I'm trying to design a system that has 2 major workflows:
Case 1: producers query for documents from sources. I need to do processing on each document, then save the result of that to MongoDB. In this case I figured a few Kafka producers would get the source and add it to a topic, and Storm would grab from the topic and Bolts would do the processing and save. From what I've read this should have high throughput.
Case 2: My webapp allows users to upload documents, so I wanted to just add them to a separate kafka queue (so my webapp acts as a producer), but I don't want to ack the item in the consumer until I've saved to Mongo, so that when I get the callback from producer.send, I can query the db and return the processed doc to the user (basically I want the producer to know when the data is in the db, and at that point query it and send the response to the user).
Question 1: for Case 1 it is completely asynchronous, so I think Storm is the way to go from what I've read, but I'm not sure I completely understand the benefit vs spinning up a bunch of my own consumer threads.
Question 2: I'm confused by how commitSync works in the consumer. If I have a bunch of consumers subscribing to the same topic, and they wait until they're done processing before calling commitSync, what happens if another consumer polls kafka while it's processing...will it get the same offset the previous one does,or does kafka know it's being handled and give it the next offset? In this case, if a consumer has already gotten the next offset and a previous consumer fails, what happens to that entry...does kafka know to give it to the next consumer to poll?
Question 3: For Storm, it looks like nextTuple calls commitSync. I can't tell if it happens after the last Bolt in the tree has called ack, or when the first Bolt in the tree accepts the message. For Case 1, asynchronous, this doesn't really matter, but for Case 2 it does. Is there a way to use Storm for both cases, or should I write my own standalone consumers for Case 2 to work. I'd rather have 1 set of consumers that handle both and just check the synchronous topic, if nothing is in the queue check the async one.
I just started looking into these technologies for this project, and haven't been able to find clear answers. I'm in the exploratory phase, so I want to make sure the technology choices for the architecture makes sense.
Also, if I'm off base and anyone has suggestions for better technologies that would be greatly appreciated.
Thanks in advance!