Category Archives: amazon-data-pipeline

How to run Spark Or Mapreduce job on hourly aggregated data on hdfs produced by spark streaming in 5mins interval

I have a scenario where i am using spark stream to collect data from Kinesis service using https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html

Now in streaming i am doing some aggregation on the data and emitting to hdfs. i am able to complete it so far.. now i want a way where i can collect all the last hour data or hourly data and feed to new spark job or mapreduce job and do some aggregations again and send to target analytic service.

query: 1. how to get hourly aggregated data from hdfs to next spark job or mapreduce or any data processing . do we need some partition before we emit from spark to do so. 2.Can we use amazon data pipeline for this. however suppose if we emit data without partition say on /user/hadoop/ folder . how data pipeline can understand it needs to pick last hour data. can we do this applying some constraints on folder name with timestamp etc.