In my earlier posts, we looked at how Spark Streaming can be used to process the streaming loan data and compute the aggregations using Spark SQL. We also looked at how the data can be stored in file system for future batch analysis. We discussed how Spark can be integrated with Kafka to ingest the streaming loan records. Finally, we also looked at how Storm can be integrated with Kafka to process events in real-time with task parallel operations executing in a Storm topology.
The focus of this post will be to demonstrate how all the three technologies – Kafka, Storm and Spark can be integrated to create a streaming big data pipeline to process large volume of data to get real-time insights.
It is assumed that Kafka broker and Zookeeper are already installed and running as described in my earlier post.
The following figure shows the data flow architecture for processing the loan records in real-time using Kafka, Storm and Spark Streaming.
Loan Data Producer reads the loan records from the CSV file provided by Lending Club and publishes them to the Kafka broker. This process simulates the various systems generating the financial events and publishing them to the Enterprise Event Bus running on Kafka broker. There could be few loan records that are invalid or would need some scrubbing before they can be processed in real-time to generate aggregates. Apache Storm Topology reads the events from Kafka in real-time and cleanses each record before publishing it back to the Kafka broker. The events are then subscribed by Spark Streaming which reads the events in micro-batches and convert them to RDDs in a DStream. The RDDs are then converted into Dataset to perform computations and aggregations using Spark SQL and feed the data to a dashboard in real-time. The data can also be stored in Parquet or HBase for batch processing.
There are 3 applications to implement the above architecture
kafka-financial-analysis has a LoanDataKafkaProducer that publishes the loan records to Kafka broker on Topic raw_loan_data_in
kafka-storm-int-financial-analysis has a LoanDataCleansingTopology that reads the raw loan data records from the Kafka in real-time and cleanses each loan record. It finally publishes the cleansed loan records back to the Kafka broker on Topic cleansed_loan_data_out.
kafka-spark-int-financial-analysis runs the Spark Streaming job that ingests the loan records from Kafka in real-time with pre-defined batch interval. The events received in micro-batches are converted to Datasets to perform computations using Spark SQL. The data is saved in Parquet file format for batch processing or to feed the data to downstream systems for additional processing.
Following steps need to be followed in order to test the pipeline end to end.
- Run Spark Streaming Job in kafka-spark-int-financial-analysis by passing following command-line arguments
localhost:9092 cleansed_loan_data_out
- Run Storm Topology in kafka-storm-int-financial-analysis by passing following command-line arguments
localhost:9092 raw_loan_data_in cleansed_loan_data_out
- Run Loan Data Kafka Producer in kafka-financial-analysis by passing following command-line arguments
localhost:9092 raw_loan_data_in C:\bigdata\LoanStats_2017Q2.csv
After all the records are published by Loan Data Kafka Producer, we can validate the results to ensure that the data-pipeline is providing accurate results for real-time insight into the loan amount funded in the IL state.
As seem in the image below, it took 184 seconds for the Loan Data Kafka Producer to publish all the loan records.
Loan Data Cleansing Storm Topology dropped 4 invalid loan data records (header, empty lines and trailer records) and also scrubbed the loan data records to reformat invalid entry.
Spark Streaming Job aggregated the total loan amount funded in the IL state. This amount is exactly the same as what we calculated by running the Spark Streaming job in isolation without integration with Kafka and Storm.
Finally, it’s worth noting that there are a total of 10 batches of data and each one of them was written to the disk in the Parquet format as seen in the image below.
It took total 184 seconds to publish all the loan records with Spark Streaming having a batch size window of 20 seconds. Hence it took total 10 batches (184/20) to process all the streaming records in the data-pipeline.
Why not just do everything in Spark?
LikeLike