Analytics

Big Data Streaming Pipeline - Spark, Kafka, Cassandra & Hadoop

Data Engineer - Spark Structured Streaming & Big Data

Business summary

Operations automation • Turn raw data into executive-ready reporting

Eliminated repetitive manual steps across the workflow
Faster response time and fewer missed handoffs
Error handling and edge-case coverage to reduce operational risk

What I built

Built a real-time big-data pipeline analysing traffic-sensor data (TII speed feeds). Using PySpark (Spark 3.5.4) Structured Streaming, I consumed JSON events from Apache Kafka, parsed and aggregated them with groupBy/count, and wrote results to Apache Cassandra via the DataStax connector - running four streaming queries (Q1-Q4) with foreachBatch and checkpointing. Cassandra ran in Docker, with keyspaces modelled in CQL. I also built batch Hadoop MapReduce jobs (Python mappers/reducers) for log analytics, plus a stack of Spark, Kafka, Cassandra, Hadoop, Docker, Java 17 and Python on Linux.

Tech stack

Apache Spark (PySpark)Apache KafkaApache CassandraHadoop MapReduceDockerApache SparkPySparkSpark Structured StreamingCQLHadoopMapReduceJavaOpenJDKPythonLinuxReal-Time Data PipelinesBig Data Engineering