Analytics
Big Data Streaming Pipeline - Spark, Kafka, Cassandra & Hadoop
Data Engineer - Spark Structured Streaming & Big Data
Business summary
Operations automation • Turn raw data into executive-ready reporting
- Eliminated repetitive manual steps across the workflow
- Faster response time and fewer missed handoffs
- Error handling and edge-case coverage to reduce operational risk
What I built
Built a real-time big-data pipeline analysing traffic-sensor data (TII speed feeds). Using PySpark (Spark 3.5.4) Structured Streaming, I consumed JSON events from Apache Kafka, parsed and aggregated them with groupBy/count, and wrote results to Apache Cassandra via the DataStax connector - running four streaming queries (Q1-Q4) with foreachBatch and checkpointing. Cassandra ran in Docker, with keyspaces modelled in CQL. I also built batch Hadoop MapReduce jobs (Python mappers/reducers) for log analytics, plus a stack of Spark, Kafka, Cassandra, Hadoop, Docker, Java 17 and Python on Linux.
Tech stack
Apache Spark (PySpark)Apache KafkaApache CassandraHadoop MapReduceDockerApache SparkPySparkSpark Structured StreamingCQLHadoopMapReduceJavaOpenJDKPythonLinuxReal-Time Data PipelinesBig Data Engineering