Skip to content

xerxes-y/DAP

Repository files navigation

Realtime Product Analytics Platform (DAP)

Overview

A distributed analytics platform simulating real-world product analytics pipelines. Ingests user events via Kafka, processes with Spark, and enables real-time analytics. Features Java 21 compatibility with Apache Spark 3.5.0 and comprehensive monitoring tools.

🚀 Key Features

  • Java 21 Compatible - Upgraded to Spark 3.5.0 with full compatibility
  • Real-time Streaming - Kafka → Spark Structured Streaming → JSON/Delta Lake
  • Comprehensive Monitoring - Web UIs, shell scripts, and monitoring guides
  • Docker Orchestration - All services containerized for easy deployment
  • Production Ready - Proper error handling, checkpointing, and resilience

Architecture

+-----------------+
|  User Simulated |
|   Event Stream  |
+--------+--------+
         |
         v
+---------------------+
|     Apache Kafka     |
+----------+----------+
         |
         v
+-------------------------------------+
|    Spark Structured Streaming Job   |
| - Java 21 + Spark 3.5.0            |
| - Reads from Kafka                  |
| - Parses JSON events                |
| - Real-time processing              |
| - Writes to JSON/Delta Lake         |
+------------------+------------------+
                   |
        +----------+-----------+
        |                      |
        v                      v
+--------------------+    +--------------------+
|   Delta Lake (S3)  |    | Apache Druid OLAP  |
| (MinIO Storage)    |    | (Real-time Queries)|
+--------------------+    +--------------------+

🛠 Technology Stack

  • Apache Kafka - Event streaming platform
  • Apache Spark 3.5.0 - Stream processing (Java 21 compatible)
  • Scala 2.12.18 - Programming language
  • Delta Lake - Data lake storage format
  • MinIO - S3-compatible object storage
  • Apache Druid - Real-time OLAP database
  • Docker Compose - Container orchestration

📋 Project Structure

DAP/
├── README.md                           # This file
├── docker-compose.yml                  # All services configuration
├── kafka-producer/                     # Scala app to generate events
├── spark-job/                          # Spark streaming applications
│   ├── src/main/scala/Main.scala       # Delta Lake streaming job
│   └── src/main/scala/SimpleMain.scala # JSON streaming job (Java 21 ready)
├── python-consumer/                    # Python analysis tools
├── druid/                             # Druid configuration
├── monitoring/                        # Monitoring guides and scripts
├── quick-monitor.sh                   # Comprehensive monitoring script
├── restart-and-monitor.sh             # Full platform restart script
├── spark-java21-compatibility-guide.md # Java 21 setup guide
└── spark-web-ui-guide.md             # Spark monitoring guide

🚀 Quick Start

Prerequisites

  • Docker and Docker Compose
  • Java 21 (OpenJDK or Oracle)
  • SBT (Scala Build Tool)

1. Start All Services

# Start the entire platform
docker-compose up -d

# Monitor startup
./quick-monitor.sh

2. Generate Sample Data

cd kafka-producer
sbt run  # Generates 100 sample events

3. Run Spark Streaming Job

cd spark-job

# Option 1: JSON output (Java 21 compatible)
sbt 'runMain SimpleMain'

# Option 2: Delta Lake output (when Delta 3.3.0+ becomes available)
sbt 'runMain Main'

4. Monitor Everything

📊 Monitoring & Operations

Built-in Monitoring Tools

  • quick-monitor.sh - Comprehensive health checks
  • restart-and-monitor.sh - Full platform restart
  • monitoring/ - Detailed monitoring guides
  • Web UIs for all components

Key Metrics to Watch

  • Kafka message throughput
  • Spark streaming batch processing times
  • Memory and CPU usage
  • Data output locations: /tmp/kafka-events-json/

🔧 Java 21 Compatibility

This project is fully compatible with Java 21 thanks to:

  • Spark 3.5.0 upgrade (from 3.4.1)
  • Scala 2.12.18 for better compatibility
  • JVM compatibility flags in build.sbt
  • Comprehensive testing on macOS with Java 21.0.5

See spark-java21-compatibility-guide.md for detailed setup instructions.

🐛 Troubleshooting

Common Issues

  1. DirectByteBuffer errors → Use Spark 3.5.0+ with Java 21
  2. Delta Lake compatibility → Use SimpleMain with JSON output temporarily
  3. Port conflicts → Check docker-compose ps and restart services
  4. Memory issues → Adjust Docker resource limits

Getting Help

  • Check monitoring/ directory for detailed guides
  • Use monitoring scripts for health checks
  • Review Spark Web UI for streaming job details

🎯 Development Phases

✅ Phase 1: Core Infrastructure (COMPLETED)

  • ✅ Kafka event streaming
  • ✅ Spark 3.5.0 + Java 21 compatibility
  • ✅ JSON/Delta Lake storage
  • ✅ Docker orchestration
  • ✅ Comprehensive monitoring

🔄 Phase 2: Enhanced Analytics (IN PROGRESS)

  • Druid real-time ingestion
  • Advanced aggregations
  • Dashboard integration

🔮 Phase 3: Production Features (PLANNED)

  • Kubernetes deployment
  • Auto-scaling
  • Advanced monitoring & alerting

🤝 Contributing

  1. Ensure Java 21 compatibility in all changes
  2. Update monitoring guides for new features
  3. Test with provided monitoring scripts
  4. Follow established project structure

📄 License

This project is for educational and demonstration purposes.

About

Distributed Analytics Platform

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors