- Docker and Docker Compose installed.
-
Clone the repository:
git clone https://github.com/vitalii-t12/upb-big-data-wikipedia-visualisation.git cd upb-big-data-wikipedia-visualisation -
Build and start the containers:
docker-compose up --build
-
Access the web interfaces:
- Hadoop Namenode UI: http://localhost:9870
- Zeppelin UI: http://localhost:8080
- Use Zeppelin to interact with Hadoop via SQL or other interpreters.
To connect to Hadoop CLI from the terminal, you can use the following command:
docker exec -it hadoop bashTo add hadoop user:
bash adduser --disabled-password --gecos "" hadoop
Set apropriate permissions for the Hadoop directories:
chown -R hadoop:hadoop /opt/hadoopTo load data, you have to open the playground notebook and select the dates that you want to download data for.
Unprocessed data is saved at /user/zeppelin/top-by-country/raw/countries_visits_{year}_{month}.json
To process data from /user/zeppelin/top-by-country/raw you have to follow the next steps:
Connect to spark-master container:
docker exec -it spark-master bashAnd run the next script:
./spark-submit /spark-scripts/process-top-by-country.pyThe output will be saved at path /user/zeppelin/top-by-country/processed/processed.parquet
To process data from /data/articles-in-range/10_km.csv you have to follow the next steps:
-
Check that
10_km.csvfile is in the/data/articles-in-range/folder (this folder is mounted to thenamenodecontainer) -
Upload the
10_km.csvfile to HDFS (run this command innamenodecontainer):
- hadoop fs -put /data/articles-in-range/10_km.csv /user/zeppelin/articles-in-range/rawConnect to spark-master container:
docker exec -it spark-master bashAnd run the next script (from /spark/bin folder:
./spark-submit /spark-scripts/process-articles-in-range.pyTo improve the capabilities and performance of the system, several tools and frameworks can be integrated:
- Apache Airflow: Automates workflows, ensuring tasks are executed in sequence, with real-time monitoring and error handling.
- Apache Hive: Introduces a SQL-like interface for querying processed data, enabling easier access for non-technical users and serving as a robust data warehouse.
- Grafana: Enhances visualization and monitoring, providing advanced dashboards for real-time system metrics and resource usage tracking.
- Apache Kafka and Apache Flink: Enables real-time data ingestion and processing, allowing analysis of trends as they emerge.
- Machine Learning Models: Adds intelligence to the system by implementing trend analysis and predictive models using TensorFlow or PyTorch.
These enhancements will make the system more scalable, user-friendly, and capable of handling complex analytical workflows.