upb-big-data-wikipedia-visualisation

Big Data Setup with Hadoop and Zeppelin

Prerequisites

Docker and Docker Compose installed.

Setup Instructions

Clone the repository:

git clone https://github.com/vitalii-t12/upb-big-data-wikipedia-visualisation.git
cd upb-big-data-wikipedia-visualisation

Build and start the containers:
```
docker-compose up --build
```
Access the web interfaces:

Hadoop Namenode UI: http://localhost:9870
Zeppelin UI: http://localhost:8080

Use Zeppelin to interact with Hadoop via SQL or other interpreters.

Hadoop configuration

To connect to Hadoop CLI from the terminal, you can use the following command:

  docker exec -it hadoop bash

To add hadoop user: bash adduser --disabled-password --gecos "" hadoop Set apropriate permissions for the Hadoop directories:

 chown -R hadoop:hadoop /opt/hadoop

Working with data

Get top countries by pageviews

To load data, you have to open the playground notebook and select the dates that you want to download data for. Unprocessed data is saved at /user/zeppelin/top-by-country/raw/countries_visits_{year}_{month}.json

Process countries by pageviews

To process data from /user/zeppelin/top-by-country/raw you have to follow the next steps: Connect to spark-master container:

docker exec -it spark-master bash

And run the next script:

./spark-submit  /spark-scripts/process-top-by-country.py

The output will be saved at path /user/zeppelin/top-by-country/processed/processed.parquet

Process large CSV with geographical data

To process data from /data/articles-in-range/10_km.csv you have to follow the next steps:

Check that 10_km.csv file is in the /data/articles-in-range/ folder (this folder is mounted to the namenode container)
Upload the 10_km.csv file to HDFS (run this command in namenode container):

- hadoop fs -put /data/articles-in-range/10_km.csv /user/zeppelin/articles-in-range/raw

Connect to spark-master container:

docker exec -it spark-master bash

And run the next script (from /spark/bin folder:

./spark-submit /spark-scripts/process-articles-in-range.py

Future Enhancements

To improve the capabilities and performance of the system, several tools and frameworks can be integrated:

Apache Airflow: Automates workflows, ensuring tasks are executed in sequence, with real-time monitoring and error handling.
Apache Hive: Introduces a SQL-like interface for querying processed data, enabling easier access for non-technical users and serving as a robust data warehouse.
Grafana: Enhances visualization and monitoring, providing advanced dashboards for real-time system metrics and resource usage tracking.
Apache Kafka and Apache Flink: Enables real-time data ingestion and processing, allowing analysis of trends as they emerge.
Machine Learning Models: Adds intelligence to the system by implementing trend analysis and predictive models using TensorFlow or PyTorch.

These enhancements will make the system more scalable, user-friendly, and capable of handling complex analytical workflows.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
CSVs		CSVs
airflow/logs/scheduler		airflow/logs/scheduler
api-with-sdk		api-with-sdk
api		api
data		data
hadoop-config		hadoop-config
hadoop		hadoop
spark-scripts		spark-scripts
spark		spark
zeppelin-scripts		zeppelin-scripts
zeppelin		zeppelin
.gitignore		.gitignore
README.md		README.md
api-with-sdkapi_test.ipynb		api-with-sdkapi_test.ipynb
country-capital-lat-long-population.csv		country-capital-lat-long-population.csv
docker-compose.yml		docker-compose.yml
docker-compose_.yml		docker-compose_.yml
worldcities.csv		worldcities.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

upb-big-data-wikipedia-visualisation

Big Data Setup with Hadoop and Zeppelin

Table of Contents

Prerequisites

Setup Instructions

Hadoop configuration

Working with data

Get top countries by pageviews

Process countries by pageviews

Process large CSV with geographical data

Future Enhancements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

upb-big-data-wikipedia-visualisation

Big Data Setup with Hadoop and Zeppelin

Table of Contents

Prerequisites

Setup Instructions

Hadoop configuration

Working with data

Get top countries by pageviews

Process countries by pageviews

Process large CSV with geographical data

Future Enhancements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages