Overview

This repository is concerned about the data we're going to ingress to the application, like how we clean and process the data. For the application itself we created a separate github link

Our goal in gathering data was to provide voters with information on the 44,000 candidates running this upcoming elections. We sourced data from previous elections, to get previous candidacy details. We then tried processing election results which became a problem as Comelecs data was hacked and corrupted back in 2016. We also faced the problem of having to mine data with inconsistent formats, which required a different script each time.

Quick Links

Github and App

Documentation

Designs and other Docs

App Installation Guide

Install the Latest APK release Here:

APK Release

For a more detailed instruction guide visit Veripol App Github

VeriPol App Github

Guide

To reproduce the code, make sure to install all the dependencies and modules with respect to the python script you want to run. The libraries used in each python file are commented right beside the imports

Data Scraping Journey

Step 1 (Downloading PDF Files)

It all started with scraping the data from the https://comelec.gov.ph/?r=2022NLE/ListsofCandidates/TentativeListsofCandidates containing PDF files for each local and national candidates

The first step is to use python script that imports beatifulsoup4 to parse the HTML source code from the website and capture the links associated to the anchor tags. These links are saved to a list in python and after collecting all of the links, it is iterated through a loop and download each file with a user-defined helper function "download(url: str, dest_folder: str, filename: str)"

This process then outputs a set of folders with respect to their location in the format of REGION_NUMBER/PROVINCE/MUNICIPALITY.PNG

Step 2 (Merging PDF Files)

The PDF Files are then fed to another python script that merges all the PDF files into 2 parts, since this is in preparation for converting the PDF file to and XLSX file. The limitation occurs in this conversion part where there is a limit of the number of pages that the conversion software can handle. This merging is just done by PyPDF2 module.

Step 3 (Converting PDF to XLSX Files)

Since there are already 2 Merged PDF Files, the conversion from PDF to XLSX is also done twice. This is done using Adobe Acrobat by exporting the PDF File as an XLSX File. This process outputs 2 XLSX Files

Step 4 (Converting XLSX Files to CSV Files)

The XLSX Files can be coverted to a CSV Files through Microsoft Excel by saving it as a CSV File. Since there are 2 XLSX files, then there will also be 2 CSV Files. At this point, these 2 CSV files can be merged to one, but following through the experience, they remained to be separated.

Step 5 (Cleaning the CSV File to a Database File)

The 2 CSV Files are fed to a python script that uses "csv" module to simplify the workload of iterating through each row of the CSV files. The extraction of data are done based on the Entity Relationship Diagram which are also made as class models in a python file. At this step, it already requires SQLAlchemy to handle the database. The database contains the following tables

Location
Position
PoliticalParty
Candidate
FiledCandidacy
ElectionResult
HouseBills
SenateBills

Step 6 (Handling the local database to cloud firebase firestore)

The data is first residing on a local database to answer immidiate concerns to create the application. This also saves time and read/write quota incurred from firebase' free tier pack. Once the data were finalized, the database file is then fed to a python script that imports SQLAlchemy to read the database and pyrebase module to push the data to the firestore.

Output in the Firebase

After some hours, the data from the local SQLite db are already copied into the NoSQSL firebase firebase.

Constraints

Due to limitation of Firebase Firestore Free Tier, pushing previous years of candidacy (2019, 2016, 2010, etc) is challenging due to time constraints and quota that firebase has allocated for free tier packages.
Firebase only allows 50,000 document reads per day
Firebase only allows 20,000 document writes The total data for filed candidacy table alone has around 44,000 rows and another 44,000 rows on candidates table just for the 2022 candidates. This huge amount of data takes days or weeks to fully push all of the data. This also implies that including the previous years of candidacies will increase the amount of documment writes needed.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Backend with Flask		Backend with Flask
Blobs		Blobs
Data Scraping		Data Scraping
Data		Data
Local to Cloud Database Handling		Local to Cloud Database Handling
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Quick Links

Github and App

Documentation

Designs and other Docs

App Installation Guide

Guide

Data Scraping Journey

Step 1 (Downloading PDF Files)

Step 2 (Merging PDF Files)

Step 3 (Converting PDF to XLSX Files)

Step 4 (Converting XLSX Files to CSV Files)

Step 5 (Cleaning the CSV File to a Database File)

Step 6 (Handling the local database to cloud firebase firestore)

Output in the Firebase

Constraints

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Quick Links

Github and App

Documentation

Designs and other Docs

App Installation Guide

Guide

Data Scraping Journey

Step 1 (Downloading PDF Files)

Step 2 (Merging PDF Files)

Step 3 (Converting PDF to XLSX Files)

Step 4 (Converting XLSX Files to CSV Files)

Step 5 (Cleaning the CSV File to a Database File)

Step 6 (Handling the local database to cloud firebase firestore)

Output in the Firebase

Constraints

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages