Skip to content

feat(DE): Add Coles and IGA MongoDB preprocessing scripts with GTIN mapping, unit price fallbacks, and data quality fixes#285

Open
adrinkimno wants to merge 2 commits into
DataBytes-Organisation:mainfrom
adrinkimno:margie_licup/de-coles-preprocessing
Open

feat(DE): Add Coles and IGA MongoDB preprocessing scripts with GTIN mapping, unit price fallbacks, and data quality fixes#285
adrinkimno wants to merge 2 commits into
DataBytes-Organisation:mainfrom
adrinkimno:margie_licup/de-coles-preprocessing

Conversation

@adrinkimno
Copy link
Copy Markdown
Contributor

Why

As the DE ETL pipeline is not yet fully implemented for Coles and IGA, these preprocessing scripts serve as the temporary solution for uploading 2026 scraped pricing data to MongoDB.
The scripts were also updated with thorough documentation, including structured steps, inline comments, a Data Issues section, and a Before You Run checklist, to guide future junior DE members in running the uploads independently and to serve as a reference when the formal DE pipeline is eventually implemented for these retailers.

This is in line with the assigned ticket:

  • [DE] Upload 2026 scraped data to MongoDB:

With subtasks:

  • [DE-01] Pre-process new data from 4 retailers to fit current database schema [Coles & IGA] - ML
  • [DE-04] Upload new data to MongoDB - ML & RDL

What changed

  • Added a purpose section at the top explaining what the script does, when to run it, and a summary of the steps
  • Added a pre-run checklist covering .env file setup, path configuration, and required Python packages
  • Added a Data Issues section documenting known issues (inconsistent unit price columns, duplicate products from multi-brand searches, no barcode available so matching an old Scrape File, PromotionType values that don't all indicate a price reduction, and unmatched products being silently dropped)
  • Added numbered step headers (Steps 1–9) with a markdown explanation cell before each step describing what it does and why
  • Added inline comments throughout all code cells explaining what each block does
  • Clarified the update_data placeholder as reserved for future use and not yet implemented
  • Fixed the upload failure recovery note to correctly instruct re-running from Step 8 so the deduplication filter runs before re-attempting the upload
  • Added Coles_Preprocessing_Script_20260516.ipynb to DiscountMate_new/DE/MongoDB - Legacy/
  • Added IGA_Preprocessing_Script_20260516.ipynb to DiscountMate_new/DE/MongoDB - Legacy/
  • Updated README in DiscountMate_new/DE/MongoDB - Legacy/ to include the title and descriptions of the IGA and Coles Preprocessing scripts

@adrinkimno adrinkimno marked this pull request as draft May 16, 2026 07:27
@adrinkimno adrinkimno marked this pull request as ready for review May 16, 2026 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant