Skip to content

Scraper Fragility #167

@sm-28601

Description

@sm-28601

Robustness: BeautifulSoup CWS scraping is fragile; add fallback API and integrity tests

Description
The current method for extracting extension data relies heavily on scraping the Chrome Web Store (CWS) HTML using BeautifulSoup. The logic looks for highly specific, hardcoded DOM elements (e.g., specific <h1> tags, aria-label attributes, and Regex matches for things like "4.7 out of 5 stars").

Proof / Technical Details

  • File: core/extension_metadata.py
  • Google frequently updates the DOM structure of the Chrome Web Store. When they change a class name or an aria-label, our BeautifulSoup selectors will fail.
  • Because of how the scraper is currently written, these failures will likely be silent—the scraper won't crash the app, it will just start returning None for critical extension metadata.

Impact
If the CWS DOM changes, the backend will successfully complete scans, but the resulting reports will be missing crucial context (like user ratings, developer info, or descriptions), degrading the quality of the ExtensionShield output without triggering any internal server errors.

Proposed Solution

  1. Fallback Mechanism: Integrate a fallback data source (e.g., the chrome-stats.com API or similar service) to fetch metadata if the HTML scraper fails to find the expected DOM elements.
  2. Integrity Testing: Create a scheduled integration test (via GitHub Actions) that scrapes a known, stable extension (like Google Translate) once a day and asserts that key fields (name, rating, developer) are != None. This will alert us immediately if Google changes their layout, rather than us finding out from users.

Metadata

Metadata

Assignees

Labels

bugBug report or bug fix related workenhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions