Robustness: BeautifulSoup CWS scraping is fragile; add fallback API and integrity tests
Description
The current method for extracting extension data relies heavily on scraping the Chrome Web Store (CWS) HTML using BeautifulSoup. The logic looks for highly specific, hardcoded DOM elements (e.g., specific <h1> tags, aria-label attributes, and Regex matches for things like "4.7 out of 5 stars").
Proof / Technical Details
- File:
core/extension_metadata.py
- Google frequently updates the DOM structure of the Chrome Web Store. When they change a class name or an
aria-label, our BeautifulSoup selectors will fail.
- Because of how the scraper is currently written, these failures will likely be silent—the scraper won't crash the app, it will just start returning
None for critical extension metadata.
Impact
If the CWS DOM changes, the backend will successfully complete scans, but the resulting reports will be missing crucial context (like user ratings, developer info, or descriptions), degrading the quality of the ExtensionShield output without triggering any internal server errors.
Proposed Solution
- Fallback Mechanism: Integrate a fallback data source (e.g., the chrome-stats.com API or similar service) to fetch metadata if the HTML scraper fails to find the expected DOM elements.
- Integrity Testing: Create a scheduled integration test (via GitHub Actions) that scrapes a known, stable extension (like Google Translate) once a day and asserts that key fields (
name, rating, developer) are != None. This will alert us immediately if Google changes their layout, rather than us finding out from users.
Robustness: BeautifulSoup CWS scraping is fragile; add fallback API and integrity tests
Description
The current method for extracting extension data relies heavily on scraping the Chrome Web Store (CWS) HTML using BeautifulSoup. The logic looks for highly specific, hardcoded DOM elements (e.g., specific
<h1>tags,aria-labelattributes, and Regex matches for things like "4.7 out of 5 stars").Proof / Technical Details
core/extension_metadata.pyaria-label, our BeautifulSoup selectors will fail.Nonefor critical extension metadata.Impact
If the CWS DOM changes, the backend will successfully complete scans, but the resulting reports will be missing crucial context (like user ratings, developer info, or descriptions), degrading the quality of the ExtensionShield output without triggering any internal server errors.
Proposed Solution
name,rating,developer) are!= None. This will alert us immediately if Google changes their layout, rather than us finding out from users.