Scraper Fragility

### Robustness: BeautifulSoup CWS scraping is fragile; add fallback API and integrity tests

**Description**
The current method for extracting extension data relies heavily on scraping the Chrome Web Store (CWS) HTML using BeautifulSoup. The logic looks for highly specific, hardcoded DOM elements (e.g., specific `<h1>` tags, `aria-label` attributes, and Regex matches for things like "4.7 out of 5 stars"). 

**Proof / Technical Details**
* **File:** `core/extension_metadata.py`
* Google frequently updates the DOM structure of the Chrome Web Store. When they change a class name or an `aria-label`, our BeautifulSoup selectors will fail.
* Because of how the scraper is currently written, these failures will likely be *silent*—the scraper won't crash the app, it will just start returning `None` for critical extension metadata.

**Impact**
If the CWS DOM changes, the backend will successfully complete scans, but the resulting reports will be missing crucial context (like user ratings, developer info, or descriptions), degrading the quality of the ExtensionShield output without triggering any internal server errors.

**Proposed Solution**
1. **Fallback Mechanism:** Integrate a fallback data source (e.g., the chrome-stats.com API or similar service) to fetch metadata if the HTML scraper fails to find the expected DOM elements.
2. **Integrity Testing:** Create a scheduled integration test (via GitHub Actions) that scrapes a known, stable extension (like Google Translate) once a day and asserts that key fields (`name`, `rating`, `developer`) are `!= None`. This will alert us immediately if Google changes their layout, rather than us finding out from users.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraper Fragility #167

Robustness: BeautifulSoup CWS scraping is fragile; add fallback API and integrity tests

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scraper Fragility #167

Description

Robustness: BeautifulSoup CWS scraping is fragile; add fallback API and integrity tests

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions