Why
Need clean text for search and reading. Raw hypertext markup language is noisy.
Description of Done
- Extractor returns a structured result: title, site name, byline if present, main text, language code guess, and cleaned hypertext markup language
- Boilerplate, navigation, and scripts are removed
- Absolute links are resolved relative to the page
- Text is whitespace-normalized
- Unit tests cover common layouts and malformed documents
- Fuzz tests run the extractor against random inputs without panics
Tasks
Why
Need clean text for search and reading. Raw hypertext markup language is noisy.
Description of Done
Tasks