Why
Storing both original and cleaned representations enables re-extraction and auditing
Definition of Done
- New columns exist for cleaned text, cleaned HTML, language,
extracted_at timestamp, and checksum
- Checksums prevent duplicate writes when content has not changed
- Large bodies are stored efficiently and streamed to the database
- Database constraints protect referential integrity
- Migration is idempotent and reversible
- Unit tests cover insert, update and no-op when checksum matches
Tasks
Why
Storing both original and cleaned representations enables re-extraction and auditing
Definition of Done
extracted_attimestamp, and checksumTasks
item_idand checksum