USE 259 - parse HTML for mitlibwebsite source#266
Merged
Conversation
Why these changes are being introduced: Now that browsertrix-harvester is including full HTML + response headers in the source record available to Transmogrifier, we can do two things: 1. Parse metadata for mitlibwebsite TIMDEX records from the original, full HTML in a more opinionated fashion than we could in browsertrix-harvester. 2. Extract good, meaningful full-text from the full HTML to use for the new `fulltext` field. How this addresses that need: Expects a new `html_base64` field in the browsertrix-harvester source records. Uses this to extract metadata and full-text for the record. Side effects of this change: * Full-text is now available in the TIMDEX record for the mitlibwebsite source. * If needed, this HTML parsing could be utilized to extract more granular, source specific metadata in the future. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-259
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose and background context
This PR updates the transformer for
mitlibwebsiteto utilize the full, rendered HTML that is now coming out of browsertrix-harvester for the source records.Most impactfully, we can now set the new
fulltextfield with full-text extracted from the HTML in a way that makes sense for this source specifically (not just using the full-text extraction from browsertrix which is good, but not great).More mechanically, this recreates the minimal amount of metadata parsing from browsertrix-harvester that was performed into Transmogrifier (at this time, really only an OpenGraph
og:descriptionelement that we map tosummary). However, we're setup nicely in the future if we want to extract more metadata from the original page HTML.How can a reviewer manually see the effects of these changes?
1- Set AWS Dev1 credentials
2- Start ipython shell
3- Load the
MITLibWebsitetransformer with example records from S3:4- Transform a single record to inspect:
Includes new or updated dependencies?
NO
Changes expectations for external applications?
YES:
What are the relevant tickets?
Code review