USE 364 - mitlibwebsite targeted fulltext extraction#268
Merged
Conversation
Why these changes are being introduced: It was decided that the full-text getting extracted from mitlibwebsite full HTML was too broad. We were collecting header and footer data that was not unique to the record/URL at hand. How this addresses that need: After some analysis by DiscoEng, some URL + element selector patterns were identified to target meaningful container elements. This has dramatically reduced the amount of full-text while increasing the quality at the same time. Side effects of this change: * mitlibwebsite TIMDEX records have higher quality fulltext field values Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-364
ghukill
commented
Jan 29, 2026
Comment on lines
+169
to
+170
| (True, {"class": "content-main"}), # True = wildcard element | ||
| (True, {"class": "main-content"}), # True = wildcard element |
Contributor
Author
There was a problem hiding this comment.
This was new syntax to me in BeautifulSoup4 (BS4): you can use True to wildcard match any element.
ehanson8
approved these changes
Jan 30, 2026
Contributor
ehanson8
left a comment
There was a problem hiding this comment.
Looks good and a sensible change since clearly there is a lot of non-useful content in the unrefined full test
Contributor
There was a problem hiding this comment.
Very helpfully formatted fixture!
Comment on lines
+116
to
+120
| Using the full-text from the entire page will include far too much content that | ||
| is not unique or relevant to the page at hand, including repeating header and | ||
| footer data. Our approach may evolve over time, but this method aims to extract | ||
| only meaningful full-text from each record based on some simple rules and specific | ||
| container elements to look for. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose and background context
This PR updates how we extract full-text for the
mitlibwebsite. After some analysis from DiscoEng, some logic was identified for HTML selectors we could use to grab container elements that contained text relevant to the website, excluding content that repeats for all pages like headers and footers.NOTE: much of the file churn was updated dependencies and updated linting. This is encapsulated in a single commit. The meaningful changes can be found in this commit.
How can a reviewer manually see the effects of these changes?
Please see this USE-365 Jira ticket comment that links to a spreadsheet analyzing the results of the
fulltextfield after these changes were implemented.Includes new or updated dependencies?
YES
Changes expectations for external applications?
YES: reduction of repeating and unhelpful text in the
mitlibwebsitewill improve search relevancy and reduce noise in the USE interface.What are the relevant tickets?
Code review