Currently all the events/CFPs are stores in a CFPCache table in the original source format (or constructed JSON for HTML scraped sources). When the /api/cfps endpoint is hit, the cached values, or fresh data if no cache or cache is > 24 hours old, are collated, transformed, and merged into one result, and returned to the front end.
The plan:
- Move the fetch-transform-merge into a new
/api/cfps/update endpoint which writes the CFPs to a new database table (CFPs) in the CFP structure.
- Update the
/api/cfps endpoint to return the data directly from the new table of CFPs
- Call the new API endpoint once a day
Definition of "same CFP" is tricky, because some sources have different data. In addition to the event names being similar but not identical, the event and CFP URLs might also vary slightly (trailing slashes, querystrings, etc), and I've even seen different event dates.
For this reason, I think a good definition of "same CFP" is:
- Conference Names: match using some kind of string similarity algorithm (like Levenshtein distance or cosine similarity)
- Dates: match within a +/- 2 day window (differentiate events that run monthly or less frequently)
- CFP Url: Match after trimming whitespace and slashes
It would be nice to also flag events that match only two of these three criteria for a moderator to look at, so that we can hone the algorithm further.
Currently all the events/CFPs are stores in a
CFPCachetable in the original source format (or constructed JSON for HTML scraped sources). When the/api/cfpsendpoint is hit, the cached values, or fresh data if no cache or cache is > 24 hours old, are collated, transformed, and merged into one result, and returned to the front end.The plan:
/api/cfps/updateendpoint which writes the CFPs to a new database table (CFPs) in theCFPstructure./api/cfpsendpoint to return the data directly from the new table of CFPsDefinition of "same CFP" is tricky, because some sources have different data. In addition to the event names being similar but not identical, the event and CFP URLs might also vary slightly (trailing slashes, querystrings, etc), and I've even seen different event dates.
For this reason, I think a good definition of "same CFP" is:
It would be nice to also flag events that match only two of these three criteria for a moderator to look at, so that we can hone the algorithm further.