Basic semantic search for a tweet archive. Part of the Community Archive ecosystem. Generates semantic embeddings with OpenAI for each tweet thread (replies & retweets are ignored). Embeddings are inserted into CloudFlare's Vectorize DB. The frontend embeds the query with OpenAI and searches the vector DB.
Live demo: (https://defenderofbasic.github.io/twitter-semantic-search/)
The general steps are, create & deploy the CloudFlare worker + vector DB (see instructions in cloudflare-worker/ directory). Then generate embeddings (run the script in generate-embeddings/ with your archive JSON in archives/). Finally run the frontend/ and replace the cloudflare URL with your own, and a URL where the archive JSON is hosted.
- Support offline mode. Can just use a local server that queries the vectra DB, no need for cloudflare.
- script to turn a twitter zip to a single gzipped json, so you can do this even if your data isn't on Community Archive.