An AI-powered backend that takes in multiple types of learning resources — PDFs, GitHub READMEs, YouTube videos, and blog articles — and generates an optimized, ordered curriculum for learning a topic.
Given a mixed set of sources, the system:
- Extracts text from each source (PDF parsing, GitHub README fetching, YouTube transcript extraction, web scraping)
- Extracts key concepts using KeyBERT
- Generates semantic embeddings using SentenceTransformers
- Scores each source's difficulty (1-10) using the Gemini API
- Builds a dependency graph between sources based on concept similarity and difficulty
- Removes cycles to ensure a valid ordering exists
- Generates three candidate learning paths:
- Easy-first — progressive difficulty
- Hard-first — challenge-first
- Balanced — interleaved difficulty
- Uses Gemini as an LLM-judge to evaluate all three paths and recommend the best one, with reasoning
- Framework: FastAPI (async)
- NLP: KeyBERT for concept/keyphrase extraction
- Embeddings: SentenceTransformers (
all-MiniLM-L6-v2) - LLM: Google Gemini 2.5 Flash — difficulty scoring + path evaluation
- Algorithms: cosine similarity, cycle detection, topological sort (heap-based and median-selection variants)
- Source ingestion: BeautifulSoup (web scraping), youtube-transcript-api, GitHub raw README fetching
The API accepts a mix of uploaded files and URLs in a single request. URLs are auto-detected:
- YouTube links → transcript extraction via
youtube-transcript-api - GitHub links → raw README fetching
- Other URLs → article text extraction via BeautifulSoup
Each source's text is processed through KeyBERT to extract key concepts (1-2 word keyphrases), and SentenceTransformers generates a semantic embedding for the full text.
Gemini rates each source's technical difficulty on a 1-10 scale. Scores are cached to avoid redundant API calls.
A directed graph is built where edges represent "should be learned before" relationships, determined by:
- Concept similarity (cosine similarity ≥ 0.3)
- Relative difficulty scores
- Shared concept overlap (for sources of equal difficulty)
Any cycles in the dependency graph are detected and broken so a valid topological ordering exists.
Three topological sorts are run over the resulting graph:
- Easy-first — min-heap ordered by difficulty
- Hard-first — max-heap ordered by difficulty
- Balanced — median-selection scheduling
All three paths, along with source metadata, are sent to Gemini, which returns a structured evaluation — pros, cons, best-fit learner type, and a recommended path with justification.
Accepts multipart form-data with files and/or URLs in any combination.
curl -X POST http://localhost:8000/generate-paths/ \
-F "files=@intro.txt" \
-F "files=@advanced.txt" \
-F "urls=https://github.com/user/repo" \
-F "urls=https://youtube.com/watch?v=VIDEO_ID" \
-F "urls=https://some-blog.com/article"Response shape:
{
"sources": {
"source_name": { "difficulty": 1, "concepts": ["..."] }
},
"graph": {
"source_name": ["dependent_source_1", "dependent_source_2"]
},
"difficulty_scores": {
"source_name": 1
},
"paths": {
"easy_first": ["..."],
"hard_first": ["..."],
"balanced": ["..."]
},
"analysis": {
"easy_first": { "pros": ["..."], "cons": ["..."], "best_for": "...", "score": 1 },
"hard_first": { "pros": ["..."], "cons": ["..."], "best_for": "...", "score": 1 },
"balanced": { "pros": ["..."], "cons": ["..."], "best_for": "...", "score": 1 },
"best_path": "easy_first",
"reason": "..."
}
}pip install fastapi uvicorn keybert sentence-transformers scikit-learn \
google-generativeai httpx beautifulsoup4 youtube-transcript-api
# Add your Gemini API key
# genai.configure(api_key="YOUR_API_KEY")
uvicorn main:app --reload🚧 This is a working backend — the full pipeline (multi-source ingestion, concept extraction, embeddings, dependency graph construction, path generation, and LLM-based evaluation) is functional and tested.
It does not have a frontend yet :( — I'm actively working on this, and within a few weeks there will be a UI to make this usable end-to-end :)
Thanks for stopping by and checking this out! If you found it interesting or useful, a star on the repo would mean a lot — it keeps me motivated to keep building 🚀
