Learning Path Generator

An AI-powered backend that takes in multiple types of learning resources — PDFs, GitHub READMEs, YouTube videos, and blog articles — and generates an optimized, ordered curriculum for learning a topic.

Overview

Given a mixed set of sources, the system:

Extracts text from each source (PDF parsing, GitHub README fetching, YouTube transcript extraction, web scraping)
Extracts key concepts using KeyBERT
Generates semantic embeddings using SentenceTransformers
Scores each source's difficulty (1-10) using the Gemini API
Builds a dependency graph between sources based on concept similarity and difficulty
Removes cycles to ensure a valid ordering exists
Generates three candidate learning paths:
- Easy-first — progressive difficulty
- Hard-first — challenge-first
- Balanced — interleaved difficulty
Uses Gemini as an LLM-judge to evaluate all three paths and recommend the best one, with reasoning

Tech Stack

Framework: FastAPI (async)
NLP: KeyBERT for concept/keyphrase extraction
Embeddings: SentenceTransformers (all-MiniLM-L6-v2)
LLM: Google Gemini 2.5 Flash — difficulty scoring + path evaluation
Algorithms: cosine similarity, cycle detection, topological sort (heap-based and median-selection variants)
Source ingestion: BeautifulSoup (web scraping), youtube-transcript-api, GitHub raw README fetching

How It Works

1. Source Ingestion

The API accepts a mix of uploaded files and URLs in a single request. URLs are auto-detected:

YouTube links → transcript extraction via youtube-transcript-api
GitHub links → raw README fetching
Other URLs → article text extraction via BeautifulSoup

2. Concept Extraction & Embedding

Each source's text is processed through KeyBERT to extract key concepts (1-2 word keyphrases), and SentenceTransformers generates a semantic embedding for the full text.

3. Difficulty Scoring

Gemini rates each source's technical difficulty on a 1-10 scale. Scores are cached to avoid redundant API calls.

4. Dependency Graph Construction

A directed graph is built where edges represent "should be learned before" relationships, determined by:

Concept similarity (cosine similarity ≥ 0.3)
Relative difficulty scores
Shared concept overlap (for sources of equal difficulty)

5. Cycle Removal

Any cycles in the dependency graph are detected and broken so a valid topological ordering exists.

6. Path Generation

Three topological sorts are run over the resulting graph:

Easy-first — min-heap ordered by difficulty
Hard-first — max-heap ordered by difficulty
Balanced — median-selection scheduling

7. LLM-Based Path Evaluation

All three paths, along with source metadata, are sent to Gemini, which returns a structured evaluation — pros, cons, best-fit learner type, and a recommended path with justification.

API

`POST /generate-paths/`

Accepts multipart form-data with files and/or URLs in any combination.

curl -X POST http://localhost:8000/generate-paths/ \
  -F "files=@intro.txt" \
  -F "files=@advanced.txt" \
  -F "urls=https://github.com/user/repo" \
  -F "urls=https://youtube.com/watch?v=VIDEO_ID" \
  -F "urls=https://some-blog.com/article"

Response shape:

{
  "sources": {
    "source_name": { "difficulty": 1, "concepts": ["..."] }
  },
  "graph": {
    "source_name": ["dependent_source_1", "dependent_source_2"]
  },
  "difficulty_scores": {
    "source_name": 1
  },
  "paths": {
    "easy_first": ["..."],
    "hard_first": ["..."],
    "balanced": ["..."]
  },
  "analysis": {
    "easy_first": { "pros": ["..."], "cons": ["..."], "best_for": "...", "score": 1 },
    "hard_first": { "pros": ["..."], "cons": ["..."], "best_for": "...", "score": 1 },
    "balanced": { "pros": ["..."], "cons": ["..."], "best_for": "...", "score": 1 },
    "best_path": "easy_first",
    "reason": "..."
  }
}

Setup

pip install fastapi uvicorn keybert sentence-transformers scikit-learn \
            google-generativeai httpx beautifulsoup4 youtube-transcript-api

# Add your Gemini API key
# genai.configure(api_key="YOUR_API_KEY")

uvicorn main:app --reload

Status

🚧 This is a working backend — the full pipeline (multi-source ingestion, concept extraction, embeddings, dependency graph construction, path generation, and LLM-based evaluation) is functional and tested.

It does not have a frontend yet :( — I'm actively working on this, and within a few weeks there will be a UI to make this usable end-to-end :)

⭐ Show Some Love

Thanks for stopping by and checking this out! If you found it interesting or useful, a star on the repo would mean a lot — it keeps me motivated to keep building 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
__pycache__		__pycache__
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning Path Generator

Overview

Tech Stack

How It Works

1. Source Ingestion

2. Concept Extraction & Embedding

3. Difficulty Scoring

4. Dependency Graph Construction

5. Cycle Removal

6. Path Generation

7. LLM-Based Path Evaluation

API

`POST /generate-paths/`

Setup

Status

⭐ Show Some Love

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Learning Path Generator

Overview

Tech Stack

How It Works

1. Source Ingestion

2. Concept Extraction & Embedding

3. Difficulty Scoring

4. Dependency Graph Construction

5. Cycle Removal

6. Path Generation

7. LLM-Based Path Evaluation

API

POST /generate-paths/

Setup

Status

⭐ Show Some Love

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /generate-paths/`

Packages