FSC Classifier

You give it a company name, a website, or some PDFs — it figures out which FSC codes that company falls under.

Under the hood it crawls the website, pulls text out of documents, runs everything through embeddings + vector search against ~498 pre-seeded FSC codes, then uses GPT-4o to pick the best matches and explain why.

Architecture

┌──────────────┐     ┌──────────────┐     ┌─────────────────────────────────┐
│   Next.js    │────▶│   NestJS     │────▶│   Trigger.dev Background Jobs   │
│   Frontend   │◀────│   API        │     │                                 │
└──────────────┘     └──────────────┘     │  1. Crawl website (axios/cheerio)│
     React 19          Port 3001          │  2. Parse PDFs (pdf-parse/OCR)  │
     TanStack Query    File uploads       │  3. Extract company summary     │
     shadcn/ui         CORS + validation  │  4. Embed → pgvector search     │
                                          │  5. GPT-4o rerank → top codes   │
                                          └───────────┬─────────────────────┘
                                                      │
                                                      ▼
                                          ┌─────────────────────────┐
                                          │  PostgreSQL + pgvector   │
                                          │  498 FSC codes w/        │
                                          │  OpenAI embeddings       │
                                          └─────────────────────────┘

What it does

Submit a company with a name, website URL, and/or uploaded PDFs/images
Crawls the website (axios + cheerio), focuses on about/products/services pages, picks up PDFs it finds along the way
Extracts text from documents — tries pdf-parse first, falls back to GPT-4o vision for scanned/image-based content
Embeds everything with text-embedding-3-small, searches pgvector for the closest FSC codes
GPT-4o reranks the top candidates down to 5–10 results, each with a confidence score and written reasoning
If a very similar company was already classified (>95% cosine similarity), it just reuses those codes instead of burning more API calls
The frontend polls for status updates so you can watch it go through each step in real time

Each code comes with a confidence bar and you can expand the AI's reasoning:

Tech Stack

Layer	Tech
Frontend	Next.js 16, React 19, TailwindCSS v4, shadcn/ui, TanStack Query
API	NestJS 11, class-validator, multer
Background Jobs	Trigger.dev v4.4.1
Database	PostgreSQL + Prisma + pgvector
AI	GPT-4o (reranking + OCR vision), text-embedding-3-small (embeddings)
Crawling	axios + cheerio
PDF parsing	pdf-parse, GPT-4o vision (OCR fallback)

Project Structure

fsc-classifier/
├── apps/
│   ├── api/            # NestJS REST API
│   ├── frontend/       # Next.js App Router
│   └── trigger/        # Trigger.dev background tasks
├── packages/
│   ├── database/       # Prisma schema + pgvector helpers
│   ├── openai/         # OpenAI client singleton
│   └── shared/         # Shared TypeScript types
└── data/               # FSC classification source PDF

The pipeline, step by step

Company Input (name + URL + documents)
        │
        ▼
┌─── Crawl Website ───────────────────────────┐
│  axios + cheerio                             │
│  → Homepage + priority subpages (max 8)      │
│  → Auto-detect & download PDFs from site     │
└──────────────────────────────────────────────┘
        │
        ▼
┌─── Parse Documents ──────────────────────────────┐
│  pdf-parse (fast) → GPT-4o vision (OCR fallback) │
│  → Extract text from uploaded PDFs/images         │
└───────────────────────────────────────────────────┘
        │
        ▼
┌─── Classify ───────────────────────────────────────────┐
│  1. Aggregate all text (cap at 30K chars)               │
│  2. GPT-4o extracts a structured company summary        │
│  3. text-embedding-3-small embeds the summary           │
│  4. pgvector cosine search → top 20 FSC candidates      │
│  5. Cache check: similar company already done? reuse it │
│  6. GPT-4o reranks to top 5-10 with confidence + reason │
└─────────────────────────────────────────────────────────┘
        │
        ▼
  Final FSC codes (ranked, with confidence & reasoning)

Trigger.dev Task Flow

Six tasks orchestrated by classify-company. The flow depends on what input you provide:

With website URL:

classify-company
│
├─ [CRAWLING] crawl-website
│  ├─ fetch-page (homepage)
│  ├─ fetch-page (batch: /about, /products, /services... up to 7)
│  └─ detect-and-fetch-pdfs (if PDF links found on site)
│     └─ parse-document (batch: each discovered PDF)
│
├─ [PARSING] parse-document (batch: each uploaded file)
│
└─ [CLASSIFYING] classify-fsc
   ├─ embed text → text-embedding-3-small
   ├─ pgvector search → top 20 FSC candidates
   ├─ cache hit (>0.95 similarity)? → reuse codes, done
   └─ cache miss → GPT-4o rerank → save top 5-10 codes

With uploaded documents only (no URL):

classify-company
│
├─ [PARSING] parse-document (batch: each uploaded file)
│
└─ [CLASSIFYING] classify-fsc
   ├─ embed text → text-embedding-3-small
   ├─ pgvector search → top 20 FSC candidates
   ├─ cache hit? → reuse codes, done
   └─ cache miss → GPT-4o rerank → save top 5-10 codes

The crawling step is skipped entirely when there's no URL — the status goes straight from PENDING to PARSING. Subpages and PDFs are fetched in parallel batches. If any step fails, the job is marked FAILED and Trigger.dev retries (1-3x depending on the task).

Running locally

You'll need: Node.js 20+, pnpm, PostgreSQL with pgvector, and API keys for OpenAI and Trigger.dev.

pnpm install

cp .env.example .env
# fill in DATABASE_URL, OPENAI_API_KEY, TRIGGER_SECRET_KEY

pnpm db:generate
pnpm db:migrate

# seeds ~498 FSC codes with embeddings, takes about 2 min
pnpm --filter @fsc-c/db seed:fsc

pnpm dev

That gives you:

Frontend at http://localhost:3000
API at http://localhost:3001
Trigger.dev dev server running tasks locally

API

Method	Path	What it does
`POST`	`/classify`	Submit a company (multipart/form-data)
`GET`	`/classify/:id`	Poll status + get results
`GET`	`/classify`	List all companies
`GET`	`/classify/search?fscCode=3416`	Find companies by FSC code

All responses: { success: boolean, data?: T, error?: string }

Next Steps

Things I'd want to tackle next if I keep building on this:

Smarter crawling — Right now the crawler follows a hardcoded list of priority subpages (/about, /products, /services, etc.). I'd like to replace that with AI-driven link discovery — let the model decide which pages are worth visiting based on context. Looking at Firecrawl as a potential drop-in replacement that handles this out of the box (smart crawling, JS rendering, structured extraction).
E2E tests — No tests yet. Would add Playwright tests covering the full flow: submit a company, watch it process, verify the results page renders correctly. Also API integration tests for the classification pipeline with mocked OpenAI responses.
User accounts + OAuth — Currently there's no auth at all — anyone can submit and view everything. Next step would be adding user registration via OAuth (Google, GitHub) so companies and classification history are tied to individual accounts.
Multiple AI provider support — The whole pipeline is hardwired to OpenAI right now (embeddings, reranking, OCR). I'd want to abstract that behind provider adapters so you could swap in Anthropic, Gemini, open-source models, etc. without touching the pipeline logic. Especially useful for the embedding step where there are good cheaper alternatives.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.claude		.claude
.cursor		.cursor
.github/instructions		.github/instructions
apps		apps
data		data
docs/screenshots		docs/screenshots
packages		packages
.env.example		.env.example
.gitignore		.gitignore
.mcp.json		.mcp.json
.nvmrc		.nvmrc
CLAUDE.md		CLAUDE.md
README.md		README.md
docker-compose.yml		docker-compose.yml
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.base.json		tsconfig.base.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FSC Classifier

Architecture

What it does

Tech Stack

Project Structure

The pipeline, step by step

Trigger.dev Task Flow

Running locally

API

Next Steps

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FSC Classifier

Architecture

What it does

Tech Stack

Project Structure

The pipeline, step by step

Trigger.dev Task Flow

Running locally

API

Next Steps

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages