Skip to content

kolezka/fsc-classifier

Repository files navigation

FSC Classifier

You give it a company name, a website, or some PDFs — it figures out which FSC codes that company falls under.

Under the hood it crawls the website, pulls text out of documents, runs everything through embeddings + vector search against ~498 pre-seeded FSC codes, then uses GPT-4o to pick the best matches and explain why.

Dashboard

Architecture

┌──────────────┐     ┌──────────────┐     ┌─────────────────────────────────┐
│   Next.js    │────▶│   NestJS     │────▶│   Trigger.dev Background Jobs   │
│   Frontend   │◀────│   API        │     │                                 │
└──────────────┘     └──────────────┘     │  1. Crawl website (axios/cheerio)│
     React 19          Port 3001          │  2. Parse PDFs (pdf-parse/OCR)  │
     TanStack Query    File uploads       │  3. Extract company summary     │
     shadcn/ui         CORS + validation  │  4. Embed → pgvector search     │
                                          │  5. GPT-4o rerank → top codes   │
                                          └───────────┬─────────────────────┘
                                                      │
                                                      ▼
                                          ┌─────────────────────────┐
                                          │  PostgreSQL + pgvector   │
                                          │  498 FSC codes w/        │
                                          │  OpenAI embeddings       │
                                          └─────────────────────────┘

What it does

  • Submit a company with a name, website URL, and/or uploaded PDFs/images
  • Crawls the website (axios + cheerio), focuses on about/products/services pages, picks up PDFs it finds along the way
  • Extracts text from documents — tries pdf-parse first, falls back to GPT-4o vision for scanned/image-based content
  • Embeds everything with text-embedding-3-small, searches pgvector for the closest FSC codes
  • GPT-4o reranks the top candidates down to 5–10 results, each with a confidence score and written reasoning
  • If a very similar company was already classified (>95% cosine similarity), it just reuses those codes instead of burning more API calls
  • The frontend polls for status updates so you can watch it go through each step in real time

Classification Results

Each code comes with a confidence bar and you can expand the AI's reasoning:

AI Reasoning

Tech Stack

Layer Tech
Frontend Next.js 16, React 19, TailwindCSS v4, shadcn/ui, TanStack Query
API NestJS 11, class-validator, multer
Background Jobs Trigger.dev v4.4.1
Database PostgreSQL + Prisma + pgvector
AI GPT-4o (reranking + OCR vision), text-embedding-3-small (embeddings)
Crawling axios + cheerio
PDF parsing pdf-parse, GPT-4o vision (OCR fallback)

Project Structure

fsc-classifier/
├── apps/
│   ├── api/            # NestJS REST API
│   ├── frontend/       # Next.js App Router
│   └── trigger/        # Trigger.dev background tasks
├── packages/
│   ├── database/       # Prisma schema + pgvector helpers
│   ├── openai/         # OpenAI client singleton
│   └── shared/         # Shared TypeScript types
└── data/               # FSC classification source PDF

The pipeline, step by step

Company Input (name + URL + documents)
        │
        ▼
┌─── Crawl Website ───────────────────────────┐
│  axios + cheerio                             │
│  → Homepage + priority subpages (max 8)      │
│  → Auto-detect & download PDFs from site     │
└──────────────────────────────────────────────┘
        │
        ▼
┌─── Parse Documents ──────────────────────────────┐
│  pdf-parse (fast) → GPT-4o vision (OCR fallback) │
│  → Extract text from uploaded PDFs/images         │
└───────────────────────────────────────────────────┘
        │
        ▼
┌─── Classify ───────────────────────────────────────────┐
│  1. Aggregate all text (cap at 30K chars)               │
│  2. GPT-4o extracts a structured company summary        │
│  3. text-embedding-3-small embeds the summary           │
│  4. pgvector cosine search → top 20 FSC candidates      │
│  5. Cache check: similar company already done? reuse it │
│  6. GPT-4o reranks to top 5-10 with confidence + reason │
└─────────────────────────────────────────────────────────┘
        │
        ▼
  Final FSC codes (ranked, with confidence & reasoning)

Trigger.dev Task Flow

Six tasks orchestrated by classify-company. The flow depends on what input you provide:

With website URL:

classify-company
│
├─ [CRAWLING] crawl-website
│  ├─ fetch-page (homepage)
│  ├─ fetch-page (batch: /about, /products, /services... up to 7)
│  └─ detect-and-fetch-pdfs (if PDF links found on site)
│     └─ parse-document (batch: each discovered PDF)
│
├─ [PARSING] parse-document (batch: each uploaded file)
│
└─ [CLASSIFYING] classify-fsc
   ├─ embed text → text-embedding-3-small
   ├─ pgvector search → top 20 FSC candidates
   ├─ cache hit (>0.95 similarity)? → reuse codes, done
   └─ cache miss → GPT-4o rerank → save top 5-10 codes

With uploaded documents only (no URL):

classify-company
│
├─ [PARSING] parse-document (batch: each uploaded file)
│
└─ [CLASSIFYING] classify-fsc
   ├─ embed text → text-embedding-3-small
   ├─ pgvector search → top 20 FSC candidates
   ├─ cache hit? → reuse codes, done
   └─ cache miss → GPT-4o rerank → save top 5-10 codes

The crawling step is skipped entirely when there's no URL — the status goes straight from PENDING to PARSING. Subpages and PDFs are fetched in parallel batches. If any step fails, the job is marked FAILED and Trigger.dev retries (1-3x depending on the task).

Running locally

You'll need: Node.js 20+, pnpm, PostgreSQL with pgvector, and API keys for OpenAI and Trigger.dev.

pnpm install

cp .env.example .env
# fill in DATABASE_URL, OPENAI_API_KEY, TRIGGER_SECRET_KEY

pnpm db:generate
pnpm db:migrate

# seeds ~498 FSC codes with embeddings, takes about 2 min
pnpm --filter @fsc-c/db seed:fsc

pnpm dev

That gives you:

Add Company

API

Method Path What it does
POST /classify Submit a company (multipart/form-data)
GET /classify/:id Poll status + get results
GET /classify List all companies
GET /classify/search?fscCode=3416 Find companies by FSC code

All responses: { success: boolean, data?: T, error?: string }

Next Steps

Things I'd want to tackle next if I keep building on this:

  • Smarter crawling — Right now the crawler follows a hardcoded list of priority subpages (/about, /products, /services, etc.). I'd like to replace that with AI-driven link discovery — let the model decide which pages are worth visiting based on context. Looking at Firecrawl as a potential drop-in replacement that handles this out of the box (smart crawling, JS rendering, structured extraction).

  • E2E tests — No tests yet. Would add Playwright tests covering the full flow: submit a company, watch it process, verify the results page renders correctly. Also API integration tests for the classification pipeline with mocked OpenAI responses.

  • User accounts + OAuth — Currently there's no auth at all — anyone can submit and view everything. Next step would be adding user registration via OAuth (Google, GitHub) so companies and classification history are tied to individual accounts.

  • Multiple AI provider support — The whole pipeline is hardwired to OpenAI right now (embeddings, reranking, OCR). I'd want to abstract that behind provider adapters so you could swap in Anthropic, Gemini, open-source models, etc. without touching the pipeline logic. Especially useful for the embedding step where there are good cheaper alternatives.

License

MIT

About

AI system to classify companies into Federal Supply Classification (FSC) codes using web crawling, PDF parsing, OpenAI embeddings, pgvector search, and GPT-4o reranking

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors