Skip to content

LLMSystems/SEC-10-K-Structured-Extraction-Web

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SEC 10-K Structured Extraction

Turn raw SEC EDGAR filings into clean structured JSON and readable Markdown

English | 中文

License: MIT Python Node

output.mp4

SEC 10-K filings are notoriously hard to work with. Raw EDGAR documents are inconsistent HTML, Item boundaries vary across filers, and reconstructing a balance sheet from XBRL means navigating three separate linkbase files just to get a number with a label.

This project provides a parsing pipeline + web UI that handles all of that. Submit any 10-K filing URL and get back structured JSON with every Item labeled and extracted — or render Item 8 financial statements directly to Markdown.

Features

  • Full Item extraction — splits all Parts (I / II / III / IV) and labels each Item as extracted, incorporated_by_reference, not_applicable, reserved, or missing
  • XBRL financial reconstruction — parses Instance + Presentation + Label linkbases to rebuild Item 8 tables (Income Statement, Balance Sheet, Cash Flow, etc.)
  • Markdown rendering — outputs clean, readable Markdown including numeric footnotes and text disclosures
  • Async job queue — submit a filing and get a job_id immediately; poll for results when processing finishes
  • Caching — same filing processed only once, keyed by accession_number
  • Dual input modes — accepts cik + accession_number or a direct EDGAR URL
  • Admin panel — job health dashboard, flag analytics, per-parser performance, and an item detail drawer

Quick Start

Prerequisites: Python ≥ 3.10, Node.js ≥ 18

# Backend
cd api
pip install -r requirements.txt
uvicorn main:app --reload
# → http://localhost:8000  (interactive API docs at /docs)

# Frontend (new terminal)
cd frontend
npm install
npm run dev
# → http://localhost:5173

Environment variables

api/.env

DB_PATH=./data/sec_extraction.db
CORS_ORIGINS=http://localhost:5173

frontend/.env

VITE_API_BASE_URL=http://localhost:8000

Usage

Submit any filing via the web UI, or call the API directly:

# Submit a parsing job
curl -X POST http://localhost:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{"cik":"0000320193","accession_number":"0000320193-23-000106"}'
# → { "job_id": "...", "status": "pending", "cache_hit": false }

Or use the parsing module directly in Python:

from api.sec_10k_pipeline.pipeline import Pipeline
from api.sec_10k_pipeline.models import FilingInput

pipeline = Pipeline()

# Option 1: CIK + Accession Number
result = pipeline.run(FilingInput(
    cik="0000320193",
    accession_number="0000320193-23-000106",
))

# Option 2: Direct URL
result = pipeline.run(FilingInput(
    url="https://www.sec.gov/Archives/edgar/data/.../filing.htm",
))

# Save results (JSON + Markdown)
result = pipeline.run(input, save_to="output/")

Full API reference: docs/api.md

Tech Stack

Backend: FastAPI · SQLite (aiosqlite) · lxml / BeautifulSoup · Pydantic · asyncio

Frontend: Vue 3 · TypeScript · Vite · Pinia · shadcn-vue · Tailwind CSS v4

Project Layout

api/                  # FastAPI backend + parsing pipeline
├── sec_10k_pipeline/ # Core engine (regex, LLM-assisted, XBRL parsing)
└── ...
frontend/             # Vue 3 SPA
docs/                 # API reference, architecture notes, validator rules

Contributing

Issues and pull requests are welcome. If you find a filing that parses incorrectly, or a feature you'd like to see, opening an issue is the best place to start — edge cases from real filings are especially useful.

  1. Fork the repo and create a branch
  2. Make your change with a clear commit message
  3. Open a PR describing what changed and why

License

MIT — see LICENSE for details.

About

A web that converts SEC 10-K filings into structured data and human-readable Markdown

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors