SEC 10-K Structured Extraction Tool

Parses SEC EDGAR Form 10-K annual reports into standardized JSON. Automatically identifies the content and status of every Item — zero LLM cost, average latency < 1 second.

English | 中文

Features

Full Item coverage: Parses all 16 Items across Part I–IV, outputting one of five statuses: extracted / incorporated_by_reference / not_applicable / reserved / missing
Zero LLM cost: Purely rule-based pipeline — parsing itself takes ~0.03s with no API calls
iXBRL support: Handles the iXBRL format mandatory for large filers since 2019
XBRL financial extraction: Optionally reconstructs Item 8 primary financial statements directly from XBRL (income statement, balance sheet, cash flow statement, etc.)
Sync / async API: Both Pipeline and AsyncPipeline are available

Quick Start

Installation

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Usage

Synchronous

from src.pipeline import Pipeline
from src.models import FilingInput

pipeline = Pipeline()

# Option 1: CIK + Accession Number
result = pipeline.run(FilingInput(
    cik="0000320193",
    accession_number="0000320193-23-000106",
))

# Option 2: Direct URL
result = pipeline.run(FilingInput(
    url="https://www.sec.gov/Archives/edgar/data/.../filing.htm",
))

# Save results (JSON + Markdown)
result = pipeline.run(input, save_to="output/")

Async

from src.async_pipeline import AsyncPipeline
from src.models import FilingInput
import asyncio

async def main():
    pipeline = AsyncPipeline()
    result = await pipeline.run_async(FilingInput(
        cik="0000320193",
        accession_number="0000320193-23-000106",
    ))

asyncio.run(main())

Output Format

{
  "filing_info": {
    "cik": "0000320193",
    "accession_number": "0000320193-23-000106",
    "company_name": "Apple Inc.",
    "fiscal_year_end": "2023-09-30"
  },
  "items": [
    {
      "part": "Part I",
      "item_number": "1",
      "item_title": "Business",
      "content_text": "Apple Inc. designs, manufactures...",
      "char_range": [1024, 45231],
      "status": "extracted"
    },
    {
      "part": "Part III",
      "item_number": "10",
      "item_title": "Directors, Executive Officers and Corporate Governance",
      "content_text": null,
      "char_range": null,
      "status": "incorporated_by_reference"
    }
  ]
}

How It Works

A purely rule-based pipeline — SEC regulations mandate Item numbering and ordering, giving the parser reliable anchors regardless of visual formatting variation:

Input (CIK + Accession Number or direct URL)
  ↓ fetch       Retrieve metadata and HTML from the SEC EDGAR API
  ↓ preprocess  HTML → plain text (handle iXBRL, tables, hyphenation)
  ↓ parse       RegexParser locates the start and end position of each Item
  ↓ postprocess Classify the status of each Item
Output (standardized JSON)

Architecture overview:

src/
├── pipeline.py              Main pipeline
├── async_pipeline.py        Async version
├── models.py                Data structures
├── patterns.py              Regex pattern definitions
├── postprocessor.py         Item status classification
├── item8_xbrl_facts.py      XBRL financial extraction
├── render_item8_markdown.py XBRL result renderer
├── parsers/
│   ├── regex_parser.py      Main parser
│   └── hybrid.py            Dispatcher (supports future LLM fallback)
└── eval/
    ├── metrics.py           Evaluation script
    └── runner.py            Batch evaluation entry point

Evaluation

Benchmarked on 35 filings, 12 companies, 2016–2026, with manually annotated ground truth:

Metric	Value
Status Accuracy	100.0% (788 / 788 items)
Critical Regressions	0
Content Length Normal Rate	99.0% (484 / 489 extracted items)
Head/Tail Match Pass Rate	Head 99.8% / Tail 100.0%
Average Latency	0.687s
LLM Cost	$0

In a separate large-scale test across 507 randomly sampled filings, structural error rate dropped 95.5% from the initial version.

Ground truth was manually annotated using SEC-10-K-Annotation-Tool. Full report: eval_datasets/results/.

Running Evaluation

python -m src.eval.metrics \
    --ground-truth eval_datasets/ground_truth \
    --output eval_datasets/results

XBRL Financial Extraction (Item 8)

Reconstructs Item 8 primary financial statements directly from XBRL, outputting a multi-period Markdown report.

from src.item8_xbrl_facts import get_item8_xbrl_facts
from src.render_item8_markdown import write_item8_markdown

payload = get_item8_xbrl_facts("0000019617", "0001628280-26-008131")
write_item8_markdown(payload, "output_item8.md")

Parses four XBRL source files (Instance Document, Presentation Linkbase, Label Linkbase, Schema) and classifies results into three blocks: main statements, numeric disclosures, and text disclosures.

Known Limitations

Supports HTML-format filings from approximately 2000 onward. SGML / plain-text formats from before 1996 have no HTML structure and cannot be processed.

Contributing

PRs and Issues are welcome. If you encounter a filing that fails to parse, please open an Issue with the CIK and Accession Number.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
docs		docs
eval_datasets		eval_datasets
feedback		feedback
frontend_feedback		frontend_feedback
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh-CN.md		README_zh-CN.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SEC 10-K Structured Extraction Tool

Features

Quick Start

Installation

Usage

Synchronous

Async

Output Format

How It Works

Evaluation

Running Evaluation

XBRL Financial Extraction (Item 8)

Known Limitations

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SEC 10-K Structured Extraction Tool

Features

Quick Start

Installation

Usage

Synchronous

Async

Output Format

How It Works

Evaluation

Running Evaluation

XBRL Financial Extraction (Item 8)

Known Limitations

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages