🐶 DOGE Data Challenge 🚀

🏛️ A data-driven look into U.S. federal regulations using the eCFR API — exploring word counts, trends over time, and custom metrics like regulatory density to inform smarter de-regulation strategies

📜 Built to support data transparency and government-wide efficiency efforts by analyzing the Code of Federal Regulations (CFR), unleashing prosperity through de-regulation

✒️ Hamilton had his pen —— I have my keyboard... Using data to untangle the regulatory state, one agency at a time

🎯 Project Purpose

This technical assessment explores how to better understand and visualize the scale and complexity of U.S. federal regulations:

The eCFR contains over 200,000 pages of regulatory text across ~150 agencies
The data is publicly accessible through an official API
The goal: build a tool to parse and analyze regulation data for actionable insights

Overview

The DOGE Data Challenge processes regulation XML snapshots to map agencies, extract text, and produce analytical reports. Key features include:

Dynamic Configuration: Paths and settings are managed via a .env file, with defaults set by bootstrap.py
Modular Utilities: Helper functions (i.e., path management, text processing) are packaged with Poetry for reusability
Notebook Pipeline: Several Jupyter notebooks handle data ingestion, processing, analysis, and visualization
Testing: Unit tests ensure reliability of core utilities

⚡ Quick Setup

1️⃣ Install Poetry:

pip install poetry

Manual Path Setup: If ~/.local/bin isn’t in your PATH, you may need to add it manually (e.g., export PATH="$HOME/.local/bin:$PATH")

2️⃣ Clone the repository:

git clone https://github.com/bkaewell/doge-data-challenge.git
cd doge-data-challenge

3️⃣ Install dependencies:

poetry install

Verify pyproject.toml is up to date with all dependencies (pandas, matplotlib, etc.); poetry add python-dotenv jupyter This doge-data-challenge project uses Poetry to manage dependencies and package doge_data_challenge/helpers/

4️⃣ Bootstrap project:

poetry run python bootstrap.py

This creates

5️⃣ Run notebooks:

poetry run jupyter-notebook

Usage

Configure .env (copy .env.example and edit) to set SNAPSHOT_DATE, ARCHIVE_DIR, etc.
Run notebooks in order (01_agency_mapping_and_flattening.ipynb to 04_visualization_and_reporting.ipynb).
Each notebook uses init_notebook() to load paths:

from doge_data_challenge.helpers import init_notebook
paths = init_notebook()

🔢 Word Count Methods

The WORDCOUNT_METHOD defined in your .env controls how regulation text is parsed and counted. This ensures transparency and consistency across analyses — especially when comparing different agencies or dates.

Method	Description
`split`	Simple `text.split()` based on whitespace — fast but may over/under count
`regex`	Uses `\b\w+\b` to match real words — closer to Google Docs word count
`legal`	Placeholder for stricter rules (i.e. exclude citations, headers, numbers)
`nlp`	Placeholder for future spaCy/NLTK-style tokenization

📂 Repository Overview

doge-data-challenge/
├── README.md               # Documentation (this file)
├── bootstrap.py            # Validates .env config and sets up workspace for data pipeline
├── doge_data_challenge/
│   ├── __init__.py         # Marks doge_data_challenge as a Poetry Python package
│   └── helpers/            # Reusable utility functions
│       ├── __init__.py                # Exposes helper functions for import
│       ├── env_setup.py               # Sets up project paths and config from .env
│       ├── init_notebook.py           # Initializes notebooks with sys.path and paths
│       ├── print_helpers.py           # Formats and prints directory status messages
│       ├── trim_notebook_outputs.py   # Limits notebook output size for Git
│       └── wordcount.py               # Implements word counting strategies
├── notebooks/                                  # Data pipeline notebooks
│   ├── 01_agency_scraper.ipynb                 # Scrapes, maps, and flattens agency JSON to a dataframe
│   ├── 02_data_download_and_storage.ipynb      # Downloads and caches XMLs
│   ├── 03_text_extraction_and_analysis.ipynb   # Extracts and analyzes text
│   └── 04_visualization_and_reporting.ipynb    # Generates metrics and charts
├── tests/                                      # Unit tests for reliability
│   ├── __init__.py
│   └── test_env_paths.py            # Tests path loading and directory creation
├── .env                             # Configuration file
├── .gitignore                       # Ignores .env, data, checkpoints, and cache files
├── poetry.lock                      # Locks dependency versions (Git-ignored)
├── pyproject.toml                   # Poetry configuration and dependencies
│
│                                    # Below is Git-ignored to keep repo lightweight
│                                    ###############################################
├── {AGENCY_METADATA_DIR}/           # Stores metadata
│   └── {SNAPSHOT_DATE}/             
│       ├── agencies.json            # Top-level JSON from API
│       └── flattened_agencies.csv   # Output from 01_agency_scraper
│   ...
└── {REGULATION_TEXT_DIR}/           # Regulation XMLs from API
│   └── {SNAPSHOT_DATE}/
│       └── title_N/
│           └── chapter_N.xml
│   ...                              # Output from 02_data_download_and_storage

📚 Additional Documentation

TBD

🤝 Contributing & Contact

🎯 Looking to contribute? Open an issue or fork the repo!
👨‍💻 Author: Brian Kaewell
📧 Contact: Please open an issue here

BACKUP

📌 Key Deliverables

Download and parse regulation text from the eCFR API
Compute word counts, track changes over time (i.e. 2020 → 2025), and generate SHA-256 checksums per agency
Normalize nested agency structures (including children) for accurate aggregation
Introduce a custom metric: regulatory density = words per CFR reference
Visualize agency sizes and regulation growth
Build a modular pipeline for future extension (i.e. NLP-based analysis)

⚡ Quick Setup

1️⃣ Clone the Repo

git clone https://github.com/bkaewell/doge-data-challenge.git
cd doge-data-challenge

2️⃣ Bootstrap Your Environment

python bootstrap.py

This will:

✅ Create a .env file if it doesn't exist
✅ Set the default SNAPSHOT_DATE to today
✅ Set the default WORDCOUNT_METHOD to regex
✅ Create the necessary data folders under data/ and archive/

Context: The project involves scraping agency metadata and regulation texts, organized by SNAPSHOT_DATE. The .env file defines AGENCY_METADATA_DIR and REGULATION_TEXT_DIR, and users may extend it for snapshot-specific configurations

Configuration

All configuration lives in .env. You can manually set a specific date for analysis:

SNAPSHOT_DATE=2025-03-27
WORDCOUNT_METHOD=regex  # Options: split, regex, legal, nlp

AGENCY_METADATA_DIR=agency_metadata
REGULATION_TEXT_DIR=regulation_text

🔥🔥 regex balances speed and accuracy with NLP-style tokenization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🐶 DOGE Data Challenge 🚀

🎯 Project Purpose

Overview

⚡ Quick Setup

1️⃣ Install Poetry:

2️⃣ Clone the repository:

3️⃣ Install dependencies:

4️⃣ Bootstrap project:

5️⃣ Run notebooks:

Usage

🔢 Word Count Methods

📂 Repository Overview

📚 Additional Documentation

🤝 Contributing & Contact

BACKUP

📌 Key Deliverables

⚡ Quick Setup

1️⃣ Clone the Repo

2️⃣ Bootstrap Your Environment

Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
doge_data_challenge		doge_data_challenge
notebooks		notebooks
.env		.env
.gitignore		.gitignore
README.md		README.md
bootstrap.py		bootstrap.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

🐶 DOGE Data Challenge 🚀

🎯 Project Purpose

Overview

⚡ Quick Setup

1️⃣ Install Poetry:

2️⃣ Clone the repository:

3️⃣ Install dependencies:

4️⃣ Bootstrap project:

5️⃣ Run notebooks:

Usage

🔢 Word Count Methods

📂 Repository Overview

📚 Additional Documentation

🤝 Contributing & Contact

BACKUP

📌 Key Deliverables

⚡ Quick Setup

1️⃣ Clone the Repo

2️⃣ Bootstrap Your Environment

Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages