Skip to content

kirillvberkeley/Ucc-Normalizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ucc‑normalizer

ucc-normalizer is a Python command line tool that normalizes and deduplicates messy Uniform Commercial Code (UCC) filing CSV files. It was designed for local, deterministic processing: there are no external API calls and no randomness in its output. The utility trims and case‑folds text fields, removes common legal suffixes (Inc, LLC, Co., etc.), converts state names to USPS codes, parses filing dates to ISO format and strips non‑numeric characters from phone numbers. A simple fuzzy matching heuristic then merges duplicate records, keeping the most recent filing date.

Installation

The project relies on a small set of well‑known Python libraries. You can install dependencies and run the tool using a virtual environment or directly in a GitHub Codespace.

cd ucc-normalizer
python -m venv .venv
source .venv/bin/activate
pip install -r <(python -c "import tomllib,sys;print('\n'.join(tomllib.load(open('pyproject.toml','rb'))['project']['dependencies']))")

Alternatively, if you have pipx, you can install the package locally:

pipx install .

Usage

The main entry point is the normalize subcommand. Run ucc-normalizer normalize --help to see all options.

usage: ucc-normalizer normalize [OPTIONS] INPUT

Normalize and dedupe the records in INPUT CSV and write the results.

Arguments:
  INPUT    Path to the input CSV file containing UCC filings.

Options:
  --out PATH       [required] Path to write the normalized CSV
  --report PATH    Optional path to write a deduplication report as JSON
  --threshold INT  Similarity threshold (0‑100) for deduplication [default: 90]
  --help           Show this message and exit.

For example, to normalize the sample data and inspect the deduplication report:

ucc-normalizer normalize samples/raw.csv --out outputs/normalized.csv --report outputs/report.json --threshold 90

After running, you will find a normalized CSV under outputs/normalized.csv and a JSON report describing which rows were merged in outputs/report.json.

Configuration flags

  • --threshold – Adjusts the strictness of deduplication. It represents a percentage similarity (0–100) computed using RapidFuzz’s token set ratio on the debtor name, address, city and state. Higher thresholds require more exact matches.
  • --report – When provided, deduplication details are written as a JSON array. Each entry has keys a (the losing row’s original index), b (the winning row’s index), score (similarity score) and survivor (same as b).

Testing

Automated tests verify normalization, suffix stripping, state conversion, date parsing, deduplication and CLI behaviour. To run the tests with pytest:

cd ucc-normalizer
pytest -q

All tests should pass. The tests create small in‑memory DataFrames and exercise the fuzzy matching logic as well as the CLI wrapper.

Sample data

The samples/raw.csv file contains a small set of intentionally messy UCC filings to illustrate the tool’s functionality. Running the normalizer on this file produces outputs/normalized.csv and outputs/report.json, showing how duplicate entries are collapsed and fields are standardized.

License

This project is licensed under the MIT license. See the LICENSE file at the repository root for more information.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages