ucc-normalizer is a Python command line tool that normalizes and deduplicates messy Uniform Commercial Code (UCC) filing CSV files. It was designed for local, deterministic processing: there are no external API calls and no randomness in its output. The utility trims and case‑folds text fields, removes common legal suffixes (Inc, LLC, Co., etc.), converts state names to USPS codes, parses filing dates to ISO format and strips non‑numeric characters from phone numbers. A simple fuzzy matching heuristic then merges duplicate records, keeping the most recent filing date.
The project relies on a small set of well‑known Python libraries. You can install dependencies and run the tool using a virtual environment or directly in a GitHub Codespace.
cd ucc-normalizer
python -m venv .venv
source .venv/bin/activate
pip install -r <(python -c "import tomllib,sys;print('\n'.join(tomllib.load(open('pyproject.toml','rb'))['project']['dependencies']))")Alternatively, if you have pipx, you can install the package locally:
pipx install .The main entry point is the normalize subcommand. Run ucc-normalizer normalize --help to see all options.
usage: ucc-normalizer normalize [OPTIONS] INPUT
Normalize and dedupe the records in INPUT CSV and write the results.
Arguments:
INPUT Path to the input CSV file containing UCC filings.
Options:
--out PATH [required] Path to write the normalized CSV
--report PATH Optional path to write a deduplication report as JSON
--threshold INT Similarity threshold (0‑100) for deduplication [default: 90]
--help Show this message and exit.
For example, to normalize the sample data and inspect the deduplication report:
ucc-normalizer normalize samples/raw.csv --out outputs/normalized.csv --report outputs/report.json --threshold 90After running, you will find a normalized CSV under outputs/normalized.csv and a JSON report describing which rows were merged in outputs/report.json.
--threshold– Adjusts the strictness of deduplication. It represents a percentage similarity (0–100) computed using RapidFuzz’s token set ratio on the debtor name, address, city and state. Higher thresholds require more exact matches.--report– When provided, deduplication details are written as a JSON array. Each entry has keysa(the losing row’s original index),b(the winning row’s index),score(similarity score) andsurvivor(same asb).
Automated tests verify normalization, suffix stripping, state conversion, date parsing, deduplication and CLI behaviour. To run the tests with pytest:
cd ucc-normalizer
pytest -qAll tests should pass. The tests create small in‑memory DataFrames and exercise the fuzzy matching logic as well as the CLI wrapper.
The samples/raw.csv file contains a small set of intentionally messy UCC filings to illustrate the tool’s functionality. Running the normalizer on this file produces outputs/normalized.csv and outputs/report.json, showing how duplicate entries are collapsed and fields are standardized.
This project is licensed under the MIT license. See the LICENSE file at the repository root for more information.