GitHub - kirillvberkeley/Ucc-Normalizer

ucc‑normalizer

ucc-normalizer is a Python command line tool that normalizes and deduplicates messy Uniform Commercial Code (UCC) filing CSV files. It was designed for local, deterministic processing: there are no external API calls and no randomness in its output. The utility trims and case‑folds text fields, removes common legal suffixes (Inc, LLC, Co., etc.), converts state names to USPS codes, parses filing dates to ISO format and strips non‑numeric characters from phone numbers. A simple fuzzy matching heuristic then merges duplicate records, keeping the most recent filing date.

Installation

The project relies on a small set of well‑known Python libraries. You can install dependencies and run the tool using a virtual environment or directly in a GitHub Codespace.

cd ucc-normalizer
python -m venv .venv
source .venv/bin/activate
pip install -r <(python -c "import tomllib,sys;print('\n'.join(tomllib.load(open('pyproject.toml','rb'))['project']['dependencies']))")

Alternatively, if you have pipx, you can install the package locally:

pipx install .

Usage

The main entry point is the normalize subcommand. Run ucc-normalizer normalize --help to see all options.

usage: ucc-normalizer normalize [OPTIONS] INPUT

Normalize and dedupe the records in INPUT CSV and write the results.

Arguments:
  INPUT    Path to the input CSV file containing UCC filings.

Options:
  --out PATH       [required] Path to write the normalized CSV
  --report PATH    Optional path to write a deduplication report as JSON
  --threshold INT  Similarity threshold (0‑100) for deduplication [default: 90]
  --help           Show this message and exit.

For example, to normalize the sample data and inspect the deduplication report:

ucc-normalizer normalize samples/raw.csv --out outputs/normalized.csv --report outputs/report.json --threshold 90

After running, you will find a normalized CSV under outputs/normalized.csv and a JSON report describing which rows were merged in outputs/report.json.

Configuration flags

--threshold – Adjusts the strictness of deduplication. It represents a percentage similarity (0–100) computed using RapidFuzz’s token set ratio on the debtor name, address, city and state. Higher thresholds require more exact matches.
--report – When provided, deduplication details are written as a JSON array. Each entry has keys a (the losing row’s original index), b (the winning row’s index), score (similarity score) and survivor (same as b).

Testing

Automated tests verify normalization, suffix stripping, state conversion, date parsing, deduplication and CLI behaviour. To run the tests with pytest:

cd ucc-normalizer
pytest -q

All tests should pass. The tests create small in‑memory DataFrames and exercise the fuzzy matching logic as well as the CLI wrapper.

Sample data

The samples/raw.csv file contains a small set of intentionally messy UCC filings to illustrate the tool’s functionality. Running the normalizer on this file produces outputs/normalized.csv and outputs/report.json, showing how duplicate entries are collapsed and fields are standardized.

License

This project is licensed under the MIT license. See the LICENSE file at the repository root for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
samples		samples
tests		tests
ucc_normalizer		ucc_normalizer
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ucc‑normalizer

Installation

Usage

Configuration flags

Testing

Sample data

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ucc‑normalizer

Installation

Usage

Configuration flags

Testing

Sample data

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages