Address Classification (BK-IMP)

This project implements a high-performance system for parsing and classifying Vietnamese address data. By leveraging specialized Trie structures and fuzzy matching, the system accurately extracts hierarchical information (Province, District, and Ward) from messy input strings.

Project Workflow

The system follows a specific pipeline to transform raw, inconsistent address strings into structured geographic data.

1. Preprocessing & Data Cleaning

Before classification, the input data is "sanitized" to handle common human errors and formatting issues:

Character Cleanup: Removes special characters (dots, slashes, plus signs) and the words immediately associated with them.
Noise Reduction: Filters out standalone characters that carry no geographic meaning.
Word Segmentation: Automatically fixes mashed words by detecting uppercase transitions (e.g., TỉnhĐà $\rightarrow$ Tỉnh Đà).
Format Normalization: Generates variations of address data, including versions without Vietnamese accents and various casing styles.

2. The Trie Architecture

To achieve near-instant lookup times, the project utilizes a dual-Trie approach:

Standard Trie: Contains the core hierarchy (Province $\rightarrow$ District $\rightarrow$ Ward). Nodes are enriched with attributes like original_words, full_words, and address_type.
Trigram Trie: Deconstructs address data into trigrams. This allows the system to remain robust even when users provide partial or slightly misspelled inputs.

3. Classification Algorithms

The core logic prioritizes accuracy by processing data from right-to-left:

Right-to-Left Scan: Since Vietnamese addresses typically place the most general info (Province) at the end, the algorithm starts there and moves toward the Ward.
Exact Match: The word is first checked against the Standard Trie. If found, it is assigned a type and removed from the processing string.
Fuzzy Search: If an exact match fails, the system triggers a DFS search starting from the first unfounded character and uses Levenshtein Distance to identify the most likely intended word.

Performance Benchmarks

Designed for efficiency, the system remains extremely fast even in worst-case scenarios involving fuzzy matching.

Metric	Execution Time
Average Case	0.0001 s
Worst Case	0.0003 s

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitattributes		.gitattributes
Address_Classification.pptx		Address_Classification.pptx
Address_Classification.py		Address_Classification.py
README.md		README.md
list_district.txt		list_district.txt
list_province.txt		list_province.txt
list_ward.txt		list_ward.txt
test.json		test.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Address Classification (BK-IMP)

Project Workflow

1. Preprocessing & Data Cleaning

2. The Trie Architecture

3. Classification Algorithms

Performance Benchmarks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Address Classification (BK-IMP)

Project Workflow

1. Preprocessing & Data Cleaning

2. The Trie Architecture

3. Classification Algorithms

Performance Benchmarks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages