Skip to content

minhngoncoding/address-classification

Repository files navigation

Address Classification (BK-IMP)

This project implements a high-performance system for parsing and classifying Vietnamese address data. By leveraging specialized Trie structures and fuzzy matching, the system accurately extracts hierarchical information (Province, District, and Ward) from messy input strings.

Project Workflow

The system follows a specific pipeline to transform raw, inconsistent address strings into structured geographic data.

1. Preprocessing & Data Cleaning

Before classification, the input data is "sanitized" to handle common human errors and formatting issues:

  • Character Cleanup: Removes special characters (dots, slashes, plus signs) and the words immediately associated with them.
  • Noise Reduction: Filters out standalone characters that carry no geographic meaning.
  • Word Segmentation: Automatically fixes mashed words by detecting uppercase transitions (e.g., TỉnhĐà $\rightarrow$ Tỉnh Đà).
  • Format Normalization: Generates variations of address data, including versions without Vietnamese accents and various casing styles.

2. The Trie Architecture

To achieve near-instant lookup times, the project utilizes a dual-Trie approach:

  • Standard Trie: Contains the core hierarchy (Province $\rightarrow$ District $\rightarrow$ Ward). Nodes are enriched with attributes like original_words, full_words, and address_type.
  • Trigram Trie: Deconstructs address data into trigrams. This allows the system to remain robust even when users provide partial or slightly misspelled inputs.

3. Classification Algorithms

The core logic prioritizes accuracy by processing data from right-to-left:

  1. Right-to-Left Scan: Since Vietnamese addresses typically place the most general info (Province) at the end, the algorithm starts there and moves toward the Ward.
  2. Exact Match: The word is first checked against the Standard Trie. If found, it is assigned a type and removed from the processing string.
  3. Fuzzy Search: If an exact match fails, the system triggers a DFS search starting from the first unfounded character and uses Levenshtein Distance to identify the most likely intended word.

Performance Benchmarks

Designed for efficiency, the system remains extremely fast even in worst-case scenarios involving fuzzy matching.

Metric Execution Time
Average Case 0.0001 s
Worst Case 0.0003 s

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages