This project implements a high-performance system for parsing and classifying Vietnamese address data. By leveraging specialized Trie structures and fuzzy matching, the system accurately extracts hierarchical information (Province, District, and Ward) from messy input strings.
The system follows a specific pipeline to transform raw, inconsistent address strings into structured geographic data.
Before classification, the input data is "sanitized" to handle common human errors and formatting issues:
- Character Cleanup: Removes special characters (dots, slashes, plus signs) and the words immediately associated with them.
- Noise Reduction: Filters out standalone characters that carry no geographic meaning.
-
Word Segmentation: Automatically fixes mashed words by detecting uppercase transitions (e.g.,
TỉnhĐà$\rightarrow$ Tỉnh Đà). - Format Normalization: Generates variations of address data, including versions without Vietnamese accents and various casing styles.
To achieve near-instant lookup times, the project utilizes a dual-Trie approach:
-
Standard Trie: Contains the core hierarchy (Province
$\rightarrow$ District$\rightarrow$ Ward). Nodes are enriched with attributes likeoriginal_words,full_words, andaddress_type. - Trigram Trie: Deconstructs address data into trigrams. This allows the system to remain robust even when users provide partial or slightly misspelled inputs.
The core logic prioritizes accuracy by processing data from right-to-left:
- Right-to-Left Scan: Since Vietnamese addresses typically place the most general info (Province) at the end, the algorithm starts there and moves toward the Ward.
- Exact Match: The word is first checked against the Standard Trie. If found, it is assigned a type and removed from the processing string.
- Fuzzy Search: If an exact match fails, the system triggers a DFS search starting from the first unfounded character and uses Levenshtein Distance to identify the most likely intended word.
Designed for efficiency, the system remains extremely fast even in worst-case scenarios involving fuzzy matching.
| Metric | Execution Time |
|---|---|
| Average Case | 0.0001 s |
| Worst Case | 0.0003 s |