Sort Chinese text the way Chinese readers expect.
hanzi-sort is a Rust CLI and library for sorting Hanzi by Hanyu Pinyin or by stroke count, with deterministic tie-breaking, phrase-level override rules for polyphonic characters, and terminal-friendly tabular output.
Migration note
pinyin-sorthas been renamed tohanzi-sort. This is a hard rename: there is no compatibility binary alias.
Unicode codepoint order is not Chinese sort order.
If you want useful output for names, glossaries, study lists, or publishing workflows, you usually need more than plain lexical comparison:
- pinyin order for alphabetic-style indexes
- stroke order for dictionary-style or teaching workflows
- phrase-level override rules for polyphonic characters like
重庆or银行 - stable tie-breaking so the same dataset always produces the same order
Sort by pinyin:
hanzi-sort -t 汉字 张三 赵四Sort by stroke count:
hanzi-sort -t 天 一 十 --sort-by strokes --columns 1 --entry-width 2 --blank-every 0Resolve a polyphonic phrase with an override file:
hanzi-sort -t 重庆 银行 --config ./override.tomlWrite the result to a file:
hanzi-sort -t 重庆 银行 -o ./sorted.txt- Sort by
pinyinorstrokes - Keep unknown characters in the comparison key instead of dropping them
- Break ties by original character so output stays deterministic
- Read repeated
--textvalues or one non-blank record per line from--file - Override single characters or full phrases with TOML
- Format output with configurable columns, alignment, padding, separators, and blank-line cadence
- Use the same core sorter from Rust via
PinyinContextandSortMode
Prerequisites:
- Rust toolchain
- Python 3 and
pypinyinif you need to regeneratedata/pinyin.csv
Build:
cargo build --release
target/release/hanzi-sort -hnix develop
just buildBasic help:
hanzi-sort -hInput rules:
--textand--fileare mutually exclusive--filereads one non-blank line per record- directory inputs are rejected
- success exits with
0; invalid args, bad override files, and I/O failures exit non-zero
--sort-by pinyinDefault. Compares the primary tone3 pinyin for each mapped character, then falls back to the original character.--sort-by strokesCompares total stroke count per character, then falls back to the original character.
-f, --file <FILE>: input file path, can be repeated-t, --text <TEXT>...: inline text input, can be repeated-o, --output <PATH>: write output to a file instead of stdout-c, --config <PATH>: TOML override file--sort-by <MODE>:pinyinorstrokes--columns <N>: entries per row, must be greater than0--blank-every <N>: insert a blank line everyNrows; use0to disable--entry-width <N>: target display width per entry, must be greater than0--align <MODE>:left,center,right, oreven--padding-char <CHAR>: padding character, must have display width1--separator <CHAR>: separator between entries--line-ending <CHAR>: line ending character
--config accepts TOML with either or both sections:
[char_override]
'重' = "chong2"
'行' = "xing2"
[phrase_override]
"重庆" = ["chong2", "qing4"]
"银行" = ["yin2", "hang2"]Rules:
phrase_overridetakes precedence overchar_override- each
phrase_overrideentry must provide exactly one pinyin syllable per character - omitted sections default to empty maps
The CLI is the primary product, but the sorter is available as a Rust library.
use hanzi_sort::{PinyinContext, SortMode, sort_strings_by};
let context = PinyinContext::default();
let sorted = sort_strings_by(
vec!["一".into(), "十".into(), "天".into()],
&context,
SortMode::Strokes,
);Key APIs:
PinyinContext::pinyin_of(&str) -> Vec<PinYinRecord>sort_strings(Vec<String>, &PinyinContext) -> Vec<String>sort_strings_by(Vec<String>, &PinyinContext, SortMode) -> Vec<String>
scripts/convert_pinyin_to_csv.pybuildsdata/pinyin.csvfrom the vendored pinyin datasetscripts/convert_strokes_to_csv.pybuildsdata/strokes.csvfrom UnicodekTotalStrokesdatabuild.rsgenerates static lookup tables insrc/generated/- the build validates representative codepoints such as
〇,汉,重, and一
cargo test
cargo clippy --all-targets --all-features -- -D warningsNix helpers:
nix develop
just prep-data
just buildAGPL-3.0-only. See LICENSE.