charconv

A pure Crystal implementation of GNU libiconv. Converts text between 150+ character encodings using Unicode (UCS-4) as a pivot, with performance-first design.

Features

150+ encodings: ASCII, UTF-8, UTF-16/32, ISO-8859-*, Windows codepages, Mac encodings, CJK (Shift_JIS, EUC-JP, GBK, Big5, EUC-KR, GB18030, ...), EBCDIC, and more
Fast: 8-byte ASCII scanner with memcpy for ASCII-superset pairs, enum-based dispatch compiling to jump tables, table-driven single-byte codecs, zero allocations in the hot path
Correct: Exhaustive byte-level tests against system iconv for every encoding
Streaming: Buffer-based API for zero-copy conversion, plus IO wrapper for convenience
GNU iconv compatible: Supports //IGNORE, //TRANSLIT, and combined flags

Installation

Add to your shard.yml:

dependencies:
  charconv:
    github: jackthorne/charconv

Usage

One-shot conversion

require "charconv"

# String/Bytes → Bytes
result = CharConv.convert("Hello, World!", "UTF-8", "ISO-8859-1")
result = CharConv.convert(input_bytes, "Shift_JIS", "UTF-8")

# With flags
result = CharConv.convert(input, "UTF-8", "ASCII//TRANSLIT")   # transliterate
result = CharConv.convert(input, "UTF-8", "ASCII//IGNORE")     # skip failures

Streaming (buffer-based)

converter = CharConv::Converter.new("EUC-JP", "UTF-8")

# You provide the buffers — zero allocations
src_consumed, dst_written = converter.convert(input_bytes, output_bytes)
# Call repeatedly until input is exhausted

IO streaming

File.open("input.txt", "r") do |input|
  File.open("output.txt", "w") do |output|
    CharConv.convert(input, output, "Shift_JIS", "UTF-8")
  end
end

# Or with a Converter instance for more control
converter = CharConv::Converter.new("GB18030", "UTF-8")
converter.convert(input_io, output_io, buffer_size: 16384)

Thread Safety

Converter instances are not thread-safe — they hold mutable codec state for stateful encodings (ISO-2022-JP, UTF-7, HZ, etc.). Do not share a converter across fibers or threads. Instead, call #dup to create an independent copy:

converter = CharConv::Converter.new("ISO-2022-JP", "UTF-8")

# Each fiber gets its own copy with fresh state
10.times do
  spawn do
    my_conv = converter.dup
    my_conv.convert(input_bytes, output_bytes)
  end
end

Error handling

charconv offers two error handling styles: status codes for streaming, and exceptions for one-shot conversion.

Status codes (streaming)

convert_with_status returns a ConvertStatus enum indicating why conversion stopped:

Status	Meaning	Action
`OK`	All input consumed	Done — read `dst[0, written]`
`E2BIG`	Output buffer full	Flush written bytes, call again with remaining input
`EILSEQ`	Invalid byte sequence	Error at `src[consumed]` — handle or abort
`EINVAL`	Incomplete sequence	Need more input bytes, or error at EOF

converter = CharConv::Converter.new("UTF-8", "ISO-8859-1")
src = input_bytes
dst = Bytes.new(src.size * 2)

consumed, written, status = converter.convert_with_status(src, dst)
case status
when .ok?     then io.write(dst[0, written])
when .e2_big? then # grow buffer or flush and retry with src[consumed..]
when .eilseq? then raise "Invalid byte at position #{consumed}"
when .einval? then raise "Incomplete sequence at end of input"
end

Exceptions (one-shot)

The one-shot CharConv.convert raises CharConv::ConversionError on failure:

begin
  result = CharConv.convert(input, "UTF-8", "ISO-8859-1")
rescue ex : CharConv::ConversionError
  puts ex.message # e.g. "Conversion failed at byte 42 (42/100 bytes consumed)"
end

Using flags instead of error handling

//IGNORE silently skips invalid bytes and unencodable characters. //TRANSLIT attempts ASCII approximations (e.g. é → e, © → (c)). Combine both for maximum tolerance:

# Skip what can't be transliterated
result = CharConv.convert(input, "UTF-8", "ASCII//TRANSLIT//IGNORE")

Buffer sizing for streaming

When using convert or convert_with_status with your own buffers:

Safe default: src.size * 4 covers worst-case expansion (1-byte source → 4-byte UTF-8)
Exact ceiling: src.size * converter.to.max_bytes_per_char — never overestimates
Same-family conversions (e.g. ISO-8859-1 → ISO-8859-2): output ≤ input, so src.size suffices
If E2BIG is returned: double the buffer and retry, or flush written bytes and continue

converter = CharConv::Converter.new("EUC-JP", "UTF-8")
src = File.read("input.dat").to_slice
dst = Bytes.new(src.size * converter.to.max_bytes_per_char)
consumed, written, status = converter.convert_with_status(src, dst)

Stateful encodings

ISO-2022-JP, ISO-2022-CN, ISO-2022-KR, UTF-7, and HZ use escape sequences to switch between character sets. This means:

Call flush_encoder after the last chunk to emit any pending escape sequences (e.g. the switch-back-to-ASCII sequence in ISO-2022-JP):

converter = CharConv::Converter.new("UTF-8", "ISO-2022-JP")
consumed, written = converter.convert(src, dst)
flush_len = converter.flush_encoder(dst, written)
output.write(dst[0, written + flush_len])

Call reset before reusing the converter for a new document — otherwise the encoder assumes it's continuing the previous stream's state.
One-shot and IO methods handle this automatically — flush_encoder and reset are only needed when using the buffer-based streaming API directly.

Querying encodings

CharConv.encoding_supported?("UTF-8")       # => true
CharConv.encoding_supported?("NONEXISTENT") # => false
CharConv.list_encodings                      # => ["ASCII", "UTF-8", ...]

Supported Encodings

Unicode: ASCII, UTF-8, UTF-16BE/LE/BOM, UTF-32BE/LE/BOM, UCS-2, UCS-4, UTF-7, C99, Java

Western European: ISO-8859-1/15, CP1252, MacRoman, HP-ROMAN8, NEXTSTEP

Central/Eastern European: ISO-8859-2/3/4/10/13/14/16, CP1250, MacCentralEurope

Cyrillic: ISO-8859-5, CP1251, KOI8-R, KOI8-U, KOI8-RU, MacCyrillic, MacUkraine

Greek: ISO-8859-7, CP1253, MacGreek

Turkish: ISO-8859-9, CP1254, MacTurkish

Hebrew: ISO-8859-8, CP1255, MacHebrew

Arabic: ISO-8859-6, CP1256, MacArabic, CP864

Thai: ISO-8859-11, TIS-620, CP874, MacThai

Vietnamese: VISCII, TCVN, CP1258

Japanese: EUC-JP, Shift_JIS, CP932, ISO-2022-JP, ISO-2022-JP-1, ISO-2022-JP-2

Chinese (Simplified): GB2312, GBK, GB18030, EUC-CN, HZ, ISO-2022-CN

Chinese (Traditional): Big5, CP950, Big5-HKSCS, EUC-TW

Korean: EUC-KR, CP949, ISO-2022-KR, JOHAB

DOS/IBM: CP437, CP737, CP775, CP850, CP852, CP855, CP857, CP858, CP860-CP866, CP869

EBCDIC: CP037, CP273, CP277, CP278, CP280, CP284, CP285, CP297, CP423, CP424, CP500, CP905, CP1026

Other: ARMSCII-8, Georgian-Academy, Georgian-PS, PT154, KOI8-T, KZ-1048, MULELAO-1, ATARIST, RISCOS-LATIN1

Replacing libiconv in Crystal's stdlib

charconv can transparently replace Crystal's libiconv dependency for all stdlib encoding operations (String#encode, String.new(bytes, encoding), IO#set_encoding).

require "charconv/stdlib"

# All stdlib encoding now uses charconv — no libiconv calls at runtime
"café".encode("ISO-8859-1")
String.new(bytes, "Shift_JIS")

io = File.open("data.txt")
io.set_encoding("EUC-JP")
io.gets_to_end  # decoded through charconv

By default, libiconv is still linked but never called. To fully remove the libiconv dependency, compile with -Dwithout_iconv:

crystal build app.cr -Dwithout_iconv

Performance

charconv vs system libiconv, 1 MB input, --release mode.

Conversion	charconv	system iconv	Speedup
ASCII → ASCII	73.39 µs	11.89 ms	162.0×
UTF-8 → ISO-8859-1 (mixed Latin)	3.43 ms	14.62 ms	4.3×
ISO-8859-1 → UTF-8	2.08 ms	14.24 ms	6.9×
UTF-8 → UTF-8 (mixed widths)	4.92 ms	11.98 ms	2.4×
CP1252 → UTF-8	2.50 ms	17.24 ms	6.9×
UTF-8 → CP1252 (mixed Latin)	3.50 ms	14.50 ms	4.1×
UTF-16BE → UTF-8 (mixed widths)	3.73 ms	10.83 ms	2.9×
UTF-8 → UTF-16LE	4.57 ms	10.11 ms	2.2×

Measured on Apple M3 Pro, Crystal 1.19.1, macOS. Run crystal spec spec/bench_spec.cr --release to reproduce.

Architecture

Every conversion goes through a Unicode pivot:

Source bytes → UCS-4 codepoint → Target bytes
  decode()        (pivot)         encode()

For ASCII-superset encoding pairs (the vast majority), an 8-byte word scanner identifies ASCII runs and memcpys them directly, only falling back to the decode-pivot-encode loop for non-ASCII characters. This means ASCII-heavy text converts at memory bandwidth.

See ARCHITECTURE.md for the full design rationale.

Minimal Build

By default, charconv embeds lookup tables for all 150+ encodings including CJK (~2 MB of table data). To build a smaller binary with only Western encodings:

crystal build app.cr -Dcharconv_minimal --release

The minimal set includes: ASCII, UTF-8, ISO-8859-1 through -16, CP1250-1258, KOI8-R/U, Mac encodings, DOS codepages, EBCDIC, UTF-16/32, UTF-7, C99, Java, and all other single-byte encodings. CJK encodings (Shift_JIS, EUC-JP, GBK, Big5, GB18030, EUC-KR, ISO-2022-*, HZ, etc.) are excluded — attempting to use them will return EILSEQ/ILUNI errors (the stubs reject all input).

Development

crystal spec                        # run all tests
crystal spec spec/bench_spec.cr --release  # run benchmarks

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
.github/workflows		.github/workflows
spec		spec
src		src
tools		tools
.editorconfig		.editorconfig
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
PLAN.md		PLAN.md
README.md		README.md
shard.yml		shard.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

charconv

Features

Installation

Usage

One-shot conversion

Streaming (buffer-based)

IO streaming

Thread Safety

Error handling

Buffer sizing for streaming

Stateful encodings

Querying encodings

Supported Encodings

Replacing libiconv in Crystal's stdlib

Performance

Architecture

Minimal Build

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

charconv

Features

Installation

Usage

One-shot conversion

Streaming (buffer-based)

IO streaming

Thread Safety

Error handling

Buffer sizing for streaming

Stateful encodings

Querying encodings

Supported Encodings

Replacing libiconv in Crystal's stdlib

Performance

Architecture

Minimal Build

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages