Leadtype currently uses a single-byte encoding pipeline inherited from the Type1/AFM era of PDF. Every character in a rendered string must be mapped through a codepage to a byte value in the range 0–255, and the PDF font object is declared with an encoding dictionary (typically WinAnsiEncoding plus a differences array). This approach was standard for PDF 1.2 (1996) but has been superseded by composite fonts since PDF 1.4 (2001).
Unicode source text
→ EachCodepage() (rich_text: splits string into single-codepage runs)
→ CharForCodepoint() (codepage: maps rune → byte 0–255)
→ PDF text string (bytes) (pdf: written with /Encoding WinAnsiEncoding+diffs)
→ font metrics via cmap (ttf: glyph index used only for width lookup)
- Hard ceiling of 256 glyphs per encoding. CJK, Arabic, Devanagari, emoji, and anything outside the 24 supported codepages cannot be rendered at all.
- Multi-script text requires font-encoding switches. A string mixing Greek and Cyrillic produces two separate PDF text segments with different font keys. Ligature and shaping boundaries are broken.
- The codepage layer is unnecessary for TTF/OTF. TTF fonts already address all glyphs via a cmap (Unicode → glyph index). The current code ignores that for output and routes everything through a byte-encoding layer, applying the Type1 workflow to fonts that do not need it.
- No ToUnicode CMap in output. Copy-paste, search, and accessibility all fail because the PDF contains no reverse mapping from encoded bytes back to Unicode.
- No font subsetting. Entire font files are embedded. A document using a single CJK character embeds a 20–50 MB font.
Produce a PDF rendering path for TTF/OTF fonts that:
- Addresses glyphs by their TTF glyph index, not by a codepage byte.
- Emits Type 0 / CIDFont composite font objects so any glyph in the font can be used.
- Emits a ToUnicode CMap for every font so text extraction and search work correctly.
- Subsets font files to include only the glyphs actually used in a document.
- Leaves the existing codepage path intact for Type1/AFM output.
The modern font declaration in the PDF file changes from:
% Simple font — single byte encoding
/Type /Font
/Subtype /TrueType
/BaseFont /ArialMT
/Encoding /WinAnsiEncoding
/Widths [...]
to:
% Composite font — glyph index addressing
/Type /Font
/Subtype /Type0
/BaseFont /ArialMT
/Encoding /Identity-H
/DescendantFonts [
<< /Type /Font
/Subtype /CIDFontType2
/BaseFont /ArialMT
/CIDSystemInfo << /Registry (Adobe) /Ordering (Identity) /Supplement 0 >>
/DW 1000
/W [...] % sparse glyph-index width array
/CIDToGIDMap /Identity
/FontDescriptor << ... >>
>>
]
/ToUnicode <stream> % CMap mapping glyph IDs back to Unicode
Text strings are written as big-endian uint16 glyph IDs rather than single bytes.
Unicode source text
→ font.cmapTable.glyphIndex(rune) (ttf: Unicode → glyph ID, uint16)
→ PDF text string (uint16 BE pairs) (pdf: written under /Identity-H encoding)
→ ToUnicode CMap stream (pdf: emitted alongside font object)
→ font subset stream (pdf: only used glyphs embedded)
A ToUnicode CMap stream maps each glyph ID used in the document back to its Unicode codepoint, enabling text extraction and copy-paste. Format:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
N beginbfchar
<GLYPHID> <UNICODE>
...
endbfchar
endcmap
CMap end
end
Track which glyph IDs are referenced during document construction. Before embedding, strip the font to a subset:
- Rewrite the
glyftable (for TrueType outlines) to include only used glyphs plus their composite dependencies. - Rewrite the
locatable to match. - Update
maxp,hmtx,cmap,postto reflect the subset. - Rename the font (prefix the PostScript name with a 6-char tag, e.g.
ABCDEF+ArialMT) per the PDF spec requirement for embedded subsets.
For CFF-based OpenType fonts, subset the CFF table similarly.
Emit ToUnicode CMap streams for all currently produced simple-font (Type1 / TrueType) encodings. This does not change rendered output; it only adds reverse-mapping metadata so viewers can extract text correctly. The codepage's Map() method already provides the byte → rune mapping needed to construct the CMap.
Deliverable: Every font object in generated PDFs includes a /ToUnicode stream.
Add a parallel rendering path for TTF fonts that bypasses the codepage layer entirely:
- New
Type0FontPDF object (alongside existingTrueTypeFont). - New
CIDFontPDF object (descendant of Type0). IdentityHEncodingconstant — no encoding dictionary needed beyond/Identity-H.- Text rendered as big-endian uint16 glyph ID pairs.
- Glyph ID width array (
/W) constructed directly fromhmtxTable. - ToUnicode CMap generated from the glyph ID → Unicode map accumulated during rendering.
- Opt-in at the
DocWriterlevel: a flag or constructor option selects Unicode mode.
Deliverable: A document created with the Unicode path renders all glyphs in a TTF font, including CJK and emoji, with correct text extraction.
Once Phase 2 is stable, EachCodepage is no longer needed for TTF output. The rich_text rendering loop changes from iterating codepage segments to iterating characters directly, collecting glyph IDs per font face. EachCodepage is retained for the AFM/Type1 path only.
Deliverable: Multi-script TTF text renders in a single font-key segment. No mid-string font-encoding switches.
Track glyph IDs referenced per font during document construction. Before Close():
- Open the original font file.
- Build a closure of required glyph IDs (add composite glyph components).
- Write a subset font stream containing only those glyphs.
- Update all font objects to reference the subset stream.
- Prefix the PostScript name with a 6-char random tag per PDF spec.
Deliverable: Embedded font streams contain only glyphs used in the document. File sizes for CJK-heavy documents reduce dramatically.
Extend Phase 2 and Phase 4 to handle OpenType fonts with CFF outlines (CIDFontType0 instead of CIDFontType2). Requires parsing the CFF table for subsetting.
- The
codepage/package and its generated tables remain for Type1/AFM output. EachCodepageremains available for callers that still target simple fonts.- All existing tests and the existing simple-font PDF output path continue to work unchanged.
- The TTF parsing infrastructure (
cmapTable,hmtxTable, etc.) is already present and usable without modification.