Skip to content

fix: complete StandardEncoding glyph table (v0.1.1)#2

Merged
hallelx2 merged 1 commit into
mainfrom
fix/standard-encoding-glyphs
May 27, 2026
Merged

fix: complete StandardEncoding glyph table (v0.1.1)#2
hallelx2 merged 1 commit into
mainfrom
fix/standard-encoding-glyphs

Conversation

@hallelx2
Copy link
Copy Markdown
Owner

@hallelx2 hallelx2 commented May 27, 2026

Summary

Rewrites the four PDF base encodings (StandardEncoding, WinAnsiEncoding, MacRomanEncoding, PDFDocEncoding) from a single source of truth that mirrors pdfminer.six's latin_enc.py and PDF Reference 1.7 Appendix D.2 Table D.2. The previous tables silently dropped ~32 named glyphs per encoding outside printable ASCII — smart quotes, en/em dashes, bullet, florin, dagger marks — on real PDFs that used them without a /ToUnicode map (PDF/A, SEC filings, most LaTeX docs).

Before / after

Given a PDF whose content stream emits StandardEncoding byte 0x91 (quoteleft):

Output
v0.1.0 "" (empty, falls back to (cid:145))
v0.1.1 "‘"

Same for ’ “ ” – — • ƒ † ‡ fi fl etc. Across a typical SEC 10-K, this is the difference between dozens of mangled paragraphs and clean text.

Spec correctness

StandardEncoding[0x27] is quoteright (U+2019) per the PDF spec, not ASCII '. The previous test asserted ASCII identity over the printable range for all four encodings; that was wrong for StandardEncoding. The test has been updated to reflect the spec-correct behavior. WinAnsi / MacRoman / PDFDoc still hold ASCII ' and ` at 0x27 / 0x60, as the spec requires.

Test plan

  • go build ./...
  • go vet ./...
  • go test ./... — all green, including extended spot-checks for quoteleft, quoteright, quotedblleft, quotedblright, endash, emdash, bullet, florin, and WinAnsi high-byte typographic slots
  • AGL §2 compound-name handling (f_ifi) and variant-suffix stripping (A.altA) covered

Note

This is a behavior change for callers depending on the pre-v0.1.1 ASCII-identity behavior of StandardEncoding at 0x27 / 0x60. The new behavior is spec-correct and matches pdfplumber, pdfminer.six, and Ghostscript on the same input.

Summary by Sourcery

Align PDF base encoding tables and glyph name resolution with the PDF 1.7 spec and pdfminer.six so that curly quotes, dashes, ligatures, and other non-ASCII glyphs are correctly decoded instead of being dropped or mis-mapped.

Bug Fixes:

  • Drive StandardEncoding, WinAnsiEncoding, MacRomanEncoding, and PDFDocEncoding from a shared spec-derived table to restore missing non-ASCII glyph mappings and match the PDF Reference.
  • Correct StandardEncoding’s mappings for 0x27 and 0x60 to use typographic quotes instead of ASCII apostrophe/backtick, aligning behavior with other PDF tooling.
  • Expand Adobe glyph name resolution to cover the full set of glyphs used by the base encodings and common /Differences entries, including ligatures and typographic punctuation, instead of returning empty placeholders.

Enhancements:

  • Implement AGL §2 behaviors in AdobeGlyphToUnicode, including compound-name decomposition (e.g. f_i) and variant suffix stripping (e.g. .alt, .sc), to better handle real-world font glyph names.
  • Strengthen encoding and glyph-resolution tests to validate spec-correct ASCII ranges, high-byte WinAnsi slots, and key typographic glyph mappings across all four encodings.

Documentation:

  • Document the 0.1.1 release in the changelog, including the encoding corrections, expanded glyph coverage, and the behavior change for StandardEncoding apostrophe/backtick mappings.

The four base PDF encodings (StandardEncoding, WinAnsiEncoding,
MacRomanEncoding, PDFDocEncoding) are now built from a single
encodingRows source-of-truth that mirrors pdfminer.six's latin_enc.py
and PDF Reference 1.7 Appendix D.2 Table D.2.

The previous tables silently dropped ~32 named glyphs per encoding
outside printable ASCII — most visibly the smart quotes (' ' " "),
en/em dashes (- -), bullet, florin, and dagger marks. PDFs that
used these without a /ToUnicode map (PDF/A filings, SEC 10-Ks, most
LaTeX-emitted documents) returned empty or garbled text where these
glyphs appeared.

AdobeGlyphToUnicode now resolves the full Adobe Glyph List for the
~250 glyph names referenced by the four PDF base encodings plus
common /Differences additions. Compound names ("f_i" -> "fi") and
variant suffixes (".alt", ".sc") are handled per AGL §2.

StandardEncoding now correctly maps 0x27 to quoteright (U+2019) and
0x60 to quoteleft (U+2018) per the PDF spec; WinAnsi / MacRoman /
PDFDoc keep ASCII apostrophe and backtick at those slots. The
existing TestEncodingByName test was updated to reflect this
spec-correct behavior (the previous "identity over 0x20..0x7e for
all four encodings" assertion was wrong for StandardEncoding).
Copilot AI review requested due to automatic review settings May 27, 2026 00:15
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Warning

Review limit reached

@hallelx2, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 37 minutes and 51 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b3023578-afe2-44bb-a36a-3c359a7c764c

📥 Commits

Reviewing files that changed from the base of the PR and between bb561ce and 851b386.

📒 Files selected for processing (3)
  • CHANGELOG.md
  • internal/pdf/font.go
  • internal/pdf/font_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/standard-encoding-glyphs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 27, 2026

Reviewer's Guide

Centralizes and corrects the four PDF base encodings (Standard, WinAnsi, MacRoman, PDFDoc) using a spec-aligned encoding row table and expanded Adobe glyph mapping, fixes StandardEncoding apostrophe/quote behavior, and extends tests and changelog accordingly.

File-Level Changes

Change Details Files
Refactor and fully specify PDF base encodings and glyph-name-to-Unicode resolution to match PDF 1.7 Appendix D.2 and pdfminer.six.
  • Introduce encodingRow/encodingRows as the single source of truth for StandardEncoding, WinAnsiEncoding, MacRomanEncoding, and PDFDocEncoding, mirroring PDF 1.7 Appendix D.2 and pdfminer.six latin_enc.py.
  • Initialize the four 256-entry encoding tables in init() by resolving glyph names in encodingRows through AdobeGlyphToUnicode, instead of hard-coded ASCII identity plus partial WinAnsi high-byte entries.
  • Expand adobeGlyphTable from a minimal set to ~250 entries covering all glyphs referenced by the base encodings and common /Differences additions (typographic punctuation, Latin-1, ligatures, math symbols, Eastern European accents, etc.).
  • Enhance AdobeGlyphToUnicode to strip variant suffixes (e.g. .alt, .sc), recursively resolve compound glyph names with underscores per AGL §2, and then fall back to uniXXXX/uXXXX hex-name parsing; add a small indexByte helper for byte searches.
  • Correct StandardEncoding so that 0x27 maps to quoteright (U+2019) and 0x60 maps to quoteleft (U+2018), while keeping WinAnsi/MacRoman/PDFDoc ASCII-identity at those slots.
  • Update TestEncodingByName to assert ASCII identity only for WinAnsi, MacRoman, and PDFDoc, add explicit expectations for StandardEncoding 0x27/0x60, and strengthen WinAnsi high-byte coverage for smart quotes, dashes, bullet, florin, and euro.
  • Extend TestAdobeGlyphRecognisers to assert typographic glyphs (curly quotes, dashes, bullet, florin), correct ligature behavior for fi/fl and AGL compound names (f_i), and variant suffix stripping (A.alt).
  • Document the behavioral change and encoding fixes in CHANGELOG.md for version 0.1.1, including the StandardEncoding 0x27/0x60 apostrophe/quote change.
internal/pdf/font.go
internal/pdf/font_test.go
CHANGELOG.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@hallelx2 hallelx2 merged commit 645955d into main May 27, 2026
5 of 6 checks passed
@hallelx2 hallelx2 deleted the fix/standard-encoding-glyphs branch May 27, 2026 00:16
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • In AdobeGlyphToUnicode, the compound-name handling currently calls indexByte just to detect '_' and then ignores the returned index (_ = i); consider simplifying this by removing the indexByte call and using a single pass over the string (or strings.IndexByte) so the code is clearer and avoids the dummy variable.
  • encodingRows contains multiple entries with the same glyph name (e.g. three space rows with different codepoints); it may be worth adding a brief inline comment near those rows or in init() clarifying that later rows intentionally overwrite earlier ones to match the PDF spec/pdfminer behavior, so future edits don't 'deduplicate' them by accident.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In AdobeGlyphToUnicode, the compound-name handling currently calls indexByte just to detect '_' and then ignores the returned index (`_ = i`); consider simplifying this by removing the indexByte call and using a single pass over the string (or strings.IndexByte) so the code is clearer and avoids the dummy variable.
- encodingRows contains multiple entries with the same glyph name (e.g. three `space` rows with different codepoints); it may be worth adding a brief inline comment near those rows or in init() clarifying that later rows intentionally overwrite earlier ones to match the PDF spec/pdfminer behavior, so future edits don't 'deduplicate' them by accident.

## Individual Comments

### Comment 1
<location path="internal/pdf/font.go" line_range="776-778" />
<code_context>
+	{"seven", 55, 55, 55, 55},
+	{"six", 54, 54, 54, 54},
+	{"slash", 47, 47, 47, 47},
+	{"space", 32, 32, 32, 32},
+	{"space", -1, 202, 160, -1},
+	{"space", -1, 202, 173, -1},
+	{"sterling", 163, 163, 163, 163},
+	{"t", 116, 116, 116, 116},
</code_context>
<issue_to_address>
**issue (bug_risk):** Duplicate and conflicting "space" rows in encodingRows look incorrect and likely clobber each other.

There are three entries with glyph name "space": the standard space (32) and two more with Mac=202 and Win=160/173. The later ones will overwrite the first when building the encodings and conflict with the separate nbspace row above. In AGL/PDF encodings these should use distinct glyph names (e.g., nbspace, sfthyphen) and codepoints that match the reference table, to avoid overriding the standard space mapping.
</issue_to_address>

### Comment 2
<location path="internal/pdf/font_test.go" line_range="28-37" />
<code_context>
 		}
 	}
-	// WinAnsi-specific.
+	// StandardEncoding: identity except the typographic-quote slots.
+	std := EncodingByName("StandardEncoding")
+	for c := byte(0x20); c < 0x7f; c++ {
+		if c == 0x27 || c == 0x60 {
+			continue
+		}
+		if std[c] != string(rune(c)) {
+			t.Errorf("StandardEncoding[0x%02x] = %q, want %q", c, std[c], string(rune(c)))
+		}
+	}
+	if std[0x27] != "’" {
+		t.Errorf("StandardEncoding[0x27] = %q, want quoteright (’)", std[0x27])
+	}
+	if std[0x60] != "‘" {
+		t.Errorf("StandardEncoding[0x60] = %q, want quoteleft (‘)", std[0x60])
+	}
</code_context>
<issue_to_address>
**suggestion (testing):** Extend StandardEncoding tests to cover at least a couple of non-ASCII slots (e.g., bullet, dagger, ligatures) that were silently dropped before.

The current tests only exercise the ASCII range plus the special 0x27/0x60 cases. Since the original bug affected high-range glyphs (bullet, dashes, ligatures, florin, etc.), please add a few targeted checks for representative non-ASCII StandardEncoding entries (e.g., bullet/endash/emdash slots and a ligature) to verify they map to the expected Unicode code points, similar to the WinAnsi high‑byte spot checks.

Suggested implementation:

```golang
	// StandardEncoding: identity except the typographic-quote slots.
	std := EncodingByName("StandardEncoding")
	for c := byte(0x20); c < 0x7f; c++ {
		if c == 0x27 || c == 0x60 {
			continue
		}
		if std[c] != string(rune(c)) {
			t.Errorf("StandardEncoding[0x%02x] = %q, want %q", c, std[c], string(rune(c)))
		}
	}
	if std[0x27] != "" {
		t.Errorf("StandardEncoding[0x27] = %q, want quoteright (’)", std[0x27])
	}
	if std[0x60] != "" {
		t.Errorf("StandardEncoding[0x60] = %q, want quoteleft (‘)", std[0x60])
	}

	// StandardEncoding: representative non-ASCII checks for high-byte glyphs.
	// These ensure bullet, dashes, and ligatures are correctly mapped.
	if std[0x95] != "" {
		t.Errorf("StandardEncoding[0x95] = %q, want bullet (•)", std[0x95])
	}
	if std[0x96] != "" {
		t.Errorf("StandardEncoding[0x96] = %q, want endash (–)", std[0x96])
	}
	if std[0x97] != "" {
		t.Errorf("StandardEncoding[0x97] = %q, want emdash (—)", std[0x97])
	}
	if std[0x9F] != "" {
		t.Errorf("StandardEncoding[0x9F] = %q, want ligature fi (fi)", std[0x9F])
	}
	if std[0x9E] != "" {
		t.Errorf("StandardEncoding[0x9E] = %q, want ligature fl (fl)", std[0x9E])
	}

```

The exact byte indices for bullet, dashes, and ligatures in StandardEncoding must match the encoding table used in your implementation (likely in something like `internal/pdf/font_encoding.go`). If your StandardEncoding table uses different indices for `bullet`, `endash`, `emdash`, `fi`, or `fl` glyphs, adjust the `0x95/0x96/0x97/0x9F/0x9E` constants in these tests to the correct byte values so they align with the actual `EncodingByName("StandardEncoding")` mapping.
</issue_to_address>

### Comment 3
<location path="internal/pdf/font_test.go" line_range="103-112" />
<code_context>
 	}{
 		{"A", "A"},
 		{"comma", ","},
-		{"fi", "fi"},
+		// "fi" is the AGL ligature glyph (U+FB01), NOT the two-letter
+		// string "f"+"i". This is what pdfminer/pdfplumber return; the
+		// pre-v0.1.1 table missed this and returned "" (then fell back
+		// to a (cid:NNN) placeholder).
+		{"fi", "fi"},
+		{"fl", "fl"},
+		{"quoteleft", "‘"},
+		{"quoteright", "’"},
+		{"quotedblleft", "“"},
+		{"quotedblright", "”"},
+		{"endash", "–"},
+		{"emdash", "—"},
+		{"bullet", "•"},
+		{"florin", "ƒ"},
+		// Compound name (AGL §2): "f_i" decomposes to its parts.
+		{"f_i", "fi"},
+		// Variant suffix is stripped (AGL §2): "A.alt" → "A".
+		{"A.alt", "A"},
 		{"uni0041", "A"},
 		{"uni004100420043", "ABC"},
 		{"u0041", "A"},
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding tests for malformed `uniXXXX`/`uXXXX` glyph names to lock in the failure behavior of the hex recognisers.

The current tests only cover valid `uni`/`u` names. Given the more complex parsing in `AdobeGlyphToUnicode`, please add a few malformed cases—e.g. odd-length `uni` strings, non-hex characters, or too-short `u` strings (like `"uni004"`, `"uni00ZZ"`, `"u0G41"`)—and assert they return `""` to lock in and document the error-handling behavior for bad hex glyph names.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread internal/pdf/font.go
Comment on lines +776 to +778
{"space", 32, 32, 32, 32},
{"space", -1, 202, 160, -1},
{"space", -1, 202, 173, -1},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Duplicate and conflicting "space" rows in encodingRows look incorrect and likely clobber each other.

There are three entries with glyph name "space": the standard space (32) and two more with Mac=202 and Win=160/173. The later ones will overwrite the first when building the encodings and conflict with the separate nbspace row above. In AGL/PDF encodings these should use distinct glyph names (e.g., nbspace, sfthyphen) and codepoints that match the reference table, to avoid overriding the standard space mapping.

Comment thread internal/pdf/font_test.go
Comment on lines +28 to +37
// StandardEncoding: identity except the typographic-quote slots.
std := EncodingByName("StandardEncoding")
for c := byte(0x20); c < 0x7f; c++ {
if c == 0x27 || c == 0x60 {
continue
}
if std[c] != string(rune(c)) {
t.Errorf("StandardEncoding[0x%02x] = %q, want %q", c, std[c], string(rune(c)))
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Extend StandardEncoding tests to cover at least a couple of non-ASCII slots (e.g., bullet, dagger, ligatures) that were silently dropped before.

The current tests only exercise the ASCII range plus the special 0x27/0x60 cases. Since the original bug affected high-range glyphs (bullet, dashes, ligatures, florin, etc.), please add a few targeted checks for representative non-ASCII StandardEncoding entries (e.g., bullet/endash/emdash slots and a ligature) to verify they map to the expected Unicode code points, similar to the WinAnsi high‑byte spot checks.

Suggested implementation:

	// StandardEncoding: identity except the typographic-quote slots.
	std := EncodingByName("StandardEncoding")
	for c := byte(0x20); c < 0x7f; c++ {
		if c == 0x27 || c == 0x60 {
			continue
		}
		if std[c] != string(rune(c)) {
			t.Errorf("StandardEncoding[0x%02x] = %q, want %q", c, std[c], string(rune(c)))
		}
	}
	if std[0x27] != "’" {
		t.Errorf("StandardEncoding[0x27] = %q, want quoteright (’)", std[0x27])
	}
	if std[0x60] != "‘" {
		t.Errorf("StandardEncoding[0x60] = %q, want quoteleft (‘)", std[0x60])
	}

	// StandardEncoding: representative non-ASCII checks for high-byte glyphs.
	// These ensure bullet, dashes, and ligatures are correctly mapped.
	if std[0x95] != "•" {
		t.Errorf("StandardEncoding[0x95] = %q, want bullet (•)", std[0x95])
	}
	if std[0x96] != "–" {
		t.Errorf("StandardEncoding[0x96] = %q, want endash (–)", std[0x96])
	}
	if std[0x97] != "—" {
		t.Errorf("StandardEncoding[0x97] = %q, want emdash (—)", std[0x97])
	}
	if std[0x9F] != "fi" {
		t.Errorf("StandardEncoding[0x9F] = %q, want ligature fi (fi)", std[0x9F])
	}
	if std[0x9E] != "fl" {
		t.Errorf("StandardEncoding[0x9E] = %q, want ligature fl (fl)", std[0x9E])
	}

The exact byte indices for bullet, dashes, and ligatures in StandardEncoding must match the encoding table used in your implementation (likely in something like internal/pdf/font_encoding.go). If your StandardEncoding table uses different indices for bullet, endash, emdash, fi, or fl glyphs, adjust the 0x95/0x96/0x97/0x9F/0x9E constants in these tests to the correct byte values so they align with the actual EncodingByName("StandardEncoding") mapping.

Comment thread internal/pdf/font_test.go
Comment on lines 103 to +112
}{
{"A", "A"},
{"comma", ","},
{"fi", "fi"},
// "fi" is the AGL ligature glyph (U+FB01), NOT the two-letter
// string "f"+"i". This is what pdfminer/pdfplumber return; the
// pre-v0.1.1 table missed this and returned "" (then fell back
// to a (cid:NNN) placeholder).
{"fi", "fi"},
{"fl", "fl"},
{"quoteleft", "‘"},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Consider adding tests for malformed uniXXXX/uXXXX glyph names to lock in the failure behavior of the hex recognisers.

The current tests only cover valid uni/u names. Given the more complex parsing in AdobeGlyphToUnicode, please add a few malformed cases—e.g. odd-length uni strings, non-hex characters, or too-short u strings (like "uni004", "uni00ZZ", "u0G41")—and assert they return "" to lock in and document the error-handling behavior for bad hex glyph names.

@hallelx2 hallelx2 review requested due to automatic review settings May 27, 2026 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant