Skip to content

Normalize duplicated arXiv LaTeX metadata#17

Open
hqhq1025 wants to merge 1 commit into
aiming-lab:mainfrom
hqhq1025:codex/fix-arxiv-title-latex
Open

Normalize duplicated arXiv LaTeX metadata#17
hqhq1025 wants to merge 1 commit into
aiming-lab:mainfrom
hqhq1025:codex/fix-arxiv-title-latex

Conversation

@hqhq1025
Copy link
Copy Markdown

@hqhq1025 hqhq1025 commented May 14, 2026

Summary

Fixes arXiv metadata formatting issues caused by mixed text/MathJax extraction and raw TeX display.

This is broader than a single paper title: it normalizes duplicated LaTeX/math fragments at seed time and startup, and routes all user-facing arXiv title/abstract/comment/journal-ref displays through a shared formatter.

Examples handled:

  • \gtrsim 100\times\gtrsim 100\times -> ≥ 100× for display, and a single cleaned TeX fragment in stored metadata.
  • z=1.37z=1.37 -> z=1.37 for display, and $z=1.37$ in stored metadata.
  • \mathbb{P}^1\mathbb{P}^1 -> P^1 for display.
  • \mathcal{N}=1\mathcal{N}=1 -> N=1 for display.
  • GL(d_1)\times GL(d_2)GL(d_1)\times GL(d_2) -> one GL(d_1)× GL(d_2) display fragment.
  • Raw URL TeX commands such as \url{...} and \href{...}{...} no longer appear in displayed metadata.

Affected display surfaces now use display fields:

  • abstract page
  • search results
  • category pages
  • listing pages
  • author pages
  • account recent library
  • library and starred pages
  • export detail
  • simulated PDF preview

BibTeX/export download paths keep the cleaned raw metadata instead of the display-formatted text.

Adds:

  • sites/arxiv/metadata_cleaning.py
  • scripts/check_arxiv_metadata_latex.py

Verification

  • python3 scripts/check_arxiv_metadata_latex.py
  • python3 -m py_compile sites/arxiv/app.py sites/arxiv/metadata_cleaning.py scripts/check_arxiv_metadata_latex.py
  • Flask test-client render checks across:
    • /abs/2604.07983
    • /search?query=2025mkn&searchtype=all
    • /list/astro-ph.CO/recent
    • /category/astro-ph
    • /author/Lemon,%20Cameron
  • Verified rendered pages do not contain duplicated \gtrsim 100\times\gtrsim 100\times, adjacent duplicated math blocks, z=1.37z=1.37, \url{...}, or \href{...}{...}.

Fixes #16.

@hqhq1025 hqhq1025 force-pushed the codex/fix-arxiv-title-latex branch from 40e51c9 to 84d893d Compare May 14, 2026 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

arXiv metadata renders duplicated LaTeX/math fragments

1 participant