Problem
Some arXiv mirror metadata contains duplicated LaTeX/math fragments caused by mixed text/MathJax extraction.
Example from /abs/2604.07983:
A Natural \gtrsim 100\times\gtrsim 100\times Telescope ... at z=1.37z=1.37
The intended display is closer to:
A Natural $\gtrsim 100\times$ Telescope ... at $z=1.37$
A scan found similar title artifacts such as:
\mathbb{P}^1\mathbb{P}^1
\mathcal{N}=1\mathcal{N}=1
GL(d_1)\times GL(d_2)GL(d_1)\times GL(d_2)
\mathrm{GL}_4\mathrm{GL}_4
- repeated isotope / ion fragments
Expected
The mirror should normalize known adjacent duplicate metadata fragments during seeding and startup so existing packaged databases and freshly seeded databases render clean titles/metadata.
Proposed Fix
Add an arXiv metadata cleanup helper shared by seed-time import and startup backfill, and add regression checks for known duplicated LaTeX fixtures.
Problem
Some arXiv mirror metadata contains duplicated LaTeX/math fragments caused by mixed text/MathJax extraction.
Example from
/abs/2604.07983:The intended display is closer to:
A scan found similar title artifacts such as:
\mathbb{P}^1\mathbb{P}^1\mathcal{N}=1\mathcal{N}=1GL(d_1)\times GL(d_2)GL(d_1)\times GL(d_2)\mathrm{GL}_4\mathrm{GL}_4Expected
The mirror should normalize known adjacent duplicate metadata fragments during seeding and startup so existing packaged databases and freshly seeded databases render clean titles/metadata.
Proposed Fix
Add an arXiv metadata cleanup helper shared by seed-time import and startup backfill, and add regression checks for known duplicated LaTeX fixtures.