Skip to content

arXiv metadata renders duplicated LaTeX/math fragments #16

@hqhq1025

Description

@hqhq1025

Problem

Some arXiv mirror metadata contains duplicated LaTeX/math fragments caused by mixed text/MathJax extraction.

Example from /abs/2604.07983:

A Natural \gtrsim 100\times\gtrsim 100\times Telescope ... at z=1.37z=1.37

The intended display is closer to:

A Natural $\gtrsim 100\times$ Telescope ... at $z=1.37$

A scan found similar title artifacts such as:

  • \mathbb{P}^1\mathbb{P}^1
  • \mathcal{N}=1\mathcal{N}=1
  • GL(d_1)\times GL(d_2)GL(d_1)\times GL(d_2)
  • \mathrm{GL}_4\mathrm{GL}_4
  • repeated isotope / ion fragments

Expected

The mirror should normalize known adjacent duplicate metadata fragments during seeding and startup so existing packaged databases and freshly seeded databases render clean titles/metadata.

Proposed Fix

Add an arXiv metadata cleanup helper shared by seed-time import and startup backfill, and add regression checks for known duplicated LaTeX fixtures.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions