Skip to content

TED Issues 085-105 (2008) Reexported in Different Format (Sept 2023) #1

@monneyboi

Description

@monneyboi

Summary

We've discovered that TED (Tenders Electronic Daily) reexported issues 085-105 from 2008 on September 29, 2023, changing their format from the original TED META XML format to the INTERNAL_OJS R2.0.5 XML format.

This represents a significant data anomaly that raises questions about data integrity, versioning, and the reason for the reexport.

Key Facts

  • Affected Issues: 200800085 through 200800105 (21 issues)
  • Date Range: ~May 2-30, 2008
  • Reexport Date: September 29, 2023
  • Format Change: TED META XML → INTERNAL_OJS R2.0.5
  • Size Impact: Archive sizes dropped from ~50MB to ~9MB
  • Current Status: TED servers still serve the reexported version (verified Nov 10, 2025)

Format Comparison

Original Format (Issues 001-084, 106+)

  • ZIP files per language: en_20080430_084_meta_org.zip
  • ~50 ZIP files per issue
  • TED META XML structure

Reexported Format (Issues 085-105)

  • Document-based directories with opoce-input/ subdirectories
  • Files: {DOC_ID}_2008.{lang} (e.g., 114495_2008.en)
  • ~42,000 individual XML files per issue
  • INTERNAL_OJS R2.0.5 DTD format
  • File modification timestamps: Sept 29, 2023 18:23 UTC+2

Technical Details

All files within the reexported archives carry the Sept 29, 2023 modification timestamp, while surrounding issues maintain their original 2008 structure and format.

Example document structure in reexported archives:

200800085/
├── 114238/
│   └── opoce-input/
│       ├── 114238_2008.bg
│       ├── 114238_2008.en  (INTERNAL_OJS R2.0.5 XML)
│       └── ... (all EU languages)
└── ... (1693 documents total)

Questions for Investigation

  1. Why these specific 21 issues? What triggered the reprocessing of this particular date range?

  2. Data quality concerns? Were there errors or data quality issues in the original exports that necessitated reprocessing?

  3. Content differences? Are there any differences in the actual procurement data between the original and reexported versions?

  4. Format choice? Why use INTERNAL_OJS R2.0.5 instead of maintaining the META XML format for consistency?

  5. Other reexports? Are there other date ranges in 2008 or other years that were similarly reexported?

  6. Authoritative version? For data integrity purposes, which version should be considered authoritative?

  7. Documentation? Is there any official EU Publications Office documentation, change log, or bulletin from September 2023 about this event?

Impact on Our Project

Our scraper currently handles both formats correctly:

  • TedMetaXmlParser processes the original format
  • TedInternalOjsParser processes the reexported format
  • Data continuity is maintained

However, we should:

  • Document this anomaly in project documentation
  • Consider reaching out to EU Publications Office for clarification
  • Verify data integrity across the format boundary
  • Add logging to track which parser processes which issues

Verification

This analysis is based on:

  • Direct comparison of archive structures and file timestamps
  • Fresh downloads from TED servers (Nov 10, 2025)
  • Examination of XML structure and DTD declarations
  • Size and file count comparisons

Metadata

Metadata

Assignees

No one assigned

    Labels

    anomalyquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions