YAGO Time Metadata Parser

A Python script that efficiently parses YAGO knowledge base TTL (Turtle) files to extract time-related metadata without loading the entire file into memory.

Overview

This parser extracts time-related information from YAGO knowledge base files, including:

Birth dates (schema:birthDate)
Death dates (schema:deathDate)
Start dates (schema:startDate)
End dates (schema:endDate)
Publication dates (schema:datePublished)

For each entity, the parser finds the earliest and latest dates across all time properties, making it easy to aggregate temporal information.

Features

✅ Memory-efficient: Streams large TTL files line by line
✅ Fast parsing: Processes millions of triples efficiently
✅ Unicode decoding: Converts encoded entity names (e.g., __u0028_ → () for easy searching
✅ Multiple export formats: CSV and JSON output
✅ Wikipedia linking: Extracts Wikipedia URLs when available
✅ Progress tracking: Verbose mode shows real-time parsing progress
✅ Flexible output: Configurable result limits and formats

Requirements

Python 3.7 or higher
No external dependencies (uses only Python standard library)

Installation

Note: All Yago data (dumps, extracted text, OpenSearch index, and PostgreSQL database) is stored under the ${WIKI_DATA} path (TBD GB). Only the Ubuntu OS, packages and additional software resides on the system drive.

Phase 1: System Preparation

Set environment configuration variables:

# Source the DeepRedAI environment (recommended — sets all paths automatically)
source deepred-env.sh

# Or set paths manually:
# export WIKI_DATA="/mnt/data/wikipedia"

Phase 2: Download Yago Data

Switch to wiki user and download the dump:

sudo -iu wiki

Verify the environment variable is set:

echo $WIKI_DATA
# Should output your data path, e.g., $DEEPRED_ROOT/wikipedia

Note: If $WIKI_DATA is empty, source the environment or set it manually:

source deepred-env.sh
# Or: export WIKI_DATA=/mnt/data/wikipedia

Then download the data:

mkdir -p ${WIKI_DATA}/yago
cd ${WIKI_DATA}/yago
wget -c --timeout=60 --tries=10 https://yago-knowledge.org/data/yago4.5/yago-4.5.0.2.zip

This download is ~12 GB and takes 15-30 min. If interrupted, simply re-run the same command to resume.

Phase 3: Unzip Yago Data

Extract only the yago-facts.ttl file (contains Wikipedia entity time data):

cd ${WIKI_DATA}/yago
unzip yago-4.5.0.2.zip yago-facts.ttl

The unzip process may take 5-10 minutes depending on disk speed. The extracted TTL files will be several gigabytes in size.

Verify the extraction:

ls -lh ${WIKI_DATA}/yago
# Should show .ttl files:
# -rw-r--r-- 1 wiki wiki  12G Apr  9  2024 yago-4.5.0.2.zip
# -rw-r--r-- 1 wiki wiki  22G Apr  4  2024 yago-facts.ttl

Extraction

Phase 4: Extract Yago Data

Create ${WIKI_DATA}/scripts/yago_parser.py (see scripts).

Run export to CSV:

python3 ${WIKI_DATA}/scripts/yago_parser.py ${WIKI_DATA}/yago/yago-facts.ttl --csv ${WIKI_DATA}/yago/yago-facts.csv --verbose

Run export to JSON:

python3 ${WIKI_DATA}/scripts/yago_parser.py ${WIKI_DATA}/yago/yago-facts.ttl --json ${WIKI_DATA}/yago/yago-facts.json --verbose

yago_parser.py

Command-Line Options

Option	Description
`ttl_file`	Path to the YAGO TTL file (required)
`--csv FILE`	Export results to CSV file
`--json FILE`	Export results to JSON file
`--verbose`, `-v`	Show parsing progress
`--limit N`	Display N entities in summary (default: 20)
`--no-summary`	Skip console summary output

Output Format

CSV Format

Entity,Wikipedia_URL,Earliest_Date,Latest_Date
A-1_(wrestler),https://en.wikipedia.org/wiki/A-1_(wrestler),1977-05-22,1977-05-22
Augusto_Pinochet,https://en.wikipedia.org/wiki/Augusto_Pinochet,1915-11-25,2006-12-10
Andrei_Tarkovsky,https://en.wikipedia.org/wiki/Andrei_Tarkovsky,1932-04-04,1986-12-29

Note: Entity names are automatically decoded from YAGO's Unicode encoding format for easy searching. For example, A-1__u0028_wrestler_u0029_ becomes A-1_(wrestler), preserving underscores that are part of the Wikipedia article name.

JSON Format

[
  {
    "entity": "Augusto_Pinochet",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Augusto_Pinochet",
    "earliest_date": "1915-11-25",
    "latest_date": "2006-12-10"
  },
  {
    "entity": "Andrei_Tarkovsky",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Andrei_Tarkovsky",
    "earliest_date": "1932-04-04",
    "latest_date": "1986-12-29"
  }
]

Python API

You can also use the parser programmatically in your Python code:

from yago_parser import YagoTimeExtractor

# Create parser
extractor = YagoTimeExtractor('yago-tiny.ttl')

# Parse the file
extractor.parse_file(verbose=True)

# Get results
results = extractor.get_results()
for entity, wiki_url, earliest, latest in results:
    print(f"{entity}: {earliest} to {latest}")

# Export
extractor.export_csv('output.csv')
extractor.export_json('output.json')

# Print summary
extractor.print_summary(limit=30)

Next Steps: Normalization

The YAGO parser output contains Wikipedia URLs in many languages. To use this data with your local English Wikipedia database, you need to normalize the output using the normalize_yago_output.py script.

See YAGO Normalizer Setup for details on:

Converting non-English Wikipedia URLs to English equivalents
Adding Wikipedia page IDs from your local database
Validating articles exist in your database

Quick Example:

# Step 1: Parse YAGO data
python yago_parser.py yago-wd-facts.ttl --csv yago_raw.csv

# Step 2: Normalize to English Wikipedia with page IDs
python normalize_yago_output.py yago_raw.csv --output yago_normalized.csv --skip-missing

# Result: yago_normalized.csv contains only English Wikipedia articles
# that exist in your local database, with Wikipedia page IDs included

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YAGO Time Metadata Parser

Overview

Features

Requirements

Installation

Extraction

yago_parser.py

Command-Line Options

Output Format

CSV Format

JSON Format

Python API

Next Steps: Normalization

References

FilesExpand file tree

YagoParser-Setup.md

Latest commit

History

YagoParser-Setup.md

File metadata and controls

YAGO Time Metadata Parser

Overview

Features

Requirements

Installation

Extraction

yago_parser.py

Command-Line Options

Output Format

CSV Format

JSON Format

Python API

Next Steps: Normalization

References