A Python script that efficiently parses YAGO knowledge base TTL (Turtle) files to extract time-related metadata without loading the entire file into memory.
This parser extracts time-related information from YAGO knowledge base files, including:
- Birth dates (
schema:birthDate) - Death dates (
schema:deathDate) - Start dates (
schema:startDate) - End dates (
schema:endDate) - Publication dates (
schema:datePublished)
For each entity, the parser finds the earliest and latest dates across all time properties, making it easy to aggregate temporal information.
✅ Memory-efficient: Streams large TTL files line by line
✅ Fast parsing: Processes millions of triples efficiently
✅ Unicode decoding: Converts encoded entity names (e.g., __u0028_ → () for easy searching
✅ Multiple export formats: CSV and JSON output
✅ Wikipedia linking: Extracts Wikipedia URLs when available
✅ Progress tracking: Verbose mode shows real-time parsing progress
✅ Flexible output: Configurable result limits and formats
- Python 3.7 or higher
- No external dependencies (uses only Python standard library)
Note: All Yago data (dumps, extracted text, OpenSearch index, and PostgreSQL database) is stored under the ${WIKI_DATA} path (TBD GB).
Only the Ubuntu OS, packages and additional software resides on the system drive.
Phase 1: System Preparation
- Set environment configuration variables:
# Source the DeepRedAI environment (recommended — sets all paths automatically)
source deepred-env.sh
# Or set paths manually:
# export WIKI_DATA="/mnt/data/wikipedia"Phase 2: Download Yago Data
Switch to wiki user and download the dump:
sudo -iu wikiVerify the environment variable is set:
echo $WIKI_DATA
# Should output your data path, e.g., $DEEPRED_ROOT/wikipediaNote: If $WIKI_DATA is empty, source the environment or set it manually:
source deepred-env.sh
# Or: export WIKI_DATA=/mnt/data/wikipediaThen download the data:
mkdir -p ${WIKI_DATA}/yago
cd ${WIKI_DATA}/yago
wget -c --timeout=60 --tries=10 https://yago-knowledge.org/data/yago4.5/yago-4.5.0.2.zipThis download is ~12 GB and takes 15-30 min. If interrupted, simply re-run the same command to resume.
Phase 3: Unzip Yago Data
Extract only the yago-facts.ttl file (contains Wikipedia entity time data):
cd ${WIKI_DATA}/yago
unzip yago-4.5.0.2.zip yago-facts.ttlThe unzip process may take 5-10 minutes depending on disk speed. The extracted TTL files will be several gigabytes in size.
Verify the extraction:
ls -lh ${WIKI_DATA}/yago
# Should show .ttl files:
# -rw-r--r-- 1 wiki wiki 12G Apr 9 2024 yago-4.5.0.2.zip
# -rw-r--r-- 1 wiki wiki 22G Apr 4 2024 yago-facts.ttlPhase 4: Extract Yago Data
Create ${WIKI_DATA}/scripts/yago_parser.py (see scripts).
Run export to CSV:
python3 ${WIKI_DATA}/scripts/yago_parser.py ${WIKI_DATA}/yago/yago-facts.ttl --csv ${WIKI_DATA}/yago/yago-facts.csv --verboseRun export to JSON:
python3 ${WIKI_DATA}/scripts/yago_parser.py ${WIKI_DATA}/yago/yago-facts.ttl --json ${WIKI_DATA}/yago/yago-facts.json --verbose| Option | Description |
|---|---|
ttl_file |
Path to the YAGO TTL file (required) |
--csv FILE |
Export results to CSV file |
--json FILE |
Export results to JSON file |
--verbose, -v |
Show parsing progress |
--limit N |
Display N entities in summary (default: 20) |
--no-summary |
Skip console summary output |
Entity,Wikipedia_URL,Earliest_Date,Latest_Date
A-1_(wrestler),https://en.wikipedia.org/wiki/A-1_(wrestler),1977-05-22,1977-05-22
Augusto_Pinochet,https://en.wikipedia.org/wiki/Augusto_Pinochet,1915-11-25,2006-12-10
Andrei_Tarkovsky,https://en.wikipedia.org/wiki/Andrei_Tarkovsky,1932-04-04,1986-12-29Note: Entity names are automatically decoded from YAGO's Unicode encoding format for easy searching. For example, A-1__u0028_wrestler_u0029_ becomes A-1_(wrestler), preserving underscores that are part of the Wikipedia article name.
[
{
"entity": "Augusto_Pinochet",
"wikipedia_url": "https://en.wikipedia.org/wiki/Augusto_Pinochet",
"earliest_date": "1915-11-25",
"latest_date": "2006-12-10"
},
{
"entity": "Andrei_Tarkovsky",
"wikipedia_url": "https://en.wikipedia.org/wiki/Andrei_Tarkovsky",
"earliest_date": "1932-04-04",
"latest_date": "1986-12-29"
}
]You can also use the parser programmatically in your Python code:
from yago_parser import YagoTimeExtractor
# Create parser
extractor = YagoTimeExtractor('yago-tiny.ttl')
# Parse the file
extractor.parse_file(verbose=True)
# Get results
results = extractor.get_results()
for entity, wiki_url, earliest, latest in results:
print(f"{entity}: {earliest} to {latest}")
# Export
extractor.export_csv('output.csv')
extractor.export_json('output.json')
# Print summary
extractor.print_summary(limit=30)The YAGO parser output contains Wikipedia URLs in many languages. To use this data with your local English Wikipedia database, you need to normalize the output using the normalize_yago_output.py script.
See YAGO Normalizer Setup for details on:
- Converting non-English Wikipedia URLs to English equivalents
- Adding Wikipedia page IDs from your local database
- Validating articles exist in your database
Quick Example:
# Step 1: Parse YAGO data
python yago_parser.py yago-wd-facts.ttl --csv yago_raw.csv
# Step 2: Normalize to English Wikipedia with page IDs
python normalize_yago_output.py yago_raw.csv --output yago_normalized.csv --skip-missing
# Result: yago_normalized.csv contains only English Wikipedia articles
# that exist in your local database, with Wikipedia page IDs included