Knowledge pipeline spec seems to suggest
Target: 200+ relations, 100K+ pairs across all domains.
Why not shoot for something larger?
https://huggingface.co/datasets/ladybugdb/wikidata-20260401/tree/main
contains a recent wikidata snapshot with 90 million entities and 750 million edges in duckdb format all under < 5GB.
I've since run graph clustering algos on this graph. Sample results
https://gist.github.com/adsharma/7800d8f2db1eb8f687d8fbfa2d33102d
The pipeline producing these duckdb files is a very simple set of shell and python scripts. I've tried to pass feedback to wikidata folks to make something similar available for convenience. But haven't been able to get on their priority radar.
Knowledge pipeline spec seems to suggest
Why not shoot for something larger?
https://huggingface.co/datasets/ladybugdb/wikidata-20260401/tree/main
contains a recent wikidata snapshot with 90 million entities and 750 million edges in duckdb format all under < 5GB.
I've since run graph clustering algos on this graph. Sample results
https://gist.github.com/adsharma/7800d8f2db1eb8f687d8fbfa2d33102d
The pipeline producing these duckdb files is a very simple set of shell and python scripts. I've tried to pass feedback to wikidata folks to make something similar available for convenience. But haven't been able to get on their priority radar.