Load wikidata from duckdb instead of curated json

[Knowledge pipeline spec](https://github.com/chrishayuk/larql/blob/main/knowledge/docs/knowledge-pipeline-spec.md) seems to suggest

> Target: 200+ relations, 100K+ pairs across all domains.

Why not shoot for something larger?

https://huggingface.co/datasets/ladybugdb/wikidata-20260401/tree/main

contains a recent wikidata snapshot with 90 million entities and 750 million edges in duckdb format all under < 5GB.

I've since run graph clustering algos on this graph. Sample results

https://gist.github.com/adsharma/7800d8f2db1eb8f687d8fbfa2d33102d

The pipeline producing these duckdb files is a very simple set of shell and python scripts. I've tried to pass feedback to wikidata folks to make something similar available for convenience. But haven't been able to get on their priority radar. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load wikidata from duckdb instead of curated json #29

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Load wikidata from duckdb instead of curated json #29

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions