Skip to content

Load wikidata from duckdb instead of curated json #29

@adsharma

Description

@adsharma

Knowledge pipeline spec seems to suggest

Target: 200+ relations, 100K+ pairs across all domains.

Why not shoot for something larger?

https://huggingface.co/datasets/ladybugdb/wikidata-20260401/tree/main

contains a recent wikidata snapshot with 90 million entities and 750 million edges in duckdb format all under < 5GB.

I've since run graph clustering algos on this graph. Sample results

https://gist.github.com/adsharma/7800d8f2db1eb8f687d8fbfa2d33102d

The pipeline producing these duckdb files is a very simple set of shell and python scripts. I've tried to pass feedback to wikidata folks to make something similar available for convenience. But haven't been able to get on their priority radar.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions