The hvectorspaces repository investigates the historical evolution of vector space research through citation networks analysis.
To install the required packages, run the following command in your python environment (Python 3.10+ is recommended):
pip install .[dev]
This document explains how to set up a local PostgreSQL database that mirrors the data we previously hosted on CockroachDB.
The local DB contains two tables:
openalex_vector_spacesper_decade_citation_graph
All data is stored locally in PostgreSQL and loaded from compressed CSV files that are versioned in this repo using Git LFS.
The Cockroach → PostgreSQL export lives in:
exported_db/
├── schema.sql
├── openalex_vector_spaces.csv.gz
└── per_decade_citation_graph.csv.gz
The exported database files inside the exported_db/ directory are large (.csv.gz).
To avoid bloating the Git repository and to make downloads efficient, we store them using Git LFS (Large File Storage).
Download and install Git LFS by following the instructions at https://git-lfs.github.com/ or run:
curl -s https://packagecloud.io/github/git-lfs/install.sh | sudo bashEnable Git LFS:
git lfs installWhen you clone the repository for the first time, Git LFS will automatically download the large files.
After cloning:
git pullGit LFS automatically downloads the actual CSV files into exported_db/.
Check with:
ls -lh exported_db/You should see large file sizes; not tiny pointer files.
You’ll need:
PostgreSQL 14+
On macOS, we use Postgres.app:
Download and install from the Postgres.app website.
Open Postgres.app and make sure the server is running.
(Optional but recommended) Add the command-line tools to your PATH via Postgres.app’s preferences or by adding the following line to your shell profile:
export PATH="/Applications/Postgres.app/Contents/Versions/latest/bin:$PATH"For persistent PATH changes, add the above line to your ~/.bash_profile, ~/.zshrc, or equivalent file depending on your shell.
We assume a database name like hvectorspaces (you can change it, but then update .env accordingly).
From a terminal:
createdb hvectorspaces
Or from psql:
CREATE DATABASE hvectorspaces;
The Python clients expect standard PostgreSQL env vars. Create a .env file in the project root or modify the existing one with the following variables:
PG_HOST=localhost
PG_PORT=5432
PG_DATABASE=hvectorspaces
PG_USER=<your_mac_username>
PG_PASSWORD=Notes:
PG_USER should normally be your macOS username (the output of the terminal command whoami).
For Postgres.app defaults, no password is required for local connections, so you can leave PG_PASSWORD empty.
From the project root, you can load the environment variables from the .env file by running:
source .envIn order to create the tables, run the following command from the terminal:
psql -d hvectorspaces -f exported_db/schema.sqlYou can verify the tables exist by running
psql -d hvectorspaces -c "\d+ openalex_vector_spaces"
psql -d hvectorspaces -c "\d+ per_decade_citation_graph"We have a Python script that uses the local PostgreSQL client (PostgresClient) to load the .csv.gz files via COPY.
From the project root:
python -m scripts.create_postgresql_dbThis script should:
Connect to PostgreSQL using the env vars in .env.
Load:
exported_db/openalex_vector_spaces.csv.gz → openalex_vector_spaces
exported_db/per_decade_citation_graph.csv.gz → per_decade_citation_graph
If everything works, you should see log output indicating that the rows were loaded.
The repository contains three main components:
- The
hvectorspacespackage for data processing, analysis and visualization; - The
testsfolder for unit tests; - The
scriptsfolder for data acquisition and processing scripts.
The most important module in the hvectorspaces package is the io module, which provides a client to access PostGreSQL DB for data retrieval and storage.
You can create an instance of the PostgresClient class and use its methods to interact with the database. For example, to download the data and generate a map from OpenAlex work ID to the list of IDs of cited works, you can use the following code:
from hvectorspaces.io import PostgresClient
id_to_cited_ids = {}
with PostgresClient() as client:
citation_map = client.fetch_per_decade_data(1980)
for oa_id, refs in citation_map:
id_to_cited_ids[oa_id] = refsThe PostgresClient.fetch_per_decade_data method can also take a list of column names as an optional argument to specify which additional columns to retrieve from the database.
In order to launch a generic SQL query against the PostgresClient, you can use the PostgresClient.execute_query method. For example:
from hvectorspaces.io import PostgresClient
with PostgresClient() as client:
result = client.execute_sql("SELECT * FROM openalex_vector_spaces LIMIT 10;")
for row in result:
print(row)Your .env file should contain the relevant environment variables to connect to the PostgresClient DB instance. These are PG_HOST, PG_DATABASE, PG_USER, PG_PASSWORD, and PG_PORT.
The main table used in this repository is openalex_vector_spaces, which contains the following columns:
oa_id: OpenAlex work ID (string)doi: Digital Object Identifier of the work (string)title: Title of the work (string)publication_year: Year of publication (integer)cited_by_count: Number of times the work has been cited (integer)abstract: Abstract of the work (string)referenced_works: List of OpenAlex work IDs cited by the work (array of strings)domain: Domain of the work (string)field: Field of study of the work (string)topic: Topic of the work (string)layer: Number of hops from the seed for the work in the citation network (int)in_decade_references: List of OpenAlex work IDs cited by the work in the same decade (array of strings)
The scripts folder contains scripts for data acquisition and processing. The main scripts are:
create_postgresql_db.py: This script was used in a migration from CockroachDB to PostgreSQL. It creates theopenalex_vector_spacestable in the PostgreSQL instance with the appropriate schema and loads the data from local CSV files into the table.sql_upload_oa_data.py: This script uploads OpenAlex data to the PostgreSQL instance. It reads data from a specified source and populates theopenalex_vector_spacestable. Currently, it searches for all works that contain the term "vector space" in their title or abstract, were published after 1920 and have more than 20 citations. Starting from these seed works, it performs a breadth-first search in the citation network to collect all works that cite or are cited by the seed works, up to 2 hops away, filtering out those that have less than 20 citations.add_in_decade_references_column.py: This script adds thein_decade_referencescolumn to theopenalex_vector_spacestable. It populates this column with the list of cited works that were published in the same decade as the citing work.create_clusters.py: This script uses a pre-defined clustering method (defaults toleiden) to create clusters of works within different decades based on their citation relationships. The script can be updated to fetch additional fields from the database as needed. It can be called from cli with
python -m scripts.create_clusters --output_path <json_path>Optional arguments are: --clustering_method (default: leiden), --decade_start (default: 1950), --cluster_size_cutoff (default: 5) and --top_n (default: 10). Additional information can be found in the docstring of the main method.
create_graph_from_clusters.py: This script generates a citation graph from the clusters created in the previous step. It constructs a directed graph where nodes represent works and edges represent citation relationships between them, divided by decades. The resulting graph is saved in a specified output path. It can be called from cli with
python -m scripts.create_graph_from_clusters --input_path <json_path> --output_path <graph_path>