Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
22bec9b
migrated to uv
dwnoble Aug 13, 2025
ee1854e
file formatting, more uv migrations
dwnoble Nov 20, 2025
363ca4c
unit test fixes
dwnoble Nov 20, 2025
0de6c30
python version fix
dwnoble Nov 20, 2025
ff20e87
test fixes
dwnoble Nov 20, 2025
3c0fa7b
Added schema validation
dwnoble Nov 21, 2025
ca1b4bd
updated lockfile
dwnoble Nov 21, 2025
2c25b95
Added initial schema validation service and in memory knowledge graph…
dwnoble Jan 26, 2026
d8d6f6f
merged
dwnoble Jan 26, 2026
96bd55c
cleanup
dwnoble Jan 26, 2026
c3c8b66
test fixes
dwnoble Jan 26, 2026
b0c5955
Apply suggestions from code review
dwnoble Jan 26, 2026
b6b33e2
Apply suggestions from code review
dwnoble Jan 26, 2026
08e0754
test cases
dwnoble Jan 26, 2026
05a9386
refactor
dwnoble Jan 26, 2026
c3884f0
header
dwnoble Jan 26, 2026
3cf2e15
fixed readme
dwnoble Jan 26, 2026
8ae1388
readme fixes
dwnoble Jan 26, 2026
0bce7e7
readme fixes
dwnoble Jan 26, 2026
334411a
readme fixes
dwnoble Jan 26, 2026
22bca66
Added datacommons-cli package and migrated command line utilities to …
dwnoble Jan 28, 2026
13bef43
undid accidental changes
dwnoble Jan 28, 2026
28957c7
fix
dwnoble Jan 28, 2026
3fa5732
suggestion
dwnoble Jan 28, 2026
8747f2e
Apply suggestions from code review
dwnoble Feb 4, 2026
5e89dd7
pr feedback
dwnoble Feb 4, 2026
be933fe
Merge branch 'datacommons-cli' of github.com:dwnoble/datacommons into…
dwnoble Feb 4, 2026
4430737
updated uv.lock
dwnoble Feb 4, 2026
1595b79
Separated test and lint dependencies into separate groups
dwnoble Feb 9, 2026
82ea89a
Set default uv index to pypi.org to fix github test failures
dwnoble Feb 9, 2026
f5ce2b1
comment
dwnoble Feb 9, 2026
b2c5a30
simplified set up
dwnoble Feb 9, 2026
f5027a6
Formatted
dwnoble Feb 9, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ jobs:
run: uv sync --locked

- name: Check formatting with Ruff
run: uv run ruff format --check
run: uv run --extra lint ruff format --check

# TODO(dwnoble): Fix formatting issues in datacommons-schema and uncomment these
#- name: Lint with Ruff
Expand All @@ -35,4 +35,4 @@ jobs:
- name: Run tests
run: |
# Enable parallel test execution
uv run pytest
uv run --extra test pytest
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ repos:
hooks:
- id: run-tests
name: Run tests
entry: uv run pytest
entry: uv run --extra test pytest
language: system
pass_filenames: false
stages: [pre-push]
Expand Down
37 changes: 31 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,16 +29,17 @@ This section will guide you through setting up Data Commons locally and defining

To get started, you'll need to check out the Data Commons repository and set up your local environment.

#### Check out the repository:
#### Clone the repository:

```bash
git clone https://github.org/datacommonsorg/datacommons
cd datacommons
```

The repository contains three main components:
The repository contains four main components:
- `datacommons-api`: The REST API server for interacting with Data Commons
- `datacommons-db`: The database layer for storing and querying data
- `datacommons-cli`: Command-line interface for interacting with Data Commons
- `datacommons-schema`: Schema management and validation tools

#### Create a virtual environment with uv
Expand All @@ -52,7 +53,7 @@ uv sync
Run the test suite to verify your setup:

```bash
uv run pytest
uv run --extra test pytest
```

Tests are also run automatically before pushing changes.
Expand All @@ -71,14 +72,23 @@ Replace the values with your actual GCP project and Spanner instance details. Yo

#### Start Data Commons:

Run the `datacommons-api` command using `uv` to start a local development server.
Run the `datacommons` command using `uv` to start a local development server.

```bash
uv run datacommons-api
uv run datacommons api start
```

This will start the Data Commons API server on port 5000, ready to receive your schema and data.

Alternatively, you can set the spanner configuration using command line arguments, which will take precedence over environment variables:

```bash
uv run datacommons api start \
--gcp-project-id="your-gcp-project-id" \
--gcp-spanner-instance-id="your-spanner-instance-id" \
--gcp-spanner-database-name="your-spanner-database-name"
```

### 2. Define Your Schema

Data Commons uses JSON-LD to define schemas, building upon RDF, RDFS, and SHACL for robust data modeling and validation. The repository includes example schema and data files in the `examples` directory:
Expand Down Expand Up @@ -316,4 +326,19 @@ You should see the:
}
]
}
```
```

### Schema Tools

Use the `datacommons schema` command to convert between MCF and JSON-LD formats.

```bash
# Convert with default settings
uv run datacommons schema mcf2jsonld data.mcf

# Convert with custom namespace and output file
uv run datacommons schema mcf2jsonld data.mcf -n "dc:https://datacommons.org/" -o output.jsonld

# Generate compact output
uv run datacommons schema mcf2jsonld data.mcf -c
```
Original file line number Diff line number Diff line change
Expand Up @@ -16,26 +16,50 @@
import uvicorn

from datacommons_api.app import app
from datacommons_api.core.config import get_config
from datacommons_api.core.config import initialize_config
from datacommons_api.core.logging import get_logger, setup_logging
from datacommons_db.session import initialize_db


setup_logging()
logger = get_logger(__name__)


@click.command()
@click.group()
def api():
"""Data Commons API CLI suite"""
pass


@api.command()
@click.option("--host", default="127.0.0.1", help="Host to bind to.")
@click.option("--port", default=5000, help="Port to listen on.")
@click.option("--reload", is_flag=True, help="Enable auto-reload.")
def main(host: str, port: int, *, reload: bool = False):
@click.option("--gcp-project-id", default="", help="GCP project id.")
@click.option("--gcp-spanner-instance-id", default="", help="GCP Spanner instance id.")
@click.option(
"--gcp-spanner-database-name", default="", help="GCP Spanner database name."
)
def start(
host: str,
port: int,
reload: bool,
gcp_project_id: str,
gcp_spanner_instance_id: str,
gcp_spanner_database_name: str,
):
"""Start the FastAPI app with Uvicorn."""
logger.info("Starting Data Commons...")
config = get_config()
config = initialize_config(
gcp_project_id=gcp_project_id,
gcp_spanner_instance_id=gcp_spanner_instance_id,
gcp_spanner_database_name=gcp_spanner_database_name,
)

# Initialize the database
logger.info("Initializing database...")
logger.info("GCP Project ID: %s", config.GCP_PROJECT_ID)
logger.info("GCP Spanner Instance ID: %s", config.GCP_SPANNER_INSTANCE_ID)
logger.info("GCP Spanner Database Name: %s", config.GCP_SPANNER_DATABASE_NAME)
initialize_db(
config.GCP_PROJECT_ID,
config.GCP_SPANNER_INSTANCE_ID,
Expand Down
41 changes: 37 additions & 4 deletions packages/datacommons-api/datacommons_api/core/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,10 @@ class ProductionConfig(Config):
}


# Default configuration
app_config = config[os.getenv("APP_ENV", "default")]()


def validate_config_or_exit(config: Config) -> None:
"""Ensure the configuration is valid"""
# Ensure GCP Spanner is configured
Expand All @@ -65,9 +69,38 @@ def validate_config_or_exit(config: Config) -> None:
sys.exit(1)


def get_config() -> Config:
"""Get the appropriate configuration object based on environment."""
env = os.getenv("APP_ENV", "default")
app_config = config[env]()
def initialize_config(
gcp_project_id: str = "",
gcp_spanner_instance_id: str = "",
gcp_spanner_database_name: str = "",
) -> Config:
"""
Initialize the configuration object based on environment or command line arguments.

Args:
gcp_project_id: Optional GCP project id.
gcp_spanner_instance_id: Optional GCP Spanner instance id.
gcp_spanner_database_name: Optional GCP Spanner database name.

Returns:
Config: The configuration object.
"""
app_config.GCP_PROJECT_ID = gcp_project_id or app_config.GCP_PROJECT_ID
app_config.GCP_SPANNER_INSTANCE_ID = (
gcp_spanner_instance_id or app_config.GCP_SPANNER_INSTANCE_ID
)
app_config.GCP_SPANNER_DATABASE_NAME = (
gcp_spanner_database_name or app_config.GCP_SPANNER_DATABASE_NAME
)
validate_config_or_exit(app_config)
return app_config
Comment on lines +72 to 96
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function modifies the global app_config object. Using global state can make the application harder to reason about, debug, and test, as it creates hidden dependencies and makes the behavior of functions dependent on the order in which they are called.

A better approach would be to avoid modifying global state. For example, this function could create and return a new configuration object, or the configuration object could be instantiated and passed explicitly where needed, rather than relying on a global instance.



def get_config() -> Config:
"""
Get the configuration object.

Returns:
Config: The configuration object.
"""
return app_config
4 changes: 1 addition & 3 deletions packages/datacommons-api/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,12 @@ dependencies = [
"uvicorn[standard]>=0.22.0",
"sqlalchemy>=2.0.0",
"pydantic>=2.0.0",
"click>=8.1.7",
"datacommons-db",
"datacommons-schema",
]
urls = {Homepage = "https://github.com/datacommonsorg/datacommons"}

[project.scripts]
datacommons-api = "datacommons_api.app_cli:main"

[build-system]
requires = ["uv", "setuptools"]
build-backend = "setuptools.build_meta"
Expand Down
33 changes: 33 additions & 0 deletions packages/datacommons-cli/datacommons_cli/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Copyright 2025 Google LLC.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import click

from datacommons_api.core.logging import get_logger, setup_logging
from datacommons_api.api_cli import api as api_cli
from datacommons_schema.schema_cli import schema as schema_cli

setup_logging()
logger = get_logger(__name__)


@click.group()
def cli():
"""Datacommons CLI suite"""
pass


# Add schema CLI commands to the main CLI
cli.add_command(api_cli)
cli.add_command(schema_cli)
15 changes: 15 additions & 0 deletions packages/datacommons-cli/datacommons_cli/version.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright 2026 Google LLC.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

__version__ = "0.0.1"
30 changes: 30 additions & 0 deletions packages/datacommons-cli/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
[project]
name = "datacommons-cli"
dynamic = ["version"]
description = 'Data Commons CLI'
license = "Apache-2.0"
readme = "README.md"

requires-python = ">=3.11"
keywords = []
authors = []
dependencies = [
"click>=8.1.7",
"datacommons-api",
"datacommons-db",
"datacommons-schema",
]
urls = {Homepage = "https://github.com/datacommonsorg/datacommons"}

[project.scripts]
datacommons = "datacommons_cli.cli:cli"

[build-system]
requires = ["uv", "setuptools"]
build-backend = "setuptools.build_meta"

[tool.setuptools]
include-package-data = true

[tool.setuptools.dynamic]
version = {attr = "datacommons_cli.version.__version__"}
25 changes: 3 additions & 22 deletions packages/datacommons-schema/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,13 @@ The `mcf2jsonld` command converts MCF files to JSON-LD format, with support for

```bash
# Basic usage
datacommons mcf2jsonld input.mcf
datacommons schema mcf2jsonld input.mcf

# With custom namespace
datacommons mcf2jsonld input.mcf --namespace "schema:https://schema.org/"
datacommons schema mcf2jsonld input.mcf --namespace "schema:https://schema.org/"

# Output to file with compact format
datacommons mcf2jsonld input.mcf -o output.jsonld -c
datacommons schema mcf2jsonld input.mcf -o output.jsonld -c
```

#### Options
Expand Down Expand Up @@ -70,25 +70,6 @@ mcf_nodes = parse_mcf_string(mcf_content)
jsonld = mcf_nodes_to_jsonld(mcf_nodes, compact=True)
```

### Command Line

```bash
# Convert with default settings
datacommons mcf2jsonld data.mcf

# Convert with custom namespace and output file
datacommons mcf2jsonld data.mcf -n "dc:https://datacommons.org/" -o output.jsonld

# Generate compact output
datacommons mcf2jsonld data.mcf -c
```

## Dependencies

- Click (for CLI interface)
- Pydantic (for data validation)
- JSON-LD processing libraries

## Contributing

When contributing to this module:
Expand Down
4 changes: 2 additions & 2 deletions packages/datacommons-schema/datacommons_schema/schema_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@


@click.group()
def cli():
def schema():
"""Data Commons Schema Parsing CLI"""


@cli.command()
@schema.command()
@click.argument("mcf_file", type=click.Path(exists=True))
@click.option(
"--namespace",
Expand Down
10 changes: 7 additions & 3 deletions packages/datacommons-schema/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,20 @@ classifiers = [
dependencies = [
"click",
"pydantic",
"pytest"
"pyshacl",
"rdflib",
]

[project.optional-dependencies]
test = [
"pytest",
]

[project.urls]
Documentation = "https://github.com/datacommonsorg/datacommons#readme"
Issues = "https://github.com/datacommonsorg/datacommons/issues"
Source = "https://github.com/datacommonsorg/datacommons"

[project.scripts]
datacommons-schema = "datacommons_schema.schema_cli:cli"

[build-system]
requires = ["uv", "setuptools"]
Expand Down
Loading