Skip to content

DataArtifex/dataverse-toolkit

dartfx-dataverse

Development Status Documentation Ask DeepWiki Python 3.12+ Package Status CI Ruff pre-commit Contributor Covenant License

A Python toolkit for interacting with Dataverse repositories

⚠️ Early Development: This project is in its early development stages. While functional, the API may change and some features are still being implemented. We welcome your feedback and contributions!

Overview

dartfx-dataverse is a Python package that facilitates programmatic interactions with Dataverse server installations via their API. The package focuses on discovery and access rather than content management, making it ideal for researchers, data scientists, and developers who need to search and retrieve data from Dataverse repositories.

Key Features

  • 🔍 Powerful Search: Advanced search capabilities with filtering, faceting, and geographic queries
  • 🌍 Server Discovery: Retrieve information about known Dataverse installations worldwide
  • 📦 Dataset Metadata: Retrieve dataset metadata and export formats (DDI, Dublin Core, schema.org)
  • 🛡️ Type-Safe: Built with Pydantic models for robust data validation
  • Performance: Built-in request caching for improved performance
  • 🔧 Configurable: Flexible error handling, SSL verification, and session management
  • 📚 Well-Documented: Comprehensive documentation with examples

Current Features

  • Retrieve server installation information and metadata
  • Search datasets, dataverses, and files
  • Advanced search with filters, facets, and geographic queries
  • Retrieve dataset metadata and export formats
  • Paginated result handling
  • Comprehensive error handling
  • Request caching support

Requirements

  • Python 3.12 or higher
  • uv or pip for package management

Command Line Interface

The toolkit includes a command-line interface dartfx-dataverse for quick discovery:

# List worldwide installations
dartfx-dataverse installations --limit 10

# Search Harvard Dataverse
dartfx-dataverse search "climate change" --per-page 5

# Get server information
dartfx-dataverse info dataverse.harvard.edu

# Get dataset metadata (JSON)
dartfx-dataverse dataset doi:10.5683/SP3/FNS9EF -H borealisdata.ca

# Get dataset in DDI format
dartfx-dataverse dataset doi:10.5683/SP3/FNS9EF -H borealisdata.ca --export ddi

# List metadata blocks
dartfx-dataverse metadatablocks dataverse.harvard.edu

Installation

Note: This package is not yet published on PyPI. Please use the development installation method below.

Development Installation (Current Method)

To install the package, clone the repository and install locally:

  1. Clone the Repository:

    git clone https://github.com/DataArtifex/dataverse-toolkit.git
    cd dataverse-toolkit
  2. Install in Editable Mode:

    Using uv (recommended):

    uv sync

    Or using pip:

    pip install -e ".[dev]"
  3. Using Hatch (Recommended for Development):

    # Install Hatch
    uv tool install hatch
    
    # Activate development environment
    hatch shell
    
    # Run tests
    hatch run test

Future PyPI Release

Once stable, this package will be released on PyPI. Installation will then be:

Using uv (Recommended)

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dartfx-dataverse
uv pip install dartfx-dataverse

Using pip

pip install dartfx-dataverse

Quick Start

Discover Dataverse Installations

Get a list of known Dataverse installations worldwide. This functionality leverages the community-maintained dataverse-installations project:

from dartfx.dataverse import fetch_dataverse_installations

# Fetch all known installations
installations = fetch_dataverse_installations()

# Display first 5
for installation in installations[:5]:
    print(f"{installation.name}: {installation.hostname}")

Connect to a Server

Create a connection to a specific Dataverse server:

from dartfx.dataverse import DataverseServer, ServerInstallation

# Create server installation object
harvard = ServerInstallation(
    name="Harvard Dataverse",
    hostname="dataverse.harvard.edu"
)

# Create server connection
server = DataverseServer(harvard)

# Get server information
info = server.get_server_info()
print(f"Server version: {info['status']}")

Search for Datasets

Perform searches with various options:

from dartfx.dataverse import SearchParameters

# Simple search
results = server.search_simple("climate change")
print(f"Found {results['data']['total_count']} results")

# Advanced search with parameters
params = SearchParameters(
    q="climate change",
    type="dataset",
    per_page=20,
    sort="date",
    order="desc",
    show_facets=True
)

results = server.search(params)
for item in results['data']['items']:
    print(f"- {item['name']}")

More Examples

# Search with filters
params = SearchParameters(
    q="*",
    type="dataset",
    fq=[
        "publicationDate:[2020 TO *]",  # From 2020 onwards
        "authorName:Smith"               # Author is Smith
    ]
)

# Geographic search
params = SearchParameters(
    q="environment",
    geo_point="42.3601,-71.0589",  # Boston, MA
    geo_radius="50"                 # 50 km radius
)

# Search with metadata fields
params = SearchParameters(
    q="health",
    metadata_fields=["citation", "identifier", "subjects"]
)

Documentation

Comprehensive documentation is available, including:

  • Installation Guide: Detailed installation instructions and requirements
  • Quick Start: Get up and running in minutes
  • Usage Guide: In-depth coverage of all features
  • API Reference: Complete API documentation
  • Examples: Real-world use cases and code examples
  • Contributing Guide: How to contribute to the project

Visit the full documentation for more details.

Caching

The library automatically caches API requests to improve performance and reduce server load.

Default Behavior

  • Backend: In-memory (memory)
  • Persistence: Cached data is lost when the Python process exits.
  • CLI Scope: In the CLI, the cache is active only for the duration of a single command (effectively transient).

Custom Configuration (Python API)

You can provide your own requests_cache.CachedSession to customize the caching behavior (e.g., using a persistent SQLite backend or setting an expiration time):

import requests_cache
from dartfx.dataverse import DataverseServer

# Create a persistent cache with 24-hour expiration
session = requests_cache.CachedSession(
    'dataverse_cache',
    backend='sqlite',
    expire_after=86400  # 24 hours
)

server = DataverseServer(
    server="dataverse.harvard.edu",
    session=session
)

Clearing the Cache

If you are using a persistent cache and need to clear it manually:

# Clear all cached responses
server.session.cache.clear()

Disabling Caching

To disable caching entirely, you can pass a standard requests.Session:

import requests
from dartfx.dataverse import DataverseServer

server = DataverseServer(
    server="dataverse.harvard.edu",
    session=requests.Session()
)

Project Status & Roadmap

Current Version: 0.1.0 (Development)

This is an early development release. The core functionality is working, but APIs may change.

Roadmap

v0.2.0 (Planned)

  • Pydantic models for search results and datasets
  • Enhanced error messages and debugging
  • Batch operation support
  • Progress indicators for long-running operations

Completed Features (v0.1.x)

  • Dataset metadata retrieval (DDI, Dublin Core, DataCite)
  • Server information and versioning
  • Export formats discovery
  • Metadata block listing
  • Search API wrapper with caching
  • CLI for discovery and search

Future (v0.3.0+)

  • File metadata retrieval
  • Support for additional metadata formats (Croissant, schema.org)
  • Dataset and file download capabilities
  • Download progress tracking and resuming
  • Stable API (v1.0.0)

Contributing

We welcome contributions! Here's how you can help:

Getting Started

  1. Fork the repository
  2. Clone your fork: git clone https://github.com/YOUR-USERNAME/dataverse-toolkit.git
  3. Create a feature branch: git checkout -b feature/your-feature-name
  4. Set up development environment:
    uv tool install hatch
    hatch shell

Development Workflow

# Run tests
hatch run test

# Run tests with coverage
hatch run cov

# Type checking
hatch run types:check

# Format code
ruff format .

# Lint code
ruff check . --fix

Submitting Changes

  1. Make your changes and add tests
  2. Ensure all tests pass: hatch run test
  3. Commit your changes: git commit -am 'Add some feature'
  4. Push to your fork: git push origin feature/your-feature-name
  5. Submit a pull request

See the Contributing Guide for detailed guidelines.

Code of Conduct

This project follows the Contributor Covenant Code of Conduct. By participating, you are expected to uphold this code.

License

This project is licensed under the MIT License - see the LICENSE.txt file for details.

Acknowledgments

Links

Support

If you encounter issues or have questions:

  1. Check the documentation
  2. Search existing issues
  3. Create a new issue if needed

Citation

If you use this package in your research, please cite:

@software{dartfx_dataverse,
  author = {Heus, Pascal},
  title = {dartfx-dataverse: A Python toolkit for Dataverse repositories},
  year = {2024},
  url = {https://github.com/DataArtifex/dataverse-toolkit}
}

Maintained by Data Artifex | Author: Pascal Heus

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages