A Python toolkit for interacting with Dataverse repositories
⚠️ Early Development: This project is in its early development stages. While functional, the API may change and some features are still being implemented. We welcome your feedback and contributions!
dartfx-dataverse is a Python package that facilitates programmatic interactions with Dataverse server installations via their API. The package focuses on discovery and access rather than content management, making it ideal for researchers, data scientists, and developers who need to search and retrieve data from Dataverse repositories.
- 🔍 Powerful Search: Advanced search capabilities with filtering, faceting, and geographic queries
- 🌍 Server Discovery: Retrieve information about known Dataverse installations worldwide
- 📦 Dataset Metadata: Retrieve dataset metadata and export formats (DDI, Dublin Core, schema.org)
- 🛡️ Type-Safe: Built with Pydantic models for robust data validation
- ⚡ Performance: Built-in request caching for improved performance
- 🔧 Configurable: Flexible error handling, SSL verification, and session management
- 📚 Well-Documented: Comprehensive documentation with examples
- Retrieve server installation information and metadata
- Search datasets, dataverses, and files
- Advanced search with filters, facets, and geographic queries
- Retrieve dataset metadata and export formats
- Paginated result handling
- Comprehensive error handling
- Request caching support
- Python 3.12 or higher
- uv or pip for package management
The toolkit includes a command-line interface dartfx-dataverse for quick discovery:
# List worldwide installations
dartfx-dataverse installations --limit 10
# Search Harvard Dataverse
dartfx-dataverse search "climate change" --per-page 5
# Get server information
dartfx-dataverse info dataverse.harvard.edu
# Get dataset metadata (JSON)
dartfx-dataverse dataset doi:10.5683/SP3/FNS9EF -H borealisdata.ca
# Get dataset in DDI format
dartfx-dataverse dataset doi:10.5683/SP3/FNS9EF -H borealisdata.ca --export ddi
# List metadata blocks
dartfx-dataverse metadatablocks dataverse.harvard.eduNote: This package is not yet published on PyPI. Please use the development installation method below.
To install the package, clone the repository and install locally:
-
Clone the Repository:
git clone https://github.com/DataArtifex/dataverse-toolkit.git cd dataverse-toolkit -
Install in Editable Mode:
Using uv (recommended):
uv sync
Or using pip:
pip install -e ".[dev]" -
Using Hatch (Recommended for Development):
# Install Hatch uv tool install hatch # Activate development environment hatch shell # Run tests hatch run test
Once stable, this package will be released on PyPI. Installation will then be:
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dartfx-dataverse
uv pip install dartfx-dataversepip install dartfx-dataverseGet a list of known Dataverse installations worldwide. This functionality leverages the community-maintained dataverse-installations project:
from dartfx.dataverse import fetch_dataverse_installations
# Fetch all known installations
installations = fetch_dataverse_installations()
# Display first 5
for installation in installations[:5]:
print(f"{installation.name}: {installation.hostname}")Create a connection to a specific Dataverse server:
from dartfx.dataverse import DataverseServer, ServerInstallation
# Create server installation object
harvard = ServerInstallation(
name="Harvard Dataverse",
hostname="dataverse.harvard.edu"
)
# Create server connection
server = DataverseServer(harvard)
# Get server information
info = server.get_server_info()
print(f"Server version: {info['status']}")Perform searches with various options:
from dartfx.dataverse import SearchParameters
# Simple search
results = server.search_simple("climate change")
print(f"Found {results['data']['total_count']} results")
# Advanced search with parameters
params = SearchParameters(
q="climate change",
type="dataset",
per_page=20,
sort="date",
order="desc",
show_facets=True
)
results = server.search(params)
for item in results['data']['items']:
print(f"- {item['name']}")# Search with filters
params = SearchParameters(
q="*",
type="dataset",
fq=[
"publicationDate:[2020 TO *]", # From 2020 onwards
"authorName:Smith" # Author is Smith
]
)
# Geographic search
params = SearchParameters(
q="environment",
geo_point="42.3601,-71.0589", # Boston, MA
geo_radius="50" # 50 km radius
)
# Search with metadata fields
params = SearchParameters(
q="health",
metadata_fields=["citation", "identifier", "subjects"]
)Comprehensive documentation is available, including:
- Installation Guide: Detailed installation instructions and requirements
- Quick Start: Get up and running in minutes
- Usage Guide: In-depth coverage of all features
- API Reference: Complete API documentation
- Examples: Real-world use cases and code examples
- Contributing Guide: How to contribute to the project
Visit the full documentation for more details.
The library automatically caches API requests to improve performance and reduce server load.
- Backend: In-memory (
memory) - Persistence: Cached data is lost when the Python process exits.
- CLI Scope: In the CLI, the cache is active only for the duration of a single command (effectively transient).
You can provide your own requests_cache.CachedSession to customize the caching behavior (e.g., using a persistent SQLite backend or setting an expiration time):
import requests_cache
from dartfx.dataverse import DataverseServer
# Create a persistent cache with 24-hour expiration
session = requests_cache.CachedSession(
'dataverse_cache',
backend='sqlite',
expire_after=86400 # 24 hours
)
server = DataverseServer(
server="dataverse.harvard.edu",
session=session
)If you are using a persistent cache and need to clear it manually:
# Clear all cached responses
server.session.cache.clear()To disable caching entirely, you can pass a standard requests.Session:
import requests
from dartfx.dataverse import DataverseServer
server = DataverseServer(
server="dataverse.harvard.edu",
session=requests.Session()
)This is an early development release. The core functionality is working, but APIs may change.
- Pydantic models for search results and datasets
- Enhanced error messages and debugging
- Batch operation support
- Progress indicators for long-running operations
- Dataset metadata retrieval (DDI, Dublin Core, DataCite)
- Server information and versioning
- Export formats discovery
- Metadata block listing
- Search API wrapper with caching
- CLI for discovery and search
- File metadata retrieval
- Support for additional metadata formats (Croissant, schema.org)
- Dataset and file download capabilities
- Download progress tracking and resuming
- Stable API (v1.0.0)
We welcome contributions! Here's how you can help:
- Fork the repository
- Clone your fork:
git clone https://github.com/YOUR-USERNAME/dataverse-toolkit.git - Create a feature branch:
git checkout -b feature/your-feature-name - Set up development environment:
uv tool install hatch hatch shell
# Run tests
hatch run test
# Run tests with coverage
hatch run cov
# Type checking
hatch run types:check
# Format code
ruff format .
# Lint code
ruff check . --fix- Make your changes and add tests
- Ensure all tests pass:
hatch run test - Commit your changes:
git commit -am 'Add some feature' - Push to your fork:
git push origin feature/your-feature-name - Submit a pull request
See the Contributing Guide for detailed guidelines.
This project follows the Contributor Covenant Code of Conduct. By participating, you are expected to uphold this code.
This project is licensed under the MIT License - see the LICENSE.txt file for details.
- Dataverse installations inventory provided by the community-maintained IQSS/dataverse-installations project.
- Built with Pydantic for data validation
- Uses Requests and requests-cache for HTTP operations
- Developed using Hatch project manager
- Documentation built with Sphinx
- Documentation: https://dataverse-toolkit.readthedocs.io/
- Source Code: https://github.com/DataArtifex/dataverse-toolkit
- Issue Tracker: https://github.com/DataArtifex/dataverse-toolkit/issues
- PyPI: https://pypi.org/project/dartfx-dataverse/
- Dataverse Project: https://dataverse.org/
If you encounter issues or have questions:
- Check the documentation
- Search existing issues
- Create a new issue if needed
If you use this package in your research, please cite:
@software{dartfx_dataverse,
author = {Heus, Pascal},
title = {dartfx-dataverse: A Python toolkit for Dataverse repositories},
year = {2024},
url = {https://github.com/DataArtifex/dataverse-toolkit}
}Maintained by Data Artifex | Author: Pascal Heus