Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

## [1.0.2] - 2026-01-18

### Changed
- Migrated from `setup.py` to `pyproject.toml` following PEP 517/518 standards for modern Python packaging
- Restructured codebase: moved implementation from `setlr/__init__.py` to `setlr/core.py` (~1020 lines)
- `setlr/__init__.py` now serves as a clean public API interface (~90 lines)

### Added
- New public API function `run_setl()` with comprehensive documentation and type hints
- Proper deprecation warning for `_setl()` function (still available for backward compatibility)
- Improved error messages for NaN/missing values (now displays `<empty/missing>` instead of `nan`)
- Extended JSON error context from 4 to 8 lines before error for better debugging
- Comprehensive API documentation with usage examples
- Development scripts for bootstrap, build, and release
- GitHub Actions workflows for automated testing and linting
- Migration documentation (MIGRATION.md)

### Fixed
- Improved error reporting for missing data scenarios
- Better context display for JSON syntax errors in templates
- Python version compatibility for JSON error handling

## [1.0.1] - 2024-08-09

### Changed
- Moved version information from `_version.py` directly into `setup.py`
- Modified `setup.py` to support `--version` flag

### Fixed
- Fixed SHACL constraint in ontology example (changed `sh:minCount` from 1 to 0 for `rdfs:subClassOf`)

## [1.0.0] - 2024-04-29

### Added
- Initial stable release of setlr
- Core SETL (Semantic Extract, Transform, Load) functionality
- Support for generating RDF graphs from tabular data
- CLI tool via `setlr` command
- Data source readers: CSV, Excel, JSON, XML, and RDF graphs
- Template-based transformation using Jinja2
- Named graph support via ConjunctiveGraph
- RDF namespaces: csvw, ov, setl, prov, pv, sp, sd, dc, void, shacl
- Utility functions: `extract()`, `transform()`, `load()`, `hash()`, `camelcase()`
- SHACL validation support with pyshacl[js]
- Python 3.8+ support
- Comprehensive test suite

### Dependencies
- rdflib >= 6.0.0
- pandas >= 0.23.0
- jinja2
- click (CLI support)
- tqdm (progress bars)
- pyshacl[js] (validation)
- beautifulsoup4, lxml (XML/HTML parsing)
- requests (HTTP support)
- toposort (dependency ordering)
- Other utility libraries: numpy, xlrd, ijson, python-slugify

[Unreleased]: https://github.com/tetherless-world/setlr/compare/v1.0.2...HEAD
[1.0.2]: https://github.com/tetherless-world/setlr/compare/v1.0.1...v1.0.2
[1.0.1]: https://github.com/tetherless-world/setlr/compare/v1.0.0...v1.0.1
[1.0.0]: https://github.com/tetherless-world/setlr/releases/tag/v1.0.0
33 changes: 33 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Include important files
include README.md
include LICENSE
include CHANGELOG.md
include MIGRATION.md
include pyproject.toml
include setup.py
include setup.cfg

# Include example files
recursive-include example *.csv *.ttl *.setl.ttl

# Exclude unwanted files and directories
global-exclude __pycache__
global-exclude *.py[cod]
global-exclude *.so
global-exclude .DS_Store
global-exclude *.egg-info
recursive-exclude * __pycache__
recursive-exclude * *.py[cod]

# Exclude test files
prune tests
prune .github
prune .circleci
prune script
prune docs/_build

# Exclude development files
exclude .gitignore
exclude .pylintrc
exclude unittest.cfg
exclude IMPROVEMENT_SUMMARY.md
168 changes: 162 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,174 @@
# setlr: The Semantic Extract, Transform and Load-er
# setlr: Semantic Extract, Transform and Load

[![Unit Tests](https://github.com/tetherless-world/setlr/actions/workflows/test.yml/badge.svg)](https://github.com/tetherless-world/setlr/actions/workflows/test.yml)
[![Lint](https://github.com/tetherless-world/setlr/actions/workflows/lint.yml/badge.svg)](https://github.com/tetherless-world/setlr/actions/workflows/lint.yml)

setlr is a tool for generating RDF graphs, including named graphs, from almost any kind of tabular data.
**SETLr** is a powerful Python tool for generating RDF graphs from tabular data using declarative SETL (Semantic Extract, Transform, Load) scripts.

# Installation
## Features

Simply check out the code, optionally create a python virtual environment, and install it using pip:
✨ **Multiple Data Sources**: CSV, Excel, JSON, XML, RDF, SAS files
🔄 **Flexible Transformations**: JSON-LD templates with Jinja2, Python functions, SPARQL
⚡ **High Performance**: Streaming XML parsing, pandas DataFrames, progress tracking
🐍 **Python Integration**: Use as library or CLI tool
✅ **Validation**: Built-in SHACL validation
📝 **Well Documented**: Comprehensive guides and API reference

## Quick Start

### Installation

```bash
pip install setlr
```

# Learning how to SETL
### Simple Example

Create `data.csv`:
```csv
ID,Name,Email
1,Alice,alice@example.com
2,Bob,bob@example.com
```

Create `transform.setl.ttl`:
```turtle
@prefix setl: <http://purl.org/twc/vocab/setl/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix csvw: <http://www.w3.org/ns/csvw#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix : <http://example.com/> .

:table a csvw:Table, setl:Table ;
prov:wasGeneratedBy [ a setl:Extract ; prov:used <data.csv> ] .

:output a void:Dataset ;
prov:wasGeneratedBy [
a setl:Transform, setl:JSLDT ;
prov:used :table ;
prov:value '''[{
"@id": "http://example.com/person/{{row.ID}}",
"@type": "http://xmlns.com/foaf/0.1/Person",
"http://xmlns.com/foaf/0.1/name": "{{row.Name}}",
"http://xmlns.com/foaf/0.1/mbox": "mailto:{{row.Email}}"
}]'''
] .
```

Run SETLr:
```bash
setlr transform.setl.ttl
```

### Using from Python

```python
from rdflib import Graph, URIRef
import setlr

# Load SETL script
setl_graph = Graph()
setl_graph.parse("transform.setl.ttl", format="turtle")

# Execute ETL pipeline
resources = setlr.run_setl(setl_graph)

# Access generated RDF
output = resources[URIRef('http://example.com/output')]
print(f"Generated {len(output)} RDF triples")
```

## Documentation

📚 **[Complete Documentation](docs/README.md)** - Full guides and references

**Quick Links:**
- [Tutorial](docs/tutorial.md) - Step-by-step guide to SETLr
- [JSLDT Template Language](docs/jsldt.md) - Transform syntax reference
- [Python API](docs/python-api.md) - Using SETLr from Python
- [Quick Start](docs/quickstart.md) - Get started in 5 minutes
- [Examples](docs/examples.md) - Real-world examples

**Advanced Topics:**
- [Streaming XML with XPath](docs/streaming-xml.md) - Efficient large file processing
- [Python Functions](docs/python-functions.md) - Custom Python transforms
- [SPARQL Support](docs/sparql.md) - Query and update endpoints
- [SHACL Validation](docs/shacl.md) - Validate your RDF output

## Key Concepts

SETLr uses RDF (with PROV-O vocabulary) to describe ETL workflows:

1. **Extract**: Load data from sources (CSV, Excel, JSON, XML, RDF, SAS)
2. **Transform**: Apply templates or Python scripts to generate RDF
3. **Load**: Save to files or SPARQL endpoints

## Supported Formats

**Input:**
- Tabular: CSV, TSV, Excel (XLS/XLSX), SAS (XPORT/SAS7BDAT)
- Structured: JSON (with ijson selectors), XML (with XPath streaming)
- Semantic: RDF (Turtle, JSON-LD, RDF/XML, etc.), OWL Ontologies

**Output:**
- RDF: Turtle, TriG, N-Triples, N3, RDF/XML, JSON-LD
- Destinations: Files, SPARQL Update endpoints

## Examples

See the [examples/](example/) directory for complete working examples:

- `social.setl.ttl` - Basic CSV to RDF with conditionals and loops
- `ontology.setl.ttl` - OWL ontology transformation with SHACL shapes

## Development

```bash
# Clone repository
git clone https://github.com/tetherless-world/setlr.git
cd setlr

# Bootstrap (creates venv and installs dependencies)
./script/bootstrap

# Activate virtual environment
source venv/bin/activate

# Run tests
./script/build

# Run linter
flake8 setlr/
```

## Contributing

Contributions are welcome! Please:
1. Fork the repository
2. Create a feature branch
3. Add tests for new functionality
4. Ensure all tests pass
5. Submit a pull request

## License

Apache License 2.0 - see [LICENSE](LICENSE) file for details.

## Citation

If you use SETLr in your research, please cite:

```bibtex
@software{setlr,
title = {SETLr: Semantic Extract, Transform and Load},
author = {McCusker, Jamie},
year = {2024},
url = {https://github.com/tetherless-world/setlr}
}
```

## Support

To learn how to use setlr please visit [the tutorial wiki page](https://github.com/tetherless-world/setlr/wiki/SETLr-Basics-Tutorial).
- 📖 [Documentation](docs/README.md)
- 🐛 [Issue Tracker](https://github.com/tetherless-world/setlr/issues)
- 💬 [Discussions](https://github.com/tetherless-world/setlr/discussions)
59 changes: 59 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# SETLr Documentation

Welcome to the SETLr (Semantic Extract, Transform and Load-er) documentation!

## Table of Contents

1. [Quick Start](quickstart.md)
2. [Installation](installation.md)
3. [Tutorial](tutorial.md)
4. [JSLDT Template Language](jsldt.md)
5. [Python API](python-api.md)
6. [Advanced Features](advanced.md)
- [Streaming XML with XPath](streaming-xml.md)
- [Python Functions in Transforms](python-functions.md)
- [SPARQL Support](sparql.md)
- [SHACL Validation](shacl.md)
7. [Examples](examples.md)
8. [CLI Reference](cli.md)

## What is SETLr?

SETLr is a powerful tool for generating RDF graphs from tabular data sources. It uses declarative SETL (Semantic Extract, Transform, Load) scripts to:

- **Extract** data from CSV, Excel, JSON, XML, and RDF sources
- **Transform** data using JSON-LD templates with Jinja2 templating
- **Load** results to files or SPARQL endpoints

## Key Features

- 📊 **Multiple Data Formats**: CSV, Excel, JSON, XML, RDF, SAS files
- 🔄 **Powerful Transformations**: JSON-LD templates with @if, @for, @with control structures
- 🐍 **Python Integration**: Call from Python code or use custom Python functions
- ⚡ **Streaming**: Efficient XML parsing for large files with XPath filtering
- ✅ **Validation**: Built-in SHACL validation support
- 🎯 **SPARQL**: Execute SPARQL queries and load to endpoints

## Quick Example

```python
from rdflib import Graph
import setlr

# Load your SETL script
setl_graph = Graph()
setl_graph.parse("my_script.setl.ttl", format="turtle")

# Execute the ETL pipeline
resources = setlr.run_setl(setl_graph)

# Access generated RDF
output_graph = resources[URIRef('http://example.com/output')]
```

## Learn More

- New to SETLr? Start with the [Quick Start Guide](quickstart.md)
- Want to learn the basics? Follow the [Tutorial](tutorial.md)
- Need to write transforms? Check the [JSLDT Template Language](jsldt.md)
- Using Python? See the [Python API Documentation](python-api.md)
Loading
Loading