diff --git a/docs/advanced.md b/docs/advanced.md index 86099cc..4858854 100644 --- a/docs/advanced.md +++ b/docs/advanced.md @@ -1,472 +1,278 @@ # Advanced Features -SETLr provides powerful advanced capabilities for complex data transformation workflows, large-scale processing, and production deployments. +SETLr provides several advanced capabilities beyond basic CSV-to-RDF transformation. This guide covers specialized features for working with large XML files, custom Python code, SPARQL endpoints, and SHACL validation. ## Overview -This guide covers advanced topics including: +- **[Streaming XML with XPath](#streaming-xml)** - Efficiently process large XML files with XPath filtering +- **[Python Functions in Transforms](#python-functions)** - Execute custom Python code within transforms +- **[SPARQL Support](#sparql-support)** - Load RDF to SPARQL endpoints +- **[SHACL Validation](#shacl-validation)** - Validate output RDF against SHACL shapes -- Multi-source transforms -- Conditional loading and filtering -- Performance optimization -- Error handling and debugging -- Integration patterns +## Streaming XML with XPath {#streaming-xml} -For specific advanced features, see: -- [Streaming XML with XPath](streaming-xml.md) - Efficient large file processing -- [Python Functions in Transforms](python-functions.md) - Custom Python code -- [SPARQL Support](sparql.md) - Query and update endpoints -- [SHACL Validation](shacl.md) - Validate your RDF output +For large XML files that don't fit in memory, SETLr provides streaming XML parsing with XPath filtering. -## Multi-Source Transforms +### Key Features -SETLr can combine data from multiple sources in a single transform. +- **Memory Efficient**: Uses incremental parsing (iterparse) to process one element at a time +- **XPath Filtering**: Extract only the elements you need +- **Progress Tracking**: Shows progress bar for long-running operations +- **DTD Validation**: Optional validation against document DTD -### Combining Multiple Tables +### Quick Example ```turtle @prefix setl: . @prefix prov: . -@prefix csvw: . - -# Load first table -:users a csvw:Table, setl:Table ; - prov:wasGeneratedBy [ - a setl:Extract ; - prov:used - ] . - -# Load second table -:orders a csvw:Table, setl:Table ; - prov:wasGeneratedBy [ - a setl:Extract ; - prov:used - ] . - -# Transform using both tables -:output prov:wasGeneratedBy [ - a setl:Transform, setl:JSLDT ; - prov:used :users, :orders ; - prov:value ''' - [{ - "@for": "user in users", - "@do": { - "@id": "http://example.com/user/{{user.ID}}", - "@type": "Person", - "name": "{{user.Name}}", - "orders": [{ - "@for": "order in orders", - "@if": "order.UserID == user.ID", - "@do": { - "@id": "http://example.com/order/{{order.OrderID}}" - } - }] - } - }] - ''' -] . -``` - -### Loading from Different Formats - -```turtle -# CSV data -:csv_table a csvw:Table, setl:Table ; - prov:wasGeneratedBy [ - a setl:Extract ; - prov:used - ] . - -# JSON data -:json_data a setl:Table ; - prov:wasGeneratedBy [ - a setl:Extract ; - prov:used ; - setl:hasJSONSelector "$.items[*]" - ] . +@prefix : . -# XML data -:xml_data a setl:Table ; - prov:wasGeneratedBy [ +:xmlTable a setl:Table ; + setl:xpath "//book" ; # Extract only elements + prov:wasGeneratedBy [ a setl:Extract ; - prov:used ; - setl:hasXPathSelector "//item" + prov:used ; ] . ``` -## Conditional Loading +This extracts only `` elements from the XML, ignoring all other elements and reducing memory usage. -Use conditional logic to selectively process data based on runtime conditions. +### When to Use -### Filtering with @if +- XML files larger than 100 MB +- Files with thousands of elements +- Limited memory environments +- Need to extract specific elements from complex XML -```json -[{ - "@for": "row in table", - "@if": "row.Status == 'active' and row.Score > 50", - "@do": { - "@id": "http://example.com/entity/{{row.ID}}", - "@type": "ActiveEntity", - "score": "{{row.Score}}" - } -}] -``` +**→ [Full Streaming XML Documentation](streaming-xml.md)** -### Conditional Fields - -```json -{ - "@id": "http://example.com/person/{{row.ID}}", - "@type": "Person", - "name": "{{row.Name}}", - "email": { - "@if": "row.Email", - "@do": "mailto:{{row.Email}}" - }, - "phone": { - "@if": "row.Phone and row.PhoneVerified", - "@do": "{{row.Phone}}" - } -} -``` +## Python Functions in Transforms {#python-functions} -## Performance Optimization +Execute custom Python code within JSLDT transforms for complex processing, graph manipulation, and post-processing. -### Streaming Processing +### Key Features -For large XML files, use streaming to reduce memory usage: +- **Graph Access**: Direct access to the RDF graph being generated +- **Post-Processing**: Add computed triples, aggregates, and statistics +- **Validation**: Check generated RDF for correctness +- **Custom Logic**: Execute arbitrary Python code + +### Quick Example ```turtle -:big_xml a setl:Table ; +@prefix setl: . +@prefix prov: . +@prefix void: . +@prefix : . + +:enrichedGraph a void:Dataset ; prov:wasGeneratedBy [ - a setl:Extract ; - prov:used ; - setl:hasXPathSelector "//record" ; - setl:streaming true # Enable streaming + a setl:Transform, setl:JSLDT ; + prov:used :dataTable ; + prov:used [ + a setl:PythonScript ; + prov:value ''' +# Variables available: graph, setl_graph +from rdflib.namespace import RDF + +# Count triples by type +types = {} +for s, p, o in graph.triples((None, RDF.type, None)): + types[str(o)] = types.get(str(o), 0) + 1 + +print("Generated triples by type:") +for t, count in sorted(types.items()): + print(f" {t}: {count}") +''' + ] ; + prov:value '''[{ + "@id": "http://example.com/{{row.ID}}", + "@type": "http://example.com/Item" + }]''' ; ] . ``` -See [Streaming XML documentation](streaming-xml.md) for details. - -### Batch Processing - -Process data in batches to control memory usage: - -```python -from rdflib import Graph, Namespace, URIRef -import setlr - -# For very large datasets, process in chunks -chunk_size = 10000 -offset = 0 - -output_graph = Graph() - -while True: - # Create SETL script for this batch - setl_graph = create_batch_setl(offset, chunk_size) - - # Process batch - resources = setlr.run_setl(setl_graph) - - # Accumulate results - batch_output = resources[URIRef('http://example.com/output')] - output_graph += batch_output - - # Check if done - if len(batch_output) < chunk_size: - break - - offset += chunk_size - -# Save final results -output_graph.serialize('output.ttl', format='turtle') -``` - -### Pandas Optimization +### When to Use -For CSV/Excel files, pandas is used automatically. Optimize with: +- Computing aggregates or statistics after transformation +- Adding cross-references between generated entities +- Validating generated RDF structure +- Complex logic not easily expressed in JSLDT templates -```python -# Use appropriate dtypes to reduce memory -# Specify in your data loading if possible - -# For very wide tables, select only needed columns -# by processing the source data first -``` +⚠️ **Security Warning**: Python scripts execute with full system access. Only run trusted SETL scripts. -## Error Handling and Debugging +**→ [Full Python Functions Documentation](python-functions.md)** -### Verbose Logging +## SPARQL Support {#sparql-support} -Enable detailed logging to diagnose issues: +Load transformed RDF directly to SPARQL endpoints for integration with triple stores and semantic web applications. -```python -import logging -import setlr - -# Enable debug logging -logging.basicConfig(level=logging.DEBUG) -logger = logging.getLogger('setlr') -logger.setLevel(logging.DEBUG) - -# Now run SETL -resources = setlr.run_setl(setl_graph) -``` +### Key Features -### Progress Tracking +- **Direct Loading**: Send RDF to SPARQL UPDATE endpoints +- **Integration**: Works with Fuseki, GraphDB, Blazegraph, etc. +- **SPARQL Service Description**: Uses standard W3C vocabulary -Use tqdm for progress tracking on large datasets: +### Quick Example -```python -from tqdm import tqdm -import setlr +```turtle +@prefix setl: . +@prefix prov: . +@prefix void: . +@prefix sd: . +@prefix : . -# Progress bars are automatically shown for: -# - Large file processing -# - Batch operations -# - Network transfers +# Transform data (see previous examples) +:myGraph a void:Dataset ; + prov:wasGeneratedBy [ + a setl:Transform, setl:JSLDT ; + # ... transform details ... + ] . -resources = setlr.run_setl(setl_graph) +# Load to SPARQL endpoint +:sparql_load a setl:Load, sd:Service ; + sd:endpoint ; + prov:used :myGraph . ``` -### Validation During Development +### Configuration -Validate intermediate results to catch issues early: +The SPARQL endpoint URL should point to the UPDATE endpoint: -```python -from rdflib import Graph -import setlr +- **Fuseki**: `http://localhost:3030/dataset/update` +- **GraphDB**: `http://localhost:7200/repositories/repo/statements` +- **Blazegraph**: `http://localhost:9999/blazegraph/namespace/kb/sparql` -# Process data -resources = setlr.run_setl(setl_graph) -output = resources[URIRef('http://example.com/output')] +### When to Use -# Validate results -print(f"Generated {len(output)} triples") -print(f"Subjects: {len(set(output.subjects()))}") -print(f"Predicates: {len(set(output.predicates()))}") -print(f"Objects: {len(set(output.objects()))}") +- Loading data into semantic web applications +- Integration with existing triple stores +- Building knowledge graphs +- Creating linked data services -# Check for specific patterns -for s, p, o in output.triples((None, RDF.type, None)): - print(f"Type: {o}") -``` +### Authentication -### Error Recovery - -Handle errors gracefully in production: - -```python -import setlr -from rdflib import Graph - -try: - setl_graph = Graph() - setl_graph.parse('transform.setl.ttl', format='turtle') - resources = setlr.run_setl(setl_graph) - -except setlr.SetlrError as e: - print(f"SETL processing error: {e}") - # Handle gracefully - -except Exception as e: - print(f"Unexpected error: {e}") - # Log and notify -``` +For endpoints requiring authentication, use HTTP authentication in the URL or configure credentials in your environment. -## Integration Patterns - -### CI/CD Integration - -Integrate SETLr into your CI/CD pipeline: - -```yaml -# GitHub Actions example -name: Generate RDF - -on: [push] - -jobs: - generate: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - - name: Set up Python - uses: actions/setup-python@v5 - with: - python-version: '3.11' - - - name: Install SETLr - run: pip install setlr - - - name: Generate RDF - run: setlr transform.setl.ttl -o output.ttl - - - name: Upload artifact - uses: actions/upload-artifact@v4 - with: - name: rdf-output - path: output.ttl -``` - -### Docker Integration +## SHACL Validation {#shacl-validation} -Use SETLr in containerized environments: +Validate transformed RDF against SHACL (Shapes Constraint Language) shapes to ensure data quality and conformance to schemas. -```dockerfile -FROM python:3.11-slim +### Key Features -# Install SETLr -RUN pip install setlr +- **W3C Standard**: Uses SHACL specification for validation +- **Pre-Load Validation**: Checks RDF before loading to files or endpoints +- **Detailed Reports**: Shows which constraints failed +- **Schema Enforcement**: Ensure data meets required structure -# Copy your SETL scripts and data -COPY transform.setl.ttl /app/ -COPY data/ /app/data/ +### Quick Example -WORKDIR /app +Create SHACL shapes file (`shapes.ttl`): -# Run transformation -CMD ["setlr", "transform.setl.ttl", "-o", "/output/result.ttl"] +```turtle +@prefix sh: . +@prefix xsd: . +@prefix ex: . +@prefix foaf: . + +ex:PersonShape + a sh:NodeShape ; + sh:targetClass foaf:Person ; + sh:property [ + sh:path foaf:name ; + sh:minCount 1 ; + sh:datatype xsd:string ; + ] ; + sh:property [ + sh:path foaf:mbox ; + sh:maxCount 1 ; + sh:nodeKind sh:IRI ; + ] . ``` +Run SETLr with validation: + ```bash -# Build and run -docker build -t my-setl-transform . -docker run -v $(pwd)/output:/output my-setl-transform +setlr transform.setl.ttl --rdf-validation shapes.ttl ``` -### Scheduled Processing - -Run SETLr transformations on a schedule: - -```python -# scheduled_transform.py -import schedule -import time -from rdflib import Graph -import setlr - -def run_transform(): - """Run the SETL transformation""" - print("Starting transformation...") - - setl_graph = Graph() - setl_graph.parse('transform.setl.ttl', format='turtle') - - resources = setlr.run_setl(setl_graph) - - # Save output with timestamp - timestamp = time.strftime('%Y%m%d_%H%M%S') - output_file = f'output_{timestamp}.ttl' - - output = resources[URIRef('http://example.com/output')] - output.serialize(output_file, format='turtle') - - print(f"Transformation complete: {output_file}") - -# Schedule to run every day at 2 AM -schedule.every().day.at("02:00").do(run_transform) - -while True: - schedule.run_pending() - time.sleep(60) -``` +### Validation Process -### REST API Wrapper - -Expose SETLr as a REST API: - -```python -from flask import Flask, request, jsonify -from rdflib import Graph -import setlr -import tempfile - -app = Flask(__name__) - -@app.route('/transform', methods=['POST']) -def transform(): - """Accept CSV data and SETL script, return RDF""" - - # Get input - csv_data = request.files['data'] - setl_script = request.form['setl'] - - # Save to temp files - with tempfile.NamedTemporaryFile(mode='w', suffix='.csv') as csv_file: - csv_data.save(csv_file.name) - - # Update SETL script with temp file path - setl_graph = Graph() - setl_graph.parse(data=setl_script, format='turtle') - - # Run transformation - resources = setlr.run_setl(setl_graph) - - # Return RDF - output = resources[URIRef('http://example.com/output')] - return output.serialize(format='turtle'), 200, { - 'Content-Type': 'text/turtle' - } - -if __name__ == '__main__': - app.run(debug=True) -``` +1. SETL transform executes and generates RDF +2. Generated RDF is validated against SHACL shapes +3. Validation report is generated +4. If validation passes, RDF is loaded +5. If validation fails, warnings are shown but loading continues -## Best Practices +### When to Use -### 1. Modular SETL Scripts +- Enforcing data quality standards +- Ensuring schema conformance +- Catching transformation errors early +- Documenting expected RDF structure -Break complex transformations into modules: +### Common Shape Constraints +**Required Properties:** ```turtle -# common.setl.ttl - shared definitions -@prefix : . -@prefix setl: . - -# users.setl.ttl - user-specific transforms -@prefix : . -<> owl:imports . +@prefix sh: . +@prefix foaf: . -# Main script imports both +sh:property [ + sh:path foaf:name ; + sh:minCount 1 ; # Required +] ; ``` -### 2. Version Control - -- Store SETL scripts in version control -- Track changes to transforms with your data processing pipeline -- Use branches for experimental transforms - -### 3. Testing - -- Test SETL scripts with sample data before production use -- Validate output with SHACL shapes -- Compare output to expected results +**Data Types:** +```turtle +@prefix sh: . +@prefix schema: . +@prefix xsd: . + +sh:property [ + sh:path schema:age ; + sh:datatype xsd:integer ; +] ; +``` -### 4. Documentation +**Value Ranges:** +```turtle +@prefix sh: . +@prefix schema: . + +sh:property [ + sh:path schema:age ; + sh:minInclusive 0 ; + sh:maxInclusive 150 ; +] ; +``` -- Document complex transforms with comments (use rdfs:comment) -- Maintain README files for transform collections -- Include example data with your scripts +**Pattern Matching:** +```turtle +@prefix sh: . +@prefix foaf: . -### 5. Monitoring +sh:property [ + sh:path foaf:mbox ; + sh:pattern "^mailto:" ; +] ; +``` -- Log transformation results (record counts, errors) -- Monitor resource usage for large datasets -- Set up alerts for transformation failures +### Installation -## Next Steps +SHACL validation requires the `pyshacl` package: -- Explore [Streaming XML](streaming-xml.md) for large file processing -- Learn about [Python Functions](python-functions.md) for custom logic -- Set up [SPARQL endpoints](sparql.md) for data loading -- Implement [SHACL validation](shacl.md) for quality control +```bash +pip install setlr[validation] +# or +pip install pyshacl[js] +``` -## Support +## See Also -For questions about advanced features: -- Check the [documentation](README.md) -- Open a [discussion](https://github.com/tetherless-world/setlr/discussions) -- Report issues on [GitHub](https://github.com/tetherless-world/setlr/issues) +- [Tutorial](tutorial.md) - Step-by-step guide to SETLr basics +- [JSLDT Template Language](jsldt.md) - Template syntax reference +- [Python API](python-api.md) - Using SETLr from Python code +- [CLI Reference](cli.md) - Command-line options and usage +- [Examples](examples.md) - Complete working examples