diff --git a/docs/advanced.md b/docs/advanced.md
index 86099cc..4858854 100644
--- a/docs/advanced.md
+++ b/docs/advanced.md
@@ -1,472 +1,278 @@
# Advanced Features
-SETLr provides powerful advanced capabilities for complex data transformation workflows, large-scale processing, and production deployments.
+SETLr provides several advanced capabilities beyond basic CSV-to-RDF transformation. This guide covers specialized features for working with large XML files, custom Python code, SPARQL endpoints, and SHACL validation.
## Overview
-This guide covers advanced topics including:
+- **[Streaming XML with XPath](#streaming-xml)** - Efficiently process large XML files with XPath filtering
+- **[Python Functions in Transforms](#python-functions)** - Execute custom Python code within transforms
+- **[SPARQL Support](#sparql-support)** - Load RDF to SPARQL endpoints
+- **[SHACL Validation](#shacl-validation)** - Validate output RDF against SHACL shapes
-- Multi-source transforms
-- Conditional loading and filtering
-- Performance optimization
-- Error handling and debugging
-- Integration patterns
+## Streaming XML with XPath {#streaming-xml}
-For specific advanced features, see:
-- [Streaming XML with XPath](streaming-xml.md) - Efficient large file processing
-- [Python Functions in Transforms](python-functions.md) - Custom Python code
-- [SPARQL Support](sparql.md) - Query and update endpoints
-- [SHACL Validation](shacl.md) - Validate your RDF output
+For large XML files that don't fit in memory, SETLr provides streaming XML parsing with XPath filtering.
-## Multi-Source Transforms
+### Key Features
-SETLr can combine data from multiple sources in a single transform.
+- **Memory Efficient**: Uses incremental parsing (iterparse) to process one element at a time
+- **XPath Filtering**: Extract only the elements you need
+- **Progress Tracking**: Shows progress bar for long-running operations
+- **DTD Validation**: Optional validation against document DTD
-### Combining Multiple Tables
+### Quick Example
```turtle
@prefix setl: .
@prefix prov: .
-@prefix csvw: .
-
-# Load first table
-:users a csvw:Table, setl:Table ;
- prov:wasGeneratedBy [
- a setl:Extract ;
- prov:used
- ] .
-
-# Load second table
-:orders a csvw:Table, setl:Table ;
- prov:wasGeneratedBy [
- a setl:Extract ;
- prov:used
- ] .
-
-# Transform using both tables
-:output prov:wasGeneratedBy [
- a setl:Transform, setl:JSLDT ;
- prov:used :users, :orders ;
- prov:value '''
- [{
- "@for": "user in users",
- "@do": {
- "@id": "http://example.com/user/{{user.ID}}",
- "@type": "Person",
- "name": "{{user.Name}}",
- "orders": [{
- "@for": "order in orders",
- "@if": "order.UserID == user.ID",
- "@do": {
- "@id": "http://example.com/order/{{order.OrderID}}"
- }
- }]
- }
- }]
- '''
-] .
-```
-
-### Loading from Different Formats
-
-```turtle
-# CSV data
-:csv_table a csvw:Table, setl:Table ;
- prov:wasGeneratedBy [
- a setl:Extract ;
- prov:used
- ] .
-
-# JSON data
-:json_data a setl:Table ;
- prov:wasGeneratedBy [
- a setl:Extract ;
- prov:used ;
- setl:hasJSONSelector "$.items[*]"
- ] .
+@prefix : .
-# XML data
-:xml_data a setl:Table ;
- prov:wasGeneratedBy [
+:xmlTable a setl:Table ;
+ setl:xpath "//book" ; # Extract only elements
+ prov:wasGeneratedBy [
a setl:Extract ;
- prov:used ;
- setl:hasXPathSelector "//item"
+ prov:used ;
] .
```
-## Conditional Loading
+This extracts only `` elements from the XML, ignoring all other elements and reducing memory usage.
-Use conditional logic to selectively process data based on runtime conditions.
+### When to Use
-### Filtering with @if
+- XML files larger than 100 MB
+- Files with thousands of elements
+- Limited memory environments
+- Need to extract specific elements from complex XML
-```json
-[{
- "@for": "row in table",
- "@if": "row.Status == 'active' and row.Score > 50",
- "@do": {
- "@id": "http://example.com/entity/{{row.ID}}",
- "@type": "ActiveEntity",
- "score": "{{row.Score}}"
- }
-}]
-```
+**→ [Full Streaming XML Documentation](streaming-xml.md)**
-### Conditional Fields
-
-```json
-{
- "@id": "http://example.com/person/{{row.ID}}",
- "@type": "Person",
- "name": "{{row.Name}}",
- "email": {
- "@if": "row.Email",
- "@do": "mailto:{{row.Email}}"
- },
- "phone": {
- "@if": "row.Phone and row.PhoneVerified",
- "@do": "{{row.Phone}}"
- }
-}
-```
+## Python Functions in Transforms {#python-functions}
-## Performance Optimization
+Execute custom Python code within JSLDT transforms for complex processing, graph manipulation, and post-processing.
-### Streaming Processing
+### Key Features
-For large XML files, use streaming to reduce memory usage:
+- **Graph Access**: Direct access to the RDF graph being generated
+- **Post-Processing**: Add computed triples, aggregates, and statistics
+- **Validation**: Check generated RDF for correctness
+- **Custom Logic**: Execute arbitrary Python code
+
+### Quick Example
```turtle
-:big_xml a setl:Table ;
+@prefix setl: .
+@prefix prov: .
+@prefix void: .
+@prefix : .
+
+:enrichedGraph a void:Dataset ;
prov:wasGeneratedBy [
- a setl:Extract ;
- prov:used ;
- setl:hasXPathSelector "//record" ;
- setl:streaming true # Enable streaming
+ a setl:Transform, setl:JSLDT ;
+ prov:used :dataTable ;
+ prov:used [
+ a setl:PythonScript ;
+ prov:value '''
+# Variables available: graph, setl_graph
+from rdflib.namespace import RDF
+
+# Count triples by type
+types = {}
+for s, p, o in graph.triples((None, RDF.type, None)):
+ types[str(o)] = types.get(str(o), 0) + 1
+
+print("Generated triples by type:")
+for t, count in sorted(types.items()):
+ print(f" {t}: {count}")
+'''
+ ] ;
+ prov:value '''[{
+ "@id": "http://example.com/{{row.ID}}",
+ "@type": "http://example.com/Item"
+ }]''' ;
] .
```
-See [Streaming XML documentation](streaming-xml.md) for details.
-
-### Batch Processing
-
-Process data in batches to control memory usage:
-
-```python
-from rdflib import Graph, Namespace, URIRef
-import setlr
-
-# For very large datasets, process in chunks
-chunk_size = 10000
-offset = 0
-
-output_graph = Graph()
-
-while True:
- # Create SETL script for this batch
- setl_graph = create_batch_setl(offset, chunk_size)
-
- # Process batch
- resources = setlr.run_setl(setl_graph)
-
- # Accumulate results
- batch_output = resources[URIRef('http://example.com/output')]
- output_graph += batch_output
-
- # Check if done
- if len(batch_output) < chunk_size:
- break
-
- offset += chunk_size
-
-# Save final results
-output_graph.serialize('output.ttl', format='turtle')
-```
-
-### Pandas Optimization
+### When to Use
-For CSV/Excel files, pandas is used automatically. Optimize with:
+- Computing aggregates or statistics after transformation
+- Adding cross-references between generated entities
+- Validating generated RDF structure
+- Complex logic not easily expressed in JSLDT templates
-```python
-# Use appropriate dtypes to reduce memory
-# Specify in your data loading if possible
-
-# For very wide tables, select only needed columns
-# by processing the source data first
-```
+⚠️ **Security Warning**: Python scripts execute with full system access. Only run trusted SETL scripts.
-## Error Handling and Debugging
+**→ [Full Python Functions Documentation](python-functions.md)**
-### Verbose Logging
+## SPARQL Support {#sparql-support}
-Enable detailed logging to diagnose issues:
+Load transformed RDF directly to SPARQL endpoints for integration with triple stores and semantic web applications.
-```python
-import logging
-import setlr
-
-# Enable debug logging
-logging.basicConfig(level=logging.DEBUG)
-logger = logging.getLogger('setlr')
-logger.setLevel(logging.DEBUG)
-
-# Now run SETL
-resources = setlr.run_setl(setl_graph)
-```
+### Key Features
-### Progress Tracking
+- **Direct Loading**: Send RDF to SPARQL UPDATE endpoints
+- **Integration**: Works with Fuseki, GraphDB, Blazegraph, etc.
+- **SPARQL Service Description**: Uses standard W3C vocabulary
-Use tqdm for progress tracking on large datasets:
+### Quick Example
-```python
-from tqdm import tqdm
-import setlr
+```turtle
+@prefix setl: .
+@prefix prov: .
+@prefix void: .
+@prefix sd: .
+@prefix : .
-# Progress bars are automatically shown for:
-# - Large file processing
-# - Batch operations
-# - Network transfers
+# Transform data (see previous examples)
+:myGraph a void:Dataset ;
+ prov:wasGeneratedBy [
+ a setl:Transform, setl:JSLDT ;
+ # ... transform details ...
+ ] .
-resources = setlr.run_setl(setl_graph)
+# Load to SPARQL endpoint
+:sparql_load a setl:Load, sd:Service ;
+ sd:endpoint ;
+ prov:used :myGraph .
```
-### Validation During Development
+### Configuration
-Validate intermediate results to catch issues early:
+The SPARQL endpoint URL should point to the UPDATE endpoint:
-```python
-from rdflib import Graph
-import setlr
+- **Fuseki**: `http://localhost:3030/dataset/update`
+- **GraphDB**: `http://localhost:7200/repositories/repo/statements`
+- **Blazegraph**: `http://localhost:9999/blazegraph/namespace/kb/sparql`
-# Process data
-resources = setlr.run_setl(setl_graph)
-output = resources[URIRef('http://example.com/output')]
+### When to Use
-# Validate results
-print(f"Generated {len(output)} triples")
-print(f"Subjects: {len(set(output.subjects()))}")
-print(f"Predicates: {len(set(output.predicates()))}")
-print(f"Objects: {len(set(output.objects()))}")
+- Loading data into semantic web applications
+- Integration with existing triple stores
+- Building knowledge graphs
+- Creating linked data services
-# Check for specific patterns
-for s, p, o in output.triples((None, RDF.type, None)):
- print(f"Type: {o}")
-```
+### Authentication
-### Error Recovery
-
-Handle errors gracefully in production:
-
-```python
-import setlr
-from rdflib import Graph
-
-try:
- setl_graph = Graph()
- setl_graph.parse('transform.setl.ttl', format='turtle')
- resources = setlr.run_setl(setl_graph)
-
-except setlr.SetlrError as e:
- print(f"SETL processing error: {e}")
- # Handle gracefully
-
-except Exception as e:
- print(f"Unexpected error: {e}")
- # Log and notify
-```
+For endpoints requiring authentication, use HTTP authentication in the URL or configure credentials in your environment.
-## Integration Patterns
-
-### CI/CD Integration
-
-Integrate SETLr into your CI/CD pipeline:
-
-```yaml
-# GitHub Actions example
-name: Generate RDF
-
-on: [push]
-
-jobs:
- generate:
- runs-on: ubuntu-latest
- steps:
- - uses: actions/checkout@v4
-
- - name: Set up Python
- uses: actions/setup-python@v5
- with:
- python-version: '3.11'
-
- - name: Install SETLr
- run: pip install setlr
-
- - name: Generate RDF
- run: setlr transform.setl.ttl -o output.ttl
-
- - name: Upload artifact
- uses: actions/upload-artifact@v4
- with:
- name: rdf-output
- path: output.ttl
-```
-
-### Docker Integration
+## SHACL Validation {#shacl-validation}
-Use SETLr in containerized environments:
+Validate transformed RDF against SHACL (Shapes Constraint Language) shapes to ensure data quality and conformance to schemas.
-```dockerfile
-FROM python:3.11-slim
+### Key Features
-# Install SETLr
-RUN pip install setlr
+- **W3C Standard**: Uses SHACL specification for validation
+- **Pre-Load Validation**: Checks RDF before loading to files or endpoints
+- **Detailed Reports**: Shows which constraints failed
+- **Schema Enforcement**: Ensure data meets required structure
-# Copy your SETL scripts and data
-COPY transform.setl.ttl /app/
-COPY data/ /app/data/
+### Quick Example
-WORKDIR /app
+Create SHACL shapes file (`shapes.ttl`):
-# Run transformation
-CMD ["setlr", "transform.setl.ttl", "-o", "/output/result.ttl"]
+```turtle
+@prefix sh: .
+@prefix xsd: .
+@prefix ex: .
+@prefix foaf: .
+
+ex:PersonShape
+ a sh:NodeShape ;
+ sh:targetClass foaf:Person ;
+ sh:property [
+ sh:path foaf:name ;
+ sh:minCount 1 ;
+ sh:datatype xsd:string ;
+ ] ;
+ sh:property [
+ sh:path foaf:mbox ;
+ sh:maxCount 1 ;
+ sh:nodeKind sh:IRI ;
+ ] .
```
+Run SETLr with validation:
+
```bash
-# Build and run
-docker build -t my-setl-transform .
-docker run -v $(pwd)/output:/output my-setl-transform
+setlr transform.setl.ttl --rdf-validation shapes.ttl
```
-### Scheduled Processing
-
-Run SETLr transformations on a schedule:
-
-```python
-# scheduled_transform.py
-import schedule
-import time
-from rdflib import Graph
-import setlr
-
-def run_transform():
- """Run the SETL transformation"""
- print("Starting transformation...")
-
- setl_graph = Graph()
- setl_graph.parse('transform.setl.ttl', format='turtle')
-
- resources = setlr.run_setl(setl_graph)
-
- # Save output with timestamp
- timestamp = time.strftime('%Y%m%d_%H%M%S')
- output_file = f'output_{timestamp}.ttl'
-
- output = resources[URIRef('http://example.com/output')]
- output.serialize(output_file, format='turtle')
-
- print(f"Transformation complete: {output_file}")
-
-# Schedule to run every day at 2 AM
-schedule.every().day.at("02:00").do(run_transform)
-
-while True:
- schedule.run_pending()
- time.sleep(60)
-```
+### Validation Process
-### REST API Wrapper
-
-Expose SETLr as a REST API:
-
-```python
-from flask import Flask, request, jsonify
-from rdflib import Graph
-import setlr
-import tempfile
-
-app = Flask(__name__)
-
-@app.route('/transform', methods=['POST'])
-def transform():
- """Accept CSV data and SETL script, return RDF"""
-
- # Get input
- csv_data = request.files['data']
- setl_script = request.form['setl']
-
- # Save to temp files
- with tempfile.NamedTemporaryFile(mode='w', suffix='.csv') as csv_file:
- csv_data.save(csv_file.name)
-
- # Update SETL script with temp file path
- setl_graph = Graph()
- setl_graph.parse(data=setl_script, format='turtle')
-
- # Run transformation
- resources = setlr.run_setl(setl_graph)
-
- # Return RDF
- output = resources[URIRef('http://example.com/output')]
- return output.serialize(format='turtle'), 200, {
- 'Content-Type': 'text/turtle'
- }
-
-if __name__ == '__main__':
- app.run(debug=True)
-```
+1. SETL transform executes and generates RDF
+2. Generated RDF is validated against SHACL shapes
+3. Validation report is generated
+4. If validation passes, RDF is loaded
+5. If validation fails, warnings are shown but loading continues
-## Best Practices
+### When to Use
-### 1. Modular SETL Scripts
+- Enforcing data quality standards
+- Ensuring schema conformance
+- Catching transformation errors early
+- Documenting expected RDF structure
-Break complex transformations into modules:
+### Common Shape Constraints
+**Required Properties:**
```turtle
-# common.setl.ttl - shared definitions
-@prefix : .
-@prefix setl: .
-
-# users.setl.ttl - user-specific transforms
-@prefix : .
-<> owl:imports .
+@prefix sh: .
+@prefix foaf: .
-# Main script imports both
+sh:property [
+ sh:path foaf:name ;
+ sh:minCount 1 ; # Required
+] ;
```
-### 2. Version Control
-
-- Store SETL scripts in version control
-- Track changes to transforms with your data processing pipeline
-- Use branches for experimental transforms
-
-### 3. Testing
-
-- Test SETL scripts with sample data before production use
-- Validate output with SHACL shapes
-- Compare output to expected results
+**Data Types:**
+```turtle
+@prefix sh: .
+@prefix schema: .
+@prefix xsd: .
+
+sh:property [
+ sh:path schema:age ;
+ sh:datatype xsd:integer ;
+] ;
+```
-### 4. Documentation
+**Value Ranges:**
+```turtle
+@prefix sh: .
+@prefix schema: .
+
+sh:property [
+ sh:path schema:age ;
+ sh:minInclusive 0 ;
+ sh:maxInclusive 150 ;
+] ;
+```
-- Document complex transforms with comments (use rdfs:comment)
-- Maintain README files for transform collections
-- Include example data with your scripts
+**Pattern Matching:**
+```turtle
+@prefix sh: .
+@prefix foaf: .
-### 5. Monitoring
+sh:property [
+ sh:path foaf:mbox ;
+ sh:pattern "^mailto:" ;
+] ;
+```
-- Log transformation results (record counts, errors)
-- Monitor resource usage for large datasets
-- Set up alerts for transformation failures
+### Installation
-## Next Steps
+SHACL validation requires the `pyshacl` package:
-- Explore [Streaming XML](streaming-xml.md) for large file processing
-- Learn about [Python Functions](python-functions.md) for custom logic
-- Set up [SPARQL endpoints](sparql.md) for data loading
-- Implement [SHACL validation](shacl.md) for quality control
+```bash
+pip install setlr[validation]
+# or
+pip install pyshacl[js]
+```
-## Support
+## See Also
-For questions about advanced features:
-- Check the [documentation](README.md)
-- Open a [discussion](https://github.com/tetherless-world/setlr/discussions)
-- Report issues on [GitHub](https://github.com/tetherless-world/setlr/issues)
+- [Tutorial](tutorial.md) - Step-by-step guide to SETLr basics
+- [JSLDT Template Language](jsldt.md) - Template syntax reference
+- [Python API](python-api.md) - Using SETLr from Python code
+- [CLI Reference](cli.md) - Command-line options and usage
+- [Examples](examples.md) - Complete working examples