From 929e87db49ed8e6d659a7b0dafa095221a50e39d Mon Sep 17 00:00:00 2001 From: MarioDeFelipe <89535508+MarioDeFelipe@users.noreply.github.com> Date: Thu, 26 Feb 2026 18:28:23 +0100 Subject: [PATCH 1/2] Add SAP Datasphere partner plugin 18 specialized skills covering the full SAP Datasphere lifecycle: exploration, data modeling, integration, BW Bridge migration, security architecture, CLI automation, business content activation, catalog governance, performance optimization, and troubleshooting. Powered by 45 MCP tools via @mariodefe/sap-datasphere-mcp with OAuth 2.0 authentication and enterprise-grade security. Author: Mario De Felipe Homepage: https://github.com/MarioDeFelipe/sap-datasphere-plugin-for-claude-cowork --- .../SAP-Datasphere/.claude-plugin/plugin.json | 12 + partner-built/SAP-Datasphere/.mcp.json | 16 + partner-built/SAP-Datasphere/LICENSE | 21 + partner-built/SAP-Datasphere/README.md | 160 ++ .../skills/datasphere-admin/SKILL.md | 283 +++ .../references/security-governance.md | 137 ++ .../references/space-management.md | 130 ++ .../references/system-monitoring.md | 181 ++ .../datasphere-admin/references/transport.md | 197 ++ .../SKILL.md | 688 +++++++ .../references/analytic-model-guide.md | 940 +++++++++ .../SKILL.md | 799 ++++++++ .../references/content-catalog.md | 1404 +++++++++++++ .../datasphere-bw-bridge-migration/SKILL.md | 1303 ++++++++++++ .../references/bw-bridge-guide.md | 1236 ++++++++++++ .../datasphere-catalog-steward/SKILL.md | 1343 +++++++++++++ .../references/catalog-governance-guide.md | 1693 ++++++++++++++++ .../skills/datasphere-cli-automator/SKILL.md | 780 +++++++ .../references/cli-reference.md | 1174 +++++++++++ .../skills/datasphere-connections/SKILL.md | 194 ++ .../references/authentication.md | 120 ++ .../references/connection-types.md | 107 + .../references/troubleshooting-guide.md | 143 ++ .../skills/datasphere-data-flows/SKILL.md | 451 +++++ .../references/data-flows.md | 133 ++ .../references/replication-flows.md | 89 + .../references/task-chains.md | 92 + .../references/transformation-flows.md | 95 + .../SKILL.md | 972 +++++++++ .../references/data-sharing-guide.md | 1034 ++++++++++ .../skills/datasphere-explorer/SKILL.md | 182 ++ .../references/exploration-workflows.md | 243 +++ .../skills/datasphere-flow-doctor/SKILL.md | 1295 ++++++++++++ .../references/abap-side-monitoring.md | 422 ++++ .../references/error-catalog.md | 947 +++++++++ .../replication-flow-error-patterns.md | 167 ++ .../datasphere-intelligent-lookup/SKILL.md | 782 ++++++++ .../references/intelligent-lookup-guide.md | 909 +++++++++ .../datasphere-performance-optimizer/SKILL.md | 597 ++++++ .../references/diagnostic-procedures.md | 156 ++ .../references/optimization-techniques.md | 553 +++++ .../skills/datasphere-s4hana-import/SKILL.md | 771 +++++++ .../cds-replication-architecture.md | 149 ++ .../references/s4hana-integration-guide.md | 532 +++++ .../datasphere-security-architect/SKILL.md | 1600 +++++++++++++++ .../references/security-patterns.md | 1345 +++++++++++++ .../datasphere-transformation-logic/SKILL.md | 707 +++++++ .../references/transformation-patterns.md | 730 +++++++ .../datasphere-transport-manager/SKILL.md | 1468 ++++++++++++++ .../references/transport-operations.md | 1784 +++++++++++++++++ .../skills/datasphere-view-architect/SKILL.md | 421 ++++ .../references/view-modeling-guide.md | 742 +++++++ 52 files changed, 32429 insertions(+) create mode 100644 partner-built/SAP-Datasphere/.claude-plugin/plugin.json create mode 100644 partner-built/SAP-Datasphere/.mcp.json create mode 100644 partner-built/SAP-Datasphere/LICENSE create mode 100644 partner-built/SAP-Datasphere/README.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-admin/SKILL.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-admin/references/security-governance.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-admin/references/space-management.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-admin/references/system-monitoring.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-admin/references/transport.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-analytic-model-creator/SKILL.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-analytic-model-creator/references/analytic-model-guide.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-business-content-activator/SKILL.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-business-content-activator/references/content-catalog.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-bw-bridge-migration/SKILL.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-bw-bridge-migration/references/bw-bridge-guide.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-catalog-steward/SKILL.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-catalog-steward/references/catalog-governance-guide.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-cli-automator/SKILL.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-cli-automator/references/cli-reference.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-connections/SKILL.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-connections/references/authentication.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-connections/references/connection-types.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-connections/references/troubleshooting-guide.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-data-flows/SKILL.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/data-flows.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/replication-flows.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/task-chains.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/transformation-flows.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-data-product-publisher/SKILL.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-data-product-publisher/references/data-sharing-guide.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-explorer/SKILL.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-explorer/references/exploration-workflows.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-flow-doctor/SKILL.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-flow-doctor/references/abap-side-monitoring.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-flow-doctor/references/error-catalog.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-flow-doctor/references/replication-flow-error-patterns.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-intelligent-lookup/SKILL.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-intelligent-lookup/references/intelligent-lookup-guide.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-performance-optimizer/SKILL.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-performance-optimizer/references/diagnostic-procedures.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-performance-optimizer/references/optimization-techniques.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-s4hana-import/SKILL.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-s4hana-import/references/cds-replication-architecture.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-s4hana-import/references/s4hana-integration-guide.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-security-architect/SKILL.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-security-architect/references/security-patterns.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-transformation-logic/SKILL.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-transformation-logic/references/transformation-patterns.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-transport-manager/SKILL.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-transport-manager/references/transport-operations.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-view-architect/SKILL.md create mode 100644 partner-built/SAP-Datasphere/skills/datasphere-view-architect/references/view-modeling-guide.md diff --git a/partner-built/SAP-Datasphere/.claude-plugin/plugin.json b/partner-built/SAP-Datasphere/.claude-plugin/plugin.json new file mode 100644 index 0000000..ccee345 --- /dev/null +++ b/partner-built/SAP-Datasphere/.claude-plugin/plugin.json @@ -0,0 +1,12 @@ +{ + "name": "datasphere", + "version": "0.3.0", + "description": "The most comprehensive SAP Datasphere plugin for Claude. 18 specialized skills covering exploration, data modeling, integration, BW Bridge migration, security architecture, CLI automation, business content activation, catalog governance, performance optimization, and troubleshooting — all through natural language. Powered by 45 MCP tools with enterprise-grade security.", + "author": { + "name": "Mario De Felipe", + "url": "https://github.com/MarioDeFelipe" + }, + "homepage": "https://github.com/MarioDeFelipe/sap-datasphere-plugin-for-claude-cowork", + "repository": "https://github.com/MarioDeFelipe/sap-datasphere-plugin-for-claude-cowork", + "license": "MIT" +} diff --git a/partner-built/SAP-Datasphere/.mcp.json b/partner-built/SAP-Datasphere/.mcp.json new file mode 100644 index 0000000..8a89a54 --- /dev/null +++ b/partner-built/SAP-Datasphere/.mcp.json @@ -0,0 +1,16 @@ +{ + "mcpServers": { + "sap-datasphere": { + "command": "npx", + "args": ["-y", "@mariodefe/sap-datasphere-mcp"], + "env": { + "DATASPHERE_BASE_URL": "~~https://your-tenant.region.hcs.cloud.sap", + "DATASPHERE_CLIENT_ID": "~~your-oauth-client-id", + "DATASPHERE_CLIENT_SECRET": "~~your-oauth-client-secret", + "DATASPHERE_TOKEN_URL": "~~https://your-tenant.authentication.region.hana.ondemand.com/oauth/token", + "DATASPHERE_AUTH_URL": "~~https://your-tenant.authentication.region.hana.ondemand.com", + "USE_MOCK_DATA": "false" + } + } + } +} diff --git a/partner-built/SAP-Datasphere/LICENSE b/partner-built/SAP-Datasphere/LICENSE new file mode 100644 index 0000000..6aa4e67 --- /dev/null +++ b/partner-built/SAP-Datasphere/LICENSE @@ -0,0 +1,21 @@ +MIT License + +Copyright (c) 2026 Datasphere Automations + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/partner-built/SAP-Datasphere/README.md b/partner-built/SAP-Datasphere/README.md new file mode 100644 index 0000000..234b8cd --- /dev/null +++ b/partner-built/SAP-Datasphere/README.md @@ -0,0 +1,160 @@ +# SAP Datasphere Plugin for Claude + +The most comprehensive SAP Datasphere plugin for Claude. 18 specialized skills, 30 reference files, and 31,000+ lines of expert content covering every major Datasphere workflow — from data exploration and view design to BW Bridge migration, security architecture, CLI automation, and catalog governance. Powered by a production-grade MCP server with 45 tools, OAuth 2.0 authentication, and enterprise-level security including SQL sanitization and PII filtering. + +## Skills + +### Exploration & Discovery + +| Skill | Description | +|-------|-------------| +| **datasphere-explorer** | Guided exploration and discovery — browse spaces, search the catalog, inspect schemas, profile data quality, trace lineage, and build queries interactively | + +### Data Modeling + +| Skill | Description | +|-------|-------------| +| **datasphere-view-architect** | Design Graphical and SQL views with proper semantic usage (Fact, Dimension, Text, Hierarchy), associations, persistence strategies, and performance optimization | +| **datasphere-analytic-model-creator** | Create Analytic Models for SAP Analytics Cloud with measures, dimensions, variables, currency conversion, and exception aggregation | +| **datasphere-intelligent-lookup** | Configure fuzzy matching and intelligent lookups for data harmonization across sources — matching strategies, threshold tuning, and review workflows | + +### Data Integration + +| Skill | Description | +|-------|-------------| +| **datasphere-data-flows** | Orchestrate replication flows (with CDC), data flows (visual ETL), transformation flows (SQL-based delta), and task chains | +| **datasphere-transformation-logic** | Generate and validate SQLScript and Python transformations — SCD Type 2, deduplication, pivoting, delta handling patterns | +| **datasphere-s4hana-import** | Import entities from SAP S/4HANA and BW/4HANA — CDS views, ODP extractors, Cloud Connector setup, and delta extraction | +| **datasphere-connections** | Create and manage 35+ connection types including SAP S/4HANA, BigQuery, Redshift, Kafka, and generic JDBC/OData connectors | + +### Migration + +| Skill | Description | +|-------|-------------| +| **datasphere-bw-bridge-migration** | Migrate from BW/4HANA using Shell and Remote Conversion — ADSO modeling, Process Chain to Task Chain mapping, hybrid operation, and decommissioning | + +### Security + +| Skill | Description | +|-------|-------------| +| **datasphere-security-architect** | Design row-level security with Data Access Controls (DAC), import BW Analysis Authorizations, configure audit policies, and integrate Identity Providers (SAML/OIDC) | + +### Administration & Governance + +| Skill | Description | +|-------|-------------| +| **datasphere-admin** | Space management, user and role administration, system monitoring, capacity planning, and transport operations | +| **datasphere-cli-automator** | Automate administration via CLI — generate JSON payloads for bulk space/user/connection provisioning, manage certificates, and build CI/CD pipelines | +| **datasphere-data-product-publisher** | Publish data products through the Data Sharing Cockpit — product descriptions, license terms, visibility settings, and marketplace management | +| **datasphere-transport-manager** | Manage CSN/JSON transport packages — dependency checking, export/import workflows, conflict resolution, and Content Network integration | +| **datasphere-business-content-activator** | Activate pre-built SAP Business Content packages — prerequisite checking (Time Dimensions, TCUR*), LSA++ alignment, and content update management | +| **datasphere-catalog-steward** | Internal data governance — metadata enrichment, glossary term management, KPI definitions, tag taxonomies, and lineage-based impact analysis | + +### Monitoring & Troubleshooting + +| Skill | Description | +|-------|-------------| +| **datasphere-flow-doctor** | Diagnose and resolve errors in Data Flows, Replication Flows, and Transformation Flows — error catalogs, root cause analysis, and fix recommendations | +| **datasphere-performance-optimizer** | Analyze and optimize performance — View Analyzer, Explain Plans, persistence strategies, partitioning, storage tiering, and query tuning | + +## Reference Library (30 files) + +Each skill includes detailed reference documentation for deep-dive guidance: + +| Skill | Reference Files | +|-------|----------------| +| **explorer** | exploration-workflows.md | +| **view-architect** | view-modeling-guide.md | +| **analytic-model-creator** | analytic-model-guide.md | +| **intelligent-lookup** | intelligent-lookup-guide.md | +| **data-flows** | data-flows.md, replication-flows.md, transformation-flows.md, task-chains.md | +| **transformation-logic** | transformation-patterns.md | +| **s4hana-import** | s4hana-integration-guide.md, cds-replication-architecture.md | +| **connections** | authentication.md, connection-types.md, troubleshooting-guide.md | +| **bw-bridge-migration** | bw-bridge-guide.md | +| **security-architect** | security-patterns.md | +| **admin** | space-management.md, system-monitoring.md, security-governance.md, transport.md | +| **cli-automator** | cli-reference.md | +| **data-product-publisher** | data-sharing-guide.md | +| **transport-manager** | transport-operations.md | +| **business-content-activator** | content-catalog.md | +| **catalog-steward** | catalog-governance-guide.md | +| **flow-doctor** | error-catalog.md, abap-side-monitoring.md, replication-flow-error-patterns.md | +| **performance-optimizer** | optimization-techniques.md, diagnostic-procedures.md | + +## Prerequisites + +- [Claude Code](https://claude.com/claude-code) v1.0.33+ or Claude Desktop with Cowork mode +- An SAP Datasphere tenant with OAuth 2.0 client credentials +- Node.js 18+ (for the MCP server) + +## Installation + +Install from a marketplace or load directly: + +```bash +claude --plugin-dir ./sap-datasphere-plugin-for-claude-cowork +``` + +## Configuration + +After installation, configure your SAP Datasphere connection by setting the environment variables in `.mcp.json`: + +| Variable | Description | Example | +|----------|-------------|---------| +| `DATASPHERE_BASE_URL` | Your tenant URL | `https://mytenant.eu20.hcs.cloud.sap` | +| `DATASPHERE_CLIENT_ID` | OAuth client ID | From SAP BTP cockpit | +| `DATASPHERE_CLIENT_SECRET` | OAuth client secret | From SAP BTP cockpit | +| `DATASPHERE_TOKEN_URL` | OAuth token endpoint | `https://mytenant.authentication.eu20.hana.ondemand.com/oauth/token` | +| `DATASPHERE_AUTH_URL` | OAuth auth endpoint | `https://mytenant.authentication.eu20.hana.ondemand.com` | + +### Setting up OAuth credentials in SAP BTP + +1. Open the SAP BTP Cockpit for your subaccount +2. Navigate to **Security > Instances and Subscriptions** +3. Create a service instance for SAP Datasphere with the appropriate scopes +4. Create a service key to obtain your client ID and secret + +## Usage + +Once configured, just talk to Claude naturally: + +- *"What spaces do we have in Datasphere?"* +- *"Help me design a star schema for customer analytics"* +- *"Create an analytic model with revenue measures"* +- *"Import CDS views from our S/4HANA system"* +- *"Migrate our BW Process Chains to Task Chains"* +- *"Set up row-level security on the sales data"* +- *"Bulk-create 50 users via CLI"* +- *"Activate the Automotive business content package"* +- *"My replication flow is failing — help me diagnose"* +- *"Optimize this slow-running view"* +- *"Set up a transport package for production deployment"* +- *"Enrich our catalog with business glossary terms"* +- *"Generate SCD Type 2 logic for the customer dimension"* + +## MCP Server + +This plugin uses the [`@mariodefe/sap-datasphere-mcp`](https://www.npmjs.com/package/@mariodefe/sap-datasphere-mcp) MCP server, which provides 45 tools covering: + +- **Foundation**: Connection testing, user info, tenant config +- **Catalog & Discovery**: Space browsing, asset search, marketplace +- **Schema & Metadata**: Table schemas, OData metadata, analytical models +- **Data Query**: Smart queries, SQL execution, OData queries +- **Data Profiling**: Column distribution analysis, cross-asset column search +- **Repository & Lineage**: Object search, lineage tracing, deployment status +- **Database Users**: Full CRUD for space-level database users + +## Security + +The MCP server includes enterprise-grade security: + +- OAuth 2.0 with automatic token refresh +- RBAC-based authorization enforcement +- SQL injection prevention and query sanitization +- PII redaction and credential masking +- Input validation on all tool parameters + +## License + +MIT diff --git a/partner-built/SAP-Datasphere/skills/datasphere-admin/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-admin/SKILL.md new file mode 100644 index 0000000..b4c3ac5 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-admin/SKILL.md @@ -0,0 +1,283 @@ +--- +name: datasphere-admin +description: SAP Datasphere system administration skill for managing spaces, users, roles, security, monitoring, and transport operations. Use when performing administrative tasks including creating/managing spaces, user management, role assignment, system monitoring, capacity planning, or transport package operations. +--- + +# SAP Datasphere Administrator + +Comprehensive administration skill for SAP Datasphere covering space management, security configuration, system monitoring, and transport operations. + +## Navigation Overview + +SAP Datasphere uses a left-side navigation menu. Administrative functions are located in the lower section: + +| Menu Item | Submenu Items | URL Fragment | +|-----------|---------------|--------------| +| Space Management | (direct) | `#/managespaces` | +| Security | Users, Roles, Authorization Overview, Activities | `#/users`, `#/roles` | +| Transport | Packages, Export, Import, Monitor | `#/repository_packages` | +| System Monitor | System Monitor, Capacities | `#/monitoring` | +| System | Configuration, Administration, About | `#/administration` | + +## Space Management + +**Navigation:** Left Menu → Space Management + +### View All Spaces +1. Click **Space Management** in left navigation +2. View space cards showing: Name, Status (Cold/Warm), Disk Storage, Memory Storage +3. Toggle between tile/list view using icons in top-right toolbar + +### Create a New Space +1. Navigate to Space Management +2. Click **Create** button (top toolbar) OR click **+** on Spaces card from Home +3. Fill in space details: + - Business Name (display name) + - Technical Name (system identifier) + - Storage allocation (Disk and Memory quotas) +4. Click **Create** + +### Edit Space Properties +1. Navigate to Space Management +2. Locate the space card +3. Click **Edit** button on the card +4. Modify: Storage quotas, Members, Properties +5. Save changes + +### Space Status Management +- **Lock/Unlock:** Select space → Click Lock/Unlock in toolbar +- **Monitor:** Select space → Click Monitor to view space-specific metrics +- **Delete:** Select space → Click Delete (moves to Recycle Bin) + +### Recycle Bin +- Access via left panel in Space Management +- Restore or permanently delete spaces + +## Security Administration + +**Navigation:** Left Menu → Security → [submenu] + +### User Management +**Path:** Security → Users (`#/users`) + +#### View Users +1. Navigate to Security → Users +2. Use Filter By panel to filter by: Role Name, Scope Name, License Type +3. Results show: User ID, Display Name, First Name, Last Name, Email, Roles count + +#### Add New User +1. Click **+** (Add) button in toolbar +2. Enter user details +3. Assign roles +4. Save + +#### Edit User +1. Select user row +2. Click Edit icon +3. Modify user properties and role assignments +4. Save + +### Role Management +**Path:** Security → Roles (`#/roles`) + +#### Standard Roles +| Role | Description | +|------|-------------| +| Data Warehouse Cloud Space Administrator | Full space privileges | +| Data Warehouse Cloud Integrator | Full data integration privileges | +| Data Warehouse Cloud Modeler | Modeling privileges | +| Data Warehouse Cloud Viewer | View-only access | +| Data Warehouse Cloud Consumer | Data consumption privileges | +| Data Warehouse Cloud Extended Viewer | Extended viewer privileges | +| Data Catalog Administrator | Full catalog privileges | +| Data Catalog User | Read catalog privileges | + +#### Scoped Roles +Scoped roles limit permissions to specific spaces. Naming pattern: `Scoped Data Warehouse Cloud [Role Type]` + +#### Create Custom Role +1. Navigate to Security → Roles +2. Click **+** button +3. Define role name and description +4. Configure permissions +5. Save + +### Authorization Overview +**Path:** Security → Authorization Overview + +View consolidated authorization matrix across users, roles, and spaces. + +### Activities Log +**Path:** Security → Activities + +Monitor user activities and audit trail. + +## System Monitoring + +**Navigation:** Left Menu → System Monitor → [submenu] + +### Dashboard +**Path:** System Monitor → System Monitor (`#/monitoring`) + +**Tabs available:** +- Dashboard (default) +- Elastic Compute Nodes +- Task Logs +- Statement Logs +- Object Store + +**Dashboard Metrics:** +- Disk Storage Used (pie chart breakdown) +- Disk Used by Spaces for Storage +- Memory Used by Spaces for Storage +- Failed Tasks (Last 24 Hours) +- Out-of-Memory Errors +- Top 5 Out-of-Memory Errors by Space +- Admission Control Events + +### Capacities +**Path:** System Monitor → Capacities + +View and manage compute capacity allocation. + +### Task Logs +1. Navigate to System Monitor → System Monitor +2. Click **Task Logs** tab +3. Filter by: Space, Task Type, Status, Date Range +4. View execution details and errors + +### Statement Logs +1. Navigate to System Monitor → System Monitor +2. Click **Statement Logs** tab +3. Analyze SQL statement execution + +## Transport Operations + +**Navigation:** Left Menu → Transport → [submenu] + +### Packages +**Path:** Transport → Packages (`#/repository_packages`) + +#### View Packages +1. Navigate to Transport → Packages +2. Filter by Space using dropdown +3. View: Business Name, Technical Name, Space, Created On, Result + +#### Create Package +1. Click **+** button +2. Select space and objects to include +3. Define package name +4. Save + +### Export +**Path:** Transport → Export + +Export packages for deployment to other systems. + +### Import +**Path:** Transport → Import + +Import packages from other Datasphere instances. + +### Transport Monitor +**Path:** Transport → Monitor + +Track transport operation status and history. + +## System Administration + +**Navigation:** Left Menu → System → Administration (`#/administration`) + +### Configuration Tabs +| Tab | Purpose | +|-----|---------| +| System Configuration | Session timeout, SAP support access | +| Tenant Links | External system links | +| Data Source Configuration | Data source settings | +| Security | Security policies | +| App Integration | Third-party integrations | +| Notifications | Alert and notification settings | + +### System Configuration +1. Navigate to System → Administration +2. On System Configuration tab: + - Set **Session Timeout** (default: 3600 seconds) + - Toggle **Allow SAP support user creation** +3. Click Edit to modify, Save to confirm + +## Common Administrative Workflows + +### Onboard New Team Member +1. Security → Users → Add user +2. Assign appropriate roles (e.g., Data Warehouse Cloud Modeler) +3. Space Management → Edit space → Add member to relevant space(s) +4. Configure scoped roles if needed + +### Capacity Planning +1. System Monitor → Dashboard → Review storage metrics +2. System Monitor → Capacities → Assess compute allocation +3. Space Management → Edit spaces to adjust quotas as needed + +### Troubleshoot Failed Tasks +1. System Monitor → Task Logs → Filter by Failed status +2. Review error details +3. Check Statement Logs for SQL-level issues +4. Review Out-of-Memory metrics if relevant + +## MCP Tools Integration + +When the SAP Datasphere MCP Server is connected (via Claude Desktop), the following tools are available for programmatic administration: + +### Foundation Tools +| Tool | Description | +|------|-------------| +| `test_connection` | Verify connectivity to Datasphere tenant | +| `get_current_user` | Get current authenticated user info | +| `get_tenant_info` | Get tenant configuration details | +| `list_spaces` | List all available spaces | + +### Catalog & Discovery Tools +| Tool | Description | +|------|-------------| +| `list_catalog_assets` | Browse catalog assets | +| `get_asset_details` | Get detailed asset metadata | +| `search_catalog` | Search for assets by criteria | +| `find_assets_by_column` | Find assets containing specific columns | + +### Data Quality & Analysis Tools +| Tool | Description | +|------|-------------| +| `smart_query` | Intelligent data querying | +| `query_analytical_data` | Query analytical models | +| `query_relational_entity` | Query relational tables/views | +| `analyze_column_distribution` | Analyze data distribution | + +### Database User Management Tools +| Tool | Description | +|------|-------------| +| `list_database_users` | List database users in a space | +| `create_database_user` | Create new database user | +| `get_database_user_details` | Get user configuration | +| `update_database_user` | Modify user settings | +| `delete_database_user` | Remove database user | + +### MCP Server Setup +To use MCP tools, configure Claude Desktop (`~/Library/Application Support/Claude/claude_desktop_config.json`): +```json +{ + "mcpServers": { + "sap-datasphere": { + "command": "/path/to/start_sap_datasphere_mcp.sh", + "args": [] + } + } +} +``` + +## Resources + +See reference files for detailed procedures: +- `references/space-management.md` - Detailed space operations +- `references/security-governance.md` - Security configuration details +- `references/system-monitoring.md` - Monitoring and troubleshooting +- `references/transport.md` - Transport lifecycle management diff --git a/partner-built/SAP-Datasphere/skills/datasphere-admin/references/security-governance.md b/partner-built/SAP-Datasphere/skills/datasphere-admin/references/security-governance.md new file mode 100644 index 0000000..52acaaf --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-admin/references/security-governance.md @@ -0,0 +1,137 @@ +# Security & Governance Reference + +## Table of Contents +1. [User Management](#user-management) +2. [Role Architecture](#role-architecture) +3. [Scoped Roles](#scoped-roles) +4. [Data Access Controls](#data-access-controls) +5. [Authorization Overview](#authorization-overview) +6. [Activity Monitoring](#activity-monitoring) + +## User Management + +### User Lifecycle +1. **Provisioning:** Users created via SAP BTP cockpit or SCIM +2. **Assignment:** Add to Datasphere and assign roles +3. **Space Access:** Grant space membership +4. **Deprovisioning:** Remove roles and space access + +### User Properties +| Property | Description | +|----------|-------------| +| User ID | Unique identifier (typically email) | +| Display Name | Full name shown in UI | +| First Name | Given name | +| Last Name | Family name | +| Email | Contact email | +| License Type | SAP Datasphere license assignment | + +### Filtering Users +Filter panel options: +- **Role Name:** Filter by assigned role +- **Scope Name:** Filter by space assignment +- **License Type:** Filter by license tier + +## Role Architecture + +### Standard Roles (Global) + +| Role | Privileges | +|------|-----------| +| **DW Cloud Space Administrator** | Full administrative access to assigned spaces | +| **DW Cloud Integrator** | Create and manage data flows, replication, connections | +| **DW Cloud Modeler** | Create and modify data models, views, tables | +| **DW Cloud Viewer** | Read-only access to data and models | +| **DW Cloud Consumer** | Consume data for analytics (SAC integration) | +| **DW Cloud Extended Viewer** | Enhanced viewer with additional read permissions | +| **DW Cloud AI Consumer** | Access AI/ML features | +| **Data Catalog Administrator** | Full catalog management | +| **Data Catalog User** | Browse and search catalog | + +### Assigning Roles +1. Security → Users → Select user +2. Click Edit +3. Navigate to Roles section +4. Add/remove role assignments +5. Save + +## Scoped Roles + +### Concept +Scoped roles restrict permissions to specific spaces rather than global access. + +### Naming Convention +`Scoped Data Warehouse Cloud [Role Type]` + +Examples: +- Scoped Data Warehouse Cloud Viewer +- Scoped Data Warehouse Cloud Modeler +- Scoped Data Warehouse Cloud Space Administrator + +### Creating Scoped Role Assignment +1. Security → Roles +2. Select scoped role template +3. Configure scope (select spaces) +4. Assign to users + +### When to Use Scoped Roles +- Multi-tenant environments with isolated teams +- Project-based access control +- Principle of least privilege implementation + +## Data Access Controls + +### Row-Level Security +Implement DAC to restrict data visibility: + +1. Define criteria (e.g., region, department) +2. Map criteria to user attributes +3. Apply to views/tables + +### Column-Level Security +Restrict access to sensitive columns: +- Mask sensitive data +- Hide columns from unauthorized users + +### Implementation Path +1. Data Builder → Create Data Access Control +2. Define criteria structure +3. Map to business semantics +4. Assign to protected entities + +## Authorization Overview + +### Accessing Authorization Matrix +Path: Security → Authorization Overview + +### Matrix Views +- User vs. Role assignments +- Role vs. Permission mappings +- Space vs. User access + +### Use Cases +- Audit compliance checks +- Access review campaigns +- Permission troubleshooting + +## Activity Monitoring + +### Activity Log Location +Path: Security → Activities + +### Logged Events +- User logins/logouts +- Object modifications +- Data access events +- Administrative changes + +### Filtering Activities +- By user +- By action type +- By date range +- By object/space + +### Audit Best Practices +- Regular activity review +- Export logs for compliance +- Set up alerting for sensitive operations diff --git a/partner-built/SAP-Datasphere/skills/datasphere-admin/references/space-management.md b/partner-built/SAP-Datasphere/skills/datasphere-admin/references/space-management.md new file mode 100644 index 0000000..81f43a6 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-admin/references/space-management.md @@ -0,0 +1,130 @@ +# Space Management Reference + +## Table of Contents +1. [Space Concepts](#space-concepts) +2. [Creating Spaces](#creating-spaces) +3. [Space Configuration](#space-configuration) +4. [Member Management](#member-management) +5. [Storage Management](#storage-management) +6. [Elastic Compute Nodes](#elastic-compute-nodes) + +## Space Concepts + +Spaces are isolated virtual environments that contain data models, objects, and user assignments. Each space has: +- **Technical Name:** System identifier (cannot be changed after creation) +- **Business Name:** Human-readable display name +- **Status:** Cold (inactive) or Warm (active with loaded data) +- **Storage Quotas:** Disk and Memory limits + +## Creating Spaces + +### Step-by-Step Process +1. Navigate: Left Menu → Space Management +2. Click **Create** button in top toolbar +3. Complete the form: + +| Field | Description | Example | +|-------|-------------|---------| +| Business Name | Display name | "Finance Analytics" | +| Technical Name | System ID (uppercase, underscores) | "FINANCE_ANALYTICS" | +| Disk Storage | Maximum disk quota | 2 GB | +| Memory Storage | Maximum memory quota | 1 GB | + +4. Click **Create** to provision the space + +### Best Practices +- Use meaningful technical names that reflect purpose +- Start with conservative storage quotas and expand as needed +- Document space purpose in the description field + +## Space Configuration + +### General Properties +Access via: Space card → Edit → General tab + +- Business Name (editable) +- Description +- Priority (1-5, affects resource allocation) +- Time Data settings + +### Database Users +Configure Open SQL schema access: +1. Space → Edit → Database Users tab +2. Add database user with credentials +3. Configure schema access permissions + +### Connections +Associate data connections with the space: +1. Space → Edit → Connections tab +2. Assign existing connections or create new ones + +## Member Management + +### Adding Members +1. Space card → Edit → Members tab +2. Click **Add** +3. Search for user by ID or name +4. Select role for this space: + - Space Administrator + - Integrator + - Modeler + - Viewer + +### Role Permissions in Space Context +| Space Role | Create/Edit Models | Run Data Flows | View Data | Manage Space | +|------------|-------------------|----------------|-----------|--------------| +| Administrator | Yes | Yes | Yes | Yes | +| Integrator | Yes | Yes | Yes | No | +| Modeler | Yes | Limited | Yes | No | +| Viewer | No | No | Yes | No | + +### Removing Members +1. Space → Edit → Members tab +2. Select member row +3. Click **Remove** +4. Confirm action + +## Storage Management + +### Monitoring Storage Usage +View from Space Management: +- Disk for Storage: Used/Allocated +- Memory for Storage: Used/Allocated +- Progress bars indicate utilization + +### Adjusting Quotas +1. Space → Edit → General tab +2. Modify Disk Storage or Memory Storage values +3. Save changes + +Note: Cannot reduce below current usage. + +### Storage Best Practices +- Monitor usage regularly via System Monitor +- Set alerts for high utilization +- Plan for growth in data volumes + +## Elastic Compute Nodes + +### Overview +Elastic Compute Nodes provide additional compute capacity on-demand. + +### Viewing ECN Status +Location: Space Management → Left panel → "Elastic Compute Nodes" + +Shows: +- Block-Hour Remaining +- View Logs link +- Create option + +### Creating ECN +1. Click **Create** in Elastic Compute Nodes section +2. Configure: + - Node size + - Duration + - Associated space +3. Confirm creation + +### Monitoring ECN Usage +- View Logs shows consumption history +- Block-hours are consumed during active usage diff --git a/partner-built/SAP-Datasphere/skills/datasphere-admin/references/system-monitoring.md b/partner-built/SAP-Datasphere/skills/datasphere-admin/references/system-monitoring.md new file mode 100644 index 0000000..8c9561d --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-admin/references/system-monitoring.md @@ -0,0 +1,181 @@ +# System Monitoring Reference + +## Table of Contents +1. [Dashboard Overview](#dashboard-overview) +2. [Storage Monitoring](#storage-monitoring) +3. [Task Logs](#task-logs) +4. [Statement Logs](#statement-logs) +5. [Memory Management](#memory-management) +6. [Elastic Compute Nodes](#elastic-compute-nodes) +7. [Troubleshooting Guide](#troubleshooting-guide) + +## Dashboard Overview + +### Accessing System Monitor +Path: Left Menu → System Monitor → System Monitor + +### Dashboard Layout +The dashboard displays real-time and historical metrics: + +| Card | Metric | Time Range | +|------|--------|------------| +| Disk Storage Used | Pie chart breakdown | Now | +| Disk Used by Spaces | Usage vs quota | Now | +| Memory Used by Spaces | Usage vs quota | Now | +| Failed Tasks | Count | Last 24 Hours | +| Out-of-Memory Errors | Count | Last 24 Hours | +| Top 5 OOM by Space | Space breakdown | Last 7 Days | +| Admission Control Events | Rejection/Queuing | Last 24 Hours | + +### Storage Categories +- **Other Data:** System and operational data +- **Data in Spaces:** User space data +- **Administrative Data:** Metadata and configurations +- **Audit Log Data:** Activity logging + +## Storage Monitoring + +### Disk Storage Metrics +- Total allocated vs used +- Breakdown by category +- Trend analysis available + +### Memory Storage Metrics +- In-memory table usage +- Per-space breakdown +- Warm vs cold data + +### Capacity Planning Indicators +- Red: >90% utilization (critical) +- Yellow: 70-90% utilization (warning) +- Green: <70% utilization (healthy) + +## Task Logs + +### Accessing Task Logs +1. System Monitor → System Monitor +2. Click **Task Logs** tab + +### Task Types Logged +- Data Flow executions +- Replication Flow runs +- Transformation Flow operations +- View persistence tasks +- Task Chain executions + +### Filtering Options +| Filter | Options | +|--------|---------| +| Space | All spaces or specific | +| Task Type | Flow type selection | +| Status | Running, Completed, Failed | +| Date Range | Custom date selection | + +### Task Details View +Click on a task row to see: +- Start/end timestamps +- Duration +- Records processed +- Error messages (if failed) +- Step-by-step execution log + +## Statement Logs + +### Accessing Statement Logs +1. System Monitor → System Monitor +2. Click **Statement Logs** tab + +### Logged Statements +- SQL queries +- DDL operations +- DML operations + +### Analysis Capabilities +- Execution time analysis +- Resource consumption +- Query optimization hints + +### Filtering Statements +- By user +- By statement type +- By execution time threshold +- By date range + +## Memory Management + +### Out-of-Memory (OOM) Errors +Dashboard shows: +- OOM count (last 24 hours) +- Top 5 spaces by OOM (last 7 days) +- MDS request failures + +### Causes of OOM +- Large data loads +- Complex queries +- Insufficient memory quota +- Multiple concurrent operations + +### Resolution Steps +1. Identify affected space from dashboard +2. Review Task Logs for failing operations +3. Options: + - Increase space memory quota + - Optimize query/model + - Schedule during off-peak + - Use Elastic Compute Nodes + +## Elastic Compute Nodes + +### Monitoring ECN +Path: System Monitor → Elastic Compute Nodes tab + +### Metrics Available +- Active nodes +- Block-hours consumed +- Usage by space + +### ECN Best Practices +- Use for burst workloads +- Monitor consumption trends +- Plan block-hour budgets + +## Troubleshooting Guide + +### Failed Task Investigation +1. System Monitor → Task Logs +2. Filter: Status = Failed +3. Click task for details +4. Review error message +5. Common resolutions: + - Connection timeout: Check source availability + - Memory error: Increase quota or optimize + - Permission error: Verify user roles + +### Slow Performance Investigation +1. System Monitor → Statement Logs +2. Sort by execution time +3. Identify slow queries +4. Options: + - Add indexes + - Optimize joins + - Partition large tables + - Persist frequently accessed views + +### Storage Capacity Issues +1. Dashboard → Check utilization percentages +2. Identify top consumers by space +3. Options: + - Archive old data + - Delete unused objects + - Increase storage quotas + - Compress data + +### Admission Control Events +Indicates resource contention: +- **Rejection Events:** Requests denied due to resource limits +- **Queuing Events:** Requests waiting for resources + +Resolution: +- Stagger scheduled tasks +- Increase capacity +- Optimize resource-heavy operations diff --git a/partner-built/SAP-Datasphere/skills/datasphere-admin/references/transport.md b/partner-built/SAP-Datasphere/skills/datasphere-admin/references/transport.md new file mode 100644 index 0000000..50b6952 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-admin/references/transport.md @@ -0,0 +1,197 @@ +# Transport Operations Reference + +## Table of Contents +1. [Transport Concepts](#transport-concepts) +2. [Package Management](#package-management) +3. [Export Operations](#export-operations) +4. [Import Operations](#import-operations) +5. [Transport Monitoring](#transport-monitoring) +6. [Best Practices](#best-practices) + +## Transport Concepts + +### Purpose +Transport enables moving Datasphere objects between environments: +- Development → Test → Production +- Tenant to tenant migration +- Backup and recovery + +### Transportable Objects +- Tables (local and remote) +- Views (graphical and SQL) +- Analytic Models +- Data Flows +- Replication Flows +- Transformation Flows +- Task Chains +- Connections (metadata only) + +### Transport Lifecycle +1. Create Package (group objects) +2. Export Package (generate archive) +3. Transfer Archive (download/upload) +4. Import Package (deploy to target) +5. Monitor Results + +## Package Management + +### Accessing Packages +Path: Left Menu → Transport → Packages + +### Package Properties +| Field | Description | +|-------|-------------| +| Business Name | Display name | +| Technical Name | System identifier | +| Space | Source space | +| Created On | Creation timestamp | +| Result | Status (Success/Failed) | + +### Creating a Package +1. Navigate to Transport → Packages +2. Click **+** (Add) button +3. Configure package: + - Select source space + - Enter business name + - Enter technical name +4. Add objects: + - Browse space objects + - Select objects to include + - Dependencies auto-included +5. Save package + +### Editing a Package +1. Select package row +2. Click Edit icon +3. Add/remove objects +4. Save changes + +### Deleting a Package +1. Select package row +2. Click Delete icon +3. Confirm deletion + +Note: Deleting a package does not affect source objects. + +## Export Operations + +### Accessing Export +Path: Left Menu → Transport → Export + +### Export Process +1. Navigate to Transport → Export +2. Select package(s) to export +3. Configure export options: + - Include data (optional) + - Compression settings +4. Execute export +5. Download archive file (.zip) + +### Export Considerations +- **With Data:** Includes actual data records (larger file) +- **Without Data:** Structure only (smaller file) +- Export validates object dependencies + +### Export Troubleshooting +| Issue | Resolution | +|-------|------------| +| Missing dependencies | Add required objects to package | +| Permission denied | Verify export privileges | +| Large file timeout | Split into smaller packages | + +## Import Operations + +### Accessing Import +Path: Left Menu → Transport → Import + +### Import Process +1. Navigate to Transport → Import +2. Upload archive file +3. Select target space +4. Configure import options: + - Overwrite existing (yes/no) + - Data handling +5. Preview changes +6. Execute import +7. Review results + +### Import Modes +| Mode | Behavior | +|------|----------| +| Create Only | Fails if object exists | +| Overwrite | Replaces existing objects | +| Merge | Updates without losing custom changes | + +### Import Validation +Pre-import checks: +- Object name conflicts +- Dependency availability +- Connection references +- Space capacity + +### Post-Import Tasks +- Verify object functionality +- Update connection credentials +- Test data flows +- Validate security settings + +## Transport Monitoring + +### Accessing Monitor +Path: Left Menu → Transport → Monitor + +### Monitored Information +- Transport execution history +- Status per object +- Error details +- Timing information + +### Status Values +| Status | Meaning | +|--------|---------| +| Running | In progress | +| Completed | Successful | +| Completed with Warnings | Success with issues | +| Failed | Error occurred | + +### Viewing Details +Click on transport record to see: +- Object-level status +- Error messages +- Execution timestamps + +## Best Practices + +### Package Organization +- Group related objects together +- Use meaningful names +- Document package contents +- Version packages (v1, v2, etc.) + +### Development Workflow +1. Develop in DEV space +2. Create package +3. Export without data +4. Import to TEST +5. Validate functionality +6. Export for PROD +7. Import to PROD with approval + +### Dependency Management +- Include all dependencies +- Test complete package +- Document external dependencies + +### Security Considerations +- Remove sensitive data before transport +- Update credentials in target +- Verify role assignments post-import + +### Troubleshooting Transport Failures +1. Check Transport Monitor for errors +2. Common issues: + - Missing dependencies → Add to package + - Name conflicts → Rename or use overwrite + - Connection errors → Verify target connections + - Capacity limits → Free space or increase quota +3. Retry after resolution diff --git a/partner-built/SAP-Datasphere/skills/datasphere-analytic-model-creator/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-analytic-model-creator/SKILL.md new file mode 100644 index 0000000..3b0fbcc --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-analytic-model-creator/SKILL.md @@ -0,0 +1,688 @@ +--- +name: Analytic Model Creator +description: "Build Analytic Models for SAP Analytics Cloud consumption. Use this skill when you need to create reporting dimensions, define sophisticated measures (calculated, restricted, count distinct), configure currency/unit conversions, set up exception aggregation, or prepare data for SAC dashboards. Essential for analytics layer design, KPI definition, and self-service BI enablement." +--- + +# Analytic Model Creator Skill + +## Overview + +The Analytic Model Creator skill guides you through designing and implementing Analytic Models in SAP Datasphere. Analytic Models are semantic objects that present data to analytics tools like SAP Analytics Cloud (SAC) in a consumable, pre-aggregated format. They sit above Fact and Dimension views, providing a polished interface for end users to create reports and dashboards. + +## What Are Analytic Models? + +### Definition +An Analytic Model is a semantic object that combines: +- **One Fact source** — Contains quantitative measures and primary grain +- **Multiple Dimensions** — Provide context and analysis dimensions +- **Measures** — Simple, calculated, or restricted aggregations +- **Variables** — Optional parameters for dynamic filtering +- **Attributes** — Dimension properties and drill-down paths +- **Aggregations** — Exception rules for special measure handling + +### Role in the Consumption Layer +``` +Raw Data (Databases) + ↓ + Data Builder (Cleanse, Join) + ↓ + Semantic Views (Graphical/SQL) + ↓ + Analytic Models (Aggregated, Structured) ← YOU ARE HERE + ↓ +SAC Dashboards, Reports, Exploration +``` + +### When to Use Analytic Models + +**Create an Analytic Model when:** +- Building SAC dashboards and reports +- End users need self-service analytics +- You need to define sophisticated aggregation rules +- Measures require conditional aggregation or currency conversion +- Multiple consumption patterns are needed from one data source +- Data governance requires controlled access to metrics + +**DO NOT use Analytic Models for:** +- Pure data integration or distribution +- Direct operational reporting (use Relational Datasets) +- Real-time transaction querying (use transaction views) +- Data that rarely needs aggregation + +## Creating an Analytic Model: Step-by-Step + +### Step 1: Select a Fact Source + +Choose the fact table that contains your measurable events or transactions. + +**Fact source selection criteria:** +- Must be a view with semantic usage = "Fact" +- Contains quantitative measures (amounts, counts, quantities) +- Has appropriate grain for analysis (transaction, daily, order level) +- Includes necessary dimensional keys +- Consider data volume and refresh frequency + +**Query fact source schema:** +``` +Use MCP tool: get_table_schema(fact_view_name) +Returns: column list, data types, key indicators +``` + +**Fact source example:** +``` +SalesOrders Fact View +├── OrderID (Key) +├── OrderDate (Time dimension) +├── CustomerID (Customer dimension key) +├── ProductID (Product dimension key) +├── Amount (Measure) +├── Quantity (Measure) +├── OrderStatus (Attribute) +└── CurrencyCode (Attribute) +``` + +### Step 2: Add Dimensions + +Attach dimensional views that provide analysis context. + +**Dimension selection:** +- Customer Dimension → Analyze sales by customer attributes +- Product Dimension → Analyze by product category, brand +- Date Dimension → Analyze over time (month, quarter, year) +- Geography Dimension → Analyze by region, country +- Organization Dimension → Analyze by cost center, department + +**Association mapping:** +Each dimension is linked via a foreign key association: +``` +SalesOrders.CustomerID → Customer.CustomerID +SalesOrders.ProductID → Product.ProductID +SalesOrders.OrderDate → Date.DateKey +``` + +**Best practices:** +- Include all business-relevant dimensions +- Verify foreign key relationships exist +- Document dimension hierarchy levels +- Include dimension keys and attributes +- Use search_catalog to discover available dimensions + +**Dimension example:** +``` +Customer Dimension +├── CustomerID (Key) +├── CustomerName (Attribute) +├── IndustryCode (Attribute) +├── Region (Attribute, drill-down to Country) +├── SalesTerritory (Attribute) +└── CreditLimit (Attribute) +``` + +### Step 3: Define Measures + +Measures are the quantifiable metrics users analyze. + +**Measure types:** + +#### Simple Measures +Direct aggregation of a fact table column. + +**Example:** +``` +Measure: Total Sales Amount +Source Column: Amount +Aggregation: SUM +``` + +**Common simple measures:** +- `SUM(Amount)` → Total revenue, total costs +- `SUM(Quantity)` → Total units sold +- `COUNT(*)` → Number of transactions +- `AVG(Price)` → Average unit price +- `MIN(Discount)` → Minimum discount applied +- `MAX(OrderValue)` → Largest order + +#### Calculated Measures +Derived from other measures or columns using expressions. + +**Example:** +``` +Measure: Average Order Value +Formula: Total Sales Amount / Number of Orders + +Measure: Gross Margin +Formula: (Revenue - COGS) / Revenue + +Measure: Days to Delivery +Formula: DATEDIFF(day, OrderDate, DeliveryDate) +``` + +**Expression examples:** +``` +# Percentage calculations +Profit_Margin = Net_Profit / Revenue * 100 + +# Unit economics +Cost_Per_Unit = Total_Cost / Quantity + +# Time-based metrics +Days_In_Inventory = 365 / Inventory_Turnover + +# Ratio analysis +Debt_To_Equity = Total_Debt / Total_Equity +``` + +#### Restricted Measures +Aggregation with specific filter conditions. + +**Example:** +``` +Measure: High-Value Orders +Source: Total Sales Amount +Filter: Amount > 10,000 + +Measure: On-Time Deliveries +Source: COUNT(OrderID) +Filter: ShipDate <= DueDate + +Measure: Completed Orders +Source: COUNT(OrderID) +Filter: Status = 'Completed' +``` + +**Use cases:** +- Key performance indicators with thresholds +- Subset metrics (Premium customers only) +- Conformance counts (Quality metrics) +- Compliance tracking (Orders meeting SLA) + +#### Count Distinct Measures +Count unique values of a dimension key. + +**Example:** +``` +Measure: Number of Customers +Type: Count Distinct +Column: CustomerID +Result: Unique customer count + +Measure: Product Variety +Type: Count Distinct +Column: ProductID +Result: Number of distinct products sold +``` + +**Performance consideration:** +- Count distinct is expensive on large datasets +- Consider materialization if used frequently +- Avoid combining with many other dimensions in filters + +**Measure definition best practices:** +- Use clear, business-friendly names +- Document calculation logic and filters +- Verify aggregation type (SUM vs AVG vs COUNT) +- Test measures with `execute_query` on sample data +- Consider currency and unit attributes +- Define decimal places and formatting + +### Step 4: Configure Measure Aggregation Types + +Specify how measures combine across dimensions. + +**Aggregation types:** + +| Type | Behavior | Example | +|------|----------|---------| +| **SUM** | Add values across dimension | Sum of all order amounts = Total revenue | +| **AVG** | Average across dimension | Average order value by customer | +| **MIN** | Minimum value in dimension | Lowest price per product | +| **MAX** | Highest value in dimension | Highest discount offered | +| **COUNT** | Count non-null values | Number of orders | +| **COUNT DISTINCT** | Unique values in dimension | Unique customers | +| **NONE** | No aggregation (detail level) | Exception cases | + +**Context-specific aggregation:** +``` +Measure: Headcount +SUM by Company (add departments) +AVG by Department (doesn't make sense) +Should use NONE or formula-based + +Measure: Salary +AVG by Department (makes sense) +SUM by Department (total payroll) +``` + +### Step 5: Add Attributes + +Attributes provide drill-down paths and detail information for dimensions. + +**Attribute examples:** +``` +From Customer Dimension: +- CustomerName (detail attribute) +- Industry (classification attribute) +- Region (hierarchy attribute) +- SalesRep (responsibility attribute) + +From Product Dimension: +- ProductName (detail) +- Category (classification) +- Brand (classification) +- SkuNumber (identifier) +``` + +**Hierarchy attributes:** +``` +Date Dimension +├── Year (top level) +├── Quarter (drill-down) +├── Month (drill-down) +└── Day (detail level) + +Geography Dimension +├── Region (top level) +├── Country (drill-down) +├── Province/State (drill-down) +└── City (detail level) +``` + +**Best practices:** +- Include all user-facing attributes +- Order logically (hierarchical top-to-bottom) +- Document attribute meanings +- Consider user search/discovery + +### Step 6: Define Variables and Input Parameters + +Add dynamic filtering for user interaction. + +**Variable types:** + +#### Prompt Variables +Users select value(s) before executing queries. + +**Example:** +``` +Variable: FiscalYear +Type: Single-select Prompt +Values: 2022, 2023, 2024 +Usage in Measure: + COUNT(Orders) WHERE FiscalYear = :FiscalYear +``` + +**Multi-select prompt:** +``` +Variable: SelectRegions +Type: Multi-select Prompt +Values: North, South, East, West, Europe +Filter: WHERE Region IN (:SelectRegions) +``` + +#### Range Variables +Users specify start and end values. + +**Example:** +``` +Variable: SalesDateRange +Type: Date Range +Filter: WHERE OrderDate BETWEEN :StartDate AND :EndDate +``` + +**Numeric range:** +``` +Variable: OrderAmountRange +Type: Numeric Range +Filter: WHERE Amount >= :MinAmount AND Amount <= :MaxAmount +``` + +#### Fixed Variables +Predefined values for consistent calculations. + +**Example:** +``` +Variable: CurrentFiscalYear = YEAR(CURRENT_DATE()) +Variable: PriorYear = CurrentFiscalYear - 1 +Usage: In measures for year-over-year comparisons +``` + +**Variable best practices:** +- Provide meaningful default values +- Add descriptive labels and help text +- Consider mandatory vs optional +- Validate ranges (e.g., EndDate > StartDate) +- Document business context + +### Step 7: Configure Currency Conversion + +Handle multi-currency scenarios. + +**Currency conversion setup:** + +``` +Define source currency column: +├── CurrencyCode (USD, EUR, GBP, etc.) + +Define conversion rules: +├── Target currency (e.g., USD for all reporting) +├── Exchange rate source (lookup table, feed) +├── Effective date matching +└── Rounding rules +``` + +**Currency conversion configuration:** +``` +Measure: Revenue (Converted) +Source: Amount +Original Currency: CurrencyCode column +Target Currency: USD +Exchange Rate Source: ExchangeRate.Lookup (SourceCurrency, TargetCurrency, Date) +Conversion Formula: Amount * ExchangeRate +``` + +**Multi-currency reporting:** +``` +Report users can choose reporting currency: +- Filter: :ReportingCurrency = USD/EUR/GBP +- Measures automatically convert to selected currency +- Reconciliation occurs at transaction level +``` + +**Best practices:** +- Define conversion rules at measure definition time +- Document exchange rate sources +- Handle conversion date logic (transaction date, report date) +- Consider rounding and precision +- Test multi-currency calculations with sample data + +### Step 8: Set Up Exception Aggregation Rules + +Define special aggregation behavior for specific measures/dimensions. + +**Exception aggregation examples:** + +#### Measure Exception +Specific measure aggregates differently in certain dimension contexts. + +``` +Measure: UnitPrice +Normal Aggregation: AVG (average unit price) +Exception with ProductCategory: +- DO NOT average unit prices across product categories +- Use SUM(Amount) / SUM(Quantity) instead (weighted average) +- Prevents misleading aggregations +``` + +**Use cases:** +- Weighted averages (don't average averages) +- Margin calculations (aggregate costs/revenue separately) +- Retention rates (don't average percentages) + +#### Dimension Exception +Dimension behaves differently with certain measures. + +``` +Measure: Budget (allocated by department) +Normal behavior: SUM across all dimensions +Exception with Date dimension: +- DO NOT sum budgets across months +- Use only budget for the selected month +- Avoid double-counting + +Measure: Headcount (point-in-time) +Exception with Organization: +- DO NOT sum headcount across levels +- Use only at lowest level (employees) +``` + +**Configuration syntax:** +``` +Measure: Commission +Base Aggregation: SUM +Exception Rule: + IF dimension = 'ProductCategory' + THEN aggregate as NONE (show detail only) + ELSE aggregate as SUM +``` + +**Best practices:** +- Document exceptions clearly +- Test aggregations in SAC +- Avoid overly complex rules +- Consider impact on user experience + +## Attributes and Their Properties + +### Attribute Types + +#### Key Attributes +Unique identifiers for dimension members. + +``` +CustomerID (Key) +├── Used for dimension uniqueness +├── Not aggregated in reports +├── Used for filtering +└── Links to fact table +``` + +#### Classification Attributes +Categorical properties for grouping. + +``` +Product Category (Classification) +├── Values: Electronics, Clothing, Food +├── Used for drill-down and filtering +├── Typically aggregates with SUM +└── Creates different data segments +``` + +#### Hierarchy Attributes +Ordered levels for drill-down paths. + +``` +Organization Hierarchy: +├── Level 0: Company +├── Level 1: Division +├── Level 2: Department +├── Level 3: Team +└── Level 4: Individual +``` + +#### Text/Description Attributes +Longer text fields for context. + +``` +Product Description (Text) +├── Marketing description +├── Not aggregated +├── Read-only in most contexts +└── Useful for report context +``` + +### Attribute Properties + +**Data type:** +- Text (string, variable length) +- Date (ISO format YYYY-MM-DD) +- Number (integer or decimal) +- Boolean (true/false) + +**Attribute properties:** +``` +Name: ProdCategory +Label: Product Category +Description: "High-level product grouping for analysis" +Data Type: Text +Semantic Role: Classification +Display Length: 50 characters +``` + +## Variables and Input Parameters + +### Variable Definition Patterns + +#### Financial Period Variables +``` +Variable: SelectedMonth +Type: Single Select +Values: FROM calendar table +Usage: WHERE Month = :SelectedMonth + +Variable: FiscalYearRange +Type: Range +Usage: WHERE FiscalYear BETWEEN :StartYear AND :EndYear +``` + +#### Geographic Variables +``` +Variable: SelectedRegion +Type: Multi-Select +Values: North, South, East, West +Usage: WHERE Region IN (:SelectedRegion) + +Variable: CountrySubset +Type: Multi-Select +Dynamic Values: FROM Country dimension +``` + +#### Customer Segment Variables +``` +Variable: IndustryFilter +Type: Single Select +Values: Manufacturing, Retail, Service, Government +Usage: WHERE Industry = :IndustryFilter + +Variable: MinAnnualRevenue +Type: Numeric +Default: 1000000 +Usage: WHERE AnnualRevenue >= :MinAnnualRevenue +``` + +#### Performance Threshold Variables +``` +Variable: TargetGrowthRate +Type: Numeric +Default: 0.10 (10%) +Usage in Measure: IF(GrowthRate >= :TargetGrowthRate, 'On Track', 'At Risk') +``` + +## SAC Consumption Considerations + +### Dashboard Compatibility + +**Layout and performance:** +- Keep analytic models focused (8-12 measures optimal) +- Include essential dimensions only +- Test dashboard responsiveness with typical filters +- Consider caching strategies for large datasets + +**Filter interaction:** +- Dimension filters must have clear values +- Avoid excessive hierarchy levels (limit to 5) +- Test filter performance in SAC +- Document filter behavior for users + +### Self-Service Analytics + +**Naming conventions for clarity:** +``` +GOOD: +- "Total Sales Revenue (USD)" +- "Customer Acquisition Cost" +- "On-Time Delivery Rate" + +AVOID: +- "SAL_AMT" +- "CALC_COST" +- "OTD_PERC" +``` + +**Measure organization:** +``` +Revenue Measures: +├── Total Sales Amount +├── Average Order Value +├── Revenue by Region +└── Year-over-Year Growth + +Cost Measures: +├── Total COGS +├── Operating Expenses +└── Cost Per Unit +``` + +**Documentation for end users:** +``` +For each measure, document: +- What it measures +- How it's calculated +- When to use it +- Any limitations or exceptions +- Currency or unit information +``` + +### Performance Optimization for SAC + +**Query optimization:** +- Persist frequently-accessed analytic models +- Pre-aggregate common combinations +- Use SAC query caching +- Test with realistic user loads + +**Dimension cardinality:** +``` +Low cardinality (good for filters): +- Department (10-20 values) +- Region (5-10 values) + +High cardinality (avoid for filtering): +- Customer ID (millions) +- Transaction ID (billions) +``` + +**Measure calculation placement:** +- Simple aggregations in Fact view +- Complex calculations in Analytic Model +- User-defined calculations in SAC (if acceptable) + +## MCP Tools Reference + +### get_analytical_metadata +Retrieve structure and properties of existing analytic models. +``` +Use to understand existing measure definitions and dimension associations +Helps avoid duplicate measures across models +``` + +### query_analytical_data +Execute queries against analytic models to validate results. +``` +Use to test measures, filters, and dimension combinations +Verify correct aggregation behavior before deployment +``` + +### search_catalog +Discover fact tables, dimensions, and existing analytic models. +``` +Use to find suitable fact sources and dimension tables +Identify reusable components +``` + +### get_table_schema +Retrieve detailed column information for fact sources. +``` +Use before creating measures to understand available data +Verify data types and column names +``` + +## Key Takeaways + +1. **Choose fact sources carefully** — Correct grain and scope determines model usability +2. **Define comprehensive measures** — Simple, calculated, and restricted measures address diverse needs +3. **Include relevant dimensions** — Balance analysis richness with performance +4. **Handle exceptions explicitly** — Complex aggregation rules must be documented +5. **Consider currency and units** — Multi-currency models require explicit configuration +6. **Test in SAC early** — Validate filter behavior and dashboard performance +7. **Document for users** — Clear measure definitions enable self-service analytics +8. **Optimize for consumption** — Simplify complexity; performance matters in dashboards diff --git a/partner-built/SAP-Datasphere/skills/datasphere-analytic-model-creator/references/analytic-model-guide.md b/partner-built/SAP-Datasphere/skills/datasphere-analytic-model-creator/references/analytic-model-guide.md new file mode 100644 index 0000000..3e587d3 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-analytic-model-creator/references/analytic-model-guide.md @@ -0,0 +1,940 @@ +# Analytic Model Reference Guide + +## Measure Types - Syntax and Examples + +### Simple Measures + +Simple measures are straightforward aggregations of a single column with a standard aggregation function. + +#### SUM Aggregation +```sql +Measure Name: Total Sales Revenue +Source Column: Amount (numeric) +Aggregation: SUM +Syntax: SUM(Amount) +Result: Sums all amount values across selected dimensions + +Example Values: +- All Data: $5,234,500 +- By Region: North=$1,200,000, South=$2,100,000, Europe=$1,934,500 +- By Month: Jan=$425,000, Feb=$398,000, Mar=$456,000 +``` + +#### COUNT Aggregation +```sql +Measure Name: Number of Orders +Source Column: OrderID +Aggregation: COUNT +Syntax: COUNT(OrderID) +Result: Counts non-null order IDs + +Example: +- Total Orders: 12,456 +- Orders by Customer: Acme=245, Beta=189, Gamma=512 +- Orders by Month: Jan=1050, Feb=967, Mar=1123 +``` + +#### COUNT DISTINCT Aggregation +```sql +Measure Name: Unique Customers +Source Column: CustomerID +Aggregation: COUNT(DISTINCT CustomerID) +Syntax: COUNT(DISTINCT CustomerID) +Result: Number of unique customer IDs + +Example: +- Total Unique Customers: 3,456 +- Unique Customers by Region: North=890, South=1,200, Europe=1,366 +- Unique Customers by Product: ProductA=1,234, ProductB=2,100 + +Performance Note: COUNT DISTINCT can be slow on high-cardinality columns +``` + +#### AVG Aggregation +```sql +Measure Name: Average Order Value +Source Column: Amount +Aggregation: AVG +Syntax: AVG(Amount) +Result: Average of all amount values + +Example: +- Overall Average: $419.42 +- Average by Customer Segment: Premium=$892.34, Standard=$305.67, Basic=$187.23 +- Average by Product Category: Electronics=$756.89, Clothing=$245.67 + +Formula: SUM(Amount) / COUNT(OrderID) +Interpretation: Users understand "typical order value" +``` + +#### MIN/MAX Aggregation +```sql +Measure Name: Minimum Order Value +Source Column: Amount +Aggregation: MIN +Syntax: MIN(Amount) +Result: Smallest order amount + +Example: +- Minimum Order: $5.00 +- Minimum by Region: North=$10.00, South=$5.00, Europe=$8.50 + +Measure Name: Maximum Order Value +Source Column: Amount +Aggregation: MAX +Syntax: MAX(Amount) +Result: Largest order amount + +Example: +- Maximum Order: $125,400.00 +- Maximum by Month: Jan=$95,600, Feb=$125,400, Mar=$87,300 +``` + +--- + +### Calculated Measures + +Calculated measures derive from expressions combining other measures, columns, or functions. + +#### Ratio and Percentage Measures +```sql +Measure Name: Profit Margin Percentage +Formula: (NetProfit / Revenue) * 100 +Source Measures: NetProfit, Revenue +Syntax: (NetProfit / Revenue) * 100 +Result: Percentage profit relative to revenue + +Example Calculation: +- NetProfit = $1,000,000 +- Revenue = $5,000,000 +- Result = (1,000,000 / 5,000,000) * 100 = 20% + +Usage: Monitor profitability +Insight: "For every $100 of revenue, we keep $20 as profit" + +--- + +Measure Name: Gross Margin Percentage +Formula: ((Revenue - COGS) / Revenue) * 100 +Source Columns: Revenue, COGS (Cost of Goods Sold) +Syntax: ((Revenue - COGS) / Revenue) * 100 +Result: Percentage of revenue above product cost + +Example: +- Revenue = $100,000 +- COGS = $60,000 +- Result = ((100,000 - 60,000) / 100,000) * 100 = 40% + +Variance Detection: +- Manufacturing: 35% (below target of 40%) +- Retail: 42% (above target of 40%) +``` + +#### Per-Unit Economics +```sql +Measure Name: Cost Per Unit +Formula: TotalCost / TotalQuantity +Source Measures: TotalCost (SUM of Costs), TotalQuantity (SUM of Quantity) +Syntax: TotalCost / TotalQuantity +Result: Average cost per unit + +Example: +- Total Cost = $150,000 +- Total Quantity = 5,000 units +- Cost Per Unit = $30.00 + +Use Case: Track unit economics improvements +- Year 1: $35.00 per unit +- Year 2: $30.00 per unit (14% improvement) + +--- + +Measure Name: Revenue Per Customer +Formula: TotalRevenue / UniqueCustomers +Source Measures: TotalRevenue (SUM), UniqueCustomers (COUNT DISTINCT) +Syntax: TotalRevenue / UniqueCustomers +Result: Average revenue per unique customer + +Example: +- Total Revenue = $10,000,000 +- Unique Customers = 2,500 +- Revenue Per Customer = $4,000 + +Segment Analysis: +- Premium Customers: $8,500 per customer +- Standard Customers: $2,200 per customer +- Basic Customers: $400 per customer +``` + +#### Year-over-Year and Growth Measures +```sql +Measure Name: Revenue YoY Growth % +Formula: ((CurrentYear - PriorYear) / PriorYear) * 100 +Calculated from: Two instances of Revenue measure with year filters +Syntax: ((RevenueCurrentYear - RevenuePriorYear) / RevenuePriorYear) * 100 +Result: Percentage growth year-over-year + +Example: +- 2024 Revenue = $12,500,000 +- 2023 Revenue = $10,000,000 +- YoY Growth = ((12,500,000 - 10,000,000) / 10,000,000) * 100 = 25% + +Interpretation: Revenue grew 25% from 2023 to 2024 + +--- + +Measure Name: Quarter-over-Quarter Change +Formula: (CurrentQtr - PriorQtr) / PriorQtr +Calculated from: Two instances of Revenue with quarter filters +Example: +- Q2 2024 = $3,200,000 +- Q1 2024 = $2,900,000 +- QoQ = (3,200,000 - 2,900,000) / 2,900,000 = 10.3% +``` + +#### Conditional Calculations +```sql +Measure Name: Incentive Payment +Formula: CASE + WHEN SalesAmount >= 1,000,000 THEN SalesAmount * 0.05 + WHEN SalesAmount >= 500,000 THEN SalesAmount * 0.03 + ELSE 0 + END +Source: SalesAmount measure +Result: Tiered commission/incentive + +Example: +- Sales $1,500,000 → Incentive = $75,000 (5%) +- Sales $750,000 → Incentive = $22,500 (3%) +- Sales $250,000 → Incentive = $0 + +--- + +Measure Name: Performance Rating +Formula: CASE + WHEN ActualSales >= TargetSales * 1.1 THEN 'Excellent' + WHEN ActualSales >= TargetSales THEN 'On Track' + WHEN ActualSales >= TargetSales * 0.8 THEN 'At Risk' + ELSE 'Critical' + END +Result: Categorical performance assessment + +Example: +- Target = $1,000,000, Actual = $1,150,000 → 'Excellent' (15% above) +- Target = $1,000,000, Actual = $950,000 → 'On Track' (5% below, within tolerance) +- Target = $1,000,000, Actual = $750,000 → 'At Risk' (25% below) +``` + +#### Time-Based Calculations +```sql +Measure Name: Average Days to Delivery +Formula: AVG(DATEDIFF(day, OrderDate, DeliveryDate)) +Source Columns: OrderDate, DeliveryDate +Result: Average delivery time in days + +Example: +- Order Jan 15, Delivered Jan 18 = 3 days +- Order Jan 16, Delivered Jan 20 = 4 days +- Average = 3.5 days + +Use Case: Service level tracking +- Target: 5 days +- Actual: 4.2 days (exceeding SLA) + +--- + +Measure Name: Days Sales Outstanding (DSO) +Formula: (AvgAccountsReceivable / DailySales) +Approximation: (365 * AccountsReceivable) / Revenue +Interpretation: Average days to collect payment + +Example Calculation: +- Annual Revenue = $10,000,000 +- Average AR Balance = $1,500,000 +- DSO = (365 * 1,500,000) / 10,000,000 = 54.75 days + +Trend Analysis: +- Month 1: 60 days (slow collections) +- Month 2: 55 days (improving) +- Month 3: 50 days (on track) +``` + +--- + +### Restricted Measures + +Restricted measures are aggregations with built-in filters, creating subset-specific metrics. + +#### Threshold-Based Restrictions +```sql +Measure Name: High-Value Orders +Definition: COUNT of Orders where Amount >= $10,000 +Source: OrderCount measure (COUNT(OrderID)) +Filter: WHERE Amount >= 10000 +Result: Count of orders above threshold + +Example: +- Total Orders: 5,000 +- High-Value Orders: 342 +- Percentage: 6.84% + +Use Case: Track premium transaction volume +Split by Sales Rep: +- Rep A: 45 high-value orders +- Rep B: 67 high-value orders +- Rep C: 28 high-value orders +``` + +#### Status-Based Restrictions +```sql +Measure Name: Completed Orders +Definition: SUM(Amount) where Status = 'Completed' +Source: Revenue measure (SUM(Amount)) +Filter: WHERE OrderStatus = 'Completed' +Result: Revenue from only completed transactions + +Example: +- Total Orders Revenue: $5,234,500 (includes pending, cancelled) +- Completed Orders: $4,987,234 (excludes non-completed) + +Status Breakdown: +- Completed: $4,987,234 (95.3%) +- Pending: $189,456 (3.6%) +- Cancelled: $57,810 (1.1%) + +--- + +Measure Name: On-Time Deliveries +Definition: COUNT(OrderID) where ShipDate <= DueDate +Filter: WHERE ShipDate <= DueDate +Result: Count of orders delivered on time + +Example: +- Total Orders: 1,200 +- On-Time: 1,050 +- Late: 150 +- On-Time Rate: 87.5% + +Regional Performance: +- North: 91% on-time +- South: 85% on-time +- Europe: 83% on-time +``` + +#### Conformance and Compliance Restrictions +```sql +Measure Name: Approved Invoices +Definition: SUM(InvoiceAmount) where ApprovalStatus = 'Approved' +Filter: WHERE ApprovalStatus = 'Approved' +Result: Total amount of invoices meeting approval criteria + +Example: +- Total Invoiced: $1,500,000 +- Approved: $1,450,000 (96.7%) +- Pending Review: $35,000 (2.3%) +- Rejected: $15,000 (1.0%) + +--- + +Measure Name: Quality Pass Rate +Definition: COUNT(InspectionID) where QualityScore >= 95 +Filter: WHERE QualityScore >= 95 +Result: Count of items passing quality threshold + +Example: +- Total Items Inspected: 10,000 +- Quality Pass: 9,750 +- Quality Pass Rate: 97.5% + +Quality Trend: +- Q1: 95.2% +- Q2: 96.8% +- Q3: 97.5% (improving) +``` + +#### Combination Restrictions +```sql +Measure Name: Large Orders from Key Customers +Definition: SUM(Amount) where Amount >= 50000 AND CustomerSegment = 'Key' +Filter: WHERE Amount >= 50000 AND CustomerSegment = 'Key' +Result: Revenue from large orders of key accounts + +Example Calculation: +- Total Revenue: $10,000,000 +- Large Orders (>$50K): $6,500,000 +- Large Orders from Key Customers: $5,800,000 (89% of large orders) + +Business Insight: Key customers represent significant large order volume +``` + +--- + +### Count Distinct Measures + +Count distinct measures identify unique members of a dimension. + +#### Customer Counting +```sql +Measure Name: Number of Unique Customers +Definition: COUNT(DISTINCT CustomerID) +Source Column: CustomerID +Result: Count of unique customers + +Example: +- Total Transactions: 50,000 +- Unique Customers: 3,200 +- Avg Transactions per Customer: 15.6 + +Segmentation: +- Premium Customers: 120 +- Standard Customers: 1,050 +- Basic Customers: 2,030 + +Trend: +- Year 1: 2,500 unique customers +- Year 2: 3,200 unique customers (28% growth) +``` + +#### Product/SKU Counting +```sql +Measure Name: Product Variety Sold +Definition: COUNT(DISTINCT ProductID) +Source Column: ProductID +Result: Number of unique products sold + +Example: +- Total Product Catalog: 5,000 SKUs +- Products Sold This Year: 1,200 (24% of catalog) +- Products Sold by Region: North=650, South=480, Europe=720 + +Use Case: Assortment analysis +- High-selling products: 50 SKUs (40% of revenue) +- Tail products: 1,150 SKUs (60% of revenue) +``` + +#### Unique Time Period Counting +```sql +Measure Name: Days with Sales Activity +Definition: COUNT(DISTINCT OrderDate) +Source Column: OrderDate +Result: Number of unique days with orders + +Example: +- Total Days in Year: 365 +- Days with Sales: 310 (85% of days) +- Average Orders per Day: 32.3 + +Use Case: Operational analysis +- Monday: 98% days with activity +- Friday: 92% days with activity +- Weekend: 15% days with activity +``` + +#### Supplier/Vendor Counting +```sql +Measure Name: Active Supplier Count +Definition: COUNT(DISTINCT SupplierID) +Source Column: SupplierID +Result: Number of active suppliers + +Example: +- Total Registered Suppliers: 500 +- Active Suppliers (has shipments this period): 180 (36%) +- Average Purchases per Supplier: 27.8 + +Risk Assessment: +- Primary suppliers (80% of value): 15 suppliers +- Secondary suppliers (20% of value): 165 suppliers +- Diversity is important for supply chain resilience +``` + +--- + +## Aggregation Types - Complete Reference + +| Aggregation | Input | Output | When Used | Example | +|---|---|---|---|---| +| **SUM** | Numeric values | Total | Amounts, quantities, costs | SUM(InvoiceAmount) = $1,500,000 | +| **AVG** | Numeric values | Mean | Unit prices, rates, percentages | AVG(UnitPrice) = $45.50 | +| **COUNT** | Any column | Integer count | Number of transactions | COUNT(OrderID) = 5,432 | +| **COUNT DISTINCT** | Any column with duplicates | Unique count | Unique customers, products | COUNT(DISTINCT CustomerID) = 1,200 | +| **MIN** | Comparable values | Minimum | Lowest price, earliest date | MIN(OrderDate) = 2024-01-01 | +| **MAX** | Comparable values | Maximum | Highest discount, latest date | MAX(Amount) = $95,600 | +| **STDDEV** | Numeric values | Standard deviation | Variability, risk measurement | STDDEV(SalesAmount) = $12,450 | +| **VARIANCE** | Numeric values | Variance | Spread of data points | VARIANCE(Price) = 156.25 | + +--- + +## Exception Aggregation Patterns + +Exception aggregation rules define how specific measures behave when combined with certain dimensions. + +### Pattern 1: Weighted Average Exception + +**Scenario:** Unit prices should not be averaged across product categories. + +``` +Measure: Unit Price +Normal Aggregation: AVG(Price) + +Exception Rule: +When dimension = ProductCategory + Aggregate as: SUM(TotalAmount) / SUM(TotalQuantity) + Instead of: AVG(Price) + +Why: Taking an average of averages is mathematically incorrect +Example: + Category A: 100 units at $10 (average) = $1,000 total + Category B: 50 units at $20 (average) = $1,000 total + + Wrong: AVG(10, 20) = $15 + Correct: (1,000 + 1,000) / (100 + 50) = $13.33 + +Dashboard Impact: + Without exception: Misleading $15 per unit price + With exception: Correct $13.33 weighted unit price +``` + +### Pattern 2: Point-in-Time Exception + +**Scenario:** Headcount and inventory are point-in-time measures; don't sum across periods. + +``` +Measure: Employee Headcount +Normal Aggregation: SUM(EmployeeCount) + +Exception Rule: +When dimension = Date or Time Period + Aggregate as: NONE (show detail only) + Instead of: SUM + +Why: Summing headcount across months is meaningless +Example: + Jan Headcount: 100 employees + Feb Headcount: 105 employees + Mar Headcount: 110 employees + + Wrong Sum: 315 (meaningless) + Correct: Show each month separately, user interprets trend + +--- + +Measure: Inventory On Hand +Normal Aggregation: SUM(Quantity) + +Exception Rule: +When dimension = Date or Month + Show: Last day of period balance only + Not: Sum of all daily balances + +Why: Inventory is measured at a point in time +Example: + Month-end inventory is meaningful (stock available) + Sum of daily inventory is meaningless (double-counting) +``` + +### Pattern 3: Ratio Exceptions + +**Scenario:** Percentages and rates should not be averaged; recalculate from components. + +``` +Measure: Profit Margin % +Formula: (NetProfit / Revenue) * 100 +Normal Aggregation: Uses formula components + +Exception Rule: +When combining dimensions (e.g., Region and Product) + Recalculate as: SUM(AllProfit) / SUM(AllRevenue) * 100 + Not: AVG(ProfitMarginByRegion) + +Why: Averaging percentages loses context +Example: + Region North: 50 units, $1,000 profit, 5% margin + Region South: 100 units, $500 profit, 1% margin + + Wrong: AVG(5%, 1%) = 3% + Correct: ($1,000 + $500) / ($20,000 + $50,000) = 1.75% + +Dashboard Implementation: + Raw measure: Use components (profit, revenue) + Calculated measure: Apply formula + Exception: Prevent aggregation of percentages +``` + +### Pattern 4: Subset Measure Exception + +**Scenario:** When combining restricted and unrestricted measures. + +``` +Measure: Total Revenue +Source: SUM(Amount) with no filter + +Measure: Online Revenue +Source: SUM(Amount) where SalesChannel = 'Online' + +Exception Rule: +Cannot calculate: Online Revenue / Total Revenue at certain dimension levels + Problem: Different denominator causes meaningless results + Solution: Only calculate this ratio at specific dimension levels (e.g., by Product) + +Dashboard Pattern: + Show Total Revenue and Online Revenue side-by-side + Calculate ratio only when both dimensions align +``` + +--- + +## Currency and Unit Conversion Configuration + +### Multi-Currency Setup + +**Configuration structure:** +``` +Dimension: TransactionCurrency (from fact table) +Values: USD, EUR, GBP, JPY, CAD, AUD + +Measure: Amount (in transaction currency) + +Conversion Method: +1. Define exchange rate source +2. Apply conversion at query time +3. Display in target currency +``` + +**Exchange rate source example:** +``` +ExchangeRates Table: +├── SourceCurrency (USD, EUR, GBP) +├── TargetCurrency (reporting currency) +├── ExchangeRate (conversion ratio) +├── EffectiveDate (when rate became valid) +└── SourceSystem (Reuters, ECB, Internal) + +Example Data: +| SourceCurrency | TargetCurrency | ExchangeRate | EffectiveDate | +| USD | EUR | 0.92 | 2024-01-01 | +| GBP | EUR | 1.17 | 2024-01-01 | +| JPY | EUR | 0.0067 | 2024-01-01 | +``` + +**Conversion formula in measure:** +``` +Measure: Revenue (EUR) +Formula: Amount * ExchangeRate +Lookup: ExchangeRates.ExchangeRate + WHERE SourceCurrency = TransactionCurrency + AND TargetCurrency = 'EUR' + AND EffectiveDate <= TransactionDate + ORDER BY EffectiveDate DESC + LIMIT 1 (most recent rate) +``` + +**Conversion date logic:** +``` +Spot Rate Method (Transaction date): +- Convert using rate effective on transaction date +- Most accurate for financial reporting +- Requires historical rate table + +Period-End Rate Method (Reporting date): +- Convert all transactions using month-end rate +- Simpler, used for P&L reporting +- May create volatility month-to-month +``` + +### Unit Conversion + +**Unit conversion configuration:** +``` +Measure: Quantity Sold (units) + +Conversion Rules: +├── Base Unit: Pieces +├── Alternative Units: Dozen, Box, Pallet +├── Conversion Factors: +│ ├── 1 Dozen = 12 Pieces +│ ├── 1 Box = 24 Pieces +│ ├── 1 Pallet = 120 Boxes = 2,880 Pieces + +Measure: Quantity (Dozens) +Formula: Quantity (units) / 12 + +Measure: Weight +├── Base Unit: Kilograms +├── Conversions: +│ ├── 1 Ton = 1,000 KG +│ ├── 1 Pound = 0.454 KG +│ ├── 1 Ounce = 0.0283 KG +``` + +**Mixed unit reporting:** +``` +User selects reporting unit: +└─ [Dropdown: Pieces | Dozen | Box | Pallet] + +System calculates: +- Formula-based conversion +- Display with appropriate unit label +- Rounding rules per unit type + +Example: +Quantity = 500 Pieces +- In Pieces: 500 +- In Dozen: 41.67 (rounded to 42) +- In Box: 20.83 (rounded to 21) +- In Pallet: 1.74 (rounded to 2) +``` + +--- + +## Variable Types and Usage Patterns + +### Single Select Variable +``` +Variable Definition: +├── Name: SelectedRegion +├── Label: Choose Region +├── Type: Single Select +├── Values: North, South, East, West, Europe +├── Default: North +└── Mandatory: Yes + +Usage in Measure: +WHERE Region = :SelectedRegion + +Dashboard Behavior: +- Dropdown with 5 options +- User selects one before query execution +- Entire dashboard filtered by selection +``` + +### Multi-Select Variable +``` +Variable Definition: +├── Name: SelectedSalesReps +├── Label: Filter by Sales Representatives +├── Type: Multi-Select +├── Values: [List of 200+ reps] +├── Default: All +└── Mandatory: No + +Usage in Measure: +WHERE SalesRepID IN (:SelectedSalesReps) + +Dashboard Behavior: +- Multi-select list or checkbox group +- User selects multiple reps +- OR logic combines selections +- Empty selection = All reps +``` + +### Range Variable +``` +Variable Definition (Numeric): +├── Name: SalesAmountRange +├── Label: Order Amount Range +├── Type: Range +├── Min Value: 0 +├── Max Value: 1,000,000 +├── Default Min: 0 +├── Default Max: 100,000 +└── Step: 1,000 + +Usage in Measure: +WHERE Amount >= :MinAmount AND Amount <= :MaxAmount + +Variable Definition (Date): +├── Name: DateRange +├── Label: Order Date Range +├── Type: Date Range +├── Default Start: First day of current month +├── Default End: Last day of current month +└── Format: YYYY-MM-DD + +Usage in Measure: +WHERE OrderDate >= :StartDate AND OrderDate <= :EndDate +``` + +### Hierarchical Variable +``` +Variable Definition: +├── Name: OrganizationLevel +├── Label: Organization Hierarchy +├── Type: Hierarchical Select +├── Hierarchy: Company → Division → Department → Team +├── Default: Company level +└── Allow drill-down: Yes + +Usage: +- User sees Company names +- On selection, displays Divisions +- On Division selection, displays Departments +- Drill-down continues through hierarchy +``` + +### Fixed Variable (Formula-Based) +``` +Variable Definition: +├── Name: CurrentFiscalYear +├── Type: Fixed (Calculated) +├── Formula: YEAR(CURRENT_DATE()) +├── Updated: Automatically daily + +Usage in Measure: +WHERE FiscalYear = :CurrentFiscalYear + +Example Calculations: +├── PriorYear = :CurrentFiscalYear - 1 +├── CurrentMonthFirstDay = DATE_TRUNC('month', CURRENT_DATE()) +├── LastDayOfMonth = LAST_DAY(CURRENT_DATE()) +└── QuarterStart = DATE_TRUNC('quarter', CURRENT_DATE()) +``` + +--- + +## Best Practices for SAC Compatibility + +### Measure Definition Best Practices + +1. **Naming Convention** + ``` + Good names: + - "Total Sales Revenue" + - "Customer Acquisition Cost" + - "On-Time Delivery Rate" + + Avoid: + - Acronyms without context (SAL_AMT, CAC, OTD) + - Technical names (SUM_FACT_000123) + - Vague names (VALUE, AMOUNT, DATA) + ``` + +2. **Documentation** + ``` + For each measure, document: + - Definition: "Sum of all invoice amounts for completed orders" + - Calculation: "SUM(InvoiceAmount) WHERE OrderStatus='Completed'" + - Aggregation Type: "SUM" + - Currency/Unit: "USD" + - Limitations: "Excludes cancelled orders" + - Last Updated: "2024-02-01" + - Owner: "Finance Team" + ``` + +3. **Aggregation Clarity** + ``` + Specify default aggregation: + + Measure: Revenue + ├── Default Agg: SUM (sum across regions, products) + ├── Don't Sum: (leave as detail for certain dimensions) + + Measure: Unit Price + ├── Default Agg: AVG + ├── Exception: Use weighted average by product category + ``` + +### Dimension and Attribute Organization + +``` +Dimension: ProductDimension +├── Keys and IDs +│ ├── ProductID +│ └── SKUNumber +├── Classification +│ ├── ProductCategory +│ ├── ProductSubcategory +│ └── Brand +├── Attributes +│ ├── ProductName +│ ├── Description +│ └── ListPrice +└── Hierarchy Levels + ├── Category (Level 0) + ├── Subcategory (Level 1) + └── Product (Level 2) +``` + +### Performance Optimization + +``` +Measure Complexity Levels: + +Level 1 (Simple, Fast): +- Simple aggregations (SUM, COUNT) +- Single column operations +- Expected query time: <1 second + +Level 2 (Moderate, Acceptable): +- Calculated measures with 2-3 components +- Restricted measures with filters +- Expected query time: 1-5 seconds + +Level 3 (Complex, Use Carefully): +- Multiple calculated components +- Complex window functions +- Count distinct on high-cardinality columns +- Expected query time: 5-30 seconds + +Recommendation: +- Avoid Level 3 measures in interactive dashboards +- Use for scheduled reports only +- Consider materialization/persistence +``` + +--- + +## Troubleshooting Common Issues + +### Aggregation Problems + +**Issue: Measures don't add up correctly** +``` +Symptom: Region A $100K + Region B $200K ≠ Total $300K + +Cause: Unit price averaging issue +- Dimension level shows average unit prices +- But those aren't weighted by volume + +Solution: Use exception aggregation rule +- Don't average prices across regions +- Recalculate from sum(amount) / sum(quantity) +``` + +**Issue: Data appears duplicated** +``` +Symptom: Total $500K + Restricted measure $500K = $1M (should be $500K total) + +Cause: N:N join cardinality explosion +- Join between fact and dimension creates duplicate rows +- Each duplicate counted separately + +Solution: Verify join cardinality +- Should be N:1 (many orders to one customer) +- Not N:N (many orders to many dimension members) +- Check for incorrect join conditions +``` + +### Performance Issues + +**Slow count distinct measures** +``` +Cause: COUNT(DISTINCT CustomerID) on billions of rows + +Solutions: +1. Create materialized view with daily customer counts +2. Pre-aggregate distinct counts at lower granularity +3. Archive old data before querying +4. Use approximate count algorithms +``` + +**Dashboard filter lag** +``` +Cause: High-cardinality dimension with millions of values + +Solutions: +1. Limit dimension values shown (top 100, not all) +2. Use hierarchical navigation instead of flat list +3. Implement search/autocomplete for selection +4. Create separate filtered dimension for dashboards +``` diff --git a/partner-built/SAP-Datasphere/skills/datasphere-business-content-activator/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-business-content-activator/SKILL.md new file mode 100644 index 0000000..a5da5f4 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-business-content-activator/SKILL.md @@ -0,0 +1,799 @@ +--- +name: "SAP Datasphere Business Content Activator" +description: "Activate pre-built SAP Business Content packages in Datasphere! Use this skill when you need to deploy industry-specific data models, manage content updates, handle prerequisites (Time Dimension, Currency Conversion), and align content with LSA++ layered architecture. Essential for rapid analytics implementation, reducing customization, and accelerating time-to-value in retail, automotive, finance, utilities, and other verticals." +--- + +# SAP Datasphere Business Content Activator + +## What is Business Content? + +SAP Business Content is a collection of pre-built, production-ready data models, analytical views, and data flows designed for specific industries and business domains. Rather than building your entire analytics solution from scratch, Business Content gives you: + +- **Pre-modeled Data Objects**: Tables, views, and dimensions aligned to industry best practices +- **Time and Currency Handling**: Built-in temporal and FX conversion logic +- **Industry-Specific Analytical Hierarchies**: Organized by sales channel, product lines, geographic regions, etc. +- **Data Flow Templates**: Extract-Transform-Load (ETL) patterns for common data integration scenarios +- **Reporting Views**: Pre-built analytics views ready for dashboards and reports +- **Documentation and Metadata**: Embedded business glossaries and lineage information + +### Key Benefits + +| Benefit | Impact | +|---------|--------| +| Time-to-Value | Deploy analytics in weeks vs. months | +| Best Practices | Industry-standard data modeling patterns | +| Reduced Customization | 70-80% of requirements covered by content | +| Consistency | Standardized KPIs across the organization | +| Maintainability | SAP updates content; you benefit from innovations | + +--- + +## Content Network Overview + +The SAP Datasphere Content Network is where you browse, preview, and select Business Content packages for activation. + +### Accessing the Content Network + +1. **Navigate to Content Network**: + - In Datasphere, go to **Business Content > Content Network** + - Or visit: `https://your-datasphere-instance/business-content/network` + +2. **Search and Filter**: + - Filter by **Industry** (Automotive, Utilities, Retail, Finance, Manufacturing, etc.) + - Filter by **Domain** (Sales, Finance, Supply Chain, Human Resources, etc.) + - Filter by **Data Source** (SAP S/4HANA, Salesforce, Workday, etc.) + - Search by **Keyword** (e.g., "revenue analysis", "inventory management") + +3. **Preview Package**: + - View included data models and views + - Check dependencies and prerequisites + - Review object count and complexity + - Read implementation guide + +### Content Package Metadata + +Each package displays: +- **Version**: Semantic versioning (e.g., 1.2.3) with release notes +- **Objects Count**: Number of tables, views, and flows included +- **Dependencies**: Packages this content requires +- **Prerequisites**: Time Dimension, Currency Conversion, other setup needs +- **Industries**: Which verticals this package serves +- **Last Updated**: SAP's latest modification date + +--- + +## Pre-Activation Prerequisites Checklist + +Before activating any Business Content, ensure these foundational elements are in place. + +### Time Dimension Tables + +Time Dimension is the backbone of temporal analytics. Business Content heavily relies on it for: +- Year-to-Date (YTD) analysis +- Period-over-Period (POP) comparisons +- Fiscal vs. calendar calendars +- Holidays and working day calculations + +#### Check If Time Dimension Exists + +``` +In Datasphere: +1. Go to Business Content > Administration > Prerequisites +2. Look for "Time Dimension" table +3. Check: Status = "Populated" (with date range) +``` + +#### Populate Time Dimension (If Empty) + +If the Time Dimension table exists but is empty: + +1. **Download Time Dimension Data File**: + - Go to **Administration > Time Dimension** + - Click **Generate Data File** for your date range + - Select fiscal period definition (Gregorian or custom) + - Generate CSV with dates, quarters, years, etc. + +2. **Load Data**: + - Import CSV via **Data Integration > New Data Flow** + - Select target: **Time Dimension** table + - Map columns: date → DATE, year → YEAR, quarter → QUARTER, etc. + - Execute load + +3. **Verify Population**: + - Query should return records for all periods: + ```sql + SELECT MIN(DATE), MAX(DATE), COUNT(*) FROM TIME_DIMENSION + ``` + +#### Time Dimension Example Structure + +| DATE | YEAR | QUARTER | MONTH | WEEK | DAY_OF_WEEK | FISCAL_YEAR | FISCAL_QUARTER | IS_HOLIDAY | +|------|------|---------|-------|------|-------------|-------------|-----------------|-----------| +| 2024-01-01 | 2024 | Q1 | 1 | 1 | Monday | 2024 | Q1 | true | +| 2024-01-02 | 2024 | Q1 | 1 | 1 | Tuesday | 2024 | Q1 | false | + +### Currency Conversion Views (TCUR*) + +Currency Conversion enables multi-currency reporting and harmonization. Business Content uses these views to convert transactional amounts to reporting currency. + +#### Identify Required Currency Conversion Views + +Business Content packages typically require: +- **TCURR** — Exchange rates master table +- **TCURN** — Currency conversion rules and settings +- **TCURV** — Currency conversion view (calculated) + +#### Check If Currency Conversion is Available + +``` +In Datasphere: +1. Go to Business Content > Administration > Prerequisites +2. Look for "Currency Conversion" (TCUR*) +3. Check: Status = "Available" (with rates populated) +``` + +#### Set Up Currency Conversion (If Missing) + +1. **Source Currency Master Data**: + - **Option A**: Load from SAP S/4HANA connection + - Table: TCURR (exchange rates) + - Create data flow: SAP → Datasphere + - **Option B**: Load from external system + - Format: SOURCE_CURRENCY, TARGET_CURRENCY, RATE, VALID_FROM, VALID_TO + - Example: USD → EUR, 0.92, 2024-01-01, 2024-12-31 + +2. **Create Currency Conversion View**: + - Use Datasphere's **Currency Conversion Calculation View** template + - Configure source as TCURR table + - Specify conversion hierarchy (usually to reporting currency) + +3. **Verify Exchange Rates**: + - Query should return rates for all currency pairs: + ```sql + SELECT SOURCE_CURRENCY, TARGET_CURRENCY, RATE FROM TCURV + WHERE VALID_FROM <= TODAY() AND VALID_TO >= TODAY() + ``` + +### Unit of Measure Tables + +Business Content often includes measurements (quantity, weight, volume). Unit of Measure (UOM) tables normalize these: + +#### Check UOM Availability + +``` +In Datasphere: +1. Go to Business Content > Administration > Prerequisites +2. Look for "Unit of Measure" table +3. Check: Status = "Available" +``` + +#### UOM Table Structure + +| UNIT_CODE | UNIT_NAME | CATEGORY | CONVERSION_FACTOR | +|-----------|-----------|----------|------------------| +| KG | Kilogram | Weight | 1.0 | +| LB | Pound | Weight | 0.453592 | +| MTR | Meter | Length | 1.0 | +| KM | Kilometer | Length | 1000.0 | + +#### Populate UOM Table + +``` +Load standard UOM data: +1. Create data flow from SAP system → UOM table +2. Or upload CSV with standard UOM master data +3. Verify all conversion factors are populated +``` + +### Other Shared Dependencies + +Depending on the Business Content package, also verify: + +| Dependency | Purpose | Check | +|------------|---------|-------| +| Organizational Hierarchy | Drill-down by division, region, department | Table populated? | +| Customer Master | Customer dimensions and attributes | Source system connectivity? | +| Product Master | Product hierarchies and classifications | UPC/SKU mappings available? | +| General Ledger Accounts | Chart of Accounts for financial analysis | GL account mapping available? | + +--- + +## Content Activation Workflow (Step-by-Step) + +### Step 1: Select Business Content Package + +1. Navigate to **Business Content > Content Network** +2. Browse or search for desired package + - Example: "Sales Cloud Analytics" for retail +3. Click **Details** to review: + - Objects included + - Prerequisites + - Industry applicability + - Implementation time estimate + +### Step 2: Review Included Objects + +Packages typically include: + +**Data Models** (tables for raw data ingestion): +- Sales orders, line items, fulfillment status +- Customer master, product master +- Daily snapshots for analytics + +**Analytical Views** (pre-aggregated analytics): +- Revenue by product line (yearly, monthly) +- Customer acquisition and retention metrics +- Margin analysis by region + +**Data Flows** (ETL templates): +- Extract data from SAP S/4HANA +- Transform and load into analytical tables +- Schedule frequency: daily, weekly, monthly + +### Step 3: Choose Target Space + +Decide where content will be activated: + +**Option A: Single Space** (recommended for small teams) +- Create all content in one space (e.g., ANALYTICS) +- Simpler governance and discovery +- All users in space access all content + +**Option B: Separate Spaces by Layer** (recommended for large orgs using LSA++) +- **Inbound Layer Space**: Raw data tables +- **Harmonization Layer Space**: Data cleaning and transformation +- **Reporting Layer Space**: Analytical views for end users +- Enables fine-grained access control and performance isolation + +### Step 4: Handle Conflicts with Existing Objects + +If objects already exist in the target space: + +| Scenario | Action | +|----------|--------| +| First activation | Proceed (no conflicts) | +| Re-activating same version | Skip (use existing objects) | +| Activating new version | Choose: **Overwrite** or **Keep** | +| Custom modifications exist | Choose: **Keep** (preserve changes) | + +**Conflict Resolution Dialog**: +``` +Existing Object: SALES_ORDERS +┌─────────────────────────────────┐ +│ Overwrite (replace with new) │ +│ Keep (preserve customizations) │ +│ Rename new (add suffix _v2) │ +└─────────────────────────────────┘ +``` + +### Step 5: Activate Package + +1. Click **Activate** on the content package +2. Select target space(s) +3. Choose conflict resolution for each object +4. Review activation summary +5. Confirm activation + +**Activation Progress**: +``` +Creating objects: [████████░░] 80% (40/50 objects) +Estimated time remaining: 2 minutes +``` + +After activation, content objects appear in the target space. + +--- + +## LSA++ (Layered Scalable Architecture) Alignment + +LSA++ is SAP's recommended architecture for enterprise data warehouses. Business Content is designed to fit seamlessly into LSA++ layers. + +### Understanding LSA++ Layers + +``` +┌─────────────────────────────────────────────────────────┐ +│ REPORTING LAYER (L3) │ +│ Pre-aggregated analytics views for dashboards & reports │ +│ Example: Revenue_Analysis, Customer_Metrics │ +└─────────────────────────────────────────────────────────┘ + ↑ +┌─────────────────────────────────────────────────────────┐ +│ HARMONIZATION LAYER (L2) │ +│ Cleansed, standardized, unified data model │ +│ Example: Sales_Order_Harmonized, Customer_Unified │ +└─────────────────────────────────────────────────────────┘ + ↑ +┌─────────────────────────────────────────────────────────┐ +│ PROPAGATION LAYER (L1) │ +│ Document-level data, minimal transformation │ +│ Example: Sales_Order_Raw, Customer_Raw │ +└─────────────────────────────────────────────────────────┘ + ↑ +┌─────────────────────────────────────────────────────────┐ +│ INBOUND LAYER (L0) │ +│ Raw data extracted from source systems (as-is) │ +│ Example: SD_SALESDOCUMENT, MD_CUSTOMER │ +└─────────────────────────────────────────────────────────┘ +``` + +### How Business Content Maps to LSA++ Layers + +**Inbound Layer Objects** (L0 - Source extraction): +- Tables with source system structure (e.g., SALESDOCUMENT copied from S/4HANA) +- Staging tables for daily delta loads +- Minimal transformation, full detail + +**Propagation Layer Objects** (L1 - Document level): +- Document-level tables with line item detail +- Business fields added (e.g., sales document type descriptions) +- Ready for propagation to harmonization + +**Harmonization Layer Objects** (L2 - Unified): +- Cleansed, deduplicated, standardized data +- Cross-source consolidation (combining from multiple SAP modules) +- Rich master data (customer, product with attributes) +- Time-dimensioned snapshots for historical analysis + +**Reporting Layer Objects** (L3 - Analytics): +- Pre-aggregated cubes and analytical views +- Optimized for dashboard performance +- Business user language (e.g., "Revenue", "Gross Margin") +- Calculated fields and metrics + +### Best Practices for Layering Imported Content + +**1. Separate Spaces by Layer** (recommended): + +``` +Datasphere Spaces Structure: +├── INBOUND_LAYER (Space) +│ └── Raw data tables from source systems +│ └── Connections to SAP S/4HANA, Salesforce, etc. +├── HARMONIZATION_LAYER (Space) +│ └── Cleansed and standardized data +│ └── Data flows reading from INBOUND_LAYER +├── REPORTING_LAYER (Space) +│ └── Analytics views and dashboards +│ └── Analytical views reading from HARMONIZATION_LAYER +└── MASTERED_DATA (Space) + └── Reference data (Customer, Product, Organization) + └── Reusable by all layers +``` + +**Access Control by Layer**: +- **Inbound**: Only data engineers have access +- **Harmonization**: Data engineers + data architects +- **Reporting**: Business users (view only) +- **Mastered Data**: Read access for all layers + +**2. Organize Within a Space** (alternative for smaller teams): + +``` +Single Analytics Space with layering via naming: +├── L0_SALESDOCUMENT (Inbound) +├── L0_CUSTOMER (Inbound) +├── L1_SALESDOCUMENT_PROPAGATED (Propagation) +├── L2_SALESDOCUMENT_HARMONIZED (Harmonization) +├── L3_REVENUE_ANALYSIS (Reporting) +└── L3_CUSTOMER_METRICS (Reporting) +``` + +**3. Isolate Inbound from Harmonization** + +Critical principle: Never have data flows directly from source system to Reporting Layer. + +**Wrong** (anti-pattern): +``` +Source System → Reporting View +(No data quality checks) +``` + +**Correct** (LSA++ compliant): +``` +Source System → Inbound Tables → Harmonization Layer → Reporting View + (staging) (cleansing) (optimization) +``` + +--- + +## Managing Content Updates + +SAP regularly publishes new versions of Business Content. Decide whether to adopt updates. + +### Understanding Update Types + +**Patch Update** (e.g., 1.0.0 → 1.0.1): +- Bug fixes and data corrections +- No structural changes +- Recommended to apply: **Always** + +**Minor Update** (e.g., 1.0 → 1.1): +- New fields, additional views +- Backward compatible +- Recommended to apply: **Usually** (assess customizations) + +**Major Update** (e.g., 1.0 → 2.0): +- Significant restructuring, deprecated objects +- May break customizations +- Recommended to apply: **Plan carefully** + +### Update Decision Matrix + +| Scenario | Overwrite | Keep | Notes | +|----------|-----------|------|-------| +| Patch update, no customizations | **Overwrite** | | Apply immediately | +| Patch update, minor customizations | **Keep** → Merge | | Manual merge after update available | +| Minor update, no customizations | **Overwrite** | | Review new fields before updating | +| Minor update, significant customizations | **Keep** | | Evaluate if new features justify re-work | +| Major update, critical customizations | **Keep** | | Plan migration project separately | +| Production system, no customizations | **Overwrite** | | Update after testing in non-prod | +| Development space, any customizations | **Overwrite** | | Easier to re-customize than maintain drift | + +### Update Workflow + +**Step 1: Check for Updates** + +``` +In Datasphere: +1. Go to Business Content > Manage Content +2. Look for "Update Available" badges +3. Click to view release notes and changelog +``` + +**Step 2: Impact Analysis** + +``` +For each outdated object: +1. Check if customizations exist (custom fields, flows) +2. Check if dependent views use this object +3. Test update in non-production space first +``` + +**Step 3: Stage Update in Non-Prod** + +``` +1. Clone prod space to test space (if using separate spaces) +2. Or create separate test package version +3. Activate updated package version in test space +4. Run regression tests (SQL queries, dashboards) +5. Validate calculated fields and aggregations +``` + +**Step 4: Decide: Overwrite or Keep** + +If testing passes and no customizations exist: +``` +Click Overwrite → All objects replaced with new version +``` + +If customizations are critical or testing failed: +``` +Click Keep → Old version retained, new version labeled _v2 +``` + +After keeping old version: +- Manually migrate customizations to new version +- Gradually redirect data flows to _v2 objects +- Deprecate old version once migration complete + +**Step 5: Update Production** + +``` +After non-prod validation: +1. Activate update in production space +2. Monitor performance and error logs +3. Validate dashboards and reports render correctly +4. Communicate update to business users +``` + +--- + +## Industry-Specific Content Packages + +### Automotive Industry + +**Package: Automotive Sales & Service Analytics** +- Objects: 45 tables, 28 views +- Domains: Sales, Service, Warranty, Spare Parts +- Key Measures: Vehicle sales by model, service revenue, parts availability +- Prerequisites: Time Dimension, Currency Conversion, Customer master +- Typical Activation Time: 4-6 weeks + +**Package: Automotive Supply Chain** +- Objects: 62 tables, 35 views +- Domains: Procurement, Production Planning, Inventory, Logistics +- Key Measures: Supplier performance, production capacity, inventory turns +- Prerequisites: Time Dimension, Organization hierarchy, Product master +- Typical Activation Time: 8-10 weeks + +### Retail Industry + +**Package: Retail POS & Merchandise Analytics** +- Objects: 38 tables, 31 views +- Domains: Point of Sale, Merchandise Planning, Promotions +- Key Measures: Sales by product category, margin by location, promotion ROI +- Prerequisites: Time Dimension, Currency Conversion, Product hierarchy +- Typical Activation Time: 3-5 weeks + +**Package: Retail Supply Chain** +- Objects: 55 tables, 40 views +- Domains: Distribution, Inventory, Replenishment +- Key Measures: Stock coverage, distribution effectiveness, shrinkage +- Prerequisites: Time Dimension, Organization hierarchy, Product master +- Typical Activation Time: 6-8 weeks + +### Utilities Industry + +**Package: Energy & Water Distribution** +- Objects: 41 tables, 29 views +- Domains: Grid operations, Customer billing, Asset management +- Key Measures: Energy consumption by segment, outage frequency, billing accruals +- Prerequisites: Time Dimension, Currency Conversion, Equipment master +- Typical Activation Time: 5-7 weeks + +### Finance Industry + +**Package: General Ledger & Financial Reporting** +- Objects: 34 tables, 26 views +- Domains: Accounting, Profitability, Consolidation +- Key Measures: Revenue recognition, expense analysis, intercompany consolidation +- Prerequisites: Time Dimension, GL account master, Cost center hierarchy +- Typical Activation Time: 4-6 weeks + +**Package: Banking Risk & Compliance** +- Objects: 48 tables, 35 views +- Domains: Credit risk, Market risk, Regulatory reporting +- Key Measures: Risk-weighted assets, non-performing loans, regulatory ratios +- Prerequisites: Time Dimension, Product master, Risk classification master +- Typical Activation Time: 8-12 weeks + +### Manufacturing + +**Package: Production & Costing** +- Objects: 52 tables, 38 views +- Domains: Bill of Materials, Work orders, Job costing +- Key Measures: Cost per unit, throughput, variance analysis +- Prerequisites: Time Dimension, Product master, Cost center hierarchy +- Typical Activation Time: 7-9 weeks + +--- + +## Customizing Activated Content + +After activation, content is often tailored for organization-specific needs. + +### Safe Customization Patterns + +**Pattern 1: Add Calculated Fields** (Non-breaking) + +``` +Existing View: REVENUE_ANALYSIS +├── Base fields: Sales_Amount, Product, Customer +└── Add calculated fields: + ├── Margin_Percent = Gross_Margin / Sales_Amount * 100 + ├── Days_to_Payment = Invoice_Date - Payment_Date + └── Customer_Segment = (custom logic based on revenue) +``` + +**Pattern 2: Create Extension Views** (Recommended) + +Instead of modifying existing views, create new views that extend them: + +``` +Business Content View: REVENUE_ANALYSIS (DON'T MODIFY) +↓ +New Extension View: REVENUE_ANALYSIS_EXTENDED (your custom logic) +├── Extends: REVENUE_ANALYSIS +├── Adds: Additional dimensions and calculated fields +└── Data flows and dashboards consume _EXTENDED view +``` + +**Benefits**: +- Original view unmodified (survives future updates) +- Your customizations clearly separated +- Easier to merge future updates + +**Pattern 3: Create Custom Dimensions** + +Extend master data tables with organization-specific attributes: + +``` +Content View: CUSTOMER_MASTER (standard SAP fields) +├── Customer_ID, Name, Industry, Region (standard) +└── Add via custom fields: + ├── Account_Manager (org-specific) + ├── Customer_Segment_Custom (org-specific classification) + ├── Contract_Status (org-specific) +``` + +### Unsafe Customization (Avoid) + +**Anti-Pattern 1: Modify Content Objects Directly** + +``` +❌ DON'T DO THIS: +1. Edit view REVENUE_ANALYSIS (from content) +2. Add custom fields directly +3. Problem: Update overwrites customizations +``` + +**Anti-Pattern 2: Hard-code Values** + +``` +❌ DON'T DO THIS: +Sales_Amount WHERE Country = 'USA' +^ Hard-coded country filter breaks for other regions +``` + +**Better**: +``` +✓ DO THIS: +Create parameterized view with country input +Let business users select country via filter +``` + +### Common Customizations + +**1. Add Company-Specific Hierarchies** + +``` +Extend Organization hierarchy: +├── Region (from content) +└── Add: Sales Territory, Account Team (your custom) + +Extended View: SALES_REVENUE_BY_TERRITORY +├── Base: REVENUE_ANALYSIS +└── Joined with: Your Territory_Master table +``` + +**2. Align Chart of Accounts** + +``` +GL Account mapping table (YOUR custom): +├── Content_GL_Account → Your_GL_Account +├── 400000 (Sales) → 4000 (Sales Revenue) +├── 410000 (Returns) → 4100 (Sales Returns) + +Use mapping in data flow: +GL_Details → Map GL Account → Store in HARMONIZED table +``` + +**3. Add Company Fiscal Calendar** + +``` +If business uses non-Gregorian fiscal calendar: +1. Create custom fiscal calendar master +2. Extend Time Dimension joins with fiscal calendar +3. Reporting uses fiscal year / fiscal quarter +``` + +--- + +## Troubleshooting Failed Activations + +### Common Activation Failures + +| Error | Cause | Solution | +|-------|-------|----------| +| "Prerequisite not met: Time Dimension" | Time Dimension table empty | Populate Time Dimension with date data | +| "Space quota exceeded" | Not enough memory/disk | Increase space allocation or split across spaces | +| "Object name conflict" | Object exists, conflict resolution not specified | Choose Overwrite or Rename in conflict dialog | +| "Connection test failed" | Source system unreachable | Verify connection credentials and network | +| "Permission denied" | Insufficient space access | Ensure user has space_admin role | + +### Activation Log Review + +After failed activation, review logs: + +``` +In Datasphere: +1. Go to Business Content > Activation History +2. Find failed activation +3. Click View Logs +4. Search for ERROR lines +``` + +**Log Example**: +``` +[2024-02-01 10:15:30] INFO: Activation started for package SALES_ANALYTICS v1.2 +[2024-02-01 10:15:45] INFO: Creating objects... +[2024-02-01 10:16:02] ERROR: Failed to create object REVENUE_DAILY +[2024-02-01 10:16:02] ERROR: Reason: "Space SALES_ANALYTICS at capacity (1000 GB / 1000 GB)" +[2024-02-01 10:16:02] WARN: Rollback initiated. 12 objects created, 3 objects rolled back. +``` + +### Retry Failed Activation + +After fixing the underlying issue: + +``` +1. Go to Business Content > Manage Content +2. Find the package with failed activation +3. Click Retry Activation +4. Review conflict resolution settings +5. Click Confirm +``` + +--- + +## Post-Activation Validation Checklist + +After successful activation, verify everything is working: + +### Data Verification + +- [ ] Time Dimension table populated with correct date range + ```sql + SELECT MIN(DATE), MAX(DATE), COUNT(*) FROM TIME_DIMENSION + ``` + +- [ ] Currency Conversion rates populated + ```sql + SELECT COUNT(*) FROM TCURV WHERE VALID_FROM <= TODAY() + ``` + +- [ ] Master data tables have records + ```sql + SELECT TABLE_NAME, COUNT(*) FROM [ACTIVATED_OBJECTS] GROUP BY TABLE_NAME + ``` + +- [ ] Data flow test load executed successfully + - No error logs in data flow execution history + - Row counts match expected volumes + +### Analytical View Verification + +- [ ] Key analytical views return data + ```sql + SELECT TOP 100 * FROM REVENUE_ANALYSIS + -- Should return rows with expected columns + ``` + +- [ ] Calculated fields compute without errors + ```sql + SELECT *, MARGIN_PERCENT FROM REVENUE_ANALYSIS + -- No NULL or error values for MARGIN_PERCENT + ``` + +- [ ] Aggregation views perform acceptably + - Query execution time < 5 seconds + - No memory errors in query trace + +### Dashboard & Reporting + +- [ ] Pre-built dashboards load without errors +- [ ] Charts render with expected data +- [ ] Drill-down by dimensions works +- [ ] Filter selections (date range, region) apply correctly + +### Performance Baseline + +- [ ] Document current query performance + ``` + Create_Date Baseline: + - REVENUE_ANALYSIS: 2.1 seconds + - CUSTOMER_METRICS: 1.8 seconds + - MARGIN_ANALYSIS: 3.2 seconds + ``` +- [ ] Monitor performance over 1-2 weeks +- [ ] Alert if degradation > 20% + +### Access & Security + +- [ ] Users can access content in appropriate spaces +- [ ] Row-level security rules apply (if configured) +- [ ] Audit logs track access to sensitive views + +--- + +## Next Steps + +1. **Identify Industry Packages**: Browse Content Network for your vertical +2. **Verify Prerequisites**: Ensure Time Dimension, Currency Conversion ready +3. **Plan LSA++ Layout**: Decide on space separation by layer +4. **Test in Non-Prod**: Activate in development space first +5. **Customize Thoughtfully**: Use extension views, not in-place modifications +6. **Monitor Post-Activation**: Validate data, performance, access +7. **Plan Updates**: Track new versions and plan upgrades quarterly + +See **references/content-catalog.md** for complete prerequisite checklists, activation troubleshooting, industry-specific content listings, and post-activation validation templates. diff --git a/partner-built/SAP-Datasphere/skills/datasphere-business-content-activator/references/content-catalog.md b/partner-built/SAP-Datasphere/skills/datasphere-business-content-activator/references/content-catalog.md new file mode 100644 index 0000000..a1c8ae0 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-business-content-activator/references/content-catalog.md @@ -0,0 +1,1404 @@ +# SAP Datasphere Business Content: Complete Catalog & Reference + +## Table of Contents +1. [Complete Prerequisite Checklist](#complete-prerequisite-checklist) +2. [Time Dimension Population Guide](#time-dimension-population-guide) +3. [Currency Conversion Setup Guide](#currency-conversion-setup-guide) +4. [LSA++ Layer Mapping Reference](#lsa-layer-mapping-reference) +5. [Industry Content Packages Catalog](#industry-content-packages-catalog) +6. [Content Update Decision Matrix](#content-update-decision-matrix) +7. [Activation Troubleshooting Guide](#activation-troubleshooting-guide) +8. [Post-Activation Validation Checklist](#post-activation-validation-checklist) + +--- + +## Complete Prerequisite Checklist + +### By Content Type: What You Need Before Activation + +#### Sales & Revenue Analytics Content + +**Mandatory Prerequisites**: +- [ ] Time Dimension table created and populated + - Date range covers: Current year -3 years to +1 year + - Includes: Year, Quarter, Month, Week, Day of Week, Day of Month + - Fiscal calendar (if non-Gregorian): Fiscal_Year, Fiscal_Quarter, Fiscal_Period + +- [ ] Currency Conversion (TCURR, TCURV) available + - Exchange rates populated for all base currencies to reporting currency + - Historical rates available (for multi-currency transactions) + - Rate update frequency: Daily or weekly + +- [ ] Customer Master Data + - Source: SAP S/4HANA connection or external file + - Minimum fields: Customer_ID, Customer_Name, Industry, Country, Sales_District + - Record count: All active and inactive customers + +- [ ] Product Master Data + - Product_ID, Product_Name, Product_Category, Product_Line + - Product hierarchy for drill-down analysis + - Pricing: Standard cost or list price per product + +- [ ] Organization Hierarchy + - Company_Code, Division, Region, Sales_Office + - Cost Centers (for profitability analysis) + - Sales Territory (if using territory-based sales model) + +**Optional Prerequisites** (enhances content): +- [ ] Customer Attributes (Industry, Account Size, Customer Type) +- [ ] Product Attributes (Margin %, Warranty Terms, Product Lifecycle) +- [ ] Sales Employee Master (Sales Rep name, territory, manager) + +**Typical Implementation Time**: 2-4 weeks +**Effort**: 2-3 data engineers + 1 business analyst + +#### Supply Chain & Inventory Content + +**Mandatory Prerequisites**: +- [ ] Time Dimension (as above) + +- [ ] Product Master Data + - Product_ID, Product_Name, Product_Category + - Unit of Measure (UOM): Base unit, conversion factors + - Procurement data: Lead time, Minimum order quantity, Supplier + +- [ ] Supplier Master Data + - Supplier_ID, Supplier_Name, Country, Shipping_Terms + - Rating: Quality, On-time delivery, Cost competitiveness + +- [ ] Location/Warehouse Master + - Warehouse_ID, Warehouse_Name, Location, Capacity + - Warehouse type (DC, Store, Plant) + +- [ ] Storage Location Mapping + - Product → Warehouse → Storage Location + - Enables granular inventory tracking + +**Optional Prerequisites**: +- [ ] Transportation master (shipping modes, carriers, costs) +- [ ] Commodity codes and tariff classification +- [ ] Seasonal adjustment factors (for demand planning) + +**Typical Implementation Time**: 3-5 weeks + +#### Finance & Profitability Content + +**Mandatory Prerequisites**: +- [ ] Time Dimension (with fiscal calendar critical here) + +- [ ] General Ledger Master + - GL Account (from Chart of Accounts) + - Account type (Asset, Liability, Revenue, Expense, Equity) + - Company Code assignment + +- [ ] Cost Element Master + - Cost Element ID and description + - Cost Element category (Material, Labor, Overhead) + +- [ ] Cost Center Master + - Cost Center ID, Name, Department + - Cost Center hierarchy (for consolidation) + +- [ ] Currency Conversion (critical for multi-company consolidation) + +- [ ] Company/Legal Entity Master + - Company Code, Company Name + - Fiscal year definition + - Inter-company elimination rules + +**Optional Prerequisites**: +- [ ] Profit Center Master (for segment profitability) +- [ ] Internal Orders (for project costing) +- [ ] Real Estate/Asset location mapping + +**Typical Implementation Time**: 4-6 weeks + +#### HR & Compensation Content + +**Mandatory Prerequisites**: +- [ ] Time Dimension (with payroll periods) + +- [ ] Employee Master + - Employee_ID, Employee_Name, Department, Cost_Center + - Job Title, Job Grade, Employment Type (Full-time, Part-time, Contractor) + - Hire Date, Termination Date (if applicable) + +- [ ] Organization Structure + - Department, Division, Company + - Reporting hierarchy (Manager → Direct Reports) + +- [ ] Compensation Master + - Base Salary, Variable Pay components + - Benefits allocation per employee + +**Optional Prerequisites**: +- [ ] Time & Attendance data (hours worked, absences) +- [ ] Skills and Competency master +- [ ] Succession planning data + +**Typical Implementation Time**: 2-3 weeks + +#### Manufacturing & Quality Content + +**Mandatory Prerequisites**: +- [ ] Time Dimension + +- [ ] Product Master (including Bill of Materials) + - Product_ID, Product_Name + - Component hierarchy (for BOM) + - Unit of Measure and conversion factors + +- [ ] Work Center Master + - Work Center ID, Name, Facility, Capacity + - Standard time per operation + +- [ ] Material Master + - Material_ID, Description, Category + - Cost (Standard cost per unit) + - Supplier (for sourced materials) + +- [ ] Quality Codes Master + - Defect codes, Inspection results, Pass/Fail codes + +**Optional Prerequisites**: +- [ ] Equipment / Asset master (for maintenance tracking) +- [ ] Routing master (sequential operations per product) +- [ ] Batch/Lot number master (for traceability) + +**Typical Implementation Time**: 3-5 weeks + +--- + +## Time Dimension Population Guide + +The Time Dimension is foundational for all analytics. Proper setup ensures temporal calculations work correctly. + +### Step 1: Create Time Dimension Table + +If not auto-created by Business Content activation: + +```sql +CREATE TABLE TIME_DIMENSION ( + DATE DATE NOT NULL, + YEAR INTEGER, + QUARTER INTEGER, + QUARTER_NAME VARCHAR(10), + MONTH INTEGER, + MONTH_NAME VARCHAR(20), + WEEK_OF_YEAR INTEGER, + WEEK_OF_MONTH INTEGER, + DAY_OF_WEEK INTEGER, + DAY_NAME VARCHAR(20), + DAY_OF_MONTH INTEGER, + DAY_OF_YEAR INTEGER, + FISCAL_YEAR INTEGER, + FISCAL_QUARTER INTEGER, + FISCAL_QUARTER_NAME VARCHAR(10), + FISCAL_MONTH INTEGER, + IS_HOLIDAY BOOLEAN, + HOLIDAY_NAME VARCHAR(100), + IS_WEEKEND BOOLEAN, + WORKING_DAY BOOLEAN, + PRIMARY KEY (DATE) +); +``` + +### Step 2: Generate Time Dimension Data + +**Option A: Use SAP Datasphere Generator** + +``` +In Datasphere: +1. Go to Business Content > Administration > Time Dimension +2. Click Generate Data +3. Specify parameters: + ├── Start Date: 2021-01-01 + ├── End Date: 2026-12-31 + ├── Fiscal Calendar: Custom Fiscal Year starts July 1 + ├── Holiday Calendar: Select your country (USA, Germany, Japan, etc.) +4. Download CSV file +``` + +**Option B: Generate Programmatically** + +Python script to generate Time Dimension: + +```python +import pandas as pd +from datetime import datetime, timedelta +import numpy as np + +# Parameters +start_date = pd.Timestamp('2021-01-01') +end_date = pd.Timestamp('2026-12-31') +fiscal_year_start_month = 7 # July + +# Generate date range +dates = pd.date_range(start=start_date, end=end_date, freq='D') + +# Build Time Dimension +time_dim = pd.DataFrame({ + 'DATE': dates, + 'YEAR': dates.year, + 'QUARTER': dates.quarter, + 'QUARTER_NAME': 'Q' + dates.quarter.astype(str), + 'MONTH': dates.month, + 'MONTH_NAME': dates.strftime('%B'), + 'WEEK_OF_YEAR': dates.isocalendar().week, + 'WEEK_OF_MONTH': (dates.day // 7) + 1, + 'DAY_OF_WEEK': dates.weekday() + 1, # 1=Monday, 7=Sunday + 'DAY_NAME': dates.strftime('%A'), + 'DAY_OF_MONTH': dates.day, + 'DAY_OF_YEAR': dates.dayofyear, + 'IS_WEEKEND': dates.weekday >= 5, # Saturday, Sunday + 'WORKING_DAY': dates.weekday < 5, # Monday-Friday +}) + +# Calculate Fiscal Year (starting July) +def get_fiscal_year(date, fiscal_start_month=7): + if date.month >= fiscal_start_month: + return date.year + 1 + else: + return date.year + +time_dim['FISCAL_YEAR'] = time_dim['DATE'].apply( + lambda x: get_fiscal_year(x, fiscal_year_start_month) +) + +# Calculate Fiscal Quarter +def get_fiscal_quarter(date, fiscal_start_month=7): + month_in_fiscal = (date.month - fiscal_start_month) % 12 + return (month_in_fiscal // 3) + 1 + +time_dim['FISCAL_QUARTER'] = time_dim['DATE'].apply( + lambda x: get_fiscal_quarter(x, fiscal_year_start_month) +) +time_dim['FISCAL_QUARTER_NAME'] = 'FY' + time_dim['FISCAL_YEAR'].astype(str) + \ + 'Q' + time_dim['FISCAL_QUARTER'].astype(str) + +# Add holidays (example: USA holidays) +holidays = { + pd.Timestamp('2024-01-01'): 'New Year Day', + pd.Timestamp('2024-07-04'): 'Independence Day', + pd.Timestamp('2024-12-25'): 'Christmas', +} + +time_dim['IS_HOLIDAY'] = time_dim['DATE'].isin(holidays.keys()) +time_dim['HOLIDAY_NAME'] = time_dim['DATE'].map(holidays) + +# Save to CSV +time_dim.to_csv('time_dimension.csv', index=False, date_format='%Y-%m-%d') +print(f"Generated Time Dimension: {len(time_dim)} rows") +print(time_dim.head()) +``` + +### Step 3: Load into Datasphere + +**Method 1: Direct CSV Upload** + +``` +In Datasphere: +1. Go to Data Integration > New Data Flow +2. Add source: Upload File (time_dimension.csv) +3. Add target: TIME_DIMENSION table +4. Map columns (ensure correct data types) +5. Execute load +``` + +**Method 2: Load via Data Integration Flow** + +Create SQL INSERT or SAP S/4HANA extraction if your company has a time dimension there: + +``` +Data Flow Steps: +1. Extract from SAP S/4HANA table /BIC/DIMDATE (if available) + → Or upload CSV as shown above +2. Transform: + ├── Ensure DATE is primary key (unique, not null) + ├── Convert all numeric fields to INTEGER + ├── Convert all name fields to VARCHAR +3. Load into TIME_DIMENSION table +4. Create index on DATE column for performance +``` + +### Step 4: Verify Population + +```sql +-- Check date coverage +SELECT MIN(DATE) as earliest_date, MAX(DATE) as latest_date, COUNT(*) as row_count +FROM TIME_DIMENSION; + +Expected output: +earliest_date: 2021-01-01 +latest_date: 2026-12-31 +row_count: 2191 (6 years × 365 days average) + +-- Check fiscal calendar +SELECT DISTINCT FISCAL_YEAR, MIN(DATE), MAX(DATE) +FROM TIME_DIMENSION +GROUP BY FISCAL_YEAR +ORDER BY FISCAL_YEAR; + +Expected output (if FY starts July): +FISCAL_YEAR: 2021, MIN: 2020-07-01, MAX: 2021-06-30 +FISCAL_YEAR: 2022, MIN: 2021-07-01, MAX: 2022-06-30 +... + +-- Check no missing dates +SELECT COUNT(*) FROM TIME_DIMENSION +WHERE DATE BETWEEN '2024-01-01' AND '2024-12-31'; + +Expected output: 366 (2024 is a leap year) +``` + +### Step 5: Update Frequency + +Set up daily or weekly refresh to maintain current date: + +``` +In Datasphere Data Integration: +1. Create repeating task for Time Dimension refresh +2. Schedule: Daily at 23:59 UTC +3. Action: Insert new dates for next 1 year (rolling window) +4. Delete future dates beyond 3 years from today (keep 3-year rolling window) +``` + +--- + +## Currency Conversion Setup Guide + +Multi-currency analytics require proper currency conversion configuration. + +### Step 1: Source Currency Master Data + +#### From SAP S/4HANA (Recommended) + +Create data flow to extract exchange rates: + +``` +Source: SAP S/4HANA +Table: TCURR (Exchange Rates) + +Fields: +├── FCURR (From Currency) — Source currency code +├── TCURR (To Currency) — Target currency code +├── GDATU (Valid From Date) — Rate validity start +├── GDATUV (Valid To Date) — Rate validity end +├── KURST (Exchange Rate Type) — M (Daily), B (Bond), etc. +├── KURSK (Exchange Rate) — Conversion factor + +Target: Datasphere TCURR table +``` + +Data Flow SQL: + +```sql +SELECT + FCURR as source_currency, + TCURR as target_currency, + GDATU as valid_from_date, + GDATUV as valid_to_date, + KURST as rate_type, + KURSK as exchange_rate +FROM TCURR@SAP_S4H +WHERE KURST = 'M' -- Daily rates + AND GDATU <= TODAY() +ORDER BY FCURR, TCURR, GDATU; +``` + +#### From External File + +If not available in SAP: + +```csv +source_currency,target_currency,valid_from_date,valid_to_date,exchange_rate +USD,EUR,2024-01-01,2024-12-31,0.92 +USD,GBP,2024-01-01,2024-12-31,0.79 +EUR,GBP,2024-01-01,2024-12-31,0.86 +JPY,USD,2024-01-01,2024-12-31,0.0067 +CNY,USD,2024-01-01,2024-12-31,0.138 +``` + +### Step 2: Create Exchange Rate Lookup Table + +```sql +CREATE TABLE TCURR ( + SOURCE_CURRENCY CHAR(3) NOT NULL, + TARGET_CURRENCY CHAR(3) NOT NULL, + VALID_FROM_DATE DATE NOT NULL, + VALID_TO_DATE DATE NOT NULL, + RATE_TYPE CHAR(1), + EXCHANGE_RATE DECIMAL(20, 8), + PRIMARY KEY (SOURCE_CURRENCY, TARGET_CURRENCY, VALID_FROM_DATE), + INDEX idx_target_date (TARGET_CURRENCY, VALID_FROM_DATE) +); + +-- Insert exchange rates +INSERT INTO TCURR VALUES +('USD', 'EUR', '2024-01-01', '2024-12-31', 'M', 0.92), +('USD', 'GBP', '2024-01-01', '2024-12-31', 'M', 0.79), +('EUR', 'USD', '2024-01-01', '2024-12-31', 'M', 1.0869), +('EUR', 'GBP', '2024-01-01', '2024-12-31', 'M', 0.86), +...; +``` + +### Step 3: Create Currency Conversion Calculation View + +```sql +CREATE VIEW TCURV ( + source_currency, + target_currency, + conversion_date, + exchange_rate +) AS +SELECT + tcurr.source_currency, + tcurr.target_currency, + /* Get the most recent rate for the given date */ + MAX(tcurr.valid_from_date) as conversion_date, + tcurr.exchange_rate +FROM TCURR tcurr +WHERE tcurr.valid_from_date <= CURRENT_DATE + AND tcurr.valid_to_date >= CURRENT_DATE +GROUP BY + tcurr.source_currency, + tcurr.target_currency, + tcurr.exchange_rate; +``` + +### Step 4: Use in Analytical Queries + +**Example: Convert all sales to USD** + +```sql +SELECT + t.transaction_id, + t.transaction_date, + t.amount_original, + t.currency_original, + t.amount_original * COALESCE(c.exchange_rate, 1) as amount_usd, + 'USD' as currency_converted +FROM SALES t +LEFT JOIN TCURV c + ON t.currency_original = c.source_currency + AND c.target_currency = 'USD' + AND c.conversion_date <= t.transaction_date +WHERE c.conversion_date = ( + SELECT MAX(conversion_date) + FROM TCURV c2 + WHERE c2.source_currency = t.currency_original + AND c2.target_currency = 'USD' + AND c2.conversion_date <= t.transaction_date +); +``` + +### Step 5: Verify Configuration + +```sql +-- Check exchange rates available +SELECT + source_currency, + target_currency, + COUNT(*) as rate_count, + MIN(valid_from_date) as earliest, + MAX(valid_to_date) as latest +FROM TCURR +GROUP BY source_currency, target_currency +ORDER BY source_currency; + +-- Check current rates +SELECT * FROM TCURV +WHERE conversion_date = TODAY() +ORDER BY source_currency, target_currency; + +-- Test conversion +SELECT + 100 as amount, + 'EUR' as from_currency, + 'USD' as to_currency, + 100 * (SELECT exchange_rate FROM TCURV + WHERE source_currency = 'EUR' + AND target_currency = 'USD' + AND conversion_date = TODAY() + ) as amount_converted_usd; +``` + +### Step 6: Setup Refresh Schedule + +Exchange rates typically change daily: + +``` +In Datasphere Data Integration: +1. Create task: "Update Exchange Rates Daily" +2. Source: External system or manual file upload +3. Target: TCURR table +4. Refresh frequency: Daily at 08:00 UTC +5. Retention: Keep rates for 3 years rolling +``` + +--- + +## LSA++ Layer Mapping Reference + +Complete reference for how Business Content objects map to LSA++ layers. + +### Layer 0: Inbound Layer - Source Extraction + +**Purpose**: Raw data from source systems, minimal transformation. + +**Characteristics**: +- Table structure mirrors source system +- Full detail level (no aggregation) +- Delta load frequency (daily, hourly) +- Field names may match source system language + +**Example Objects from Business Content**: + +``` +Sales Analytics Package: +├── SALES_ORDER_INBOUND (mirrors S/4HANA VBAK) +├── SALES_ITEM_INBOUND (mirrors S/4HANA VBAP) +├── CUSTOMER_INBOUND (mirrors S/4HANA KNA1) +└── MATERIAL_INBOUND (mirrors S/4HANA MARA) + +Supply Chain Package: +├── PURCHASE_ORDER_INBOUND (mirrors EKKO) +├── DELIVERY_INBOUND (mirrors LIKP) +├── GOODS_RECEIPT_INBOUND (mirrors EKET) +└── STOCK_INBOUND (mirrors MARD) +``` + +**Data Flow Pattern**: +``` +Source System → Inbound Table (extract as-is) +``` + +**Access Control**: Data engineers only + +--- + +### Layer 1: Propagation Layer - Document Level + +**Purpose**: Document-level data with minimal business logic, ready to propagate. + +**Characteristics**: +- One row per document or transaction +- Includes related master data attributes (not just IDs) +- Light transformation (decode values, convert units) +- Retains all detail (not aggregated) + +**Example Objects**: + +``` +Sales Analytics: +├── SALES_ORDER_PROPAGATED +│ From: SALES_ORDER_INBOUND + CUSTOMER_INBOUND +│ Added: Customer_Name, Industry, Sales_Rep +│ +├── INVOICE_PROPAGATED +│ From: BILLING_INBOUND + ORDER_INBOUND +│ Added: Order_ID, Customer reference, Terms +``` + +**Data Flow Pattern**: +``` +Inbound (L0) → Enrich with Master Data → Propagation Table (L1) +``` + +**Transformations**: +```sql +-- Example transformation from L0 to L1 +INSERT INTO SALES_ORDER_PROPAGATED +SELECT + o.order_id, + o.order_date, + o.customer_id, + c.customer_name, -- Added from CUSTOMER master + c.industry, -- Added from CUSTOMER master + o.amount, + o.currency, + o.sales_rep_id, + sr.sales_rep_name, -- Added from SALES_REP master + CURRENT_TIMESTAMP as load_timestamp +FROM SALES_ORDER_INBOUND o +LEFT JOIN CUSTOMER_INBOUND c ON o.customer_id = c.customer_id +LEFT JOIN SALES_REP_INBOUND sr ON o.sales_rep_id = sr.sales_rep_id; +``` + +**Access Control**: Data engineers + architects + +--- + +### Layer 2: Harmonization Layer - Unified Model + +**Purpose**: Cleansed, standardized, deduplicated data. Single source of truth. + +**Characteristics**: +- Business-friendly naming (not source system codes) +- Data quality checks and deduplication +- Cross-source consolidation (combine SAP + Salesforce + external systems) +- Slowly Changing Dimensions (SCD) logic +- Time-dimensioned snapshots for historical analysis + +**Example Objects**: + +``` +Sales Analytics: +├── CUSTOMER_HARMONIZED +│ ├── Customer_ID (primary key) +│ ├── Customer_Name (standardized) +│ ├── Industry (from master, validated against industry list) +│ ├── Valid_From, Valid_To (SCD Type 2) +│ ├── Is_Active (calculated flag) +│ └── Last_Modified_Date + +├── SALES_ORDER_HARMONIZED +│ ├── Order_ID +│ ├── Order_Date (standardized format) +│ ├── Customer_ID (foreign key to CUSTOMER_HARMONIZED) +│ ├── Amount_Local_Currency +│ ├── Amount_USD (converted using TCURV) +│ ├── Order_Status (standardized: NEW, CONFIRMED, SHIPPED, CLOSED) +│ ├── Margin_Amount (calculated) +│ └── Load_Date (when record was harmonized) + +├── SALES_DAILY_SNAPSHOT +│ ├── Report_Date (grain) +│ ├── Customer_ID +│ ├── Orders_Count (aggregated) +│ ├── Revenue_Amount (summed, in USD) +│ └── Average_Order_Value (calculated) +``` + +**Data Quality Rules** (applied in L2): + +```sql +-- Example: Deduplication +INSERT INTO CUSTOMER_HARMONIZED +SELECT DISTINCT + customer_id, + MAX(customer_name) as customer_name, -- Pick most recent + MAX(industry) as industry, + CURRENT_DATE as valid_from, + NULL as valid_to, + 1 as is_active, + CURRENT_TIMESTAMP as last_modified +FROM SALES_ORDER_PROPAGATED +WHERE customer_id IS NOT NULL + AND customer_name NOT LIKE '%TEST%' -- Quality check + AND customer_name NOT LIKE '%DUMMY%' +GROUP BY customer_id; + +-- Example: SCD Type 2 (track history) +MERGE INTO CUSTOMER_HARMONIZED tgt +USING (SELECT * FROM CUSTOMER_INBOUND) src +ON tgt.customer_id = src.customer_id +WHEN MATCHED AND src.customer_name <> tgt.customer_name THEN + UPDATE SET tgt.valid_to = CURRENT_DATE - 1 + THEN INSERT (customer_id, customer_name, valid_from, valid_to, is_active) + VALUES (src.customer_id, src.customer_name, CURRENT_DATE, NULL, 1); +``` + +**Access Control**: Analysts (read), architects (write) + +--- + +### Layer 3: Reporting Layer - Analytics + +**Purpose**: Pre-aggregated, optimized analytical views for dashboards and reports. + +**Characteristics**: +- Pre-aggregated (group by key dimensions) +- Optimized for dashboard performance +- Business metric naming (Revenue, Margin, %, Rank) +- Designed for end-user self-service +- Usually materialized (stored, not calculated on-demand) + +**Example Objects**: + +``` +Sales Analytics: +├── REVENUE_BY_PRODUCT +│ ├── Report_Date (daily grain) +│ ├── Product_Category +│ ├── Product_Name +│ ├── Revenue_Amount (SUM of orders) +│ ├── Order_Count (COUNT of orders) +│ ├── Average_Order_Value (SUM / COUNT) +│ └── Margin_Percent (Margin / Revenue * 100) + +├── CUSTOMER_METRICS +│ ├── Customer_ID +│ ├── Customer_Name +│ ├── Industry +│ ├── YTD_Revenue +│ ├── YTD_Order_Count +│ ├── YTD_Average_Order_Value +│ ├── Churn_Risk (calculated flag: no order in 90 days) +│ └── Customer_Lifetime_Value (all-time revenue) + +├── REGIONAL_SALES_DASHBOARD +│ ├── Report_Date +│ ├── Region +│ ├── Sales_Office +│ ├── Revenue_Amount +│ ├── Revenue_YoY_Growth (Year-over-Year %) +│ ├── Target_Revenue +│ ├── Variance_to_Target +│ ├── Rank_in_Region +│ └── Status (On_Track, At_Risk, Off_Track) +``` + +**Aggregation Examples**: + +```sql +-- Example L3 view: Revenue by Product +INSERT INTO REVENUE_BY_PRODUCT +SELECT + DATE_TRUNC(o.order_date, DAY) as report_date, + p.product_category, + p.product_name, + SUM(o.amount_usd) as revenue_amount, + COUNT(DISTINCT o.order_id) as order_count, + SUM(o.amount_usd) / COUNT(DISTINCT o.order_id) as average_order_value, + SUM(o.margin_amount) / SUM(o.amount_usd) * 100 as margin_percent +FROM SALES_ORDER_HARMONIZED o +JOIN PRODUCT_HARMONIZED p ON o.product_id = p.product_id +WHERE o.order_date >= DATE_TRUNC(CURRENT_DATE, MONTH) +GROUP BY + DATE_TRUNC(o.order_date, DAY), + p.product_category, + p.product_name; + +-- Example L3 view: Customer Metrics (with complex calculations) +INSERT INTO CUSTOMER_METRICS +SELECT + c.customer_id, + c.customer_name, + c.industry, + SUM(o.amount_usd) as ytd_revenue, + COUNT(DISTINCT o.order_id) as ytd_order_count, + SUM(o.amount_usd) / NULLIF(COUNT(DISTINCT o.order_id), 0) as ytd_avg_order_value, + CASE + WHEN MAX(o.order_date) < CURRENT_DATE - 90 THEN 'High' + WHEN MAX(o.order_date) < CURRENT_DATE - 30 THEN 'Medium' + ELSE 'Low' + END as churn_risk, + SUM(o.amount_usd) as customer_lifetime_value +FROM CUSTOMER_HARMONIZED c +LEFT JOIN SALES_ORDER_HARMONIZED o + ON c.customer_id = o.customer_id + AND YEAR(o.order_date) = YEAR(CURRENT_DATE) +GROUP BY + c.customer_id, + c.customer_name, + c.industry; +``` + +**Access Control**: All business users (read-only) + +--- + +## Industry Content Packages Catalog + +### Automotive Industry + +#### Package: Automotive Sales & Service Analytics (ASA-SAL-001) +- **Version**: 2.3.1 (as of Feb 2024) +- **Size**: 45 tables, 28 views, 12 data flows +- **Industry Domains**: Sales, Service, Warranty, Spare Parts +- **Source Systems**: SAP S/4HANA, SAP CRM, SAP ERP +- **Key Analytical Areas**: + - Vehicle sales by model, trim, color + - Service revenue by service type and workshop + - Warranty claim analysis + - Spare parts inventory and turnover +- **Key Dimensions**: Vehicle_Model, Customer, Sales_Region, Workshop, Service_Type, Time +- **Key Metrics**: + - Revenue_Sales, Revenue_Service, Revenue_Warranty + - Units_Sold, Service_Orders_Count + - Average_Selling_Price, Service_Margin_Percent + - Warranty_Cost_Percent_of_Revenue +- **Prerequisites**: + - Time Dimension (3-year history minimum) + - Customer Master, Product Master (vehicles) + - Sales Organization, Dealer Network + - Currency Conversion (if multi-currency) +- **Typical Activation Time**: 4-6 weeks +- **Customization Needs**: Regional sales channel mapping + +#### Package: Automotive Supply Chain & Inventory (ASA-SCM-001) +- **Version**: 1.8.0 +- **Size**: 62 tables, 35 views, 18 data flows +- **Domains**: Procurement, Production Planning, Inventory, Logistics +- **Key Analytical Areas**: + - Supplier performance (quality, on-time delivery, cost) + - Production planning and scheduling + - Inventory planning and optimization + - Logistics cost and efficiency +- **Key Dimensions**: Supplier, Material, Plant, Warehouse, Logistics_Partner +- **Key Metrics**: + - Supplier_Quality_Score, On_Time_Delivery_Rate + - Inventory_Turnover, Days_Inventory_Outstanding + - Procurement_Cost_Variance, Freight_Cost_per_Unit + - Production_Yield, Equipment_Downtime +- **Prerequisites**: + - Time Dimension (with production calendar) + - Product Master (with BOM) + - Supplier Master, Organization Hierarchy + - Unit of Measure, Standard costs +- **Typical Activation Time**: 8-10 weeks +- **Customization Needs**: Multi-plant production planning rules + +### Retail Industry + +#### Package: Retail POS & Merchandise Analytics (RET-SAL-001) +- **Version**: 2.1.4 +- **Size**: 38 tables, 31 views, 10 data flows +- **Domains**: Point of Sale, Merchandise Planning, Promotions +- **Key Analytical Areas**: + - Sales by product category, store location + - Merchandise margin and turnover + - Promotion effectiveness and ROI + - Customer traffic and conversion +- **Key Dimensions**: Store, Product_Category, Date, Cashier, Promotion +- **Key Metrics**: + - Revenue, Units_Sold, Transactions_Count + - Margin_Amount, Margin_Percent + - Average_Transaction_Value + - Promotion_Lift, Customer_Traffic +- **Prerequisites**: + - Time Dimension (with fiscal calendar) + - Product Master (UPC codes, categories) + - Store Master (locations, formats) + - Currency Conversion +- **Typical Activation Time**: 3-5 weeks +- **Customization Needs**: Store hierarchy alignment + +#### Package: Retail Supply Chain & Inventory (RET-SCM-001) +- **Version**: 1.9.2 +- **Size**: 55 tables, 40 views, 15 data flows +- **Domains**: Distribution, Inventory, Replenishment +- **Key Analytical Areas**: + - Stock coverage and stockout analysis + - Distribution center efficiency + - Inventory aging and obsolescence + - Replenishment effectiveness +- **Key Dimensions**: Store, Distribution_Center, Product, Supplier +- **Key Metrics**: + - Stock_Coverage_Days, Stockout_Incidents + - Inventory_Turnover, Days_Inventory_Outstanding + - Distribution_Cost_per_Unit, Shrinkage_Rate + - Replenishment_Accuracy +- **Prerequisites**: + - Time Dimension + - Product Master, Store Master, Distribution Network + - Inventory transaction history (6+ months) + - Unit of Measure +- **Typical Activation Time**: 6-8 weeks +- **Customization Needs**: Multi-tier distribution network + +### Utilities Industry + +#### Package: Energy & Water Distribution Analytics (UTI-OPS-001) +- **Version**: 1.6.0 +- **Size**: 41 tables, 29 views, 12 data flows +- **Domains**: Grid Operations, Customer Billing, Asset Management +- **Key Analytical Areas**: + - Energy/water consumption by customer segment + - Outage frequency and duration (SAIDI/SAIFI) + - Billing and revenue collection + - Asset condition and maintenance +- **Key Dimensions**: Customer_Segment, Service_Area, Equipment, Date +- **Key Metrics**: + - Consumption_Volume, Revenue_Billing + - Outage_Frequency, Outage_Duration, Customer_Impact + - Collection_Rate, Days_Sales_Outstanding + - Asset_Age, Maintenance_Cost +- **Prerequisites**: + - Time Dimension (hourly/daily) + - Customer Master (by segment: residential, commercial, industrial) + - Equipment Master, Service Territory + - Meter readings and consumption data (historical 12 months) +- **Typical Activation Time**: 5-7 weeks +- **Customization Needs**: Regulatory reporting alignment + +### Finance Industry + +#### Package: General Ledger & Financial Reporting (FIN-GL-001) +- **Version**: 1.7.3 +- **Size**: 34 tables, 26 views, 8 data flows +- **Domains**: Accounting, Profitability, Consolidation +- **Key Analytical Areas**: + - Profitability analysis (P&L by segment) + - Cash flow analysis + - Receivables and payables aging + - Consolidation and elimination +- **Key Dimensions**: Company_Code, GL_Account, Cost_Center, Time +- **Key Metrics**: + - Revenue, Cost_of_Goods_Sold, Operating_Expense + - Gross_Margin, Operating_Income, Net_Income + - Accounts_Receivable_Aging, Days_Sales_Outstanding + - Accounts_Payable_Aging, Days_Payable_Outstanding +- **Prerequisites**: + - Time Dimension (with fiscal calendar critical) + - GL Account Master with account type + - Cost Center hierarchy + - Currency Conversion (for consolidation) + - Company/Legal Entity Master +- **Typical Activation Time**: 4-6 weeks +- **Customization Needs**: Chart of accounts mapping + +#### Package: Banking Risk & Compliance (FIN-RISK-001) +- **Version**: 1.2.1 +- **Size**: 48 tables, 35 views, 14 data flows +- **Domains**: Credit Risk, Market Risk, Regulatory Reporting +- **Key Analytical Areas**: + - Credit risk assessment and monitoring + - Portfolio composition and concentration + - Non-performing loan (NPL) analysis + - Regulatory ratio reporting (Basel III/IV) +- **Key Dimensions**: Borrower, Loan_Product, Rating, Collateral_Type +- **Key Metrics**: + - Risk_Weighted_Assets, Capital_Ratio + - Non_Performing_Loan_Ratio, Loss_Provisions + - Interest_Rate_Risk_Exposure, Liquidity_Coverage_Ratio +- **Prerequisites**: + - Time Dimension + - Borrower Master, Loan Portfolio master + - Risk classification master + - Currency Conversion + - Regulatory parameter tables (risk weights, LGD, PD) +- **Typical Activation Time**: 8-12 weeks +- **Customization Needs**: Regulatory framework alignment + +### Manufacturing Industry + +#### Package: Production & Cost Analysis (MFG-PROD-001) +- **Version**: 1.5.2 +- **Size**: 52 tables, 38 views, 16 data flows +- **Domains**: Bill of Materials, Work Orders, Cost Accounting +- **Key Analytical Areas**: + - Production cost analysis (standard vs. actual) + - Variance analysis (material, labor, overhead) + - Production efficiency and throughput + - Order profitability +- **Key Dimensions**: Product, Work_Center, Cost_Element, Order, Plant +- **Key Metrics**: + - Cost_per_Unit, Variance_Amount, Variance_Percent + - Throughput, Cycle_Time, Equipment_Utilization + - Scrap_Rate, Rework_Rate + - Order_Profitability +- **Prerequisites**: + - Time Dimension + - Product Master (with BOM) + - Cost Element Master, Cost Center hierarchy + - Standard costs (historical) + - Work Center/Equipment master +- **Typical Activation Time**: 7-9 weeks +- **Customization Needs**: Cost accounting method (actual vs. standard) + +--- + +## Content Update Decision Matrix + +When new versions of Business Content packages are released, decide whether to update: + +### Decision Framework + +``` +Does content have Recommend +customizations? | Version Type │ Action │ Effort +─────────────────┼───────────────────────────────────────── +No │ Patch (1.0→1.0.1)│ Overwrite │ 1-2 hrs +No │ Minor (1.0→1.1) │ Overwrite │ 2-4 hrs +No │ Major (1.0→2.0) │ Overwrite │ 4-8 hrs +Yes, minor │ Patch │ Overwrite │ 1-2 hrs +Yes, minor │ Minor │ Keep │ 1 day +Yes, minor │ Major │ Keep │ Separate project +Yes, major │ Any type │ Keep │ Plan carefully +``` + +### Detailed Scenarios + +| Scenario | Decision | Rationale | Steps | +|----------|----------|-----------|-------| +| **Patch: No customizations** | Overwrite immediately | Fixes bugs, improves stability | 1. Backup current version 2. Activate patch 3. Test dashboards | +| **Patch: Minor customizations** | Overwrite | Customizations usually preserved | 1. Verify custom fields not modified 2. Overwrite 3. Test | +| **Minor Version: No customizations** | Overwrite | New features valuable | 1. Review release notes 2. Test in dev 3. Activate | +| **Minor Version: Significant customizations** | Keep | Cost of re-customization outweighs benefit | 1. Document customizations 2. Evaluate new features 3. Plan migration for major version | +| **Major Version: Any customizations** | Keep (Plan separately) | Risk of breaking changes high | 1. Create project for migration 2. Analyze all changes 3. Rebuild customizations 4. Test thoroughly | +| **Production environment** | Keep + Test in Dev | Minimize production risk | 1. Update content in DEV 2. Thorough testing 3. Plan maintenance window 4. Update PROD | + +### Update Testing Checklist + +Before updating production content: + +- [ ] Backup current version + ``` + Export all objects: Business Content > Manage Content > Export + ``` + +- [ ] Test in non-production space + - [ ] Activate update version + - [ ] Run all key analytical queries + - [ ] Verify dashboard performance + - [ ] Check custom field calculations + - [ ] Test data flows + +- [ ] Compare versions + - [ ] List new tables/views + - [ ] Check deprecated objects + - [ ] Document breaking changes + +- [ ] Merge customizations (if using Keep) + - [ ] Identify modified fields in old version + - [ ] Manually apply to new version + - [ ] Test recalculations + +- [ ] Communicate to stakeholders + - [ ] Notify business users of new features + - [ ] Schedule training if UI changed + - [ ] Provide release notes + +--- + +## Activation Troubleshooting Guide + +### Symptoms and Solutions + +#### Symptom: "Prerequisite not satisfied: Time Dimension" + +**Root Cause**: Time Dimension table not populated or empty. + +**Solution**: +``` +1. Check if TIME_DIMENSION table exists + → Go to Tables in space, search for "TIME_DIMENSION" + +2. If exists but empty: + → Go to Business Content > Administration > Time Dimension + → Click "Generate Data" + → Download CSV file + → Create Data Flow: Upload CSV → TIME_DIMENSION table + → Execute load + +3. Verify population: + → Query: SELECT COUNT(*) FROM TIME_DIMENSION + → Should return thousands of rows (1+ year of data) +``` + +#### Symptom: "Currency Conversion view not available" + +**Root Cause**: TCURR/TCURV not populated with exchange rates. + +**Solution**: +``` +1. Check if TCURR table exists and has data + → SELECT COUNT(*) FROM TCURR + → If 0 rows: + +2. Load exchange rates: + → Create Data Flow from SAP S/4HANA table TCURR + → OR upload CSV with rates + → Target: TCURR table + +3. Create currency conversion view: + → Use Datasphere's Calculation View template + → Base on TCURR table + → Publish as TCURV view + +4. Verify: + → SELECT * FROM TCURV WHERE CONVERSION_DATE = TODAY() + → Should return rates for all currency pairs +``` + +#### Symptom: "Space quota exceeded" during activation + +**Root Cause**: Insufficient memory, disk, or object quota in target space. + +**Solution**: +``` +1. Check space quota: + → Go to Space > Settings + → Review: Memory Used / Allocated, Disk Used / Allocated + +2. Free up space: + → Delete unused tables/views + → Archive historical data flows + → OR increase space allocation: + └── Go to Space Settings > Upgrade Resources + +3. Alternative: Activate to different space + → Create new space with larger quota + → Select as activation target +``` + +#### Symptom: "Connection test failed" during activation + +**Root Cause**: Source system connection invalid or unreachable. + +**Solution**: +``` +1. Verify connection: + → Go to Connections > [Connection Name] + → Click Test Connection + → Review error details + +2. Common fixes: + → Check credentials (username/password valid) + → Verify hostname/IP reachable (ping, telnet) + → Check firewall rules + → Verify TLS certificate if using HTTPS + +3. Update connection: + → Edit connection with correct parameters + → Retry test + +4. Retry activation: + → Business Content > Manage Content + → Click Retry Activation +``` + +#### Symptom: "Permission denied" error + +**Root Cause**: User lacks necessary roles for space. + +**Solution**: +``` +1. Check user role in target space: + → Space Settings > Members + → Look for current user + +2. Assign space admin role: + → Have space owner or admin assign: + └── User Role: Space_Admin + └── Or: Space_Editor (minimum for activation) + +3. Retry activation with new permissions +``` + +#### Symptom: "Object conflicts: CUSTOMER table already exists" + +**Root Cause**: Name collision with existing table. + +**Solution**: +``` +1. Choose conflict resolution: + → In activation dialog, when conflict appears: + ├── Overwrite (replace existing table) + ├── Keep (skip this object) + └── Rename (add _v2 suffix to new version) + +2. If choosing Rename: + → Update data flows to use new table names + → Recalculate dependent views + +3. If choosing Keep: + → Manually merge old and new schema later + → Document changes for future migration +``` + +#### Symptom: Activation hangs or times out + +**Root Cause**: Large package, insufficient resources, or service slowness. + +**Solution**: +``` +1. Cancel activation: + → If no progress > 30 mins, click Cancel + → Activation rolls back automatically + +2. Increase space resources: + → Space Settings > Upgrade Memory / Disk + → Increase to double current allocation + +3. Retry activation: + → Business Content > Manage Content > Retry + → Monitor progress in real-time + +4. Enable debug logging: + → Go to settings > Logging Level = DEBUG + → Activation logs will show detailed steps +``` + +--- + +## Post-Activation Validation Checklist + +After successful activation, validate everything works: + +### Data Completeness Checks + +**Time Dimension**: +```sql +SELECT MIN(DATE), MAX(DATE), COUNT(*) FROM TIME_DIMENSION; +-- Verify: Covers at least 3-year rolling window +-- Verify: No gaps in dates +-- Verify: Fiscal calendar populated if used +``` + +**Master Data** (Customer, Product, Organization): +```sql +SELECT TABLE_NAME, COUNT(*) FROM [ACTIVATED_TABLES] +WHERE TABLE_NAME LIKE '%MASTER' OR TABLE_NAME LIKE '%REFERENCE' +GROUP BY TABLE_NAME; +-- Verify: All master data tables > 0 rows +``` + +**Exchange Rates** (if multi-currency): +```sql +SELECT COUNT(DISTINCT CURRENCY_PAIR) FROM TCURV +WHERE CONVERSION_DATE = TODAY(); +-- Verify: All expected currency pairs have rates +``` + +### Analytical View Validation + +**Test Key Views**: +```sql +-- For each key analytical view +SELECT COUNT(*) FROM REVENUE_ANALYSIS; +SELECT COUNT(*) FROM CUSTOMER_METRICS; +SELECT COUNT(*) FROM SALES_DASHBOARD; + +-- Verify: Returns rows, no errors +-- Verify: Execution time < 5 seconds for < 1M rows +``` + +**Check Calculated Fields**: +```sql +SELECT MARGIN_PERCENT FROM REVENUE_ANALYSIS LIMIT 10; +-- Verify: No NULL, no error values +-- Verify: Percentages between 0-100 (if appropriate) + +SELECT YTD_REVENUE, ORDERS_COUNT FROM CUSTOMER_METRICS LIMIT 10; +-- Verify: No negative numbers (if inappropriate) +-- Verify: Aggregations match manual calculation +``` + +### Performance Baseline + +**Record Execution Times**: +``` +Create_Date: [Today] +Query Performance Baseline: +├── REVENUE_ANALYSIS: 1.8 seconds +├── CUSTOMER_METRICS: 2.3 seconds +├── SALES_DASHBOARD: 1.2 seconds +├── MARGIN_ANALYSIS: 2.9 seconds +└── INVENTORY_STATUS: 3.1 seconds + +Alert threshold: If any query > 5x baseline +``` + +### Dashboard & Report Testing + +**For Each Pre-Built Dashboard**: +- [ ] Loads without errors +- [ ] All charts render with data +- [ ] Drill-down navigation works +- [ ] Filters apply and refresh correctly +- [ ] Dates display in correct format + +**Example Test Case**: +``` +Dashboard: Regional Sales Performance +├── Load dashboard: ✓ (< 3 seconds) +├── Chart 1 "Sales by Region": ✓ (shows 5 regions) +├── Chart 2 "YoY Growth": ✓ (shows comparison) +├── Filter by Date Range: ✓ (updates all charts) +└── Drill-down Region → Sales Office: ✓ (drills to detail) +``` + +### Access & Security + +**Verify Access Control**: +``` +Test user: analyst@company.com +Role: Datasphere_Analyst + +Expected: Can read analytics views, cannot modify +├── Can SELECT from REVENUE_ANALYSIS: ✓ +├── Can view dashboards: ✓ +├── Cannot INSERT into tables: ✓ +├── Cannot delete views: ✓ +``` + +**Check Audit Logs**: +``` +Go to Administration > Audit Logs +Filter: Last 24 hours, Action = Activation +Verify: All activation steps logged +``` + +### Business Validation + +**Have Business Users Review**: +- [ ] KPIs match expected values (within 5%) +- [ ] Dimensions and hierarchies align to organization +- [ ] Data freshness is acceptable +- [ ] Report templates useful for their role + +**Example Validation**: +``` +Business Reviewer: Sales Director + +Expected Revenue (from Finance System): $45.2M +Actual from REVENUE_ANALYSIS: $45.1M +Variance: 0.2% ✓ (acceptable) + +Expected Orders (from operational system): 12,500 +Actual from ORDERS_COUNT: 12,480 +Variance: 0.2% ✓ (acceptable) +``` + +### Issue Logging + +If issues found: +``` +Log Template: +├── Date Found: [Date] +├── Component: [Table/View/Dashboard] +├── Issue: [Description] +├── Severity: Critical|High|Medium|Low +├── Resolution: [Fix applied] +└── Root Cause: [Why it happened] + +Example: +├── Component: REVENUE_ANALYSIS view +├── Issue: Q1 2024 revenue 5% lower than expected +├── Root Cause: Missing transaction data from one sales office (data load failure) +├── Resolution: Re-ran data flow from source system, now correct +``` + +### Sign-Off + +Get formal approval before opening to business users: + +``` +Activation Sign-Off Form: +├── Activated Package: SALES_ANALYTICS v2.1 +├── Target Space: ANALYTICS_PROD +├── Activation Date: 2024-02-01 +├── Data Validation: ✓ PASSED +├── Performance Baseline: ✓ PASSED +├── Dashboard Testing: ✓ PASSED +├── Business Review: ✓ PASSED +├── Go-Live Approval: +│ ├── Technical Lead: [Name] ✓ Approved +│ ├── Business Lead: [Name] ✓ Approved +│ └── Date: 2024-02-05 +└── Available to Business Users: Yes (as of 2024-02-06) +``` + +--- + +## References and Support + +- **Datasphere Documentation**: https://help.sap.com/datasphere +- **Business Content Network**: https://www.sap.com/datasphere/content-network +- **SAP Community**: https://community.sap.com/datasphere +- **SAP Support Portal**: https://support.sap.com + +For questions on specific industry content, contact your SAP solution partner or Datasphere implementation team. diff --git a/partner-built/SAP-Datasphere/skills/datasphere-bw-bridge-migration/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-bw-bridge-migration/SKILL.md new file mode 100644 index 0000000..db92d38 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-bw-bridge-migration/SKILL.md @@ -0,0 +1,1303 @@ +--- +name: BW Bridge Migration +description: "Migrate SAP BW/4HANA to Datasphere using BW Bridge Shell and Remote Conversion. Use when replacing legacy BW systems, converting Process Chains to Task Chains, handling ADSO inventory migrations, analyzing Task List compatibility (STC01), or transitioning from hybrid Bridge operations. Essential for BW→Datasphere modernization strategies." +--- + +# BW Bridge Migration Skill + +## Overview + +The SAP Datasphere BW Bridge provides a bridge environment for migrating from SAP BW/4HANA to Datasphere. This skill guides you through comprehensive migration strategies, architectural decisions, object conversion workflows, and operational transition patterns. + +### When to Use This Skill + +- **BW System Modernization**: Replacing legacy BW/4HANA systems with cloud-native Datasphere +- **Data Warehouse Migration**: Moving InfoCubes, DSOs, and Process Chains to native Datasphere +- **Hybrid Operations**: Running BW Bridge alongside native Datasphere during transition +- **Process Chain Migration**: Converting BW scheduling to Datasphere Task Chains +- **Inventory and Write-Interface ADSOs**: Handling specialized ADSO types in migration scenarios +- **Compatibility Assessment**: Determining which BW objects can be converted vs. rebuilt + +### BW Bridge Architecture + +The BW Bridge is a restricted instance of SAP BW/4HANA running **within** the Datasphere environment: + +``` +┌─────────────────────────────────────────────────────┐ +│ Datasphere Tenant │ +├─────────────────────────────────────────────────────┤ +│ ┌──────────────────────────────────────────────┐ │ +│ │ BW Bridge (Embedded BW/4HANA Instance) │ │ +│ │ - InfoCubes / Composite Providers │ │ +│ │ - DataStore Objects (ADSOs) │ │ +│ │ - Process Chains │ │ +│ │ - BW Authorizations & Hierarchies │ │ +│ └──────────────────────────────────────────────┘ │ +│ ┌──────────────────────────────────────────────┐ │ +│ │ Native Datasphere │ │ +│ │ - Dimensions │ │ +│ │ - Fact Tables & Analytical Datasets │ │ +│ │ - Task Chains │ │ +│ │ - Data Access Controls (DACs) │ │ +│ └──────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────┘ +``` + +**Key Characteristics:** +- BW Bridge is a **non-productive** restricted environment +- Cannot be used for production reporting +- Limited administrative capabilities compared to standalone BW/4HANA +- Intended for migration and modernization, not long-term operation + +--- + +## Phase 1: Migration Assessment + +### BW Object Inventory and Compatibility Analysis + +Before conversion, analyze your entire BW landscape: + +#### Step 1: Extract BW Object Inventory + +Use transaction **STC01** (Task List Management) in the source BW system to: + +1. Navigate to **STC01** in SAP GUI +2. Select **View Task List** to see all migration tasks +3. Export the full task list to a spreadsheet for documentation +4. Categorize objects by type: + - InfoCubes / Composite Providers + - DataStore Objects (Standard, Inventory, Write-Interface) + - Master Data Objects (Characteristics, Key Figures) + - Process Chains + - Hierarchies and Authorization objects + +#### Step 2: Compatibility Assessment + +Create a compatibility matrix: + +| Object Type | Convertible | Approach | Effort | Notes | +|---|---|---|---|---| +| InfoCube | Yes | Shell Conversion | Low | Direct mapping to Analytical Dataset | +| ADSO (Standard) | Yes | Shell Conversion | Low | Maps to native table | +| ADSO (Inventory) | Partial | Hybrid or Rebuild | Medium | Requires special DAC modeling | +| ADSO (Write-Interface) | Partial | Remote Conversion | High | Complex load process mapping | +| Composite Provider | Yes | Shell Conversion | Medium | May include non-convertible objects | +| Hierarchy | Yes | Both | Low | DAC hierarchy filtering available | +| Process Chain | Yes | Task Chain Rebuild | Medium | Manual mapping to Task Chain | +| BW Authorization | Yes | DAC Mapping | Medium | Migration via Analysis Authorization import | + +#### Step 3: Risk Scoring + +For each object, score migration risk: + +``` +Risk = (Complexity × Dependencies × Customization) ÷ Team Expertise + +- Complexity: 1-5 (Simple BW objects = 1, complex logic = 5) +- Dependencies: 1-5 (Isolated = 1, many upstream consumers = 5) +- Customization: 1-5 (Standard = 1, heavily modified = 5) +- Team Expertise: 1-5 (Deep BW + Datasphere knowledge = 5) + +Score > 12 = High Risk (prioritize for Shell Conversion) +Score 6-12 = Medium Risk (detailed design required) +Score < 6 = Low Risk (standard conversion path) +``` + +--- + +## Phase 2: Shell Conversion Workflow + +Shell Conversion automatically converts BW objects to native Datasphere objects while preserving metadata and logic. + +### When to Use Shell Conversion + +- Standard InfoCubes → Analytical Datasets +- Standard ADSOs → Native Tables +- Simple hierarchies → Datasphere hierarchies +- Composite Providers with convertible objects + +### Shell Conversion Step-by-Step + +#### Pre-Conversion Validation (Steps 1-5) + +1. **Verify BW System Health** + - Run consistency checks: **RSRV** transaction + - Ensure no orphaned objects + - Validate all InfoCube aggregate tables are rebuilt + - Check for active process chains and allow them to complete + +2. **Extract Technical Metadata** + - Document all InfoCube characteristics and key figures + - List all navigational attributes + - Export all hierarchies and custom hierarchies + - Note all authorization-relevant fields + - Document all calculated fields and restricted KFGs + +3. **Analyze Data Volume** + - Run query **STOM_INFOCUBE_SIZE** to get InfoCube sizes + - Plan data transfer bandwidth and timeline + - Identify incremental vs. full load requirements + - Estimate network throughput: `Time = (Size in GB × 8) ÷ Network Mbps` + +4. **Assess Custom Development** + - Search for ABAP enhancements on InfoCube load processes + - Document user-exits and BAdIs + - Identify custom ABAP reports dependent on this InfoCube + - Plan re-coding requirements for unsupported objects + +5. **Create Migration Backlog** + - Prioritize objects by: + - Business criticality (P1: Core reporting, P2: Secondary, P3: Legacy) + - Dependency chain (convert consumers before dependencies) + - Data volume (small → large to validate approach) + - Assign owners to each object conversion + +#### Datasphere Preparation (Steps 6-10) + +6. **Set Up Datasphere Space** + - Create dedicated migration space: `BW_BRIDGE_MIGRATION` + - Configure space members with appropriate roles + - Enable audit logging for compliance tracking + - Set up separate native Datasphere space: `NATIVE_TARGET` + +7. **Prepare Source System Connections** + - Create connection to source BW system with read-only user + - Test connectivity and credential validation + - Set up network routing if across different networks + - Enable BW Bridge connector (licensed separately) + +8. **Stage BW Bridge Instance** + - BW Bridge provisioning (SAP handles infrastructure) + - Validate Bridge system accessibility + - Configure user accounts with BW Bridge access + - Test STC01 access in Bridge environment + +9. **Plan Object Naming Convention** + - Establish naming prefix: `Z_`, `C_`, `_CONVERTED_` + - Document version tracking: `v1`, `v2_refined` + - Separate Bridge objects from native: `BRIDGE_*` vs `DS_*` + - Create mapping spreadsheet: BW Name → Datasphere Name + +10. **Set Up Data Transfer Infrastructure** + - Configure batch data loads schedule + - Plan full vs. incremental load strategy + - Set up error logging and monitoring + - Prepare rollback data snapshots + +#### Object Conversion Execution (Steps 11-20) + +11. **Initiate BW Bridge Shell Conversion** + - In BW Bridge STC01, select source InfoCube/ADSO + - Click **Propose Conversion** (automated pre-check) + - Review conversion proposal report for warnings/errors + - Resolve any compatibility issues identified + +12. **Map Characteristic to Dimension** + - For each characteristic in the InfoCube: + - Determine if it becomes a Dimension or dimension column + - Link to master data objects if they exist + - Configure hierarchy support (if needed) + - Validate attribute inheritance + +13. **Map Key Figure to Measure** + - Define aggregation type: + - SUM (default) → Standard measure + - MIN/MAX → Use in analytic models + - NONE → Dimension-like field + - Set decimal places and currency/unit links + - Document any formula-based KFGs requiring recalculation + +14. **Configure Advanced Mappings** + - Map navigational attributes to dimension attributes + - Convert restricted KFGs to calculated measures or materialized tables + - Handle time characteristics (calendar years, months) + - Map custom hierarchies to Datasphere hierarchies + +15. **Validate Shell Conversion Preview** + - Generate conversion impact report + - Review object dependencies and impact analysis + - Identify unsupported objects for manual rebuild + - Validate field mappings and data type compatibility + +16. **Execute Conversion** + - Trigger shell conversion in STC01 (`Execute Conversion` button) + - Monitor conversion logs for errors: `TL1-001`, `TL1-002`, etc. + - Conversion typically completes in 15-60 minutes depending on size + - Verify generated Datasphere objects in repository + +17. **Post-Conversion Object Validation** + - Verify Datasphere Analytical Dataset created with correct structure + - Check dimension and measure count matches original + - Validate field types: Integer, Decimal, String, Date + - Test hierarchy loading if hierarchies were converted + - Confirm object is marked as `Readable` in metadata + +18. **Data Transfer Execution** + - Create data transfer task from BW Bridge to native Datasphere + - Initial full load: Extract all historical data + - Validate row counts match source InfoCube + - Implement incremental load for ongoing updates + - Test data reconciliation: Source vs. Target row counts + +19. **Query and View Conversion** + - Convert dependent BW queries to Datasphere Analytical Models + - Rewrite query formulas (not all BW calculations directly portable) + - Test query performance - typically improve 3-10x + - Validate query results against source system + +20. **Sign-Off and Documentation** + - Business owner approves converted object + - Document conversion details: Object ID, conversion time, data volume + - Update migration registry with conversion date + - Archive pre-conversion metadata for audit trail + +### Handling Unsupported Objects + +Some BW constructs cannot be converted automatically: + +| Unsupported Feature | Workaround | +|---|---| +| Account-Based Modeling | Rebuild as calculated measure in Analytical Model | +| Complex ABAP BAdIs in Load Process | Replicate logic in Datasphere Transformation rules | +| Time-Dependent Hierarchies | Use versioned Datasphere hierarchies with effective dates | +| Cell-Level Security (CLS) | Map to Data Access Controls (DACs) with column filtering | +| Query-Level Cascading | Model in Datasphere as dimensional filters | + +### Shell Conversion Common Issues and Resolutions + +| Issue | Root Cause | Resolution | +|---|---|---| +| Characteristic not converting | Non-standard attributes | Add missing attributes before conversion; retry | +| Data transfer timeout | Network latency | Increase timeout; split into smaller batches; check bandwidth | +| Hierarchies not loading | Circular hierarchy logic | Review and fix in source; convert simplified hierarchy | +| Permission errors on Bridge | User lacks Bridge access | Add user to BW Bridge roles in Datasphere | + +--- + +## Phase 3: Remote Conversion Workflow + +Remote Conversion is an alternative for complex scenarios where Shell Conversion is insufficient. + +### When to Use Remote Conversion + +- Composite Providers with heavy custom logic +- Write-Interface ADSOs with complex load processes +- Objects with extensive ABAP enhancements +- Scenarios requiring gradual cutover during transition +- Complex aggregation requirements + +### Remote Conversion Architecture + +``` +┌────────────────────────────────────────┐ +│ Source BW/4HANA System (Remote) │ +│ - InfoCubes, ADSOs, Queries │ +│ - Process Chains, Hierarchies │ +└────────────────┬───────────────────────┘ + │ (Direct connection) + │ +┌────────────────▼───────────────────────┐ +│ Datasphere Remote Conversion │ +│ - BW Bridge (embedded read-only) │ +│ - Custom transformation logic │ +│ - Remote data federation │ +└────────────────┬───────────────────────┘ + │ (Virtualization & ETL) + │ +┌────────────────▼───────────────────────┐ +│ Native Datasphere Objects │ +│ - Dimensions, Fact Tables │ +│ - Analytical Models, Views │ +└────────────────────────────────────────┘ +``` + +### Remote Conversion Step-by-Step + +1. **Establish Remote BW Connection** + - Set up read-only connection to source BW/4HANA + - Validate network connectivity and firewall rules + - Create dedicated RFC destination for Datasphere + +2. **Create Remote View in Datasphere** + - Create External Data Source pointing to BW remote query/provider + - Define column mapping explicitly (no automatic detection) + - Test remote table preview to validate connection + +3. **Implement Custom Transformation** + - Build Datasphere Transformation to apply business logic + - Replicate ABAP BAdI logic if custom enhancements exist + - Handle data type conversions (BW data types → SQL types) + - Implement unit and currency handling + +4. **Configure Staging Tables** + - Create intermediate tables for transformation staging + - Implement error handling for failed transformations + - Add audit columns: `_LOAD_ID`, `_LOAD_TIMESTAMP`, `_SOURCE_SYSTEM` + - Partition tables by loading period for performance + +5. **Establish Data Replication Schedule** + - Full load: Initial data population (typically 1-2x monthly) + - Incremental load: Delta from last run (daily or hourly) + - Monitor load performance and adjust timing + - Implement automated failure notifications + +6. **Testing and Validation** + - Reconcile row counts: Source vs. Datasphere + - Validate key metrics match (sum of measures by dimension) + - Compare sample query results: BW vs. Datasphere + - Document any data discrepancies and rationale + +7. **Operationalize Remote Conversion** + - Configure automated scheduling in Task Chains + - Implement monitoring and alerting for load failures + - Document runbook for manual recovery procedures + - Plan cutover to native Datasphere (if applicable) + +--- + +## Phase 4: Bridge-Specific Modeling + +### DataStore Objects (ADSOs) in Datasphere + +ADSOs in BW Bridge have special properties not found in standard tables. Understanding Bridge ADSO variants is critical. + +#### ADSO Type Comparison + +| Property | Standard ADSO | Inventory ADSO | Write-Interface ADSO | +|---|---|---|---| +| **Purpose** | Transactional staging | Periodic snapshots | Dimension/Master data | +| **Updates** | Full/Delta loads | Append-only (inventory) | Record management | +| **Activation** | Required after load | Incremental activation | Direct write capability | +| **Query-Ready** | Not directly queryable | Can be queried | Queryable immediately | +| **In Datasphere** | Standard Table | Fact Table (time-series) | Dimension Table | +| **Conversion Effort** | Low | Medium | High | + +#### Standard ADSO Conversion + +Standard ADSOs convert directly to Datasphere tables: + +```yaml +BW Bridge ADSO: + Name: 2_SALES_STAGING + Key Fields: + - COMPANY_CODE + - SALES_ORG + - PERIOD + Data Fields: + - AMOUNT + - QUANTITY + - CURRENCY + +Converted to Datasphere Table: + Name: T_SALES_STAGING + Columns: + - COMPANY_CODE (String, Key) + - SALES_ORG (String, Key) + - PERIOD (Date, Key) + - AMOUNT (Decimal) + - QUANTITY (Integer) + - CURRENCY (String) + Indexes: + - Primary Key (COMPANY_CODE, SALES_ORG, PERIOD) + - Index on PERIOD for time-series queries +``` + +#### Inventory ADSO Conversion + +Inventory ADSOs maintain periodic snapshots. In Datasphere, these become fact tables with explicit time-dimensioning: + +```sql +-- Inventory ADSO Structure in BW Bridge +-- Tracks balances as of specific dates +SELECT + POSTING_DATE, + PRODUCT_ID, + WAREHOUSE_ID, + OPENING_QUANTITY, + INBOUND_QUANTITY, + OUTBOUND_QUANTITY, + CLOSING_QUANTITY, + LAST_TRANSACTION_ID +FROM 3_INV_ADSO +WHERE POSTING_DATE >= '2024-01-01' +ORDER BY POSTING_DATE, PRODUCT_ID; + +-- Equivalent Datasphere Fact Table +CREATE TABLE F_INVENTORY_SNAPSHOT ( + POSTING_DATE DATE NOT NULL, + PRODUCT_ID VARCHAR(10) NOT NULL, + WAREHOUSE_ID VARCHAR(10) NOT NULL, + OPENING_QUANTITY DECIMAL(15,2), + INBOUND_QUANTITY DECIMAL(15,2), + OUTBOUND_QUANTITY DECIMAL(15,2), + CLOSING_QUANTITY DECIMAL(15,2), + LAST_TRANSACTION_ID VARCHAR(20), + _LOAD_DATE DATE NOT NULL DEFAULT CURRENT_DATE, + PRIMARY KEY (POSTING_DATE, PRODUCT_ID, WAREHOUSE_ID) +); + +-- Data Access Control for inventory snapshots +-- Only allow users to see snapshots <= today +CREATE DATA ACCESS CONTROL DAC_INVENTORY_CURRENT +FOR TABLE F_INVENTORY_SNAPSHOT +FILTER BY + POSTING_DATE <= CURRENT_DATE +AND WAREHOUSE_ID IN (SELECT ASSIGNED_WAREHOUSE FROM T_USER_WAREHOUSE_MAP + WHERE USER_ID = CURRENT_USER); +``` + +**Key Design Pattern for Inventory ADSOs:** +- Use surrogate key (`_LOAD_ID`) to track each snapshot load +- Include effective date range (`EFFECTIVE_FROM`, `EFFECTIVE_TO`) for historical queries +- Implement type-2 slowly-changing dimension for dimension tables +- Use partitioning by `POSTING_DATE` for query performance + +#### Write-Interface ADSO Conversion + +Write-Interface ADSOs enable direct data writing. These are complex in Datasphere as native tables are append-only. Implement a dual-table strategy: + +```yaml +# BW Bridge Write-Interface ADSO +# Used for user-maintained master data (e.g., price lists) + +BW Structure: + ADSO: 4_PRICE_MASTER + Key Fields: + - MATERIAL_ID + - CUSTOMER_ID + - VALID_FROM (Date) + Data Fields: + - UNIT_PRICE (Decimal) + - CURRENCY (String) + - LAST_CHANGED (Timestamp) + +Datasphere Dual-Table Approach: + Staging Table (write-enabled): + T_PRICE_MASTER_STAGING + - Used only for ETL/maintenance + - Not exposed to business users + + Production Table (read-only): + T_PRICE_MASTER + - Fact table for queries + - Refreshed nightly from staging + - Enforces historical tracking + + UI Handler (if direct user input required): + - Use Datasphere Data Marketplace app + - Implement custom approval workflow + - Maintain change log in T_PRICE_MASTER_AUDIT +``` + +**SQL Implementation Pattern:** + +```sql +-- Write-Interface ADSO equivalent: Staging table +CREATE TABLE T_PRICE_MASTER_STAGING ( + MATERIAL_ID VARCHAR(18) NOT NULL, + CUSTOMER_ID VARCHAR(10) NOT NULL, + VALID_FROM DATE NOT NULL, + UNIT_PRICE DECIMAL(13,2), + CURRENCY VARCHAR(3), + LAST_CHANGED TIMESTAMP DEFAULT CURRENT_TIMESTAMP, + CHANGED_BY VARCHAR(12) DEFAULT CURRENT_USER, + PRIMARY KEY (MATERIAL_ID, CUSTOMER_ID, VALID_FROM) +); + +-- Production table: Type-2 SCD +CREATE TABLE T_PRICE_MASTER ( + MATERIAL_ID VARCHAR(18) NOT NULL, + CUSTOMER_ID VARCHAR(10) NOT NULL, + VALID_FROM DATE NOT NULL, + VALID_TO DATE, + UNIT_PRICE DECIMAL(13,2), + CURRENCY VARCHAR(3), + IS_CURRENT CHAR(1) DEFAULT 'X', + _LOAD_DATE DATE NOT NULL, + PRIMARY KEY (MATERIAL_ID, CUSTOMER_ID, VALID_FROM) +); + +-- Nightly refresh job pseudo-code +-- Step 1: Mark previous current records as history +UPDATE T_PRICE_MASTER +SET IS_CURRENT = '', VALID_TO = CURRENT_DATE - 1 +WHERE IS_CURRENT = 'X' + AND (MATERIAL_ID, CUSTOMER_ID) IN ( + SELECT DISTINCT MATERIAL_ID, CUSTOMER_ID FROM T_PRICE_MASTER_STAGING + WHERE LAST_CHANGED >= CURRENT_DATE + ); + +-- Step 2: Insert new current records +INSERT INTO T_PRICE_MASTER + SELECT MATERIAL_ID, CUSTOMER_ID, VALID_FROM, NULL, UNIT_PRICE, CURRENCY, 'X', CURRENT_DATE + FROM T_PRICE_MASTER_STAGING + WHERE LAST_CHANGED >= CURRENT_DATE; +``` + +--- + +## Phase 5: Process Chain to Task Chain Mapping + +Process Chains in BW are the scheduling and orchestration layer. Datasphere uses Task Chains, which require manual redesign. + +### Architecture: BW Process Chain vs. Datasphere Task Chain + +**BW Process Chain Components:** +- Variant Processor (Variable initialization) +- Data Load (ABAP batch jobs) +- BW Cubes, ADSOs (target objects) +- Process Chain Steps (Sequential execution) +- Check and Decision steps (Conditional logic) +- Wait/Event steps (Time/event-based triggers) + +**Datasphere Task Chain Components:** +- Dimensions and Tables (source objects) +- Transformation rules (data logic) +- Data transfer tasks (load execution) +- Task sequences (dependencies) +- Conditions (IF/THEN logic) +- Schedules (Cron-based timing) + +### Mapping BW Process Chain Elements + +| BW Process Chain | Datasphere Mapping | Design Notes | +|---|---|---| +| Load InfoCube | Data Transfer Task or Transformation | Load from BW Bridge or external source | +| Process Chain Step | Task Sequence | Single task or complex sub-workflow | +| Check variant processor | Variables in Task Chain context | Parameterize tasks dynamically | +| Decision step (IF/THEN) | Conditional execution in Task Chain | Branch based on previous results | +| Parallel branches | Parallel execution flag in Task Chain | Improves performance, requires careful ordering | +| Post-load aggregation | Materialization Task | Pre-calculate aggregates for performance | +| Process chain event | Scheduled Trigger | Create external event-driven tasks if needed | + +### Process Chain to Task Chain Step-by-Step + +#### Step 1: Document Process Chain Structure + +``` +BW Process Chain: Z_DAILY_SALES_LOAD +├── Step 1: Load Sales Master [Transaction: RSBWP_LOAD_DATA, Variant: DEFAULT] +├── Step 2: Load Sales Transactions [Transaction: RSBWP_LOAD_DATAX, Variant: DAILY] +├── Step 3: Validate Load (Check) [Min Rec: 1000, Max Rec: 1000000] +├── Step 4a: Decision [IF success THEN → Step 5, ELSE → Step 6] +├── Step 5: Aggregate Sales (PARALLEL) +│ ├── 5a: Aggregate by Product [Time: 15 min] +│ ├── 5b: Aggregate by Region [Time: 12 min] +│ └── 5c: Aggregate by Customer [Time: 10 min] +├── Step 6: Send Email Notification [Email: admin@company.com] +└── Schedule: Daily at 03:00 UTC +``` + +#### Step 2: Design Task Chain Structure + +Create equivalent Task Chain in Datasphere: + +```yaml +Task Chain: TC_DAILY_SALES_LOAD + Description: "Datasphere equivalent of Z_DAILY_SALES_LOAD" + Schedule: CRON "0 3 * * ?" (Daily 03:00 UTC) + + Tasks: + 1. Task: DT_LOAD_SALES_MASTER + Type: Data Transfer + Source: BW Bridge InfoCube 0SALES_MASTER + Target: T_SALES_MASTER + Runtime: ~10 min + ErrorHandling: STOP_ON_ERROR + + 2. Task: DT_LOAD_SALES_TRANSACTIONS + Type: Data Transfer + Source: BW Bridge InfoCube 0SALES_001 + Target: T_SALES_TRANSACTIONS + Runtime: ~25 min + ErrorHandling: STOP_ON_ERROR + Variables: + - LOAD_TYPE = DELTA + - LAST_LOAD_DATE = + + 3. Task: TR_VALIDATE_LOAD + Type: Transformation (Validation Logic) + SQL: | + SELECT COUNT(*) as record_count FROM T_SALES_TRANSACTIONS + WHERE _LOAD_DATE = CURRENT_DATE + Condition: record_count >= 1000 AND record_count <= 1000000 + OnError: LOG_WARNING_AND_CONTINUE + + 4. Task: TC_AGGREGATION_TASKS (Parallel Sub-Task-Chain) + Type: Task Chain (Parallel Execution) + Runtime: ~15 min (parallelized) + + Sub-Tasks: + 4a. Task: TR_AGG_PRODUCT + Type: Transformation + Target: V_SALES_BY_PRODUCT (Materialized View) + + 4b. Task: TR_AGG_REGION + Type: Transformation + Target: V_SALES_BY_REGION + + 4c. Task: TR_AGG_CUSTOMER + Type: Transformation + Target: V_SALES_BY_CUSTOMER + + 5. Task: NOTIFY_COMPLETION + Type: Script (Send Email) + On Success: Send email to admin@company.com + Template: "Daily Sales Load Completed Successfully" + + Error Handling: + OnTaskFailure: Retry up to 3 times with 10 min interval + OnChainFailure: Send alert email and stop + Rollback: Not automatic; requires manual intervention +``` + +#### Step 3: Handle Conditional Logic + +BW Process Chains use decision steps; Datasphere Task Chains use conditional task execution: + +```yaml +# BW Decision Step +Decision Step: CHECK_LOAD_SUCCESS + If: Load Step was successful + Then: Continue to Aggregation + If: Load Step failed + Then: Jump to Error Handler + +# Datasphere Equivalent +Task Chain: TC_WITH_DECISION + Tasks: + 1. DT_LOAD_DATA + SuccessCondition: "record_count > 0" + NextTaskOnSuccess: TC_AGGREGATION + NextTaskOnFailure: SEND_ERROR_ALERT + + 2. TC_AGGREGATION + (Conditional execution only if Task 1 succeeds) + + 3. SEND_ERROR_ALERT + (Only executed if Task 1 fails) +``` + +#### Step 4: Handle Variable Processors + +BW Process Chains initialize variables; Datasphere Task Chains use parameters: + +```yaml +# BW Variant Processor +Variant Processor: VAR_SALES_LOAD + Variables: + - LOAD_DATE: today() - 1 + - LOAD_TYPE: DELTA + - COMPANY_CODE: $COMP_CODE (User input) + +# Datasphere Task Chain Parameters +Task Chain: TC_SALES_LOAD + Parameters: + - load_date (DateTime): default = + - load_type (String): default = "DELTA" + - company_code (String): required = true + + Usage in Transformation: + WHERE POSTING_DATE = :load_date + AND LOAD_TYPE = :load_type + AND COMPANY_CODE = :company_code +``` + +#### Step 5: Configure Queued Task Manager (QTM) Runtime + +The Queued Task Manager in Datasphere manages task execution: + +**Task Chain Execution Policy:** + +```yaml +Task Chain: TC_DAILY_SALES_LOAD + Execution: + MaxConcurrentRuns: 1 # Prevent concurrent execution + Timeout: 1 hour # Abort if running > 1 hour + RetryPolicy: + MaxRetries: 3 + RetryInterval: 10 minutes + BackoffMultiplier: 1.5 # 10, 15, 22 minutes + + RuntimeQueueing: + Priority: NORMAL # Can be LOW, NORMAL, HIGH + QueueBehavior: FIFO # First-In-First-Out + + Notifications: + OnStart: Log entry + OnSuccess: Email notification + OnFailure: Alert + Log + Email + OnRetry: Log attempt number +``` + +#### Step 6: Map Aggregation and Post-Load Steps + +BW typically includes aggregate table maintenance; Datasphere uses materialized views: + +```sql +-- BW Post-Load Aggregation (in Process Chain) +-- Creates aggregate InfoCubes based on Dimension subsets + +-- Datasphere Equivalent: Materialized View +CREATE MATERIALIZED VIEW V_SALES_BY_PRODUCT AS +SELECT + PRODUCT_ID, + SUM(SALES_AMOUNT) as total_sales, + SUM(QUANTITY) as total_quantity, + COUNT(DISTINCT CUSTOMER_ID) as unique_customers, + CURRENT_DATE as view_date +FROM T_SALES_TRANSACTIONS +WHERE POSTING_DATE >= DATEADD(month, -12, CURRENT_DATE) +GROUP BY PRODUCT_ID; + +-- Refresh Schedule (Task Chain) +Task: REFRESH_MATERIALIZED_VIEWS + Type: Materialization Task + Views: [V_SALES_BY_PRODUCT, V_SALES_BY_REGION, V_SALES_BY_CUSTOMER] + Schedule: Daily at 04:00 UTC + Parallelization: Yes (refresh all views in parallel) +``` + +#### Step 7: Test Task Chain Execution + +Before production deployment: + +1. **Functional Testing**: Execute Task Chain manually, verify all tasks complete +2. **Data Validation**: Compare output with legacy Process Chain results +3. **Performance Testing**: Validate runtime is acceptable (BW baseline +/- 20%) +4. **Error Scenario Testing**: Intentionally fail steps, verify error handling works +5. **Schedule Testing**: Run on actual schedule, verify no conflicts with other processes + +### Handling Complex Process Chain Scenarios + +**Parallel Execution in Task Chains:** + +```yaml +# BW Process Chain with parallel branches +Process Chain: Z_COMPLEX_LOAD + ├── Serial: Load Master Data (Step 1) + ├── Parallel Branch 1: Load Regional Data (Steps 2-4) + │ ├── 2: Americas Load + │ ├── 3: EMEA Load + │ └── 4: APAC Load + ├── Parallel Branch 2: Load Reference Data (Steps 5-6) + │ ├── 5: Exchange Rates + │ └── 6: GL Accounts + ├── Sync Point: After all parallel branches complete + └── Serial: Run Aggregations (Step 7) + +# Datasphere Task Chain equivalent +Task Chain: TC_COMPLEX_LOAD + Tasks: + 1. DT_LOAD_MASTER + NextTask: [DT_LOAD_AMERICAS, DT_LOAD_REFDATA] (Parallel) + + 2. DT_LOAD_AMERICAS (Parallel, execution starts after Task 1) + NextTask: TC_REGIONAL_AGGREGATION (on completion) + + 3. DT_LOAD_REFDATA (Parallel, execution starts after Task 1) + NextTask: TC_REFERENCE_LOADING (on completion) + + 4. TC_REGIONAL_AGGREGATION (waits for Task 2) + NextTask: TC_FINAL_AGGREGATION (sync point) + + 5. TC_REFERENCE_LOADING (waits for Task 3) + NextTask: TC_FINAL_AGGREGATION (sync point) + + 6. TC_FINAL_AGGREGATION (sync point: waits for Tasks 4 & 5) + FinalTask: Yes +``` + +--- + +## Phase 6: Data Transfer Patterns + +### Full vs. Incremental Load Strategy + +**Full Load Pattern** (Initial migration): + +```yaml +Full Load Strategy: + Frequency: Once per object + Timing: Off-peak hours (weekend or night) + Data Volume: 100% of historical data + Validation: Row count matching + + Steps: + 1. Extract all records from BW Bridge source + 2. Transform if needed (data type conversions, calculations) + 3. Load into Datasphere target table (TRUNCATE + INSERT) + 4. Validate: Record count, key uniqueness, NOT NULL constraints + 5. Confirm completion in migration registry + + Estimated Duration: Size(GB) / Bandwidth(Mbps/8) + Processing + Example: 50 GB at 100 Mbps = ~1 hour + 15 min processing = 1.25 hours +``` + +**Incremental Load Pattern** (Ongoing updates): + +```yaml +Incremental Load Strategy: + Frequency: Daily, hourly, or real-time (depending on business need) + Timing: Scheduled after source system completes load + Data Volume: Only changes since last load + Validation: Duplicate detection, referential integrity + + Steps: + 1. Determine load scope: + - Timestamp-based: WHERE _CHANGED_AT >= :last_load_time + - Sequence-based: WHERE _CHANGE_SEQ > :last_load_seq + - Delta indicator: WHERE CHANGE_FLAG = 'X' + + 2. Extract delta records from source + 3. Perform upsert in Datasphere: + - For new records: INSERT + - For updated records: UPDATE (or DELETE+INSERT for immutable tables) + - For deleted records: Soft delete or remove depending on design + + 4. Update watermark: + - Store :last_load_time in metadata table + - Increment :last_load_seq counter + + 5. Validate: Row count in delta, uniqueness, referential integrity + + Metadata Tracking: + Table: T_LOAD_WATERMARK + Columns: + - TABLE_NAME: Source table identifier + - LAST_LOAD_TIMESTAMP: When last load occurred + - LAST_LOAD_SEQUENCE: Sequence number if available + - RECORD_COUNT_LOADED: Records in last load + - STATUS: SUCCESS, FAILED, IN_PROGRESS +``` + +### Upsert (Insert + Update) Implementation + +```sql +-- Technique 1: Merge Statement (Preferred) +MERGE INTO T_SALES_TRANSACTIONS tgt +USING T_SALES_TRANSACTIONS_DELTA src +ON (tgt.TRANSACTION_ID = src.TRANSACTION_ID + AND tgt.LINE_ITEM_NUM = src.LINE_ITEM_NUM) +WHEN MATCHED AND src._OPERATION_FLAG = 'U' + THEN UPDATE SET + tgt.AMOUNT = src.AMOUNT, + tgt.QUANTITY = src.QUANTITY, + tgt.LAST_CHANGED = CURRENT_TIMESTAMP, + tgt.CHANGED_BY = CURRENT_USER +WHEN MATCHED AND src._OPERATION_FLAG = 'D' + THEN DELETE +WHEN NOT MATCHED AND src._OPERATION_FLAG IN ('I', 'U') + THEN INSERT (TRANSACTION_ID, LINE_ITEM_NUM, AMOUNT, QUANTITY, LAST_CHANGED, CHANGED_BY) + VALUES (src.TRANSACTION_ID, src.LINE_ITEM_NUM, src.AMOUNT, src.QUANTITY, CURRENT_TIMESTAMP, CURRENT_USER); + +-- Technique 2: DELETE + INSERT (for immutable designs) +-- Delete all records from delta load period +DELETE FROM T_SALES_TRANSACTIONS +WHERE POSTING_DATE = CURRENT_DATE + AND SOURCE_SYSTEM = 'BW_BRIDGE'; + +-- Insert all records from staging area +INSERT INTO T_SALES_TRANSACTIONS +SELECT * FROM T_SALES_TRANSACTIONS_STAGING +WHERE POSTING_DATE = CURRENT_DATE; + +-- Technique 3: Type-2 SCD for dimension tables +-- Mark previous records as expired, insert new versions +UPDATE T_CUSTOMER_DIM +SET IS_CURRENT = '', VALID_TO = CURRENT_DATE - 1 +WHERE IS_CURRENT = 'X' + AND CUSTOMER_ID IN (SELECT CUSTOMER_ID FROM T_CUSTOMER_STAGING); + +INSERT INTO T_CUSTOMER_DIM +SELECT *, 'X' as IS_CURRENT, CURRENT_DATE as VALID_FROM, NULL as VALID_TO +FROM T_CUSTOMER_STAGING; +``` + +### Data Reconciliation + +After transfer, validate data integrity: + +```sql +-- Reconciliation Query 1: Row Count Matching +SELECT + 'Row Count Check' as Check_Name, + (SELECT COUNT(*) FROM T_BW_SOURCE) as Source_Count, + (SELECT COUNT(*) FROM T_DATASPHERE_TARGET) as Target_Count, + CASE WHEN (SELECT COUNT(*) FROM T_BW_SOURCE) = + (SELECT COUNT(*) FROM T_DATASPHERE_TARGET) + THEN 'PASS' ELSE 'FAIL' END as Status; + +-- Reconciliation Query 2: Key Uniqueness +SELECT + 'Key Uniqueness' as Check_Name, + CASE WHEN (SELECT COUNT(*) FROM T_DATASPHERE_TARGET) = + (SELECT COUNT(DISTINCT KEY_FIELD_1, KEY_FIELD_2) FROM T_DATASPHERE_TARGET) + THEN 'PASS' ELSE 'FAIL' END as Status, + (SELECT COUNT(*) FROM T_DATASPHERE_TARGET) as Total_Records, + (SELECT COUNT(DISTINCT KEY_FIELD_1, KEY_FIELD_2) FROM T_DATASPHERE_TARGET) as Unique_Keys; + +-- Reconciliation Query 3: Key Aggregate Matching +SELECT + DIMENSION_FIELD, + COUNT(*) as Count_Source, + SUM(AMOUNT) as Sum_Amount_Source +FROM T_BW_SOURCE +GROUP BY DIMENSION_FIELD +UNION ALL +SELECT + DIMENSION_FIELD, + COUNT(*) as Count_Target, + SUM(AMOUNT) as Sum_Amount_Target +FROM T_DATASPHERE_TARGET +GROUP BY DIMENSION_FIELD; + +-- Reconciliation Query 4: Data Quality Checks +SELECT + 'Data Quality' as Check_Category, + 'NULL values in Amount' as Issue, + COUNT(*) as Record_Count +FROM T_DATASPHERE_TARGET +WHERE AMOUNT IS NULL +HAVING COUNT(*) > 0; + +SELECT + 'Data Quality' as Check_Category, + 'Invalid Currency Code' as Issue, + COUNT(*) as Record_Count +FROM T_DATASPHERE_TARGET +WHERE CURRENCY NOT IN ('USD', 'EUR', 'GBP', 'JPY') +HAVING COUNT(*) > 0; +``` + +--- + +## Phase 7: Hybrid Operations + +During migration, BW Bridge and native Datasphere run in parallel. + +### Parallel Reporting Architecture + +``` +┌──────────────────────────────────────────────────────┐ +│ Unified Reporting Layer │ +│ (BI Tools: SAC, Power BI, Tableau, Looker) │ +└──────────┬──────────────────────────────┬────────────┘ + │ (Dual queries) │ + ┌──────▼──────────┐ ┌──────────▼─────┐ + │ BW Bridge │ │ Native DS │ + │ (Converted │ │ (New objects) │ + │ objects) │ │ │ + └─────────────────┘ └─────────────────┘ + +Migration Phase: +- Phase 1 (Weeks 1-4): Parallel testing + - Reports pull from BOTH sources + - Results reconciled nightly + - Users provide feedback + +- Phase 2 (Weeks 5-6): Gradual cutover + - Critical reports → Native DS + - Secondary reports → Still use Bridge + +- Phase 3 (Week 7+): Complete cutover + - All reports → Native DS + - Bridge decommissioned +``` + +### Reporting Reconciliation During Migration + +```yaml +Report Reconciliation Process: + Daily Task: + 1. Extract report data from BW Bridge version + 2. Extract report data from Datasphere version + 3. Compare key metrics (with tolerance): + - Row counts (tolerance: ±0.1%) + - Sum of key measures (tolerance: ±0.01%) + 4. Flag discrepancies > tolerance + 5. Investigate root cause if discrepancies found + + Root Cause Analysis: + - Data not transferred yet + - Different filter logic in conversion + - Rounding or aggregation differences + - Incomplete incremental load + + Resolution: + - Validate conversion logic + - Adjust filters if needed + - Re-run data transfer + - Extend testing period if critical discrepancy +``` + +### Managing User Transition + +``` +Migration Communication Plan: + +Week 1-2: Announcement + - Explain BW → Datasphere transition + - Highlight benefits (speed, cloud-native, lower TCO) + - Share timeline + +Week 3-4: Training + - New Datasphere interface walkthrough + - Differences from BW querying + - Performance expectations + - Support contact info + +Week 5-6: Parallel Reporting + - Users test Datasphere reports + - Report issues to project team + - Provide feedback on accuracy + +Week 7: Cutover Window + - BW Bridge reports become read-only (2-4 hours) + - Verify final data sync + - Switch all reports to Datasphere + - Go-live validation + +Post-Cutover: Support + - Monitor query performance + - Respond to user issues + - Optimize slow queries + - Track cost/performance metrics +``` + +--- + +## Phase 8: Decommissioning the Bridge + +Once migration is complete and users are comfortable with native Datasphere, decommission the Bridge. + +### Decommissioning Checklist + +``` +Pre-Decommissioning (2 weeks before): +☐ All critical reports migrated to Datasphere +☐ All historical data verified in Datasphere +☐ Users trained and comfortable +☐ Performance baseline established +☐ Backup/archival of BW metadata completed +☐ Legal/compliance sign-off obtained + +Decommissioning (Final window): +☐ Final data sync: BW Bridge → Datasphere +☐ Verify no active Task Chains dependent on Bridge +☐ Disable BW Bridge connections +☐ Archive BW Bridge database +☐ Revoke user access to Bridge +☐ Update documentation: systems inventory, data lineage diagrams + +Post-Decommissioning (30 days): +☐ Monitor for any Bridge connection attempts (should be zero) +☐ Monitor Datasphere performance (confirm no degradation) +☐ Archive BW Bridge infrastructure +☐ Document migration lessons learned +☐ Update training materials +☐ Schedule post-implementation review meeting + +Fallback Procedure (if critical issues): +☐ Keep BW Bridge in read-only mode for 30 days +☐ Maintain daily backup exports from Bridge +☐ Document any queries that fail in Datasphere +☐ Have rollback plan ready (requires Bridge licensing extension) +``` + +--- + +## Common Migration Pitfalls + +### Pitfall 1: Insufficient Data Validation + +**Problem**: Proceed to cutover without reconciling source vs. target data. + +**Prevention**: +- Implement automated reconciliation queries (see Phase 6) +- Require sign-off from business data steward +- Run parallel reports for 2+ weeks minimum +- Document tolerance levels for acceptable variance + +### Pitfall 2: Complex Custom Logic Not Translated + +**Problem**: BW custom calculations (BAdIs, user-exits) not replicated in Datasphere. + +**Prevention**: +- Inventory all custom logic early (Phase 1) +- Document ABAP code and business purpose +- Rebuild logic as Datasphere transformation rules +- Validate calculated values match source + +### Pitfall 3: Process Chain Dependencies Overlooked + +**Problem**: Task Chains fail due to missing dependencies or incorrect sequencing. + +**Prevention**: +- Document all Process Chain step dependencies (Phase 5) +- Test Task Chain execution thoroughly before go-live +- Implement error handling and notifications +- Have runbook for manual recovery + +### Pitfall 4: Performance Regression After Migration + +**Problem**: Datasphere reports run slower than expected. + +**Prevention**: +- Establish BW baseline performance metrics +- Set Datasphere performance targets (goal: 3-10x faster) +- Implement indexes and partitioning +- Monitor query performance continuously +- Optimize slow queries identified in testing + +### Pitfall 5: Authorization Loss During Migration + +**Problem**: Data security lost when converting BW authorizations to DACs. + +**Prevention**: +- Map BW Analysis Authorizations to DACs early (Phase 1) +- Test DAC filtering with diverse user groups +- Implement audit logging for sensitive data access +- Validate user can/cannot see appropriate rows +- Use Data Access Controls (covered in Security Architect skill) + +### Pitfall 6: Incomplete Historical Data Transfer + +**Problem**: Some time periods missing in Datasphere due to load failure. + +**Prevention**: +- Implement data completeness checks by time period +- Run full load verification query: + ```sql + SELECT POSTING_DATE, COUNT(*) FROM T_DATASPHERE_TARGET + GROUP BY POSTING_DATE ORDER BY POSTING_DATE; + ``` +- Identify and re-run failed loads +- Keep detailed load execution log + +### Pitfall 7: Naming Conventions Create Confusion + +**Problem**: Users cannot find converted objects due to renamed InfoCubes/ADSOs. + +**Prevention**: +- Create naming convention mapping document +- Distribute to all users before cutover +- Create synonyms/aliases if tools support +- Add descriptive metadata/descriptions to objects +- Test report re-pointing before go-live + +--- + +## Migration Runbook Template + +```yaml +Migration Runbook: Z_SALES_MASTER_CUBE + +Object Identification: + Source: Z_SALES_MASTER (InfoCube) + Size: 45 GB + Record Count: 250 million transactions + Key Fields: COMPANY, SALES_ORG, CUSTOMER, POSTING_DATE + +Conversion Approach: Shell Conversion +Complexity Level: Medium +Risk Score: 8/25 (Medium-Low) + +Execution Schedule: + Phase: Wave 3 (Weeks 6-7) + Shell Conversion Window: 2024-03-15 03:00-04:30 UTC (1.5 hours) + Data Transfer: 2024-03-15 04:30-06:00 UTC (1.5 hours) + Validation: 2024-03-15 06:00-08:00 UTC (2 hours) + User Acceptance Testing: 2024-03-15 to 2024-03-20 + Production Cutover: 2024-03-22 03:00 UTC + +Pre-Execution Steps: + ☐ Verify backup of source InfoCube + ☐ Confirm BW Bridge availability + ☐ Disable dependent Process Chains + ☐ Notify users of maintenance window + ☐ Prepare rollback plan + +Execution Steps: + 1. [03:00] Execute Shell Conversion + Command: STC01 → Propose Conversion → Execute + Expected Duration: 30 minutes + Success Criteria: 0 errors, object created in Datasphere + + 2. [03:35] Monitor Conversion Progress + Transaction: SM37 (check batch job logs) + Logs to Check: CONVERSION_Z_SALES_MASTER job + + 3. [04:00] Validate Shell Conversion + Verify: Object exists in Datasphere, field count matches + SQL: SELECT COUNT(*) FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'T_SALES_MASTER' + + 4. [04:30] Initiate Full Data Load + Task: DT_LOAD_SALES_MASTER + Source: BW Bridge T_SALES_MASTER + Target: Datasphere T_SALES_MASTER + Options: Truncate existing, Full load + + 5. [06:00] Monitor Data Load + Check: Row count progress in both systems + SQL: SELECT COUNT(*) FROM T_SALES_MASTER_STAGING, T_SALES_MASTER (target) + + 6. [06:15] Data Load Completion + Expected: All 250M records transferred + Validation: Row count matching ±0.1% + + 7. [06:30] Run Reconciliation Queries + Query 1: Row count by COMPANY + Query 2: Sum of SALES_AMOUNT by SALES_ORG + Tolerance: ±0.01% + + 8. [07:00] Create Datasphere Views + Views: V_SALES_BY_PRODUCT, V_SALES_BY_REGION + Type: Materialized + Refresh: Daily at 04:00 + + 9. [08:00] Enable Reporting + Update BI tool connections + Point reports to Datasphere objects + Enable read-only access for UAT + +Post-Execution Steps: + ☐ Verify no production queries against BW Bridge source + ☐ Monitor Datasphere CPU/memory utilization + ☐ Respond to UAT user issues + ☐ Update documentation + ☐ Schedule final cutover validation + +Rollback Procedure (if issues detected): + 1. Disable Datasphere connections in BI tools + 2. Re-enable BW Bridge queries (kept as fallback) + 3. Halt incremental load Task Chain + 4. Delete failed Datasphere objects + 5. Investigate root cause + 6. Schedule retry within 48 hours + +Escalation Contacts: + Datasphere Admin: John Smith (john.smith@company.com) + BW Bridge Specialist: Jane Doe (jane.doe@company.com) + Data Quality Lead: Bob Johnson (bob.johnson@company.com) + +Lessons Learned (Post-Cutover): + +``` + +--- + +## MCP Tool References + +This skill integrates with these Claude MCP tools: + +- **search_repository**: Find BW objects, transformations, and Task Chains by keyword +- **get_object_definition**: Retrieve complete object metadata, field definitions, and properties +- **list_repository_objects**: Browse all objects in a space, filter by type, see creation dates +- **get_task_status**: Check Task Chain execution progress, view logs, identify failures + +**Example Usage:** + +``` +Assistant: "Let me search for your existing SALES InfoCubes in Datasphere." +Tool: search_repository(pattern: "SALES", object_type: "INFOCUBE") \ No newline at end of file diff --git a/partner-built/SAP-Datasphere/skills/datasphere-bw-bridge-migration/references/bw-bridge-guide.md b/partner-built/SAP-Datasphere/skills/datasphere-bw-bridge-migration/references/bw-bridge-guide.md new file mode 100644 index 0000000..6eead2f --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-bw-bridge-migration/references/bw-bridge-guide.md @@ -0,0 +1,1236 @@ +# BW Bridge Migration Reference Guide + +## Table of Contents + +1. Shell Conversion Step-by-Step Checklist +2. Remote Conversion Checklist +3. Object Compatibility Matrix +4. ADSO Type Comparison Details +5. Process Chain to Task Chain Mapping Patterns +6. STC01 Task List Error Codes +7. Migration Timeline Template +8. Rollback and Fallback Procedures + +--- + +## 1. Shell Conversion Step-by-Step Checklist + +### Pre-Conversion Phase (Days 1-3) + +**Week Before Conversion:** +- [ ] Schedule conversion window (weekend or off-peak hours) +- [ ] Notify all users of downtime +- [ ] Disable dependent Process Chains +- [ ] Create database backup in source BW system +- [ ] Document current InfoCube structure (field list) +- [ ] Export object metadata to spreadsheet + +**Day Before Conversion:** +- [ ] Run BW consistency check: **RSRV** transaction +- [ ] Fix any inconsistencies identified +- [ ] Verify BW Bridge system is accessible +- [ ] Test Datasphere connection from BW Bridge +- [ ] Confirm user has STC01 execution rights +- [ ] Review conversion task proposal + +**Conversion Day - 2 Hours Before:** +- [ ] Stop all batch jobs accessing the InfoCube +- [ ] Verify no users logged into queries +- [ ] Final backup of metadata +- [ ] Confirm support team is available +- [ ] Have rollback plan ready + +### Shell Conversion Execution Phase (Hours 1-2) + +**Step 1: Access STC01 Transaction** (5 minutes) +``` +Path: SAP GUI → /H /STC01 +Authentication: Use BW administrator account +Location: BW Bridge system (not source BW) +``` + +**Step 2: Select Source Object** (10 minutes) +``` +Transaction: STC01 +Action: View Task List → Select "InfoCubes" or "ADSOs" +Selection: + ☐ Object Name: [e.g., 0SALES_001] + ☐ Verify object type matches expected + ☐ Confirm object is not in use + ☐ Note object size in GBs +``` + +**Step 3: Propose Conversion** (15 minutes) +``` +Action: Click "Propose Conversion" button +System Response: Pre-conversion compatibility scan +Review Proposal Report: + ☐ Number of characteristics identified + ☐ Number of key figures identified + ☐ Warnings (if any) + ☐ Unsupported elements (if any) + +Handling Warnings: + ✓ Non-critical: can proceed + ✓ Unsupported attributes: plan manual rebuild + ✗ Critical: cancel and investigate + +If Critical Issues: + - Fix in source system + - Re-run consistency check + - Attempt conversion again +``` + +**Step 4: Validate Conversion Proposal** (15 minutes) +``` +Review System-Generated Report: + ☐ Field mapping completeness (100%?) + ☐ Data type compatibility (all convertible?) + ☐ Key field identification (correct?) + ☐ Navigational attribute handling (mapped?) + ☐ Hierarchy recognition (detected?) + +Data Type Mapping Validation: + BW Type → Datasphere Type ✓ Verify + ───────────────────────────────────────── + NUMC → VARCHAR [ ] + DEC → DECIMAL [ ] + DATS → DATE [ ] + TIMS → TIME [ ] + CHAR → VARCHAR [ ] + CLNT → VARCHAR [ ] +``` + +**Step 5: Execute Conversion** (30 minutes) +``` +Action: Click "Execute Conversion" button +Confirmation Dialog: "Confirm shell conversion of [OBJECT_NAME]?" + ☐ Confirm by entering object name + ☐ Click OK to proceed + +System Process: + 1. Lock object in BW Bridge (prevent access) + 2. Read object metadata + 3. Generate Datasphere DDL scripts + 4. Create Datasphere tables/dimensions + 5. Create supporting objects (hierarchies, attributes) + 6. Validate object creation + 7. Update migration registry + +Expected Duration: 15-45 minutes (depending on size) + +Monitoring: + ☐ Check batch job progress: SM37 (Job name: CONVERSION_xxxxx) + ☐ Monitor temp tablespace usage: DB02 + ☐ If > 90% utilization, contact DBAs +``` + +### Post-Conversion Validation Phase (1-3 Hours) + +**Step 6: Verify Object Creation** (30 minutes) +``` +Location: Datasphere Repository Browser + +Verification Checklist: + ☐ Table exists with correct name + ☐ Primary key fields present + ☐ All characteristics converted to columns + ☐ All key figures converted to measures + ☐ Field count matches original (Count: ___) + ☐ Data types are correct (spot-check 5 fields) + ☐ Object marked as "Readable" + ☐ Metadata populated (description, owner, etc.) + +SQL Validation Query: + SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, IS_NULLABLE + FROM INFORMATION_SCHEMA.COLUMNS + WHERE TABLE_NAME = 'T_[CONVERTED_NAME]' + ORDER BY ORDINAL_POSITION; + + Expected Result: X rows (X = characteristic count + key figure count + system columns) +``` + +**Step 7: Data Transfer Configuration** (45 minutes) +``` +Create Data Transfer Task: + +Task Name: DT_LOAD_[OBJECTNAME]_FULL +Source: + ☐ Connection: BW Bridge + ☐ Source Object: [Original InfoCube/ADSO] + ☐ Load Type: FULL_LOAD + +Target: + ☐ Datasphere Target Table: T_[OBJECTNAME] + ☐ Load Behavior: TRUNCATE_AND_INSERT + ☐ Error Handling: STOP_ON_ERROR + +Scheduling: + ☐ Frequency: One-time (initial load) + ☐ Timing: Off-peak hours +``` + +**Step 8: Execute Full Data Load** (Variable - 30 min to 4 hours) +``` +Action: Manually trigger data transfer task + +Pre-Load Validation: + ☐ Source object accessible + ☐ Target table writable + ☐ Network connectivity confirmed + ☐ Estimated load time communicated to stakeholders + +During Load Monitoring: + Task: DT_LOAD_[OBJECTNAME]_FULL + + Check Progress Every 10 Minutes: + ☐ Records processed: ___ / total + ☐ Data transferred: ___ GB + ☐ Error count: ___ + ☐ Estimated time remaining: ___ + ☐ CPU/Memory utilization: ___ / ___ % + +Load Performance Metrics: + Metric Target Actual + ───────────────────────────────────────────────── + Records/Sec > 100K ___ + GB/Sec > 1 GB ___ + Network Utilization < 80% ___ + Datasphere CPU < 75% ___ +``` + +**Step 9: Validate Loaded Data** (60 minutes) +``` +Data Quality Checks (Run in Datasphere): + +1. Row Count Validation + Source Query: + SELECT COUNT(*) FROM [BW_SOURCE_TABLE] + Target Query: + SELECT COUNT(*) FROM T_[OBJECTNAME] + + Expected: Match exactly (or ±0.1% if incremental flagging) + ☐ PASS: ___,___ records in both + ☐ FAIL: Source ___, Target ___ (Variance: __%) + +2. Key Uniqueness Validation + Query: + SELECT COUNT(*), COUNT(DISTINCT [KEY_FIELDS]) + FROM T_[OBJECTNAME] + + Expected: Both counts equal (no duplicate keys) + ☐ PASS: No duplicates detected + ☐ FAIL: ___ duplicate keys found + +3. NOT NULL Constraint Validation + Query: + SELECT COUNT(*) FROM T_[OBJECTNAME] + WHERE [REQUIRED_FIELD] IS NULL + + Expected: 0 records + ☐ PASS: No NULL values + ☐ FAIL: ___ NULL values found in [FIELD] + +4. Measure Aggregation Validation + BW Query: + SELECT SUM([KEY_FIGURE_1]) FROM 0SALES_001 + WHERE [FILTER_1] = 'VALUE' + + Datasphere Query: + SELECT SUM([MEASURE_1]) FROM T_SALES_001 + WHERE [FILTER_1] = 'VALUE' + + Expected: Numeric match (same decimals) + ☐ PASS: Both = ___,___,___.__ + ☐ FAIL: Source __, Target __ (Variance: __%) + +5. Dimension Value Distribution Check + Query: + SELECT DIMENSION_FIELD, COUNT(*) + FROM T_[OBJECTNAME] + GROUP BY DIMENSION_FIELD + ORDER BY COUNT(*) DESC + + Expected: Distribution matches source + ☐ PASS: Top 10 values match source + ☐ FAIL: Top value mismatch - [VALUE] source vs [VALUE] target + +6. Date Range Validation + Query: + SELECT MIN(POSTING_DATE), MAX(POSTING_DATE), COUNT(DISTINCT POSTING_DATE) + FROM T_[OBJECTNAME] + + Expected: Full date range covered (no gaps) + ☐ PASS: Min ____, Max ____, ___ unique dates + ☐ FAIL: Gap detected: ____ + +7. Data Type Spot Check (Sample 5 Fields) + ☐ Field [NAME] - Type [VARCHAR] - Sample: '[value]' + ☐ Field [NAME] - Type [DECIMAL] - Sample: [123.45] + ☐ Field [NAME] - Type [DATE] - Sample: [2024-01-15] + ☐ Field [NAME] - Type [INTEGER] - Sample: [999] + ☐ Field [NAME] - Type [VARCHAR] - Sample: '[value]' +``` + +**Step 10: Create Dependent Objects** (30 minutes) +``` +Hierarchies (if applicable): + ☐ Create dimension hierarchies matching BW + ☐ Test hierarchy navigation in Datasphere + ☐ Validate parent-child relationships + ☐ Confirm all nodes present + +Analytical Models: + ☐ Create analytical model on converted table + ☐ Map dimensions and measures + ☐ Configure filters/variables if needed + ☐ Test query execution + +Materialized Views (for aggregate tables): + ☐ Create MV for pre-aggregated reporting + ☐ Define refresh schedule + ☐ Test query performance +``` + +**Step 11: User Acceptance Testing (Days 1-7 Post-Conversion)** +``` +UAT Activities: + +Day 1: System Orientation + ☐ Open sample query in Datasphere + ☐ Review result set (compare with BW) + ☐ Train users on new interface + ☐ Gather initial feedback + +Day 2-3: Functional Testing + ☐ Execute 5-10 key reports from converted object + ☐ Verify results match BW reports + ☐ Test filtering and drill-down + ☐ Validate calculations and aggregations + +Day 4-5: Performance Testing + ☐ Compare query execution time (BW vs. Datasphere) + ☐ Document baseline metrics + ☐ Verify performance meets expectations (goal: 3-10x faster) + ☐ If slower, escalate for optimization + +Day 6-7: User Acceptance + ☐ Collect sign-off from business owner + ☐ Document any issues or enhancement requests + ☐ Archive UAT results + ☐ Schedule production cutover + +UAT Sign-Off Template: + Object: [NAME] + Tester: [NAME] + Date: [DATE] + Result: ☐ APPROVED ☐ APPROVED WITH ISSUES ☐ REJECTED + Issues Found: [List any defects] + Performance: [Baseline vs. Datasphere metrics] + Notes: [Additional comments] +``` + +**Step 12: Production Cutover** (4 hours) +``` +Cutover Window: [Start Time] - [End Time] (Recommended: 03:00-07:00 UTC) + +Pre-Cutover (T-30 min): + ☐ Disable BW Bridge connections (read-only mode) + ☐ Verify no active queries + ☐ Final data sync (if incremental load pending) + ☐ Confirm support team on call + +Cutover Execution (T+0): + ☐ Final validation query execution + ☐ Update BI tool connections (point to Datasphere) + ☐ Enable Datasphere object for production use + ☐ Notify users of availability + +Post-Cutover (T+2 hours): + ☐ Monitor query execution + ☐ Check system performance + ☐ Respond to user issues + ☐ Verify no Bridge access attempts + +Post-Cutover (T+24 hours): + ☐ Confirm reports are using Datasphere source + ☐ Gather user feedback + ☐ Document any issues + ☐ Schedule follow-up optimization + +Post-Cutover (T+7 days): + ☐ Decommission BW Bridge connection (archive for 30 days) + ☐ Archive legacy object definition + ☐ Update data lineage documentation + ☐ Complete migration registry entry +``` + +--- + +## 2. Remote Conversion Checklist + +**Applicable For:** Composite Providers, Complex Scenarios, Gradual Cutover Requirements + +### Pre-Remote Conversion Phase (Days 1-2) + +- [ ] Assess if Shell Conversion is insufficient +- [ ] Document reasons for Remote Conversion choice +- [ ] Create RFC destination to source BW system +- [ ] Test network connectivity (latency, bandwidth) +- [ ] Identify custom transformation logic needed +- [ ] Plan staging table structure +- [ ] Estimate data volume and transfer time + +### Remote Conversion Setup Phase (Days 3-5) + +- [ ] Create External Data Source (connection to remote BW) +- [ ] Create Remote View in Datasphere +- [ ] Map BW fields to Datasphere columns +- [ ] Test remote table preview +- [ ] Create staging table (intermediate) +- [ ] Build Transformation rules for data logic +- [ ] Implement error handling and logging + +### Data Replication Configuration (Days 6-7) + +- [ ] Define full load strategy +- [ ] Define incremental load strategy +- [ ] Create Task Chain for scheduling +- [ ] Set up monitoring and alerts +- [ ] Test data transfer +- [ ] Validate data quality +- [ ] Document operational procedures + +### Testing and Validation (Days 8-10) + +- [ ] Execute full load test +- [ ] Run incremental load test +- [ ] Validate data reconciliation +- [ ] Test error scenarios +- [ ] Performance test +- [ ] User acceptance test +- [ ] Sign-off from business owner + +--- + +## 3. Object Compatibility Matrix + +### InfoCube / Composite Provider Convertibility + +| Component | Convertible | Shell/Remote | Complexity | Notes | +|---|---|---|---|---| +| **Standard Characteristics** | Yes | Shell | Low | Direct mapping to dimension | +| Navigational Attributes | Partial | Shell | Low | Convert if exist; add if missing | +| Time Characteristics | Yes | Shell | Low | Special handling for fiscal periods | +| Number Range | Yes | Shell | Low | Becomes integer/decimal | +| **Standard Key Figures** | Yes | Shell | Low | Direct mapping to measure | +| Cumulated Key Figures | Yes | Shell | Low | Can be calculated in model | +| Restricted Key Figures | Yes | Shell | Medium | Becomes calculated measure | +| Calculated Key Figures | Partial | Both | Medium | May need recoding in Datasphere | +| **Hierarchies** | Yes | Shell | Low | Converted to dimension hierarchies | +| Custom Hierarchies | Yes | Shell | Medium | Requires node master data | +| Time-Dependent Hierarchies | Partial | Remote | High | Use versioned hierarchies | +| **Aggregate Tables** | Yes | Shell | Low | Converted to materialized views | +| Partitioned Aggregates | Yes | Shell | Medium | Maintain partition logic | +| **ABAP Enhancements** | No | Remote | High | Replicate logic in transformation | +| User-Exit Logic | No | Remote | High | Implement in Datasphere | +| BAdI Implementation | No | Remote | High | Convert ABAP to SQL | + +### DataStore Object (ADSO) Convertibility + +| ADSO Type | Convertible | Approach | Complexity | Target Object | +|---|---|---|---|---| +| **Standard ADSO** | Yes | Shell | Low | Fact Table | +| Inventory ADSO | Yes | Shell/Hybrid | Medium | Fact Table (Type-2 SCD) | +| Write-Interface ADSO | Partial | Remote | High | Dual Table (Staging + Production) | +| Hierarchical ADSO | Yes | Shell | Medium | Dimension Table | +| Multi-Dimensional ADSO | Yes | Shell | Medium | Star Schema (fact + dims) | + +### Query and Analytics Convertibility + +| Query Component | Convertible | Approach | Notes | +|---|---|---|---| +| **Query Filters** | Yes | Analytical Model | Parameterized filters work | +| Drill-Down Paths | Yes | Analytical Model | Configure hierarchy navigation | +| Calculated Columns | Partial | Analytical Model | Rewrite if complex ABAP | +| Sorting | Yes | Analytical Model | Standard SQL ORDER BY | +| Grouping/Subtotals | Yes | Analytical Model | SQL GROUP BY + WITH ROLLUP | +| Formatting | Partial | BI Tool | Move to reporting layer | +| Exceptions | Partial | Alerting | Use Datasphere alerts/rules | + +--- + +## 4. ADSO Type Comparison Details + +### Standard ADSO + +``` +Purpose: Staging area for incremental data loads +Characteristics: + - Requires activation after load + - Not query-ready until activated + - Optimized for ETL processes + - Supports Insert, Update, Delete + - Supports partitioning by time + +Load Flow: + 1. Extract data from source + 2. Load into ADSO (inbound table) + 3. Activate ADSO (moves to active table) + 4. Available for queries + +Datasphere Equivalent: Standard Fact Table +Design: + CREATE TABLE T_STAGING ( + KEY_FIELD_1 VARCHAR(10) NOT NULL, + KEY_FIELD_2 DATE NOT NULL, + MEASURE_1 DECIMAL(15,2), + MEASURE_2 INTEGER, + _LOAD_ID INTEGER, + PRIMARY KEY (KEY_FIELD_1, KEY_FIELD_2) + ); + + -- Separate production table (optional) + CREATE TABLE T_PRODUCTION AS + SELECT * FROM T_STAGING + WHERE _LOAD_DATE = CURRENT_DATE; +``` + +### Inventory ADSO + +``` +Purpose: Track inventory balances at specific points in time +Characteristics: + - Append-only (no updates to historical records) + - Time-dimensioned (POSTING_DATE key field) + - Can be queried directly + - Efficient for balance sheet reporting + - Supports balance carry-forward + +Load Flow: + 1. Extract inventory snapshot as of date X + 2. Append to Inventory ADSO + 3. Query-ready immediately (no activation needed) + 4. Historical snapshots preserved forever + +Datasphere Equivalent: Fact Table with Type-2 SCD +Design: + CREATE TABLE T_INVENTORY ( + POSTING_DATE DATE NOT NULL, + PRODUCT_ID VARCHAR(18) NOT NULL, + WAREHOUSE_ID VARCHAR(10) NOT NULL, + OPENING_BALANCE DECIMAL(15,2), + RECEIPTS DECIMAL(15,2), + ISSUES DECIMAL(15,2), + CLOSING_BALANCE DECIMAL(15,2), + LAST_MOVEMENT_DATE DATE, + IS_CURRENT CHAR(1) DEFAULT 'X', + _LOAD_DATE DATE, + PRIMARY KEY (POSTING_DATE, PRODUCT_ID, WAREHOUSE_ID) + ); + + -- Query: Get current inventory (as of today) + SELECT + PRODUCT_ID, + WAREHOUSE_ID, + CLOSING_BALANCE, + POSTING_DATE + FROM T_INVENTORY + WHERE POSTING_DATE = CURRENT_DATE + AND IS_CURRENT = 'X'; + + -- Query: Historical inventory (balance as of specific date) + SELECT + PRODUCT_ID, + WAREHOUSE_ID, + CLOSING_BALANCE, + POSTING_DATE + FROM T_INVENTORY + WHERE POSTING_DATE = '2024-01-31' + AND IS_CURRENT = 'X'; +``` + +### Write-Interface ADSO + +``` +Purpose: Master data with direct user write capability +Characteristics: + - Can be written to directly (bypassing ETL) + - Query-ready immediately + - Typically small (thousands of records) + - Examples: Price lists, Discount tables, Exception rules + - Supports versioning/effective dating + +Load Flow: + 1. Users enter/modify data directly + 2. Data available immediately + 3. ETL can also load/update + 4. Queries can access latest version + 5. Historical tracking via effective dates + +Datasphere Equivalent: Dual-Table Pattern + +Design Approach 1: Simple Time-Versioned + CREATE TABLE T_PRICE_MASTER ( + MATERIAL_ID VARCHAR(18) NOT NULL, + CUSTOMER_ID VARCHAR(10), + VALID_FROM DATE NOT NULL, + VALID_TO DATE, + UNIT_PRICE DECIMAL(13,2), + CURRENCY VARCHAR(3), + CREATED_BY VARCHAR(12), + CREATED_AT TIMESTAMP, + MODIFIED_BY VARCHAR(12), + MODIFIED_AT TIMESTAMP, + IS_CURRENT CHAR(1) DEFAULT 'X', + PRIMARY KEY (MATERIAL_ID, CUSTOMER_ID, VALID_FROM) + ); + +Design Approach 2: Dual-Table (Staging + Read) + Staging Table (Write-enabled via app): + T_PRICE_MASTER_STAGING + - Used by data entry application + - Direct INSERT/UPDATE operations + - Not exposed to queries + + Production Table (Read-only): + T_PRICE_MASTER + - Refreshed hourly from staging + - Exposed to BI queries + - Maintains history + + Application Layer: + - Users interact with Datasphere Data Marketplace + - Custom approval workflow + - Logs changes to T_PRICE_MASTER_AUDIT + + Refresh Job (hourly): + -- Archive old current records + UPDATE T_PRICE_MASTER + SET IS_CURRENT = '', VALID_TO = CURRENT_DATE + WHERE IS_CURRENT = 'X' + AND (MATERIAL_ID, CUSTOMER_ID) IN ( + SELECT DISTINCT MATERIAL_ID, CUSTOMER_ID + FROM T_PRICE_MASTER_STAGING + WHERE MODIFIED_AT >= DATEADD(hour, -1, CURRENT_TIMESTAMP) + ); + + -- Insert new current records + INSERT INTO T_PRICE_MASTER + SELECT MATERIAL_ID, CUSTOMER_ID, VALID_FROM, NULL, UNIT_PRICE, + CURRENCY, CREATED_BY, CREATED_AT, MODIFIED_BY, MODIFIED_AT, + 'X', CURRENT_DATE + FROM T_PRICE_MASTER_STAGING + WHERE MODIFIED_AT >= DATEADD(hour, -1, CURRENT_TIMESTAMP); +``` + +--- + +## 5. Process Chain to Task Chain Mapping Patterns + +### Pattern 1: Sequential Load with Validation + +**BW Process Chain:** +``` +Start + └─ Step 1: Load Master (InfoCube 0CUSTOMER) + └─ Step 2: Load Transactions (InfoCube 0SALES) + └─ Step 3: Check (Min: 100 records, Max: 10M records) + ├─ If Success → Step 4: Aggregate + └─ If Error → Step 5: Send Alert +``` + +**Datasphere Task Chain:** +```yaml +Task Chain: TC_SEQUENTIAL_LOAD_WITH_VALIDATION + Tasks: + 1. DT_LOAD_CUSTOMER + Type: Data Transfer + Source: BW Bridge 0CUSTOMER + Target: T_CUSTOMER + Dependencies: None (Start) + OnError: FAIL_ENTIRE_CHAIN + + 2. DT_LOAD_SALES + Type: Data Transfer + Source: BW Bridge 0SALES + Target: T_SALES + Dependencies: Task 1 Success + OnError: FAIL_ENTIRE_CHAIN + + 3. TR_VALIDATE_LOAD + Type: Transformation + SQL: | + SELECT COUNT(*) as load_count + INTO @load_count + FROM T_SALES + WHERE _LOAD_DATE = CURRENT_DATE; + + IF @load_count < 100 OR @load_count > 10000000 + THEN SIGNAL SQLEXCEPTION; + + Dependencies: Task 2 Success + OnError: GOTO Task 5 + + 4. TR_AGGREGATE_SALES + Type: Transformation + SQL: CREATE MATERIALIZED VIEW V_SALES_AGG AS ... + Dependencies: Task 3 Success + OnError: LOG_WARNING_CONTINUE + + 5. SEND_ALERT + Type: Script + Action: Send email to admin@company.com + Trigger: OnError from Task 3 +``` + +### Pattern 2: Parallel Load with Synchronization Point + +**BW Process Chain:** +``` +Start + ├─ Step 1: Load Regional Data (Americas, EMEA, APAC in parallel) + ├─ Step 2: Load Reference Data (Rates, GL Accounts in parallel) + └─ Sync Point (wait for all branches) + └─ Step 3: Reconcile & Aggregate +``` + +**Datasphere Task Chain:** +```yaml +Task Chain: TC_PARALLEL_WITH_SYNC + Tasks: + # Regional Data Loads (Parallel) + 1. DT_LOAD_AMERICAS + Type: Data Transfer + Runtime: ~30 min + NextTasks: [4] (Sync point) + + 2. DT_LOAD_EMEA + Type: Data Transfer + Runtime: ~25 min + NextTasks: [4] (Sync point) + + 3. DT_LOAD_APAC + Type: Data Transfer + Runtime: ~20 min + NextTasks: [4] (Sync point) + + # Reference Data Loads (Parallel) + 4. DT_LOAD_EXCHANGE_RATES + Type: Data Transfer + Runtime: ~5 min + NextTasks: [7] (Sync point) + + 5. DT_LOAD_GL_ACCOUNTS + Type: Data Transfer + Runtime: ~10 min + NextTasks: [7] (Sync point) + + # Synchronization Point + 6. SYNC_POINT + Type: Control (wait for tasks 1,2,3,4,5) + WaitForTasks: [1, 2, 3, 4, 5] + Timeout: 120 minutes + NextTasks: [7] + + # Final Aggregation (after sync point) + 7. TR_FINAL_RECONCILE + Type: Transformation + Dependencies: Task 6 (SYNC_POINT) + NextTasks: [END] + + Execution Order: + - Tasks 1, 2, 3, 4, 5 start immediately (parallel) + - Task 6 waits for ALL to complete + - Task 7 starts after Task 6 +``` + +### Pattern 3: Conditional Branching + +**BW Process Chain:** +``` +Start + ├─ Step 1: Load Data + ├─ Step 2: Check (IF record count > 0) + │ ├─ Then → Step 3: Process Normal Path + │ └─ Else → Step 4: Process Empty Load + └─ Step 5: Send Notification +``` + +**Datasphere Task Chain:** +```yaml +Task Chain: TC_CONDITIONAL_BRANCHING + Tasks: + 1. DT_LOAD_DATA + Type: Data Transfer + OutputVariables: + - record_count (query result) + NextTasks: [2] + + 2. TR_CHECK_CONDITION + Type: Transformation + SQL: | + SELECT COUNT(*) as record_count + FROM T_LOADED_DATA + WHERE _LOAD_DATE = CURRENT_DATE + + ConditionCheck: record_count > 0 + NextTaskIfTrue: Task 3 + NextTaskIfFalse: Task 4 + + 3. TR_PROCESS_NORMAL + Type: Transformation (Normal processing logic) + NextTasks: [5] + + 4. TR_PROCESS_EMPTY + Type: Transformation (Handle empty load) + SQL: INSERT INTO T_AUDIT_LOG VALUES ('NO_DATA_LOADED', CURRENT_TIMESTAMP) + NextTasks: [5] + + 5. TR_SEND_NOTIFICATION + Type: Script + Action: Send email (content depends on path taken) + FinalTask: Yes +``` + +### Pattern 4: Error Handling and Retry + +**BW Process Chain:** +``` +Start + └─ Step 1: Load Data + └─ Step 2: Check (If error, retry) + │ └─ Retry up to 3 times + │ └─ If success → Continue + │ └─ If all retries fail → Error Handler +``` + +**Datasphere Task Chain:** +```yaml +Task Chain: TC_ERROR_HANDLING_RETRY + Tasks: + 1. DT_LOAD_DATA_WITH_RETRY + Type: Data Transfer + ErrorHandling: + RetryPolicy: EXPONENTIAL_BACKOFF + MaxRetries: 3 + InitialRetryDelay: 5 minutes + BackoffMultiplier: 2.0 + + RetrierSchedule: + Attempt 1: Immediate retry + Attempt 2: Retry after 5 min + Attempt 3: Retry after 10 min + Attempt 4: Retry after 20 min + + OnAllRetriesFailed: GOTO Task 2 + + 2. TR_ERROR_HANDLER + Type: Transformation + SQL: | + INSERT INTO T_ERROR_LOG + VALUES ('LOAD_FAILED', 'Max retries exceeded', CURRENT_TIMESTAMP) + + NextTasks: [3] + + 3. SEND_ESCALATION_ALERT + Type: Script + Action: Send alert to ops team + FinalTask: Yes +``` + +--- + +## 6. STC01 Task List Error Codes and Resolutions + +### Conversion Phase Errors + +| Error Code | Message | Root Cause | Resolution | +|---|---|---|---| +| **TL1-001** | Object not found in BW Bridge | Object doesn't exist or wrong system | Verify object exists; check system connection | +| **TL1-002** | Insufficient authorization | User lacks conversion rights | Grant STC01_ADMIN role to user | +| **TL1-003** | Object already converted | Duplicate conversion attempt | Verify conversion completed; skip if already done | +| **TL1-004** | Unsupported object type | Object type not convertible (e.g., custom query) | Use Remote Conversion or manual rebuild | +| **TL1-005** | Metadata inconsistency detected | Orphaned fields or corrupt definition | Run BW consistency check (RSRV); fix and retry | +| **TL1-006** | Datasphere connection failed | Network issue or auth failure | Test connection; verify credentials; check firewall | +| **TL1-007** | Insufficient Datasphere space | No free storage | Clean up; compress old data; contact DBAs | +| **TL1-008** | Characteristic not recognized | Custom attribute not supported | Document attribute; plan manual handling | +| **TL1-009** | Key figure aggregation unsupported | Formula-based KFG not convertible | Implement as calculated measure in Analytical Model | +| **TL1-010** | Timeout during conversion | Process taking too long | Increase timeout setting; split large objects | + +### Data Transfer Errors + +| Error Code | Message | Root Cause | Resolution | +|---|---|---|---| +| **TL2-001** | Source table not accessible | Table doesn't exist or access denied | Verify source object; check user permissions | +| **TL2-002** | Target table locked | Another load in progress | Wait for completion; check Task Chain status | +| **TL2-003** | Data type mismatch | Field cannot be cast to target type | Adjust transformation; use CAST function | +| **TL2-004** | Primary key violation | Duplicate keys in source | Remove duplicates from source; use MERGE instead of INSERT | +| **TL2-005** | Foreign key violation | Referenced record not found | Load parent tables first; check referential integrity | +| **TL2-006** | NULL constraint violation | Required field is NULL | Apply transformation to populate; filter out nulls | +| **TL2-007** | Decimal precision loss | Number has more digits than target | Increase column precision; adjust data | +| **TL2-008** | String too long | VARCHAR exceeds max length | Increase column size; truncate/split data | +| **TL2-009** | Network timeout | Connection lost during transfer | Increase timeout; check bandwidth; retry | +| **TL2-010** | Disk space full | Target system out of space | Purge old data; add storage; reduce batch size | + +### Post-Load Validation Errors + +| Error Code | Message | Root Cause | Resolution | +|---|---|---|---| +| **TL3-001** | Row count mismatch | Source and target differ | Run data reconciliation; identify missing records | +| **TL3-002** | Aggregation mismatch | Sum of measures differs | Check for data type conversions; verify formulas | +| **TL3-003** | Key uniqueness violation | Duplicate keys found | Remove duplicates from source; investigate cause | +| **TL3-004** | Date range gap | Missing time periods | Re-run full load; check filtering logic | +| **TL3-005** | NULL values unexpected | More NULLs than source | Investigate transformation; check data quality | +| **TL3-006** | Data distribution skewed | Unusual grouping by dimension | Investigate source data; may be correct if expected | +| **TL3-007** | Hierarchy validation failed | Circular or orphaned nodes | Fix hierarchy in source; rebuild manually | +| **TL3-008** | Lookup failure | Referenced dimension record not found | Ensure dimension loaded first; add missing records | + +--- + +## 7. Migration Timeline Template + +### 4-Week Migration Timeline + +``` +WEEK 1: Planning & Assessment +┌─────────────────────────────────────────┐ +│ Monday: Kick-off Meeting │ +│ ☐ Define scope & objectives │ +│ ☐ Review timeline │ +│ ☐ Assign team roles │ +│ │ +│ Tuesday-Wednesday: Object Inventory │ +│ ☐ Export STC01 task list │ +│ ☐ Document BW object metadata │ +│ ☐ Assess complexity for each object │ +│ │ +│ Thursday: Compatibility Assessment │ +│ ☐ Classify objects: High/Med/Low risk │ +│ ☐ Identify unsupported features │ +│ ☐ Create conversion roadmap │ +│ │ +│ Friday: Infrastructure Setup │ +│ ☐ Provision Datasphere space │ +│ ☐ Configure BW Bridge access │ +│ ☐ Create user accounts & roles │ +│ │ +│ Deliverable: Migration Plan Document │ +└─────────────────────────────────────────┘ + +WEEK 2: Pilot Conversion (1 object) +┌─────────────────────────────────────────┐ +│ Monday: Pilot Object Selection │ +│ ☐ Choose low-risk test object │ +│ ☐ Size ~1-5 GB (manageable) │ +│ ☐ Some business value (validate value) │ +│ │ +│ Tuesday-Wednesday: Shell Conversion │ +│ ☐ Extract metadata │ +│ ☐ Run compatibility check │ +│ ☐ Execute shell conversion │ +│ ☐ Validate object creation │ +│ │ +│ Thursday: Data Transfer │ +│ ☐ Execute full load │ +│ ☐ Monitor progress │ +│ ☐ Validate row counts │ +│ ☐ Run reconciliation queries │ +│ │ +│ Friday: Testing & Sign-Off │ +│ ☐ Basic UAT testing │ +│ ☐ Performance validation │ +│ ☐ Business owner sign-off │ +│ │ +│ Deliverable: Pilot Conversion Report │ +└─────────────────────────────────────────┘ + +WEEK 3: Wave 1 Conversion (5-10 objects) +┌─────────────────────────────────────────┐ +│ Monday: Wave Planning │ +│ ☐ Prioritize remaining objects │ +│ ☐ Define conversion sequence │ +│ ☐ Create conversion schedule │ +│ │ +│ Tue-Thu: Parallel Conversions │ +│ ☐ Shell convert 5-10 objects (parallel)│ +│ ☐ Load data for each │ +│ ☐ Validate data quality │ +│ ☐ Create analytics models/views │ +│ │ +│ Friday: Consolidation │ +│ ☐ Complete UAT for all objects │ +│ ☐ Address issues found │ +│ ☐ Prepare for Wave 2 │ +│ │ +│ Deliverable: Wave 1 Completion Report │ +└─────────────────────────────────────────┘ + +WEEK 4: Wave 2 + Go-Live Prep +┌─────────────────────────────────────────┐ +│ Monday: Wave 2 Start │ +│ ☐ Convert remaining objects │ +│ ☐ Complete data transfers │ +│ │ +│ Tue-Wed: Go-Live Preparation │ +│ ☐ Final data reconciliation │ +│ ☐ Update BI tool connections │ +│ ☐ Create runbooks & documentation │ +│ ☐ Train support team │ +│ │ +│ Thursday: Readiness Check │ +│ ☐ All objects converted ✓ │ +│ ☐ Data validated ✓ │ +│ ☐ Users trained ✓ │ +│ ☐ Runbooks ready ✓ │ +│ ☐ Support prepared ✓ │ +│ │ +│ Friday: Go-Live Day │ +│ ☐ Switch BI tool connections (03:00) │ +│ ☐ Monitor first 2 hours │ +│ ☐ Respond to issues │ +│ ☐ Confirm success │ +│ │ +│ Deliverable: Go-Live Report │ +└─────────────────────────────────────────┘ +``` + +### 8-Week Migration Timeline (Large-Scale) + +``` +WEEK 1-2: Planning & Assessment + ☐ Form steering committee + ☐ Conduct full system inventory + ☐ Risk assessment & complexity scoring + ☐ Resource planning + +WEEK 3: Infrastructure & Pilot + ☐ Provision Datasphere environment + ☐ Configure connections + ☐ Execute pilot conversion + ☐ Validate pilot results + +WEEK 4-5: Wave 1 (High Priority Objects) + ☐ Convert top 20-30% of objects + ☐ Data transfer & validation + ☐ UAT phase 1 + +WEEK 6: Wave 2 (Medium Priority Objects) + ☐ Convert 40-60% of objects + ☐ Parallel testing continues + ☐ Identify optimization opportunities + +WEEK 7: Wave 3 (Remaining Objects) + ☐ Convert final 10-20% of objects + ☐ Complete UAT + ☐ Go-live readiness assessment + +WEEK 8: Go-Live & Stabilization + ☐ Final cutover + ☐ Production monitoring + ☐ Issue resolution + ☐ Decommission BW Bridge +``` + +--- + +## 8. Rollback and Fallback Procedures + +### Scenario 1: Shell Conversion Fails (Before Data Transfer) + +**Problem:** Conversion step fails; object not created in Datasphere + +**Rollback Steps:** +1. [ ] Note conversion failure in log +2. [ ] Investigate error code (see Section 6) +3. [ ] Correct root cause in source system +4. [ ] Delete failed object from Datasphere +5. [ ] Retry shell conversion +6. [ ] Document lessons learned + +**Fallback:** Defer object to Wave 2; use Remote Conversion instead + +--- + +### Scenario 2: Data Transfer Fails (Partial Load) + +**Problem:** Data load stops midway; target table has incomplete data + +**Rollback Steps:** +1. [ ] Stop data transfer task immediately +2. [ ] Truncate target table (remove partial data) +3. [ ] Investigate error code (see Section 6) +4. [ ] Fix root cause (network, data quality, etc.) +5. [ ] Retry full data load +6. [ ] Run reconciliation to confirm completeness + +**Prevention:** +- Set batch size to recoverable chunks (100K records) +- Log every 10K records processed +- Use restart checkpoints for large loads + +--- + +### Scenario 3: Data Reconciliation Fails (Row Count Mismatch) + +**Problem:** Source and target row counts don't match; data lost during transfer + +**Investigation:** +1. [ ] Get exact row counts from both systems +2. [ ] Calculate variance: `(Target - Source) / Source * 100` +3. [ ] If variance < 0.1%: Acceptable; document variance +4. [ ] If variance >= 0.1%: Investigate + - Check if source was modified during load + - Verify load is complete (check timestamps) + - Look for NULL value filtering + - Check for duplicate key rejections + +**Remediation:** +```sql +-- Find records in source not in target +SELECT * FROM T_BW_SOURCE s +WHERE NOT EXISTS ( + SELECT 1 FROM T_DATASPHERE_TARGET t + WHERE t.KEY_FIELD_1 = s.KEY_FIELD_1 + AND t.KEY_FIELD_2 = s.KEY_FIELD_2 +) +LIMIT 100; -- Review first 100 + +-- If records found: Re-insert missing records +INSERT INTO T_DATASPHERE_TARGET +SELECT * FROM T_BW_SOURCE +WHERE KEY_FIELD_1 IN ( + -- List of missing keys from investigation above +); +``` + +**Fallback:** If unable to reconcile: +- Truncate target; re-run full load +- Extend reconciliation tolerance if variance < 0.5% +- Escalate to data quality team if > 0.5% + +--- + +### Scenario 4: Production Cutover Issues (Post-Go-Live) + +**Problem:** Datasphere queries failing; users cannot access reports + +**Immediate Actions (First 30 minutes):** +1. [ ] Disable Datasphere connections in BI tools +2. [ ] Re-enable BW Bridge as fallback source +3. [ ] Notify users of temporary service degradation +4. [ ] Assemble troubleshooting team +5. [ ] Initiate incident call + +**Investigation (30 minutes - 2 hours):** +- Check Datasphere system status +- Review query logs for errors +- Validate data transfer completion +- Check network connectivity + +**Resolution Examples:** +- **Missing data**: Re-run data transfer, validate upload +- **Slow queries**: Missing indexes; optimize with EXPLAIN PLAN +- **Authorization denied**: User permissions; review DACs +- **Data quality issue**: Investigate source; re-convert object + +**Escalation Criteria:** +- If issue unresolved within 2 hours: Escalate to Datasphere vendor +- If no resolution path identified: Proceed with permanent fallback + +**Permanent Fallback (if unresolved):** +1. [ ] Keep BW Bridge in production (read-only mode) +2. [ ] Postpone Datasphere cutover by 1 week +3. [ ] Root cause analysis & fix +4. [ ] Schedule retry + +--- + +### Scenario 5: BW Bridge Decommissioning Blocked + +**Problem:** Cannot shutdown BW Bridge; still has active dependencies + +**Investigation:** +```sql +-- Find active Task Chains using Bridge objects +SELECT TASK_CHAIN_ID, LAST_EXECUTION_DATE +FROM DATASPHERE.TASK_CHAINS +WHERE SOURCE_SYSTEM = 'BW_BRIDGE' +ORDER BY LAST_EXECUTION_DATE DESC; + +-- Find BI tool connections to Bridge +SELECT TOOL_NAME, CONNECTION_STRING, LAST_USED +FROM BI_TOOL_CONNECTIONS +WHERE SYSTEM = 'BW_BRIDGE' +ORDER BY LAST_USED DESC; +``` + +**Remediation:** +1. [ ] Identify all Bridge dependencies +2. [ ] Migrate each to Datasphere equivalent +3. [ ] Update BI tool connections +4. [ ] Disable Bridge access for 30 days (read-only) +5. [ ] Monitor for orphaned connections +6. [ ] Retry decommissioning + +--- + +### Emergency Rollback Procedure (Nuclear Option) + +**Use ONLY if critical data corruption or security breach:** + +``` +Trigger Conditions: + ☐ Multiple data integrity issues affecting > 10% of records + ☐ Security breach with customer PII exposure + ☐ Complete system failure without recovery option + ☐ Executive decision to halt migration + +Execution: + 1. [T+0] Activate incident response team + 2. [T+15] Notify all executives & legal + 3. [T+30] Take Datasphere objects offline + 4. [T+45] Full restoration from BW Bridge backups + 5. [T+60] Validate data integrity + 6. [T+90] Restore user access to BW Bridge + 7. [T+2h] Root cause analysis begins + 8. [T+24h] Communication to users + +Communication: + - "Migration paused due to [REASON]" + - "BW reports temporarily restored" + - "Datasphere access temporarily unavailable" + - "Full investigation underway" + +Recovery Timeline: + - Week 1: Root cause analysis + - Week 2: Fix root cause + - Week 3-4: Retry migration with corrections +``` + +--- + +### Fallback Maintenance (Post-Fallback) + +If you execute a fallback: + +**Weekly Monitoring:** +- [ ] Monitor BW Bridge CPU/memory usage +- [ ] Check for any Bridge connection errors +- [ ] Validate data freshness + +**Monthly Reviews:** +- [ ] Analyze root cause in detail +- [ ] Identify prevention measures +- [ ] Update runbooks + +**Retry Planning:** +- Schedule retry within 2-4 weeks +- Communicate new target date to users +- Apply lessons learned from fallback +- Execute comprehensive retesting before go-live + +--- + +End of Reference Guide + diff --git a/partner-built/SAP-Datasphere/skills/datasphere-catalog-steward/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-catalog-steward/SKILL.md new file mode 100644 index 0000000..3c2c30f --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-catalog-steward/SKILL.md @@ -0,0 +1,1343 @@ +--- +name: Catalog Steward +description: Master your Datasphere catalog governance—enrich metadata, manage glossaries, define KPIs, control tags, and analyze impact. Use this when you need to improve data discoverability, ensure consistent terminology, prevent governance chaos, validate KPI definitions, or assess change impacts before modifications. Triggers include "organize our catalog," "standardize metadata," "define business glossaries," "link our KPIs," "what breaks if we change this," and "which models use this table." +--- + +# Catalog Steward Skill + +## Overview + +The Catalog Steward skill empowers you to take control of your SAP Datasphere's internal data governance. This skill focuses on **enriching metadata, managing business glossaries, defining KPIs, controlling tag taxonomies, and performing lineage-based impact analysis**—all essential for enabling self-service analytics and preventing governance chaos. + +Unlike the Data Product Publisher skill (which publishes external marketplace products), the Catalog Steward skill is about making your *internal* Datasphere repository discoverable, understandable, and trustworthy. When users search your catalog, they should find well-named assets with clear descriptions, consistent business terminology, quality metrics, and transparent lineage. + +### Why Catalog Governance Matters + +- **Self-Service Analytics:** Business users can find and trust data without submitting tickets +- **Compliance & Auditability:** Clear lineage and ownership trails support regulatory requirements +- **Impact Analysis:** Understand change ripple effects before modifying critical assets +- **Terminology Alignment:** Glossaries ensure "Revenue" means the same thing across teams +- **Data Quality Transparency:** Quality scores help users select the right datasets +- **Governance at Scale:** Consistent metadata reduces technical debt and tribal knowledge + +--- + +## Core Workflows + +### 1. Metadata Enrichment + +Metadata enrichment transforms technical asset names and sparse descriptions into discoverable, business-friendly documentation. + +#### Workflow: Analyze and Suggest Business-Friendly Names + +**When to use:** During onboarding, after importing source system tables, or during catalog cleanup sprints. + +**Steps:** + +1. **Search for undernamed assets:** + - Use `search_catalog` to find tables/views with missing or cryptic names (e.g., "T_SALES_001") + - Filter by asset type (Dimension, Fact, View, Model) + - Identify candidates for enrichment + +2. **Analyze content with column inspection:** + - Use `get_asset_details` to inspect table/view structure + - Review key columns to infer business meaning + - Identify primary dimensions and measures + - Example: "T_SALES_001" contains `CUST_ID`, `ORDER_DT`, `AMOUNT` → suggests "Customer Orders Fact" + +3. **Suggest and apply business names:** + - Map technical names to business-friendly alternatives + - Follow naming conventions (see references for templates) + - Apply updated names via catalog metadata endpoints + - Document rationale in internal notes + +**Best Practices:** + +- Include plural nouns for fact tables, singular for dimensions +- Use business domain terminology (not IT jargon) +- Avoid ambiguity: "Sales" → "Monthly Sales Orders" or "Daily Sales Revenue" +- Create a naming convention document and version it + +#### Workflow: Write Meaningful Descriptions + +**When to use:** When onboarding new users, before publishing catalog assets, or during quality audits. + +**Steps:** + +1. **Gather context:** + - Use `get_asset_details` to extract technical metadata + - Review related objects (upstream sources, downstream consumers) + - Identify responsible team or owner + +2. **Write descriptions following a template:** + - **What:** One-sentence summary of what the asset contains + - **Why:** Business purpose or use case + - **Key columns:** 2-3 most important dimensions/measures + - **Refresh frequency:** How often is it updated + - **Caveats:** Data quality issues, exclusions, or limitations + - Example template (see references) + +3. **Link to upstream sources:** + - Document source systems or parent tables + - Use `get_object_definition` to trace lineage + - Include transformation logic (if relevant) + +4. **Review and version:** + - Have data owner approve descriptions + - Track description changes in catalog versioning + +**Best Practices:** + +- Keep descriptions under 500 words; link to detailed documentation elsewhere +- Use plain language; assume audience is business analyst (not DBA) +- Include examples of typical queries or use cases +- Flag experimental or deprecated assets clearly +- Update descriptions when business meaning changes (not just when data structure changes) + +#### Workflow: Auto-Suggest Tags Based on Content Analysis + +**When to use:** During bulk catalog onboarding or when implementing a new tag taxonomy. + +**Steps:** + +1. **Analyze asset content:** + - Use `get_asset_details` to inspect column names, types, and distributions + - Use `analyze_column_distribution` to understand data characteristics + - Identify data types (financial, HR, product, customer, operational) + - Detect common patterns (dates, IDs, amounts) + +2. **Match against tag taxonomy:** + - Map identified characteristics to your tag taxonomy (see references) + - Example: columns contain "SALARY", "EMPLOYEE_ID" → suggest tags: `hr`, `sensitive`, `employee-master` + +3. **Propose tags with confidence scoring:** + - High confidence: tags match multiple column patterns + - Medium confidence: tags match domain or naming conventions + - Low confidence: tags are contextual or require human review + +4. **Review and apply:** + - Present suggestions with reasoning + - Allow manual override for edge cases + - Batch-apply approved tags + +**Best Practices:** + +- Use a controlled vocabulary (see tag taxonomy in references) +- Combine multiple tag types (domain, sensitivity, cadence, owner) +- Review auto-suggestions; don't apply blindly +- Document why assets receive specific tags +- Update tags when asset usage patterns change + +#### Workflow: Bulk Metadata Updates Across Multiple Assets + +**When to use:** After organizational changes, standardization initiatives, or when implementing governance policies. + +**Steps:** + +1. **Identify batch scope:** + - Use `list_catalog_assets` to find assets matching criteria (e.g., all tables from a source system, all models owned by a team) + - Validate that batch scope is correct (test with small sample first) + +2. **Define update template:** + - Standardize naming patterns, tags, descriptions, or ownership + - Create template for changes (see references) + - Document change rationale and approval + +3. **Execute updates in phases:** + - Phase 1: Apply changes to test/sandbox catalogs + - Phase 2: Validate against downstream consumers using lineage + - Phase 3: Apply to production with versioning + - Phase 4: Communicate changes to users + +4. **Track and audit changes:** + - Log all bulk changes with timestamp, author, and reason + - Enable catalog versioning to support rollback if needed + - Notify affected teams of changes + +**Best Practices:** + +- Always test bulk updates on a sample first +- Use lineage analysis to identify downstream impacts +- Batch updates by logical group (not random collections) +- Communicate timing and rationale to stakeholders +- Provide before/after comparisons for major changes + +--- + +### 2. Glossary Term Management + +A business glossary is the "source of truth" for terminology. It ensures that "Gross Margin," "EBITDA," and "Market Share" mean the same thing across all teams. + +#### Workflow: Create and Maintain a Business Glossary + +**When to use:** At governance program launch, when onboarding new business domains, or when terminology conflicts arise. + +**Steps:** + +1. **Identify core business concepts:** + - Interview business owners and analysts + - Review existing reports, dashboards, and analysis + - Document terms with multiple definitions (conflicts to resolve) + - Prioritize high-impact terms (used in multiple models, KPIs, or reports) + +2. **Create glossary term entries:** + - Use the glossary term template (see references) + - Define each term with business meaning, not technical definition + - Include approved synonyms and related terms + - Document calculation methodology (for metrics) + - Assign owner and approval authority + - Set version and last-reviewed date + +3. **Build glossary hierarchy:** + - Group terms by business domain (Sales, Finance, HR, Operations) + - Create parent-child relationships (e.g., "Revenue" → "Product Revenue", "Service Revenue") + - Link related terms (see section on glossary relationships) + +4. **Enable feedback and evolution:** + - Publish draft glossary and collect feedback from stakeholders + - Review conflicts and make approval decisions + - Version published glossary (v1.0, v1.1, etc.) + - Schedule annual reviews with business owners + +**Best Practices:** + +- Start with 20-30 highest-impact terms, not the entire organization +- Involve business owners, not just IT, in definition +- Make glossary searchable and always discoverable (don't hide in PDFs) +- Include usage examples and anti-examples (what it is NOT) +- Document historical changes (why did definition change?) +- Link to actual data implementations (models, measures) + +#### Workflow: Link Glossary Terms to Technical Assets + +**When to use:** After glossary terms are approved, during model development, or during metadata enrichment sprints. + +**Steps:** + +1. **Identify linking opportunities:** + - Use `search_catalog` to find assets matching glossary terms + - Example: search for "revenue" → find all views, models, measures with revenue-related logic + - Use `get_asset_details` to inspect calculated fields and measures + +2. **Create term-to-asset mappings:** + - Link glossary term "Revenue" to measure `TOTAL_REVENUE` in model `Sales_Summary` + - Document how technical asset implements the glossary definition + - Capture calculation logic or transformation rules + - Note any deviations or approximations + +3. **Enable bidirectional navigation:** + - Users viewing glossary term should see which assets implement it + - Users viewing assets should see which glossary terms apply + - Create cross-reference views or dashboards + +4. **Validate consistency:** + - Check that all uses of the term apply the same definition + - Flag deviations or variant calculations + - Schedule reviews when definitions or implementations change + +**Best Practices:** + +- One glossary term can map to multiple technical assets (same concept, different contexts) +- Document if an asset implements the term exactly or is an approximation +- Include transformation rules or calculation logic in the mapping +- Update mappings when either glossary terms or asset definitions change +- Use mappings to detect duplicate or conflicting implementations + +#### Workflow: Term Approval Workflows + +**When to use:** When implementing formal governance, during terminology disputes, or when adding new glossary terms. + +**Steps:** + +1. **Define approval roles:** + - **Proposer:** Business analyst or data owner + - **Domain Owner:** Accountable for terms in their domain (Sales, Finance, etc.) + - **Governance Lead:** Final approval authority + - Use RACI matrix (see references) to clarify roles + +2. **Create proposal-to-approval workflow:** + - Proposer submits term with definition, calculation, and rationale + - Domain owner reviews for alignment with business standards + - Governance lead checks for conflicts, clarity, and compliance + - Feedback is provided; proposer revises if needed + - Final approval records who, when, and rationale + +3. **Track approval status:** + - Status: Draft → Proposed → Approved → Published + - Escalation path for disputes (which executive resolves conflicts?) + - SLA for reviews (e.g., 5 business days) + +4. **Manage versioning:** + - When a term definition changes, trigger re-approval + - Previous versions remain available (audit trail) + - Notify users when definitions change + +**Best Practices:** + +- Clarify approval authority upfront (don't create bottlenecks) +- Use lightweight workflow for low-risk terms, formal workflow for KPIs or financial terms +- Document why terms were rejected (helps future proposals) +- Include legal or compliance review for regulatory terms +- Set clear SLAs to prevent indefinite reviews + +#### Workflow: Glossary Hierarchies and Relationships + +**When to use:** As glossary grows beyond 30-50 terms, when standardizing across domains, or when implementing enterprise-wide terminology. + +**Steps:** + +1. **Design hierarchical structure:** + - Create top-level categories (Business Domains: Finance, Sales, HR, etc.) + - Create sub-categories (Finance → Revenue, Expenses, Assets) + - Create specific terms (Revenue → Product Revenue, Service Revenue) + - Support 2-3 levels of depth (too deep = hard to navigate) + +2. **Define relationship types:** + - **Synonym:** Alternative names for the same concept (e.g., "Gross Profit" = "Gross Margin") + - **Related:** Conceptually connected but distinct (e.g., "Revenue" related to "Cost of Goods Sold") + - **Parent-Child:** Hierarchical containment (e.g., "Revenue" ← "Product Revenue") + - **Derived:** One term calculated from others (e.g., "Profit Margin" derived from "Profit" and "Revenue") + +3. **Build navigation paths:** + - Enable browsing by domain (discover all financial terms) + - Enable searching across domains (find all revenue-related terms) + - Create "Related Terms" suggestions on term detail pages + - Build term dependency maps for KPI validation + +4. **Maintain consistency:** + - Review hierarchies during governance reviews + - Consolidate synonyms and related terms to reduce duplication + - Update relationships when definitions change + +**Best Practices:** + +- Don't create deep trees (3+ levels); use relationships instead +- Document relationship semantics (what makes two terms "related"?) +- Use hierarchies to organize domains, not to create arbitrary classification +- Enable free-text search as primary discovery mechanism +- Use term relationships to detect definition conflicts + +#### Workflow: Ensure Consistent Terminology Across the Organization + +**When to use:** During governance audits, when merging business units, or when enforcing standards. + +**Steps:** + +1. **Audit current terminology:** + - Search catalog for variant names and definitions (e.g., "Revenue", "Sales", "Turnover", "Top Line") + - Interview teams to understand why variants exist + - Use `search_catalog` to find all objects using each variant + - Document conflicts in a consolidation backlog + +2. **Resolve conflicts through glossary:** + - For each conflict, create a single approved glossary term + - Declare one variant as canonical; others as synonyms + - Document why this definition was chosen + - Get stakeholder approval before enforcement + +3. **Enforce consistency:** + - Link all variant implementations to approved glossary term + - Update descriptions/names in catalog to use approved terminology + - Add metadata (tags) to identify which implementations are authoritative vs. legacy + - Deprecate non-conforming implementations gradually + +4. **Ongoing audits:** + - Schedule quarterly reviews of new assets for terminology alignment + - Audit popular models/views for consistent term usage + - Include terminology checklist in data product publishing workflow + +**Best Practices:** + +- Enforce consistency gradually (phase out old terms over 6-12 months) +- Document migration path for teams using old terminology +- Recognize that business language evolves; update glossary annually +- Use glossary to enforce standards, not to restrict valid language +- Support common synonyms as alternate search terms + +--- + +### 3. KPI (Key Performance Indicator) Definition + +KPIs translate business objectives into measurable metrics. The catalog ensures KPIs are well-defined, validated against data, and linked to accountability. + +#### Workflow: Define KPIs Within the Catalog + +**When to use:** When launching new strategic initiatives, during business planning cycles, or when formalizing informal metrics. + +**Steps:** + +1. **Gather KPI requirements:** + - Interview executive sponsors and business owners + - Document strategic objective each KPI supports + - Define calculation methodology (detailed formula) + - Identify refresh cadence (daily, weekly, monthly) + - Assign accountability (who owns this KPI?) + - Define target/threshold values + +2. **Create KPI definition using template (see references):** + - **Name:** Business-friendly name (e.g., "Customer Lifetime Value") + - **Code:** Unique identifier (e.g., "CLV_001") + - **Strategic Objective:** Which business goal does this KPI support? + - **Definition:** Plain-language description + - **Calculation:** Detailed formula with logic + - **Dimensions:** How is KPI sliced? (by customer segment, region, product, time) + - **Data Sources:** Which tables/models feed this KPI? + - **Owner:** Who is accountable? + - **Review Frequency:** When is this KPI reviewed? + - **Version:** Creation date and change log + +3. **Validate against data landscape:** + - Use `get_asset_details` to inspect source tables + - Use `analyze_column_distribution` to check data availability and quality + - Verify required dimensions/measures exist + - Document any data gaps or approximations + +4. **Publish and socialize:** + - Create KPI detail page in catalog with calculation visible + - Share KPI definition with stakeholders + - Link to dashboards/reports that use this KPI + - Establish governance (who approves changes?) + +**Best Practices:** + +- Keep KPI definitions simple; complexity breeds misunderstanding +- Include examples: "If X happened, would KPI increase or decrease?" +- Document known limitations and caveats (e.g., "excludes international operations") +- Version KPI definitions; don't silently change calculations +- Link KPI to glossary terms for consistency + +#### Workflow: Link KPIs to Underlying Datasets and Measures + +**When to use:** During KPI validation, when optimizing data models, or when documenting lineage. + +**Steps:** + +1. **Map KPI to source measures:** + - Use `get_object_definition` to inspect model structure + - Identify which measures feed each KPI calculation + - Example: KPI "Profit Margin" uses measures `Total_Profit` and `Total_Revenue` + - Document transformation logic (if any) + +2. **Trace lineage to source systems:** + - Use `list_catalog_assets` or lineage analysis to trace back to source tables + - Document data flow: Source System → ETL → Model → Measure → KPI + - Identify any data transformations or aggregations + - Document refresh timing at each stage + +3. **Create bidirectional links:** + - KPI detail page shows source measures + - Measure detail page shows which KPIs consume it + - Enable impact analysis: "change this measure → affects these KPIs" + +4. **Validate availability and completeness:** + - Ensure all required source columns exist + - Check that historical data is available for trending + - Verify refresh frequency supports KPI review cycle + - Document any data quality issues in lineage + +**Best Practices:** + +- Map each KPI to its smallest constituent measures (enables reuse) +- Document assumptions in data flow (e.g., "excludes canceled orders") +- Use lineage to identify shared dependencies (optimization opportunities) +- Automate lineage updates when data models change +- Create data dictionaries linking business metrics to technical measures + +#### Workflow: KPI Ownership and Accountability + +**When to use:** During KPI launch, during governance reviews, or when resolving KPI disputes. + +**Steps:** + +1. **Assign clear ownership:** + - **KPI Owner:** Accountable for definition and business interpretation (executive) + - **Data Owner:** Accountable for underlying data quality (data team) + - **Dashboard Owner:** Accountable for reporting infrastructure (BI team) + - Use RACI matrix (see references) to clarify secondary responsibilities + +2. **Document ownership in catalog:** + - Assign owner to KPI definition with contact information + - Create KPI ownership matrix (spreadsheet or dashboard) + - Link KPI to team or department + - Document escalation path for KPI disputes + +3. **Enable accountability:** + - Schedule monthly KPI reviews with owners + - Track KPI performance trends + - Document explanations when KPIs miss targets + - Use KPI dashboards to highlight performance issues early + +4. **Rotate and transition ownership:** + - When owner changes roles, assign replacement + - Document transition in KPI versioning + - Provide new owner with calculation documentation and historical context + +**Best Practices:** + +- Assign single accountable owner (not a committee) +- Ensure owner has authority to make decisions about KPI +- Connect KPI ownership to performance management/compensation (creates accountability) +- Review ownership quarterly; update when roles change +- Document succession plan for critical KPI owners + +#### Workflow: KPI Validation + +**When to use:** Before publishing KPIs, during data quality issues, or when results seem suspicious. + +**Steps:** + +1. **Validate calculation logic:** + - Walk through calculation step-by-step + - Check for logic errors (incorrect operators, filters, aggregations) + - Verify dimensional alignment (are dimensions aggregated correctly?) + - Test with known scenarios (e.g., "if all customers had 100 orders, KPI should be X") + +2. **Validate data quality:** + - Use `analyze_column_distribution` on source columns + - Check for missing values, outliers, or data quality issues + - Validate assumptions (e.g., "all dates are in YYYY-MM-DD format") + - Review data freshness: is data current enough for KPI? + +3. **Validate against reality:** + - Compare KPI results to manual calculations (if available) + - Run KPI on historical data; check for expected trends + - Benchmark against external data if available (e.g., compare "Market Share" KPI to published reports) + - Interview business owners: "does this number feel right?" + +4. **Document validation results:** + - Create validation report (see references) + - Document any discrepancies and their root causes + - Establish data quality requirements for KPI use + - Define KPI confidence level (trusted, needs monitoring, experimental) + +5. **Set up ongoing monitoring:** + - Create KPI quality dashboard (shows data freshness, completeness, outliers) + - Set up alerts for data quality issues + - Schedule monthly validation checks + - Document changes to source data that might affect KPI + +**Best Practices:** + +- Never publish KPI without validation +- Include data quality caveats in KPI definition +- Validate with business owners, not just data teams +- Document validation assumptions (so others can replicate) +- Schedule re-validation when source data changes significantly + +#### Workflow: KPI Lifecycle Management + +**When to use:** When KPIs become irrelevant, during business strategy reviews, or when merging business units. + +**Steps:** + +1. **Establish KPI lifecycle states:** + - **Proposed:** New KPI being evaluated + - **Active:** Currently tracked and reviewed + - **Monitored:** Less critical but still watched + - **Deprecated:** Phased out or replaced by newer KPI + - **Archived:** Historically important, no longer used + +2. **Transition KPIs through lifecycle:** + - Proposed → Active: After validation and stakeholder approval + - Active → Deprecated: When business objective changes or KPI becomes outdated + - Deprecated → Archived: After 6-12 month sunset period + - Document reason and date for each transition + +3. **Manage sunset of deprecated KPIs:** + - Communicate sunset date to stakeholders well in advance + - Identify replacement KPI (if applicable) + - Provide training on new KPI + - Archive old dashboards/reports gradually + - Keep historical data accessible for trend analysis + +4. **Review and refresh KPI portfolio:** + - Conduct annual KPI portfolio review + - Assess each KPI: Still aligned with strategy? Still accurate? Still relevant? + - Identify KPIs for deprecation + - Identify new KPIs needed for emerging priorities + +**Best Practices:** + +- Document why KPIs were deprecated (important context for future teams) +- Don't delete KPI definitions; archive them with historical data +- Communicate KPI changes to all stakeholders early +- Link deprecated KPI to replacement (if applicable) +- Review KPI portfolio annually, not ad hoc + +--- + +### 4. Tag Management + +Tags are lightweight metadata that enable discovery and governance. A well-designed tag taxonomy makes the catalog navigable at scale. + +#### Workflow: Design a Tag Taxonomy + +**When to use:** At governance program launch or when current tagging scheme becomes unwieldy. + +**Steps:** + +1. **Define tag categories:** + - **Domain Tags:** Business domain (Finance, Sales, HR, Operations, Product) + - **Sensitivity Tags:** PII, Confidential, Internal, Public + - **Cadence Tags:** Real-time, Daily, Weekly, Monthly, Ad-hoc + - **Owner/Team Tags:** Owned_by_Finance, Owned_by_Sales, etc. + - **Quality Tags:** Certified, Under_Review, Experimental, Legacy + - **Use Case Tags:** KPI, Reporting, Analysis, AI/ML, Regulatory + - See references for detailed taxonomy design patterns + +2. **Create controlled vocabulary:** + - Define each tag with clear definition + - Document when to use each tag (vs. related tags) + - Establish naming convention (lowercase, no spaces, hyphens for compound terms) + - Example: Use `hr-employee-master` not `HR Employee Master` or `hr_emp_master` + +3. **Design hierarchy (if needed):** + - Flat hierarchy: Simple tag list, good for small catalogs (<100 assets) + - Hierarchical: Parent/child relationships, good for large catalogs (>500 assets) + - Example hierarchy: `domain:finance`, `domain:finance:accounting`, `domain:finance:revenue` + +4. **Publish and train:** + - Create tag guide with examples + - Train data owners on tagging conventions + - Publish tag definitions in easily searchable location + - Include tag guide in onboarding documentation + +**Best Practices:** + +- Start simple; expand tags over time +- Limit tag count (20-50 active tags); too many defeats discovery +- Use domain category heavily; use other categories sparingly +- Don't use tags for information that should be in descriptions +- Review tags quarterly; consolidate if duplicates emerge +- See references for industry-specific tag examples + +#### Workflow: Apply Tags Consistently Across Assets + +**When to use:** During catalog onboarding, in metadata enrichment sprints, or during quality audits. + +**Steps:** + +1. **Define tagging standards:** + - Which asset types get tagged? (tables, views, models, measures) + - How many tags per asset? (typically 3-5) + - Which tag categories are mandatory? (e.g., domain, sensitivity) + - What approval is needed? (self-service vs. peer review) + +2. **Apply tags systematically:** + - Use `list_catalog_assets` to find untagged or under-tagged assets + - Apply domain tag based on asset purpose + - Apply sensitivity tag based on data content (PII, financial, health, etc.) + - Apply cadence tag based on refresh frequency + - Apply quality tag based on readiness level + +3. **Document tagging decisions:** + - For each asset, document why specific tags were applied + - Include examples in tag definition (to ensure consistent interpretation) + - Review tagging periodically; adjust tags if meanings change + +4. **Enable peer review:** + - Have data owner review proposed tags + - Tag changes should be visible in audit trail + - Create dashboard showing tagging coverage (% of assets with required tags) + +**Best Practices:** + +- Tag all new assets before publishing to catalog +- Use tags to enforce governance (e.g., all PII data must have `sensitivity:pii` tag) +- Don't over-tag; too many tags reduce usability +- Keep tag count consistent (don't tag one asset with 2 tags and another with 20) +- Review tags when asset purpose or data changes + +#### Workflow: Tag-Based Search and Discovery + +**When to use:** When enabling self-service analytics or building faceted search interfaces. + +**Steps:** + +1. **Enable tag-based filtering:** + - Use `search_catalog` with tag filters + - Support multi-tag searches (e.g., "show all assets tagged `domain:finance` AND `sensitivity:internal`") + - Support tag hierarchies in search (search for `domain:finance` returns all finance sub-tags) + +2. **Create tag-based browsing:** + - Create tag clouds or faceted navigation in catalog UI + - Enable "related tags" suggestions (if browsing `domain:sales`, suggest `cadence:daily`) + - Show tag frequency (how many assets have each tag?) + +3. **Enable self-service discovery:** + - Train users to search by tag (simpler than writing complex queries) + - Create tag guides for different personas (executives, analysts, engineers) + - Build dashboards that link tags to data discovery metrics + +4. **Track usage patterns:** + - Monitor which tags are most searched + - Identify unused tags (candidates for removal) + - Use search analytics to refine tagging strategy + +**Best Practices:** + +- Make tag search as prominent as free-text search +- Enable "did you mean" suggestions for similar tags +- Show tag-based recommendations (users viewing asset with tag X also viewed assets with tag Y) +- Use tag search to validate tagging strategy (if tags aren't searched, reconsider their value) +- Create saved tag searches for common discovery patterns + +#### Workflow: Tag Governance + +**When to use:** When tagging decisions affect multiple teams, when enforcing compliance, or when implementing self-service governance. + +**Steps:** + +1. **Define tagging authority:** + - Who can create new tags? (centralized: only governance team; decentralized: domain teams with approval) + - Who can apply tags? (any asset owner; or only owners of sensitive assets) + - Who can modify or delete tags? (governance team only) + - Document process in governance policy + +2. **Implement approval workflows (if needed):** + - For sensitive tags (PII, Financial) or compliance tags, require approval + - Asset owner proposes tags; governance team reviews and approves + - Approval captures who approved and timestamp + - Rejected tags include feedback for why + +3. **Monitor tagging compliance:** + - Create dashboard showing tagging coverage by domain/team + - Flag assets missing required tags (domain, sensitivity) + - Monthly tagging audit: review high-change assets + - Identify teams with inconsistent tagging practices + +4. **Enforce tagging standards:** + - Prevent publishing of assets without required tags + - Create alerts for untagged or mis-tagged sensitive assets + - Include tagging checklist in data product publishing workflow + - Tie tagging compliance to team scorecards (if appropriate) + +**Best Practices:** + +- Keep tagging lightweight; heavy approval workflows reduce adoption +- Tag sensitive assets with higher approval rigor +- Enable bulk tag application (don't require tagging each asset individually) +- Provide clear feedback when tagging violates standards +- Review tagging governance quarterly + +--- + +### 5. Lineage and Impact Analysis + +Understanding data lineage answers critical questions: "Where does this data come from?" "Which models consume this table?" "What breaks if I change this column?" + +#### Workflow: Use Impact and Lineage Analysis to Trace Data Flows + +**When to use:** Before modifying critical tables, during root cause analysis of data issues, or when documenting data flows. + +**Steps:** + +1. **Access lineage tools:** + - Use `get_object_definition` to inspect asset structure + - Use `list_catalog_assets` with lineage context to identify related assets + - Access Datasphere lineage visualization (typically in asset detail page) + - Filter lineage by direction (upstream, downstream, bidirectional) + +2. **Trace upstream lineage:** + - Start from asset of interest (e.g., a model or measure) + - Follow lineage upstream to source tables + - Identify transformations at each step + - Document assumptions and business logic in transformations + - Example: Model_Sales → View_Sales_Orders → Table_ORDERS (SAP source) + +3. **Trace downstream lineage:** + - Start from table or view + - Follow lineage downstream to consuming models, measures, KPIs + - Identify all downstream impacts (critical for change assessment) + - Example: Table_CUSTOMERS → View_Customer_Enriched → Model_Customer_Analytics → KPI_Churn_Rate + +4. **Analyze bidirectional flows:** + - Identify circular dependencies (should be rare) + - Find shared data flows (tables consumed by multiple models) + - Identify choke points (tables with many downstream consumers) + - Document data flow bottlenecks + +5. **Document lineage:** + - Create lineage diagram showing major data flows + - Document business logic at each transformation + - Include metadata about refresh timing + - Publish documentation with visual lineage + +**Best Practices:** + +- Lineage should be automatically captured; manually document only complex logic +- Include transformation rationale (not just "what" but "why") +- Document data quality changes through lineage (where does quality degrade?) +- Review lineage when source systems change +- Use lineage to identify optimization opportunities (redundant transformations, etc.) + +#### Workflow: Generate Impact Reports + +**When to use:** Before making changes to critical assets, during root cause analysis, or during governance audits. + +**Steps:** + +1. **Define change scope:** + - Identify specific table, column, model, or measure being changed + - Document nature of change (delete column, rename, change calculation, deprecate table) + - Estimate impact scope (how many downstream assets affected?) + - Assess impact severity (internal tools only vs. customer-facing dashboards) + +2. **Generate impact analysis:** + - Use lineage tools to identify all downstream consumers + - Classify consumers by impact type: + - **Direct:** Assets directly consuming the changed object + - **Indirect:** Assets consuming direct consumers + - **KPI:** KPIs affected by change + - **Reports/Dashboards:** BI artifacts consuming changed assets + - Identify affected stakeholders (which teams will feel impact?) + +3. **Assess impact severity:** + - For each impacted asset, assess: + - **Criticality:** Is this a critical KPI? Customer-facing? Regulatory? + - **Detectability:** Would broken data be noticed immediately or silently wrong? + - **Blast Radius:** How many end users affected? + - Document each assessment with rationale + +4. **Create impact report:** + - **Change Summary:** What is being changed and why? + - **Downstream Impacts:** List all affected assets with severity + - **Mitigation Plans:** How to minimize impact? (e.g., phased rollout, temporary shadow calculation) + - **Testing Plan:** How to validate change doesn't break downstream? + - **Rollback Plan:** How to revert if issue discovered? + - **Stakeholder Notifications:** Which teams need to be informed? + +5. **Use report to gain approvals:** + - Share impact report with affected stakeholders + - Get sign-off from asset owners before proceeding + - Document approval (who approved, when, any conditions) + - Update report as approvals gathered + +**Best Practices:** + +- Always generate impact report before modifying critical assets +- Use impact reports to surface hidden downstream dependencies +- Include indirect impacts (sometimes more important than direct) +- Define severity thresholds (when is impact too high?) +- Use impact analysis to identify opportunities to consolidate redundant implementations + +#### Workflow: Upstream Analysis + +**When to use:** When investigating data quality issues, understanding data freshness, or documenting data sources. + +**Steps:** + +1. **Trace back to original sources:** + - Start from asset with issue (model, measure, or dashboard) + - Follow lineage upstream to source tables + - Identify each transformation step + - Document data quality at each stage + +2. **Identify source systems:** + - For each upstream table, identify source system (SAP, Salesforce, custom app, etc.) + - Document extraction frequency (real-time, batch, delayed) + - Identify data quality issues in source (missing values, duplicates, delays) + +3. **Analyze data quality degradation:** + - Identify where data quality issues are introduced + - Example: "Missing Customer Names in source → not populated in enriched view → shows as blanks in dashboard" + - Document which transformations impact quality + +4. **Identify opportunities to improve:** + - Use upstream analysis to fix quality issues at source (better than downstream workarounds) + - Identify redundant transformations that could be consolidated + - Propose moving transformations closer to source (for efficiency) + +**Best Practices:** + +- Document source system characteristics (reliability, update frequency) +- Use upstream analysis to identify data quality root causes +- Periodically review upstream dependencies; document changes +- Build dashboards tracking data freshness across lineage +- Work with source system owners to improve data quality upstream + +#### Workflow: Downstream Analysis + +**When to use:** When deprecating assets, understanding asset usage, or during data quality investigations. + +**Steps:** + +1. **Identify all downstream consumers:** + - Start from table, view, or measure + - Trace forward to consuming models, measures, KPIs + - Continue tracing to dashboards, reports, or AI/ML models + - Use `list_catalog_assets` to identify all references + +2. **Categorize downstream usage:** + - **Critical:** KPIs, customer-facing dashboards, regulatory reports + - **Important:** Internal dashboards, analyst-used models, operational reports + - **Experimental:** Prototype dashboards, one-time analyses + - Assess what happens if asset becomes unavailable + +3. **Understand consumption patterns:** + - Which downstream assets are actively used? + - Which downstream assets are stale/unused? (candidates for cleanup) + - Which measures appear in multiple KPIs? (indicates high leverage) + - Which dashboards have most users? (highest-risk to break) + +4. **Plan changes safely:** + - For deprecation: Provide replacement assets before turning off original + - For modifications: Test changes against actual downstream consumers + - Communicate changes early to downstream owners + - Build deprecation timeline that allows downstream adjustment + +**Best Practices:** + +- Use downstream analysis to understand asset criticality +- Identify "hidden" downstream consumers (often forgotten dependencies) +- Use downstream usage patterns to prioritize data governance efforts +- Periodically clean up unused downstream assets (reduces technical debt) +- Track downstream usage to validate data product adoption + +#### Workflow: Change Impact Assessment Before Modifications + +**When to use:** Before modifying any production asset, during data quality fixes, or during optimization projects. + +**Steps:** + +1. **Plan change:** + - Document what is being changed (table structure, calculation logic, assumptions) + - Document why change is needed (bug fix, performance, business requirement) + - Document expected benefits and risks + +2. **Analyze downstream impacts:** + - Use downstream analysis workflow (above) to identify all consumers + - Generate impact report (see above workflow) + - Identify critical vs. non-critical impacts + - Document stakeholders who need notification + +3. **Design safe change approach:** + - **Option 1 (No-Impact Approach):** Add new column/table; don't modify existing ones + - **Option 2 (Backward-Compatible):** Support old and new logic simultaneously during transition + - **Option 3 (Phased Rollout):** Change in phases; monitor for issues at each phase + - **Option 4 (Temporary Shadow):** Run new logic alongside old; validate before switching + - Choose approach based on risk and downstream impact + +4. **Build testing plan:** + - Test change in development/test environment first + - Test with actual data volumes and realistic downstream consumption patterns + - Validate that downstream assets still produce correct results + - Document test cases and results + +5. **Execute with monitoring:** + - Apply change (in phases if using phased rollout) + - Monitor downstream dashboards/reports for unexpected changes + - Monitor data quality metrics (completeness, accuracy, freshness) + - Have rollback plan ready + - Document any issues discovered during rollout + +6. **Communicate and document:** + - Notify downstream stakeholders of change and any impacts + - Document change in asset versioning/changelog + - Update lineage documentation if data flows changed + - Conduct post-mortem if issues discovered + +**Best Practices:** + +- Always analyze impact before change; never assume "no one will notice" +- Use phased or shadow approaches for high-risk changes +- Build automated tests to validate downstream assets after changes +- Keep rollback capability for at least 24-48 hours after change +- Review impact assessment process regularly; improve based on incidents + +--- + +### 6. Data Quality Scoring and Tracking + +Data quality scores help users select trustworthy datasets and identify improvement opportunities. + +#### Workflow: Define Quality Dimensions and Scoring + +**When to use:** During governance program launch or when implementing data quality initiatives. + +**Steps:** + +1. **Identify quality dimensions:** + - **Completeness:** Are all required values present? (inverse of missing/null rates) + - **Accuracy:** Do values match source of truth or business rules? + - **Timeliness:** Is data fresh? How long since last update? + - **Consistency:** Do related values align? (e.g., sum of parts = total) + - **Uniqueness:** Are there unintended duplicates? + - **Validity:** Do values match expected format/range? + - See references for detailed scoring templates + +2. **Define scoring methodology:** + - For each dimension, establish measurement logic + - Example Completeness: "Score = 100 * (non-null rows / total rows)" + - Example Timeliness: "Score = 100 if updated in last 24 hours; decreases 5 points per day stale" + - Document assumptions and edge cases + - Establish minimum thresholds (e.g., Completeness must be ≥95%) + +3. **Aggregate scores:** + - Calculate overall quality score from dimension scores + - Use weighted average if some dimensions more important (see references for templates) + - Example: Overall = (Completeness * 30% + Accuracy * 30% + Timeliness * 25% + Consistency * 15%) + - Establish overall quality tiers: Certified (90+), Trusted (80-89), Monitor (70-79), Issue (<70) + +4. **Establish scorecard template:** + - Create quality scorecard showing all dimensions + - Include trends over time (are we improving or degrading?) + - Document current blockers to achieving higher scores + - Set improvement targets + +**Best Practices:** + +- Start with simple dimensions (completeness, timeliness); add advanced dimensions over time +- Base scores on automated measurements when possible (avoid manual scoring) +- Review scoring methodology quarterly; update if dimensions/thresholds change +- Communicate quality scores to all data consumers +- Link quality issues to root causes (helps with remediation) + +#### Workflow: Score Assets on Quality Metrics + +**When to use:** During metadata enrichment, when onboarding new data sources, or in quality audits. + +**Steps:** + +1. **Measure quality dimensions:** + - Use `analyze_column_distribution` to assess completeness (what % of rows have values?) + - Check timeliness (when was data last refreshed?) + - Run validation rules to assess accuracy (do values meet business rules?) + - Compare with source to assess consistency + +2. **Calculate quality scores:** + - For each dimension, calculate score using methodology from above + - Aggregate dimension scores into overall quality tier + - Document any assumptions or manual overrides + - Identify root causes of low scores + +3. **Assign quality tags:** + - Tag assets with quality tier (Certified, Trusted, Monitor, Issue) + - Tag assets with specific quality issues (duplicate_data, stale_data, missing_values, etc.) + - Use tags to surface quality issues in catalog search + +4. **Publish quality metadata:** + - Add quality score and dimensions to asset detail page + - Create quality dashboard showing scores across portfolio + - Enable sorting/filtering by quality score + - Show trend charts (is quality improving over time?) + +**Best Practices:** + +- Automate quality scoring; don't rely on manual assessments +- Review quality scores weekly or monthly (not annually) +- Link quality issues to improvement projects (make it actionable) +- Highlight quick wins (easy-to-fix quality issues) +- Recognize teams that improve data quality + +#### Workflow: Quality Dashboards and Trending + +**When to use:** When establishing quality culture, during quality improvement programs, or for executive visibility. + +**Steps:** + +1. **Build quality scorecards:** + - Create dashboard showing quality scores across all assets + - Show breakdown by domain (Finance, Sales, HR, etc.) + - Show breakdown by source system or team + - Display as heatmap or scorecard format + +2. **Visualize trends over time:** + - Track quality score trends (improving or degrading?) + - Identify tables with declining quality (investigate why) + - Celebrate tables with improving quality (recognize teams) + - Use trends to justify investment in quality initiatives + +3. **Enable drill-down analysis:** + - Click table to see detailed quality metrics (dimension scores) + - See which columns are causing low quality + - View quality issues identified in validation rules + - Link to remediation projects or tickets + +4. **Use for accountability:** + - Assign quality scorecards to team owners + - Monthly review of team's quality scorecard + - Tie quality improvements to performance goals + - Use quality metrics in hiring/promotion decisions (if appropriate) + +**Best Practices:** + +- Make quality visible to all users (public dashboard, not hidden) +- Set realistic improvement targets (don't expect 100% overnight) +- Celebrate improvements; don't just criticize poor quality +- Link quality metrics to business impact (show cost/risk of poor quality) +- Review quality dashboards at least monthly + +--- + +### 7. Catalog Review Workflows + +Regular catalog reviews prevent stale data, ensure accurate metadata, and maintain governance standards. + +#### Workflow: Periodic Asset Review Scheduling + +**When to use:** To establish ongoing governance cadence. + +**Steps:** + +1. **Define review schedule:** + - **Critical Assets:** Monthly review (KPIs, customer-facing models, regulatory data) + - **Important Assets:** Quarterly review (heavily-used dashboards, core measures) + - **Standard Assets:** Annual review (everything else) + - Establish review calendar with assigned owners + +2. **Create review checklist:** + - Does the asset meet current business needs? + - Is metadata up-to-date (name, description, tags)? + - Is quality acceptable? Any known issues? + - Is asset actively used? By whom? + - Is ownership clear and current? + - Are there any deprecated or redundant assets to clean up? + +3. **Conduct reviews:** + - Send review request to asset owner with checklist + - Owner reviews and confirms or updates metadata + - Owner identifies any issues or improvement opportunities + - Governance team follows up on unfinished reviews + +4. **Track and report:** + - Monitor review completion rates + - Report on findings (common issues, needed improvements) + - Create action items for improvements identified + - Schedule follow-up reviews for problematic assets + +**Best Practices:** + +- Keep review lightweight; 5-minute checklist is better than hour-long review +- Automate notifications and tracking (don't rely on email) +- Make reviews part of team's regular cadence (e.g., first Friday of month) +- Recognize teams with high-quality catalogs +- Use review findings to improve governance processes + +#### Workflow: Stale Asset Identification and Cleanup + +**When to use:** During governance audits or when trying to reduce catalog clutter. + +**Steps:** + +1. **Identify stale assets:** + - Use `list_catalog_assets` to find assets with: + - No recent updates (e.g., not modified in 1+ year) + - No recent usage (e.g., not consumed by any dashboards/models) + - No owner or owner no longer in organization + - Flag assets as potentially stale + - Create stale asset inventory + +2. **Investigate stale assets:** + - For each stale asset, determine why it's not used + - Is it truly unused? Or is usage not tracked? + - Use downstream analysis to check for hidden dependencies + - Interview potential users (is this asset still needed?) + +3. **Plan consolidation or deprecation:** + - **Option 1:** Consolidate with similar active asset (reduce duplication) + - **Option 2:** Deprecate with migration path to replacement asset + - **Option 3:** Archive with clear documentation (in case needed in future) + - **Option 4:** Delete if truly redundant and no users identified + - Get stakeholder approval before action + +4. **Execute cleanup:** + - For deprecation: Communicate timeline, provide replacement access + - For consolidation: Migrate any remaining users to replacement asset + - For archival: Move to archive location, keep documentation accessible + - For deletion: Only after confirming no users/lineage dependencies + +5. **Monitor and report:** + - Track cleanup progress and completed actions + - Report on amount of technical debt removed + - Celebrate catalog cleanliness improvements + - Establish policies to prevent stale assets from accumulating again + +**Best Practices:** + +- Don't delete without confirming no users (hidden dependencies surprise you) +- Keep archived assets documented; don't just disappear them +- Communicate deprecations early; give long lead time (6+ months) +- Use cleanup projects to establish ongoing maintenance culture +- Review and approve stale asset cleanup at governance committee level + +#### Workflow: Ownership Assignment and Accountability + +**When to use:** During onboarding, when ownership gaps identified, or during governance reviews. + +**Steps:** + +1. **Define ownership model:** + - **Technical Owner:** Responsible for data model, refresh, quality + - **Business Owner:** Responsible for business interpretation, accuracy + - **Executive Sponsor:** Accountable for strategic alignment + - Document roles and responsibilities in RACI matrix (see references) + +2. **Identify ownership gaps:** + - Use `list_catalog_assets` to find assets without assigned owner + - Create ownership inventory (asset → assigned owner) + - Identify teams with too much ownership (capacity issues) + - Identify gaps where ownership unclear + +3. **Assign ownership:** + - Match assets to appropriate owners based on: + - Team responsible for data model + - Team most familiar with business context + - Team consuming data most heavily + - Get owner approval before assigning + - Document escalation path if owner unavailable + +4. **Enable accountability:** + - Use ownership assignments to route review requests + - Track owner response rates and quality of reviews + - Recognize owners with high-quality asset governance + - Provide support/training to struggling owners + +5. **Maintain ownership:** + - Review ownership quarterly + - Update when owners change roles + - Establish succession planning for critical asset owners + - Document owner transitions with knowledge transfer + +**Best Practices:** + +- Assign single owner (not committees); clarifies accountability +- Ensure owner has time/authority to manage asset (avoid overloading) +- Rotate ownership periodically (prevents siloing of knowledge) +- Provide owners with tools and dashboards to manage their assets +- Tie ownership to performance reviews/compensation (creates accountability) + +--- + +## MCP Tools Reference + +This skill leverages these Datasphere MCP tools: + +- **`search_catalog`** - Search catalog by name, description, or metadata; filter by type, domain, tag +- **`get_asset_details`** - Retrieve full metadata for table, view, model, or measure (structure, lineage, ownership) +- **`list_catalog_assets`** - List assets matching criteria (type, owner, status, tag); supports pagination +- **`search_repository`** - Search source system definitions and imported objects +- **`get_object_definition`** - Retrieve detailed definition of object (structure, calculations, lineage) +- **`get_deployed_objects`** - List deployed models/measures and their status +- **`analyze_column_distribution`** - Analyze column data types, cardinality, completeness, distributions + +--- + +## Best Practices Summary + +**Metadata Enrichment:** +- Use business terminology, not technical jargon +- Keep descriptions under 500 words; link to detailed docs +- Auto-suggest tags; require human review before applying +- Batch updates by logical group; test on samples first + +**Glossary Management:** +- Start with 20-30 high-impact terms; grow over time +- Involve business owners in definitions +- Link glossary terms to actual technical implementations +- Version glossary; communicate changes to stakeholders +- Resolve terminology conflicts through formal approval process + +**KPI Definition:** +- Define calculation logic clearly; include examples +- Validate KPI against underlying data before publishing +- Assign single owner; document accountability +- Review KPI portfolio annually for relevance +- Version KPI definitions; don't silently change calculations + +**Tag Management:** +- Design controlled vocabulary; limit active tags to 20-50 +- Apply tags consistently across all assets +- Use tags for governance enforcement (required tags for sensitive data) +- Review and consolidate tags quarterly + +**Lineage & Impact:** +- Always analyze downstream impact before modifying critical assets +- Use impact reports to gain stakeholder approval for changes +- Identify and consolidate redundant implementations +- Build automated tests to validate changes against downstream consumers + +**Data Quality:** +- Automate quality scoring; avoid manual assessments +- Make quality scores visible to all users +- Link quality issues to remediation projects +- Review quality metrics monthly; celebrate improvements + +**Catalog Reviews:** +- Keep reviews lightweight (5-minute checklist) +- Conduct critical asset reviews monthly; standard assets annually +- Identify and clean up stale assets regularly +- Maintain ownership assignments; rotate periodically + +--- + +## Common Anti-Patterns and Solutions + +**Anti-Pattern:** Metadata written for IT, not business +- **Solution:** Use business analyst as template reviewer; remove jargon + +**Anti-Pattern:** Too many tags; users can't navigate +- **Solution:** Consolidate to 20-30 core tags; deprecate duplicates + +**Anti-Pattern:** KPI definitions silently change (breaks downstream calculations) +- **Solution:** Version KPI definitions; communicate changes; validate impact + +**Anti-Pattern:** Assets with no owner; governance unenforceable +- **Solution:** Systematically assign owners; include ownership in publishing workflow + +**Anti-Pattern:** Quality issues discovered downstream; no visibility upstream +- **Solution:** Build quality dashboards; surface issues early; tie to remediation projects + +**Anti-Pattern:** Lineage not captured; impact analysis impossible +- **Solution:** Ensure lineage automatically captured from data models; manually document complex logic + +**Anti-Pattern:** Glossary becomes unmaintainable; conflicting definitions +- **Solution:** Implement formal approval workflow; version terms; resolve conflicts through governance + +--- + +## Integration with Data Product Publishing + +The Catalog Steward skill complements the Data Product Publisher skill: + +- **Catalog Steward:** Organizes internal repository (metadata, quality, lineage, governance) +- **Data Product Publisher:** Publishes curated products to external marketplace + +Before publishing a data product, use Catalog Steward to: +- Ensure all source assets have clear ownership and quality certification +- Validate glossary terms and KPI definitions +- Verify lineage and impact analysis (understand ripple effects) +- Establish quality SLAs for published product +- Assign business/technical owners responsible for product quality + +--- + +## Getting Started + +1. **Audit Current State:** + - Use `list_catalog_assets` to inventory all assets + - Use `get_asset_details` to assess metadata quality + - Identify highest-value governance improvements + +2. **Design Your Governance Model:** + - Define roles and responsibilities (RACI) + - Design tag taxonomy and glossary structure + - Establish quality scoring methodology + +3. **Execute Pilot Project:** + - Pick 1-2 business domains for pilot + - Enrich metadata, add glossary terms, implement tagging + - Build sample quality dashboard + - Get stakeholder feedback and refine approach + +4. **Scale Governance Program:** + - Extend to additional domains + - Build automation for metadata enrichment, quality scoring, lineage capture + - Establish review cadences and ownership assignments + - Train data owners and catalog curators + +5. **Measure and Optimize:** + - Track catalog usage metrics (searches, views, discovery patterns) + - Monitor governance compliance (tagging, quality, ownership) + - Conduct quarterly reviews; adjust policies based on feedback + - Celebrate wins; recognize teams driving adoption diff --git a/partner-built/SAP-Datasphere/skills/datasphere-catalog-steward/references/catalog-governance-guide.md b/partner-built/SAP-Datasphere/skills/datasphere-catalog-steward/references/catalog-governance-guide.md new file mode 100644 index 0000000..f9776a4 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-catalog-steward/references/catalog-governance-guide.md @@ -0,0 +1,1693 @@ +# Catalog Governance Reference Guide + +## Table of Contents +1. [Metadata Enrichment Templates](#metadata-enrichment-templates) +2. [Glossary Term Template](#glossary-term-template) +3. [KPI Definition Template](#kpi-definition-template) +4. [Tag Taxonomy Design Patterns](#tag-taxonomy-design-patterns) +5. [Impact Analysis Report Template](#impact-analysis-report-template) +6. [Data Quality Scorecard](#data-quality-scorecard) +7. [Catalog Review Schedule Template](#catalog-review-schedule-template) +8. [Governance RACI Matrix](#governance-raci-matrix) +9. [Common Anti-Patterns](#common-anti-patterns) + +--- + +## Metadata Enrichment Templates + +### Asset Naming Convention + +Use these patterns to create consistent, discoverable asset names. + +#### Dimension Table Naming +``` +[Domain]_[Entity]_Dim + +Examples: +- Customer_Dim (core customer master) +- Product_Dim (product catalog) +- Date_Dim (date dimension for time-series analysis) +- Employee_Dim (employee master) +- Geography_Dim (geographic locations/regions) +``` + +#### Fact Table Naming +``` +[Domain]_[Verb]_Fact + +Examples: +- Sales_Order_Fact (individual sales transactions) +- Finance_GL_Posting_Fact (general ledger postings) +- HR_Attendance_Fact (employee attendance records) +- Marketing_Campaign_Response_Fact (campaign responses) +``` + +#### View Naming +``` +[Domain]_[Entity]_View_[Purpose] + +Examples: +- Sales_Order_View_Current (current active orders) +- Customer_View_Enriched (customer with demographics) +- Finance_GL_View_Monthly (monthly GL aggregation) +- Product_View_with_Pricing (products with price hierarchy) +``` + +#### Model Naming +``` +[Domain]_[Topic]_Model + +Examples: +- Sales_Pipeline_Model +- Customer_Churn_Model +- Finance_Cash_Flow_Model +- HR_Headcount_Model +``` + +#### Measure Naming +``` +[Entity]_[Aggregation]_[Domain] + +Examples: +- Order_Count_Sales +- Revenue_Sum_Sales +- Cost_Avg_Finance +- Headcount_Current_HR +``` + +### Asset Description Template + +Use this template for all asset descriptions: + +```markdown +## [Asset Name] + +### What This Asset Contains +[One-sentence summary describing the primary data contents] + +Example: This table contains daily sales order transactions, including order details, +customer information, and order amounts for all regions. + +### Business Purpose +[Why this asset exists; what business problems does it solve?] + +Example: Enables sales analysis, revenue forecasting, and customer purchasing patterns +analysis for monthly business reviews and quarterly forecasting. + +### Key Entities and Measures +- **Dimension [X]:** [Brief description] (Example value: "Customer ID, unique per customer") +- **Dimension [Y]:** [Brief description] (Example value: "Order date, YYYY-MM-DD format") +- **Measure [A]:** [Brief description] (Example value: "Order amount in USD, excluding tax") +- **Measure [B]:** [Brief description] (Example value: "Item quantity ordered") + +### Data Refresh +- **Frequency:** [Daily/Weekly/Monthly] at [time, timezone] +- **Latency:** Data typically available within [X hours] of transaction +- **Last Updated:** [Auto-populated timestamp] + +### Data Lineage +- **Source:** [List source systems or parent tables] +- **Transformations:** [Summarize key business logic or transformations applied] +- **Consumers:** [List primary consuming models or reports] + +### Known Limitations or Caveats +[Document any data quality issues, exclusions, or approximations. This is critical.] + +Examples: +- Excludes returns and cancellations; see Sales_Return_Fact for returns +- Historical data available from Jan 2020 forward only +- International orders excluded; see Global_Sales_Order_Fact +- Prices are list prices; do not include negotiated discounts + +### Owner and Contact +- **Business Owner:** [Name, Title, Email] +- **Technical Owner:** [Name, Title, Email] +- **Last Review Date:** [YYYY-MM-DD] + +### Related Assets +- [Related Dimension: Customer_Dim] +- [Related Fact: Sales_Order_Detail_Fact] +- [Related Model: Sales_Pipeline_Model] + +### Certification Status +[Certified / Trusted / Monitor / Issue] +- Quality Score: [XX%] +- Last Quality Review: [YYYY-MM-DD] +``` + +### Description Writing Guidelines + +**DO:** +- Use plain business language (assume audience is analyst, not DBA) +- Include concrete examples (helps readers understand scope) +- Document caveats and limitations (critical for users to know) +- Keep main description under 300 words (link to detailed docs) +- Use consistent tense (present tense, descriptive) +- Be specific (not "customer info" but "customer name, email, phone, address") + +**DON'T:** +- Use technical jargon without explanation (no "SCD Type 2 dimension") +- Hide important limitations (users will find out later at cost) +- Write essays (use "see detailed documentation" link instead) +- Assume readers know source system (explain SAP, Salesforce, etc.) +- Change descriptions without notifying users (impacts trust) + +### Examples + +#### Good Description +``` +Sales Order Fact + +What This Asset Contains: +This table contains 250M+ daily sales order transactions from 2020-present, +including order header details (order ID, customer, date, amount) and detail +lines (product, quantity, unit price). + +Business Purpose: +Enables revenue reporting, sales forecasting, and customer purchasing analysis. +Used in 40+ dashboards and 3 major KPIs (Monthly Revenue, Customer Lifetime Value, +Sales Pipeline). + +Key Entities: +- Order ID (unique per order) +- Customer ID (links to Customer_Dim) +- Order Date (YYYY-MM-DD) +- Total Amount (USD, including tax) +- Product ID (links to Product_Dim) + +Data Refresh: +- Frequency: Daily at 2 AM UTC +- Latency: Data available within 6 hours of order creation in SAP +- Last Updated: 2024-02-08 + +Limitations: +- Returns and cancellations excluded (see Sales_Return_Fact) +- International orders before 2023 not available (system migration) +- Negotiated discounts not shown; uses list prices +- Test orders included; filter by Customer ID >= 100000 to exclude + +Owner: Sarah Chen, Sales Analytics Manager (s.chen@company.com) +Quality Score: 97% (certified) +``` + +#### Poor Description +``` +Order Table + +Contains order data. Updated daily. Links to customer table. +Has order amount and product info. See SQL for details. +``` + +--- + +## Glossary Term Template + +Use this template for all glossary terms. + +### Glossary Term Definition + +```markdown +## [Term Name] + +### Synonyms +[List alternative names for this concept] +- Alternative name 1 +- Alternative name 2 +- [Note: "Turnover" is NOT a synonym for US-based companies; document regional differences] + +### Plain Language Definition +[Define the business concept in language a business analyst would use. +Assume no technical background. 1-2 sentences.] + +Example: "The total income a company receives from selling products and services +to customers during a specific time period, before deducting expenses." + +### Technical Definition +[How is this calculated in data systems? Include formula if applicable.] + +Example: "Sum of all order amounts (order header TOTAL_AMOUNT field) for orders +with order status = 'Completed' during the reporting period, excluding canceled +and returned orders. Calculated in Sales_Summary model using TOTAL_REVENUE measure." + +### Calculation Method +[Detailed step-by-step calculation, including any business logic or exclusions] + +Example: +1. Identify all orders in SAP (table ORDERS) with status = 'Completed' +2. Exclude orders with flag CANCELLED_FLAG = 'Y' +3. Exclude order detail lines with return quantities > 0 (see RETURN_QTY column) +4. Sum TOTAL_AMOUNT field across all qualifying order detail lines +5. Convert to reporting currency using daily exchange rates (CURRENCY_CONVERSION table) +6. Apply any negotiated discounts recorded in CUSTOMER_DISCOUNT table + +### Dimensions / Slicing +[How is this metric sliced in analysis?] + +Typical Dimensions: +- By Customer Segment (Enterprise, Mid-Market, SMB) +- By Product Category (Physical Products, Services, Digital) +- By Geography (North America, Europe, Asia-Pacific) +- By Time Period (Daily, Weekly, Monthly, Year-to-Date) +- By Sales Channel (Direct, Partner, Online) + +### Examples +[Provide concrete examples to clarify the definition] + +**What Increases Revenue:** +- Customer places new order: Order Amount is added to Revenue +- Volume discount removed: Revenue increases (less discount) +- Price increase applied: Revenue increases for same quantities + +**What Decreases Revenue:** +- Order canceled: Order Amount is removed from Revenue +- Product returned: Return amount is deducted from Revenue +- Negotiated discount applied: Revenue decreases + +**Edge Cases:** +- Revenue from free trials: Generally excluded (zero price) +- Multi-year contracts: Recognized monthly/quarterly (not all at once) +- Accrued vs. received: Reported on accrual basis (not cash received) + +### Related Terms +[Link to related glossary terms] + +- **Parent Term:** [Term this is derived from] +- **Related Terms:** [Terms used in conjunction with this one] +- **Complement Terms:** [Terms that measure the inverse or opposite] + +Examples: +- Parent: Total Revenue +- Related: Product Revenue, Service Revenue, Recurring Revenue +- Complement: Expenses, Cost of Revenue + +### Calculation Logic / Formula +[For metrics, include detailed formula with field references] + +Example SQL: +```sql +SELECT + SUM(o.TOTAL_AMOUNT) as REVENUE +FROM ORDERS o +WHERE o.ORDER_STATUS = 'Completed' + AND o.CANCELLED_FLAG = 'N' + AND NOT EXISTS (SELECT 1 FROM ORDER_DETAIL od + WHERE od.ORDER_ID = o.ORDER_ID + AND od.RETURN_QTY > 0) +``` + +### Data Sources +[Which systems/tables implement this term?] + +- Primary Source: SAP ERP (ORDERS, ORDER_DETAIL tables) +- Transformation: Sales_Summary model in Datasphere +- Calculation: TOTAL_REVENUE measure +- Reporting: All Sales dashboards + +### Applicable Business Domains +[Which teams use this term?] + +- Sales (daily/weekly tracking) +- Finance (monthly closing, forecasting) +- Executive Leadership (strategic KPIs) +- Marketing (ROI analysis) + +### Owner and Approval +- **Business Owner:** [Name, Title] +- **Technical Owner:** [Name, Title] +- **Approved By:** [Name, Title, Governance Committee] +- **Approval Date:** [YYYY-MM-DD] +- **Last Review Date:** [YYYY-MM-DD] +- **Next Review Date:** [YYYY-MM-DD] + +### Version History +| Version | Date | Author | Change | +|---------|------|--------|--------| +| 1.2 | 2024-02-01 | J. Smith | Added international revenue; updated definition | +| 1.1 | 2023-10-15 | M. Johnson | Clarified discount treatment | +| 1.0 | 2023-06-01 | S. Chen | Initial definition | + +### Change Requests +[Link to any pending change requests or disputes about this definition] + +- [Change Request #001: Should revenue include multi-year contracts? In review...] +- [Change Request #002: Should we exclude certain customer segments? Pending domain owner review...] + +--- + +## KPI Definition Template + +Use this template for all Key Performance Indicators. + +```markdown +## [KPI Name] + +### KPI Code +[Unique identifier for tracking] +Example: KPI-SALES-001, KPI-FIN-MARGIN-001 + +### Strategic Objective +[Which business goal does this KPI support?] + +Example: "Increase annual recurring revenue (ARR) by 25% in 2024" +Example: "Improve customer retention from 92% to 95% by Q4 2024" + +### Business Definition +[Plain-language definition suitable for executives] + +Example: "The percentage of customers who renew their subscription with us +at the end of the contract period, measured monthly and tracked as trailing +12-month average." + +### Technical Definition / Calculation +[Detailed calculation formula with field references] + +Example: +1. Identify all customers with contract renewal dates in reporting month +2. Count customers who renewed contracts (RENEWAL_FLAG = 'Y') +3. Count total eligible customers (all with renewal date in month) +4. Calculate: Churn Rate = (Total - Renewed) / Total * 100% +5. Calculate: Retention = 100% - Churn Rate +6. Apply trailing 12-month moving average + +### Formula +``` +Retention Rate (%) = (Renewed Customers in Period / Eligible Customers) * 100 +Trailing 12-Month Retention = Average of last 12 monthly rates +``` + +### Detailed Calculation Steps +[Step-by-step walkthrough for implementation] + +``` +Step 1: Identify eligible customers +- Customer must have active contract during reporting period +- Customer must have reached renewal date +- Exclude test customers (CUSTOMER_ID < 100000) + +Step 2: Count renewals +- FROM Contracts c +- JOIN Renewals r ON c.CONTRACT_ID = r.CONTRACT_ID +- WHERE c.RENEWAL_DATE >= start_of_month +- AND c.RENEWAL_DATE < end_of_month +- AND r.RENEWAL_FLAG = 'Y' + +Step 3: Calculate monthly rate +- Monthly Retention = Count(Renewals) / Count(Eligible) * 100 + +Step 4: Calculate trailing 12-month average +- T12M Retention = AVG(monthly rates for last 12 months) +``` + +### Dimensions (Slicing) +[How is this KPI analyzed by different attributes?] + +- **Customer Segment:** Enterprise, Mid-Market, SMB +- **Product:** Core Product, Premium Add-on, Professional Services +- **Geography:** North America, EMEA, Asia-Pacific +- **Sales Channel:** Direct Sales, Partner, Self-Serve +- **Cohort:** By contract start date (vintage analysis) + +### Data Sources +[Which tables and models provide this KPI?] + +- **Source Table 1:** CONTRACTS (contract details) +- **Source Table 2:** RENEWALS (renewal transactions) +- **Transformation Model:** Customer_Success_Model +- **KPI Measure:** RETENTION_RATE_T12M +- **Dashboard:** Executive Revenue Dashboard, CS Team Dashboard + +### Refresh Frequency and Timing +- **Frequency:** Monthly (on the 3rd business day) +- **Latency:** Data available 1 business day after month-end +- **Granularity:** Monthly, trended + +### Thresholds and Targets +[What is "good" performance?] + +| Threshold | Performance | Action | +|-----------|-------------|--------| +| ≥ 95% | Target Met | Monitor for changes | +| 92-95% | Yellow | Investigate; develop improvement plan | +| 88-92% | Orange | Escalate; assign improvement task force | +| < 88% | Red | Executive escalation; emergency action plan | + +Target (2024): 95% + +### KPI Owner and Accountability +- **KPI Owner (Accountable):** [Name, Title, Email] - Sets target, reviews monthly +- **Business Owner:** [Name, Title, Email] - Interprets results, defines initiatives +- **Data Owner:** [Name, Title, Email] - Ensures data quality, troubleshoots issues +- **Dashboard Owner:** [Name, Title, Email] - Maintains dashboard infrastructure +- **Escalation Path:** [Name, Title] if KPI misses target + +### Related KPIs +[KPIs that work together or have dependencies] + +- **Parent KPI:** Annual Recurring Revenue (ARR) +- **Child KPI:** Renewal Rate by Segment +- **Companion KPI:** Churn Rate, Net Revenue Retention +- **Prerequisite:** Customer Acquisition Rate (must have customers to retain) + +### Historical Performance +[Track KPI performance over time] + +| Period | Value | Trend | Notes | +|--------|-------|-------|-------| +| 2024-02 | 94.2% | ↑ 0.3% | Improvements in onboarding | +| 2024-01 | 93.9% | ↓ 0.2% | Competitor pressure in SMB | +| 2023-12 | 94.1% | ↑ 0.1% | Holiday season stable | + +### Known Issues or Caveats +[Data quality issues, approximations, or limitations] + +- "Multi-year contracts treated as annual renewal; not ideal for trend analysis" +- "Excludes customers with <6 months tenure (insufficient history)" +- "Data from SAP available with 2-day delay; projections based on partial month" + +### Validation Results +[Was this KPI validated against data?] + +- **Validation Date:** 2024-01-15 +- **Status:** CERTIFIED +- **Data Quality Score:** 98% +- **Issues Found:** None +- **Validated By:** [Name, Data Quality Team] + +### KPI Change History +| Version | Date | Change | Approved By | +|---------|------|--------|-------------| +| 1.2 | 2024-01-01 | Changed to T12M avg; was monthly snap | CEO, CFO | +| 1.1 | 2023-10-01 | Added customer segment dimension | Sales VP | +| 1.0 | 2023-06-01 | Initial definition | Executive Team | + +### Related Documentation +- [Detailed calculation logic](link-to-detailed-doc) +- [Data dictionary for source tables](link-to-data-dictionary) +- [Customer Success Dashboard](link-to-dashboard) +- [CS Team Review Meeting Notes](link-to-meeting-notes) + +--- +``` + +--- + +## Tag Taxonomy Design Patterns + +### Pattern 1: Simple Flat Taxonomy (Good for <100 Assets) + +``` +Domain Tags: +- domain:sales +- domain:finance +- domain:hr +- domain:operations +- domain:marketing +- domain:product + +Sensitivity Tags: +- sensitivity:public +- sensitivity:internal +- sensitivity:confidential +- sensitivity:pii +- sensitivity:financial + +Cadence Tags: +- cadence:real-time +- cadence:daily +- cadence:weekly +- cadence:monthly +- cadence:quarterly + +Quality Tags: +- quality:certified +- quality:trusted +- quality:monitor +- quality:issue + +Usage Tags: +- usage:kpi +- usage:reporting +- usage:analysis +- usage:ml-model +- usage:regulatory +``` + +### Pattern 2: Hierarchical Taxonomy (Good for >500 Assets) + +``` +domain:finance + ├── domain:finance:accounting + ├── domain:finance:revenue + ├── domain:finance:expense + └── domain:finance:planning + +domain:sales + ├── domain:sales:pipeline + ├── domain:sales:forecast + ├── domain:sales:commission + └── domain:sales:customer + +domain:hr + ├── domain:hr:payroll + ├── domain:hr:recruiting + ├── domain:hr:performance + └── domain:hr:learning + +sensitivity:data-protection + ├── sensitivity:pii (Personally Identifiable Information) + ├── sensitivity:financial (Salary, bank accounts) + ├── sensitivity:health (Medical records) + └── sensitivity:behavioral (Personal preferences) + +cadence:update-frequency + ├── cadence:real-time (updated <1 minute) + ├── cadence:daily (updated 1x per day) + ├── cadence:weekly (updated 1x per week) + ├── cadence:monthly (updated 1x per month) + └── cadence:static (not updated) +``` + +### Pattern 3: Multi-Dimensional Tagging (Best Practice) + +Combine multiple tag types on each asset (3-5 tags typical): + +``` +Asset: Sales_Summary_Model +Tags Applied: +- domain:sales (which business area) +- cadence:daily (update frequency) +- quality:certified (data quality) +- usage:reporting (how it's used) + +Asset: Customer_Financial_Data_View +Tags Applied: +- domain:finance (business area) +- sensitivity:pii (sensitive data) +- quality:trusted (quality status) +- cadence:weekly (refresh freq) +``` + +### Industry-Specific Tag Examples + +#### Financial Services +``` +Domain: + ├── domain:banking + ├── domain:insurance + ├── domain:trading + ├── domain:compliance + ├── domain:risk + +Regulatory: + ├── regulatory:sox (Sarbanes-Oxley) + ├── regulatory:gdpr (GDPR compliance) + ├── regulatory:ccpa (California privacy) + ├── regulatory:pci (Payment Card Industry) + └── regulatory:hipaa (Healthcare) +``` + +#### Healthcare +``` +Domain: + ├── domain:clinical + ├── domain:billing + ├── domain:pharmacy + ├── domain:radiology + +Compliance: + ├── compliance:hipaa + ├── compliance:phi (Protected Health Info) + ├── compliance:audit-trail + └── compliance:consent-required +``` + +#### Retail/E-commerce +``` +Domain: + ├── domain:products + ├── domain:orders + ├── domain:inventory + ├── domain:pricing + ├── domain:customer-service + +Business-Focus: + ├── focus:inventory-mgmt + ├── focus:demand-forecasting + ├── focus:customer-lifetime-value + ├── focus:promotional-analysis +``` + +### Tag Governance Rules + +``` +Tag Creation Rules: +- Only governance team can create new tags +- New tags require business justification +- New tags must fit within established taxonomy +- Tags reviewed and approved within 5 business days + +Tag Application Rules: +- All assets must have at least 1 domain tag +- All assets with PII/sensitive data must be tagged +- Do not apply >5 tags per asset (creates noise) +- Tag changes must be logged in audit trail + +Tag Deprecation: +- Tags with <5 assets tagged are candidates for deprecation +- Deprecated tags sunset over 30-day period +- Assets using deprecated tags must be re-tagged +- Deprecation announced 2 weeks in advance +``` + +--- + +## Impact Analysis Report Template + +Use this template when assessing change impacts. + +```markdown +# Impact Analysis Report + +## Executive Summary + +**Asset:** [Name of asset being changed] +**Change Type:** [Modification / Deletion / Renaming / Deprecation] +**Estimated Impact Scope:** [Low / Medium / High] +**Required Approvals:** [List stakeholders who must approve] +**Recommended Action:** [Proceed / Proceed with caution / Block until conditions met] + +--- + +## Change Summary + +### What is Being Changed +[Detailed description of the change] + +Example: "Dropping column HISTORICAL_PRICE from Sales_Order_Fact table +(unused since 2022; replaced by dynamic price lookup)" + +### Why is This Change Needed +[Business or technical justification] + +Example: "Column is deprecated; all downstream models already migrated to +use Price_Dim table. Dropping will improve query performance by 3% and +reduce table size by 2.5 GB." + +### Implementation Timeline +- **Analysis Date:** [When this impact analysis was completed] +- **Proposed Change Date:** [When change will be applied] +- **Approval Needed By:** [When approval must be complete] +- **Rollback Capability Until:** [Latest point rollback is possible] + +--- + +## Downstream Impact Analysis + +### Direct Consumers (Immediately Impacted) + +| Consumer Asset | Type | Impact | Owner | Risk Level | +|---|---|---|---|---| +| Model: Sales_Analysis | Model | Column no longer available; breaking change | Sarah Chen | HIGH | +| Dashboard: Daily Sales Report | Dashboard | Will display NULL; visuals affected | John Smith | HIGH | +| Report: Monthly Sales Forecast | Report | Source data changed; recalculation needed | Jane Doe | MEDIUM | + +**Summary:** 3 direct consumers identified. 2 require code changes; 1 requires testing. + +### Indirect Consumers (Second-Level Impact) + +| Asset Path | Type | Impact Depth | Owner | Risk Level | +|---|---|---|---|---| +| Table → Model: Sales_Analysis → Dashboard: Executive Dashboard | Dashboard | Breaking change will cascade; fixes required in model first | Sarah Chen | HIGH | +| Table → View: Sales_Summary_View → Report: Sales Trend Analysis | Report | Dependent on intermediate view; will propagate change | Bob Wilson | MEDIUM | + +**Summary:** 2 indirect consumers via 2-step lineage. Impact depends on fix to Model: Sales_Analysis. + +### KPI Consumers + +| KPI Name | Related Measures | Impact | Owner | +|---|---|---|---| +| Monthly Revenue | TOTAL_REVENUE (uses Sales_Order_Fact) | Will break if not handled; recalculation required | C-Level | +| Customer Lifetime Value | CUSTOMER_SPEND (uses Sales_Order_Fact) | Will break; business user impact | Sales VP | + +**Summary:** 2 KPIs directly impacted. CEO/CFO awareness required due to strategic importance. + +### Total Impact Scope + +``` +Impact Pyramid: +┌─────────────────────────────────────┐ +│ 5 Executive Dashboards (users: 200+)│ ← Highest user impact +├─────────────────────────────────────┤ +│ 12 Analyst Reports (users: 50+) │ +├─────────────────────────────────────┤ +│ 3 Internal Models │ +├─────────────────────────────────────┤ +│ 1 Direct Consumer (Sales_Analysis) │ ← Direct impact +└─────────────────────────────────────┘ +``` + +--- + +## Impact Classification + +### Risk Assessment Matrix + +| Impact Type | Severity | Detectability | Blast Radius | Overall Risk | +|---|---|---|---|---| +| Executive Dashboard breaks | High | Easy (immediate user notices) | 200+ users | CRITICAL | +| Model calculation breaks | High | Medium (appears wrong after delay) | 50+ users | HIGH | +| Report shows NULL | Medium | Easy (report owner notices) | 10+ users | MEDIUM | +| Performance degrades | Low | Hard (gradual degradation) | All users | MEDIUM | + +### Critical Assets at Risk +- **Executive Dashboard:** "Daily Sales Report" - CEO/CFO use this daily +- **KPI:** "Monthly Revenue" - Used in board reporting and compensation calc +- **Regulatory Report:** "Sales Tax Report" - Audit trail required + +--- + +## Stakeholder Assessment + +| Stakeholder | Role | Impact | Approval? | +|---|---|---|---| +| Sarah Chen (Sales Analyst) | Owner: Sales_Analysis Model | Must update model code | YES | +| John Smith (BI Team) | Owner: Executive Dashboard | Dashboard will break; requires fixes | YES | +| Jane Doe (Report Owner) | Owner: Sales Forecast Report | Dependent asset affected | YES | +| CFO | Executive User | KPI impacted; board report affected | YES | +| Data Governance Team | Oversight | Must track change impact | YES | + +--- + +## Mitigation Strategy + +### Option 1: No-Impact Approach (Recommended) +- Keep HISTORICAL_PRICE column in table +- Mark column as deprecated (add tag) +- Plan formal removal in 6-month sunset window +- Notify downstream owners of deprecation +- **Pros:** Zero breaking changes; minimal risk +- **Cons:** Technical debt remains; column not removed + +### Option 2: Backward-Compatible Migration +- Create view Sales_Order_Fact_V2 without deprecated column +- Migrate downstream consumers to new view gradually +- Keep original table for 90-day compatibility window +- Provide mapping documentation +- **Pros:** Clean break with transition path; reduces technical debt +- **Cons:** Requires downstream testing; coordination needed + +### Option 3: Breaking Change with Immediate Fix +- Remove column; notify all owners immediately +- Provide SQL fix for all downstream models +- Test fixes in dev/UAT before production +- Apply changes in coordinated rollout +- **Pros:** Complete cleanup; forces modernization +- **Cons:** High risk if any downstream consumers missed; potential user impact + +### Recommendation +Use **Option 2 (Backward-Compatible Migration)**: +- Lower risk than breaking change +- Cleaner outcome than indefinite deprecation +- Provides transition time for downstream teams +- Demonstrates governance responsibility + +--- + +## Testing and Validation Plan + +### Testing Phases + +**Phase 1: Development Testing (Week 1)** +- [ ] Apply change to development environment +- [ ] Test all 3 direct consumer models +- [ ] Validate KPI calculations still correct +- [ ] Run performance benchmarks +- [ ] Documentation: Test results, any issues found + +**Phase 2: Stakeholder Validation (Week 2)** +- [ ] Notify stakeholders; provide test access +- [ ] Sarah Chen (Model Owner): Validate model still produces correct results +- [ ] John Smith (Dashboard Owner): Validate dashboard visuals correct +- [ ] CFO (KPI User): Confirm KPI numbers still trusted +- [ ] Documentation: Stakeholder sign-offs, any requirements identified + +**Phase 3: UAT Testing (Week 3)** +- [ ] Deploy change to UAT environment +- [ ] Full regression testing of all downstream assets +- [ ] Performance testing under realistic load +- [ ] Data quality validation +- [ ] Documentation: UAT results, go/no-go decision + +**Phase 4: Production Rollout (Week 4)** +- [ ] Deploy to production during low-traffic window +- [ ] Monitor dashboards/reports for issues +- [ ] Track data quality metrics +- [ ] Have rollback plan ready +- [ ] Post-change review 7 days after deployment + +--- + +## Approval Sign-Offs + +### Required Approvals (All must approve before proceeding) + +- [ ] **Sarah Chen** (Sales_Analysis Model Owner) + - Signature: _________________ Date: _______ + - Comments: _______________________________ + +- [ ] **John Smith** (Executive Dashboard Owner) + - Signature: _________________ Date: _______ + - Comments: _______________________________ + +- [ ] **CFO** (Executive Stakeholder - KPI User) + - Signature: _________________ Date: _______ + - Comments: _______________________________ + +- [ ] **Data Governance Lead** (Oversight) + - Signature: _________________ Date: _______ + - Comments: _______________________________ + +--- + +## Communication Plan + +### Stakeholder Notifications + +**Immediate Notification (upon impact analysis completion):** +- Email to all direct consumer owners +- Summary of change, why needed, timeline +- Invite to review impact analysis +- Request any missing impacts be reported + +**Pre-Approval Notification (1 week before change):** +- Final impact summary +- Testing plan and timeline +- Approval request with sign-off deadline +- Q&A call offered + +**Post-Approval Notification (day before change):** +- Confirmed change date and time window +- Rollback plan and contact info +- Request to monitor their assets +- Escalation path if issues found + +**Post-Change Notification (day after change):** +- Confirmation change deployed successfully +- Request for any issues to be reported +- Summary of validation results +- Thank you for support + +### Training (if needed) +- [ ] Model: Sales_Analysis - 30-min working session with Sarah Chen +- [ ] Dashboard: Executive Dashboard - 15-min review with John Smith +- [ ] KPI: Monthly Revenue - Email summary to CFO + +--- + +## Rollback Plan + +**If issues discovered post-deployment:** + +1. **Detection (Within 24 hours):** + - Monitor dashboards for incorrect values + - Monitor model calculations + - Monitor data quality metrics + - Alert triggers: NULL values, 5%+ change in key metrics + +2. **Assessment (Within 1 hour of alert):** + - Confirm issue is real (not expected change) + - Assess severity (user impact, data correctness) + - Decide: Fix forward vs. rollback + +3. **Rollback (If necessary):** + - Restore table from backup (retain last 48 hours) + - Revert downstream model changes + - Notify stakeholders + - Root cause analysis + +4. **Prevention:** + - Post-mortem meeting to understand what was missed + - Enhanced testing added to process + - Documentation updated + +**Rollback Window:** 48 hours after production deployment +**Decision Authority:** Data Governance Lead + CFO + +--- + +## Post-Change Review + +**Schedule:** 7 days after deployment + +- [ ] All dashboards functioning normally +- [ ] All reports showing expected data +- [ ] KPIs within expected ranges +- [ ] Performance metrics improved as expected +- [ ] No user complaints or escalations +- [ ] Stakeholder feedback gathered +- [ ] Documentation updated with lessons learned +- [ ] Governance team signoff + +--- +``` + +--- + +## Data Quality Scorecard Template + +```markdown +# Data Quality Scorecard + +## Asset: [Asset Name] +**Date:** [YYYY-MM-DD] +**Owner:** [Name] +**Overall Score:** [XX%] | Status: [Certified / Trusted / Monitor / Issue] + +--- + +## Quality Dimensions + +### Dimension 1: Completeness +**Definition:** Percentage of non-null values in required fields + +| Field | Total Rows | Null Count | Null % | Score | Status | +|---|---|---|---|---|---| +| CUSTOMER_ID | 10,000,000 | 0 | 0% | 100 | ✓ | +| ORDER_DATE | 10,000,000 | 150 | 0.0015% | 99.8 | ✓ | +| AMOUNT | 10,000,000 | 500 | 0.005% | 99.9 | ✓ | +| DISCOUNT | 10,000,000 | 500,000 | 5% | 95 | ✓ | + +**Overall Completeness Score:** 98.7% +**Threshold:** ≥95% (PASS) +**Trend:** Stable (↔) + +--- + +### Dimension 2: Timeliness +**Definition:** How fresh is the data? How long since last update? + +| Metric | Target | Actual | Score | Status | +|---|---|---|---|---| +| Last Data Update | <6 hours | 2 hours ago | 100 | ✓ | +| Data Latency | <1 hour | 45 minutes | 100 | ✓ | +| Expected Next Update | >24 hours | 18 hours | 80 | ⚠ | +| Monthly Uptime | ≥99% | 99.7% | 100 | ✓ | + +**Overall Timeliness Score:** 95% +**Threshold:** ≥90% (PASS) +**Trend:** Improving (↑) + +--- + +### Dimension 3: Accuracy +**Definition:** Do values match source of truth or business rules? + +| Validation Rule | Test Type | Records Tested | Failed | Pass Rate | Status | +|---|---|---|---|---|---| +| AMOUNT ≥ 0 | Range Check | 10M | 5 | 99.99999% | ✓ | +| CUSTOMER_ID exists in Customer_Dim | Referential Integrity | 10M | 1,250 | 99.9875% | ✓ | +| ORDER_DATE ≤ TODAY | Logic Check | 10M | 0 | 100% | ✓ | +| AMOUNT sums correctly by Order | Aggregation Check | 500K | 2 | 99.9996% | ✓ | + +**Issues Found & Remediation:** +- 1,250 orders with missing customer ID (0.0125%) + - Root Cause: Source system data issue + - Remediation: Pending with source team + - Status: Escalated to IT; ETA fix = 2024-02-15 + +**Overall Accuracy Score:** 99.99% +**Threshold:** ≥95% (PASS) +**Trend:** Stable (↔) + +--- + +### Dimension 4: Consistency +**Definition:** Do related values align? No contradictions? + +| Consistency Check | Test | Result | Status | +|---|---|---|---| +| Product Prices | List price ≤ extended price | 99.98% consistent | ✓ | +| Customer Hierarchy | Parent-child relationships valid | 100% consistent | ✓ | +| Date Fields | ORDER_DATE ≤ SHIP_DATE ≤ DELIVERY_DATE | 99.95% consistent | ✓ | +| Currency Conversion | Exchange rates consistent | 100% consistent | ✓ | + +**Issues Found & Remediation:** +- 200 orders with extended price < list price (duplicate discount applied) + - Root Cause: Discount logic in source system + - Remediation: In progress; fix scheduled for 2024-02-12 + - Temporary Workaround: Ignore orders with this pattern in analysis + +**Overall Consistency Score:** 99.98% +**Threshold:** ≥95% (PASS) +**Trend:** Stable (↔) + +--- + +### Dimension 5: Uniqueness (Duplicate Detection) +**Definition:** Are there unintended duplicates? + +| Check | Count Total | Count Unique | Duplicates | Duplicate % | Status | +|---|---|---|---|---|---| +| ORDER_ID | 10,000,000 | 9,999,500 | 500 | 0.005% | ✓ | +| (CUSTOMER_ID, ORDER_DATE, AMOUNT) | 10,000,000 | 9,999,800 | 200 | 0.002% | ✓ | + +**Duplicates Found & Remediation:** +- 500 orders appear 2x in table (exact duplicate rows) + - Root Cause: ETL bug causing double-load on 2024-01-15 + - Remediation: Duplicates deleted; ETL fixed + - Status: Resolved 2024-01-16 + +**Overall Uniqueness Score:** 99.997% +**Threshold:** ≥99.9% (PASS) +**Trend:** Improving after cleanup (↑) + +--- + +### Dimension 6: Validity +**Definition:** Do values match expected format, type, and range? + +| Validation | Rule | Violations | Valid % | Status | +|---|---|---|---|---| +| AMOUNT Data Type | Numeric only | 0 | 100% | ✓ | +| AMOUNT Range | 0 ≤ AMOUNT ≤ 999,999,999 | 0 | 100% | ✓ | +| ORDER_DATE Format | YYYY-MM-DD | 0 | 100% | ✓ | +| ORDER_DATE Range | Valid calendar date | 0 | 100% | ✓ | +| CUSTOMER_ID Format | 8-digit numeric | 0 | 100% | ✓ | + +**Overall Validity Score:** 100% +**Threshold:** ≥95% (PASS) +**Trend:** Stable (↔) + +--- + +## Overall Quality Score Calculation + +``` +Overall Score = (Completeness*30% + Accuracy*30% + Timeliness*25% + Consistency*15%) + +Calculation: += (98.7 * 0.30) + (99.99 * 0.30) + (95 * 0.25) + (99.98 * 0.15) += 29.61 + 29.997 + 23.75 + 14.997 += 98.364% + +Overall Score: 98.4% → Status: CERTIFIED +``` + +--- + +## Quality Tier Assignment + +| Score Range | Tier | Meaning | Action | +|---|---|---|---| +| ≥95% | **Certified** | Production-ready; low risk | Use freely; promote for publishing | +| 85-94% | **Trusted** | Generally reliable; monitor | Use with awareness; plan improvements | +| 75-84% | **Monitor** | Known issues; high touch | Use only if necessary; implement fixes | +| <75% | **Issue** | Significant problems | Do not use; escalate immediately | + +**This Asset:** **Certified** (98.4% → Excellent quality) + +--- + +## Data Quality Issues & Remediation + +### Open Issues + +| Issue | Severity | Root Cause | Remediation | Owner | ETA | Status | +|---|---|---|---|---|---|---| +| Missing Customer IDs (1,250 rows) | High | Source system bug | IT investigating data quality issue in SAP | IT Team | 2024-02-15 | In Progress | +| Duplicate orders (500) | Medium | ETL double-load bug | ETL logic fixed; duplicates removed | Data Team | 2024-01-16 | Resolved | +| Price discrepancies (200) | Low | Discount logic issue | Source system owner reviewing | Finance Team | 2024-02-12 | In Progress | + +### Resolved Issues (Last 30 Days) + +| Issue | Severity | Resolution | Date Resolved | Lessons Learned | +|---|---|---|---|---| +| Stale data (data not updated for 7 days) | High | Identified and restarted ETL process | 2024-02-01 | Added monitoring alerts | +| Null values in optional field spiked to 15% | Medium | Source system configuration issue fixed | 2024-01-28 | Increased validation in ETL | + +--- + +## Quality Trends + +### Historical Scores (Last 12 Months) + +| Month | Completeness | Accuracy | Timeliness | Overall | Trend | +|---|---|---|---|---|---| +| 2024-02 | 98.7% | 99.99% | 95.0% | 98.4% | ↑ | +| 2024-01 | 98.0% | 99.95% | 94.0% | 97.8% | ↑ | +| 2023-12 | 97.5% | 99.50% | 92.0% | 97.1% | ↑ | +| 2023-11 | 97.0% | 99.20% | 90.0% | 96.2% | ↓ | +| ... | ... | ... | ... | ... | ... | + +**Trend:** Improving overall; especially strong improvement in timeliness + +--- + +## Improvement Opportunities + +| Opportunity | Effort | Impact | Owner | Timeline | +|---|---|---|---|---| +| Fix missing Customer IDs in source | Medium | High (reduces accuracy gap) | IT | 2024-02-15 | +| Reduce timeliness to <1 hour (vs. current 2h) | High | Medium (improves for KPI users) | Data Team | Q1 2024 | +| Add duplicate detection in ETL | Low | Medium (prevents future duplicates) | Data Team | 2024-02-28 | +| Improve source data validation | Medium | High (reduces accuracy issues) | Finance Team | Q2 2024 | + +--- + +## Data Quality Dashboard & Monitoring + +**Public Dashboard:** [Link to Dashboard] +**Alert Thresholds:** +- Completeness <95% → Alert +- Accuracy <95% → Critical Alert +- Timeliness >6 hours → Alert +- Duplicate % >0.1% → Alert + +**Monitoring Frequency:** Daily (automated checks) +**Review Frequency:** Weekly (governance team) / Monthly (stakeholder review) + +--- + +## Owner Sign-Off + +- **Data Owner:** [Name, Signature, Date] +- **Technical Owner:** [Name, Signature, Date] +- **Last Review Date:** [YYYY-MM-DD] +- **Next Review Date:** [YYYY-MM-DD] + +--- +``` + +--- + +## Catalog Review Schedule Template + +```markdown +# Catalog Review Schedule & Tracking + +## Review Cadence + +| Asset Category | Frequency | Owners | Effort | Purpose | +|---|---|---|---|---| +| **Critical Assets** | Monthly | Asset Owner + Governance | 1 hour/asset | Ensure accuracy, quality, relevance | +| **Important Assets** | Quarterly | Asset Owner | 30 min/asset | Validate metadata, check usage | +| **Standard Assets** | Annual | Asset Owner | 15 min/asset | Confirm active, no cleanup needed | +| **Experimental** | N/A | Author | Ad hoc | Evaluate readiness for production | +| **Deprecated** | N/A | Retiring Team | One-time | Plan transition, archive | + +--- + +## Critical Assets Review Schedule + +Critical assets: KPIs, customer-facing models, regulatory data, high-volume dashboards + +| Month | Assets to Review | Owner | Status | Notes | +|---|---|---|---|---| +| **2024-02** | Revenue KPI, Sales Pipeline Model | Sarah Chen | ✓ Complete | Quality improved to 98% | +| **2024-03** | Churn Rate KPI, Customer Segment Model | John Smith | → In Progress | Due 2024-03-15 | +| **2024-04** | Profit Margin KPI, Finance GL Model | Jane Doe | ⏳ Pending | Scheduled for 2024-04-01 | +| **2024-05** | Forecast Accuracy Model, Inventory Levels | Bob Wilson | ⏳ Pending | Scheduled for 2024-05-01 | + +--- + +## Important Assets Quarterly Review Schedule + +Important assets: Heavy-use dashboards, core measures, popular models + +**Q1 2024 (Jan-Mar) Review Group:** +- Sales_Order_Fact (Table Owner: Sarah Chen) - Due 2024-02-15 +- Sales_Summary_Model (Model Owner: John Smith) - Due 2024-02-15 +- Executive Sales Dashboard (BI Owner: Jane Doe) - Due 2024-02-15 +- Product_Category_Model (Data Owner: Bob Wilson) - Due 2024-03-15 + +**Q2 2024 (Apr-Jun) Review Group:** +- Finance_GL_Dim (Table Owner: Finance Team) - Due 2024-04-15 +- Customer_Enriched_Model (Owner: Marketing Team) - Due 2024-04-15 +- Monthly Report Suite (BI Team) - Due 2024-05-15 + +--- + +## Annual Review Scheduling + +Standard assets: One comprehensive review per year, spread across months + +| Month | Review Focus | # Assets | Assigned To | Status | +|---|---|---|---|---| +| Jan | Sales Domain | 25 | Sales Data Team | → In Progress | +| Feb | Finance Domain | 20 | Finance Data Team | ⏳ Starting | +| Mar | HR Domain | 15 | HR Data Team | ⏳ Not Started | +| Apr | Operations Domain | 30 | Ops Data Team | ⏳ Not Started | +| May | Marketing Domain | 18 | Marketing Team | ⏳ Not Started | +| Jun-Dec | Buffer + spillover | 20+ | Ad hoc | ⏳ Not Started | + +--- + +## Review Checklist + +Use this checklist for each asset review: + +### Asset Metadata Review +- [ ] **Name:** Current and accurate? Business-friendly? +- [ ] **Description:** Up-to-date? Covers what/why/when/limitations? +- [ ] **Owner:** Current? Still in this role? +- [ ] **Tags:** Current and accurate? +- [ ] **Last Updated:** Metadata reflects current state? + +### Business Relevance +- [ ] **Still Used?** Any recent usage? By whom? +- [ ] **Still Needed?** Does it still serve a business purpose? +- [ ] **Alignment:** Still aligned with current business priorities? +- [ ] **Deprecation Needed?** Is this asset becoming obsolete? + +### Data Quality +- [ ] **Quality Score:** Current? Acceptable? +- [ ] **Known Issues:** Any data quality problems? +- [ ] **Freshness:** Data current? Update frequency still appropriate? +- [ ] **Accuracy Checks:** Still passing validation rules? + +### Governance & Compliance +- [ ] **Ownership Clear:** Do we know who owns this? +- [ ] **Sensitive Data Tagged:** Is PII/confidential data properly tagged? +- [ ] **Lineage Current:** Data sources still accurate? +- [ ] **Approvals Current:** Owner and stakeholder approvals current? + +### Action Items +- [ ] **Updates Needed:** Any metadata changes required? +- [ ] **Issues to Address:** Any problems identified? +- [ ] **Owners to Contact:** Anyone need to be notified? +- [ ] **Next Review:** Schedule next review date + +--- + +## Stale Asset Identification + +Automated process to identify candidate assets for cleanup: + +``` +Criteria for "Stale" Classification: +- No metadata updates in 24+ months +- No downstream consumers (per lineage) +- No recent queries (per query logs) +- Owner no longer in organization +- No associated KPIs or reports +``` + +**Current Stale Assets:** + +| Asset | Last Updated | Owner Status | Consumers | Recommendation | +|---|---|---|---|---| +| Legacy_Sales_V1 | 2021-06-15 | Owner left org | 0 | Archive | +| Test_Customer_Model | 2022-03-01 | Still active | 0 | Delete | +| Old_GL_Reporting_View | 2021-12-31 | Still active | 2 (legacy reports) | Consolidate → New view | +| Experimental_Forecast | 2023-01-15 | Active | 0 | Deprecate (no adoption) | + +--- + +## Review Completion Tracking + +| Owner | Assets Assigned | Completed | In Progress | Pending | % Complete | +|---|---|---|---|---|---| +| Sarah Chen | 5 | 5 | 0 | 0 | 100% | +| John Smith | 6 | 4 | 2 | 0 | 67% | +| Jane Doe | 4 | 1 | 1 | 2 | 25% | +| Bob Wilson | 5 | 0 | 0 | 5 | 0% | +| **TOTAL** | **20** | **10** | **3** | **7** | **50%** | + +**SLA:** 90% completion within review window + +--- + +## Review Outcome Summary + +### Issues Identified + +| Category | Count | Examples | +|---|---|---| +| Metadata Updates Needed | 8 | Stale descriptions, outdated owners | +| Quality Issues Found | 4 | Data completeness declining, lag increasing | +| Deprecation Candidates | 3 | Legacy assets with no users | +| Owner Changes Needed | 5 | Owner left org, reassignment needed | + +### Actions Taken + +| Action | Count | Status | +|---|---|---| +| Metadata Updated | 8 | Completed | +| Quality Improvement Planned | 4 | Escalated to data team | +| Deprecation Initiated | 3 | In progress (30-day sunset) | +| Ownership Reassigned | 5 | Completed | + +--- +``` + +--- + +## Governance RACI Matrix Template + +```markdown +# Governance RACI Matrix + +## Legend +- **R (Responsible):** Does the work; executes the task +- **A (Accountable):** Final decision authority; answerable for outcome +- **C (Consulted):** Provides input; expertise needed +- **I (Informed):** Kept in the loop; notified of decisions + +--- + +## Asset Metadata Management + +| Activity | Data Owner | Business Owner | Governance Lead | BI Team | Compliance | +|---|---|---|---|---|---| +| Enrich asset descriptions | **R** | C | **A** | I | I | +| Create/update asset names | R | **A** | C | I | - | +| Assign asset ownership | **R** | **A** | C | - | I | +| Update quality scores | **R** | I | C | **A** | - | +| Apply quality tags | R | I | **A** | - | C | + +--- + +## Glossary Management + +| Activity | Business Owner | Governance Lead | Data Dictionary Owner | Finance (if $-related) | Compliance | +|---|---|---|---|---|---| +| Define glossary terms | **A** | R | C | C | C | +| Approve new terms | **A** | R | I | **C** (if financial) | I | +| Link terms to assets | R | **A** | I | - | I | +| Review/update terms | **A** | R | C | C | C | +| Deprecate terms | **A** | R | I | I | I | + +--- + +## KPI Management + +| Activity | Business Owner | Data Owner | Governance Lead | CFO/Finance | BI Owner | +|---|---|---|---|---|---| +| Define KPI | **A** | C | R | **A** | C | +| Calculate/implement | **C** | **R** | I | I | **A** | +| Validate KPI | **A** | **R** | C | I | I | +| Monitor KPI performance | **A** | I | I | **C** | **R** | +| Review/update KPI | **A** | **R** | C | **A** | I | +| Approve KPI changes | **A** | I | **R** | **A** (if strategic) | I | + +--- + +## Data Quality Management + +| Activity | Data Owner | Governance Lead | Quality Team | Business Owner | CIO | +|---|---|---|---|---|---| +| Define quality dimensions | I | **R** | **A** | C | I | +| Score assets | **R** | I | **A** | I | - | +| Track trends | **R** | I | **A** | I | - | +| Fix quality issues | **A** | I | **R** | C | I | +| Publish quality dashboard | **R** | I | **A** | I | I | +| Escalate critical issues | I | **R** | **A** | **A** | **C** | + +--- + +## Change Management + +| Activity | Data Owner | Business Owner | Governance Lead | BI Team | IT/Support | +|---|---|---|---|---|---| +| Propose change | **R** | I | **A** | I | - | +| Impact analysis | **R** | **A** | C | **A** | I | +| Stakeholder approval | I | **A** | **R** | **A** (if consumer) | I | +| Test/validate change | **A** | **C** | I | **R** | **R** | +| Deploy change | **A** | I | **R** | **R** | **A** | +| Monitor post-deployment | **R** | I | **A** | **R** | I | +| Incident response | **A** | **A** | **C** | **A** | **R** | + +--- + +## Tag Management & Governance + +| Activity | Data Owner | Governance Lead | Domain Owner | IT/Technical | +|---|---|---|---|---| +| Design tag taxonomy | I | **A** | **R** | C | +| Create new tags | I | **A** | **R** | I | +| Apply tags to assets | **R** | **A** | C | - | +| Review tag usage | **R** | **A** | I | - | +| Consolidate/deprecate tags | I | **A** | **R** | I | +| Enforce tag governance | **R** | **A** | C | I | + +--- + +## Lineage & Impact Analysis + +| Activity | Data Owner | Governance Lead | Analytics Team | Business Owner | +|---|---|---|---|---| +| Document lineage | **A** | **R** | I | I | +| Maintain lineage accuracy | **R** | I | **A** | I | +| Generate impact reports | **A** | **R** | C | **A** | +| Perform impact analysis | **R** | **A** | C | **A** | +| Communicate impacts | I | **R** | I | **A** | + +--- + +## Catalog Review Process + +| Activity | Asset Owner | Governance Lead | Data Owner | Domain Lead | +|---|---|---|---|---| +| Schedule reviews | I | **A** | - | **R** | +| Conduct reviews | **R** | I | **A** | I | +| Follow up on issues | **A** | **R** | **R** | I | +| Report results | I | **A** | I | **R** | +| Plan remediations | **A** | I | **R** | I | + +--- + +## Approval Authority & Escalation + +### By Impact Level + +**Low Impact Changes** (e.g., description update) +- Authority: Data Owner +- Escalation: Governance Lead (if disputed) + +**Medium Impact Changes** (e.g., column rename, tag change) +- Authority: Data Owner + Governance Lead +- Escalation: Domain Lead (if timeline critical) + +**High Impact Changes** (e.g., delete column, deprecate KPI) +- Authority: Data Owner + Business Owner + Governance Lead +- Escalation: Executive Sponsor (if strategic impact) + +**Critical Impact Changes** (e.g., break KPI, regulatory impact) +- Authority: Executive Sponsor + Governance Committee +- Escalation: CFO/CIO/Chief Data Officer + +--- + +## Decision-Making Authority + +| Decision | Authority | Consulted | Timeline | Escalation | +|---|---|---|---|---| +| Add new glossary term | Business Owner + Governance Lead | Data Owner | 5 business days | CFO (if finance term) | +| Define/approve KPI | CFO + Business Owner | Data Owner, BI Lead | 10 business days | CEO (if strategic) | +| Tag new asset | Data Owner | Governance Lead | 1 business day | Governance Lead | +| Score asset as "Issue" | Data Owner + Governance Lead | Business Owner | 3 business days | CIO (if urgent) | +| Deprecate critical asset | Business Owner + Governance Lead | All users | 30 days notice | Executive Sponsor | +| Change KPI calculation | Business Owner + Data Owner | Governance, BI, Users | 15 business days | CFO | + +--- + +## Meeting Schedule & Attendees + +### Monthly Data Governance Meeting +- **Attendees:** Governance Lead, Domain Leads, Data Owners (rotating) +- **Duration:** 1 hour +- **Agenda:** Review new issues, approve changes, discuss trends + +### Quarterly KPI Steering Committee +- **Attendees:** CFO, Business Owners, Data Owner, Governance Lead +- **Duration:** 1.5 hours +- **Agenda:** KPI performance review, strategy alignment, new KPI approval + +### Annual Governance Assessment +- **Attendees:** CIO, CFO, Governance Lead, Domain Leads, Business Owners +- **Duration:** Full day (possibly across 2 half-days) +- **Agenda:** Review program health, adjust policies, plan improvements + +--- +``` + +--- + +## Common Anti-Patterns and Solutions + +### Anti-Pattern 1: "IT Metadata" Problem + +**Symptom:** Descriptions use technical jargon ("SCD Type 2 dimension", "surrogate key join") + +**Root Cause:** Data owners are IT/DBA staff writing for other technical people + +**Impact:** Business analysts can't understand what data to use; self-service breaks + +**Solution:** +1. Require business analyst to review descriptions +2. Rewrite using business language (customer, order, product) +3. Include examples: "This contains 500K+ customers with name, email, address, phone" +4. Remove technical implementation details +5. Add "For technical details, see data dictionary [link]" + +--- + +### Anti-Pattern 2: "Tag Explosion" + +**Symptom:** 200+ tags in system; no one knows what each one means; tags overlap heavily + +**Root Cause:** Anyone can create tags; no governance; tags created ad-hoc for each use case + +**Impact:** Tags become useless for discovery; searches return too many results; maintenance nightmare + +**Solution:** +1. Consolidate to 20-50 core tags (ruthless deduplication) +2. Only governance team can create new tags +3. Design controlled vocabulary (see tag taxonomy template above) +4. Deprecate redundant tags over 30-day period +5. Regular quarterly reviews to keep taxonomy clean + +--- + +### Anti-Pattern 3: "Silent KPI Changes" + +**Symptom:** KPI calculation changed without notification; downstream users get wrong results; no audit trail + +**Root Cause:** No governance around KPI changes; "just pushed change to production" + +**Impact:** Users don't trust KPI numbers; regulatory audit risk; broken dashboards + +**Solution:** +1. Version KPI definitions (v1.0, v1.1, v2.0) +2. Never silently change calculation (always bump version) +3. Impact analysis required before change +4. Stakeholder approval before deployment +5. Communicate change 2+ weeks in advance +6. Maintain old calculation alongside new (transition period) +7. Audit trail documents who changed what when + +--- + +### Anti-Pattern 4: "Phantom Owner" + +**Symptom:** Asset shows owner = "John Smith" but he left company 6 months ago; no one maintaining asset + +**Root Cause:** Asset ownership not tracked; no process to reassign when people leave + +**Impact:** Asset metadata stale; quality issues not fixed; governance ineffective + +**Solution:** +1. Audit ownership quarterly; identify orphaned assets +2. When people leave: data owner notifies governance team +3. Automatically reassign to manager or team lead +4. Escalate if no clear owner identified +5. Regular reviews confirm ownership is current + +--- + +### Anti-Pattern 5: "Quality Theater" + +**Symptom:** System reports "100% quality" but users complain about bad data + +**Root Cause:** Quality scoring ignores real issues; metrics disconnect from reality + +**Impact:** Users don't trust quality scores; quality program loses credibility + +**Solution:** +1. Focus on dimensions that matter (completeness, accuracy, timeliness) +2. Include business validation: "does data match reality?" +3. Document known issues (don't hide them in score) +4. Link quality scores to remediation projects (make it actionable) +5. Review scoring methodology with users (validate it matches their needs) + +--- + +### Anti-Pattern 6: "Lost Lineage" + +**Symptom:** Can't answer "where does this data come from?" or "what breaks if we change this table?" + +**Root Cause:** Lineage not captured automatically; manually documented lineage becomes outdated + +**Impact:** Impact analysis guesses at impacts; changes break downstream assets; scope creep + +**Solution:** +1. Capture lineage automatically from data models (not manual docs) +2. Use lineage visualization tools (Datasphere provides this) +3. Update lineage when data models change (not manually) +4. Build impact analysis into change process +5. Generate impact reports before major changes + +--- + +### Anti-Pattern 7: "Glossary Graveyard" + +**Symptom:** 500-term glossary that no one uses; conflicts abound; terms obsolete and conflicting + +**Root Cause:** Created terms once; never reviewed/maintained; no governance of conflicts + +**Impact:** Glossary loses credibility; teams create their own terminology; governance broken + +**Solution:** +1. Start with 20-30 high-impact terms (not 500) +2. Implement formal approval workflow +3. Version terms and track changes +4. Resolve conflicts through governance committee +5. Annual review; deprecate unused terms +6. Make glossary discoverable and always up-to-date + +--- + +### Anti-Pattern 8: "Cargo Cult Metadata" + +**Symptom:** Asset descriptions copied from template; generic/meaningless content ("Contains customer data") + +**Root Cause:** Metadata required by process; no incentive for quality; no review + +**Impact:** Descriptions not helpful for discovery; metadata burden without benefit + +**Solution:** +1. Require meaningful descriptions (not boilerplate) +2. Review descriptions before publishing +3. Tie metadata quality to data product quality +4. Provide examples of good descriptions +5. Include specific field names and business context +6. Keep descriptions up-to-date (part of maintenance) + +--- + +### Anti-Pattern 9: "Uncontrolled Sensitive Data" + +**Symptom:** PII and financial data not properly tagged; discovered in unexpected places + +**Root Cause:** No mandatory sensitivity tagging; no discovery mechanism for sensitive data + +**Impact:** Compliance violations; audit findings; data breach risk + +**Solution:** +1. Require `sensitivity` tag on ALL assets (mandatory) +2. Auto-suggest sensitivity tags based on content analysis +3. Prevent publishing assets without sensitivity tag +4. Build dashboard showing all sensitive data (inventory) +5. Regular audits for untagged sensitive data +6. Link sensitivity tags to access controls (if supported) + +--- + +### Anti-Pattern 10: "Stale Asset Accumulation" + +**Symptom:** Catalog full of unused tables, views, models; hard to find useful assets + +**Root Cause:** No process to retire assets; "keep everything just in case"; no cleanup culture + +**Impact:** Catalog becomes unmaintainable; users overwhelmed with choices; confusion about which asset to use + +**Solution:** +1. Periodic cleanup sprints (quarterly) +2. Automated detection of stale assets (unused for 12+ months) +3. Archive (not delete) unused assets +4. Keep archived assets discoverable (for history) +5. Establish deprecation timeline (30-90 days notice) +6. Consolidate redundant implementations +7. Celebrate cleanup progress + +--- diff --git a/partner-built/SAP-Datasphere/skills/datasphere-cli-automator/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-cli-automator/SKILL.md new file mode 100644 index 0000000..56fa81c --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-cli-automator/SKILL.md @@ -0,0 +1,780 @@ +--- +name: "SAP Datasphere CLI Automator" +description: "Automate Datasphere administration with CLI commands! Use this skill when you need to bulk-provision users, create spaces programmatically, manage connections at scale, rotate certificates, or integrate Datasphere into CI/CD pipelines. Ideal for DevOps teams, system administrators, and enterprises managing hundreds of users or multi-system landscapes through Infrastructure-as-Code." +--- + +# SAP Datasphere CLI Automator + +## Overview + +The SAP Datasphere CLI (Command-Line Interface) is the power user's gateway to programmatic administration and automation. While the Datasphere UI excels at interactive tasks, the CLI is your tool of choice when you need to: + +- **Bulk Operations**: Provision hundreds or thousands of users, create multiple spaces, configure connections across landscapes +- **Infrastructure-as-Code**: Version control your Datasphere configuration in Git, deploy changes through pipelines +- **Scheduled Automation**: Trigger administrative tasks on schedules via cron, Task Chains, or cloud schedulers +- **CI/CD Integration**: Embed Datasphere provisioning into your DevOps pipeline (GitHub Actions, Azure DevOps, Jenkins) +- **Headless Environments**: Automate when no UI access is available or when running in containerized systems + +### CLI vs GUI: When to Use Each + +| Scenario | CLI | GUI | +|----------|-----|-----| +| Creating 500 users with attributes | ✓ | ✗ | +| Exploring data model visually | ✗ | ✓ | +| One-time user creation | Possible | ✓ | +| Batch certificate rotation | ✓ | ✗ | +| Setting up connection for testing | ✗ | ✓ | +| Deploying 10 connections across 5 systems | ✓ | ✗ | +| Configuring space capacity | ✓ | ✓ | +| Validating connection credentials | ✓ | ✓ | + +## Authentication and Setup + +Before using the CLI, configure authentication to your Datasphere instance. + +### Service Key Authentication (Recommended for Automation) + +Service keys are non-human identities ideal for automation, CI/CD, and scheduled tasks. + +1. **Create a Service Key in Datasphere**: + - Navigate to **System > Administration > Security** + - Create a new service key with appropriate scope + - Download the key file (JSON format containing client ID, secret, URL) + +2. **Configure CLI with Service Key**: + ```bash + datasphere config init \ + --client-id "your-client-id" \ + --client-secret "your-client-secret" \ + --instance-url "https://your-datasphere-instance.com" \ + --auth-method service-key + ``` + +3. **Store in Environment Variables** (for CI/CD pipelines): + ```bash + export DATASPHERE_CLIENT_ID="your-client-id" + export DATASPHERE_CLIENT_SECRET="your-client-secret" + export DATASPHERE_INSTANCE_URL="https://your-datasphere-instance.com" + ``` + +### OAuth Token Authentication (Interactive Use) + +For personal workstations with interactive CLI use: + +```bash +datasphere config init --auth-method oauth +# Opens browser for authentication +``` + +### Verify Configuration + +```bash +datasphere config validate +# Output: Configuration valid. Connected to datasphere.acme.com +``` + +## Space Management via CLI + +Spaces are the foundational containers in Datasphere. Manage them programmatically for consistent environments. + +### Creating a Single Space + +```bash +datasphere spaces create \ + --name "SALES_ANALYTICS" \ + --description "Sales and Revenue Analytics Space" \ + --ram-allocation 100 \ + --disk-allocation 500 \ + --priority standard +``` + +### Space Definition JSON (for bulk operations) + +Create a `spaces.json` file for reusable space configurations: + +```json +{ + "spaces": [ + { + "name": "SALES_ANALYTICS", + "description": "Sales and Revenue Analytics", + "configuration": { + "memory": { + "allocated_gb": 100, + "reserved_gb": 50 + }, + "disk": { + "allocated_gb": 500 + }, + "priority": "standard", + "network": { + "enable_public_access": false, + "data_isolation_level": "tenant" + } + }, + "owner": "sales-admin@company.com", + "tags": ["production", "analytics"] + }, + { + "name": "FINANCE_REPORTING", + "description": "Finance and Accounting Reporting", + "configuration": { + "memory": { + "allocated_gb": 150, + "reserved_gb": 75 + }, + "disk": { + "allocated_gb": 1000 + }, + "priority": "high", + "network": { + "enable_public_access": false, + "data_isolation_level": "tenant" + } + }, + "owner": "finance-admin@company.com", + "tags": ["production", "finance"] + } + ] +} +``` + +### Bulk Space Creation + +```bash +datasphere spaces create-bulk --file spaces.json --validate --dry-run +# Review output before committing + +datasphere spaces create-bulk --file spaces.json --confirm +``` + +### Pre-allocating Resources + +Control memory, disk, and processing priority during creation: + +```bash +datasphere spaces create \ + --name "HIGH_PERFORMANCE_SPACE" \ + --ram-allocation 200 \ + --disk-allocation 2000 \ + --priority high \ + --reserved-memory 100 \ + --network-isolation strict +``` + +**Resource Allocation Guide**: +- **RAM**: Minimum 50 GB for development, 100+ GB for production analytics +- **Disk**: 5x your expected data volume +- **Priority**: `low` (shared), `standard` (default), `high` (guaranteed resources) + +### Space Cloning + +Duplicate an existing space configuration as a template: + +```bash +datasphere spaces clone \ + --source "PROD_TEMPLATE" \ + --target "NEW_ENVIRONMENT" \ + --copy-connections true \ + --copy-users false +``` + +### Space Configuration Updates + +Modify existing space settings: + +```bash +datasphere spaces update SALES_ANALYTICS \ + --new-ram-allocation 150 \ + --new-disk-allocation 750 \ + --new-description "Updated: Sales and Revenue Analytics (upgraded)" +``` + +## User Management via CLI + +Efficiently provision and manage users at scale. + +### Batch User Onboarding + +Create a `users.json` file: + +```json +{ + "users": [ + { + "email": "alice.johnson@company.com", + "first_name": "Alice", + "last_name": "Johnson", + "roles": [ + { + "role": "datasphere.admin", + "scope": "global" + } + ], + "space_assignments": [ + { + "space_name": "SALES_ANALYTICS", + "role": "space_admin" + } + ], + "attributes": { + "department": "Sales", + "cost_center": "CC-1001", + "manager": "bob.smith@company.com" + }, + "status": "active" + }, + { + "email": "charlie.brown@company.com", + "first_name": "Charlie", + "last_name": "Brown", + "roles": [ + { + "role": "datasphere.analyst", + "scope": "global" + } + ], + "space_assignments": [ + { + "space_name": "SALES_ANALYTICS", + "role": "viewer" + }, + { + "space_name": "FINANCE_REPORTING", + "role": "editor" + } + ], + "attributes": { + "department": "Finance", + "cost_center": "CC-2001", + "manager": "alice.johnson@company.com" + }, + "status": "active" + } + ] +} +``` + +### Execute Bulk User Provisioning + +```bash +datasphere users create-bulk \ + --file users.json \ + --validate \ + --dry-run +# Review output + +datasphere users create-bulk \ + --file users.json \ + --send-invitations true \ + --confirm +``` + +**Output**: Lists success/failure per user, generates report with invitation links. + +### Assigning Scoped Roles Programmatically + +Scoped Roles attach users to specific spaces with granular permissions: + +```bash +datasphere users assign-role \ + --email "alice.johnson@company.com" \ + --role "datasphere.space_admin" \ + --space "SALES_ANALYTICS" \ + --effective-date "2024-02-01" +``` + +**Available Scoped Roles**: +- `datasphere.space_admin` — Full space administration +- `datasphere.space_editor` — Create and modify objects +- `datasphere.space_viewer` — Read-only access + +### User Attribute Management + +Store custom metadata on users for governance and integration: + +```bash +datasphere users set-attribute \ + --email "alice.johnson@company.com" \ + --attribute "department" \ + --value "Sales" \ + --attribute "cost_center" \ + --value "CC-1001" +``` + +Bulk update attributes: + +```bash +datasphere users batch-attributes \ + --file user_attributes.json +``` + +Where `user_attributes.json` contains: + +```json +{ + "updates": [ + { + "email": "alice.johnson@company.com", + "attributes": { + "department": "Sales", + "cost_center": "CC-1001", + "manager": "bob.smith@company.com" + } + } + ] +} +``` + +### User Deprovisioning Workflows + +**Soft Deprovisioning** (disable access without deleting): + +```bash +datasphere users disable \ + --email "alice.johnson@company.com" \ + --reason "Employee departure" \ + --effective-date "2024-03-15" +``` + +**Hard Deprovisioning** (permanent deletion): + +```bash +datasphere users delete \ + --email "alice.johnson@company.com" \ + --transfer-owned-objects-to "admin@company.com" \ + --confirm +``` + +## Connection Management via CLI + +Datasphere connections link to source and target systems. Manage them at scale via JSON templates. + +### Creating Connections from JSON Templates + +Create a `connections.json` file: + +```json +{ + "connections": [ + { + "name": "PROD_SAP_S4", + "type": "sap_s4hana", + "description": "Production SAP S/4HANA System", + "technical_user": "DATASPHERE_USER", + "connection_details": { + "host": "s4h-prod.company.com", + "port": 50013, + "client": "100", + "use_ssl": true, + "tls_version": "1.2" + }, + "authentication": { + "method": "basic", + "username_variable": "SAP_USER", + "password_variable": "SAP_PASS" + }, + "test_table": "MARA", + "retry_policy": { + "max_attempts": 3, + "backoff_seconds": 5 + }, + "owner": "integration-admin@company.com" + }, + { + "name": "SNOWFLAKE_WAREHOUSE", + "type": "snowflake", + "description": "Snowflake Data Warehouse", + "connection_details": { + "account_identifier": "xy12345.us-east-1", + "warehouse": "COMPUTE_WH", + "database": "DATASPHERE_DB", + "schema": "STAGING" + }, + "authentication": { + "method": "oauth", + "client_id_variable": "SF_CLIENT_ID", + "client_secret_variable": "SF_CLIENT_SECRET", + "token_endpoint": "https://xy12345.us-east-1.snowflakecomputing.com/oauth/authorize" + }, + "test_query": "SELECT 1", + "owner": "data-team@company.com" + } + ] +} +``` + +### Bulk Connection Setup + +```bash +datasphere connections create-bulk \ + --file connections.json \ + --validate-credentials \ + --dry-run +# Verify output + +datasphere connections create-bulk \ + --file connections.json \ + --confirm +``` + +### Connection Validation and Testing + +Verify connectivity before deployment: + +```bash +datasphere connections test \ + --name "PROD_SAP_S4" \ + --verbose +# Output: Connection test successful. Response time: 145ms +``` + +Batch test multiple connections: + +```bash +datasphere connections test-batch \ + --file connections.json \ + --generate-report test_results.html +``` + +## Certificate Management + +Manage TLS server certificates for secure connections to external systems. + +### Certificate Lifecycle Operations + +**List Current Certificates**: + +```bash +datasphere configuration certificates list \ + --show-expiry \ + --sort-by "expiry_date" +``` + +**Upload New Certificate**: + +```bash +datasphere configuration certificates upload \ + --name "PROD_SAP_S4_CERT" \ + --certificate-file "/path/to/certificate.pem" \ + --key-file "/path/to/private.key" \ + --description "Production S/4HANA TLS Certificate" +``` + +**Certificate Rotation Workflow**: + +```bash +# 1. Upload new certificate +datasphere configuration certificates upload \ + --name "PROD_SAP_S4_CERT_NEW" \ + --certificate-file "/path/to/new_cert.pem" \ + --key-file "/path/to/new_key.pem" \ + --scheduled-activation "2024-02-15T00:00:00Z" + +# 2. Activate new certificate (automatic at scheduled time or manual) +datasphere configuration certificates activate \ + --name "PROD_SAP_S4_CERT_NEW" + +# 3. Deactivate old certificate +datasphere configuration certificates deactivate \ + --name "PROD_SAP_S4_CERT" + +# 4. Clean up (optional, after verification period) +datasphere configuration certificates delete \ + --name "PROD_SAP_S4_CERT" \ + --force +``` + +### Expiry Monitoring and Alerting + +Create an automated monitoring script: + +```bash +#!/bin/bash +# cert_expiry_check.sh - Monitor certificate expiry + +ALERT_DAYS=30 +RECIPIENTS="security-team@company.com" + +datasphere configuration certificates list --json > certs.json + +python3 << 'EOF' +import json +from datetime import datetime, timedelta + +with open('certs.json') as f: + certs = json.load(f) + +alert_threshold = datetime.utcnow() + timedelta(days=ALERT_DAYS) + +for cert in certs['certificates']: + expiry = datetime.fromisoformat(cert['expiry_date']) + if expiry < alert_threshold: + print(f"ALERT: {cert['name']} expires on {cert['expiry_date']}") +EOF +``` + +Schedule in cron: + +```bash +# Run daily at 6 AM +0 6 * * * /opt/datasphere/cert_expiry_check.sh | mail -s "Datasphere Certificate Expiry Alert" security-team@company.com +``` + +## Task Automation Patterns + +### Scheduling CLI Commands via Cron + +Execute administrative tasks on a schedule: + +```bash +# Daily space quota report +0 2 * * * datasphere spaces report --format json > /var/reports/space_quota_$(date +\%Y\%m\%d).json + +# Weekly user access review +0 3 * * 0 datasphere users list --inactive-days 30 > /var/reports/inactive_users.txt + +# Monthly certificate expiry check +0 4 1 * * /opt/datasphere/cert_expiry_check.sh +``` + +### CI/CD Integration for Datasphere + +#### GitHub Actions Example + +Create `.github/workflows/datasphere-deploy.yml`: + +```yaml +name: Deploy Datasphere Configuration + +on: + push: + branches: [main] + paths: ['datasphere/**'] + +jobs: + deploy: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + + - name: Install Datasphere CLI + run: | + curl -sL https://datasphere-cli.company.com/install.sh | bash + datasphere --version + + - name: Configure CLI + env: + DATASPHERE_CLIENT_ID: ${{ secrets.DATASPHERE_CLIENT_ID }} + DATASPHERE_CLIENT_SECRET: ${{ secrets.DATASPHERE_CLIENT_SECRET }} + DATASPHERE_INSTANCE_URL: ${{ secrets.DATASPHERE_INSTANCE_URL }} + run: datasphere config init --auth-method service-key + + - name: Validate Configuration + run: | + datasphere spaces create-bulk --file datasphere/spaces.json --validate --dry-run + datasphere users create-bulk --file datasphere/users.json --validate --dry-run + datasphere connections create-bulk --file datasphere/connections.json --validate --dry-run + + - name: Deploy Changes + run: | + datasphere spaces create-bulk --file datasphere/spaces.json --confirm + datasphere users create-bulk --file datasphere/users.json --send-invitations true --confirm + datasphere connections create-bulk --file datasphere/connections.json --confirm + + - name: Run Post-Deployment Tests + run: | + datasphere connections test-batch --file datasphere/connections.json --generate-report deployment_test.html + + - name: Archive Reports + if: always() + uses: actions/upload-artifact@v3 + with: + name: deployment-reports + path: deployment_test.html +``` + +#### Azure DevOps Pipeline Example + +Create `datasphere-pipeline.yml`: + +```yaml +trigger: + branches: + include: + - main + paths: + include: + - datasphere/* + +pool: + vmImage: 'ubuntu-latest' + +stages: +- stage: Validate + jobs: + - job: ValidateConfiguration + steps: + - task: Bash@3 + inputs: + targetType: 'inline' + script: | + curl -sL https://datasphere-cli.company.com/install.sh | bash + export DATASPHERE_CLIENT_ID=$(DATASPHERE_CLIENT_ID) + export DATASPHERE_CLIENT_SECRET=$(DATASPHERE_CLIENT_SECRET) + export DATASPHERE_INSTANCE_URL=$(DATASPHERE_INSTANCE_URL) + datasphere config init --auth-method service-key + datasphere spaces create-bulk --file datasphere/spaces.json --validate --dry-run + datasphere users create-bulk --file datasphere/users.json --validate --dry-run + +- stage: Deploy + condition: succeeded() + jobs: + - deployment: DeployDatasphere + environment: 'production' + strategy: + runOnce: + deploy: + steps: + - task: Bash@3 + inputs: + targetType: 'inline' + script: | + datasphere spaces create-bulk --file datasphere/spaces.json --confirm + datasphere users create-bulk --file datasphere/users.json --send-invitations true --confirm + datasphere connections create-bulk --file datasphere/connections.json --confirm +``` + +### Infrastructure-as-Code Patterns + +Version all Datasphere configuration in Git: + +``` +datasphere-config/ +├── spaces/ +│ ├── sales.json +│ ├── finance.json +│ └── marketing.json +├── users/ +│ ├── bulk_onboarding.json +│ └── role_assignments.json +├── connections/ +│ ├── sap_systems.json +│ └── data_warehouses.json +├── certificates/ +│ └── certificates.json +└── deploy.sh +``` + +`deploy.sh` orchestrates all deployments: + +```bash +#!/bin/bash +set -e + +echo "Deploying Datasphere Configuration..." + +# Validate all configurations +echo "Validating spaces..." +datasphere spaces create-bulk --file spaces/*.json --validate --dry-run + +echo "Validating users..." +datasphere users create-bulk --file users/*.json --validate --dry-run + +echo "Validating connections..." +datasphere connections create-bulk --file connections/*.json --validate --dry-run + +# Deploy +echo "Deploying spaces..." +datasphere spaces create-bulk --file spaces/*.json --confirm + +echo "Deploying users..." +datasphere users create-bulk --file users/*.json --send-invitations true --confirm + +echo "Deploying connections..." +datasphere connections create-bulk --file connections/*.json --confirm + +echo "Deployment complete!" +``` + +## Error Handling and Logging in CLI Operations + +### Enable Verbose Logging + +```bash +datasphere --log-level debug spaces create-bulk --file spaces.json +# Output includes detailed request/response logs +``` + +### Capture Structured Output + +```bash +datasphere spaces create-bulk --file spaces.json --output json > deployment_log.json +# Parse with jq for post-processing +jq '.results[] | select(.status == "FAILED")' deployment_log.json +``` + +### Common CLI Error Codes + +| Code | Error | Resolution | +|------|-------|-----------| +| 401 | Authentication failed | Verify service key credentials and expiry | +| 403 | Permission denied | Check role assignments and space membership | +| 409 | Conflict (object exists) | Use `--force-overwrite` or change name | +| 422 | Invalid configuration | Validate JSON schema and retry | +| 503 | Service unavailable | Retry with exponential backoff | + +### Retry Logic in Scripts + +```bash +#!/bin/bash +retry_with_backoff() { + local max_attempts=5 + local attempt=1 + local delay=2 + + while [ $attempt -le $max_attempts ]; do + if "$@"; then + return 0 + fi + + if [ $attempt -lt $max_attempts ]; then + echo "Attempt $attempt failed. Retrying in ${delay}s..." + sleep $delay + delay=$((delay * 2)) + fi + + attempt=$((attempt + 1)) + done + + return 1 +} + +retry_with_backoff datasphere spaces create-bulk --file spaces.json --confirm +``` + +--- + +## MCP Tools Reference + +This skill leverages these MCP (Model Context Protocol) tools for enhanced automation: + +- **`list_spaces`** — List all spaces with metadata +- **`get_space_info`** — Retrieve detailed space configuration +- **`list_database_users`** — Query user records from Datasphere database +- **`create_database_user`** — Create users programmatically (advanced) +- **`test_connection`** — Validate connection before deployment + +Use these tools in conjunction with CLI commands for end-to-end automation workflows. + +--- + +## Next Steps + +1. **Set up service key authentication** for your Datasphere instance +2. **Create JSON templates** for your spaces, users, and connections +3. **Validate configurations** with `--dry-run` before committing +4. **Integrate with your CI/CD pipeline** (GitHub Actions or Azure DevOps) +5. **Monitor and alert** on certificate expiry and user provisioning + +See the **references/cli-reference.md** for complete command syntax, JSON schemas, and troubleshooting guides. diff --git a/partner-built/SAP-Datasphere/skills/datasphere-cli-automator/references/cli-reference.md b/partner-built/SAP-Datasphere/skills/datasphere-cli-automator/references/cli-reference.md new file mode 100644 index 0000000..c69062b --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-cli-automator/references/cli-reference.md @@ -0,0 +1,1174 @@ +# SAP Datasphere CLI Complete Reference + +## Table of Contents +1. [Configuration Commands](#configuration-commands) +2. [Space Management Commands](#space-management-commands) +3. [User Management Commands](#user-management-commands) +4. [Connection Management Commands](#connection-management-commands) +5. [Certificate Management Commands](#certificate-management-commands) +6. [JSON Schema Reference](#json-schema-reference) +7. [Bulk Operation Templates](#bulk-operation-templates) +8. [CI/CD Pipeline Examples](#cicd-pipeline-examples) +9. [Certificate Rotation Runbook](#certificate-rotation-runbook) +10. [Error Codes and Troubleshooting](#error-codes-and-troubleshooting) + +--- + +## Configuration Commands + +### Initialize Configuration + +```bash +datasphere config init \ + --client-id \ + --client-secret \ + --instance-url \ + --auth-method [service-key|oauth|basic] +``` + +**Options**: +- `--client-id` — Service key client ID +- `--client-secret` — Service key secret +- `--instance-url` — Datasphere instance URL (e.g., https://ds.company.com) +- `--auth-method` — Authentication method (service-key recommended for automation) + +**Example**: +```bash +datasphere config init \ + --client-id "sb-cli-user-001" \ + --client-secret "xXkJbK9...L2mN" \ + --instance-url "https://datasphere.acme.com" \ + --auth-method service-key +``` + +### Validate Configuration + +```bash +datasphere config validate [--verbose] +``` + +**Output**: +``` +Configuration valid. +Connected to: https://datasphere.acme.com +Authenticated as: sb-cli-user-001 +Timeout: 30s +``` + +### List Configuration + +```bash +datasphere config list +``` + +### Set Configuration Options + +```bash +datasphere config set \ + --key \ + --value +``` + +**Common Keys**: +- `timeout` — Request timeout in seconds (default: 30) +- `retry-attempts` — Max retry attempts (default: 3) +- `log-level` — Log level (debug|info|warn|error) +- `output-format` — Default output format (json|table|csv) + +--- + +## Space Management Commands + +### Create Space + +```bash +datasphere spaces create \ + --name \ + [--description ] \ + [--ram-allocation ] \ + [--disk-allocation ] \ + [--priority ] \ + [--reserved-memory ] \ + [--owner ] \ + [--tags ] \ + [--public-access true|false] +``` + +**Parameters**: +- `--name` — Space name (alphanumeric, max 128 chars) +- `--description` — Space description (optional) +- `--ram-allocation` — RAM in GB (min: 50, recommended: 100+) +- `--disk-allocation` — Disk in GB (min: 100, recommended: 5x data volume) +- `--priority` — `low` | `standard` | `high` +- `--reserved-memory` — Reserved RAM in GB (must be <= ram-allocation) +- `--owner` — Space owner email +- `--tags` — Comma-separated tags for organization +- `--public-access` — Enable/disable public access (true|false) + +**Example**: +```bash +datasphere spaces create \ + --name "SALES_ANALYTICS" \ + --description "Sales and Revenue Analytics" \ + --ram-allocation 100 \ + --disk-allocation 500 \ + --priority standard \ + --owner "sales-admin@acme.com" \ + --tags "production,analytics" +``` + +### Bulk Create Spaces + +```bash +datasphere spaces create-bulk \ + --file \ + [--validate] \ + [--dry-run] \ + [--confirm] +``` + +**Flags**: +- `--validate` — Validate JSON schema without creating +- `--dry-run` — Show what would be created +- `--confirm` — Confirm and create spaces + +**Example**: +```bash +datasphere spaces create-bulk \ + --file spaces.json \ + --validate \ + --dry-run +``` + +### List Spaces + +```bash +datasphere spaces list \ + [--filter ] \ + [--sort-by ] \ + [--output ] +``` + +**Output Formats**: json, table, csv + +**Example**: +```bash +datasphere spaces list --output json | jq '.spaces[] | select(.status == "ACTIVE")' +``` + +### Get Space Details + +```bash +datasphere spaces get \ + [--include-members] \ + [--include-objects] +``` + +**Example**: +```bash +datasphere spaces get SALES_ANALYTICS --include-members +``` + +### Clone Space + +```bash +datasphere spaces clone \ + --source \ + --target \ + [--copy-connections true|false] \ + [--copy-users true|false] \ + [--copy-objects true|false] +``` + +**Example**: +```bash +datasphere spaces clone \ + --source "PROD_TEMPLATE" \ + --target "DEV_ENVIRONMENT" \ + --copy-connections true \ + --copy-users false +``` + +### Update Space + +```bash +datasphere spaces update \ + [--new-ram-allocation ] \ + [--new-disk-allocation ] \ + [--new-description ] \ + [--new-priority ] \ + [--new-owner ] \ + [--new-tags ] +``` + +**Example**: +```bash +datasphere spaces update SALES_ANALYTICS \ + --new-ram-allocation 150 \ + --new-disk-allocation 750 \ + --new-priority high +``` + +### Delete Space + +```bash +datasphere spaces delete \ + [--backup-path ] \ + [--force] +``` + +**Example**: +```bash +datasphere spaces delete TEMP_SPACE \ + --backup-path /backups/temp_space_backup.json \ + --force +``` + +### Space Quota Report + +```bash +datasphere spaces report \ + [--metric ] \ + [--format ] \ + [--output-file ] +``` + +**Metrics**: cpu|memory|disk|concurrent-users + +**Example**: +```bash +datasphere spaces report --metric memory --format json > space_usage.json +``` + +--- + +## User Management Commands + +### Create User + +```bash +datasphere users create \ + --email \ + --first-name \ + --last-name \ + [--roles ] \ + [--spaces ] \ + [--attributes ] \ + [--send-invitation true|false] +``` + +**Parameters**: +- `--email` — User email address (unique) +- `--first-name` — First name +- `--last-name` — Last name +- `--roles` — Comma-separated global roles +- `--spaces` — Space assignments with roles (format: SPACE_NAME:ROLE) +- `--attributes` — Custom attributes for organization +- `--send-invitation` — Send invitation email + +**Available Roles**: +- `datasphere.admin` — Full system administration +- `datasphere.analyst` — Analytics and modeling +- `datasphere.viewer` — Read-only access + +**Example**: +```bash +datasphere users create \ + --email "alice.johnson@acme.com" \ + --first-name "Alice" \ + --last-name "Johnson" \ + --roles "datasphere.analyst" \ + --spaces "SALES_ANALYTICS:editor,FINANCE_REPORTING:viewer" \ + --attributes "department:Sales,cost_center:CC-1001" \ + --send-invitation true +``` + +### Bulk Create Users + +```bash +datasphere users create-bulk \ + --file \ + [--validate] \ + [--dry-run] \ + [--send-invitations true|false] \ + [--confirm] +``` + +**Example**: +```bash +datasphere users create-bulk \ + --file users.json \ + --validate \ + --dry-run +``` + +### List Users + +```bash +datasphere users list \ + [--filter ] \ + [--inactive-days ] \ + [--sort-by ] \ + [--output ] +``` + +**Filters**: active|inactive|disabled|pending + +**Example**: +```bash +datasphere users list --inactive-days 90 --output csv > inactive_users.csv +``` + +### Get User Details + +```bash +datasphere users get \ + [--include-roles] \ + [--include-space-assignments] +``` + +### Assign Role + +```bash +datasphere users assign-role \ + --email \ + --role \ + [--space ] \ + [--effective-date ] +``` + +**Example**: +```bash +datasphere users assign-role \ + --email "alice.johnson@acme.com" \ + --role "datasphere.space_admin" \ + --space "SALES_ANALYTICS" +``` + +### Remove Role + +```bash +datasphere users remove-role \ + --email \ + --role \ + [--space ] +``` + +### Set User Attributes + +```bash +datasphere users set-attribute \ + --email \ + --attribute \ + --value \ + [--attribute --value ] +``` + +**Example**: +```bash +datasphere users set-attribute \ + --email "alice.johnson@acme.com" \ + --attribute "department" \ + --value "Sales" \ + --attribute "cost_center" \ + --value "CC-1001" +``` + +### Batch Update User Attributes + +```bash +datasphere users batch-attributes \ + --file +``` + +**File Format**: +```json +{ + "updates": [ + { + "email": "alice.johnson@acme.com", + "attributes": { + "department": "Sales", + "manager": "bob.smith@acme.com" + } + } + ] +} +``` + +### Disable User + +```bash +datasphere users disable \ + --email \ + [--reason ] \ + [--effective-date ] +``` + +**Example**: +```bash +datasphere users disable \ + --email "alice.johnson@acme.com" \ + --reason "Employee departure" \ + --effective-date "2024-03-15" +``` + +### Delete User + +```bash +datasphere users delete \ + --email \ + [--transfer-owned-objects-to ] \ + [--force] +``` + +**Important**: Deletes user and reassigns owned objects. + +**Example**: +```bash +datasphere users delete \ + --email "alice.johnson@acme.com" \ + --transfer-owned-objects-to "bob.smith@acme.com" \ + --force +``` + +--- + +## Connection Management Commands + +### Create Connection + +```bash +datasphere connections create \ + --name \ + --type \ + [--description ] \ + [--connection-file ] +``` + +**Connection Types**: sap_s4hana, sap_bw, snowflake, postgresql, oracle, mysql, kafka, etc. + +**Example**: +```bash +datasphere connections create \ + --name "PROD_SAP_S4" \ + --type "sap_s4hana" \ + --connection-file s4h_connection.json +``` + +### Bulk Create Connections + +```bash +datasphere connections create-bulk \ + --file \ + [--validate-credentials] \ + [--dry-run] \ + [--confirm] +``` + +**Example**: +```bash +datasphere connections create-bulk \ + --file connections.json \ + --validate-credentials \ + --dry-run +``` + +### List Connections + +```bash +datasphere connections list \ + [--filter ] \ + [--type ] \ + [--output ] +``` + +**Filters**: active|inactive|test-failed|expiring + +### Get Connection Details + +```bash +datasphere connections get \ + [--include-test-results] +``` + +### Test Connection + +```bash +datasphere connections test \ + --name \ + [--verbose] \ + [--test-query ] +``` + +**Example**: +```bash +datasphere connections test \ + --name "PROD_SAP_S4" \ + --verbose +``` + +### Batch Test Connections + +```bash +datasphere connections test-batch \ + --file \ + [--generate-report ] \ + [--timeout ] +``` + +### Update Connection + +```bash +datasphere connections update \ + --name \ + --connection-file +``` + +### Delete Connection + +```bash +datasphere connections delete \ + --name \ + [--force] +``` + +--- + +## Certificate Management Commands + +### List Certificates + +```bash +datasphere configuration certificates list \ + [--show-expiry] \ + [--sort-by ] \ + [--output ] +``` + +**Fields for Sort**: name|created|expiry|status + +**Example**: +```bash +datasphere configuration certificates list \ + --show-expiry \ + --sort-by expiry \ + --output table +``` + +### Upload Certificate + +```bash +datasphere configuration certificates upload \ + --name \ + --certificate-file \ + --key-file \ + [--description ] \ + [--scheduled-activation ] +``` + +**File Format**: PEM (Privacy Enhanced Mail) + +**Example**: +```bash +datasphere configuration certificates upload \ + --name "PROD_SAP_S4_CERT" \ + --certificate-file "/secure/cert.pem" \ + --key-file "/secure/key.pem" \ + --description "Production SAP S/4HANA Certificate" \ + --scheduled-activation "2024-03-01T00:00:00Z" +``` + +### Activate Certificate + +```bash +datasphere configuration certificates activate \ + --name +``` + +### Deactivate Certificate + +```bash +datasphere configuration certificates deactivate \ + --name +``` + +### Delete Certificate + +```bash +datasphere configuration certificates delete \ + --name \ + [--force] +``` + +### Get Certificate Details + +```bash +datasphere configuration certificates get +``` + +--- + +## JSON Schema Reference + +### Space Definition Schema + +```json +{ + "spaces": [ + { + "name": "string (required, alphanumeric max 128)", + "description": "string (optional)", + "configuration": { + "memory": { + "allocated_gb": "integer (min: 50, max: 10000)", + "reserved_gb": "integer (optional, <= allocated_gb)" + }, + "disk": { + "allocated_gb": "integer (min: 100, max: 100000)" + }, + "priority": "string (low|standard|high)", + "network": { + "enable_public_access": "boolean", + "data_isolation_level": "string (tenant|shared|isolated)" + } + }, + "owner": "string (email, required)", + "tags": "array of strings (optional)", + "labels": { + "custom_key": "custom_value" + } + } + ] +} +``` + +### User Provisioning Schema + +```json +{ + "users": [ + { + "email": "string (required, unique)", + "first_name": "string (required)", + "last_name": "string (required)", + "roles": [ + { + "role": "string (datasphere.admin|datasphere.analyst|datasphere.viewer)", + "scope": "string (global|space)", + "effective_date": "string (ISO 8601, optional)" + } + ], + "space_assignments": [ + { + "space_name": "string (required)", + "role": "string (space_admin|editor|viewer)", + "effective_date": "string (ISO 8601, optional)" + } + ], + "attributes": { + "department": "string (optional)", + "cost_center": "string (optional)", + "manager": "string (email, optional)", + "custom_field": "string (optional)" + }, + "status": "string (active|inactive|pending)" + } + ] +} +``` + +### Connection Definition Schema + +```json +{ + "connections": [ + { + "name": "string (required)", + "type": "string (required: sap_s4hana|snowflake|postgresql|etc)", + "description": "string (optional)", + "technical_user": "string (optional)", + "connection_details": { + "host": "string", + "port": "integer", + "client": "string (for SAP systems)", + "use_ssl": "boolean", + "tls_version": "string (1.2|1.3)", + "timeout_seconds": "integer (default: 30)" + }, + "authentication": { + "method": "string (basic|oauth|certificate|kerberos)", + "username": "string (or use variable)", + "username_variable": "string (env var name)", + "password_variable": "string (env var name)", + "client_id_variable": "string (for OAuth)", + "client_secret_variable": "string (for OAuth)", + "token_endpoint": "string (for OAuth)" + }, + "test_table": "string (optional, for validation)", + "test_query": "string (optional, for validation)", + "retry_policy": { + "max_attempts": "integer (default: 3)", + "backoff_seconds": "integer (default: 5)", + "backoff_multiplier": "float (default: 2.0)" + }, + "owner": "string (email)", + "tags": "array of strings (optional)" + } + ] +} +``` + +--- + +## Bulk Operation Templates + +### Template 1: 50-User Onboarding + +```json +{ + "users": [ + { + "email": "user001@acme.com", + "first_name": "John", + "last_name": "Doe", + "roles": [{"role": "datasphere.analyst", "scope": "global"}], + "space_assignments": [ + {"space_name": "SALES_ANALYTICS", "role": "editor"}, + {"space_name": "FINANCE_REPORTING", "role": "viewer"} + ], + "attributes": { + "department": "Sales", + "cost_center": "CC-1001", + "manager": "admin@acme.com" + }, + "status": "active" + }, + { + "email": "user002@acme.com", + "first_name": "Jane", + "last_name": "Smith", + "roles": [{"role": "datasphere.analyst", "scope": "global"}], + "space_assignments": [ + {"space_name": "FINANCE_REPORTING", "role": "editor"} + ], + "attributes": { + "department": "Finance", + "cost_center": "CC-2001", + "manager": "admin@acme.com" + }, + "status": "active" + } + ] +} +``` + +### Template 2: 5-Space Environment Setup + +```json +{ + "spaces": [ + { + "name": "INBOUND_LAYER", + "description": "Raw data ingestion from source systems", + "configuration": { + "memory": {"allocated_gb": 80, "reserved_gb": 40}, + "disk": {"allocated_gb": 2000}, + "priority": "standard" + }, + "owner": "data-admin@acme.com", + "tags": ["lsa_plus", "inbound"] + }, + { + "name": "HARMONIZATION_LAYER", + "description": "Data cleansing and standardization", + "configuration": { + "memory": {"allocated_gb": 100, "reserved_gb": 50}, + "disk": {"allocated_gb": 2500}, + "priority": "standard" + }, + "owner": "data-admin@acme.com", + "tags": ["lsa_plus", "harmonization"] + }, + { + "name": "REPORTING_LAYER", + "description": "Analytics and reporting views", + "configuration": { + "memory": {"allocated_gb": 150, "reserved_gb": 75}, + "disk": {"allocated_gb": 3000}, + "priority": "high" + }, + "owner": "analytics-admin@acme.com", + "tags": ["lsa_plus", "reporting"] + } + ] +} +``` + +### Template 3: Multi-System Connection Setup + +```json +{ + "connections": [ + { + "name": "SAP_S4H_PROD", + "type": "sap_s4hana", + "description": "Production SAP S/4HANA", + "connection_details": { + "host": "s4h-prod.acme.com", + "port": 50013, + "client": "100", + "use_ssl": true, + "tls_version": "1.2" + }, + "authentication": { + "method": "basic", + "username_variable": "SAP_USER", + "password_variable": "SAP_PASS" + }, + "test_table": "MARA", + "owner": "integration-admin@acme.com" + }, + { + "name": "SAP_BW_PROD", + "type": "sap_bw", + "description": "Production SAP BW", + "connection_details": { + "host": "bw-prod.acme.com", + "port": 50013, + "client": "100", + "use_ssl": true + }, + "authentication": { + "method": "basic", + "username_variable": "BW_USER", + "password_variable": "BW_PASS" + }, + "owner": "integration-admin@acme.com" + }, + { + "name": "SNOWFLAKE_PROD", + "type": "snowflake", + "description": "Production Snowflake Warehouse", + "connection_details": { + "account_identifier": "xy12345.us-east-1", + "warehouse": "COMPUTE_WH", + "database": "PROD_DB", + "schema": "STAGING" + }, + "authentication": { + "method": "oauth", + "client_id_variable": "SF_CLIENT_ID", + "client_secret_variable": "SF_CLIENT_SECRET" + }, + "test_query": "SELECT 1", + "owner": "data-team@acme.com" + } + ] +} +``` + +--- + +## CI/CD Pipeline Examples + +### GitHub Actions: Full Datasphere Deployment + +```yaml +name: Deploy Datasphere Config + +on: + push: + branches: [main] + paths: ['datasphere/**'] + pull_request: + branches: [main] + paths: ['datasphere/**'] + +jobs: + validate: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + + - name: Install Datasphere CLI + run: | + curl -sL https://releases.datasphere.company.com/cli/install.sh | bash + datasphere --version + + - name: Configure CLI + env: + DATASPHERE_CLIENT_ID: ${{ secrets.DATASPHERE_CLIENT_ID }} + DATASPHERE_CLIENT_SECRET: ${{ secrets.DATASPHERE_CLIENT_SECRET }} + DATASPHERE_INSTANCE_URL: ${{ secrets.DATASPHERE_INSTANCE_URL }} + run: datasphere config init --auth-method service-key + + - name: Validate Spaces Configuration + run: datasphere spaces create-bulk --file datasphere/spaces.json --validate --dry-run + + - name: Validate Users Configuration + run: datasphere users create-bulk --file datasphere/users.json --validate --dry-run + + - name: Validate Connections Configuration + run: datasphere connections create-bulk --file datasphere/connections.json --validate --dry-run + + - name: Test Connections (Dry Run) + run: datasphere connections test-batch --file datasphere/connections.json --timeout 60 + + deploy: + needs: validate + runs-on: ubuntu-latest + if: github.event_name == 'push' && github.ref == 'refs/heads/main' + steps: + - uses: actions/checkout@v3 + + - name: Install Datasphere CLI + run: | + curl -sL https://releases.datasphere.company.com/cli/install.sh | bash + + - name: Configure CLI + env: + DATASPHERE_CLIENT_ID: ${{ secrets.DATASPHERE_CLIENT_ID }} + DATASPHERE_CLIENT_SECRET: ${{ secrets.DATASPHERE_CLIENT_SECRET }} + DATASPHERE_INSTANCE_URL: ${{ secrets.DATASPHERE_INSTANCE_URL }} + run: datasphere config init --auth-method service-key + + - name: Deploy Spaces + run: datasphere spaces create-bulk --file datasphere/spaces.json --confirm + + - name: Deploy Connections + run: datasphere connections create-bulk --file datasphere/connections.json --confirm + + - name: Deploy Users + run: datasphere users create-bulk --file datasphere/users.json --send-invitations true --confirm + + - name: Post-Deployment Tests + run: | + datasphere connections test-batch --file datasphere/connections.json \ + --generate-report deployment_report.html + + - name: Upload Deployment Report + if: always() + uses: actions/upload-artifact@v3 + with: + name: deployment-report + path: deployment_report.html + retention-days: 30 + + - name: Slack Notification + if: always() + uses: slackapi/slack-github-action@v1.24.0 + with: + payload: | + { + "text": "Datasphere Deployment: ${{ job.status }}", + "blocks": [ + { + "type": "section", + "text": { + "type": "mrkdwn", + "text": "*Datasphere Config Deployment*\nStatus: ${{ job.status }}\nCommit: <${{ github.server_url }}/${{ github.repository }}/commit/${{ github.sha }}|${{ github.sha }}>" + } + } + ] + } + env: + SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }} +``` + +--- + +## Certificate Rotation Runbook + +### Pre-Rotation Checklist + +- [ ] Obtain new TLS certificate and private key (PEM format) +- [ ] Verify certificate validity dates and CN/SANs +- [ ] Confirm backup of current certificate +- [ ] Schedule maintenance window (off-peak) +- [ ] Notify stakeholders of potential brief service interruption + +### Rotation Steps + +**1. Validate New Certificate** + +```bash +# Check certificate details +openssl x509 -in /path/to/new_cert.pem -text -noout + +# Verify certificate chain +openssl verify -CAfile /path/to/ca_bundle.pem /path/to/new_cert.pem + +# Verify key matches certificate +openssl x509 -noout -modulus -in /path/to/new_cert.pem | openssl md5 +openssl rsa -noout -modulus -in /path/to/key.pem | openssl md5 +# Both commands should return same MD5 hash +``` + +**2. Backup Current Certificate** + +```bash +datasphere configuration certificates list --show-expiry > cert_backup_$(date +%Y%m%d).json +``` + +**3. Upload New Certificate** + +```bash +datasphere configuration certificates upload \ + --name "PROD_SAP_S4_CERT_NEW" \ + --certificate-file "/secure/new_cert.pem" \ + --key-file "/secure/new_key.pem" \ + --description "New Production SAP S/4HANA Certificate (Rotation on 2024-03-01)" \ + --scheduled-activation "2024-03-01T02:00:00Z" +``` + +**4. Verify Upload** + +```bash +datasphere configuration certificates get PROD_SAP_S4_CERT_NEW +# Verify: Status = PENDING_ACTIVATION, Expiry date correct +``` + +**5. During Maintenance Window - Activate New Certificate** + +```bash +datasphere configuration certificates activate --name "PROD_SAP_S4_CERT_NEW" +``` + +**6. Verify Connections After Rotation** + +```bash +datasphere connections test-batch --file connections.json --timeout 60 --generate-report rotation_test.html +``` + +**7. Deactivate Old Certificate (after 24-hour observation)** + +```bash +datasphere configuration certificates deactivate --name "PROD_SAP_S4_CERT" +``` + +**8. Archive Old Certificate (after 30-day retention period)** + +```bash +datasphere configuration certificates delete \ + --name "PROD_SAP_S4_CERT" \ + --force +``` + +### Rollback Procedure (If Issues Occur) + +```bash +# 1. Revert to old certificate +datasphere configuration certificates activate --name "PROD_SAP_S4_CERT" + +# 2. Deactivate new certificate +datasphere configuration certificates deactivate --name "PROD_SAP_S4_CERT_NEW" + +# 3. Verify connections +datasphere connections test-batch --file connections.json + +# 4. Notify stakeholders +``` + +--- + +## Error Codes and Troubleshooting + +### Authentication Errors + +| Code | Error | Cause | Resolution | +|------|-------|-------|-----------| +| 401 | Unauthorized | Invalid credentials or expired token | Refresh service key, verify client ID/secret | +| 403 | Forbidden | Insufficient permissions | Check service key roles and scopes | +| 405 | Method Not Allowed | Wrong HTTP method (CLI bug) | Update CLI to latest version | + +### Space Management Errors + +| Code | Error | Cause | Resolution | +|------|-------|-------|-----------| +| 409 | Space already exists | Name conflict | Use different space name or `--force-overwrite` | +| 422 | Invalid space name | Name contains invalid characters | Use alphanumeric + underscore only | +| 507 | Insufficient space quota | Organization limit exceeded | Contact SAP support for quota increase | +| 400 | Invalid memory allocation | Below minimum (50 GB) | Increase memory allocation | + +### User Management Errors + +| Code | Error | Cause | Resolution | +|------|-------|-------|-----------| +| 409 | User already exists | Email already in system | Check existing users or use different email | +| 422 | Invalid email format | Email not properly formatted | Verify email syntax | +| 400 | Invalid role | Role doesn't exist | Check available roles: datasphere.admin|analyst|viewer | +| 404 | Space not found | Space doesn't exist | Verify space name and create space first | + +### Connection Errors + +| Code | Error | Cause | Resolution | +|------|-------|-------|-----------| +| 400 | Invalid connection type | Unsupported system type | Check supported types: sap_s4hana, snowflake, etc | +| 422 | Invalid connection details | Missing required fields | Verify all required fields in JSON | +| 503 | Connection test failed | Target system unreachable | Verify host, port, network connectivity | +| 401 | Authentication failed | Invalid credentials | Verify username/password or OAuth token | +| 408 | Timeout | Response too slow | Increase timeout or check target system performance | + +### Certificate Errors + +| Code | Error | Cause | Resolution | +|------|-------|-------|-----------| +| 400 | Invalid certificate | Certificate format or content issue | Verify PEM format, expiry date, key validity | +| 409 | Certificate already exists | Name conflicts | Use unique certificate name | +| 422 | Certificate expired | Certificate no longer valid | Provide valid, non-expired certificate | +| 410 | Cannot deactivate active cert | Cert in use by connections | Activate replacement cert first | + +### Common Troubleshooting Steps + +**1. Enable Debug Logging** + +```bash +datasphere --log-level debug [COMMAND] +# Output shows detailed request/response for debugging +``` + +**2. Validate Configuration JSON** + +```bash +# Use jq to validate JSON syntax +jq . spaces.json > /dev/null && echo "Valid JSON" || echo "Invalid JSON" + +# Validate against schema +datasphere [spaces|users|connections] create-bulk --file [FILE] --validate +``` + +**3. Check CLI Version** + +```bash +datasphere --version +# Update if outdated +curl -sL https://releases.datasphere.company.com/cli/install.sh | bash +``` + +**4. Verify Connectivity** + +```bash +# Test Datasphere instance accessibility +curl -I https://datasphere.acme.com/health +# Should return HTTP 200 OK + +# Verify service key authentication +datasphere config validate +``` + +**5. Review Server Logs** + +Request logs from Datasphere instance administrator: +- `/var/log/datasphere/cli-requests.log` +- `/var/log/datasphere/admin.log` +- Include request ID from error output + +--- + +## Additional Resources + +- **Official CLI Documentation**: https://help.sap.com/datasphere/cli +- **API Reference**: https://api.datasphere.company.com/docs +- **Support Portal**: https://support.sap.com/datasphere +- **Community**: https://community.sap.com/datasphere diff --git a/partner-built/SAP-Datasphere/skills/datasphere-connections/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-connections/SKILL.md new file mode 100644 index 0000000..0829739 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-connections/SKILL.md @@ -0,0 +1,194 @@ +--- +name: datasphere-connections +description: SAP Datasphere connections skill for creating and managing data source connections. Use when configuring connections to SAP systems (S/4HANA, BW, ECC), cloud databases (BigQuery, Redshift, Azure SQL), or other data sources for use in views, data flows, and replication. +--- + +# SAP Datasphere Connections + +Skill for creating and managing connections to data sources in SAP Datasphere. Connections enable access to SAP and non-SAP systems for data federation, replication, and ETL operations. + +## Navigation Overview + +**Path:** Left Menu → Connections → `#/connections` + +Connections are space-scoped - you must select a space before viewing or creating connections. + +## Connection Types (35 Available) + +### SAP Sources +| Connection Type | Features | Use Case | +|----------------|----------|----------| +| SAP ABAP | Remote Tables, Replication, Data Flows | Connect to SAP ERP systems | +| SAP S/4HANA Cloud | Remote Tables, Replication, Data Flows | S/4HANA Cloud integration | +| SAP S/4HANA On-Premise | Remote Tables, Replication, Data Flows | On-prem S/4HANA systems | +| SAP BW | Remote Tables, Model Import | BW/4HANA integration | +| SAP BW/4HANA Model Transfer | Model Import | Import BW models | +| SAP ECC | Remote Tables, Replication | Legacy ECC systems | +| SAP HANA | Remote Tables, Replication, Data Flows | Direct HANA connectivity | +| SAP SuccessFactors | Data Flows | HR data integration | +| SAP Fieldglass | Data Flows | Vendor management data | +| SAP Marketing Cloud | Data Flows | Marketing data | +| SAP Signavio | Data Flows | Process mining data | +| Cloud Data Integration | Data Flows | SAP CDI sources | + +### Cloud Databases +| Connection Type | Features | Use Case | +|----------------|----------|----------| +| Google BigQuery | Remote Tables, Data Flows | GCP analytics | +| Amazon Redshift | Remote Tables, Data Flows | AWS data warehouse | +| Amazon Athena | Remote Tables, Data Flows | S3 query service | +| Microsoft Azure SQL Database | Remote Tables, Data Flows | Azure SQL | +| Microsoft Azure Data Lake Gen2 | Data Flows | Azure storage | +| Microsoft Azure Blob Storage | Data Flows | Azure blobs | +| Microsoft OneLake | Data Flows | Fabric integration | +| Oracle | Remote Tables, Data Flows | Oracle databases | + +### Storage & Streaming +| Connection Type | Features | Use Case | +|----------------|----------|----------| +| Amazon S3 | Data Flows | AWS object storage | +| Google Cloud Storage | Data Flows | GCP object storage | +| Generic SFTP | Data Flows | File transfers | +| Apache Kafka | Data Flows | Event streaming | +| Confluent | Data Flows | Managed Kafka | + +### Generic Connectors +| Connection Type | Features | Use Case | +|----------------|----------|----------| +| Generic JDBC | Remote Tables, Data Flows | Any JDBC source | +| Generic OData | Remote Tables, Data Flows | OData services | +| Generic HTTP | API Tasks | REST APIs | +| Open Connectors | Data Flows | Third-party via SAP Open Connectors | + +## Creating a Connection + +### Step-by-Step Process + +1. **Navigate to Connections** + - Left Menu → Connections + - Select target space from the space cards + +2. **Start Connection Wizard** + - Click **+** dropdown → **Create Connection** + - Wizard opens with 3 steps + +3. **Step 1: Choose Connection Type** + - Use filters to narrow options: + - Features: API Tasks, Data Flows, Model Import, Remote Tables, Replication Flows + - Categories: Cloud, On-Premise + - Sources: Non-SAP, Partner Tools, SAP + - Click on desired connection type tile + +4. **Step 2: Configure Connection Properties** + - **Connection Details:** + - Category (Cloud/On-Premise) + - Host (server address) + - Port (service port) + - **Authentication:** Select method + - User Name and Password + - X.509 Client Certificate + - OAuth 2.0 + - **Features:** + - Remote Tables: Enable/disable virtual access + - Data Provisioning: Direct or via DP Agent + - Data Access: Remote, Replication, or both + - Click **Next Step** + +5. **Step 3: Enter Name and Description** + - Business Name (display name) + - Technical Name (system identifier) + - Description (optional) + - Click **Create** + +### Authentication Methods + +| Method | When to Use | +|--------|-------------| +| User Name and Password | Basic auth, service accounts | +| X.509 Client Certificate | Certificate-based, high security | +| OAuth 2.0 | Cloud services, SSO integration | +| SAP Assertion Ticket | SAP-to-SAP trusted communication | + +## Managing Connections + +### Connection Actions +| Action | Description | +|--------|-------------| +| Edit | Modify connection properties | +| Delete | Remove connection (requires no dependencies) | +| Validate | Test connection connectivity | +| Pause | Temporarily disable replication | +| Restart | Resume paused replication | + +### Connection Status +- **Connected:** Active and working +- **Disconnected:** Configuration issue or source unavailable +- **Paused:** Manually paused replication + +## Remote Tables + +When a connection supports Remote Tables: + +1. **In Data Builder:** + - Open a space in Data Builder + - Click **Sources** panel + - Expand connection to see available tables + - Drag table to canvas → creates Remote Table + +2. **Remote Table Options:** + - **Federation:** Virtual access (query on demand) + - **Replication:** Copy data to local storage + - **Snapshot:** Point-in-time copy + +### Replication Settings +- **Real-Time:** Continuous change capture +- **Scheduled:** Periodic full/delta loads +- **Manual:** On-demand refresh + +## SAP Open Connectors + +For third-party data sources not directly supported: + +1. Navigate to Connections → SAP Open Connectors tab +2. Click **Integrate your SAP Open Connectors Account** +3. Configure SAP Open Connectors instance +4. Access 150+ pre-built connectors + +## Best Practices + +### Connection Naming +- Use descriptive business names +- Include environment indicator (DEV, TEST, PROD) +- Example: "S4HANA_Finance_PROD" + +### Security +- Use service accounts, not personal credentials +- Rotate credentials regularly +- Use certificates where supported +- Limit connection access via space membership + +### Performance +- Enable replication for frequently accessed data +- Use federation for large, infrequently queried tables +- Consider data freshness requirements + +## Troubleshooting + +### Connection Fails to Validate +1. Verify host/port are correct +2. Check firewall rules (Cloud Connector if on-prem) +3. Validate credentials +4. Test network connectivity + +### Replication Errors +1. Check Data Integration Monitor for details +2. Verify source system availability +3. Review space storage capacity +4. Check for schema changes in source + +## Resources + +See reference files for detailed procedures: +- `references/connection-types.md` - Detailed connection type configurations +- `references/authentication.md` - Authentication setup guides +- `references/troubleshooting-guide.md` - Cloud Connector path configuration, Data Provisioning Agent troubleshooting, CORS setup, CSN Exposure prerequisites, OData/ODBC diagnostics diff --git a/partner-built/SAP-Datasphere/skills/datasphere-connections/references/authentication.md b/partner-built/SAP-Datasphere/skills/datasphere-connections/references/authentication.md new file mode 100644 index 0000000..67d1d70 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-connections/references/authentication.md @@ -0,0 +1,120 @@ +# Authentication Guide + +## User Name and Password + +### When to Use +- Simple authentication requirements +- Service accounts with static credentials +- Development/test environments + +### Configuration +1. Enter username +2. Enter password +3. Optionally save credentials + +### Security Considerations +- Use dedicated service accounts +- Avoid personal user credentials +- Implement password rotation policy + +--- + +## X.509 Client Certificate + +### When to Use +- High-security requirements +- Mutual TLS authentication +- SAP Cloud Platform services + +### Prerequisites +- Valid X.509 certificate (PEM format) +- Corresponding private key +- Certificate must be trusted by target system + +### Configuration Steps +1. Select "X.509 Client Certificate" as authentication type +2. Upload or paste certificate content +3. Upload or paste private key +4. Validate connection + +### Certificate Requirements +- RSA 2048-bit or higher +- Valid date range +- Proper chain of trust + +--- + +## OAuth 2.0 + +### When to Use +- Cloud services (S/4HANA Cloud, SuccessFactors) +- Token-based authentication +- SSO integration scenarios + +### OAuth Grant Types + +#### Client Credentials +Best for service-to-service communication: +1. Obtain Client ID and Secret from service key +2. Configure Token URL +3. System automatically refreshes tokens + +#### Authorization Code +For user-delegated access: +1. Configure authorization endpoint +2. User authenticates via browser +3. System stores refresh token + +### Configuration Properties +| Property | Description | +|----------|-------------| +| Client ID | OAuth application identifier | +| Client Secret | Application secret | +| Token URL | Endpoint for token exchange | +| Scope | Optional: requested permissions | + +--- + +## SAP Cloud Connector + +### When to Use +- On-premise system access +- Firewall traversal +- Secure tunnel to cloud + +### Setup Overview +1. Install Cloud Connector on-premise +2. Configure subaccount connection +3. Add virtual host mapping +4. Test connectivity from Datasphere + +### Connection Configuration +- Use virtual host name (not real hostname) +- Virtual port mapped by Cloud Connector +- Datasphere connects to Cloud Connector endpoint + +--- + +## Troubleshooting Authentication + +### Common Issues + +#### Invalid Credentials +- Verify username/password +- Check account lock status +- Confirm service account permissions + +#### Certificate Errors +- Validate certificate expiration +- Check certificate format (PEM required) +- Verify trust chain + +#### OAuth Token Failures +- Confirm client credentials +- Check token URL accessibility +- Verify scope permissions + +#### Network Issues +- Test connectivity to host/port +- Check Cloud Connector status (on-prem) +- Verify firewall rules diff --git a/partner-built/SAP-Datasphere/skills/datasphere-connections/references/connection-types.md b/partner-built/SAP-Datasphere/skills/datasphere-connections/references/connection-types.md new file mode 100644 index 0000000..5eb6d7b --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-connections/references/connection-types.md @@ -0,0 +1,107 @@ +# Connection Types Reference + +## SAP ABAP Connection + +### Configuration Properties +| Property | Required | Description | +|----------|----------|-------------| +| Category | Yes | Cloud or On-Premise | +| Host | Yes | Application server hostname | +| Port | Yes | Service port (typically 443 for Cloud) | +| Client | Yes | SAP client number | +| Language | No | Login language (EN, DE, etc.) | + +### Authentication Options +- User Name and Password +- X.509 Client Certificate + +### Supported Features +- Remote Tables +- Replication Flows +- Data Flows + +--- + +## SAP S/4HANA Cloud + +### Configuration Properties +| Property | Required | Description | +|----------|----------|-------------| +| Host | Yes | S/4HANA Cloud tenant URL | +| Authentication Type | Yes | OAuth 2.0 (recommended) | +| OAuth Client ID | Yes | Service key client ID | +| OAuth Client Secret | Yes | Service key secret | +| Token URL | Yes | OAuth token endpoint | + +### Supported Features +- Remote Tables (CDS Views) +- Replication Flows (ODP extraction) +- Data Flows + +--- + +## SAP HANA + +### Configuration Properties +| Property | Required | Description | +|----------|----------|-------------| +| Category | Yes | Cloud or On-Premise | +| Host | Yes | HANA server hostname | +| Port | Yes | SQL port (e.g., 443 for Cloud, 30015 for on-prem) | + +### Authentication Options +- User Name and Password +- X.509 Client Certificate + +### Features Configuration +| Feature | Options | +|---------|---------| +| Remote Tables | Enabled (default) | +| Data Provisioning | Direct | +| Data Access | Remote and Replication | + +--- + +## Google BigQuery + +### Configuration Properties +| Property | Required | Description | +|----------|----------|-------------| +| Project ID | Yes | GCP project identifier | +| Dataset | No | Default dataset | + +### Authentication +- Service Account JSON key + +### Supported Features +- Remote Tables +- Data Flows + +--- + +## Amazon Redshift + +### Configuration Properties +| Property | Required | Description | +|----------|----------|-------------| +| Host | Yes | Redshift cluster endpoint | +| Port | Yes | Database port (default: 5439) | +| Database | Yes | Database name | + +### Authentication +- User Name and Password +- IAM Authentication + +--- + +## Generic JDBC + +### Configuration Properties +| Property | Required | Description | +|----------|----------|-------------| +| JDBC URL | Yes | Full JDBC connection string | +| Driver Class | Yes | JDBC driver class name | + +### Notes +- Requires JDBC driver upload to Data Provisioning Agent +- For on-premise sources only diff --git a/partner-built/SAP-Datasphere/skills/datasphere-connections/references/troubleshooting-guide.md b/partner-built/SAP-Datasphere/skills/datasphere-connections/references/troubleshooting-guide.md new file mode 100644 index 0000000..306b7d3 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-connections/references/troubleshooting-guide.md @@ -0,0 +1,143 @@ +# Connection Troubleshooting Guide + +## Cloud Connector Troubleshooting + +### Required Access Paths for S/4HANA On-Premise +When configuring Cloud Connector for Datasphere, these paths must be allowed: + +| Path | Access Type | Purpose | +|------|------------|---------| +| /sap/bw/ina | Path and all sub-paths | InA protocol for model transfer and metadata | +| /sap/opu/odata/sap/ESH_SEARCH_SRV/* | All sub-paths | Enterprise Search for entity discovery | +| /sap/opu/odata4/sap/csn_exposure_v4 | Path and all sub-paths | CSN Exposure for model import | +| /sap/bw4/v1/dwc/* | All sub-paths | BW/4HANA model transfer (if applicable) | + +### Common Cloud Connector Issues +1. **Path not configured**: Connection validates but model import or replication fails — check that ALL required paths are added, not just the base URL +2. **Header configuration missing**: CORS requests fail — ensure proper header forwarding in Cloud Connector access control +3. **SSL/TLS certificate errors**: Certificate chain incomplete or expired — check trust store in Cloud Connector administration +4. **Authentication failures**: Verify backend user credentials in Cloud Connector, check password expiry +5. **Location ID mismatch**: If multiple Cloud Connectors exist, ensure Datasphere connection references the correct Location ID + +### Validation Checklist +- Connection shows "Replication flows are enabled" ✓ +- Connection shows "Data flows are enabled" ✓ +- Model Import shows either enabled or disabled (not error) ✓ +- Remote tables shows status ✓ +- If validation fails: SAP KBA 3369433 + +## Data Provisioning Agent Troubleshooting + +### Installation Prerequisites +- Version 2.6.3 or higher required for latest Datasphere features +- Java Runtime Environment (JRE) installed +- Network access from agent to both source system and Datasphere + +### IP Allowlisting +- Agent machine's IPv4 address (public IP) must be added to Datasphere IP allowlist +- System → Configuration → IP Allowlist +- See SAP KBA 3276488 for allowlist configuration +- See SAP KBA 2894588 for IP allowlist details + +### Agent Registration +1. Install agent on a machine with network access to source system +2. Configure agent connection to Datasphere +3. Register required adapters (SAP HANA, ABAP, etc.) +4. Verify registration in Datasphere → System → Configuration → Data Provisioning Agents + +### Common DP Agent Issues +1. **Agent not visible in Datasphere**: Check IP allowlist, verify agent is running, check network connectivity +2. **Adapter registration fails**: Verify adapter compatibility, check agent logs +3. **JCO exceptions**: Enable JCO trace in DP Agent for detailed connection diagnostics (see SAP Note 2938870) +4. **Connection timeouts**: Check firewall rules between agent machine and source system +5. **Agent upgrade issues**: Stop agent before upgrade, verify Java compatibility after upgrade + +### DP Agent Log Analysis +- Agent logs located in the agent installation directory +- Enable JCO trace for SAP system connectivity issues +- Check for connection pool exhaustion on high-throughput scenarios +- SAP Note 3196950 for comprehensive DP Agent troubleshooting + +## CORS Configuration (for S/4HANA Backend) + +### Temporary Enable (testing) +1. Transaction RZ11 → Parameter: icf/cors_enabled → Set to 1 +2. Requires ABAP AS restart to take effect + +### Permanent Enable (production) +1. Transaction rz10 → Profile: DEFAULT → Extended Maintenance +2. Create parameter: icf/cors_enabled = 1 +3. Restart ABAP Application Server + +### CORS Allowlist +1. Transaction UCONCOCKPIT +2. HTTP Allowlist Scenario +3. Add service path (e.g., /sap/bw/ina) +4. Configure allowed headers and methods + +## S/4HANA CSN Exposure Service Prerequisites (BASIS >= 756) + +### Required Software Components +- SAP_BASIS 756 or higher +- SAP_ABA +- SAP_BW (for BW/4HANA model transfer) +- Pre-requisite SAP Notes from SAP Note 3463326 + +### Required Search Connectors +Verify in transaction ESH_TEST_SEARCH (F4 on Connector ID, search "csn"): +- CSN_EXPOSURE_CDS_DEFAULT_FT +- CSN_EXPOSURE_CDS_DEFAULT_NFT +- SSCH_CMS_CDS_DESC_FT +- SSCH_CMS_CDS_DESC_NFT +- CSN_EXPOSURE_CDS + +If missing, activate using report ESH_CDSABAP_ACTIVATION. + +### Required Services +- CSN_EXPOSURE_V4: Check in /iwfnd/v4_admin → Published Services. If missing, use "Publish Service Groups" +- ESH_SEARCH_SRV: Check in /iwfnd/maint_service → Active Services. If missing, use "Add Service" + +### Required Authorizations for Communication User +- SDDLVIEW: DDLSRCNAME = CSN_EXPOSURE_* entities, ACTVT = 03 +- S_SERVICE: SRV_NAME = EF608938F3EB18256CE851763C2952, SRV_TYPE = HT +- S_START: AUTHPGMID = R3TR, AUTHOBJTYP = G4BA, AUTHOBJNAM = CSN_EXPOSURE_V4 +- S_SDSAUTH: ACTVT = 16 (Execute) + +## OData API Troubleshooting + +### Authentication Errors +- Verify OAuth client configuration in Datasphere +- Check token endpoint URL format +- Confirm client ID and secret +- SAP KBA 3318090 for authentication error resolution + +### Common OData Issues +1. **404 Not Found**: Verify the exact entity set name and URL path +2. **403 Forbidden**: Check user authorizations and Data Access Controls +3. **500 Internal Server Error**: Check Datasphere system logs, verify view deployment +4. **Timeout**: Reduce query scope, add filters, check view complexity + +## ODBC Connection Issues + +### Driver Installation +- Download HANA ODBC driver from SAP Development Tools +- Extract .SAR file using SAPCAR utility +- Configure ODBC DSN in Windows ODBC Administrator (64-bit) + +### Common ODBC Issues +1. **Connection refused**: Verify hostname and port, check IP allowlist +2. **Authentication failed**: ODBC credentials are Database User credentials, NOT Datasphere login +3. **Schema not visible**: Verify Database User has required space schema access +4. **Multi-space access**: Requires Database User Group with cross-space grants + +## Key SAP Notes +| Note | Description | +|------|-------------| +| 3369433 | Cloud Connector troubleshooting for Datasphere | +| 3276488 | IP Allowlist configuration | +| 2894588 | IP Allowlist details | +| 2938870 | DP Agent errors with Datasphere | +| 3196950 | DP Agent troubleshooting guide | +| 3463326 | S/4HANA CSN Exposure prerequisites | +| 3318090 | OData API authentication errors | +| 3383634 | BW/4HANA Model Import | diff --git a/partner-built/SAP-Datasphere/skills/datasphere-data-flows/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-data-flows/SKILL.md new file mode 100644 index 0000000..ea42f22 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-data-flows/SKILL.md @@ -0,0 +1,451 @@ +--- +name: datasphere-data-flows +description: SAP Datasphere Data Integration expert skill covering Replication Flows, Data Flows, Transformation Flows, and Task Chains. Use for architecting, configuring, troubleshooting, and optimizing data movement pipelines including CDC/delta processing, ETL operations, and orchestration. +--- + +# SAP Datasphere Data Integration + +Expert skill for SAP Datasphere Data Integration layer covering all flow types and orchestration patterns. + +## Flow Type Decision Matrix + +| Requirement | Recommended Flow | Reason | +|-------------|------------------|--------| +| Mass 1:1 data movement | **Replication Flow** | Optimized for bulk transfer, supports CDC | +| Real-time delta capture | **Replication Flow** | Only flow type supporting continuous CDC | +| Complex ETL (joins, unions) | **Data Flow** | Visual modeling + Python scripting | +| Delta propagation through layers | **Transformation Flow** | Reads/writes delta tables, SQL-based | +| Schedule & orchestrate | **Task Chain** | Dependency management, parallel execution | + +## Navigation + +### Accessing Data Builder +1. Click **Data Builder** in the left navigation menu (cube icon) +2. Select a **Space** to work in (required before creating objects) +3. Use the tabs to filter: All Files | Tables | Views | E/R Models | Analytic Models | **Flows** | Intelligent Lookups | Task Chains + +### Creating Flows +From Data Builder, click one of the creation tiles at the top: +- **New Data Flow** - Opens Data Flow editor +- **New Replication Flow** - Opens Replication Flow wizard +- **New Transformation Flow** - Opens Transformation Flow editor +- **New Task Chain** - Opens Task Chain orchestrator + +| Flow Type | Navigation Path | URL Fragment | +|-----------|----------------|--------------| +| Data Builder | Left Menu → Data Builder | `#/databuilder` | +| Flows List | Data Builder → Flows tab | `#/databuilder&/db/{SPACE}` (filtered) | +| Data Flow Editor | Data Builder → New Data Flow | `#/databuilder&/db/{SPACE}/-newDataFlow` | +| Replication Flow | Data Builder → New Replication Flow | `#/replicationflow` | +| Transformation Flow | Data Builder → New Transformation Flow | `#/transformationflow` | +| Task Chain | Data Builder → New Task Chain | `#/taskchain` | +| Monitor All | Data Integration Monitor | `#/dim` | + +### Data Flow Editor Layout + +**Left Panel - Repository/Sources:** +- **Repository Tab:** Lists local tables, views, and objects (85+ objects) +- **Sources Tab:** Lists external connections for bringing in data + - Expand "Connections" to see available source systems + - Supports cloud sources (AWS, Azure, etc.) + +**Center - Canvas:** +- Drag-and-drop visual modeling area +- Nodes represent: Sources, Operators, Targets +- Lines represent data flow direction +- Node badges show column counts + +**Operators Toolbar:** +| Icon | Operator | Purpose | +|------|----------|---------| +| Table | Source/Target | Add data source or target table | +| Chain | Join | Combine two inputs (INNER, LEFT, RIGHT, FULL) | +| Transform | Projection | Select/rename columns | +| Aggregate | Aggregation | GROUP BY with SUM, COUNT, AVG, etc. | +| Code | Script (Python) | Custom Python transformations | +| Filter | Filter | Row-level filtering | + +**Right Panel - Properties:** +- **General:** Business Name, Technical Name, Status +- **Run Status:** Execution state, last run info +- **Input Parameters:** Runtime variables +- **Advanced Properties:** Memory allocation, restart options + +--- + +## 1. Replication Flows + +### Primary Use Case +**1:1 mass data replication** from supported sources to supported targets with minimal transformation (projection/filtering only). This is the successor to SLT for cloud-to-cloud scenarios. + +### Capabilities +- **Initial Load:** Full data extraction +- **Delta Load (CDC):** Change Data Capture for incremental updates +- **Projections:** Column selection/filtering +- **Simple transformations:** Basic filtering only + +### Delta Prerequisites +> **CRITICAL:** The source object must have CDC annotations enabled. + +For S/4HANA CDS Views: +```abap +@Analytics.dataExtraction.enabled: true +@Analytics.dataExtraction.delta.changeDataCapture: true +``` +If the source lacks CDC annotations, only "Initial Load" is supported. + +### Supported Targets + +#### Inbound Targets (SAP) +| Target | Delta Support | Notes | +|--------|---------------|-------| +| Local Table | Yes | Can be delta-capture enabled | +| Local Table (File) | Yes | HANA Data Lake Files (Object Store) | +| SAP HANA Cloud | Yes | Direct HANA connection | + +#### Outbound Targets (Non-SAP) - Requires POI +| Target | Format | Notes | +|--------|--------|-------| +| Amazon S3 | Parquet/CSV | Premium Outbound required | +| Google Cloud Storage | Parquet/CSV | Premium Outbound required | +| Google BigQuery | Native | Premium Outbound required | +| Azure Data Lake Gen2 | Parquet/CSV | Premium Outbound required | +| Apache Kafka | Events | Premium Outbound required | + +### Premium Outbound Integration (POI) + +> **LICENSING ALERT:** Replicating to non-SAP targets incurs specific costs. + +| Scenario | POI Required? | +|----------|---------------| +| Replicate to Datasphere Local Table | No | +| Replicate to HANA Cloud | No | +| Replicate to HDLF (Object Store) | No | +| Replicate to AWS S3 | **Yes** | +| Replicate to Azure ADLS Gen2 | **Yes** | +| Replicate to Kafka | **Yes** | + +**POI Blocks:** Measured in 20GB increments. Plan capacity accordingly. + +### Critical Constraints +- ❌ **No complex logic:** Cannot perform joins, unions, aggregations +- ❌ **No Python scripting:** Use Data Flows for custom code +- ✅ **Use for:** Data movement and simple filtering only + +### Creating a Replication Flow + +1. **Data Builder** → New → **Replication Flow** +2. **Add Source:** + - Select connection (e.g., S/4HANA) + - Choose source objects (CDS Views, tables) +3. **Configure Load Type:** + - Initial Load Only + - Initial + Delta (if CDC enabled) +4. **Add Target:** + - Select target type (Local Table, Object Store, External) +5. **Map Columns:** Projection and filtering +6. **Deploy & Run** + +--- + +## 2. Data Flows + +### Primary Use Case +**Complex ETL operations** requiring joins, unions, aggregations, or custom Python scripting. + +### Capabilities +- Visual drag-and-drop modeling +- Join multiple sources +- Union datasets +- Aggregations and calculations +- **Python Operator:** Custom transformations using Pandas-like operations + +### Python Operator +```python +# Example: Rename columns and convert data types +def transform(data): + df = data.copy() + df.columns = [col.upper() for col in df.columns] + df['AMOUNT'] = df['AMOUNT'].astype(float) + return df +``` + +**Available libraries:** Standard Python data manipulation (Pandas-like dataframe operations) + +### Critical Constraints + +> **IMPORTANT:** Data Flows are **BATCH ONLY** + +| Constraint | Impact | +|------------|--------| +| No CDC support | Cannot propagate delta changes continuously | +| Batch execution | Full reload each run (unless filtered) | +| No delta chaining | Cannot use Data Flow target as delta source for another flow | + +### When NOT to Use Data Flows +- ❌ Real-time/streaming requirements → Use Replication Flow +- ❌ Delta propagation through layers → Use Transformation Flow +- ❌ Simple 1:1 data movement → Use Replication Flow (more efficient) + +### When to Use Data Flows +- ✅ Complex joins between multiple sources +- ✅ Custom Python transformations +- ✅ Data quality operations +- ✅ One-time or scheduled batch loads + +### Creating a Data Flow + +> **Note:** A banner in the editor reminds you: "Replication and Transformation Flows are now the recommended approach... while Data Flows will continue to be supported for existing workflows." + +**Step-by-Step:** + +1. **Open Data Builder** → Select Space → Click **"New Data Flow"** tile +2. **Canvas Opens** with empty workspace showing: + - Left panel: Repository (local objects) and Sources (external connections) + - Properties panel: Auto-generates name like "Data Flow 1" / "Data_Flow_1" + +3. **Add Source Tables:** + - From **Repository tab**: Click table, then click canvas area to add + - Or click Operators toolbar → Table icon → click canvas + - Source nodes show column count badge (e.g., "9" for 9 columns) + +4. **Add Transformation Operators:** + - Click operator icon in toolbar (Join, Projection, Aggregation, Script) + - Click canvas to place + - Connect nodes by clicking output port → dragging to input port + +5. **Configure Join (if used):** + - Select Join node → Properties panel shows: + - Join Type: INNER (default), LEFT, RIGHT, FULL OUTER + - Join Definition: Define key columns + - Connect two source nodes to the join inputs + +6. **Add Python Script (optional):** + - Click Script operator in toolbar + - Configure custom transformation logic in Properties panel + +7. **Add Target Table:** + - Click "+" icon on the last transformation node + - Or: Operators → Table → place on canvas + - Target shows "New" badge - creates new local table + - Configure: Connection (Datasphere), Business Name, Technical Name + +8. **Save & Deploy:** + - Click Save icon (General toolbar) + - Click Deploy to make executable + - Status changes from "Not Deployed" → "Deployed" + +9. **Run:** + - From Run Status section → Click Run icon + - Monitor in Data Integration Monitor + +--- + +## 3. Transformation Flows + +### Primary Use Case +**Delta propagation and multi-level staging** within Datasphere. The strategic successor to Data Flows for delta logic. + +### Key Concept: Delta Waterfall +``` +Replication Flow → [Delta Table A] → Transformation Flow → [Delta Table B] → Transformation Flow → [Delta Table C] +``` +Each layer receives only **changed records** (Insert, Update, Delete). + +### Capabilities +- Reads from Delta-enabled Local Table +- Writes to Delta-enabled Local Table +- **Delta Propagation:** Understands and propagates I/U/D operations +- **SQL-based transformations:** Powerful SQL logic + +### Architecture Rule + +> **CRITICAL DECISION:** If you need to: +> 1. Load data via Replication Flow (Inbound) +> 2. Process only the **changes** to a second layer +> +> → **Use Transformation Flow**, NOT Data Flow + +### Delta Logic Support +| Operation | Propagated? | +|-----------|-------------| +| INSERT | ✅ Yes | +| UPDATE | ✅ Yes | +| DELETE | ✅ Yes | + +### Creating a Transformation Flow + +1. **Data Builder** → New → **Transformation Flow** +2. **Select Source:** Must be a Delta-enabled Local Table +3. **Add SQL Logic:** + - Projections, filters + - Calculations + - CASE statements +4. **Select Target:** Delta-enabled Local Table +5. **Deploy & Run** + +### Comparison: Data Flow vs Transformation Flow + +| Aspect | Data Flow | Transformation Flow | +|--------|-----------|---------------------| +| Delta/CDC | ❌ No | ✅ Yes | +| Complex Joins | ✅ Yes | ⚠️ Limited | +| Python Scripts | ✅ Yes | ❌ No | +| Multi-level Staging | ❌ Not recommended | ✅ Designed for this | +| Performance | Batch reload | Incremental processing | + +--- + +## 4. Task Chains + +### Primary Use Case +**Scheduling and dependency management** for orchestrating multiple flows. + +### Capabilities +- Trigger: Replication Flows, Data Flows, Transformation Flows +- Trigger: View Persistency, Intelligent Lookups +- **Execution Modes:** Serial (Linear) and Parallel +- **Gates:** AND (wait for all) and OR (wait for any) + +### Execution Patterns + +#### Serial Execution +``` +[Flow A] → [Flow B] → [Flow C] +``` +Each step waits for the previous to complete. + +#### Parallel Execution with AND Gate +``` +[Flow A] ─┐ + ├─ AND ─→ [Flow D] +[Flow B] ─┤ +[Flow C] ─┘ +``` +Flow D runs only after A, B, AND C complete. + +#### Parallel Execution with OR Gate +``` +[Flow A] ─┐ + ├─ OR ─→ [Flow D] +[Flow B] ─┘ +``` +Flow D runs after A OR B completes (first one). + +### BW Bridge Integration +Task Chains can trigger remote process chains in SAP BW Bridge, though Bridge chains are typically scheduled internally within the Bridge Cockpit. + +### Creating a Task Chain + +1. **Data Builder** → New → **Task Chain** +2. **Add Tasks:** Drag flows to canvas +3. **Configure Dependencies:** + - Connect tasks with arrows + - Set gate types (AND/OR) +4. **Set Schedule:** + - One-time or recurring + - Time-based or event-based +5. **Deploy & Activate** + +--- + +## Integration Patterns + +### SAP S/4HANA Integration + +**Preferred Method:** CDS Views via ABAP connection (using Cloud Connector) + +| Method | When to Use | +|--------|-------------| +| CDS Views | Preferred - semantic richness, CDC support | +| SLT (Trigger-based) | Legacy - supported but CDS preferred | + +### Object Store (HANA Data Lake Files) + +**Physical Storage:** Replication Flows can dump data to "Local Table (File)" in embedded HANA Data Lake. + +| Characteristic | Value | +|----------------|-------| +| Performance | Slower than In-Memory HANA | +| Use Case | Warm/Cold data, staging | +| Data Products | Foundation for BDC Data Products | + +### Databricks Integration + +| Direction | Method | Cost Impact | +|-----------|--------|-------------| +| **Inbound** (Databricks → SAP) | JDBC connection or Data Import | Standard | +| **Outbound** (SAP → Databricks) Federation | Delta Sharing (Zero Copy) | No data movement | +| **Outbound** (SAP → Databricks) Mass | Replication Flow → ADLS Gen2 → Mount as Delta Table | **POI Required** | + +--- + +## Troubleshooting Guide + +### Replication Flow Failed + +**Check these in order:** + +1. **CDC Annotations:** Does source CDS view have: + ```abap + @Analytics.dataExtraction.enabled: true + @Analytics.dataExtraction.delta.changeDataCapture: true + ``` + +2. **Cloud Connector:** Is it running and connected? + +3. **POI Blocks:** For non-SAP targets, are you out of Premium Outbound blocks? + +4. **Execution Nodes:** Check thread limits for large tables (e.g., ACDOCA) + +### Data Flow is Slow + +**Optimization checklist:** + +1. **Full loads every time?** → Switch to Replication Flow (for movement) or Transformation Flow (for delta logic) + +2. **Large joins?** → Pre-aggregate or filter at source + +3. **Python operator?** → Optimize DataFrame operations + +### How to Move BW Data? + +> **Do NOT rebuild manually!** + +| Scenario | Solution | +|----------|----------| +| Legacy BW logic (ABAP) | Use **SAP BW Bridge** | +| BW/4HANA 2021+ or BW 7.5 SP24+ | Use **Data Product Generator** to push InfoProviders to Object Store | + +### I Need Real-Time + +**Only option:** Replication Flows + +Data Flows and Transformation Flows are **batch only**. + +--- + +## Best Practices + +### Flow Selection +1. **Start with Replication Flow** for data ingestion +2. **Use Transformation Flows** for delta staging layers +3. **Reserve Data Flows** for complex one-time transformations +4. **Orchestrate with Task Chains** + +### Cost Optimization +- Replicate to SAP targets (no POI cost) +- Use federation (Remote Tables) when real-time not required +- Size POI blocks based on 20GB increments + +### Performance +- Enable CDC on source CDS views +- Use Object Store for warm/cold data +- Parallelize with Task Chains where possible + +## Resources + +See reference files for detailed procedures: +- `references/replication-flows.md` - Detailed replication configuration +- `references/transformation-flows.md` - Delta staging patterns +- `references/task-chains.md` - Orchestration patterns diff --git a/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/data-flows.md b/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/data-flows.md new file mode 100644 index 0000000..a78b56d --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/data-flows.md @@ -0,0 +1,133 @@ +# Data Flows Reference Guide + +## Overview + +Data Flows are SAP Datasphere's batch ETL tool for complex transformations. They support graphical data transformation with operators including a Python operator for custom logic. + +> **SAP Recommendation:** "Replication and Transformation Flows are now the recommended approach for loading and transforming data in SAP Datasphere. Data Flows will continue to be supported for existing workflows." + +## Key Characteristics + +| Characteristic | Value | +|---------------|-------| +| Execution Mode | **Batch only** | +| Delta Support | **No** (full load only) | +| CDC Support | **No** | +| Custom Code | Python operator available | +| Use Case | Complex transformations, aggregations | + +## Data Flow Editor UI + +### Toolbar Structure + +**General Section:** +- Save icon +- Undo/Redo icons +- Deploy icon + +**View Section:** +- View toggle buttons + +**Edit Section:** +- Standard edit operations + +**Tools Section:** +- Additional utilities + +### Operators Toolbar Icons (left to right) +1. **Table** - Add source or target table +2. **Join** - Combine two data sources +3. **Projection** - Select/transform columns +4. **Aggregation** - GROUP BY operations +5. **Script** - Python custom code +6. **Filter** - Row filtering + +### Canvas Node Elements +- **Source nodes:** Blue with grid icon, shows column count +- **Operator nodes:** Blue with operator-specific icon +- **Target nodes:** Blue with "Target" label, shows "New" if creating new table +- **Connection ports:** Circles on node edges for linking +- **Error badges:** Red indicators for validation issues + +## CRITICAL CONSTRAINT + +**Data Flows are BATCH ONLY** - They do NOT support: +- Delta/incremental loading +- Change Data Capture (CDC) +- Real-time processing + +For delta requirements, use **Transformation Flows** instead. + +## Operators + +### Source Operators +- Table (local tables, views) +- SQL (custom SQL statements) + +### Transformation Operators +- **Join** - Inner, left, right, full outer joins +- **Union** - Combine multiple sources +- **Filter** - Row-level filtering +- **Projection** - Column selection and renaming +- **Aggregation** - GROUP BY with SUM, COUNT, AVG, MIN, MAX +- **Script (Python)** - Custom Python transformations + +### Target Operators +- Table (local tables) + +## Python Operator + +The Python operator enables custom transformation logic: + +```python +# Example: Custom data cleansing +def transform(data): + # data is a pandas DataFrame + data['column'] = data['column'].str.upper() + return data +``` + +**Capabilities**: +- Pandas DataFrame operations +- Custom business logic +- Data quality checks +- Complex calculations + +**Limitations**: +- Performance overhead for large datasets +- No external library imports +- Memory constraints + +## Navigation + +**Access**: Data Builder → Data Flows + +**Create**: Data Builder → New Data Flow → Drag operators → Connect sources to targets + +## Execution + +Data Flows can be: +- Run manually from Data Builder +- Scheduled via Task Chains +- Triggered by external events + +## When to Use Data Flows + +**Use Data Flows when**: +- Complex transformations are needed +- Python/custom logic is required +- Full refresh is acceptable +- One-time or batch processing + +**Do NOT use Data Flows when**: +- Delta/CDC is required → Use Transformation Flows +- Simple 1:1 replication → Use Replication Flows +- Real-time processing needed → Use Replication Flows with CDC + +## Best Practices + +1. **Minimize Python Usage** - Use built-in operators when possible +2. **Filter Early** - Apply filters before joins to reduce data volume +3. **Test with Sample Data** - Validate logic before full runs +4. **Monitor Performance** - Check execution times in Data Integration Monitor +5. **Document Logic** - Add descriptions to operators for maintainability diff --git a/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/replication-flows.md b/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/replication-flows.md new file mode 100644 index 0000000..470aa1e --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/replication-flows.md @@ -0,0 +1,89 @@ +# Replication Flows Reference + +## Source Configuration + +### S/4HANA CDS Views + +**Required Annotations for Delta:** +```abap +@Analytics.dataExtraction.enabled: true +@Analytics.dataExtraction.delta.changeDataCapture: true +``` + +**Common CDS Views:** +| CDS View | Description | CDC Support | +|----------|-------------|-------------| +| I_ACDOCA | Universal Journal | Yes | +| I_BSEG | Accounting Line Items | Yes | +| I_PURCHASEORDER | Purchase Orders | Yes | +| I_SALESORDER | Sales Orders | Yes | + +### Connection Requirements +- ABAP connection with Cloud Connector +- User with extraction authorization +- RFC destination configured + +## Target Configuration + +### Local Table +- Best for: In-memory analytics +- Enable "Delta Capture" for downstream Transformation Flows +- Storage: Uses Space memory quota + +### Local Table (File) - Object Store +- Best for: Large volumes, warm/cold data +- Format: Parquet (default), CSV +- Storage: HANA Data Lake Files (HDLF) +- Performance: Slower than in-memory + +### External Targets (POI Required) + +#### Amazon S3 +``` +Bucket: s3://your-bucket +Path: /datasphere/exports/ +Format: Parquet (recommended) or CSV +``` + +#### Azure Data Lake Gen2 +``` +Container: your-container +Path: /datasphere/exports/ +Format: Parquet (recommended) or CSV +``` + +#### Kafka +``` +Bootstrap Servers: kafka:9092 +Topic: datasphere-events +Format: JSON or Avro +``` + +## Load Types + +### Initial Load Only +- Full extraction on each run +- Use when: Source lacks CDC, one-time migration + +### Initial + Delta +- First run: Full extraction +- Subsequent runs: Only changed records +- Requires: CDC-enabled source + +## Monitoring + +### Data Integration Monitor +Path: Left Menu → Data Integration Monitor + +**Key Metrics:** +- Records transferred +- Duration +- Error count +- Delta queue status + +### Troubleshooting Failed Runs +1. Check connection status +2. Verify CDC annotations +3. Review error logs in monitor +4. Check Cloud Connector (on-prem sources) +5. Verify POI block availability (external targets) diff --git a/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/task-chains.md b/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/task-chains.md new file mode 100644 index 0000000..cbdb01f --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/task-chains.md @@ -0,0 +1,92 @@ +# Task Chains Reference Guide + +## Overview + +Task Chains orchestrate the execution of multiple data integration objects in SAP Datasphere. They provide scheduling, sequencing, and conditional execution capabilities. + +## Orchestration Objects + +Task Chains can orchestrate: +- **Replication Flows** - Mass data replication +- **Data Flows** - Batch ETL transformations +- **Transformation Flows** - Delta-aware SQL transformations +- **Other Task Chains** - Nested orchestration +- **BW Bridge Data Flows** - Legacy ABAP logic execution + +## Execution Modes + +### Serial Execution +Objects execute one after another in sequence: +``` +Object A → Object B → Object C +``` + +### Parallel Execution +Multiple objects execute simultaneously with gate logic: + +**AND Gate** - All parallel branches must complete before continuing: +``` + ┌─ Object B ─┐ +Object A ─┤ ├─ AND ─ Object D + └─ Object C ─┘ +``` + +**OR Gate** - First completed branch triggers continuation: +``` + ┌─ Object B ─┐ +Object A ─┤ ├─ OR ─ Object D + └─ Object C ─┘ +``` + +## Status Handling + +### On Success +Continue to next object or complete chain. + +### On Failure +Options: +- Stop chain execution +- Continue with next branch +- Execute error handling branch + +## BW Bridge Integration + +Task Chains can include BW Bridge Data Flows for: +- Legacy ABAP transformations +- Complex business logic +- InfoProvider updates + +## Scheduling Options + +| Schedule Type | Use Case | +|--------------|----------| +| Time-based | Regular intervals (hourly, daily, weekly) | +| Event-based | Triggered by external events | +| Manual | On-demand execution | + +## Best Practices + +1. **Group Related Flows** - Organize flows that load related data together +2. **Use Parallel Execution** - Independent flows should run in parallel +3. **Implement Error Handling** - Define behavior for each failure scenario +4. **Monitor Execution** - Use Data Integration Monitor to track runs +5. **Consider Dependencies** - Ensure upstream data is available before downstream processing + +## Navigation + +**Access**: Data Builder → Task Chains (or create new) + +**Create**: Data Builder → New Task Chain → Drag objects onto canvas → Connect with execution paths + +## Common Patterns + +### Daily Load Pattern +``` +Morning: Source Extracts (Parallel) → AND → Transformations (Serial) → Facts → Aggregates +``` + +### Real-time + Batch Hybrid +``` +Replication Flows (CDC) run continuously +Task Chain (scheduled): Transformation Flows → Analytics Layer +``` diff --git a/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/transformation-flows.md b/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/transformation-flows.md new file mode 100644 index 0000000..d465c6e --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/transformation-flows.md @@ -0,0 +1,95 @@ +# Transformation Flows Reference + +## Delta Staging Architecture + +### Multi-Layer Pattern +``` +[Source System] + ↓ (Replication Flow) +[Raw Layer - Delta Table] + ↓ (Transformation Flow) +[Cleansed Layer - Delta Table] + ↓ (Transformation Flow) +[Curated Layer - Delta Table] + ↓ (View) +[Consumption Layer] +``` + +### Delta Operations +| Operation | Symbol | Behavior | +|-----------|--------|----------| +| INSERT | I | New record created | +| UPDATE | U | Existing record modified | +| DELETE | D | Record marked for deletion | + +## Prerequisites + +### Source Table Requirements +- Must be a Local Table +- Must have **Delta Capture** enabled +- Contains delta records from Replication Flow + +### Target Table Requirements +- Must be a Local Table +- Must have **Delta Capture** enabled +- Schema compatible with source + transformations + +## SQL Transformations + +### Supported Operations +```sql +-- Projections +SELECT column1, column2 FROM source + +-- Filters +SELECT * FROM source WHERE status = 'ACTIVE' + +-- Calculations +SELECT amount * quantity AS total FROM source + +-- CASE statements +SELECT + CASE WHEN amount > 1000 THEN 'HIGH' ELSE 'LOW' END AS priority +FROM source + +-- Date functions +SELECT + YEAR(created_date) AS year, + MONTH(created_date) AS month +FROM source +``` + +### Limitations +- No complex multi-table joins (use Data Flow instead) +- No Python scripting +- Limited to SQL syntax + +## Execution Modes + +### Run on Delta +- Processes only changed records since last run +- Most efficient for incremental processing +- Maintains delta log for downstream flows + +### Run Full +- Processes all records +- Use after schema changes +- Resets delta state + +## Best Practices + +### Naming Convention +``` +TF__TO__ +Example: TF_RAW_TO_CLEANSED_SALES +``` + +### Error Handling +- Monitor for failed delta propagation +- Check data type compatibility +- Validate NULL handling + +### Performance +- Filter early in the pipeline +- Minimize transformations per flow +- Use appropriate scheduling frequency diff --git a/partner-built/SAP-Datasphere/skills/datasphere-data-product-publisher/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-data-product-publisher/SKILL.md new file mode 100644 index 0000000..c35fb3e --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-data-product-publisher/SKILL.md @@ -0,0 +1,972 @@ +--- +name: Data Product Publisher +description: "Publish and monetize data products through Datasphere's Data Sharing Cockpit. Use when preparing data for external consumers, setting up marketplace listings, managing access control, or licensing data assets. Keywords: data sharing, data product, marketplace, consumer, license, access control, data provider, visibility, quality." +--- + +# Data Product Publisher Skill + +## Overview + +The Data Product Publisher skill guides you through the complete lifecycle of publishing, managing, and monetizing data products in SAP Datasphere. From preparing high-quality data to creating marketplace listings, managing consumer subscriptions, and defining access policies, this skill provides best practices for successful data sharing. + +## When to Use This Skill + +Trigger this skill when you need to: +- Prepare data for sharing with external consumers +- Create a data product listing in the Data Sharing Cockpit +- Define access policies and license terms +- Publish data products to the Datasphere marketplace +- Onboard data consumers and manage access +- Update or deprecate existing data products +- Monitor data product quality and performance +- Manage subscription and licensing compliance +- Establish data governance for shared assets +- Generate insights from data product usage + +## What Are Data Products in Datasphere? + +### Business Value of Data Products + +Data products are curated, managed, business-ready datasets packaged for consumption by external parties. They transform raw data into valuable strategic assets. + +**Strategic Benefits:** +- **Revenue Generation**: Monetize internal data assets through data marketplace +- **Partner Enablement**: Share curated data with partners for joint business initiatives +- **Ecosystem Growth**: Build data ecosystem attracting new consumers and use cases +- **Competitive Advantage**: Leverage industry expertise through proprietary datasets +- **Operational Efficiency**: Enable self-service analytics reducing internal support burden + +**Business Use Cases:** +- Financial services: Credit risk models, market data, transaction datasets +- Retail: Customer demographics, inventory insights, demand forecasting data +- Healthcare: Clinical outcomes, patient cohorts, treatment effectiveness +- Manufacturing: Supply chain data, quality metrics, equipment diagnostics +- Transportation: Route optimization data, fleet analytics, logistics insights + +### Data Product Characteristics + +**Well-Designed Data Products:** +1. **Clear Business Context**: Explain what the data represents and business relevance +2. **High Quality**: Completeness, accuracy, timeliness verified and documented +3. **Well-Documented**: Metadata, lineage, refresh frequency, known limitations +4. **Governed**: Clear ownership, access policies, data classification +5. **Maintained**: Actively updated, version-controlled, deprecation path defined +6. **Accessible**: Multiple consumption formats, clear usage examples +7. **Discoverable**: Proper categorization, tags, searchable descriptions + +## Data Sharing Cockpit Overview + +The Data Sharing Cockpit is Datasphere's central hub for managing all data sharing activities. + +### Cockpit Sections + +**Data Provider Features:** +- **My Products**: Create and manage your published data products +- **Pending Requests**: Review and approve consumer access requests +- **Subscriptions**: Monitor active data product subscriptions +- **Analytics**: View usage metrics, consumer engagement, quality stats +- **Settings**: Configure data product defaults, policies, and governance + +**Navigation Path:** +``` +Datasphere Main Menu → Data Sharing Cockpit → Data Provider +``` + +### Key Capabilities + +| Capability | Purpose | Frequency | +|------------|---------|-----------| +| Create Product Listing | Define new data product | Once during setup | +| Set Access Policy | Control who can consume | Monthly review | +| Manage Terms & Conditions | Define license terms | During product launch | +| Review Requests | Approve/deny access | Daily/Weekly | +| Monitor Subscriptions | Track active consumers | Weekly | +| Publish Updates | Version data product | As needed | +| Manage Consumer Access | Grant/revoke permissions | As needed | +| View Analytics | Track usage and engagement | Weekly/Monthly | + +## Data Provider Workflow: Preparing Data + +### Step 1: Select Source Data + +**Evaluation Criteria:** +- **Business Value**: Does the data address real consumer needs? +- **Quality Readiness**: Is data accurate, complete, and timely? +- **Regulatory Compliance**: Does sharing comply with privacy and regulations? +- **Ownership Clarity**: Does business own the rights to share? +- **Refresh Frequency**: Can we maintain timeliness cost-effectively? + +**Data Selection Process:** +``` +1. Identify candidate datasets (brainstorm with business stakeholders) +2. Evaluate business value proposition +3. Assess data quality maturity +4. Review legal/compliance implications +5. Calculate maintenance cost vs. expected revenue +6. Prioritize by strategic importance and readiness +7. Select top 3-5 for pilot +``` + +### Step 2: Validate Data Quality + +**Quality Dimensions to Assess:** + +**Completeness:** +- No missing required fields +- Expected row count matches reality +- Historical coverage sufficient for analysis +- Example: "Product table has 100K SKUs with complete descriptions" + +**Accuracy:** +- Spot-check sample data against source +- Verify calculations and aggregations +- Validate against known benchmarks +- Example: "Revenue totals within 0.1% of source system" + +**Consistency:** +- Joins to related tables successful +- Foreign keys valid +- No conflicting values in related fields +- Example: "Every order.customer_id matches a customer record" + +**Timeliness:** +- Refresh frequency documented +- Refresh SLA achievable and tested +- Latest data available within committed timeframe +- Example: "Daily refresh by 6am UTC, 99.5% on-time" + +**Uniqueness:** +- Primary keys non-duplicated +- Grain consistent with documentation +- No unexpected duplicates in fact data +- Example: "Each transaction_id appears exactly once" + +### Step 3: Prepare Metadata Documentation + +**Essential Metadata:** + +``` +Dataset: Sales Transactions + +DESCRIPTION: + Daily transaction-level sales data across all retail locations. + Includes order details, item-level quantities and prices. + +BUSINESS CONTEXT: + Enables analysis of sales patterns, product performance, + customer spending behavior, and regional trends. + +GRAIN: + One row per item per order (transactional detail) + Example: Order 12345 with 3 items = 3 rows + +REFRESH FREQUENCY: + Daily, 5am UTC. 99.5% on-time SLA. + +ROW COUNT: + Current: 50M rows (last 12 months) + Growth: +50% annually + +SIZE: + Current: 25 GB compressed + Growth: +2 GB monthly + +KEY COLUMNS: + - transaction_id (Primary Key) + - transaction_date (Partition Key) + - customer_id (Foreign Key) + - product_id (Foreign Key) + - quantity (Fact) + - unit_price (Fact) + - total_amount (Fact) + +KNOWN LIMITATIONS: + - Data includes 5 major retailers only (not all locations) + - Web sales excluded (retail store sales only) + - Pricing excludes regional discounts applied post-transaction + - Returns recorded as separate negative transactions + +DATA LINEAGE: + Source: Enterprise ERP (SAP S/4HANA) + ETL: Nightly Replication Flow (Datasphere) + Transformations: None (raw transaction copy) + Last Validated: 2024-01-15 + +CONTACT: + Owner: sales@company.com + Support: datasupport@company.com +``` + +### Step 4: Create View or Table for Sharing + +Create a dedicated view/table for the data product (not the raw source): + +**Why Create a View:** +- Isolates consumer access from production tables +- Enables selective column exposure (hide sensitive columns) +- Allows computed columns (e.g., profit margin calculation) +- Simplifies access revocation +- Enables data transformation/cleansing for consumers + +**Example View Creation:** + +```sql +CREATE VIEW product_sales_for_sharing AS +SELECT + transaction_id, + transaction_date, + customer_id, + product_id, + ROUND(unit_price, 2) as unit_price, + quantity, + ROUND(total_amount, 2) as total_amount, + CASE + WHEN EXTRACT(MONTH FROM transaction_date) <= 6 THEN 'H1' + ELSE 'H2' + END as fiscal_half +FROM sales_transactions +WHERE transaction_date >= DATEADD(YEAR, -2, CURRENT_DATE) + AND customer_id IS NOT NULL +-- Exclude internal sales and test transactions + AND business_unit != 'INTERNAL' + AND customer_id NOT BETWEEN 999990 AND 999999; + +-- Column-level security: can be applied per consumer +ALTER VIEW product_sales_for_sharing SET COLUMN SECURITY + (unit_price = 'SENSITIVE', total_amount = 'SENSITIVE'); +``` + +**Best Practices:** +- Use descriptive view names +- Include computed/derived columns valuable for consumers +- Exclude columns with sensitive information +- Document all transformations +- Version your view definitions +- Use consistent naming conventions + +## Creating a Data Product Listing + +### Step 1: Access Data Sharing Cockpit + +``` +Datasphere Home → Data Sharing Cockpit → My Products → Create New Product +``` + +### Step 2: Complete Product Details + +**Product Information Section:** + +| Field | Required | Guidance | +|-------|----------|----------| +| Product Name | Yes | Concise, business-friendly (50 chars max) | +| Product ID | Auto-generated | System identifier (immutable) | +| Category | Yes | Pick from predefined (Finance, Sales, Operations, etc.) | +| Short Description | Yes | One-liner explaining value (100 chars max) | +| Long Description | Yes | Detailed explanation of contents (500 chars) | +| Owner Name | Yes | Name of data owner/steward | +| Owner Email | Yes | Contact for access requests | +| Support Email | Yes | Technical support contact | + +**Example Listing:** + +``` +Product Name: Retail Sales Transactions +Short Description: Daily transaction-level sales across 5 major retailers +Long Description: Comprehensive view of retail transaction activity including + order details, item quantities, pricing, and temporal attributes. Enables + analysis of sales patterns, product performance, customer behavior, and + regional trends. Updated daily with 12-month rolling window. +Category: Sales & Marketing +Owner: Sarah Johnson (sarah.johnson@company.com) +Support: data.support@company.com +``` + +### Step 3: Define Data Source + +**Selection Options:** +- Tables (Raw data tables in your space) +- Views (Logical views, recommended) +- Replicated Objects (Data from integrated systems) +- Materialized Views (Pre-calculated/optimized data) + +**Selection Criteria:** +- Use Views to enable secure column filtering +- Use Materialized Views for large frequently-accessed datasets +- Avoid exposing raw tables directly (limit consumer access) +- Select only necessary columns (principle of least privilege) + +### Step 4: Configure Visibility and Sharing + +**Visibility Options:** + +| Setting | Scope | Best For | +|---------|-------|----------| +| **Private** | Only explicitly approved consumers | Proprietary data, partners, pilot | +| **Internal Only** | All employees within organization | General internal data sharing | +| **Public** | Visible to all Datasphere tenants | Open industry data, benchmarks | + +**Recommendation:** +- Start with Private (explicit approval control) +- Move to Internal Only after stabilization +- Publish to Public only after extensive validation + +### Step 5: Set Initial Access Control + +**Access Control Mechanisms:** + +``` +1. Visibility (can user see the listing?) + → Private: Explicit allow list + → Internal: All internal users + → Public: All users + +2. Access Request (can user request access?) + → Manual Approval: You review requests + → Auto-Approval: Immediate access + → Restricted: No requests accepted + +3. Consumer Context (data consumer isolation) + → Shared Context: Shared with other consumers + → Private Context: Isolated per consumer +``` + +**Recommended Settings for Pilot:** +``` +Visibility: Private +Access Request: Manual Approval (review each consumer) +Consumer Context: Private Context (isolate early) +``` + +## Writing Effective Product Descriptions + +### Compelling Value Proposition + +**Structure: Problem → Solution → Benefit** + +``` +WEAK: +"Sales data from our company" + +STRONG: +"Gain competitive insight into retail sales patterns across 5 major markets. +Analyze product performance, customer spending trends, and seasonal demand +to optimize inventory and pricing strategies. Updated daily with 2 years +of historical transaction detail enabling time-series analysis and +forecasting model development." +``` + +### Key Information to Include + +**What the data contains:** +- Specific data elements (transactions, products, customers) +- Time period covered (last 12 months, 5-year history) +- Geographic scope (global, regions, countries) +- Industry/segment coverage + +**How to use it:** +- Top 3-5 common analytical use cases +- Example questions it can answer +- Recommended analysis approaches + +**Who should use it:** +- Target consumer roles (analyst, data scientist, business intelligence) +- Industry segments with most value +- Business functions (sales, marketing, operations) + +**Refresh and currency:** +- Update frequency (daily, weekly, monthly) +- Data lag (same-day reporting, 1-day lag) +- Historical period available + +### Example Product Descriptions + +**Customer Analytics Data Product:** + +``` +TITLE: Customer Master & Behavioral Analytics + +DATA INCLUDED: +• Customer demographics (50M customers, 200+ attributes) +• Purchase history (transactions last 3 years) +• Online engagement metrics (website clicks, session data) +• Customer lifetime value calculations +• Churn risk scores + +ANALYTICAL USE CASES: +1. Customer segmentation and targeting for marketing campaigns +2. Lifetime value prediction for pricing/acquisition optimization +3. Churn risk identification for retention programs +4. Product affinity analysis for cross-sell recommendations +5. Geographic expansion opportunity assessment + +REFRESH FREQUENCY: Daily at 6am UTC +HISTORICAL DEPTH: 3 years rolling window +DATA QUALITY: 99.8% completeness, deduplicated, validated + +TARGET AUDIENCE: Marketing teams, customer analytics, product managers + +KNOWN LIMITATIONS: +• Online data excludes mobile app interactions (web-only) +• Excludes B2B customers (B2C retail only) +• Regional data available for 8 primary markets only +``` + +## Defining License Terms and Access Policies + +### License Term Options + +**Free Access:** +``` +License Type: Community/Open Access +Cost: $0 +Best For: Internal sharing, ecosystem growth, public datasets +Terms: + - Non-exclusive access + - Non-commercial use (internal only) + - No warranty or support guarantee + - Data provided as-is +``` + +**Subscription-Based:** +``` +License Type: Monthly Subscription +Cost: $500-2,000/month (depends on data volume) +Best For: Premium proprietary data, commercial partners +Terms: + - Exclusive access (single consumer) + - Commercial use rights + - Monthly refresh updates + - Email support included + - 30-day cancellation notice required +``` + +**Pay-Per-Query:** +``` +License Type: Consumption-Based +Cost: $0.10-1.00 per GB queried +Best For: Large infrequent consumers, ad-hoc analysis +Terms: + - Pay only for actual usage + - Queries tracked and billed monthly + - No minimum commitment + - Real-time access control +``` + +### Access Control Patterns + +**Pattern 1: Column-Level Security** + +Expose different columns to different consumer types: + +``` +PARTNER A (Reseller): + - product_id, product_name, category + - quantity, revenue + - transaction_date + HIDE: unit_price, customer_id, cost_price + +PARTNER B (Investor): + - product_category, region + - total_revenue, total_quantity + - transaction_date + HIDE: customer data, pricing details, individual transactions + +INTERNAL SALES: + - All columns including unit_price and cost_price +``` + +**Implementation in Datasphere:** +```sql +-- Create consumer-specific views +CREATE VIEW product_sales_for_resellers AS +SELECT transaction_id, product_id, product_name, category, + quantity, revenue, transaction_date +FROM product_sales_for_sharing +WHERE authorized_partner = 'PARTNER_A'; + +CREATE VIEW product_sales_for_investors AS +SELECT product_category, region, + SUM(revenue) as total_revenue, SUM(quantity) as total_quantity, + transaction_date +FROM product_sales_for_sharing +GROUP BY product_category, region, transaction_date; +``` + +**Pattern 2: Row-Level Security (RLS)** + +Restrict data visibility based on consumer attributes: + +``` +SALES REGION (APAC): + - Access to APAC transactions only + WHERE region IN ('Australia', 'Japan', 'Singapore', ...) + +CUSTOMER (Small Retailer): + - Access to own transactions only + WHERE customer_id = AUTHENTICATED_CUSTOMER_ID + +DISTRIBUTOR (Multi-region): + - Access to 3 assigned regions only + WHERE region IN ('EMEA', 'LATAM', 'APAC') +``` + +**Pattern 3: Time-Based Access** + +Control which time periods consumers can access: + +``` +TIER: BASIC + - Last 12 months only + WHERE transaction_date >= DATEADD(YEAR, -1, CURRENT_DATE) + +TIER: PREMIUM + - Last 5 years + WHERE transaction_date >= DATEADD(YEAR, -5, CURRENT_DATE) + +TIER: ARCHIVE + - All historical (7+ years) + - No time restriction +``` + +### Terms and Conditions Template + +``` +DATA PRODUCT LICENSE AGREEMENT + +1. GRANT OF LICENSE + Provider grants Licensee non-exclusive right to access and use + the Data Product for purposes outlined in this Agreement. + +2. USAGE RESTRICTIONS + Licensee agrees to: + - Use Data Product only for authorized purposes + - Not share Data with third parties without written consent + - Not reverse-engineer or attempt to identify individuals + - Maintain data confidentiality and security + - Not re-distribute or re-sell Data + +3. INTELLECTUAL PROPERTY + Provider retains all intellectual property rights in Data Product. + Licensee receives only limited usage rights under this Agreement. + +4. DATA QUALITY & DISCLAIMERS + Provider provides Data "AS-IS" without warranties: + - Regarding accuracy or completeness + - Of fitness for specific purposes + - Against infringement or third-party claims + +5. DATA REFRESH & UPDATES + Data Product updated [daily/weekly/monthly] at [time] UTC. + Provider not responsible for delays due to source system outages. + +6. PAYMENT TERMS + License Fee: [description and amount] + Payment Due: [frequency and method] + Late Payment: [interest rate and consequences] + +7. TERMINATION + Either party may terminate with [30] days written notice. + Upon termination, Licensee must cease all use immediately. + +8. CONFIDENTIALITY + Licensee treats Data as confidential and not as public information. + +9. LIMITATION OF LIABILITY + Provider's total liability limited to fees paid in prior 12 months. + Provider not liable for indirect, consequential, or lost profits. + +10. GOVERNING LAW + This Agreement governed by laws of [jurisdiction]. +``` + +## Visibility Settings: Public vs Private Contexts + +### Context Model + +**Private Context:** +- Isolated data environment per consumer +- Consumer cannot see other consumers' data +- Recommended for sensitive data sharing +- Higher operational overhead + +**Shared Context:** +- Multiple authorized consumers in shared environment +- Reduces operational complexity +- Lower cost for provider +- Appropriate for non-sensitive, aggregate data + +### Visibility Configuration + +**Private Context Setup:** +``` +1. Create Data Product listing +2. Set Visibility: Private +3. Set Access: Manual Approval only +4. For each approved consumer: + a. Create consumer-specific context + b. Assign data access policies + c. Configure column/row-level security + d. Notify consumer of access provisioning +5. Monitor usage per consumer +``` + +**Shared Context Setup:** +``` +1. Create Data Product listing +2. Set Visibility: Private or Internal +3. Set Access: Auto-Approval (if homogeneous consumers) +4. Apply uniform security policies to all consumers +5. Monitor aggregate usage +``` + +## Consumer Perspective: Discovering and Requesting Access + +### How Consumers Find Your Data Products + +**Discovery Methods:** + +1. **Data Marketplace Browse:** + ``` + Datasphere → Data Sharing → Data Marketplace + → Browse by Category (Sales, Marketing, Operations) + → Search by keyword (sales, customer, revenue) + ``` + +2. **Search Functionality:** + ``` + Search: "product sales" + Results: All public/accessible products with keyword match + Sort By: Relevance, Recently Updated, Rating + ``` + +3. **Provider Recommendations:** + ``` + Follow favorite data providers + → Receive notifications of new products + → See curated product collections + ``` + +### Consumer Access Request Process + +**Step 1: Consumer Discovers Your Product** +- Finds listing in marketplace +- Reviews description and metadata +- Checks licensing terms + +**Step 2: Consumer Submits Access Request** +``` +Click: "Request Access" button +Form: + - Use Case: Describes intended analysis/purpose + - Expected Volume: How much data access needed + - Timeline: When access needed + - Company: Consumer organization + - Team: Consumer department/team +``` + +**Step 3: You Review and Approve** +``` +Data Sharing Cockpit → Pending Requests +Review: + - Consumer organization and legitimacy + - Use case alignment with your product intent + - Any risk concerns or conflicts +Actions: + - Approve: Grant access immediately + - Conditional Approve: Grant with restrictions (e.g., time-limited) + - Request Info: Ask clarifying questions + - Deny: Reject request with explanation +``` + +**Step 4: Consumer Receives Access** +- Automated notification upon approval +- Access instructions and documentation +- Connection details if applicable +- Support contact information + +## Data Marketplace: Publishing and Managing Subscribers + +### Publishing to Marketplace + +**Readiness Checklist Before Publishing:** +- Data quality validated (completeness, accuracy, timeliness) +- Metadata documented (business context, lineage, limitations) +- Security policies configured (column/row level) +- License terms finalized and reviewed by legal +- Support team trained on handling consumer questions +- SLA for refresh/updates defined +- Monitoring and alerting configured + +**Publishing Steps:** + +``` +1. Data Sharing Cockpit → My Products → [Product Name] +2. Click "Publish to Marketplace" +3. Confirm: + - Visibility: Public (marketplace listing) + - Access Request: Manual or Auto-Approval + - Pricing: Free, Subscription, or Pay-Per-Query +4. Add marketplace-specific metadata: + - Product tags (sales, customer, financial, etc.) + - Audience/industry tags + - Use case examples + - Rating and reviews enabled: Yes/No +5. Review marketplace preview +6. Confirm and publish +``` + +**Marketplace Publishing Considerations:** +- Product becomes searchable to all Datasphere tenants +- Enable consumer reviews/ratings +- Respond to consumer questions publicly +- Monitor marketplace analytics for interest trends +- Consider SEO in product name and tags + +### Managing Subscribers + +**Subscriber Dashboard:** + +``` +View → Subscriptions + +DISPLAY: +Consumer Company | Subscription Date | Status | Usage | Actions +───────────────────────────────────────────────────────────────── +Acme Corp | 2024-01-15 | Active | High | View Details +Beta Partners | 2024-01-10 | Active | Low | View Details +XYZ Industries | 2024-01-05 | Active | Med | Pause/Terminate +ABC Ventures | 2023-12-20 | Inactive| None | Reactivate +``` + +**Subscriber Actions:** + +| Action | Use Case | Impact | +|--------|----------|--------| +| View Details | Understand consumer usage patterns | Informational | +| Send Update | Notify of product changes/improvements | Engagement | +| Pause Access | Temporary suspension (technical issues) | Consumer blocked | +| Terminate | End subscription (license expired) | Consumer blocked | +| Adjust Limits | Change data volume/row limits | Access modified | + +**Managing Inactive Subscribers:** + +``` +STEP 1: Monitor Inactivity + - Alert after 30 days no queries + - Alert after 90 days no usage + +STEP 2: Outreach (if value-add) + - Email consumer explaining new features + - Offer training/support for adoption + - Explore unmet needs + +STEP 3: Decision + - If consumer engaged: Continue service + - If no response after outreach: Consider termination + - Negotiate reduced pricing if volume concerns + +STEP 4: Communicate Changes + - Notify consumer of any adjustments + - Update SLA/terms if applicable +``` + +## Data Access Control Considerations + +### Sensitive Data Handling + +**Identify Sensitive Elements:** +- Personal data (customer names, emails, IDs) +- Financial data (pricing, costs, revenue) +- Competitive information (market strategies) +- Proprietary methodologies + +**Masking Strategies:** + +| Data Type | Masking Approach | Example | +|-----------|-----------------|---------| +| **Customer Names** | Redact or hash | John Smith → CUST_12345 | +| **Email Addresses** | Partial redact | john@example.com → j***@example.com | +| **Phone Numbers** | Partial redact | 555-123-4567 → 555-***-**** | +| **Cost Data** | Hash or remove | $1.50 → [REDACTED] | +| **Revenue** | Aggregate only | Sum/Avg, not individual | +| **Personal IDs** | Hash with salt | 123-45-6789 → h7e2a9k4 | + +**Column-Level Security Implementation:** + +```sql +-- Define sensitivity levels +ALTER TABLE product_sales_for_sharing +SET COLUMN SECURITY + (customer_id = 'HIGH', unit_price = 'MEDIUM', total_amount = 'MEDIUM'); + +-- Create role-specific views +CREATE VIEW sales_by_partner (customer_id = 'MASKED', unit_price = 'REMOVED') +AS SELECT ... FROM product_sales_for_sharing; +``` + +### Privacy Compliance + +**Regulatory Considerations:** + +| Regulation | Key Requirement | Implementation | +|------------|-----------------|-----------------| +| **GDPR** | Right to be forgotten | Support customer deletion requests | +| **CCPA** | Consumer data rights | Allow consumers to request/opt-out | +| **PII Rules** | Non-identification of individuals | Hash/mask personal identifiers | +| **Data Residency** | Geographic restrictions | Ensure data stays in authorized regions | + +**Privacy Impact Assessment:** + +``` +BEFORE PUBLISHING: + +Question 1: Does the data contain personally identifiable information? + → Confirm PII is masked/redacted + +Question 2: Can individuals be identified through data inference? + → Test combinations of attributes + → Ensure k-anonymity > 5 (minimum 5 individuals per cell) + +Question 3: Is there compliance requirement for data geography? + → Confirm data not crossing borders + → Document jurisdiction alignment + +Question 4: What is the data retention requirement? + → Establish deletion schedule + → Configure automated purge policies +``` + +## Quality Requirements for Publishable Data Products + +### Quality Dimensions Framework + +**COMPLETENESS (100% Rule)** +- Expected: No null values in key columns +- Validation: Row count matches expected daily volume ± 5% +- Example: "Daily customer updates expected 500K, actual 485K-515K" + +**ACCURACY (Known Validation)** +- Expected: Spot-check sample matches source ± 0.5% +- Validation: Totals align with ERP reporting +- Example: "Monthly revenue reconciled to GL within 0.1%" + +**CONSISTENCY (Relational Integrity)** +- Expected: All foreign keys resolve to valid parent records +- Validation: No orphaned records +- Example: "Every product_id in sales matches product_master" + +**TIMELINESS (SLA-Based)** +- Expected: Data refreshed daily by 6am UTC +- Validation: 99.5% on-time delivery tracked +- Example: "30-day rolling SLA: 150/151 refreshes on-time = 99.3%" + +**CONFORMITY (Schema Validation)** +- Expected: All columns present, correct data types +- Validation: Schema unchanged without notice +- Example: "Revenue always DECIMAL(10,2), never NULL, always > 0" + +### Quality Scorecard + +**Calculate Quality Score:** +``` +Quality Score = (Completeness + Accuracy + Consistency + Timeliness + Conformity) / 5 + +SCORING: + 95-100% = Production Ready + 85-95% = Acceptable (monitor closely) + 75-85% = Conditional (fix issues before publishing) + < 75% = Not Ready (rework required) +``` + +**Example Quality Report:** + +``` +PRODUCT: Retail Sales Transactions +REPORTING PERIOD: January 2024 + +COMPLETENESS: + Target: 100% non-null on transaction_id, customer_id, amount + Result: 99.8% (5 orphaned records out of 2.5M) + Status: PASS + +ACCURACY: + Target: Revenue within 0.5% of ERP GL + Result: GL reconciliation: GL $50.2M vs Data $50.1M = 0.2% variance + Status: PASS + +CONSISTENCY: + Target: All customer_ids in sales found in customer_master + Result: 99.97% matching (750 orphaned sales found, corrected) + Status: PASS WITH CORRECTION + +TIMELINESS: + Target: Daily refresh by 6am UTC, 99%+ on-time + Result: 31/31 refreshes on-time in January = 100% + Status: PASS + +CONFORMITY: + Target: All columns present, correct types, no schema drift + Result: All 15 columns present, types correct, no changes detected + Status: PASS + +OVERALL SCORE: (99.8 + 100 + 99.97 + 100 + 100) / 5 = 99.95% +VERDICT: PRODUCTION READY ✓ +``` + +## Using MCP Tools for Data Product Management + +### browse_marketplace +Discover and analyze market opportunities: +``` +browse_marketplace(category="Sales", sort_by="rating", limit=20) +``` +Returns: Top products in category, enabling you to understand competitive positioning + +### search_catalog +Find related data products and dependencies: +``` +search_catalog(keyword="customer", consumer_type="public") +``` +Returns: All products matching search, helping identify ecosystem opportunities + +### get_asset_details +Retrieve complete product metadata: +``` +get_asset_details(asset_id="retail_sales_transactions_001") +``` +Returns: Full product definition, schemas, lineage, security policies + +### get_space_info +Understand data space structure and capacity: +``` +get_space_info(space_name="sales_analytics") +``` +Returns: Space configuration, objects, sizing, security posture + +## Data Product Publishing Workflow + +1. **Evaluate**: Assess candidate datasets (business value, quality, compliance) +2. **Prepare**: Create views, validate metadata, ensure quality +3. **List**: Create product listing with compelling description +4. **License**: Define terms, access policies, pricing +5. **Publish**: Make product discoverable to target consumers +6. **Manage**: Review requests, approve consumers, monitor subscriptions +7. **Support**: Answer questions, resolve issues, handle updates +8. **Monitor**: Track usage, quality metrics, customer satisfaction +9. **Iterate**: Gather feedback, make improvements, version updates +10. **Retire**: Deprecate products, communicate sunset dates + +## Best Practices + +- Start with high-quality, high-value data (not everything is a data product) +- Create consumer-specific views (avoid direct table exposure) +- Document thoroughly including known limitations +- Test access controls with representative consumers before publishing +- Implement automated quality monitoring +- Respond to access requests within 24-48 hours +- Establish clear support SLA for subscribers +- Monitor marketplace analytics to understand demand +- Regularly collect consumer feedback and improve products +- Version data products and communicate changes +- Plan data retention and archival strategy upfront +- Consider privacy and compliance implications early diff --git a/partner-built/SAP-Datasphere/skills/datasphere-data-product-publisher/references/data-sharing-guide.md b/partner-built/SAP-Datasphere/skills/datasphere-data-product-publisher/references/data-sharing-guide.md new file mode 100644 index 0000000..d707d1f --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-data-product-publisher/references/data-sharing-guide.md @@ -0,0 +1,1034 @@ +# Data Sharing Guide Reference + +## Data Product Listing Template + +Use this template when creating a new data product listing in the Data Sharing Cockpit. + +### Required Fields + +``` +PRODUCT IDENTIFICATION +==================== +Product Name: [Business-friendly name, max 100 chars] + Example: "Customer Master Data with Behavioral Analytics" + +Product ID: [System-generated, read-only] + Example: "prod_customer_master_001" + +Category: [Select from dropdown] + Options: Financial, Sales & Marketing, Operations, HR, Supply Chain, Other + Selection: Sales & Marketing + +Owner Information +================= +Data Owner Name: [Full name] +Data Owner Email: [Email address] +Data Owner Phone: [Optional, phone number] +Support Contact: [Email for consumer support] +Support Hours: [Optional, e.g., "Mon-Fri 8am-5pm EST"] + +Business Context +================ +Short Description: [One-liner value prop, max 150 chars] + Example: "Complete customer profiles with 3-year purchase history and engagement metrics" + +Long Description: [Detailed explanation, max 2000 chars] + Include: + - What data is included + - Who should use it + - Key analytical use cases + - Time period covered + - Refresh frequency + +Data Source Configuration +========================= +Data Source Type: [Table | View | Materialized View | Replicated Object] + Recommendation: Use View for consumer-friendly schema + +Source Object: [Select from your space] + Example: "vw_customer_analytics_publish" + +Columns Included: [Auto-populated from source] + Verify: + - No sensitive columns exposed + - All necessary columns included + - Column names are business-friendly + +Row Count: [Current record count] + Example: "50,000,000 rows" + +Data Size: [Current storage size] + Example: "25 GB compressed" + +Data Grain: [Explain level of detail] + Example: "One row per customer, with aggregated metrics" + +Refresh Frequency: [How often data updates] + Options: Real-time | Hourly | Daily | Weekly | Monthly | Quarterly | Annual + Selection: Daily + Refresh Time (UTC): 06:00 + +Refresh SLA: [Service level agreement for timeliness] + Example: "99.5% on-time delivery (5 min tolerance)" + +Historical Depth: [How much historical data] + Example: "3 years rolling window (36 months)" + +Data Quality +============ +Completeness: [% non-null for key columns] + Example: "99.8% (5 out of 2.5M customer records have customer_id)" + +Accuracy Validation: [How accuracy is verified] + Example: "Daily reconciliation to ERP system, <0.5% variance tolerance" + +Known Limitations: [Document any data gaps or caveats] + Example: + - "Excludes B2B customers (B2C retail only)" + - "Online data limited to web (excludes mobile app)" + - "Regional data available for 12 primary markets" + - "Regional discounts excluded (list price only)" + +Data Quality Score: [Overall quality rating] + Options: Excellent (95%+) | Good (85-95%) | Acceptable (75-85%) | Poor (<75%) + Selection: Excellent (98%) + +Accessibility & Sharing +====================== +Data Lineage: [Where data originates] + Example: "Source: SAP S/4HANA ERP + Transform: Nightly replication + view transformation + Updated: 2024-01-15" + +Visibility: [Who can see this product] + Options: Private | Internal Only | Public + Selection: Private + Rationale: "Initial pilot with selected partners" + +Access Approval: [How consumers request access] + Options: Manual Approval | Auto-Approval + Selection: Manual Approval + Rationale: "Want to control initial consumer access" + +Consumer Context: [Data isolation per consumer] + Options: Shared Context | Private Context + Selection: Private Context + Rationale: "Each consumer gets isolated data environment" + +Licensing & Terms +================= +License Type: [How data is licensed] + Options: Free | Subscription | Pay-Per-Query | Custom + Selection: Subscription + +License Fee: [Cost structure if applicable] + Example: "$500/month for subscription" + +License Terms: [Attach or reference T&C document] + Include: + - Non-exclusive access rights + - Permitted use cases + - Restrictions (no redistribution, commercial use) + - Confidentiality obligations + - Data retention requirements + +Sample License Terms: + "Non-exclusive access for internal business intelligence use. + Licensee may not re-distribute, re-sell, or share with third parties. + Data is confidential and provided as-is without warranty. + Data retention: No longer than 1 year after subscription ends." + +Pricing Model Details: [If applicable] + Subscription: $500/month, billed monthly, 30-day cancellation + Pay-Per-Query: $0.50 per GB queried, billed monthly + Volume Discounts: 20% off at 10+ subscriptions + +Marketplace +=========== +Marketplace Tags: [Keywords for discoverability] + Examples: customer, sales, marketing, analytics, 3-year, behavioral + +Target Industry: [Industries that would benefit] + Examples: Retail, E-commerce, Consumer Goods, Financial Services + +Use Case Examples: [2-3 concrete examples] + Example 1: "Customer segmentation for targeted marketing campaigns" + Example 2: "Lifetime value prediction for acquisition ROI optimization" + Example 3: "Churn risk identification for retention programs" + +Rating & Reviews: [Enable consumer feedback] + Option: Enabled + +Documentation Link: [URL to extended documentation] + Example: "https://knowledge.company.com/data-products/customer-master" + +Support Contacts +================ +Primary Support Contact: [Name and email] +Support Email: [Monitored email for requests] +Support SLA: [Response time commitment] + Example: "24-hour response during business hours" + +Documentation Contacts: [Who updates metadata] +``` + +## License Term Options and Templates + +### License 1: Internal Free Access + +**Use Case:** Sharing within organization, building adoption + +``` +LICENSE AGREEMENT: INTERNAL FREE ACCESS + +1. GRANT + Company grants employees of its organization non-exclusive, + royalty-free right to access and use the Data Product for + business purposes only. + +2. PERMITTED USE + Licensee may: + - Query and analyze the Data Product + - Use analysis results internally for business decisions + - Share insights (not raw data) with colleagues + +3. RESTRICTIONS + Licensee may NOT: + - Share raw data with external parties + - Attempt to identify individuals + - Reverse-engineer data structures + - Distribute data commercially + +4. CONFIDENTIALITY + Licensee acknowledges Data is confidential and maintains + appropriate security controls. + +5. NO WARRANTY + Data provided AS-IS without warranty of accuracy, completeness, + or fitness for particular purpose. + +6. TERMINATION + License terminates when employment ends or upon 30 days notice. + +TERM: Effective upon grant, continues until termination +NO COST: No license fees +``` + +### License 2: Partner Subscription + +**Use Case:** Premium data sharing with external partners + +``` +LICENSE AGREEMENT: PARTNER SUBSCRIPTION DATA + +1. GRANT + Provider grants Licensee non-exclusive right to access + and use Data Product under terms of this Agreement. + +2. PERMITTED USE + Licensee may: + - Query and analyze Data Product + - Incorporate insights into business processes + - Use for [SPECIFIC USE CASE: "customer analytics", "market analysis"] + - Support [NUMBER] internal users + +3. RESTRICTIONS + Licensee may NOT: + - Redistribute, re-sell, or share Data with third parties + - Disclose Data to competitors + - Create derivative products for external sale + - Attempt to identify individuals from data + - Use Data for purposes other than above + +4. DATA QUALITY + Provider provides Data "AS-IS" without warranties regarding: + - Accuracy, timeliness, or completeness + - Fitness for specific purposes + - Non-infringement of third-party rights + +5. REFRESH TERMS + - Data updated: [FREQUENCY: "Daily by 6am UTC"] + - Provider not liable for refresh delays + - Maximum 48-hour delay without penalty + +6. PAYMENT + - Fee: $[AMOUNT]/[PERIOD] (e.g., "$1,000/month") + - Payment: Due net 30 days from invoice + - Late payment: 1.5% monthly interest + - Annual discount available: [%] + +7. TERM & TERMINATION + - Initial Term: [PERIOD: "12 months"] + - Renewal: Automatic unless 30-day notice given + - Termination for Cause: Immediate if material breach + - Termination for Convenience: 30 days notice, no refund + - Upon termination: Licensee ceases all access immediately + +8. CONFIDENTIALITY + Licensee treats Data as confidential and implements reasonable + security measures to prevent unauthorized access. + +9. LIABILITY LIMIT + Provider's total liability limited to fees paid in prior 12 months. + In no event liable for indirect, consequential, lost profits, or + data corruption damages. + +10. DATA RESIDENCY + Data remains in [REGION: "US East"] data centers only. + Licensee not permitted to export or copy to other regions. + +EXECUTION: Executed and effective as of [DATE] +``` + +### License 3: Marketplace - Pay-Per-Query + +**Use Case:** Open marketplace model with usage-based pricing + +``` +LICENSE AGREEMENT: MARKETPLACE PAY-PER-QUERY + +1. GRANT + Provider grants Licensee non-exclusive right to query Data Product + according to pricing model in Section 5. + +2. PERMITTED USE + Licensee may: + - Query Data Product for business intelligence + - Download query results for analysis + - Use results for internal decision-making + - Support up to [NUMBER] concurrent queries + +3. RESTRICTIONS + Licensee may NOT: + - Redistribute raw data or query results + - Use for commercial re-sale + - Share Data with external parties + - Attempt to identify individuals + +4. QUERY TERMS + - Query Limit: Unlimited + - Concurrent Queries: [NUMBER, e.g., "5 concurrent"] + - Result Size Limit: [e.g., "1M rows per query"] + - Timeout: Queries > 1 hour auto-cancel + +5. PRICING & PAYMENT + Pricing Model: Pay-Per-Query (GB scanned) + Rate: $[PRICE] per GB scanned (e.g., "$0.50 per GB") + Minimum: None + Billing: Monthly, based on actual usage + Payment Due: Net 30 days from invoice + Late Payment: 1.5% monthly interest + + Example Billing: + - February Queries: 250 GB scanned + - Rate: $0.50/GB + - Invoice: 250 * $0.50 = $125.00 + +6. VOLUME DISCOUNTS + - 100+ GB/month: 10% discount + - 500+ GB/month: 25% discount + - 1,000+ GB/month: 40% discount + +7. USAGE MONITORING + Provider monitors: + - GB scanned per query (billed metric) + - Query frequency + - Data accessed + Licensee receives weekly usage reports. + +8. DATA QUALITY & DISCLAIMERS + Data provided AS-IS without warranty. + Provider not responsible for: + - Accuracy or completeness + - Timeliness relative to source systems + - Fitness for particular purposes + +9. TERM & TERMINATION + - Effective: Upon account creation + - Ongoing: Month-to-month auto-renewal + - Termination: Either party may terminate with 30-day notice + - Effect: Access revoked at end of notice period + - Final Bill: Due on termination date + +10. CONFIDENTIALITY + Licensee maintains Data confidentiality per industry standards. + +11. LIABILITY LIMITS + Provider total liability capped at usage fees in prior 3 months. + No liability for indirect or consequential damages. + +ACCEPTANCE: Acceptance of terms required upon first query +``` + +## Access Policy Configuration Patterns + +### Pattern 1: Consumer Role-Based Access + +Implement different access policies based on consumer type. + +**Sales Team (Internal)** +```sql +-- Full access to all columns, no filtering +CREATE VIEW vw_sales_data_internal AS +SELECT * FROM product_sales_fact +WHERE business_unit = 'SALES'; + +SECURITY POLICY: NONE (internal full access) +``` + +**Partner - Reseller** +```sql +-- Limited columns, aggregate level +CREATE VIEW vw_sales_data_reseller AS +SELECT + transaction_date, + product_category, -- Not product_id (too detailed) + region, + SUM(quantity) as total_quantity, -- Aggregated + SUM(revenue) as total_revenue, -- Aggregated + COUNT(DISTINCT customer_id) as customers -- Hidden identities +FROM product_sales_fact +WHERE transaction_date >= DATEADD(MONTH, -12, CURRENT_DATE) +GROUP BY transaction_date, product_category, region; + +SECURITY POLICY: +- Column restrictions: Hide unit_price, cost, customer_id +- Row restrictions: Current year only (12-month rolling) +``` + +**Partner - Investor** +```sql +-- Quarterly summaries only +CREATE VIEW vw_sales_data_investor AS +SELECT + EXTRACT(QUARTER FROM transaction_date) as quarter, + EXTRACT(YEAR FROM transaction_date) as year, + SUM(revenue) as total_revenue, + COUNT(DISTINCT customer_id) as customer_count, + AVG(revenue / NULLIF(quantity, 0)) as avg_price, + SUM(quantity) as total_units +FROM product_sales_fact +WHERE transaction_date >= DATEADD(YEAR, -3, CURRENT_DATE) +GROUP BY EXTRACT(QUARTER FROM transaction_date), + EXTRACT(YEAR FROM transaction_date); + +SECURITY POLICY: +- Column restrictions: Only aggregate metrics +- Row restrictions: Last 3 years quarterly summary +``` + +### Pattern 2: Data Masking by Consumer Sensitivity + +Hide sensitive columns based on consumer classification. + +**Configuration:** +``` +CONSUMER CLASSIFICATION: + - Internal (Full Access): All columns, all rows + - Partner (Medium Access): 60% of columns, filtered rows + - Public (Limited Access): 20% of columns, aggregate only + +APPLICATION: + Consumer: Acme Partners (Partner classification) + Hide Columns: unit_cost, profit_margin, negotiated_discount + Filter Rows: Last 12 months (not 5-year history) + Mask Values: Customer names → hashed IDs +``` + +**Implementation:** +```sql +-- Base table (secure) +CREATE TABLE sales_transaction_secure ( + transaction_id INT, + customer_name VARCHAR(100), -- SENSITIVE + customer_id INT, + product_id INT, + unit_cost DECIMAL(10,2), -- SENSITIVE + unit_price DECIMAL(10,2), + quantity INT, + revenue DECIMAL(10,2) +); + +-- Secure column definitions +ALTER TABLE sales_transaction_secure +SET COLUMN SECURITY ( + customer_name = 'SENSITIVE', + unit_cost = 'SENSITIVE' +); + +-- Partner view (limited access) +CREATE VIEW vw_sales_for_partner AS +SELECT + transaction_id, + HASH(customer_name) as customer_hash, -- Masked + product_id, + -- unit_cost hidden completely + unit_price, + quantity, + revenue +FROM sales_transaction_secure +WHERE transaction_date >= DATEADD(MONTH, -12, CURRENT_DATE) + AND customer_id NOT IN (SELECT customer_id FROM customer_vip_list); + -- Exclude VIP customers from partner view + +-- Grant permissions +GRANT SELECT ON vw_sales_for_partner TO 'acme_partners_consumer_context'; +``` + +### Pattern 3: Time-Based Data Tiering + +Provide different access horizons based on subscription level. + +**Tier 1: Starter (Free)** +``` +Access: Last 3 months only +Refresh: Weekly (not daily) +Latency Tolerance: 7 days (historical data only) +Rows Returned: Max 1M per query +Use Case: Ad-hoc exploration, small team +``` + +**Tier 2: Professional ($500/month)** +``` +Access: Last 12 months +Refresh: Daily +Latency Tolerance: Same-day (updated by 6am next day) +Rows Returned: Max 100M per query +Use Case: Regular analysis, medium team +``` + +**Tier 3: Enterprise ($2,000+/month)** +``` +Access: All historical (7+ years) +Refresh: Real-time (updated within 1 hour) +Latency Tolerance: 1 hour +Rows Returned: Unlimited +Concurrent Queries: 10+ +Use Case: Deep analysis, large organization +``` + +**Implementation:** +```sql +-- Tiered view creation +CREATE VIEW vw_sales_tier_starter AS +SELECT * FROM sales_fact +WHERE transaction_date >= DATEADD(MONTH, -3, CURRENT_DATE); + +CREATE VIEW vw_sales_tier_professional AS +SELECT * FROM sales_fact +WHERE transaction_date >= DATEADD(YEAR, -1, CURRENT_DATE); + +CREATE VIEW vw_sales_tier_enterprise AS +SELECT * FROM sales_fact; -- All data + +-- Assign views based on subscription +CONSUMER: Consumer_A (Tier: Starter) +GRANT SELECT ON vw_sales_tier_starter TO 'consumer_a_context'; + +CONSUMER: Consumer_B (Tier: Professional) +GRANT SELECT ON vw_sales_tier_professional TO 'consumer_b_context'; + +CONSUMER: Consumer_C (Tier: Enterprise) +GRANT SELECT ON vw_sales_tier_enterprise TO 'consumer_c_context'; +``` + +## Quality Checklist Before Publishing + +Use this checklist to verify data product readiness. + +``` +DATA QUALITY VALIDATION +======================= + +COMPLETENESS CHECKS +[ ] No nulls in primary key columns + Script: SELECT COUNT(*) FROM product WHERE product_id IS NULL + Expected: 0 rows + +[ ] Row count stable (within ±5% of expected daily volume) + Script: SELECT COUNT(*) FROM sales WHERE DATE(load_date) = CURRENT_DATE + Expected: Daily volume ±5% + +[ ] All expected columns present + Script: DESCRIBE product_sales_view + Compare: Against documented schema + +[ ] No orphaned records (foreign key integrity) + Script: SELECT COUNT(*) FROM sales s + WHERE NOT EXISTS (SELECT 1 FROM customers c WHERE c.id = s.customer_id) + Expected: 0 rows + +ACCURACY CHECKS +[ ] Sample spot-check against source system (10-20 records) + Process: Compare 20 random transaction IDs + Expected: 100% match + +[ ] Monthly totals reconcile to GL (within 0.5%) + Script: SELECT SUM(amount) FROM sales WHERE MONTH = CURRENT_MONTH + Compare: Against GL Total from ERP + Tolerance: ±0.5% + +[ ] Aggregate metrics validate with known reports + Process: Compare dashboard metrics with published reports + Expected: Match within 0.1% + +CONSISTENCY CHECKS +[ ] Foreign key relationships valid + Script: Validate all product_ids exist in product_master + Expected: 100% match + +[ ] Grain/cardinality matches documentation + Documentation: "One row per transaction, uniqueness on transaction_id" + Validation: Count(Distinct transaction_id) = Count(*) + Expected: Equals + +[ ] No duplicate composite keys + Script: SELECT transaction_id, customer_id, COUNT(*) FROM sales + GROUP BY transaction_id, customer_id + HAVING COUNT(*) > 1 + Expected: 0 rows + +[ ] Data types consistent with schema + Process: Review DESCRIBE output for column types + Expected: All matches documented schema + +TIMELINESS CHECKS +[ ] Last refresh timestamp recent (within SLA) + Script: SELECT MAX(load_timestamp) FROM sales_fact + Expected: Within last 24 hours + +[ ] Refresh success rate meets SLA (99%+) + Script: SELECT COUNT(*) FROM refresh_log + WHERE success = TRUE AND load_date >= DATEADD(DAY, -30, CURRENT_DATE) + Expected: ≥ 30 days / month * 0.99 = ≥29.7 successes + +[ ] No stale partitions (gaps in date coverage) + Script: SELECT DISTINCT transaction_date FROM sales + ORDER BY transaction_date DESC + Expected: No gaps in recent dates + +CONFORMITY CHECKS +[ ] No schema drift (columns, types unchanged) + Process: Compare current DDL to version-controlled schema + Expected: Exact match + +[ ] Column names business-friendly (no cryptic codes) + Review: All column names understandable to business users + Expected: All descriptive + +[ ] All documentation current and accurate + Review: Updated within last 30 days + Expected: All sections current + +SECURITY & PRIVACY CHECKS +[ ] No PII exposed unmasked + Process: Review view definitions for customer names, SSN, etc. + Expected: All sensitive data masked/removed + +[ ] Access control policies in place + Check: Row/column level security policies defined + Expected: Policies for all consumer types + +[ ] License terms final and reviewed by legal + Review: With legal/compliance team + Expected: Approved signature + +PERFORMANCE CHECKS +[ ] Query performance meets expectations + Test: Run 3 representative queries + Expected: All complete within 30 seconds + +[ ] Memory usage reasonable (no spills) + Monitor: Check query logs for memory spill events + Expected: Zero spill events + +[ ] View dependencies documented + List: All dependent views clearly identified + Expected: Dependency tree documented + +METADATA CHECKS +[ ] Business description complete and clear + Review: Can external user understand purpose/content? + Expected: Yes, without follow-up questions + +[ ] Data lineage documented + Document: Source → ETL → View → Consumer + Expected: Full lineage documented + +[ ] Refresh frequency and SLA clearly stated + Document: "Daily by 6am UTC, 99.5% on-time SLA" + Expected: Clear SLA statement + +[ ] Known limitations documented + Example: "Excludes web orders, includes retail only" + Expected: Specific limitations listed + +[ ] Contact information complete + Check: Owner, support, escalation contacts + Expected: All contacts identified and verified + +INFRASTRUCTURE CHECKS +[ ] Monitoring & alerting configured + Setup: Email alerts for refresh failures + Expected: Alerts active and tested + +[ ] Backup strategy in place + Document: Recovery procedure and RTO/RPO + Expected: Documented and tested + +[ ] Capacity sufficient for growth + Plan: 3-year growth projection + Expected: Storage/compute adequate + +SIGN-OFF +======== +[ ] Business Owner Approves: _________________ Date: _______ +[ ] Data Steward Approves: _________________ Date: _______ +[ ] Quality Manager Approves: _________________ Date: _______ +[ ] Technical Owner Approves: _________________ Date: _______ + +Ready for Publication: YES / NO +``` + +## Consumer Onboarding Workflow + +Standard process for onboarding newly approved data product consumers. + +``` +CONSUMER ONBOARDING WORKFLOW +============================= + +PHASE 1: PRE-APPROVAL (1-2 days before approval) +═══════════════════════════════════════════════ + +Step 1: Verify Consumer Information + □ Company name and industry + □ Contact person and title + □ Number of users who will access + □ Expected use cases + Review: Does request align with product intent? + +Step 2: Assess Risk + □ Is consumer a direct competitor? + □ Are there geographic restrictions (data residency)? + □ Are there industry restrictions? + □ Check: Industry/risk matrix for approval + +Step 3: Prepare Approval Response + □ Draft approval email + □ Include: Access instructions, support contact, T&Cs + □ Schedule: Prepare welcome materials + + +PHASE 2: APPROVAL NOTIFICATION (Day 1) +═══════════════════════════════════════ + +Step 1: Send Approval Email + Template: + ─────────────────────────────────────────────── + Subject: Access Approved: [Product Name] + + Dear [Consumer Name], + + Your request to access [Product Name] has been APPROVED. + + QUICK START: + 1. Login to Datasphere with your credentials + 2. Navigate to Data Sharing → My Subscriptions + 3. Click [Product Name] to begin querying + 4. See documentation: [link] + + SUPPORT: + Questions? Contact: [support email] + Support hours: [hours] + + LICENSE TERMS: + [Summary of key terms] + Full terms: [link] + + DATA DESCRIPTION: + [1-2 paragraph summary] + + IMPORTANT CONTACTS: + Data Owner: [name, email] + Technical Support: [email, hours] + + Welcome aboard! + ─────────────────────────────────────────────── + +Step 2: Notify Internal Team + To: Data owner, support team + Content: New consumer approved, use case summary + + +PHASE 3: TECHNICAL PROVISIONING (1-3 days) +═══════════════════════════════════════════ + +Step 1: Create Consumer Context (if private context model) + Action: Provision isolated data environment for consumer + Includes: Network, storage, access controls + Expected: 24 hours to completion + +Step 2: Configure Access Policies + Actions: + □ Apply row-level security (if applicable) + □ Apply column-level security (if applicable) + □ Configure view permissions + □ Test access with sample query + Validation: Verify only authorized data visible + +Step 3: Document Consumer-Specific Configuration + Record: + □ Consumer context ID + □ Connection strings/endpoints + □ Specific policies applied + □ Data restrictions (time, volume, columns) + File: [Documentation system] + + +PHASE 4: CONSUMER ENABLEMENT (3-5 days) +═══════════════════════════════════════ + +Step 1: Provide Detailed Documentation + Send: + □ Schema documentation (column descriptions, data types) + □ Sample queries for common use cases + □ Data dictionary + □ Known limitations and data quality notes + □ Refresh schedule and SLA + +Step 2: Conduct Technical Walkthrough (if requested) + Schedule: 30-minute video call + Content: + - Connection setup + - Navigation in Datasphere + - Query examples + - Where to get help + Attendees: Consumer tech lead + support person + +Step 3: Monitor Initial Usage + Week 1-2: + □ Check for errors in consumer logs + □ Verify expected data volume retrieved + □ Proactively reach out if issues detected + □ Gather initial feedback + + +PHASE 5: ONGOING SUPPORT (Continuous) +═════════════════════════════════════ + +Monthly Engagement: + □ Review usage metrics + □ Confirm SLA met (refresh timeliness, availability) + □ Check for support requests (response time < 24h) + □ Gather feedback (survey quarterly) + +Quarterly Business Review: + □ Usage trends + □ ROI/value realized + □ Product improvements based on feedback + □ Upcoming product changes + + +PHASE 6: OFFBOARDING (Upon Termination) +═════════════════════════════════════════ + +Step 1: Termination Notice + To: Consumer + Timeline: 30-day notice + Content: Specify termination date, offer grace period + +Step 2: Final Access & Data Export + Allow: Consumer to export final dataset (if applicable) + Timeline: Until termination date + Support: Help with export process if needed + +Step 3: Revoke Access + On Termination Date: + □ Disable consumer context + □ Revoke query permissions + □ Remove from subscription list + □ Archive access logs + +Step 4: Feedback Collection + Request: Post-termination survey + Questions: + - Was product valuable? + - Why discontinued? + - What would improve? + Use: Guide product improvements +``` + +## Managing Data Product Lifecycle + +Framework for versioning, updating, deprecating, and retiring data products. + +``` +LIFECYCLE PHASES +════════════════ + +PHASE 1: LAUNCH (0-3 months) +───────────────────────────── +Status: Pilot / Limited Release +Consumers: Selected partners, internal teams +Focus: Quality validation, feedback collection +Actions: + □ Monitor quality metrics daily + □ Respond to questions within 24h + □ Collect detailed feedback + □ Prepare launch review +Output: Production readiness assessment + +PHASE 2: GROWTH (3-12 months) +────────────────────────────── +Status: General Availability +Consumers: Increasing adoption +Focus: Scale, stability, improvements +Actions: + □ Weekly quality/performance review + □ Respond to feature requests + □ Implement non-breaking improvements + □ Monitor usage trends + □ Add consumers as demand grows +Output: Stable, reliable product + +PHASE 3: MATURITY (12+ months) +──────────────────────────────── +Status: Stable +Consumers: Established user base +Focus: Optimization, efficiency +Actions: + □ Monthly review of metrics + □ Optimize performance/cost + □ Plan enhancements based on roadmap + □ Maintain high SLA +Output: Efficient, well-tuned service + +PHASE 4: DECLINE (Product nearing end) +──────────────────────────────────────── +Status: Limited Updates +Consumers: Stable or declining +Focus: Transition planning +Actions: + □ Announce intended retirement date (12+ months notice) + □ Reduce new feature development + □ Prepare migration path for consumers + □ Document replacement products +Output: Managed transition + + +VERSION MANAGEMENT +═══════════════════ + +Semantic Versioning: MAJOR.MINOR.PATCH + +MAJOR (Breaking changes): + Example: Removing a column + Action: 60+ days notice to consumers + Migration: Provide alternative view/product + Version: 2.0 → 3.0 + +MINOR (Additive changes): + Example: Adding new column, updating metrics + Action: 15+ days notice + Compatibility: Backward compatible + Version: 2.0 → 2.1 + +PATCH (Bug fixes): + Example: Correcting data quality issue + Action: Immediate deployment + Impact: Transparent to consumers + Version: 2.0 → 2.0.1 + +Version Communication: + Change Log: Document all changes + Timeline: When deployed + Impact: Who is affected + Action Required: Yes/No + + +UPDATE PROCESS +═══════════════ + +Step 1: Plan Update + □ Define what's changing + □ Determine version number + □ Calculate impact on consumers + □ Set deployment date + +Step 2: Communicate + □ Notify consumers 15-60 days before + □ Describe changes and benefits + □ Explain any action required + □ Provide migration guide if breaking + +Step 3: Implement + □ Deploy to staging environment + □ Test with representative queries + □ Validate consumer views still work + □ Deploy to production + +Step 4: Monitor + □ Watch consumer query success rate + □ Monitor performance impact + □ Respond to issues immediately + □ Collect feedback + +Step 5: Document + □ Update version history + □ Update product documentation + □ Update schema/lineage + + +DEPRECATION PROCESS +════════════════════ + +Timeline: + Day 1: Announce deprecation (12 months notice) + Month 6: Final warning + Month 11: Final opportunity to migrate + Month 12+1: Retirement + + +Deprecation Announcement (Day 1): + To: All consumers + Subject: [Product Name] Deprecation Announcement + + [Product Name] will be retired on [DATE], 12 months from now. + + REASON: [Explain business rationale] + REPLACEMENT: [What to use instead] + MIGRATION: [Steps to transition] + SUPPORT: [Help available during transition] + DEADLINE: [Final day for questions/assistance] + + Questions? Contact: [support contact] + + +Migration Support: + □ Provide comparison of old vs. new product + □ Supply migration guide with examples + □ Offer technical assistance (calls, emails) + □ Extend support during transition period + □ Provide data export if needed + + +RETIREMENT PROCESS +════════════════════ + +30 Days Before: + □ Final notice to all consumers + □ Confirm all consumers migrated or confirmed termination + □ Disable new access requests + +On Retirement Date: + □ Revoke all access permissions + □ Archive product and documentation + □ Remove from marketplace + □ Retain backups for compliance + +Post-Retirement: + □ Monitor for questions/incidents + □ Document lessons learned + □ Update guidance for future products +``` diff --git a/partner-built/SAP-Datasphere/skills/datasphere-explorer/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-explorer/SKILL.md new file mode 100644 index 0000000..9ac72b9 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-explorer/SKILL.md @@ -0,0 +1,182 @@ +--- +name: datasphere-explorer +description: > + SAP Datasphere exploration and discovery assistant. Guides users through understanding their + Datasphere landscape — browsing spaces, discovering data assets in the catalog, inspecting table + schemas, profiling data quality, tracing lineage, and building queries interactively. + Use this skill whenever the user wants to explore, discover, understand, or get oriented in their + SAP Datasphere environment. Also trigger when the user mentions "Datasphere", "spaces", "catalog", + "data assets", "schema", "data profiling", "lineage", or asks questions like "what data do we have", + "show me the tables", "what's in this space", or "help me find data". Even if the user just wants + to browse or get a lay of the land, this skill should activate. +--- + +# SAP Datasphere Explorer + +Guide users through discovering and understanding their SAP Datasphere environment. Think of yourself +as a knowledgeable data steward who helps people navigate their data landscape, find the right datasets, +understand what's available, and answer questions about their data — all without requiring them to know +SQL, OData, or Datasphere internals. + +## Before You Start + +Verify the MCP connection is live by calling `test_connection`. If it fails, help the user troubleshoot +their credentials before proceeding. See `references/exploration-workflows.md` for the connection +troubleshooting checklist. + +## Core Exploration Workflows + +There are several natural ways a user might want to explore. Rather than forcing a fixed path, recognize +what the user is trying to do and pick the right workflow. + +### 1. Landscape Overview ("What do we have?") + +When the user wants to understand the big picture: + +1. Call `list_spaces` to get all available spaces +2. Summarize the spaces — group them by purpose if the naming makes it obvious (e.g., production vs. + development, by department, by data domain) +3. For each space of interest, call `get_space_info` to show storage usage, member count, and status +4. Offer to drill into any space the user finds interesting + +Present this conversationally. Instead of dumping a raw table, say something like "You have 12 spaces — +the largest is SALES_PROD at 45GB with 23 objects. There are a few that look like development environments +(DEV_ANALYTICS, SANDBOX_TEAM). Want me to explore any of these?" + +### 2. Space Deep Dive ("What's in this space?") + +When the user picks a specific space: + +1. Call `get_space_assets` to list all assets in the space +2. Group assets by type (views, tables, data flows, analytic models, etc.) +3. Highlight key metrics: total assets, most recently modified, any that look like "golden" or + curated datasets +4. Call `search_repository` with the space filter for additional object details and lineage info + +Provide a navigable summary. Suggest next steps: "This space has 8 views and 15 tables. The views look +like they're consumption-ready analytics layers. Want me to inspect the schema of any specific one?" + +### 3. Data Catalog Search ("Find me data about X") + +When the user is looking for specific data: + +1. Call `search_catalog` with the user's search terms +2. Present results with context: name, description, space, type, last modified +3. For promising results, call `get_asset_details` to show richer metadata +4. If the user wants to understand the shape of the data, call `get_table_schema` for column details +5. For analytical models, call `get_analytical_metadata` to understand measures and dimensions + +Help the user evaluate results: "I found 3 assets matching 'customer revenue.' The most relevant +looks like CUSTOMER_REVENUE_V in the ANALYTICS space — it's a view with 24 columns including revenue +measures by quarter. Want me to show you the full schema?" + +### 4. Schema Inspection ("What columns does this table have?") + +When the user wants to understand a specific table or view: + +1. Call `get_table_schema` (for relational) or `get_analytical_metadata` (for analytical models) +2. Present columns organized by purpose: key columns, measures, dimensions, timestamps +3. Show data types, nullability, and any descriptions +4. Call `get_relational_entity_metadata` for additional OData-level metadata if available + +Make the schema meaningful. Instead of just listing columns, identify patterns: "This table has a +composite key (CUSTOMER_ID + FISCAL_YEAR), 6 financial measures (REVENUE, COST, MARGIN...), and +3 geographic dimensions. The LAST_UPDATED timestamp suggests it refreshes regularly." + +### 5. Data Profiling ("What does the data actually look like?") + +When the user wants to understand data content and quality: + +1. Call `analyze_column_distribution` for key columns to understand value ranges, cardinality, and + null rates +2. Use `smart_query` to pull sample data (limit to 10-20 rows for readability) +3. Use `execute_query` for specific quality checks (null counts, duplicate detection, date ranges) +4. Identify potential data quality issues and flag them + +Interpret the results for the user: "The REGION column has 5 distinct values covering EMEA, APAC, and +Americas. The REVENUE column ranges from $1.2K to $4.8M with no nulls — looks clean. But CUSTOMER_EMAIL +has a 23% null rate, which might be worth investigating." + +### 6. Lineage and Impact ("Where does this data come from?") + +When the user wants to understand data flow: + +1. Call `search_repository` with object identifiers to find related objects +2. Use `get_object_definition` to understand transformation logic +3. Trace upstream (sources) and downstream (consumers) relationships +4. Call `get_deployed_objects` to see what's actively deployed + +Present lineage as a story: "SALES_SUMMARY_V pulls from two sources: the SAP S/4HANA sales orders +replicated through REPL_FLOW_S4 and the master data from the CUSTOMERS table. It's consumed by +the EXECUTIVE_DASHBOARD analytic model." + +### 7. Interactive Query Building ("Show me data where...") + +When the user wants to query data: + +1. Start with `smart_query` for natural-language-style queries — it handles aggregation intelligently +2. For more complex needs, help the user build SQL and execute with `execute_query` +3. Always start with small result sets (TOP 10/20) before pulling larger datasets +4. For analytical models, use `query_analytical_data` for proper measure aggregation + +Guide the user through refinement: "Here are the top 10 customers by revenue this quarter. Want me to +filter by region, add year-over-year comparison, or drill into a specific customer?" + +### 8. Marketplace Discovery ("What external data is available?") + +When the user wants to find external or shared data: + +1. Call `browse_marketplace` to see available data packages +2. Present packages with descriptions, providers, and content summaries +3. Help evaluate relevance to the user's needs + +## Handling Common Situations + +**User doesn't know where to start**: Begin with the Landscape Overview. Summarize what's there and +let the user's curiosity guide the next steps. + +**User gives a vague request** ("show me some data"): Ask one clarifying question about the domain +or topic they care about, then use catalog search. Don't ask too many questions — just get enough +to run a useful search. + +**User asks about something that doesn't exist**: Search the catalog first. If nothing matches, +check for similar names or related concepts. Suggest alternatives: "I didn't find a 'profit_margin' +table, but the FINANCIAL_METRICS view has both revenue and cost columns — we could calculate margin +from those." + +**Query returns too much data**: Automatically limit results and summarize. Let the user know there's +more: "Showing the first 20 of 1,450 records. Want me to filter or aggregate?" + +**Query returns errors**: Read the error message carefully. Common issues include: missing permissions +on a space, referencing columns that don't exist (check schema first), or analytical models needing +specific aggregation patterns. See `references/exploration-workflows.md` for the error resolution guide. + +## MCP Tools Reference + +For the full list of available tools with parameters and examples, read `references/exploration-workflows.md`. + +**Quick reference — tools by workflow:** + +| Workflow | Primary Tools | +|----------|--------------| +| Landscape Overview | `list_spaces`, `get_space_info` | +| Space Deep Dive | `get_space_assets`, `search_repository`, `list_repository_objects` | +| Catalog Search | `search_catalog`, `list_catalog_assets`, `get_asset_details`, `get_asset_by_compound_key` | +| Schema Inspection | `get_table_schema`, `get_relational_entity_metadata`, `get_analytical_metadata` | +| Data Profiling | `analyze_column_distribution`, `smart_query`, `execute_query` | +| Lineage & Impact | `search_repository`, `get_object_definition`, `get_deployed_objects` | +| Query Building | `smart_query`, `execute_query`, `query_relational_entity`, `query_analytical_data` | +| Marketplace | `browse_marketplace` | +| Foundation | `test_connection`, `get_current_user`, `get_tenant_info`, `get_available_scopes` | + +## Presentation Guidelines + +Keep the conversation natural and accessible. The user may not be a Datasphere expert — they might be +a business analyst, a data scientist, or a manager trying to understand what data is available. + +- Translate technical metadata into business language when possible +- Summarize before showing raw data +- Suggest logical next steps after each interaction +- Use concrete examples and numbers rather than abstract descriptions +- If showing tabular data, keep it to 5-10 rows unless the user asks for more +- When profiling, focus on the insights (quality issues, patterns, anomalies) not just the numbers diff --git a/partner-built/SAP-Datasphere/skills/datasphere-explorer/references/exploration-workflows.md b/partner-built/SAP-Datasphere/skills/datasphere-explorer/references/exploration-workflows.md new file mode 100644 index 0000000..f6a594e --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-explorer/references/exploration-workflows.md @@ -0,0 +1,243 @@ +# Exploration Workflows — Detailed Reference + +## Connection Troubleshooting Checklist + +If `test_connection` fails, walk through these steps: + +1. **OAuth credentials**: Verify the client ID and secret are correct and haven't expired +2. **Token URL**: Must match the tenant's authentication endpoint (format: `https://.authentication..hana.ondemand.com/oauth/token`) +3. **Base URL**: Must include the tenant name and region (format: `https://..hcs.cloud.sap`) +4. **Network access**: The MCP server must be able to reach SAP BTP endpoints (check firewall/proxy) +5. **Scopes**: The OAuth client needs appropriate scopes for the APIs being called +6. **User permissions**: The technical user must have space-level read access at minimum + +Common error patterns: +- `401 Unauthorized` → credentials are wrong or expired +- `403 Forbidden` → user lacks required permissions/scopes +- `Connection refused` → network/firewall issue +- `404 Not Found` → wrong base URL or tenant name + +## Tool Reference by Workflow + +### Foundation Tools + +#### test_connection +Verifies OAuth connectivity to the Datasphere tenant. +- **Use when**: Starting a session, or when other tools fail +- **Returns**: Connection status, tenant info, authenticated user + +#### get_current_user +Returns the authenticated user's details. +- **Use when**: Need to check who is connected and their permissions +- **Returns**: Username, display name, assigned roles + +#### get_tenant_info +Returns tenant configuration and capacity. +- **Use when**: Understanding the environment's size and setup +- **Returns**: Tenant name, region, storage capacity, feature flags + +#### get_available_scopes +Lists OAuth2 scopes the current credentials have access to. +- **Use when**: Debugging permission issues +- **Returns**: List of granted scopes + +### Catalog & Discovery Tools + +#### list_spaces +Lists all spaces the authenticated user can access. +- **Use when**: Getting a landscape overview +- **Returns**: Space names, descriptions, storage used, member counts + +#### get_space_info +Returns detailed information about a specific space. +- **Parameters**: `space_id` (required) +- **Use when**: Drilling into a specific space +- **Returns**: Storage quotas, members, settings, status + +#### search_catalog +Full-text search across the entire data catalog. +- **Parameters**: `query` (required), filters (optional) +- **Use when**: User is looking for data about a topic +- **Returns**: Matching assets with name, description, type, space + +#### list_catalog_assets +Browse catalog assets with filters. +- **Parameters**: Various filters (space, type, etc.) +- **Use when**: Browsing available assets +- **Returns**: Asset list with metadata + +#### get_asset_details +Returns rich metadata for a specific catalog asset. +- **Parameters**: Asset identifier +- **Use when**: Understanding a specific asset's purpose and structure +- **Returns**: Description, tags, lineage info, quality metrics + +#### get_asset_by_compound_key +Looks up an asset by its compound key (space + technical name). +- **Parameters**: `space_id`, `technical_name` +- **Use when**: You know the exact asset identifier +- **Returns**: Full asset details + +#### get_space_assets +Lists all assets within a specific space. +- **Parameters**: `space_id` (required) +- **Use when**: Exploring everything in a space +- **Returns**: All assets grouped by type + +#### search_tables +Searches for tables across spaces. +- **Parameters**: Search terms, space filter +- **Use when**: Looking for specific tables +- **Returns**: Matching tables with schema summaries + +#### browse_marketplace +Browse available data marketplace packages. +- **Use when**: Looking for external/shared data +- **Returns**: Available packages with descriptions + +### Schema & Metadata Tools + +#### get_table_schema +Returns column definitions for a relational table or view. +- **Parameters**: `space_id`, `table_name` +- **Use when**: Understanding table structure +- **Returns**: Column names, types, keys, nullability, descriptions + +#### get_relational_metadata +Returns CSDL metadata for relational entities in a space. +- **Parameters**: `space_id` +- **Use when**: Need OData-level schema details +- **Returns**: Entity types, properties, navigation properties + +#### get_analytical_metadata +Returns CSDL metadata for analytical models. +- **Parameters**: `space_id` +- **Use when**: Understanding analytical model structure (measures, dimensions) +- **Returns**: Measures, dimensions, attributes, hierarchies + +#### get_relational_entity_metadata +Returns detailed column metadata for a specific entity. +- **Parameters**: `space_id`, `entity_name` +- **Use when**: Need per-column OData details +- **Returns**: Property names, types, annotations + +#### get_consumption_metadata +Returns consumption layer schema information. +- **Parameters**: `space_id` +- **Use when**: Understanding what's exposed for consumption +- **Returns**: Consumption-ready entities and their structure + +### Data Query Tools + +#### smart_query +Intelligent SQL query with auto-aggregation. +- **Parameters**: `space_id`, `table_name`, `query_description` or `sql` +- **Use when**: User describes what data they want in natural language +- **Returns**: Query results with column headers + +Best practice: Start with natural language descriptions. The tool will construct appropriate SQL +including aggregations, filters, and limits. + +#### execute_query +Runs a direct SQL query against Datasphere. +- **Parameters**: `space_id`, `sql` (required) +- **Use when**: Need precise SQL control (complex joins, window functions, specific formatting) +- **Returns**: Query results +- **Limits**: Max 10,000 characters query length; results capped + +Safety: The SQL sanitizer blocks destructive operations (DROP, DELETE, TRUNCATE, ALTER). +Only SELECT queries are permitted. + +#### query_relational_entity +OData query on relational entities. +- **Parameters**: `space_id`, `entity_name`, OData query options ($filter, $select, $top, etc.) +- **Use when**: Querying relational data through OData protocol +- **Returns**: Entity data in structured format + +#### query_analytical_data +OData query on analytical models. +- **Parameters**: `space_id`, `model_name`, OData query options +- **Use when**: Querying analytical models with proper aggregation +- **Returns**: Aggregated analytical data + +### Data Profiling Tools + +#### analyze_column_distribution +Profiles a column's value distribution. +- **Parameters**: `space_id`, `table_name`, `column_name` +- **Use when**: Understanding data quality, cardinality, value ranges +- **Returns**: Distinct count, null rate, min/max, top values, distribution + +#### find_assets_by_column +Searches for assets containing a specific column name. +- **Parameters**: `column_name` +- **Use when**: Finding which tables/views contain a particular field +- **Returns**: List of assets with the matching column + +### Repository & Lineage Tools + +#### search_repository +Searches for objects with lineage information. +- **Parameters**: Search terms, space filter +- **Use when**: Understanding data flow and dependencies +- **Returns**: Objects with upstream/downstream relationships + +#### list_repository_objects +Lists objects in the repository. +- **Parameters**: Space and type filters +- **Use when**: Enumerating all objects in a space +- **Returns**: Object list with types and status + +#### get_deployed_objects +Lists actively deployed objects. +- **Parameters**: Space filter +- **Use when**: Understanding what's live in production +- **Returns**: Deployed objects with deployment status + +#### get_object_definition +Returns the full definition/specification of an object. +- **Parameters**: Object identifier +- **Use when**: Understanding transformation logic, view definitions, or flow configurations +- **Returns**: Object specification (columns, transformations, SQL, mappings) + +#### get_task_status +Checks the execution status of data flows and tasks. +- **Parameters**: Task identifier +- **Use when**: Monitoring running or recently completed jobs +- **Returns**: Status, start/end times, record counts, error messages + +### Database User Management + +#### list_database_users +Lists database users in a space. +- **Parameters**: `space_id` + +#### create_database_user / get_database_user_details / update_database_user / delete_database_user / reset_database_user_password +Full CRUD operations on database users. + +## Error Resolution Guide + +### Common Query Errors + +| Error | Likely Cause | Resolution | +|-------|-------------|------------| +| `Entity not found` | Wrong table/view name | Use `search_tables` or `get_space_assets` to find correct name | +| `Column not found` | Wrong column name | Use `get_table_schema` to verify column names | +| `Insufficient privileges` | Missing space access | Check user roles with `get_current_user` | +| `Query too complex` | SQL exceeds limits | Simplify the query, reduce joins, or use smaller result sets | +| `Timeout` | Large dataset without filters | Add WHERE clauses, use TOP/LIMIT, or aggregate first | + +### OData-Specific Errors + +| Error | Likely Cause | Resolution | +|-------|-------------|------------| +| `400 Bad Request` | Malformed $filter | Check OData syntax — strings need single quotes | +| `404 Not Found` | Entity not exposed | Verify the entity is exposed for consumption | +| `501 Not Implemented` | Unsupported operation | Try a different query approach (SQL vs OData) | + +### Analytical Model Errors + +Analytical models require proper aggregation. If you get unexpected results: +1. Check that you're using `query_analytical_data` (not relational queries) for analytical models +2. Ensure measures are being aggregated (SUM, AVG, etc.) +3. Verify dimension combinations exist in the data diff --git a/partner-built/SAP-Datasphere/skills/datasphere-flow-doctor/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-flow-doctor/SKILL.md new file mode 100644 index 0000000..5ee6424 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-flow-doctor/SKILL.md @@ -0,0 +1,1295 @@ +--- +name: Flow Doctor +description: "Diagnose and resolve errors in Data Flows, Replication Flows, Transformation Flows, and Task Chains. Use this when flows fail, to read error logs, identify root causes, and implement fixes. Critical for debugging data pipeline issues." +--- + +# Flow Doctor + +## Overview + +The Flow Doctor skill guides you through systematic troubleshooting of SAP Datasphere flows. When a flow fails, this skill helps you read error messages, understand root causes, and implement targeted solutions. + +## When to Use This Skill + +- **Flow execution failed**: Red status in monitor, need to identify why +- **Partial data loading**: Only some records loaded, others rejected +- **Performance degradation**: Flow taking longer than expected +- **Connectivity issues**: Cannot reach source or target system +- **Memory errors**: Out-of-memory or resource constraints +- **Data mismatches**: Unexpected differences between source and target +- **Schema changes**: Source structure changed, flow now incompatible +- **Task chain failures**: Dependent tasks failing or not executing + +## Systematic Troubleshooting Workflow + +### Step 1: Check Overall Flow Status + +``` +Datasphere → Data Integrations → Monitor +└─ Find failing flow +├─ Note last run time and status +├─ Check error indicator (red icon) +└─ Count failed runs vs total runs +``` + +**Status Indicators:** + +| Status | Meaning | Action | +|--------|---------|--------| +| GREEN | Running successfully | Check logs for warnings | +| YELLOW | Running but with warnings | Review warning messages | +| RED | Failed | Read error log immediately | +| GRAY | Not running | Check scheduling or manual trigger | +| BLUE | Running (in progress) | Wait for completion | + +**Example Status Review:** +``` +Flow: REP_CUSTOMER_DAILY +Status: RED (Failed) +Last Run: 2024-01-16 23:45:00 +Duration: 12 minutes (usually 5 minutes - 2.4x longer!) +Records: 45,000 / 1,000,000 loaded +Error Count: 1 +``` + +### Step 2: Access and Read Error Logs + +**Location:** Data Integrations → Monitor → [Flow Name] +``` +Error Log Details: +├─ Timestamp: 2024-01-16 23:45:45 +├─ Severity: ERROR +├─ Message: "Memory limit exceeded" +├─ Step: "LOAD_DATA_TRANSFORM" +└─ Context: "Processing batch 3 of 10" +``` + +**Log Types by Flow Type:** + +**Replication Flow Logs:** +``` +[2024-01-16 23:45:12] INFO | Initializing replication from S/4HANA +[2024-01-16 23:45:30] INFO | Connected to source system +[2024-01-16 23:45:31] INFO | Extracting C_CUSTOMER (1M rows) +[2024-01-16 23:46:00] WARN | Extraction slow: 20K rows/min (target: 50K rows/min) +[2024-01-16 23:46:45] ERROR | Authorization error: User missing RFC_READ_TABLE +[2024-01-16 23:46:45] INFO | Replication failed - 0 rows loaded +``` + +**Data Flow Logs:** +``` +[2024-01-16 23:45:00] INFO | Starting data flow +[2024-01-16 23:45:05] INFO | Reading source: 100,000 rows +[2024-01-16 23:45:10] INFO | Python operator: Transform started +[2024-01-16 23:45:25] ERROR | Python operator: TypeError in line 42 +[2024-01-16 23:45:25] ERROR | Stack trace: ... +[2024-01-16 23:45:25] INFO | Data flow failed +``` + +**Transformation Flow Logs:** +``` +[2024-01-16 23:45:00] INFO | Executing transformation procedure +[2024-01-16 23:45:10] INFO | Creating temporary tables +[2024-01-16 23:45:12] ERROR | SQL Error [327]: Column "AMOUNT" not found +[2024-01-16 23:45:12] INFO | Rollback executed +[2024-01-16 23:45:12] INFO | Transformation failed +``` + +### Step 3: Identify Root Cause Category + +Use the decision tree to narrow down the issue: + +``` +Flow Failed +│ +├─ Connection Error? +│ ├─ Cannot reach source system +│ ├─ Cloud Connector offline +│ └─ Network timeout +│ +├─ Authorization Error? +│ ├─ User lacks role +│ ├─ RFC function not authorized +│ └─ Table access denied +│ +├─ Schema Error? +│ ├─ Column not found +│ ├─ Data type mismatch +│ └─ Table structure changed +│ +├─ Data Quality Error? +│ ├─ Invalid data values +│ ├─ NULL where NOT NULL +│ └─ Type conversion failed +│ +├─ Resource Error? +│ ├─ Memory exhausted +│ ├─ CPU limit exceeded +│ └─ Disk full +│ +├─ Logic Error? +│ ├─ SQL syntax error +│ ├─ Python operator crash +│ └─ Transformation logic bug +│ +└─ Configuration Error? + ├─ Wrong mapping + ├─ Invalid parameters + └─ Missing prerequisites +``` + +### Step 4: Extract Key Error Information + +From the error log, extract: + +1. **Exact Error Message** + ``` + Raw: "ERROR [327]: Column 'REVENUE_AMOUNT' not found in table 'C_INVOICE'" + Key: Column REVENUE_AMOUNT is missing + ``` + +2. **Component Where It Failed** + ``` + Step: "MERGE_DATA" (not in initial load, in merge phase) + Line/Position: 23 in transformation procedure + ``` + +3. **Severity Level** + ``` + FATAL - Flow stops immediately + ERROR - Flow stops after current step + WARN - Flow continues but with issues + INFO - Informational only + ``` + +4. **Affected Data** + ``` + Records processed: 45,000 + Records failed: 1,000 + Failed %: 2.2% + ``` + +### Step 5: Implement Targeted Fix + +Based on root cause category, apply the appropriate solution (see sections below). + +### Step 6: Test and Validate + +Before restarting in production: + +``` +Test Strategy: +1. Run on sample data (LIMIT 1000) +2. Verify output matches expected schema +3. Check row counts and data quality +4. Execute once before scheduling +``` + +**Validation Query:** +```sql +SELECT + COUNT(*) AS TOTAL_ROWS, + COUNT(DISTINCT KEY_COLUMN) AS UNIQUE_KEYS, + COUNT(CASE WHEN AMOUNT IS NULL THEN 1 END) AS NULL_AMOUNTS, + MIN(AMOUNT) AS MIN_AMOUNT, + MAX(AMOUNT) AS MAX_AMOUNT +FROM TARGET_TABLE +WHERE LOAD_DATE = CURRENT_DATE; +``` + +## Common Replication Flow Failures + +### Error Type 1: Authorization Errors + +**Symptoms:** +``` +ERROR: User [DATASPH_USER] does not have authorization for RFC_READ_TABLE +ERROR: Access denied to table [C_CUSTOMER] +ERROR: Missing role SAP_BC_ANALYTICS_EXTRACTOR +``` + +**Root Causes:** +- User missing required role in S/4HANA +- RFC function not authorized for user +- CDS view not marked as extraction-enabled +- Security policy restricts table access + +**Diagnosis:** +``` +1. Check user roles in S/4HANA (PFCG transaction) +2. Verify RFC authorizations (SM59 → ODP settings) +3. Confirm CDS view has @Analytics.dataExtraction.enabled = true +4. Review data security policies (RSECADMIN) +``` + +**Solution Steps:** + +``` +S/4HANA System Administration: +1. Go to PFCG transaction +2. Search for user: DATASPH_USER +3. Assign role: SAP_BC_ANALYTICS_EXTRACTOR +4. Add to role: S_ODP_* objects +5. Save and activate + +Verify: +SM59 → ODP_BACKEND +Test Connection → Should succeed +``` + +**Prevention:** +``` +Create dedicated technical user for extraction: +- User: DATASPH_EXTRACT +- Role: Custom role with minimum permissions + ├─ RFC: RFC_READ_TABLE (for fallback) + ├─ RFC: /ODSO/* (for ODP) + ├─ Table: C_* (CDS views only) + └─ Authorization level: VIEW only + +Apply principle of least privilege. +``` + +### Error Type 2: Connection Timeouts + +**Symptoms:** +``` +ERROR: Connection timeout after 300 seconds +ERROR: Read timed out: no further information +ERROR: Unable to connect to host [s4h-prod.example.com:443] +WARN: Slow network response (5 min for 100K rows) +``` + +**Root Causes:** +- Network latency or packet loss +- Cloud Connector offline or overloaded +- Source system busy/unresponsive +- Large data volume causing slow extraction +- Firewall blocking connection + +**Diagnosis Steps:** + +``` +1. Check Cloud Connector Status: + Datasphere → Administration → Connections + → S/4HANA_PROD → Status + Expected: GREEN - Connected + +2. Verify Network Connectivity: + From DP Agent or Cloud Connector: + ping s4h-prod.example.com + telnet s4h-prod.example.com 443 + +3. Check Source System Load: + S/4HANA → SM50 (Work Process Overview) + Look for: CPU > 80%, Memory > 90%, Queued requests +``` + +**Solution:** + +**Increase Timeout:** +``` +Data Integrations → Connection Settings +→ S/4HANA_PROD +→ Advanced → Connection Timeout + Current: 300 seconds + Change to: 600 seconds + +Save and retest +``` + +**Optimize Extraction:** +``` +Replication Flow Settings: +├─ Reduce batch size: 100,000 → 50,000 rows/batch +├─ Increase parallel threads: 1 → 4 +├─ Add filter to source: WHERE load_date >= CURRENT_DATE - 7 +└─ Schedule off-peak: 23:00 instead of 09:00 +``` + +**Check Cloud Connector Performance:** +``` +Admin Console → https://cc-prod.example.com:8443 +→ Monitoring +├─ Check CPU usage (should be < 70%) +├─ Check throughput (should be > 50 MB/s) +├─ Check request queue (should be < 100) +└─ If overloaded: scale Cloud Connector or add standby +``` + +### Error Type 3: Schema Mismatches + +**Symptoms:** +``` +ERROR: Column 'REVENUE' not found in source table C_CUSTOMER +ERROR: Data type mismatch - Expected STRING, got DECIMAL +ERROR: Source table structure changed - table refresh required +``` + +**Root Causes:** +- Source CDS view fields changed in new S/4HANA patch +- Target table schema doesn't match source +- Column was deleted from source +- Data type definitions are incompatible + +**Diagnosis:** + +``` +1. Get source schema: + use MCP: get_table_schema(table_name="C_CUSTOMER", source="S/4HANA") + +2. Get target schema: + use MCP: get_table_schema(table_name="CUSTOMER_MASTER", source="DATASPHERE") + +3. Compare field-by-field: + Source: AMOUNT (DECIMAL 15,2), String length: 17 + Target: AMOUNT (DECIMAL 10,2), String length: 12 + Issue: Target too small for source values! +``` + +**Solution:** + +**Refresh Source Schema:** +``` +Data Integrations → Replication Flow → [Flow Name] +→ Source Definition +→ Refresh Schema +→ Review changes +→ Accept and Save +``` + +**Resize Target Columns:** +```sql +-- Identify columns needing expansion +SELECT + COLUMN_NAME, + DATA_TYPE, + CHARACTER_MAXIMUM_LENGTH, + NUMERIC_PRECISION, + NUMERIC_SCALE +FROM INFORMATION_SCHEMA.COLUMNS +WHERE TABLE_NAME = 'CUSTOMER_MASTER' + AND ( + CHARACTER_MAXIMUM_LENGTH < 255 + OR NUMERIC_PRECISION < 15 + ); + +-- Expand problem columns +ALTER TABLE CUSTOMER_MASTER + MODIFY COLUMN REVENUE DECIMAL(19,4), + MODIFY COLUMN CUSTOMER_NAME VARCHAR(500); +``` + +**Recreate Flow with Updated Schema:** +``` +If structural changes are significant: +1. Backup existing data +2. Rename old table: CUSTOMER_MASTER → CUSTOMER_MASTER_OLD +3. Delete old flow +4. Create new flow (will auto-create table with current schema) +5. Perform full reload +6. Validate row counts match +7. Drop _OLD table +``` + +### Error Type 4: Delta Queue Issues + +**Symptoms:** +``` +ERROR: Change record (CN: 1500000) not found in delta queue +ERROR: Queue overflow - exceeds 4GB limit +WARN: Only 2 days of changes retained (retention < extraction frequency) +``` + +**Root Causes:** +- Delta queue purged before extraction (retention expired) +- Queue size exceeded, oldest records deleted +- Too much time between delta extractions +- Source system change log not maintained + +**Diagnosis:** + +``` +Check Delta Queue Status: +use MCP: test_connection( + source="S/4HANA_PROD", + check_delta_queue=True +) + +Expected output: +{ + "queue_size_gb": 1.2, + "max_size_gb": 4.0, + "oldest_record_days": 5, + "retention_policy_days": 8, + "health": "HEALTHY" +} +``` + +**Solution:** + +**For Expired Queue (Records Deleted):** +``` +1. The delta is lost, must do full reload +2. Stop the delta replication flow +3. Create new replication flow with full load +4. Set new watermark to current value +5. Resume delta from new watermark + +SQL verification: +SELECT MAX(CHANGENUMBER) FROM C_CUSTOMER; +Store this value as new watermark. +``` + +**For Queue Overflow:** +``` +Immediate: Perform full reload to reset + +Preventive: + a) Increase delta frequency: + From: Every 30 minutes + To: Every 5-10 minutes + Rationale: Smaller batches prevent overflow + + b) Increase queue size in S/4HANA: + SPRO → ODP Configuration + Queue max size: 4GB → 8GB + Retention: 8 days → 14 days + + c) Add source filter: + Load only recent: WHERE changed_date >= CURRENT_DATE - 30 + Reduces volume extracted each run +``` + +## Common Data Flow Failures + +### Error Type 1: Memory Limit Exceeded + +**Symptoms:** +``` +ERROR: Java heap space - Out of memory +ERROR: Insufficient memory for operator execution +WARN: Memory usage at 95% of limit +ERROR: Python operator crashed - memory allocation failed +``` + +**Root Causes:** +- Input dataset larger than available memory +- Python operator loading entire dataframe +- No pagination or chunking in code +- Memory-intensive aggregations + +**Diagnosis:** + +``` +1. Check input row count: + use MCP: execute_query("SELECT COUNT(*) FROM source_table") + Example result: 50,000,000 rows (50M) + +2. Estimate memory needed: + 50M rows × 1KB per row = 50GB (exceeds 32GB limit!) + +3. Check operator code for memory issues: + - Loading all data into memory at once + - Creating large intermediate arrays + - Not releasing memory between operations +``` + +**Solution:** + +**Chunked Processing in Python:** +```python +def process_large_dataset_chunked(input_df, chunk_size=10000): + """ + Process in chunks to avoid memory issues + """ + result_chunks = [] + + for i in range(0, len(input_df), chunk_size): + chunk = input_df.iloc[i:i + chunk_size] + + # Process chunk + processed_chunk = chunk.assign( + revenue_category=chunk['revenue'].apply(categorize_revenue) + ) + + result_chunks.append(processed_chunk) + + # Explicitly free memory + del chunk + + return pd.concat(result_chunks, ignore_index=True) + +def process_via_partition(input_df, partition_column='YEAR_MONTH'): + """ + Process by partitions instead of full load + """ + result = [] + + for partition_val in input_df[partition_column].unique(): + partition_data = input_df[input_df[partition_column] == partition_val] + processed = partition_data.apply_transformations() + result.append(processed) + del partition_data + + return pd.concat(result, ignore_index=True) +``` + +**Add Source Filter:** +``` +Data Flow Settings: +├─ Source: LARGE_TABLE (100M rows) +├─ Add WHERE clause: +│ WHERE TRANSACTION_DATE >= CURRENT_DATE - 90 +│ (Reduces to 5M rows, fits in memory) +└─ Result: Incremental load instead of full +``` + +**Increase Memory Allocation:** +``` +Data Flow → Advanced Settings +├─ Memory Limit: 16GB → 32GB +├─ Note: Requires larger execution environment +└─ Cost: Higher (check pricing) +``` + +### Error Type 2: Data Type Mismatch + +**Symptoms:** +``` +ERROR: TypeError: cannot convert from DECIMAL to STRING +ERROR: Cannot convert value "ABC123" to type INTEGER +ERROR: Date conversion error: invalid format "01/32/2024" +``` + +**Root Causes:** +- Source delivers unexpected data type +- Python operator assumes wrong type +- Implicit type conversion fails +- NULL values not handled + +**Diagnosis:** + +```python +# Check data types in Python operator +def diagnose_types(input_df): + print(input_df.dtypes) + print(input_df.head(10)) + print(input_df.describe()) + + # Check for problematic values + for col in input_df.columns: + print(f"{col}: unique={input_df[col].nunique()}, nulls={input_df[col].isnull().sum()}") +``` + +**Solution:** + +**Add Type Conversion:** +```python +def safe_transform(input_df): + """ + Explicit type conversion with error handling + """ + # String to number + input_df['amount'] = pd.to_numeric( + input_df['amount'], + errors='coerce' # Invalid values become NULL + ) + + # String to date + input_df['transaction_date'] = pd.to_datetime( + input_df['transaction_date'], + format='%Y-%m-%d', + errors='coerce' + ) + + # Handle NULLs from failed conversions + input_df['amount'] = input_df['amount'].fillna(0) + input_df['transaction_date'] = input_df['transaction_date'].fillna( + pd.Timestamp('1900-01-01') + ) + + return input_df +``` + +**Add Validation Step:** +```python +def validate_before_transform(input_df): + """ + Validate data quality before transformation + """ + errors = [] + + # Check nulls + if input_df['amount'].isnull().any(): + errors.append(f"NULL amounts found: {input_df['amount'].isnull().sum()}") + + # Check type + try: + pd.to_numeric(input_df['amount']) + except: + errors.append("Amount contains non-numeric values") + + # Check value ranges + if (input_df['amount'] < 0).any(): + errors.append(f"Negative amounts found: {(input_df['amount'] < 0).sum()}") + + if errors: + raise ValueError("Data quality issues: " + "; ".join(errors)) + + return input_df +``` + +### Error Type 3: Python Operator Crash + +**Symptoms:** +``` +ERROR: Python operator failure - execution terminated +ERROR: ImportError: No module named 'sklearn' +ERROR: NameError: name 'df' is not defined +TRACEBACK: File "operator.py", line 42, in process_data +``` + +**Root Causes:** +- Python syntax error in operator code +- Missing library import +- Variable name typo +- Incompatible library version + +**Diagnosis:** + +``` +1. Read full traceback from logs +2. Identify line number causing issue +3. Check if syntax is valid +4. Verify all imports available + +Example traceback: + File "transform.py", line 42, in process_data + revenue_category = categorize(df['revenue']) + ^^ + NameError: name 'df' is not defined + +Issue: DataFrame not created - likely parameter name wrong +``` + +**Solution:** + +**Fix Syntax Errors:** +```python +# WRONG +def transform(input_df) + # Missing colon + result = input_df + 1 # Wrong + reutrn result # Typo + +# RIGHT +def transform(input_df): + result = input_df.assign(value=input_df['amount'] + 1) + return result +``` + +**Add Missing Imports:** +```python +# Check what libraries are available +import sys +print(sys.version) + +# Required imports for data flows +import pandas as pd +import numpy as np + +# Optional (may not be available) +try: + from sklearn import preprocessing +except ImportError: + print("sklearn not available - using alternative") + # Implement without sklearn +``` + +**Debug Print Statements:** +```python +def transform(input_df): + print(f"Input shape: {input_df.shape}") + print(f"Columns: {input_df.columns.tolist()}") + + result = input_df.assign( + value=input_df['amount'] * 1.1 + ) + + print(f"Output shape: {result.shape}") + return result +``` + +## Common Transformation Flow Failures + +### Error Type 1: SQL Syntax Errors + +**Symptoms:** +``` +ERROR SQL0104: Statement was not prepared - "CREATE TABLE" not recognized +ERROR SQL0207: Column 'REVENUE_AMOUNT' not found +ERROR SQL0289: Trigger or constraint violation +``` + +**Root Causes:** +- SQLScript syntax not valid +- Column name misspelled +- Reserved keyword used as identifier +- SQL dialect mismatch + +**Diagnosis:** + +``` +1. Read error line number +2. Check syntax at that line +3. Verify all table/column names exist + +Example: +Line 42: CREATE LOCAL TEMP TABLE temp_customer ( +Line 43: customer id INT, // WRONG - missing underscore in column name +``` + +**Solution:** + +**Test SQL Incrementally:** +```sql +-- Test step-by-step, not entire procedure at once + +-- Step 1: Test table exists +EXECUTE QUERY 'SELECT * FROM SOURCE_CUSTOMER LIMIT 10'; + +-- Step 2: Test transformation +EXECUTE QUERY ' + SELECT + CUSTOMER_ID, + UPPER(CUSTOMER_NAME) as name_upper, + CURRENT_TIMESTAMP as load_ts + FROM SOURCE_CUSTOMER + LIMIT 10 +'; + +-- Step 3: If that works, test full transformation logic +EXECUTE QUERY ' + MERGE INTO TARGET_CUSTOMER tc + USING SOURCE_CUSTOMER sc + ON tc.CUSTOMER_ID = sc.CUSTOMER_ID + WHEN MATCHED THEN + UPDATE SET tc.NAME = sc.NAME + WHEN NOT MATCHED THEN + INSERT (CUSTOMER_ID, NAME) + VALUES (sc.CUSTOMER_ID, sc.NAME) +'; +``` + +**Quote Identifiers:** +```sql +-- If using reserved words or special characters: +SELECT + "user", -- reserved word - quote it + "account-number", -- contains hyphen - quote it + normal_column -- no quotes needed +FROM table_name; +``` + +### Error Type 2: Delta Watermark Issues + +**Symptoms:** +``` +ERROR: Watermark value invalid - not found in source data +ERROR: Cannot compare TIMESTAMP and STRING types +WARN: Watermark not advancing - same value as last run +ERROR: Overlapping deltas - duplicate records loaded +``` + +**Root Causes:** +- Watermark field doesn't exist or changed type +- Watermark value outside valid range +- Source data older than watermark +- Timezone issues with timestamps + +**Diagnosis:** + +``` +1. Check watermark field exists: + use MCP: get_table_schema(table="C_CUSTOMER") + Verify: LAST_CHANGED_AT field exists and is TIMESTAMP + +2. Check watermark value: + use MCP: execute_query(" + SELECT MAX(LAST_CHANGED_AT) FROM C_CUSTOMER + ") + Result: 2024-01-15 23:59:59 + Current watermark stored: 2024-01-16 00:00:00 + Issue: Watermark > max value means no data to extract! + +3. Check for duplicates: + use MCP: execute_query(" + SELECT CHANGENUMBER, COUNT(*) as cnt + FROM delta_load_batch + GROUP BY CHANGENUMBER + HAVING COUNT(*) > 1 + ") +``` + +**Solution:** + +**Reset Watermark to Valid Value:** +```sql +-- Check current maximum +SELECT MAX(LAST_CHANGED_AT) as current_max FROM C_CUSTOMER; + +-- Update stored watermark +UPDATE WATERMARK_CONTROL +SET LAST_WATERMARK = ( + SELECT MAX(LAST_CHANGED_AT) - INTERVAL '1' HOUR + FROM C_CUSTOMER +) +WHERE TABLE_NAME = 'CUSTOMER'; + +-- Next delta run will get last 1 hour of changes +``` + +**Fix Timestamp Timezone Issues:** +```sql +-- Problem: Timestamps stored in UTC but watermark in local time +-- Solution: Normalize to UTC + +PROCEDURE LOAD_DELTA_NORMALIZED ( + IN iv_last_watermark TIMESTAMP +) +LANGUAGE SQLSCRIPT +AS +BEGIN + DECLARE v_watermark_utc TIMESTAMP; + + -- Convert input to UTC if needed + SET v_watermark_utc = TO_UTCTIMESTAMP( + iv_last_watermark, + 'America/New_York' -- source timezone + ); + + -- Load with UTC comparison + MERGE INTO TARGET_DATA + USING ( + SELECT * + FROM SOURCE_DATA + WHERE TO_UTCTIMESTAMP(CREATED_AT, 'America/New_York') > :v_watermark_utc + ) delta + ON TARGET_DATA.ID = delta.ID + WHEN MATCHED THEN + UPDATE SET TARGET_DATA.VALUE = delta.VALUE + WHEN NOT MATCHED THEN + INSERT VALUES (delta.ID, delta.VALUE); +END; +``` + +**Deduplicate if Overlaps Exist:** +```sql +-- Merge with deduplication on key fields +MERGE INTO TARGET_CUSTOMER tc +USING ( + SELECT * + FROM SOURCE_DELTA + QUALIFY ROW_NUMBER() OVER ( + PARTITION BY CUSTOMER_ID + ORDER BY CHANGENUMBER DESC + ) = 1 -- Keep only latest version +) delta +ON tc.CUSTOMER_ID = delta.CUSTOMER_ID +WHEN MATCHED THEN + UPDATE SET tc.NAME = delta.NAME +WHEN NOT MATCHED THEN + INSERT VALUES (delta.CUSTOMER_ID, delta.NAME); +``` + +### Error Type 3: Merge Conflicts + +**Symptoms:** +``` +ERROR: Merge violation - target record locked +ERROR: Constraint violation - duplicate key +ERROR: Cannot delete row referenced by foreign key +WARN: 500 rows failed to merge due to conflicts +``` + +**Root Causes:** +- Concurrent modifications (other process updating target) +- Key field contains duplicate values +- Foreign key constraint violated +- NOT NULL column has NULL value + +**Diagnosis:** + +``` +1. Check for locks: + CALL DBMS_LOCKS.CHECK_LOCKS(); + +2. Check for duplicate keys: + SELECT CUSTOMER_ID, COUNT(*) as cnt + FROM TARGET_CUSTOMER + GROUP BY CUSTOMER_ID + HAVING COUNT(*) > 1; + +3. Check for orphaned records: + SELECT * + FROM TARGET_CUSTOMER tc + WHERE NOT EXISTS ( + SELECT 1 FROM CUSTOMER_MASTER cm + WHERE cm.CUSTOMER_ID = tc.CUSTOMER_ID + ); + +4. Check for NULLs in NOT NULL columns: + SELECT * + FROM TARGET_CUSTOMER + WHERE CUSTOMER_ID IS NULL + OR CUSTOMER_NAME IS NULL; +``` + +**Solution:** + +**Handle Duplicate Keys:** +```sql +-- Before merge, deduplicate +MERGE INTO TARGET_CUSTOMER tc +USING ( + -- Get only latest version of each key + SELECT * + FROM SOURCE_DELTA + QUALIFY ROW_NUMBER() OVER ( + PARTITION BY CUSTOMER_ID + ORDER BY SEQUENCE_NUM DESC + ) = 1 +) delta +ON tc.CUSTOMER_ID = delta.CUSTOMER_ID +WHEN MATCHED THEN + UPDATE SET tc.CUSTOMER_NAME = delta.CUSTOMER_NAME +WHEN NOT MATCHED THEN + INSERT VALUES (delta.CUSTOMER_ID, delta.CUSTOMER_NAME); +``` + +**Handle Foreign Key Violations:** +```sql +-- Insert only valid references +MERGE INTO ORDERS o +USING ( + SELECT * + FROM ORDERS_STAGING os + WHERE EXISTS ( + SELECT 1 FROM CUSTOMER_MASTER cm + WHERE cm.CUSTOMER_ID = os.CUSTOMER_ID + ) +) delta +ON o.ORDER_ID = delta.ORDER_ID +WHEN MATCHED THEN + UPDATE SET o.CUSTOMER_ID = delta.CUSTOMER_ID +WHEN NOT MATCHED THEN + INSERT VALUES (delta.ORDER_ID, delta.CUSTOMER_ID); +``` + +**Handle Concurrent Updates:** +```sql +-- Add retry logic +PROCEDURE MERGE_WITH_RETRY ( + IN iv_max_retries INT DEFAULT 3 +) +LANGUAGE SQLSCRIPT +AS +BEGIN + DECLARE v_retry_count INT := 0; + DECLARE v_success CHAR(1) := 'N'; + + WHILE :v_retry_count < :iv_max_retries AND :v_success = 'N' DO + BEGIN + -- Attempt merge + MERGE INTO TARGET_DATA td + USING SOURCE_DELTA sd + ON td.KEY = sd.KEY + WHEN MATCHED THEN + UPDATE SET td.VALUE = sd.VALUE + WHEN NOT MATCHED THEN + INSERT VALUES (sd.KEY, sd.VALUE); + + SET v_success = 'Y'; + + EXCEPTION WHEN SQL_ERROR_CODE THEN + SET v_retry_count = :v_retry_count + 1; + IF :v_retry_count >= :iv_max_retries THEN + RESIGNAL; -- Give up after max retries + END IF; + -- Wait before retrying + CALL SYS.DBMS_LOCK.SLEEP(5); + END; + END WHILE; +END; +``` + +## Task Chain Failures + +### Error Type 1: Dependency Errors + +**Symptoms:** +``` +ERROR: Task [STEP_B] failed - dependency on STEP_A not satisfied +ERROR: Task [STEP_C] not executed - parent task STEP_B failed +WARN: Task chain halted at step 2 of 5 +``` + +**Root Causes:** +- Parent task failed before child task could run +- Dependency graph has circular reference +- Task configuration specifies wrong predecessor +- Timing issue - child starts before parent finishes + +**Diagnosis:** + +``` +1. View task chain graph: + Task Chain → [Chain Name] → Design View + Look for: Red nodes (failed), Blue nodes (waiting) + +2. Check dependency configuration: + Task [STEP_B] Dependencies: + ├─ Require STEP_A success + ├─ Start after STEP_A completion + └─ If STEP_A fails, STEP_B: SKIP | FAIL | RETRY + +3. Check task execution order: + Logs show: STEP_A failed at 10:15:00 + Logs show: STEP_B started at 10:15:05 (shouldn't have!) +``` + +**Solution:** + +**Check Dependency Graph:** +``` +Task Chain: CUSTOMER_LOAD_CHAIN +├─ STEP_A: REP_CUSTOMER (Replication) +│ └─ Status: FAILED +├─ STEP_B: TF_CUSTOMER_TRANSFORM (Transformation) +│ └─ Depends on: STEP_A SUCCESS (blocked) +├─ STEP_C: TF_CUSTOMER_AGGREGATE (Aggregation) +│ └─ Depends on: STEP_B SUCCESS (blocked) +└─ STEP_D: TF_CUSTOMER_PUBLISH (Export) + └─ Depends on: STEP_C SUCCESS (blocked) + +Fix: +1. Investigate STEP_A failure (see Replication Flow errors) +2. Restart STEP_A +3. Task chain automatically continues through B → C → D +``` + +**Configure Error Handling:** +``` +Task Dependency Settings: +├─ If predecessor fails: +│ ├─ Option 1: STOP (whole chain halts) +│ ├─ Option 2: SKIP (run anyway, may fail) +│ ├─ Option 3: RETRY (attempt predecessor again) +│ └─ Option 4: CONDITIONAL (run only if success) +│ +└─ Example configuration: + STEP_B depends on STEP_A + If STEP_A fails: RETRY (up to 3 times) + If all retries fail: STOP (don't waste resources) +``` + +### Error Type 2: Scheduling Conflicts + +**Symptoms:** +``` +WARN: Task scheduled to run 10:00 but previous run still executing at 10:15 +ERROR: Task skipped - overlap with concurrent execution detected +WARN: Daily run scheduled but weekly run already in progress +``` + +**Root Causes:** +- Previous run took longer than expected +- Frequency too high (overlapping executions) +- Manual trigger conflicts with scheduled run +- System resources exhausted + +**Diagnosis:** + +``` +Check execution timeline: +Run 1: Starts 10:00, ends 10:45 (45 min) +Run 2: Scheduled 10:30 (overlap with Run 1!) +Run 3: Scheduled 11:00 (may overlap with Run 2!) + +Issue: Task takes 45 min but scheduled every 30 min. +Duration: 45 min +Frequency: 30 min +Fix: Change frequency to 60 min minimum +``` + +**Solution:** + +**Adjust Schedule Frequency:** +``` +Current Schedule: +├─ Frequency: Every 30 minutes +├─ Avg Duration: 45 minutes +├─ Overlap Risk: HIGH + +New Schedule: +├─ Frequency: Every 60 minutes (1 hour) +├─ Avg Duration: 45 minutes +├─ Buffer: 15 minutes +└─ Overlap Risk: LOW +``` + +**Prevent Overlapping Runs:** +``` +Task Chain Settings: +├─ Allow Parallel Execution: NO +├─ Max Concurrent Runs: 1 +├─ Queue Behavior: Skip if busy +└─ Description: Ensures one run at a time +``` + +### Error Type 3: Parallel Execution Issues + +**Symptoms:** +``` +ERROR: Race condition - two tasks writing same target table +WARN: Task 1 and Task 2 both running - resource contention +ERROR: Deadlock - Task A waiting for Task B, Task B waiting for Task A +``` + +**Root Causes:** +- Tasks configured to run in parallel but have shared resources +- No locking mechanism to prevent conflicts +- Circular dependencies in parallel tasks +- Memory/CPU constraints with parallel load + +**Diagnosis:** + +``` +Check Parallel Configuration: +Task A: Loads TABLE_X +Task B: Loads TABLE_X (parallel) +Task C: Aggregates TABLE_X + +Issue: Tasks A and B both write TABLE_X simultaneously! + +Solution: Make A and B sequential OR + Load to separate tables and union results +``` + +**Solution:** + +**Sequential Instead of Parallel:** +``` +Change from: +Task A → TABLE_X ┐ +Task B → TABLE_X ├─ TABLE_RESULT +Task C ────────────→ + +To: +Task A → TABLE_X_A ┐ +Task B → TABLE_X_B ├─ TABLE_RESULT +Task C ────────────→ +(C does UNION of X_A and X_B) +``` + +**Add Locking:** +```sql +PROCEDURE LOAD_WITH_LOCK ( + IN iv_lock_name STRING DEFAULT 'CUSTOMER_LOAD' +) +LANGUAGE SQLSCRIPT +AS +BEGIN + CALL DBMS_LOCK.REQUEST(iv_lock_name, 'X'); -- Exclusive lock + + BEGIN + -- Load data (only one task executes at a time) + MERGE INTO TARGET_CUSTOMER ...; + + EXCEPTION WHEN SQL_ERROR_CODE THEN + CALL DBMS_LOCK.RELEASE(iv_lock_name); + RESIGNAL; + END; + + CALL DBMS_LOCK.RELEASE(iv_lock_name); +END; +``` + +## When to Restart vs Investigate Deeper + +### Restart Only (Self-Resolving Issues) + +``` +Symptoms → Action: +├─ Temporary timeout → Restart flow +├─ One-time network hiccup → Restart flow +├─ Cloud Connector was briefly down → Restart flow +└─ Out of memory during spike → Restart flow (if single instance) + +Expected Outcome: Successful on second attempt +No code/config changes needed +``` + +### Investigate Before Restarting (Persistent Issues) + +``` +Symptoms → Investigation Required: +├─ Authorization error → Fix user roles first +├─ Schema mismatch → Refresh schema and adjust mapping +├─ Duplicate rows → Add deduplication logic +├─ Memory consistently exceeded → Reduce batch size or add filter +├─ Dependency failure → Fix parent task first +└─ Data quality issues → Add validation/cleansing + +Expected Outcome: Understand root cause, apply fix +Prevent same issue recurring +``` + +## MCP Tool References + +### get_task_status +Check current and historical status of flows and tasks: + +``` +get_task_status( + task_name="REP_CUSTOMER_DAILY", + include_history=True, + last_runs=10 +) +``` + +### search_repository +Find flows and dependencies: + +``` +search_repository( + object_type="REPLICATION_FLOW", + name_contains="CUSTOMER", + status="FAILED" +) +``` + +### get_object_definition +View complete flow configuration: + +``` +get_object_definition(object_id="REP_CUSTOMER_DAILY") +``` + +### execute_query +Run test queries to validate data and diagnose issues: + +``` +execute_query( + query="SELECT COUNT(*) FROM TARGET_CUSTOMER WHERE LOAD_DATE = CURRENT_DATE" +) +``` + +## Reference Materials + +See reference files for detailed procedures: +- `references/error-catalog.md` - Complete error code catalog with solutions +- `references/abap-side-monitoring.md` - ABAP-side monitoring tools (DHCDCMON, DHRDBMON, SLG1) for diagnosing Replication Flow issues on the source S/4HANA system +- `references/replication-flow-error-patterns.md` - 9 known Replication Flow error patterns with root causes, diagnostic steps, SAP Note references, and SAP support component assignments + diff --git a/partner-built/SAP-Datasphere/skills/datasphere-flow-doctor/references/abap-side-monitoring.md b/partner-built/SAP-Datasphere/skills/datasphere-flow-doctor/references/abap-side-monitoring.md new file mode 100644 index 0000000..09f1e01 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-flow-doctor/references/abap-side-monitoring.md @@ -0,0 +1,422 @@ +# ABAP-Side Monitoring for Replication Flows + +## Overview + +Replication Flows in SAP Datasphere depend on a distributed architecture spanning both the source SAP S/4HANA system and the Datasphere environment. The source system hosts three critical components: + +- **CDC Engine** (Change Data Capture): Captures insert, update, and delete events from application tables using logging tables and the CDC framework +- **Resilient Data Buffer (RDB)**: A staging layer that stores captured changes temporarily before transmission +- **Replication Management Service (RMS)**: Coordinates data collection and transfer to Datasphere + +Most runtime issues originate on the source system. While Datasphere's Data Integration Monitor shows replication status, the diagnostic root cause almost always requires investigation of ABAP-side components. This reference covers the transaction-based tools available in S/4HANA for monitoring and troubleshooting these components. + +--- + +## Transaction DHCDCMON (CDC Monitor) + +### Purpose + +DHCDCMON provides real-time visibility into Change Data Capture (CDC) operations. It monitors the logging table infrastructure, view reconstruction, and data flow from application tables through the CDC framework. Use this transaction to diagnose issues with "Initial and Delta" replication flows where the CDC engine is actively capturing changes. + +### Object Overview Tab + +The Object Overview tab displays the core CDC state for each registered object: + +**Registered Objects and CDC Status Section** +- **Object Name**: The CDS view being replicated (e.g., `I_Customer`) +- **Application Table**: The source S/4 table being monitored (e.g., `KNA1`) +- **Master LogTab (/1DH/ML\*)**: System-generated logging table receiving change events from the application +- **Subscriber LogTab (/1DH/SL\*)**: Derived logging table holding reconstructed view data +- **Record Counts**: Shows current row counts in both logging tables + +**Key Diagnostic Rule: Logging Table Health Indicator** +- In a healthy state, both Master and Subscriber logging tables should contain 0 or minimal records (typically 1-10) +- Records accumulate temporarily as the CDC framework processes changes +- A **permanently rising record count** indicates the Observer or Transfer job is not consuming data +- High record counts (thousands+) signal a backlog and point to job failures or resource constraints + +**Subscriber Worklist Section** +- Lists individual subscriber connections with their processing status +- **Status B** (Blocked): Subscriber processing is suspended; investigate job failures or authorization issues +- **Status E** (Failed): Subscriber encountered an error during processing; check Application Log +- Any records appearing in this section indicate active issues requiring immediate attention + +**Subscriber Logging Table Sequence ID** +- Tracks the highest pointer value for each object's Subscriber logging table +- Increments each time a new change record is inserted into the logging table +- A static sequence ID (not changing over time) indicates no new changes are being captured or processed + +**Observer Job Worklist** +- Shows which logging tables are currently being processed by the background Observer job +- Reflects real-time job activity; entries appear and disappear as processing completes +- Absence of activity for extended periods suggests the Observer job is not running + +### Job Settings Tab + +The Job Settings tab controls two critical background jobs that move data through the CDC pipeline: + +**Observer Job (/1DH/OBSERVE_LOGTAB)** +- **SM37 Job Name**: `/1DH/OBSERVE_LOGTAB` +- **Underlying Report**: `DHCDCR_OBSERVE_LOGTAB` +- **Function**: Reads entries from the Master logging table, applies view reconstruction logic (executing CDS view definitions against captured deltas), and writes reconstructed results to the Subscriber logging table +- **Required Status**: Must show **green** status icon for healthy operation +- **Failure Impact**: If not green, view reconstruction halts, preventing delta data from propagating + +**Transfer Job (/1DH/PUSH_CDS_DELTA)** +- **SM37 Job Name**: `/1DH/PUSH_CDS_DELTA` +- **Underlying Report**: `DHCDCR_PUSH_CDS_DELTA` +- **Function**: Extracts records from the Subscriber logging table and pushes them into the Resilient Data Buffer (RDB), preparing them for transmission to Datasphere +- **Required Status**: Must show **green** status icon +- **Failure Impact**: If not green, even though CDC is capturing and reconstructing data, it never reaches the buffer or Datasphere + +**Job Recovery** +- If either job shows a non-green status (red, yellow, or stopped): + - Click the **Dispatcher Job** button to trigger immediate rescheduling + - The system will attempt to restart the job within seconds + - Refresh the tab to verify the status icon returns to green +- If status does not recover after rescheduling, proceed to Application Log + +**Application Log Integration** +- Click the **Application Log** button to jump to SLG1 (System Log) +- Displays detailed error messages from the most recent job execution +- Essential for understanding why a job failed to complete + +### Key Parameters (from Job Settings) + +The Job Settings tab also displays configuration parameters that control job behavior and performance: + +**Observer Job Parallelism and Scheduling** +- **OBSERVER_MAX_JOBS / OBSERVER_MIN_JOBS**: Number of parallel Observer job instances (default typically 1-4) +- **OBSERVER_MIN_RUNTIME**: Minimum duration a job waits for new records before terminating +- **OBSERVER_PERIOD**: Time interval between job start attempts (controls scheduling frequency) + +**Transfer Job Parallelism and Performance** +- **TRANSFER_MAX_JOBS / TRANSFER_MIN_JOBS**: Number of parallel Transfer job instances +- **Tuning Note**: SAP Note 3223735 recommends increasing `TRANSFER_MAX_JOBS` to 4-8 for high-volume scenarios to prevent RDB buffer overflow +- **TRANSFER_MAX_RUNTIME_MIN**: Maximum execution time per Transfer job instance (in minutes) + +**Health Monitoring and Thresholds** +- **HEALTH_CHECK_PERIOD**: Interval (in seconds) between system health checks +- **LOGTAB_RECORDS_LIMIT**: Record count threshold at which the system raises a warning (e.g., 100,000) +- **LOGTAB_RECORDS_LIMIT_WARNING**: Alerts are triggered when actual record count exceeds this value + +### Typical Error Symptoms + +The following patterns in DHCDCMON indicate CDC problems requiring investigation: + +1. **Subscriber Logging Table Permanently Growing** + - Record count increases over time and does not decrease + - Indicates the Transfer Job is not running or not consuming records + - Check Job Status tab for Transfer Job status; if not green, reschedule via Dispatcher Job button + +2. **Subscriber Worklist Shows Status B or E** + - Status B (Blocked) suggests authorization failure or resource lock + - Status E (Failed) indicates processing error; check Application Log for details + - May indicate communication failure between source and Datasphere + +3. **Repeating Errors in Application Log** + - Same error message appearing across multiple job execution attempts + - Systematic issue rather than transient failure; requires investigation of cause + - Check SLG1 for more detailed error context + +4. **Job Status Not Green (Red or Yellow)** + - Job is inactive, failed, or in an error state + - Reschedule via Dispatcher Job button + - If status does not recover, review Application Log + +5. **Authorization Failures** + - Message: "User XXX is not authorized to use the dispatcher/observer process" + - The background job user lacks necessary privileges + - Check SU53 (Authorization Check) and review user role assignment + - Ensure user has the `DHCDC` authorization object assigned + +### Related SAP Notes + +- **KBA 3365864**: "Where does information in DHCDCMON come from?" — Explains data sources and refresh behavior +- **KBA 2930269**: "ABAP CDS CDC common issues and troubleshooting" — Comprehensive CDC troubleshooting guide + +--- + +## Transaction DHRDBMON (RDB Monitor / Data Buffer Monitor) + +### Purpose + +DHRDBMON provides visibility into the Resilient Data Buffer (RDB), which acts as a staging layer between the CDC Engine and Datasphere. The buffer temporarily stores captured changes, organizes them into packages, and waits for the Replication Management Service (RMS) to collect and transmit them. Use this transaction to diagnose issues where CDC is working but data is not reaching Datasphere, or when the buffer is accumulating records. + +### Buffer Table Overview + +The Buffer Table Overview section displays summary information for each object's buffer: + +**Core Buffer Metadata** +- **Object Name**: The CDS view being replicated +- **Buffer Table Name**: Generated automatically when the object is subscribed from Datasphere; typically follows pattern `/1DH/BUF_*` +- **Producer ID**: Identifier corresponding to the Subscriber ID in DHCDCMON; useful for cross-referencing CDC and RDB state +- **Phase**: Maps to replication status in Datasphere's Replication Flow UI (e.g., "Initial", "Delta", "Complete") +- **Buffer Table Status**: Current operational state (Ready, Full, etc.) + +### Buffer Table Details + +Expanding a buffer table entry reveals detailed statistics: + +**Capacity and Content Metrics** +- **Maximum Buffer Records**: Total record capacity the buffer can hold +- **Package Size**: Number of records grouped into each package for transmission +- **Current Number of Records**: Active records currently in the buffer +- **Records not Assigned to Package**: Count of records not yet grouped into a transmission package + - Highlighted if this value is high relative to current records + - Indicates data is accumulating without being organized for transmission + - May signal RMS is not consuming packages or packages cannot form due to size constraints + +**Package Processing Status** +- **Number of Packages Ready**: Packages completed and waiting for RMS to collect + - Should appear and disappear as RMS transfers data to Datasphere + - Accumulation indicates RMS collection failure or Cloud Connector issues +- **Number of Packages In Process**: Packages currently being transferred by RMS + - Reflects real-time transfer activity + - Should increase and decrease as RMS batches data + +### Package Overview + +The Package Overview section lists individual packages with detailed status: + +**Package Information** +- **Package ID**: Unique identifier for this transmission batch +- **Status**: Current state (Ready, In Process, Transferred, etc.) +- **Number of Records**: Record count in this package +- **Last Status Change**: Timestamp of the most recent status update + +**Package Accumulation Analysis** +- **Expected Behavior**: Packages should appear in READY status, then transition to IN_PROCESS, then be removed as RMS collects them +- **Accumulation Pattern**: If READY packages accumulate without being consumed (remaining in status Ready for minutes), indicates RMS or Cloud Connector is not polling or collecting +- **Performance Bottleneck**: If packages accumulate while in IN_PROCESS status, indicates slow network or Datasphere ingest capacity issue + +### Expert Functions + +The Package Overview tab includes action buttons for manual intervention: + +- **Delete Package**: Removes a package and its records from the buffer (use cautiously; may lose data) +- **Change Status to Ready**: Forces a package back to Ready status for retransmission +- **Remove Records from Package**: Deletes specific records from a package + +Use these sparingly; they are diagnostic aids rather than routine operations. + +### Diagnostic Decision Tree + +Use the following decision tree to isolate RDB-related issues: + +**Are READY packages appearing in the Package Overview section?** + +- **YES, but accumulating over time** + - Packages are being created but not collected by RMS + - Problem is downstream: RMS is not polling, Cloud Connector is disconnected, or Datasphere ingestion is blocked + - Action: Check Cloud Connector status, verify RMS service health in Datasphere, review network connectivity + - This is NOT an ABAP-side issue; focus troubleshooting on infrastructure and Datasphere + +- **NO packages appearing, buffer table stays empty or has minimal records** + - RDB is not receiving data from the CDC Engine + - Problem is upstream: Transfer Job is not running, CDC is not capturing changes, or records are stuck in Subscriber logging table + - Action: Check DHCDCMON Transfer Job status, verify CDC is capturing (check Subscriber logging table record count), review Application Log for errors + +- **NO packages appearing, buffer table is full or filling up** + - Data is arriving from CDC but packages are not forming or are not being marked as Ready + - Problem is in the packaging logic or buffer is misconfigured + - Action: Check buffer capacity settings, verify Package Size parameter, check for errors in DHRDBMON Application Log, consider manually changing package status to Ready to resume flow + +--- + +## Transaction SLG1 (Application Log) + +### Purpose + +SLG1 (System Log) is the centralized application log for SAP systems. Replication Flow errors from the source S/4HANA system are logged with detailed messages, error codes, and resolution guidance. When DHCDCMON or DHRDBMON show failures, SLG1 provides the underlying cause. + +### How to Use + +**Step-by-Step Query Process** + +1. **Obtain the Error Timestamp from Datasphere** + - Open the failing Replication Flow in Datasphere Data Integration Monitor + - Navigate to the Messages tab + - Note the error timestamp (typically displayed in user's local timezone) + +2. **Adjust for Time Zone Difference** + - S/4HANA system time and Datasphere user timezone may differ + - Open SAP GUI transaction System > Status to check the source system's current date/time + - Calculate the offset between source system time and the timestamp you recorded + - Adjust the query timestamps accordingly + +3. **Open SLG1 on the Source System** + - Transaction SLG1 in SAP GUI + - Defaults to showing logs from the past 24 hours + +4. **Set Date/Time Range** + - Enter **From Date** and **From Time** (adjusted for source system timezone) + - Enter **To Date** and **To Time** (set slightly after the error timestamp to capture surrounding context) + - Typical window: error timestamp ± 5 minutes + +5. **Execute Query** + - Press F8 to execute + - SLG1 retrieves matching logs + +6. **Review Results and Expand Logs** + - Logs are displayed as collapsible entries + - Click on an entry to expand and view detailed messages + - Multiple messages may be grouped under one log + +### Important Objects to Filter + +SLG1 can return hundreds of logs across many SAP components. To reduce noise, focus on these object categories relevant to Replication Flows: + +- **DHADM**: Administration components and system configuration +- **DHAMB**: ABAP Management (job scheduling, process control) +- **DHAPE**: ABAP Pipeline Engine (data transformation and extraction) +- **DHBAS**: Base Framework (core replication infrastructure) +- **DHCDC**: Change Data Capture (logging table operations, view reconstruction) +- **DHODP**: Operational Data Provisioning (legacy extraction framework) +- **DHRDB**: Resilient Data Buffer (packaging, transmission staging) + +Filter by entering one of these object codes in the Object field to narrow results to Replication-specific logs. + +### Reading Error Details + +**Accessing Detailed Error Messages** +- Click the yellow question mark icon (?) or magnifying glass icon next to a log entry +- Opens a detailed message view with full error context + +**Message Components** +- **Message Class (sy-msgid)**: Classification code (e.g., `DH`, `00`) +- **Message Number (sy-msgno)**: Specific error number within the class +- **Cause Description**: Explanation of what triggered the error +- **System Response**: What the system did in response +- **Resolution Instructions**: Suggested corrective actions +- **Searchable Keywords**: Tags for finding related documentation + +**Long Text Reading** +- Some messages contain extensive long text with step-by-step instructions +- Scroll or expand the text area to view complete content +- Print or export messages if needed for analysis outside SAP + +### Important Note: Search Limitations + +- **Critical Limitation**: SLG1 does NOT support free-text search +- Filtering is based only on time ranges, object codes, and external identifiers +- To find errors related to a specific object (e.g., a CDS view name), you must: + - Search within a narrow time window (reducing results manually) + - Use object code filtering to limit to relevant components + - Scan results manually or export to spreadsheet for offline analysis + +**Time Zone Precision is Critical** +- Always adjust timestamps for source system timezone +- A ±10 minute error can result in missing relevant logs +- When unsure of exact time, query a wider window (±15 minutes) to avoid missing data + +--- + +## Cross-Tool Diagnostic Workflow + +For a failing Replication Flow, follow this systematic investigation sequence to isolate the component causing the issue: + +**Step 1: Capture Initial Error Information** +- Open Datasphere Data Integration Monitor +- Navigate to the failing Replication Flow's Messages tab +- Record the error message text and timestamp +- Note the object name (CDS view being replicated) + +**Step 2: Check RDB Status (Downstream of CDC)** +- Open DHRDBMON on the source system +- Search for the object by name +- Questions to answer: + - Are READY packages accumulating? (indicates RMS/Cloud Connector issue) + - Is the buffer table empty? (indicates CDC not pushing data) + - Is the buffer table full? (indicates packaging or downstream bottleneck) + +**Step 3: Check CDC Status (Data Capture)** +- Open DHCDCMON on the source system +- Search for the object +- Questions to answer: + - Is the Subscriber logging table growing? (indicates Transfer Job not running) + - Are any Subscriber Worklist entries in status B or E? (indicates job failure) + - Are both Observer Job and Transfer Job showing green status? + +**Step 4: Retrieve Detailed Error Context** +- Open SLG1 on the source system +- Enter adjusted timestamp range and filter by object code (DHCDC, DHRDB, DHAPE, or DHBAS) +- Find matching error log entries +- Expand entries to read the detailed error message with cause and resolution + +**Step 5: Test CDC Outside of Datasphere (If CDC Issue Suspected)** +- If DHCDCMON shows logging table growth or job failures, test the CDS view independently +- Open RODPS_REPL_TEST on the source system +- Test the CDS view to confirm CDC replication works outside of Datasphere context +- If test fails, the issue is in the CDS view definition or CDC configuration, not in Datasphere + +**Step 6: Check Operational Data Queue (If ODP Involved)** +- For some CDS views, Operational Data Provisioning (ODP) is the underlying extraction mechanism +- Open ODQMON on the source system +- Monitor queue status and extraction counters for the object +- Confirms whether ODP subscribers are receiving data + +**Decision Logic** +- **Problem in steps 3-4 (DHCDCMON or SLG1)**: Issue is on the ABAP source system; fix CDC configuration, job scheduling, or authorization +- **Problem in step 2 (DHRDBMON)**: Issue is either upstream (CDC not pushing) or downstream (RMS not collecting) + - Check step 3 to determine which +- **Problem passes steps 2-4 but fails in step 5 (RODPS_REPL_TEST)**: CDS view definition has an issue incompatible with Datasphere replication +- **All steps pass on source system**: Problem is in Datasphere, Cloud Connector, or network + +--- + +## Additional Diagnostic Tools + +Beyond the primary CDC and RDB transactions, several other tools provide complementary diagnostic information: + +### SDDLAR (DDL Source Tool) +- **When to Use**: When CDC is failing due to CDS view definition issues +- **Capabilities**: + - Display DDL Source: View the ABAP CDS definition syntax + - Check DDL Source: Validate syntax without compilation + - Data Preview: Execute the CDS view to test data retrieval +- **Diagnostic Value**: Confirms whether the CDS view itself is valid and executable + +### RODPS_REPL_TEST (Replication Test Tool) +- **When to Use**: When CDC replication is failing in the context of Datasphere, but you want to isolate whether the CDC engine or CDS view is the problem +- **Function**: Tests CDS view replication outside of Datasphere context +- **Expected Behavior**: Successful test indicates the CDS view and CDC engine are healthy; failure points to CDS definition or CDC configuration issues + +### ODQMON (Operational Data Queue Monitor) +- **When to Use**: For CDS views that use ODP (Operational Data Provisioning) as the underlying extraction mechanism +- **Visibility**: Queue status, subscriber count, extraction counters, and request history +- **Diagnostic Value**: Confirms whether ODP is successfully extracting and queuing data + +### SM37 (Background Job Log) +- **When to Use**: When DHCDCMON shows job failures +- **Actions**: + - Verify job status (completed, failed, running) + - Review job logs and execution history + - Check job scheduling parameters + - Manually trigger job execution for testing + +### SU53 (Authorization Check) +- **When to Use**: When errors mention "User XXX is not authorized" +- **Function**: Shows the last failed authorization check for the current user +- **Diagnostic Value**: Identifies missing authorization objects or role assignments + +### STAUTHTRACE (Authorization Trace) +- **When to Use**: When SU53 does not provide sufficient detail +- **Function**: Enables detailed trace of authorization checks for a specific user or transaction +- **Diagnostic Value**: Shows every authorization check evaluated, which object failed, and why + +--- + +## Summary Table: Transaction Reference + +| Transaction | Component | Primary Use | Key Actions | +|---|---|---|---| +| DHCDCMON | CDC Engine | Monitor change capture | Check job status, review logging tables, reschedule jobs | +| DHRDBMON | RDB Buffer | Monitor data staging | Check package status, review buffer capacity | +| SLG1 | Logs | Find error details | Query by time range and object code | +| SDDLAR | CDS Metadata | Validate view definition | Display source, check syntax, preview data | +| RODPS_REPL_TEST | CDC Testing | Test outside Datasphere | Confirm CDC engine capability | +| ODQMON | ODP Queue | Monitor queue status | Check extraction count and queue depth | +| SM37 | Jobs | Monitor background jobs | Check Observer/Transfer job logs | +| SU53 | Authorization | Check failed permissions | Identify missing authorizations | +| STAUTHTRACE | Authorization | Trace permission checks | Debug authorization failures | diff --git a/partner-built/SAP-Datasphere/skills/datasphere-flow-doctor/references/error-catalog.md b/partner-built/SAP-Datasphere/skills/datasphere-flow-doctor/references/error-catalog.md new file mode 100644 index 0000000..e542ea6 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-flow-doctor/references/error-catalog.md @@ -0,0 +1,947 @@ +# Error Catalog Reference + +## Replication Flow Errors + +### Connection and Authorization Errors + +#### Error: Cannot reach SAP system +``` +Code: CONN_001 +Severity: CRITICAL +Message: Unable to establish connection to host [s4h-prod.example.com:443] + +Causes: +1. Network unreachable +2. Cloud Connector offline +3. Firewall blocking connection +4. Wrong host/port configuration +5. Source system down + +Diagnosis: +- Ping host: ping s4h-prod.example.com +- Check Cloud Connector: CC admin console → Status +- Check network: telnet s4h-prod.example.com 443 +- Verify config: Data Integration → Connection → Check host/port + +Solutions: +1. Verify network connectivity +2. Restart Cloud Connector if offline +3. Check firewall rules allow port 443 +4. Verify connection credentials +5. Contact SAP system administrator +6. Increase timeout from 300s to 600s (if slow network) + +Prevention: +- Redundant Cloud Connector instances (HA) +- Monitor Cloud Connector health continuously +- Document network/firewall requirements +- Test connection before scheduling +``` + +#### Error: Authorization failed - RFC_READ_TABLE denied +``` +Code: AUTH_001 +Severity: CRITICAL +Message: User [DATASPH_USER] not authorized for RFC_READ_TABLE + +Causes: +1. User missing required role +2. RFC not enabled for user +3. Data security policy restricts access +4. Role missing specific authorization objects + +Diagnosis: +- Check user roles: S/4HANA → PFCG → [USER] → Roles +- Check RFC auth: SM59 → ODP_BACKEND → Test +- Check security policy: RSECADMIN → Data access rules + +Solutions: +1. Assign role SAP_BC_ANALYTICS_EXTRACTOR +2. Add RFC authorization: S_ODP_* objects +3. Grant table access via S_TABU_DIS +4. Review data security policies in RSECADMIN +5. Create custom role with minimum permissions + +Prevention: +- Use dedicated technical user +- Apply principle of least privilege +- Regular access review +- Document required authorizations +``` + +#### Error: CDS view not marked for extraction +``` +Code: EXTRACT_001 +Severity: CRITICAL +Message: CDS view [I_CUSTOMER] not extraction-enabled + +Causes: +1. Using internal view (I_*) instead of consumption (C_*) +2. Extraction not enabled in view annotations +3. View changed in new SAP patch +4. Custom view not properly configured + +Diagnosis: +- Check view prefix: C_* (consumption) vs I_* (internal) +- Check annotations: SE11 → View → [Name] + Look for: @Analytics.dataExtraction.enabled: true +- Search catalog: Datasphere → Search with extraction filter + +Solutions: +1. Use SAP-provided consumption view (C_CUSTOMER not I_CUSTOMER) +2. Create custom view with extraction enabled: + @Analytics.dataExtraction.enabled: true + @Analytics.dataExtraction.deltaSupported: true +3. Request SAP to enable extraction if needed +4. Work with ABAP team to create extraction view + +Prevention: +- Always use C_* views (consumption) when available +- Verify @Analytics annotation before using view +- Document which views are extraction-enabled +- Maintain list of approved CDS views for each module +``` + +### Data Extraction Errors + +#### Error: Delta queue expired - must do full reload +``` +Code: DELTA_001 +Severity: CRITICAL +Message: Change number 1500000 not found in delta queue (queue retention < extraction frequency) + +Causes: +1. Gap between delta extractions > queue retention (3-8 days) +2. Queue purged (automated cleanup older than retention) +3. Change records deleted by maintenance job +4. Extraction frequency too low (monthly vs daily requirements) + +Diagnosis: +- Check delta queue: test_connection(check_delta_queue=True) +- Check queue retention: 3-8 days typical +- Check extraction frequency: Current vs required +- Calculate gap: Days since last successful delta + +Solutions: +1. Accept delta loss, perform full reload: + - Stop current delta flow + - Create new replication with full load + - Set new watermark to max(CHANGENUMBER) + - Resume delta from new base + +2. Increase extraction frequency: + From: Weekly + To: Daily + Reduces gap, keeps within queue retention + +3. Increase source system queue: + SPRO → ODP → Increase max size (4GB → 8GB) + Increase retention (8 days → 14 days) + +Prevention: +- Schedule delta ≥2x per day if full reload possible +- Monitor queue health in advance +- Set alerts: Queue days retained < 5 +- Document extraction frequency vs queue retention +``` + +#### Error: Source table structure changed +``` +Code: SCHEMA_001 +Severity: HIGH +Message: Column [REVENUE_AMOUNT] not found in source table [C_CUSTOMER] + +Causes: +1. CDS view updated in new S/4HANA patch +2. View field removed or renamed +3. Flow mapped to wrong source +4. Column metadata out of sync + +Diagnosis: +- Compare source schema: get_table_schema("C_CUSTOMER", "S4H_PROD") +- Check target schema: get_table_schema("CUSTOMER_MASTER", "DATASPHERE") +- Verify flow mappings: Data Integration → Flow → Mappings +- Check S/4HANA version: Recent patch introduced changes? + +Solutions: +1. Refresh source schema in flow + → Right-click source → Refresh Schema + → Review changes and accept + +2. Adjust field mappings + → Map source fields to target + → Fill in missing fields with defaults or NULL + → Remove obsolete target fields + +3. Recreate target table + → If structural changes significant + → Delete old target table + → Flow will recreate with current schema + → Perform full reload + +Prevention: +- Test after S/4HANA patches +- Maintain schema documentation +- Set up alerts for CDS view changes +- Use snapshot tables for archival +``` + +#### Error: No data extracted - query returns empty +``` +Code: EXTRACT_002 +Severity: MEDIUM +Message: Replication completed but 0 rows extracted from source + +Causes: +1. Source table is empty +2. Filter condition too restrictive +3. Source system down but reports success +4. Incorrect table mapping +5. Valid scenario - no new data since last run + +Diagnosis: +- Verify data exists: SELECT COUNT(*) FROM C_CUSTOMER +- Check filter: Is WHERE clause in extraction correct? +- Verify mapping: Table name, schema, connection +- Test connection directly: test_connection(check_objects=True) + +Solutions: +1. If valid (expected): Status = WARNING, continue monitoring + +2. If invalid (unexpected): + - Check source data: SELECT * FROM C_CUSTOMER LIMIT 100 + - Remove/relax filters if too restrictive + - Verify table name and schema + - Check source system for data issues + +3. If first run: + - Expected behavior + - Full load will extract all data + - Subsequent delta runs will be non-empty +``` + +### Performance Errors + +#### Error: Extraction slow - timeout approaching +``` +Code: PERF_001 +Severity: MEDIUM +Message: Extraction rate 10K rows/min, expected >50K rows/min; timeout in 5 min + +Causes: +1. Large dataset (millions of rows) +2. Network congestion +3. Source system busy/overloaded +4. Batch size too large +5. Parallel threads too few + +Diagnosis: +- Check source load: SM50 in S/4HANA +- Check network: latency, packet loss +- Check Cloud Connector: CPU/Memory usage +- Check Datasphere: Network throttling? +- Calculate expected time: rows / extraction_rate + +Solutions: +1. Reduce batch size: + From: 100,000 rows/batch + To: 50,000 rows/batch + Result: More batches but faster per batch + +2. Increase parallel threads: + From: 1 thread + To: 4 threads + Result: 4x faster (if resource available) + +3. Add source filter: + Load recent: WHERE changed_date >= CURRENT_DATE - 30 + Reduces volume, faster extraction + +4. Increase timeout: + From: 300 seconds + To: 600 seconds + Only if network genuinely slow + +5. Schedule off-peak: + From: 09:00 (business hours) + To: 23:00 (off-peak) + Less system contention + +Prevention: +- Estimate initial load size: Rows × 1KB = Storage +- Test extraction with sample before scheduling +- Monitor trends over time +- Set thresholds for alerts +``` + +## Data Flow Errors + +### Memory and Resource Errors + +#### Error: Out of memory - Java heap space +``` +Code: MEM_001 +Severity: CRITICAL +Message: Java heap space - Out of memory exception in Python operator + +Causes: +1. Input dataset larger than available memory +2. Python operator loading entire dataframe at once +3. Memory leak in operator code +4. No chunking/pagination in processing +5. Intermediate results not freed + +Diagnosis: +- Check input row count: SELECT COUNT(*) FROM source +- Calculate memory: rows × ~1KB per row = GB needed +- Review operator code: Memory allocation patterns +- Check available memory: Data Flow → Resource Settings + +Solutions: +1. Chunked processing: + ```python + for chunk in pd.read_csv(file, chunksize=10000): + process_chunk(chunk) # Process smaller pieces + ``` + +2. Add source filter: + WHERE load_date >= CURRENT_DATE - 90 + (Extract 3 months instead of 3 years) + +3. Increase memory allocation: + Data Flow → Advanced → Memory: 16GB → 32GB + Cost: Higher (verify in pricing) + +4. Streaming/iterator pattern: + Don't load entire dataframe, process by partition + +5. Reduce intermediate objects: + ```python + # Bad: Creates copies + df2 = df1.assign(col=df1['x'] * 2) + df3 = df2.groupby('y').sum() + + # Good: Chain operations + result = df1.assign(col=df1['x'] * 2).groupby('y').sum() + ``` + +Prevention: +- Test with sample data first (LIMIT 10000) +- Monitor memory during execution +- Set alerts for memory > 80% +- Document operator memory requirements +``` + +#### Error: CPU limit exceeded - operation timeout +``` +Code: CPU_001 +Severity: HIGH +Message: CPU usage exceeded 100% limit; operator terminated + +Causes: +1. Complex calculations on large dataset +2. Inefficient algorithm (O(n²) instead of O(n)) +3. String operations on huge text fields +4. Regex patterns too complex +5. No optimization in operator code + +Diagnosis: +- Check operation complexity: Nested loops? String parsing? +- Profile code: Which lines consume most CPU? +- Check algorithm: Is there more efficient way? + +Solutions: +1. Optimize algorithm: + # Bad: O(n²) + for i in data: + for j in data: + if i['id'] == j['id']: ... + + # Good: O(n) with dict lookup + id_map = {row['id']: row for row in data} + for i in data: + j = id_map.get(i['id']) + +2. Use vectorized operations (numpy/pandas): + # Bad: Loop + result = [x * 2 for x in values] + + # Good: Vectorized + result = values * 2 # 100x faster + +3. Move complex logic to database: + # Bad: Python processing + data = df[df['revenue'] > 1000].groupby('customer') + + # Good: SQL processing (in data source) + SELECT * FROM C_SALES WHERE REVENUE > 1000 GROUP BY CUSTOMER + +4. Reduce dataset: + Add WHERE clause to extract fewer rows + +Prevention: +- Test complexity on representative data +- Monitor CPU during execution +- Document operator CPU requirements +- Use profiling tools to identify bottlenecks +``` + +### Data Type and Conversion Errors + +#### Error: Cannot convert STRING to DECIMAL +``` +Code: TYPE_001 +Severity: HIGH +Message: TypeError: cannot convert value 'ABC123' from type STRING to DECIMAL + +Causes: +1. Source delivers unexpected data format +2. Python operator assumes wrong type +3. Locale-specific formatting (comma vs period) +4. Special characters in numeric fields +5. NULL values or empty strings + +Diagnosis: +```python +# Check actual data +print(df['amount'].dtype) # STRING instead of expected DECIMAL +print(df['amount'].head(10)) # See examples +print(df['amount'].unique()[:20]) # Check for non-numeric +``` + +Solutions: +1. Safe type conversion with error handling: + ```python + df['amount'] = pd.to_numeric( + df['amount'], + errors='coerce' # Non-numeric become NULL + ) + ``` + +2. Clean before conversion: + ```python + df['amount'] = ( + df['amount'] + .astype(str) + .str.replace(',', '.') # European format + .str.replace('$', '') # Currency symbol + .str.strip() # Whitespace + ) + df['amount'] = pd.to_numeric(df['amount'], errors='coerce') + ``` + +3. Handle NULLs from failed conversions: + ```python + df['amount'] = df['amount'].fillna(0) # or any default + ``` + +4. Add validation: + ```python + invalid = df['amount'].isnull().sum() + if invalid > 0: + print(f"WARNING: {invalid} values failed conversion") + ``` + +Prevention: +- Examine source data: SELECT * FROM source LIMIT 100 +- Document expected formats +- Add type validation in first operator step +- Log/alert on conversion failures +``` + +#### Error: Date format invalid +``` +Code: TYPE_002 +Severity: HIGH +Message: ParserError: unable to parse string "01/32/2024" at position 0 + +Causes: +1. Date in unexpected format (MM/DD/YYYY vs DD/MM/YYYY) +2. Invalid date values (day > 31, month > 12) +3. Timezone issues +4. Locale-specific interpretation + +Diagnosis: +```python +# Check samples +print(df['date'].head(10)) +# Output: ['01/32/2024', '02/30/2024', ...] +# Issue: Impossible dates (day > number of days in month) + +# Check source format +SELECT DISTINCT DATE_FIELD FROM SOURCE LIMIT 20; +``` + +Solutions: +1. Specify format: + ```python + df['date'] = pd.to_datetime( + df['date'], + format='%m/%d/%Y', # Explicit format + errors='coerce' # Invalid → NULL + ) + ``` + +2. Try multiple formats: + ```python + for fmt in ['%m/%d/%Y', '%d/%m/%Y', '%Y-%m-%d']: + try: + df['date'] = pd.to_datetime(df['date'], format=fmt) + break + except: + continue + ``` + +3. Clean invalid dates: + ```python + df['date'] = pd.to_datetime( + df['date'], + errors='coerce' # Invalid → NaT (NULL) + ) + df['date'] = df['date'].fillna(pd.Timestamp('1900-01-01')) + ``` + +Prevention: +- Document source date format +- Request ISO format (YYYY-MM-DD) from source +- Validate samples before full extraction +- Test with known good dates +``` + +### Python Operator Errors + +#### Error: ImportError - module not found +``` +Code: IMPORT_001 +Severity: CRITICAL +Message: ImportError: No module named 'sklearn' + +Causes: +1. Library not available in Datasphere environment +2. Typo in import statement +3. Library version mismatch +4. Non-standard library not installed + +Diagnosis: +```python +# Check available libraries +import sys +print(sys.version) + +# Test import +try: + import sklearn + print("sklearn available") +except ImportError: + print("sklearn NOT available") +``` + +Solutions: +1. Use only standard libraries: + ✓ pandas, numpy, scipy + ✓ statistics, math, datetime + ? sklearn, tensorflow (may not be available) + +2. Implement without library: + ```python + # Instead of sklearn.preprocessing.StandardScaler + def standardize(values): + mean = sum(values) / len(values) + std = (sum((x - mean)**2 for x in values) / len(values))**0.5 + return [(x - mean) / std for x in values] + ``` + +3. Contact support if critical library needed + +Prevention: +- Test imports before deploying +- List available libraries in documentation +- Avoid exotic/non-standard libraries +- Implement fallback logic +``` + +#### Error: NameError - undefined variable +``` +Code: NAME_001 +Severity: HIGH +Message: NameError: name 'df' is not defined + +Causes: +1. Variable name typo +2. Variable not created before use +3. Scope issue (variable local, not available) +4. Import statement failed silently + +Diagnosis: +```python +# Check execution line by line +def transform(input_df): + print(f"Input type: {type(input_df)}") + print(f"Input shape: {input_df.shape}") + # If input_df not available, input parameter name wrong + + dff = input_df.assign(col=1) # Typo: dff vs df + print(df) # ERROR: df doesn't exist +``` + +Solutions: +1. Check parameter names match: + ```python + def transform(input_df): # Parameter name + # Use input_df, not df + return input_df.assign(new_col=1) + ``` + +2. Declare variables before use: + ```python + # Bad + result = data + offset # offset not defined! + + # Good + offset = 100 + result = data + offset + ``` + +3. Add debug output: + ```python + print(f"Available variables: {dir()}") + print(f"Variable values: df={df}, offset={offset}") + ``` + +Prevention: +- Use IDE with syntax checking +- Test on sample data first +- Add debug print statements +- Use consistent naming conventions +``` + +#### Error: SyntaxError in operator code +``` +Code: SYNTAX_001 +Severity: CRITICAL +Message: SyntaxError: invalid syntax (line 5) + +Causes: +1. Missing colon or parenthesis +2. Indentation error +3. Invalid operator +4. Reserved keyword as variable name + +Diagnosis: +```python +# Line 5 with error: +def transform(input_df) # Missing colon! + return input_df + +# Fix: +def transform(input_df): + return input_df +``` + +Solutions: +1. Check syntax: + ```bash + python -m py_compile operator.py + ``` + +2. Use IDE with syntax highlighting + +3. Common issues: + - Missing colon after function/if/for + - Indentation not consistent (tabs vs spaces) + - Extra parenthesis + - Invalid operator combination + +Prevention: +- Use Python IDE with linting +- Test code locally before deploying +- Enable syntax checking in editor +``` + +## Transformation Flow Errors + +### SQL Syntax Errors + +#### Error: Column not found in table +``` +Code: SQL_001 +Severity: HIGH +Message: SQL Error [207]: Column 'REVENUE_AMOUNT' not found + +Causes: +1. Column name typo +2. Column removed in schema update +3. Wrong table referenced +4. Case sensitivity issue (AMOUNT vs Amount) + +Diagnosis: +- Check table schema: get_table_schema('C_CUSTOMER') +- Verify column in source: SELECT REVENUE_AMOUNT FROM C_CUSTOMER LIMIT 1 +- Check flow mappings + +Solutions: +1. Fix column name: + # Bad + SELECT REVENUE_AMOUNT FROM C_CUSTOMER + + # Good (correct name) + SELECT REVENUE FROM C_CUSTOMER + +2. Qualify table name: + SELECT sc.REVENUE FROM SOURCE_CUSTOMER sc + +3. Use schema-qualified name: + SELECT REVENUE FROM SAP_SOURCE.C_CUSTOMER + +Prevention: +- Refresh schema before writing SQL +- Test queries incrementally +- Use aliases for clarity +``` + +#### Error: Table not found +``` +Code: SQL_002 +Severity: HIGH +Message: SQL Error [261]: Table 'C_CUSTOMER' not found in schema + +Causes: +1. Table doesn't exist +2. Schema not specified +3. Table name typo +4. Table in different schema/system + +Diagnosis: +- Check table exists: SELECT * FROM C_CUSTOMER LIMIT 1 +- Verify schema: SELECT * FROM SAP_SOURCE.C_CUSTOMER LIMIT 1 +- Check source system connection + +Solutions: +1. Specify schema: + SELECT * FROM SAP_SOURCE.C_CUSTOMER + +2. Create table first if missing: + SELECT * FROM source_system.C_CUSTOMER + +3. Verify table name +``` + +#### Error: Data type conversion impossible +``` +Code: SQL_003 +Severity: HIGH +Message: SQL Error [389]: CAST from [VARCHAR] to [INTEGER] not possible + +Causes: +1. Column contains non-numeric characters +2. NULL values with no default +3. Precision loss (DECIMAL → INTEGER) + +Diagnosis: +- Check data: SELECT AMOUNT, TYPEOF(AMOUNT) FROM source LIMIT 100 +- Verify target type requirements + +Solutions: +1. Use CAST with error handling: + CAST(AMOUNT AS DECIMAL(19,2)) AS converted_amount + Or use SAFE_CAST + +2. Trim/clean before conversion: + CAST(TRIM(REPLACE(AMOUNT, ',', '.')) AS DECIMAL) + +3. Use COALESCE for NULLs: + COALESCE(CAST(AMOUNT AS DECIMAL), 0) AS amount +``` + +### Delta and Watermark Errors + +#### Error: Watermark value not in source data +``` +Code: DELTA_002 +Severity: HIGH +Message: Watermark timestamp '2024-01-15 23:59:59' exceeds maximum in source + +Causes: +1. Watermark set to future date +2. Source data older than watermark +3. No new data since last extraction +4. Timezone mismatch + +Diagnosis: +- Check current max: SELECT MAX(CHANGED_AT) FROM C_CUSTOMER +- Check stored watermark: SELECT * FROM WATERMARK_CONTROL +- Compare timestamps + +Solutions: +1. Reset to valid value: + UPDATE WATERMARK_CONTROL + SET LAST_WATERMARK = (SELECT MAX(CHANGED_AT) FROM C_CUSTOMER) + +2. Or set to past value: + UPDATE WATERMARK_CONTROL + SET LAST_WATERMARK = CURRENT_TIMESTAMP() - INTERVAL '1' HOUR + +Prevention: +- Don't manually set watermark to future +- Validate watermark before loading +- Add check: watermark <= current_max_value +``` + +#### Error: Overlapping delta loads - duplicate records +``` +Code: DELTA_003 +Severity: MEDIUM +Message: 1000 duplicate records loaded; watermark not advancing correctly + +Causes: +1. Watermark not advancing between runs +2. Same change records selected multiple times +3. Overlap in extraction range + +Diagnosis: +- Compare consecutive runs: + Run 1: WHERE changed_at > 2024-01-15 12:00:00 + Run 2: WHERE changed_at > 2024-01-15 12:00:00 (same!) +- Check stored watermark + +Solutions: +1. Advance watermark after successful load: + ```sql + UPDATE WATERMARK_CONTROL + SET LAST_WATERMARK = CURRENT_TIMESTAMP() + WHERE TABLE_NAME = 'CUSTOMER' + ``` + +2. Deduplicate in merge: + ```sql + MERGE INTO TARGET_CUSTOMER tc + USING ( + SELECT * FROM SOURCE_DELTA + QUALIFY ROW_NUMBER() OVER ( + PARTITION BY CUSTOMER_ID + ORDER BY CHANGENUMBER DESC + ) = 1 + ) delta + ON tc.CUSTOMER_ID = delta.CUSTOMER_ID + ... + ``` + +Prevention: +- Verify watermark advancement in logs +- Check stored value after each load +- Add validations +``` + +### Lock and Constraint Errors + +#### Error: Table locked - cannot write +``` +Code: LOCK_001 +Severity: MEDIUM +Message: SQL Error [389]: Cannot acquire lock on table TARGET_CUSTOMER + +Causes: +1. Other process writing to same table +2. Long-running transaction holding lock +3. Deadlock between processes +4. Manual lock not released + +Diagnosis: +- Check locks: CALL DBMS_LOCKS.CHECK_LOCKS() +- Check processes: SELECT * FROM M_TRANSACTIONS +- Check blocking: SELECT * FROM M_BLOCKED_TRANSACTIONS + +Solutions: +1. Wait and retry: + Increase timeout, flow will retry + +2. Kill blocking transaction: + SELECT CONNECTION_ID FROM M_CONNECTIONS WHERE SESSION_ID = 'X' + ALTER SYSTEM KILL SESSION 'X' + +3. Run serially instead of parallel + +Prevention: +- Use row-level locking (lock specific records) +- Keep transactions short +- Release locks promptly +``` + +#### Error: Foreign key constraint violated +``` +Code: CONST_001 +Severity: HIGH +Message: SQL Error [301]: Foreign key constraint violation in table ORDERS + +Causes: +1. Order references non-existent customer +2. Parent record deleted while child exists +3. Wrong key value + +Diagnosis: +- Check orphan records: + SELECT * FROM ORDERS o + WHERE NOT EXISTS (SELECT 1 FROM CUSTOMER c WHERE c.ID = o.CUSTOMER_ID) + +Solutions: +1. Load only valid records: + ```sql + MERGE INTO ORDERS + USING ( + SELECT * FROM ORDERS_STAGING os + WHERE EXISTS (SELECT 1 FROM CUSTOMER c WHERE c.ID = os.CUSTOMER_ID) + ) delta + ... + ``` + +2. Create missing parents first + +Prevention: +- Validate before merge +- Load parents before children +- Use referential integrity constraints +``` + +## Performance Degradation Issues + +### Diagnosis Workflow + +``` +Flow taking longer than expected +├─ Step 1: Compare to historical runs +│ ├─ Previous avg: 10 minutes +│ ├─ Current run: 25 minutes +│ └─ Degradation: 2.5x slower +│ +├─ Step 2: Identify bottleneck +│ ├─ Source extraction: Slow +│ ├─ Transformation: Normal +│ └─ Target load: Normal +│ → Problem: Source slowdown +│ +├─ Step 3: Investigate root cause +│ ├─ Source system load +│ ├─ Network latency +│ ├─ Data volume increase +│ └─ Cloud Connector resource +│ +└─ Step 4: Apply solution + ├─ Reduce batch size + ├─ Add filters + ├─ Increase parallelism + └─ Schedule differently +``` + +### Common Performance Issues + +| Issue | Symptom | Solution | +|-------|---------|----------| +| Large data volume | Extraction takes 2h (was 30min) | Add WHERE filter, split into smaller batch | +| Network congestion | Latency 500ms (was 50ms) | Schedule off-peak, use compression | +| Source system busy | CPU/Memory high in SM50 | Schedule when less busy (23:00) | +| Cloud Connector overloaded | Low throughput (10MB/s vs 100MB/s) | Add standby CC, increase resources | +| Inefficient SQL | Slow aggregation | Add indexes, rewrite query, increase DB resources | +| Python operator inefficient | CPU 95% for simple operation | Optimize algorithm, use vectorization | +| Memory pressure | Swapping to disk | Reduce batch size, add filter | +| Unindexed tables | Table scan vs index seek | Add index on key columns, vacuum analyze | + diff --git a/partner-built/SAP-Datasphere/skills/datasphere-flow-doctor/references/replication-flow-error-patterns.md b/partner-built/SAP-Datasphere/skills/datasphere-flow-doctor/references/replication-flow-error-patterns.md new file mode 100644 index 0000000..45cab9e --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-flow-doctor/references/replication-flow-error-patterns.md @@ -0,0 +1,167 @@ +# Replication Flow Error Patterns and Resolutions + +## Architecture Context +Brief explanation: Replication Flows use RMS (Replication Management Service) collecting data from RDB (Resilient Data Buffer) tables via Cloud Connector. For "Initial and Delta" loads, the CDC Engine pushes changes through Master Logging Tables → Subscriber Logging Tables → RDB Buffer → RMS → Datasphere local table. Two load types: Initial Only (source table → RDB) and Initial and Delta (source table → CDC → RDB). + +## Error Pattern 1: "Error occurred during execution of API activity; see application log" — Authorization +### Symptoms +- Error in Data Integration Monitor Messages tab +- SLG1 shows: "Not authorized to use operator 'internal.inport' (BADI_DHAPE_OPER_INPORT)" or "Not authorized to use operator 'com.sap.abap.operator_reader' (BADI_DHAPE_OPER_OPER_READER)" + +### Root Cause +Missing security authorizations on the source ABAP system for the communication user + +### Resolution +Apply SAP Note 3100673 — SAP Data Intelligence / SAP Datasphere - ABAP Integration - Security Settings. Assign required authorizations to the RFC communication user. + +### Diagnostic Steps +1. Check SLG1 with Object DHAPE for detailed authorization error +2. Run SU53 for the communication user to see failed auth checks +3. Use STAUTHTRACE to trace the exact missing authorization objects + +## Error Pattern 2: "Subscription interface error / Error while processing subscription of CDS view" +### Symptoms +- Deployment or initial run fails +- Run Log states CDS view definition is "complex or inconsistent" + +### Root Cause +CDS view definition has issues — missing annotations, complex joins, unsupported features, or inconsistent metadata + +### Resolution +1. Run SDDLAR on source system to validate CDS view (Check DDL Source, Data Preview) +2. Verify required annotations: @Analytics.dataExtraction.enabled: true +3. For delta: verify @Analytics.dataExtraction.delta.changeDataCapture annotation +4. Check SAP Note 2890171 for CDS view requirements in Replication Flow scenarios + +### Diagnostic Steps +1. Open SLG1 with Object DHCDC for subscription errors +2. Test extraction independently with report RODPS_REPL_TEST +3. Validate CDS view in SDDLAR + +## Error Pattern 3: "No source partition is available. Replication Flow will restart." +### Symptoms +- Data Integration Monitor shows State = "Partitioning Initial Load", Status = Active (Retrying) +- Flow keeps restarting without making progress + +### Root Cause +Communication user lacks authorization to access the CDS view for subscriber RMS + +### Resolution +1. Check SLG1 Object DHCDC for the specific authorization failure +2. Grant the required authorizations to the RFC user for the CDS view +3. Ensure S_SDSAUTH with ACTVT=16 (Execute) is assigned + +### Diagnostic Steps +1. Check SLG1 for DHCDC entries at the time of the partitioning attempt +2. Run SU53 for the communication user immediately after the failure +3. Verify the user has S_SDSAUTH and SDDLVIEW authorization objects + +## Error Pattern 4: "Failed because maximum number of deletions retries reached" +### Symptoms +- Multiple CDS views in a single Replication Flow +- Several views fail during replication +- Run Log states "CDS view definition is complex or inconsistent" + +### Root Cause +When a Replication Flow contains many CDS views, failures in individual views can cascade. The deletion retry mechanism exhausts after repeated attempts. + +### Resolution +1. Identify which specific CDS views are failing (check per-object status in Data Integration Monitor) +2. Validate each failing CDS view individually with SDDLAR and RODPS_REPL_TEST +3. Consider splitting large Replication Flows into smaller ones with fewer objects +4. Fix the "complex or inconsistent" CDS view definitions on the source system + +## Error Pattern 5: "Cannot determine tables for CDS view" +### Symptoms +- Error during API activity execution +- SLG1 shows message DHCDC_CORE028 "Cannot determine tables for CDS view " + +### Root Cause +System cannot determine the underlying database tables for data extraction from the CDS view + +### Resolution +See SAP KBA 3397020 — Check the @Analytics.dataExtraction annotation implementation and correct any errors or warnings. Restart extraction after fixing. + +### Diagnostic Steps +1. Open SLG1, find entry with Object DHCDC, expand error details +2. Check long text for Message No. DHCDC_CORE028 +3. In SDDLAR, verify CDS view can resolve to underlying tables +4. Check if annotation @Analytics.dataExtraction is correctly implemented + +## Error Pattern 6: "Partitioning for object failed" +### Symptoms +- Random failures during replication, especially for larger datasets +- May work sometimes and fail other times + +### Root Cause +Missing corrections in the source system + +### Resolution +Apply SAP KBA 3465112 — Replication Flow of CDS views randomly fails with "Partitioning for Object failed". Implement the referenced SAP Notes on the source system. + +## Error Pattern 7: DATA_NOT_READY — Buffer Table Empty or Low +### Symptoms +- Replication stalls, no data flowing +- DHRDBMON shows buffer table empty or very low record count +- No READY packages in Package Overview + +### Root Cause (CDS views via CDC engine) +CDC Observer/Transfer jobs not running or not producing data + +### Resolution +1. Check DHCDCMON → Job Settings → Verify Observer and Transfer jobs are green +2. If not green, click Dispatcher Job to reschedule +3. Check DHCDCMON → Application Log for errors +4. For performance issues with empty buffer: Increase transfer jobs per SAP Note 3669170 and SAP Note 3223735 (Transaction DHCDCSTG / Table DHCDC_JOBSTG parameters TRANSFER_MAX_JOBS and TRANSFER_MIN_JOBS) + +## Error Pattern 8: DATA_NOT_READY — Buffer Table Full +### Symptoms +- DHRDBMON shows buffer table full or filling up +- Records not Assigned to Package is high (highlighted) +- READY packages accumulating but not being consumed + +### Root Cause +RMS cannot collect data from the buffer — typically a Cloud Connector issue, network problem, or Datasphere-side bottleneck + +### Resolution +1. Check Cloud Connector status and connectivity +2. Verify Datasphere connection is valid (Connection Management → Validate) +3. Check DHRDBMON Expert Functions to manually manage packages if needed +4. If packages are stuck, use "Change Status to Ready" or "Remove Records from Package" cautiously + +## Error Pattern 9: ABAP Requests Completely Stuck +### Symptoms +- No progress in replication +- DHRDBMON shows no movement in buffer tables +- DHCDCMON logging tables may be growing + +### Root Cause +Backend ABAP processing is blocked — jobs stuck, locks, or resource exhaustion + +### Resolution +1. Check SM37 for Observer/Transfer job status +2. Check SM12 for locks on relevant tables +3. Check ST22 for ABAP short dumps related to /1DH/ programs +4. Verify no system-wide issues (SM21 system log, ST06 OS monitor) + +## Component Ownership for SAP Support Cases +When opening an SAP support case, use these component assignments: +- RODPS_REPL_TEST returns error → Component: BW-WHM-DBA-ODA +- DHCDCMON Application Log shows "ACP daemon not start" → Component: BC-DB-CDC +- CDS extraction stuck → Check both BC-DB-CDC and DS-INT-RF (Datasphere Integration - Replication Flows) +- Cloud Connector issues → Component: BC-MID-SCC +- Datasphere-side pipeline issues → Component: DS-INT-RF + +## Key SAP Notes Quick Reference +| SAP Note | Description | +|----------|-------------| +| 2890171 | ABAP Integration - CDS view requirements for Replication Flows | +| 3100673 | ABAP Integration - Security Settings | +| 3397020 | "Cannot determine tables for CDS view" resolution | +| 3465112 | "Partitioning for Object failed" random failures | +| 3669170 | How to improve replication performance - ABAP CDC Engine | +| 3223735 | SAP Data Intelligence - Transfer job tuning (DHCDC_JOBSTG) | +| 3369433 | Cloud Connector troubleshooting for Datasphere connections | +| 3365864 | Where does information in DHCDCMON come from? | +| 2930269 | ABAP CDS CDC common issues and troubleshooting | +| 3476918 | How to access HANA Cloud DB traces | diff --git a/partner-built/SAP-Datasphere/skills/datasphere-intelligent-lookup/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-intelligent-lookup/SKILL.md new file mode 100644 index 0000000..0532004 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-intelligent-lookup/SKILL.md @@ -0,0 +1,782 @@ +--- +name: Intelligent Lookup Wizard +description: "Master fuzzy matching and intelligent lookup configuration for data harmonization and record matching. Use this when you need to match company names across sources, reconcile customer addresses, match product descriptions, deduplicate data, or harmonize master data from multiple systems. Essential for data quality improvement, MDM implementation, and data integration workflows." +--- + +# Intelligent Lookup Wizard Skill + +## Overview + +The Intelligent Lookup Wizard guides you through creating intelligent lookups in SAP Datasphere. Intelligent lookups use fuzzy matching algorithms to find similar records across datasets, enabling data harmonization, deduplication, and master data management without requiring exact matches. + +## What Are Intelligent Lookups? + +### Definition +An intelligent lookup is a data matching tool that: +- Compares text values using fuzzy matching algorithms +- Finds similar records across two datasets (input and lookup entities) +- Assigns match scores indicating confidence level +- Enables data enrichment, harmonization, and deduplication +- Includes review workflow for manual confirmation of matches + +### When to Use Intelligent Lookups + +**Use intelligent lookups for:** +- **Company name matching** — "ACME Corp", "Acme Corporation", "ACME Inc" → Match +- **Address reconciliation** — Handling abbreviations, spelling variations, format differences +- **Product description matching** — "Widget Pro" vs "Professional Widget" → Same product +- **Customer deduplication** — Finding duplicate customer records within dataset +- **Master data harmonization** — Matching vendor IDs across multiple procurement systems +- **Data quality improvement** — Identifying and merging incomplete/duplicate records +- **Cross-system reconciliation** — Matching accounts between ERP and CRM systems + +**DO NOT use intelligent lookups for:** +- Exact match lookups (use standard SQL join) +- Numeric lookups (use hash or checksum matching) +- High-speed real-time lookups (use reference tables/caches) +- Data that must match perfectly (use data quality tools instead) + +### When Intelligent Lookups Add Value + +``` +Scenario 1: Company Name Matching +Input data has: "ACME Corporations Inc" +Lookup table has: "Acme Corp" +Standard lookup: NO MATCH ✗ +Intelligent lookup: MATCH ✓ (score: 0.92) + +Scenario 2: Product Matching +Input: "Professional Grade Widget with Advanced Features" +Lookup: "Widget Pro - Advanced" +Standard lookup: NO MATCH ✗ +Intelligent lookup: MATCH ✓ (score: 0.85) + +Scenario 3: Address Matching +Input: "123 Main St, New York, NY 10001" +Lookup: "123 Main Street, New York, New York 10001" +Standard lookup: NO MATCH ✗ +Intelligent lookup: MATCH ✓ (score: 0.95) +``` + +## Use Cases by Industry + +### Finance and Procurement +- **Vendor deduplication:** Find duplicate vendor records from multiple divisions +- **Invoice matching:** Reconcile invoices with PO line items despite data entry variations +- **Bank account reconciliation:** Match transactions across GL accounts with slight format variations +- **Payment matching:** Find matching payment records in different systems + +### Sales and CRM +- **Customer deduplication:** Identify duplicate customer records from different sources +- **Account consolidation:** Merge similar account names from prospect databases +- **Contact matching:** Harmonize contact information across sales regions +- **Lead matching:** Identify duplicate leads across marketing automation systems + +### Supply Chain +- **Supplier matching:** Match supplier names across procurement systems +- **Product catalogue harmonization:** Link products with similar descriptions across brands +- **Site/location matching:** Find duplicate warehouse/store locations by address +- **SKU matching:** Match products by description when SKU numbers differ + +### Healthcare +- **Patient matching:** Identify patient records with slight name/date variations +- **Provider matching:** Link healthcare providers by practice name and address +- **Drug matching:** Match pharmaceutical products by description and strength +- **Facility matching:** Identify duplicate hospital/clinic locations + +### Retail +- **Store location matching:** Find duplicate store records by address and postal code +- **Product matching:** Link products across multiple vendor systems by description +- **Customer matching:** Identify duplicate customer loyalty accounts +- **Price matching:** Find comparable products for price comparison + +## Setting Up an Intelligent Lookup + +### Step 1: Define Input Entity + +The input entity is the dataset containing values you want to match. + +**Input entity selection criteria:** +- Should contain the "new" or "dirty" data that needs matching +- Columns must include text fields for matching (company names, addresses, descriptions) +- Can be a view, table, or query result +- May have multiple rows to match (matching many-to-one or many-to-many) + +**Input entity example (Vendor Data to Harmonize):** +``` +VendorInputID | VendorName | Address | City | State | ZIP +1 | ACME Corporation Inc | 123 Main Street | New York | NY | 10001 +2 | Acme Corp | 123 Main St | NY | NY | 10001 +3 | ACME Inc | 123 Main Str | NewYork | NY | 10001 +4 | Best Widgets LLC | 456 Oak Avenue | Boston | MA | 02101 +5 | Best Widget Industries | 456 Oak Ave | Boston | MA | 02101 +``` + +**Prepare input data:** +1. Remove leading/trailing whitespace +2. Standardize case (upper, lower, or title) +3. Remove special characters if not meaningful +4. Verify data completeness in key matching columns +5. Use `search_catalog` to locate input tables + +### Step 2: Define Lookup Entity + +The lookup entity is the reference dataset containing "correct" or master values. + +**Lookup entity selection criteria:** +- Should be authoritative source of truth (master data) +- Contains the target values you want to match to +- Can be smaller than input (for efficiency) +- Should have unique key identifier for lookup results +- May contain additional attributes for enrichment + +**Lookup entity example (Vendor Master):** +``` +VendorID | VendorMasterName | Address | City | State | ZIP | VendorStatus +V-001 | Acme Corporation | 123 Main Street | New York | NY | 10001 | Active +V-002 | Best Widgets Co. | 456 Oak Avenue | Boston | MA | 02101 | Active +V-003 | Global Supply Inc | 789 Pine Road | Chicago | IL | 60601 | Active +V-004 | Premier Components | 321 Elm Street | Denver | CO | 80202 | Inactive +``` + +**Lookup data quality considerations:** +- Master data should be clean and standardized +- Remove duplicates in lookup entity first +- Ensure key columns have no nulls +- Verify foreign key consistency +- Use `get_table_schema` to understand lookup structure + +### Step 3: Select Matching Columns + +Choose which columns from input and lookup entities to compare. + +**Column selection strategy:** +- **Primary column:** Most important for matching (company name, main address) +- **Secondary columns:** Support matching logic (city, state, postal code) +- **Avoid noise:** Don't include columns that vary naturally (phone, email, website) + +**Single column matching (Simple):** +``` +Input Column: CompanyName +Lookup Column: VendorMasterName +Matching Strategy: Fuzzy text match +Example: +- "ACME Corp Inc" → Match to "Acme Corporation" +``` + +**Multi-column matching (Stronger):** +``` +Column 1 (Weight: 60%): +Input: CompanyName +Lookup: VendorMasterName + +Column 2 (Weight: 30%): +Input: City +Lookup: City + +Column 3 (Weight: 10%): +Input: PostalCode +Lookup: ZIP + +Scoring: Weighted combination of individual column scores +``` + +**Column preparation:** +- Standardize text case before matching +- Remove special characters (punctuation, symbols) +- Trim whitespace +- Expand abbreviations if possible (St → Street, Inc → Incorporated) +- Use `analyze_column_distribution` to understand data patterns + +### Step 4: Configure Matching Strategies + +Choose matching algorithms appropriate for your data. + +**Matching strategy options:** + +#### Exact Matching +**When to use:** When matches must be perfect but case-insensitive + +**Algorithm:** Character-by-character comparison after normalization + +**Configuration:** +``` +Strategy: Exact (Case-Insensitive) +Normalization: +├── Convert to uppercase +├── Trim whitespace +├── Remove punctuation (optional) +└── Handle accents (ä → a) + +Example: +Input: "acme corp." +Lookup: "ACME CORP" +Result: MATCH ✓ +``` + +**Use case:** SKU matching, postal code matching, account numbers + +#### Fuzzy Text Matching +**When to use:** When input data has typos, abbreviations, or minor variations + +**Algorithm:** Levenshtein distance, Jaro-Winkler, or Soundex-based matching + +**Configuration:** +``` +Strategy: Fuzzy (Token-based) +Algorithm: Jaro-Winkler +Threshold: 0.85 (matches scoring 0.85+ are considered matches) + +Example Matches (Score shown): +"ACME Corp" vs "Acme Corporation" = 0.89 ✓ +"Smith Company" vs "Smyth Co" = 0.82 ✗ (below threshold) +"John Smith" vs "Jon Smyth" = 0.87 ✓ +``` + +**Use case:** Company names, person names, product descriptions + +#### Phonetic Matching +**When to use:** When spelling variations sound similar + +**Algorithm:** Soundex, Metaphone, or Double Metaphone encoding + +**Configuration:** +``` +Strategy: Phonetic +Algorithm: Metaphone +Process: +├── Convert both strings to phonetic codes +├── Compare phonetic codes +└── Score based on code similarity + +Example Matches: +"Smith" Metaphone: SM0 +"Smyth" Metaphone: SM0 +Result: MATCH ✓ (sounds identical) + +"Catherine" Metaphone: K0RN +"Katherine" Metaphone: K0RN +Result: MATCH ✓ +``` + +**Use case:** Person names, location names with spelling variations + +#### Token-Based Matching +**When to use:** When columns contain multiple words/tokens that can be in different order + +**Algorithm:** Break text into tokens; compare token sets + +**Configuration:** +``` +Strategy: Token-Based +Process: +├── Split text into words: "123 Main Street" → [123, Main, Street] +├── Match tokens between input and lookup +├── Score based on percentage of matching tokens +└── Consider token order (optional) + +Example Matches: +"New York City" vs "City of New York" = 0.87 ✓ +"Smith John" vs "John Smith" = 1.0 ✓ (all tokens match) +"Acme Inc" vs "Acme International" = 0.67 (2 of 3 tokens match) +``` + +**Use case:** Address matching, product names, company names with variable word order + +#### Composite/Hybrid Matching +**When to use:** When combining multiple matching strategies for best results + +**Configuration:** +``` +Rule 1: Try exact match first (fastest) +├── If found: Return match immediately +└── If not found: Continue to Rule 2 + +Rule 2: Try phonetic match (for name variations) +├── If found with score > 0.90: Return match +└── If not found: Continue to Rule 3 + +Rule 3: Try fuzzy text match (for typos/abbreviations) +├── If found with score > 0.85: Return match +└── If not found: No match + +Example: +"John Smyth" → Try exact (no match) + → Try phonetic (match with Smith) + → Return match +``` + +**Use case:** High-precision matching scenarios + +## Matching Strategies Deep Dive + +### Levenshtein Distance (Edit Distance) + +**What it measures:** Minimum number of single-character edits needed to transform one string to another. + +**Edits allowed:** Insert, delete, replace character + +**Example calculation:** +``` +String 1: "ACME" +String 2: "ACE" +Edits: Delete 'M' = 1 edit +Distance: 1 + +String 1: "kitten" +String 2: "sitting" +Edits: +1. k → s: "sitten" +2. e → i: "sittin" +3. Insert g: "sitting" +Distance: 3 +``` + +**Normalized score (0-1):** +``` +Similarity = 1 - (Distance / Max_Length) + +Example: "Smith" vs "Smyth" +Distance: 1 (replace i with y) +Max Length: 5 +Similarity: 1 - (1/5) = 0.80 + +Threshold typically: 0.80+ for match +``` + +### Jaro-Winkler Distance + +**What it measures:** Similarity based on matching characters and their order, with bonus for matching prefix. + +**Characteristics:** +- Ranges from 0 (no match) to 1.0 (perfect match) +- Rewards matching characters in same position +- Gives bonus for matching prefix (first 4 characters) +- Better for short strings than Levenshtein + +**Example:** +``` +String 1: "ACME CORP" +String 2: "ACME CORPORATION" + +Jaro-Winkler: 0.9167 +- Matching characters: A, C, M, E (prefix bonus) +- Order preserved for beginning of string +- Good score despite length difference +``` + +**Typical thresholds:** +- Strict: 0.90+ +- Standard: 0.85-0.89 +- Permissive: 0.80-0.84 + +### Soundex + +**What it measures:** Phonetic encoding of how text sounds when spoken. + +**Algorithm:** +1. Encode first letter +2. Encode remaining letters numerically: + - 1: B, F, P, V + - 2: C, G, J, K, Q, S, X, Z + - 3: D, T + - 4: L + - 5: M, N + - 6: R + - (vowels and H, W, Y ignored) +3. Remove consecutive duplicates +4. Pad/truncate to 4 characters + +**Examples:** +``` +"Smith" → S530 +"Smyth" → S530 +Result: MATCH ✓ + +"John" → J500 +"Jean" → J500 +Result: MATCH ✓ + +"Robert" → R163 +"Rupert" → R163 +Result: MATCH ✓ +``` + +**Limitations:** +- Only returns 4-character code +- Loss of information +- Works best for surname matching +- English names primarily + +## Threshold Tuning + +Match score thresholds determine which results are considered matches. + +### Threshold Selection by Data Type + +**Numeric/Postal Code (Exact or near-exact):** +``` +Threshold: 0.98+ +Reason: Small data, should be nearly identical +Example: "02101" vs "2101" (missing leading zero) +``` + +**Company Names (Moderate tolerance):** +``` +Threshold: 0.85-0.90 +Reason: May have legal entity type variations +Example: "Acme Corp" vs "Acme Corporation" +Typical Variation: Legal entity names, abbreviations +``` + +**Person Names (Higher tolerance):** +``` +Threshold: 0.80-0.85 +Reason: Common spelling variations, nickname matching +Example: "Johnson" vs "Jonson", "William" vs "Bill" +Typical Variation: Phonetic similarities, nickname/formal name +``` + +**Product Descriptions (Lower tolerance):** +``` +Threshold: 0.75-0.85 +Reason: May have significant description differences +Example: "Professional Widget" vs "Widget Pro" +Typical Variation: Word order, abbreviations, marketing language +``` + +**Address Matching (Moderate):** +``` +Threshold: 0.80-0.90 +Reason: Abbreviations (St/Street, Ave/Avenue) and format variations +Example: "123 Main St" vs "123 Main Street" +Typical Variation: Abbreviations, format, direction (N/S/E/W) +``` + +### Threshold Impact Matrix + +| Threshold | Precision | Recall | Use Case | +|-----------|-----------|--------|----------| +| **0.95+** | Very High | Low | Only accept near-perfect matches (numeric, ID matching) | +| **0.90-0.94** | High | Moderate | Critical matching (master data, financial records) | +| **0.85-0.89** | Good | Good | Standard matching (company names, basic reconciliation) | +| **0.80-0.84** | Moderate | High | Permissive matching (descriptions, free-text fields) | +| **<0.80** | Low | Very High | Review all matches manually (high false positive rate) | + +**Threshold adjustment strategy:** +``` +Start with 0.85 (standard) +├── If too many false positives (incorrect matches): +│ └── Increase to 0.90 (tighter matching) +├── If too many false negatives (missed matches): +│ └── Decrease to 0.80 (looser matching) +└── Test with sample data before full run +``` + +## Review Workflow + +### Match Review Interface + +After intelligent lookup runs, review and approve matches. + +**Review screen shows:** +``` +Input Record: | Lookup Match: | Score: | Status: +ACME Corp | Acme Corporation | 0.92 | [Approve/Reject/Skip] +Address: 123 Main St | Address: 123 Main St | Details: | + | | • Same street address + | | • City matches + | | • 8% name variation +``` + +**Match states:** +- **Approved:** Match confirmed, can be used for enrichment/deduplication +- **Rejected:** Not a valid match despite score +- **Skipped:** No decision made; requires manual review later +- **No Match:** Input record has no matching lookup record + +### Batch Review Process + +**For large result sets:** + +``` +Step 1: Filter by score range +├── Display high-confidence matches (0.95+) — Usually OK to auto-approve +├── Display moderate matches (0.85-0.94) — Manual review recommended +└── Display low matches (0.80-0.84) — Always review + +Step 2: Review high-confidence matches +├── Spot-check 10% sample +├── Approve if valid +└── Batch-approve if consistent + +Step 3: Manual review of moderate matches +├── Show side-by-side comparison +├── Display match reason (which fields matched) +├── Approve or reject individually + +Step 4: Handle low-confidence matches +├── Review all or use alternative data source +└── Decide: Accept, reject, or request additional matching attempt +``` + +### Handling Ambiguous Matches + +When a record has multiple potential matches: + +``` +Input: "Smith Inc" + +Possible Matches: +1. "Smith Corporation" — Score: 0.88 +2. "Smith Industries" — Score: 0.87 +3. "Smiths Inc" — Score: 0.91 + +Decision options: +├── Select highest score (Smiths Inc) — Automatic +├── Select all above threshold — All three +├── Manual review to disambiguate — Human decision +└── Reject all ambiguous matches — Conservative approach +``` + +**Disambiguation strategies:** +- Use secondary columns (address, industry, size) to differentiate +- Show additional context (historical data, relationships) +- Allow reviewer to select from list +- Escalate unclear matches for business owner review + +## Best Practices for Match Quality + +### Data Preparation (Before Matching) + +**Standardization:** +``` +Before Matching: +├── Remove extra whitespace (leading, trailing, internal) +├── Normalize case (UPPERCASE or Titlecase) +├── Expand common abbreviations (St→Street, Co→Company, Inc→Incorporated) +├── Remove special characters if not meaningful (@, #, $, etc.) +├── Convert accented characters (é→e, ñ→n) if language-neutral comparison needed +└── Fix obvious typos if known + +Example Transformation: +Input: "ACME corp., Inc." +Step 1 (Trim): "ACME corp., Inc." +Step 2 (Case): "ACME CORP., INC." +Step 3 (Expand): "ACME CORPORATION, INCORPORATED" +Step 4 (Clean): "ACME CORPORATION" +Result for matching: Much cleaner, higher match rate +``` + +**Use `analyze_column_distribution` to understand data before matching:** +``` +Questions to answer: +├── What percentage of values are null/empty? +├── What's the length distribution (shortest, longest, average)? +├── Are there common prefixes or suffixes? +├── What characters are present (numbers, special chars)? +├── Are there obvious spelling variations? +└── What's the vocabulary size (unique values)? + +Example output: +Column: Company Name +├── Null %: 2.3% +├── Length: Min=3, Max=150, Avg=35 +├── Most common prefix: None +├── Special characters: ., &, -, ', () +├── Unique values: 12,456 (high variety) +├── Top variations: +│ ├── "Inc" / "Inc." / "Incorporated" / "Inc," +│ ├── "Corp" / "Corp." / "Corporation" +│ ├── "LLC" / "L.L.C." / "Ltd." +``` + +### Column Selection Strategy + +**Primary vs Secondary columns:** +``` +Primary Matching (weight: 70-80%): +├── Company Name (highest signal) +├── Product Code (if available) +└── Full Address (if detailed) + +Secondary Matching (weight: 20-30%): +├── City +├── State/Province +├── Postal Code +├── Industry +└── Contact person +``` + +**Avoid unreliable columns:** +``` +Don't use for matching: +├── Phone numbers (change frequently) +├── Email addresses (change frequently, privacy sensitive) +├── Website URLs (change, domain squatting) +├── Internal reference numbers (varies by system) +└── Timestamps (reflect data entry time, not content) + +These vary too much; add noise rather than signal +``` + +### Iterative Refinement + +**Matching process iteration:** +``` +Iteration 1: Initial matching with 0.85 threshold +├── Results: 8,000 matches, 2,000 unmatched +├── Review high-confidence (>0.90): 95% accuracy +├── Review moderate (0.85-0.90): 75% accuracy +└── Finding: Moderate threshold needs tuning + +Iteration 2: Adjust threshold to 0.88 +├── Results: 7,200 matches (tighter), 2,800 unmatched +├── Review: 90% accuracy across all matches +├── Finding: Better precision, acceptable recall + +Iteration 3: Manual review of unmatched +├── 500 unmatched actually have matches in reference data +├── Add domain-specific matching (e.g., subsidiary names) +└── Final: 99% of matching records identified + +Lessons learned: Industry-specific variations required custom rules +``` + +### Handling Special Cases + +**Company name variations:** +``` +Input: "ABC Manufacturing" +Matches to find: +├── "ABC Manufacturing" (exact) +├── "ABC Manufacturing Corp" (legal name) +├── "ABC Mfg" (abbreviation) +├── "American Business Corp (ABC)" (full name with acronym) +├── "ABC Manufacturing Inc, subsidiary of XYZ" (nested relationship) + +Strategy: +├── Use fuzzy matching with 0.85 threshold (catches first 3) +├── Add secondary matching on known abbreviation mappings +├── Include relationship data in review +``` + +**International names:** +``` +Company: "Société Générale" +Challenges: +├── Accented characters (é) +├── Legal form in French (Société) +├── May be registered as "Societe Generale" without accent + +Strategy: +├── Normalize accents before matching +├── Include common language translations +├── Use phonetic matching as fallback +``` + +**Address variations:** +``` +Input: "123 Oak Ave, Apt 4B, Springfield, IL 62701" +Match candidates: +├── "123 Oak Avenue, Suite 4B, Springfield, IL 62701" (abbreviation) +├── "123 Oak Ave, Springfield, IL 62701" (without unit) +├── "123 Oak Street, Springfield, IL 62701" (wrong street type) + +Strategy: +├── Standardize street types (Ave→Avenue, St→Street) +├── Match at building/street level, not exact unit +├── Use postal code as secondary match +└── Accept lower score for address matching (0.80+) +``` + +## Troubleshooting Poor Match Rates + +**Symptom: Very few matches found** + +``` +Diagnosis: Threshold too high +Solution: +├── Lower threshold from 0.85 to 0.80 +├── Review matches at new threshold for false positives +└── Find acceptable balance point + +Diagnosis: Data quality issues +Solution: +├── Run analyze_column_distribution on both input and lookup +├── Check for data type mismatches (numeric stored as text) +├── Look for null values in matching columns +├── Verify foreign key consistency +└── Standardize data before re-matching + +Diagnosis: Column selection mismatch +Solution: +├── Verify input and lookup columns contain comparable data +├── Example: Matching ZIP code against City name won't work +├── Select columns with same semantic meaning +└── Test matching on subset first + +Example Fix: +Before: Matching CompanyFullDescription (input) vs CompanyName (lookup) +After: Matching CompanyName (input) vs CompanyName (lookup) +Result: Match rate improves dramatically +``` + +**Symptom: Too many false positives (incorrect matches)** + +``` +Diagnosis: Threshold too low +Solution: +├── Increase threshold from 0.80 to 0.85 or higher +├── Review quality of now-excluded matches +└── Manually review borderline cases + +Diagnosis: Algorithm not appropriate for data type +Solution: +├── "Smith" vs "Smyth" needs phonetic matching (not fuzzy) +├── Company names with acronyms need token-based (not exact) +├── Addresses need standardization first (not raw comparison) +└── Choose algorithm based on data characteristics + +Diagnosis: Lookup data has duplicates or errors +Solution: +├── Find and merge duplicate lookup records first +├── Fix obvious errors in master data +├── Re-run matching against cleaned lookup +└── Clean input should match against clean reference +``` + +**Symptom: Unexpected matches missing from results** + +``` +Diagnosis: Match fell below threshold +Solution: +├── Lower threshold to capture similar results +├── Review those borderline matches +├── Decide if manual approval is acceptable + +Diagnosis: Format mismatch between datasets +Solution: +├── Example: Input has "123 Main St", Lookup has "123 Main Street" +├── Standardize abbreviations before matching +├── Pre-process data consistently in both datasets +└── Test with sample matches first + +Diagnosis: Matching algorithm doesn't handle data type +Solution: +├── Different algorithms for names, addresses, descriptions +├── Person names: Soundex, phonetic (catch nicknames) +├── Addresses: Token-based (word order varies) +├── Descriptions: Fuzzy (typos, abbreviations) +└── Test algorithm on known match pairs +``` + +## Key Takeaways + +1. **Choose appropriate matching strategy** — Fuzzy for names/descriptions, exact for codes, phonetic for sound-alikes +2. **Prepare data thoroughly** — Standardization and cleansing improve match quality significantly +3. **Tune thresholds carefully** — Test different thresholds on sample data before full run +4. **Review results systematically** — High-confidence matches vs manual review vs rejection +5. **Iterate and refine** — Match quality improves with multiple passes and domain knowledge +6. **Document matching rules** — For reproducibility and future maintenance +7. **Consider false positives and negatives** — Choose threshold based on cost of each type of error diff --git a/partner-built/SAP-Datasphere/skills/datasphere-intelligent-lookup/references/intelligent-lookup-guide.md b/partner-built/SAP-Datasphere/skills/datasphere-intelligent-lookup/references/intelligent-lookup-guide.md new file mode 100644 index 0000000..032eba2 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-intelligent-lookup/references/intelligent-lookup-guide.md @@ -0,0 +1,909 @@ +# Intelligent Lookup Reference Guide + +## Matching Strategy Details and Algorithms + +### Fuzzy Matching (Token-Based) + +**How it works:** +- Breaks text into words/tokens +- Compares sets of tokens between input and lookup +- Assigns score based on matching token percentage +- Optionally considers token order + +**Strengths:** +- Good for variable word order: "New York City" vs "City of New York" +- Handles added/missing words: "Smith Inc" vs "Smith Inc Corporation" +- Robust to word reordering + +**Limitations:** +- Single-word fields don't benefit from tokenization +- Doesn't handle within-word typos well: "Smyth" vs "Smith" +- May miss partial matches + +**Example:** +``` +Input: "Professional Widget Manufacturing" +Tokens: [Professional, Widget, Manufacturing] + +Lookup 1: "Widget Manufacturing Professional" +Tokens: [Widget, Manufacturing, Professional] +Match: 100% (same tokens, different order) +Score: 0.99 ✓ + +Lookup 2: "Professional Widget" +Tokens: [Professional, Widget] +Match: 66% (2 of 3 tokens) +Score: 0.66 ✗ (below typical 0.85 threshold) + +Lookup 3: "Professional Widget Manufacturing Inc" +Tokens: [Professional, Widget, Manufacturing, Inc] +Match: 75% (3 of 4 tokens) +Score: 0.75 ✗ +``` + +**Configuration options:** +``` +Word Order Sensitivity: +├── Ignore order (order=false): More matches, less strict +│ Example: "Smith John" matches "John Smith" +└── Respect order (order=true): Fewer matches, more strict + Example: "Smith John" does NOT match "John Smith" + +Min Token Length: +├── 1: Count all tokens including single characters +├── 2: Ignore single-letter words (A, B, I) +└── 3: Ignore short words (And, The, For) + Example: "The Smith Corporation" vs "Smith Corp" + With min=3: Effectively matches on Smith, Corporation (ignores The) +``` + +--- + +### Levenshtein Distance + +**How it works:** +- Counts minimum number of single-character edits needed +- Edits: insert, delete, or replace a character +- Distance of 0 = identical, higher = more different +- Normalized to 0-1 scale: similarity = 1 - (distance / max_length) + +**Strengths:** +- Handles typos well: "Smith" vs "Smth" +- Works for single words +- Mathematically precise + +**Limitations:** +- Expensive for long strings +- Doesn't recognize phonetically similar words: "Smith" vs "Smyth" +- Word order matters: "John Smith" ≠ "Smith John" +- Sensitive to letter position changes + +**Example calculation:** +``` +String 1: "ACME" +String 2: "ACNE" + +Edit sequence: +1. ACME → ACNE (replace M with N) +Distance: 1 + +Normalized Score: +Max length: 4 +Similarity: 1 - (1/4) = 0.75 + +Result: 0.75 score (below 0.85 threshold, no match) + +--- + +String 1: "kitten" +String 2: "sitting" + +Edit sequence: +1. kitten → sitten (replace k with s) +2. sitten → sittin (replace e with i) +3. sittin → sitting (insert g) +Distance: 3 + +Normalized Score: +Max length: 7 +Similarity: 1 - (3/7) = 0.57 + +Result: No match +``` + +**When to use:** +- Detecting typos and minor spelling variations +- Matching codes with single-character errors +- Comparing short strings (names, codes) + +--- + +### Jaro-Winkler Distance + +**How it works:** +1. Calculate base Jaro score (character match and order) +2. Apply Winkler bonus if strings share common prefix +3. Jaro-Winkler = Jaro + (prefix_length × prefix_weight × (1 - Jaro)) + - prefix_weight typically 0.1 + - prefix_length max 4 characters + +**Formula:** +``` +Jaro = (1/3) × [(matches_in_str1 / len_str1) + + (matches_in_str2 / len_str2) + + ((matches - transpositions) / matches)] + +Jaro-Winkler = Jaro + (l × p × (1 - Jaro)) +where: +- l = common prefix length (max 4) +- p = prefix weight (typically 0.1) +``` + +**Strengths:** +- Better than Levenshtein for shorter strings +- Rewards matching prefix: ACME vs ACNE scores higher +- Considers character position and order +- Industry standard for name matching + +**Limitations:** +- More complex computation +- Still doesn't handle phonetic variations +- Length difference still impacts score + +**Example calculation:** +``` +String 1: "Smith" +String 2: "Smyth" + +Character Analysis: +Position: 1=S, 2=m, 3=i/y, 4=t, 5=h +Matches: S, m, t, h (4 matches) +Transposition: i/y are different, not counted as match + +Jaro = (1/3) × [(4/5) + (4/5) + ((4-0)/4)] + = (1/3) × [0.8 + 0.8 + 1.0] + = 0.867 + +Jaro-Winkler = 0.867 + (4 × 0.1 × (1 - 0.867)) + = 0.867 + (4 × 0.1 × 0.133) + = 0.867 + 0.053 + = 0.920 + +Result: 0.92 score (matches at 0.85 threshold) +``` + +**When to use:** +- Person name matching +- Company name matching +- Comparing short strings where word order is fixed + +--- + +### Soundex + +**How it works:** +1. Keep first letter as is +2. Replace letters with numbers: + - 1: B, F, P, V + - 2: C, G, J, K, Q, S, X, Z + - 3: D, T + - 4: L + - 5: M, N + - 6: R + - (Remove vowels, H, W, Y) +3. Remove consecutive duplicates +4. Pad/truncate to 4 characters + +**Strengths:** +- Very fast computation +- Handles phonetic variations +- Works well for surnames +- English pronunciation-based + +**Limitations:** +- Significant information loss (4 characters only) +- English-centric +- Can produce many false positives +- Doesn't work well for names < 4 characters + +**Example encoding:** +``` +"Robert" +R-o-b-e-r-t + +Step 1: Keep R +Step 2: o(vowel-remove), b(1), e(vowel-remove), r(6), t(3) + → R163 + +Step 3: Remove consecutive duplicates + → R163 (no duplicates) + +Step 4: Pad to 4 characters + → R163 (already 4) + +--- + +"Rupert" +R-u-p-e-r-t + +Step 1: Keep R +Step 2: u(vowel-remove), p(1), e(vowel-remove), r(6), t(3) + → R163 + +Result: "Robert" and "Rupert" both encode to R163 = MATCH ✓ + +--- + +"Lloyd" +L-l-o-y-d + +Step 1: Keep L +Step 2: l(4), o(vowel-remove), y(remove), d(3) + → L43 + +Step 3: Remove consecutive duplicates + (no consecutive duplicates) + → L43 + +Step 4: Pad to 4 characters + → L430 + +--- + +"Lloyd" vs "Laude" +Lloyd = L430 +Laude = L300 + +DIFFERENT CODES = NO MATCH ✗ (even though phonetically similar) +``` + +**Soundex encoding table:** +``` +Letter Digit | Examples +1 B F P V| "Baker" → B260, "Baker" vs "Barker" = B620 ✗ +2 C G J K| "Carter" → C630, "Carter" vs "Karter" = K630 ✓ +3 D T | "Davis" → D120, "Davis" vs "Tavis" = T120 ✓ +4 L | "Lewis" → L200 +5 M N | "Morris" → M620, "Morris" vs "Norris" = N620 ✓ +6 R | "Roy" → R000 +(vowels removed) +``` + +**When to use:** +- Surname matching +- Name matching in general +- As initial filter before more expensive algorithms + +--- + +### Metaphone and Double Metaphone + +**How it works:** +Similar to Soundex but with more sophisticated phonetic rules: +- More rules than Soundex +- Handles English pronunciation better +- Can produce primary and secondary codes (Double Metaphone) +- Keeps more letters than Soundex + +**Strengths:** +- Better than Soundex for English names +- Handles consonant clusters +- Fewer false matches than Soundex + +**Limitations:** +- Still English-centric +- Information loss (typically 4-5 characters) +- More complex rules to implement + +**Example encoding:** +``` +"Smith" → SM0 +"Smyth" → SM0 +Match: ✓ + +"Katherine" → K0RN +"Catherine" → K0RN +Match: ✓ + +"Johnson" → JNSN +"Jonson" → JNSN +Match: ✓ + +"Phillip" → FL (first P is silent) +"Philip" → FL +Match: ✓ +``` + +**When to use:** +- Surname/name matching (better than Soundex) +- Phonetic variations of English names +- More accurate than Soundex while still fast + +--- + +### Composite/Custom Matching Rules + +**When single algorithm not sufficient:** + +**Example 1: Company Name Matching Rule** +``` +Rule Set: +Step 1: Exact match (case-insensitive, no punctuation) + "ACME Corporation" vs "acme corporation" → MATCH ✓ + +Step 2: If no exact, try fuzzy (0.90 threshold) + "ACME Corp" vs "ACME Corporation" → Score 0.93 ✓ + +Step 3: If no fuzzy, try token-based + "Acme Inc" vs "Inc Acme" → Tokens match ✓ + +Step 4: If still no match, try without legal entity type + "ACME Manufacturing" vs "ACME Mfg Co" → Remove Co, match ✓ + +Implementation: +```sql +IF exact_match(input, lookup) THEN + RETURN MATCH (score 1.0) +ELSE IF fuzzy_score(input, lookup) >= 0.90 THEN + RETURN MATCH (score fuzzy_score) +ELSE IF token_match(input, lookup) >= 0.80 THEN + RETURN MATCH (score token_match) +ELSE IF token_match(remove_legal_entity(input), lookup) >= 0.85 THEN + RETURN MATCH (score adjusted) +ELSE + RETURN NO_MATCH +END IF +``` + +**Example 2: Address Matching Rule** +``` +Rule Set: +Column 1 (Weight 50%): Street Address + Algorithm: Fuzzy (Jaro-Winkler 0.85) + Preprocessing: Standardize abbreviations (St→Street) + +Column 2 (Weight 30%): City + Algorithm: Exact (case-insensitive) + Preprocessing: None + +Column 3 (Weight 20%): Postal Code + Algorithm: Exact or prefix match + Preprocessing: Remove dashes (12345-6789 → 12345) + +Final Score: 50% × Street_Score + 30% × City_Score + 20% × Postal_Score + +Example: +Input: "123 Main St, NewYork, NY 10001" +Lookup: "123 Main Street, New York, NY 10001" + +Scoring: +├── Street: "Main St" vs "Main Street" = 0.95 → 0.95 +├── City: "NewYork" vs "New York" = 1.0 (normalized) → 1.0 +├── Postal: "10001" vs "10001" = 1.0 → 1.0 + +Final Score: (50% × 0.95) + (30% × 1.0) + (20% × 1.0) = 0.975 +Result: MATCH ✓ +``` + +--- + +## Score Threshold Recommendations by Data Type + +### Text Fields - Company Names + +**Threshold: 0.85 (Standard)** + +**Reasoning:** +- Often have legal entity types (Inc, Corp, Ltd) +- May be abbreviated (Acme vs Acme Corporation) +- Regional variations (American vs US) + +**Match examples (Jaro-Winkler):** +``` +"ACME Corp" vs "Acme Corporation" = 0.88 ✓ +"ABC Manufacturing" vs "ABC Mfg" = 0.87 ✓ +"Global Supply Inc" vs "Global Supplies Inc" = 0.84 ✗ +"Smith Company" vs "Smiths Company" = 0.93 ✓ +``` + +**Threshold tuning:** +- Too strict (0.95): Misses legitimate matches +- Too loose (0.75): Includes many false matches +- 0.85: Balanced precision/recall + +--- + +### Text Fields - Person Names + +**Threshold: 0.82 (Slightly lower)** + +**Reasoning:** +- Many common phonetic variations +- Nicknames: Bill/William, Bob/Robert +- Spelling variations: Johnson/Jonson, Smith/Smyth +- Order variations: John Smith vs Smith, John + +**Match examples (Jaro-Winkler):** +``` +"John Smith" vs "Jon Smyth" = 0.84 ✓ +"William Johnson" vs "Bill Johnson" = 0.79 ✗ (too different) +"Catherine" vs "Katherine" = 0.92 ✓ +"Patricia" vs "Patrice" = 0.89 ✓ +``` + +**Threshold tuning:** +- Use 0.82-0.85 for general matching +- Use 0.75 if you accept manual review +- Phonetic matching works better (use Soundex/Metaphone first) + +--- + +### Addresses - Street Address + +**Threshold: 0.88 (Stricter)** + +**Reasoning:** +- Must match location accurately +- Abbreviations are standardizable +- Format variations are common (12 vs 12th) + +**Match examples:** +``` +"123 Main Street" vs "123 Main St" = 0.95 ✓ +"456 Oak Ave" vs "456 Oak Avenue" = 0.93 ✓ +"789 Pine Rd" vs "789 Pine Road" = 0.93 ✓ +"100 First St" vs "100 First Street" = 0.95 ✓ +"2 Second Ave" vs "200 Second Ave" = 0.88 ✓ (barely) +"250 Elm" vs "250 Oak" = 0.33 ✗ (wrong street) +``` + +**Threshold tuning:** +- 0.88-0.92: Recommended for address matching +- Pre-standardize abbreviations first (St, Ave, Rd, Blvd) +- Use postal code as secondary verification + +--- + +### Addresses - Postal Code + +**Threshold: 0.98 (Very strict)** + +**Reasoning:** +- Must be nearly exact +- 5-10 digit number +- Variations are uncommon + +**Match examples:** +``` +"10001" vs "10001" = 1.0 ✓ +"10001-1234" vs "10001" = 0.92 (different formats) +"02101" vs "2101" = 0.80 (missing leading 0) +``` + +**Threshold tuning:** +- Use exact match, not fuzzy +- Handle format variations programmatically +- Pre-clean before matching + +--- + +### Numeric Fields - Product Codes + +**Threshold: 0.99 (Almost exact)** + +**Reasoning:** +- Must be nearly identical +- Small character set (digits) +- Any mismatch = wrong product + +**Match examples:** +``` +"SKU-12345" vs "SKU-12345" = 1.0 ✓ +"SKU12345" vs "SKU-12345" = 0.97 (format difference) +"SKU-12345" vs "SKU-12346" = 0.88 ✗ +``` + +**Threshold tuning:** +- 0.98+: Strict, recommended +- Only accept near-perfect matches +- Pre-standardize format + +--- + +### Text Fields - Product Descriptions + +**Threshold: 0.78-0.82 (More permissive)** + +**Reasoning:** +- May use different marketing language +- Word order varies +- Abbreviations common (Pro, Max, Standard, etc.) +- Typos possible in descriptions + +**Match examples:** +``` +"Professional Widget" vs "Widget Pro" = 0.82 ✓ +"Advanced Manufacturing Tool" vs "Advanced Tool" = 0.80 ✓ +"Standard Gadget" vs "Basic Gadget" = 0.73 ✗ +"Product XYZ-100" vs "XYZ 100 Product" = 0.78 ✓ +``` + +**Threshold tuning:** +- Use 0.78-0.82 for description matching +- Token-based algorithm works better +- Be prepared for manual review + +--- + +### Summary Table + +| Data Type | Algorithm | Threshold | Note | +|-----------|-----------|-----------|------| +| Company Names | Jaro-Winkler | 0.85 | Good balance; handles legal entities | +| Person Names | Soundex/Metaphone* | 0.82 | Phonetic better; also try Jaro-Winkler | +| Street Address | Jaro-Winkler | 0.88 | Pre-standardize abbreviations | +| Postal Code | Exact | 0.99 | Nearly exact match required | +| Product Codes | Exact | 0.99 | Nearly exact; strict matching | +| Descriptions | Token-based | 0.80 | More permissive; manual review expected | +| Email Address | Exact | 0.99 | Nearly exact (domain case-insensitive) | +| Phone Number | Exact (digits only) | 0.99 | Remove formatting, match digits only | + +*Soundex/Metaphone provides boolean match (same code = match), not scored + +--- + +## Pre-Processing Tips - Data Standardization + +### String Case Normalization + +**Recommended approach:** +``` +Normalize to UPPERCASE for matching +├── Avoids case sensitivity issues +├── Case inconsistency won't prevent matches +├── Standard practice for fuzzy matching + +Example: +Input: "aCmE cOrP" +After: "ACME CORP" + +Lookup: "acme corporation" +After: "ACME CORPORATION" +``` + +### Whitespace Handling + +``` +Step 1: Trim leading and trailing +├── " ACME " → "ACME" + +Step 2: Standardize internal spaces (collapse multiple to single) +├── "ACME CORP" → "ACME CORP" + +Step 3: Normalize line breaks and tabs +├── "ACME\nCORP" → "ACME CORP" +``` + +### Special Character Removal + +**Decision: What to keep/remove** + +``` +Keep (meaningful): +├── Hyphens in compound words (New-York) +├── Apostrophes in names (O'Brien) +└── Numbers (Model 100, Version 2.0) + +Remove (usually noise): +├── Punctuation at end (ACME, → ACME) +├── Extra symbols (@, #, $, %, &) +├── Parentheses and brackets (ACME (subsidiary) → ACME subsidiary) +└── Commas and periods + +Partial removal: +├── Periods in abbreviations (U.S.A. → USA) +├── Dashes in codes (SKU-12345 → SKU12345 for matching) +``` + +**Examples:** +``` +Input: "ACME, Inc." +Step 1 (case): "ACME, INC." +Step 2 (trim): "ACME, INC." (no leading/trailing) +Step 3 (special chars): "ACME INC" +Result: "ACME INC" + +Input: "O'Brien & Associates (New York)" +Step 1 (case): "O'BRIEN & ASSOCIATES (NEW YORK)" +Step 2 (remove parens): "O'BRIEN & ASSOCIATES NEW YORK" +Step 3 (remove &): "O'BRIEN ASSOCIATES NEW YORK" +Result: "O'BRIEN ASSOCIATES NEW YORK" +``` + +### Abbreviation Standardization + +**Common expansions:** + +``` +Company Legal Forms: +├── Inc → Incorporated +├── Corp → Corporation +├── Ltd → Limited +├── LLC → Limited Liability Company +├── Co → Company +├── Assoc → Association + +Geographic: +├── St → Street +├── Ave → Avenue +├── Blvd → Boulevard +├── Rd → Road +├── Ln → Lane +├── Dr → Drive +├── Pl → Place +├── Ct → Court +├── N/S/E/W → North/South/East/West +├── NY → New York +├── PA → Pennsylvania +├── CA → California + +Professional: +├── Ph.D. → Doctor +├── M.D. → Medical Doctor +├── Mr. → Mister +├── Mrs. → Missus +└── Ms. → Miss/Missus + +Measurement: +├── Ft → Foot +├── Lbs → Pounds +├── Oz → Ounces +├── Qt → Quart +└── Gal → Gallon +``` + +**Implementation:** +``` +Expansion strategy: +1. Create abbreviation mapping table +2. Apply during pre-processing +3. Or include both versions in matching + +Example: +Input: "ACME Inc Street" +Option A (expand): "ACME Incorporated Street" +Option B (both): Match both "Inc" and "Incorporated" versions + +Lookup: "ACME Corporation Street" +``` + +### Accents and Diacritics + +**When to remove:** + +``` +Keep accents when: +├── Language-specific matching required (French, Spanish) +├── Name identity important (José vs Jose are different people) + +Remove accents when: +├── Language-neutral matching +├── System doesn't support Unicode +├── Names used internationally +``` + +**Conversion table:** +``` +À, Á, Â, Ã, Ä, Å → A +È, É, Ê, Ë → E +Ì, Í, Î, Ï → I +Ò, Ó, Ô, Õ, Ö → O +Ù, Ú, Û, Ü → U +Ñ → N +Ç → C +``` + +**Example:** +``` +Input: "Société Générale" +After: "Societe Generale" + +Lookup: "Societe Generale" +Result: MATCH ✓ +``` + +### Number Formatting + +``` +Remove formatting for matching: +├── Phone: "(555) 123-4567" → "5551234567" +├── ZIP Code: "10001-1234" → "100011234" (for matching digits only) +├── Currency: "$1,234.56" → "1234.56" +├── Product code: "SKU-ABC-12345" → "SKUABC12345" + +Consider context: +├── If dash is meaningful (model numbers): Keep +├── If just formatting: Remove +``` + +--- + +## Common Matching Patterns by Industry + +### Finance/Accounting + +**Pattern 1: Vendor Deduplication** +``` +Columns to match: Vendor Name, Address, City +Primary: Vendor Name (fuzzy, 0.85) +Secondary: City (exact) +Threshold: 0.85 + +Example: +Input: "ACME Manufacturing Inc" +Match: "Acme Mfg Corporation" +Requires: Manual review despite score + +Additional rule: If name matches but different city → Likely different vendor +``` + +**Pattern 2: Invoice Reconciliation** +``` +Match: Invoice-to-PO line +Columns: Vendor, PO Number, Amount +Algorithm: Exact on Vendor + PO, fuzzy on amount (±5%) +Threshold: Vendor exact + PO fuzzy 0.95 + Amount within tolerance + +Example: +Input PO: Amount $10,000 +Invoice: Amount $9,987.50 (0.125% difference = match) +``` + +### Sales/CRM + +**Pattern 1: Customer Deduplication** +``` +Columns: Customer Name, Phone, Address +Algorithm 1: Exact match on phone (most reliable) +Algorithm 2: If no phone, fuzzy on name + city +Threshold: Phone exact, else name 0.82 + city exact + +Example: +Phone Match: Always match regardless of name +No Phone: Require name match + postal code match +``` + +**Pattern 2: Lead Deduplication** +``` +Columns: Email, Phone, First Name + Last Name +Algorithm: Exact email (if available), else phone, else fuzzy names +Threshold: Email/Phone exact, names 0.85 + +Priority: +1. Email address (nearly unique) +2. Phone number (high specificity) +3. First + Last Name fuzzy match +``` + +### Supply Chain + +**Pattern 1: Supplier Master Consolidation** +``` +Columns: Company, Address, Supplier Code +Algorithm: Fuzzy company (0.85) + address (0.88) + optional code +Threshold: Both address and company must match, code confirms + +Example: +"ACME Mfg" + "123 Main" = potential match +"ACME Mfg" + "456 Oak" = different supplier (address doesn't match) +``` + +**Pattern 2: SKU Harmonization** +``` +Columns: Product Code, Product Name, Supplier +Algorithm: Exact on code, fuzzy on name for fallback +Threshold: Code 0.99 (nearly exact), Name 0.85 if code missing + +Example: +Code "SKU-12345" exact matches primary +Name match only if code not available +``` + +### Healthcare + +**Pattern 1: Patient Record Matching** +``` +Columns: First Name, Last Name, DOB, Phone +Algorithm: + Rule 1: DOB + Last Name exact → MATCH + Rule 2: First + Last Name + Phone → MATCH + Rule 3: Fuzzy name (0.82) + DOB exact → MATCH +Threshold: At least 2 fields must strongly match + +Caution: Medical matching requires high precision +``` + +**Pattern 2: Drug/Medication Matching** +``` +Columns: Drug Name, Strength, Form +Algorithm: Fuzzy on name (0.88), exact on strength +Threshold: Name must match, strength must be exact + +Example: +"Aspirin 100mg tablet" vs "ASA 100mg tablet" = 0.82 ✓ (close name, exact strength) +"Aspirin 100mg" vs "Aspirin 200mg" = NO MATCH ✗ (strength must match exactly) +``` + +### Retail + +**Pattern 1: Product Catalog Harmonization** +``` +Columns: UPC, Product Name, Brand +Algorithm: Exact on UPC, fuzzy on name for lookups +Threshold: UPC 0.99, Name 0.80 (descriptions vary) + +Example: +UPC "012345678901" exact matches +Name match secondary if UPC unavailable +``` + +**Pattern 2: Store Location Matching** +``` +Columns: Store Code, Street Address, City, State +Algorithm: Store code exact, address + city + state fuzzy +Threshold: Code 0.99, else address+city+state 0.88 + +Example: +Store "NYC-001" exact on code +If code missing: Address 0.90 + City 1.0 + State 1.0 = weighted match +``` + +--- + +## Troubleshooting Decision Tree + +``` +Start: Poor match results + +├─ Too few matches found? +│ ├─ Check threshold +│ │ └─ Lower from 0.85 to 0.80, test +│ ├─ Check column selection +│ │ └─ Run analyze_column_distribution; verify data in columns +│ ├─ Check data quality +│ │ └─ Any nulls? Truncated values? Wrong data type? +│ └─ Check algorithm appropriateness +│ └─ Names: Use phonetic? Addresses: Use token-based? +│ +├─ Too many false matches? +│ ├─ Check threshold +│ │ └─ Raise from 0.85 to 0.90, review quality +│ ├─ Check algorithm fit +│ │ └─ Example: Exact matching for codes, not fuzzy +│ ├─ Add secondary columns +│ │ └─ Single column matching too broad; add city, postal code +│ └─ Check lookup data +│ └─ Duplicates in master data? Clean first +│ +├─ Specific records not matching? +│ ├─ Check for case sensitivity issues +│ │ └─ Normalize to uppercase +│ ├─ Check for special characters +│ │ └─ "Smith, Inc." vs "Smith Inc" (comma matters) +│ ├─ Check for abbreviation issues +│ │ └─ "St" vs "Street"; expand abbreviations +│ ├─ Check for extra whitespace +│ │ └─ "Smith Inc" (double space) vs "Smith Inc" +│ └─ Try different algorithm +│ └─ If fuzzy fails, try phonetic or token-based +│ +└─ Performance issues (slow matching)? + ├─ Check data volume + │ └─ Millions of rows? Consider batch processing + ├─ Check algorithm complexity + │ └─ Soundex (fast) vs Jaro-Winkler (slower) + ├─ Check column cardinality + │ └─ Matching on high-cardinality field? May be slow + └─ Check for pre-processing + └─ Can you pre-filter (exact matches first)? +``` diff --git a/partner-built/SAP-Datasphere/skills/datasphere-performance-optimizer/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-performance-optimizer/SKILL.md new file mode 100644 index 0000000..68d3973 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-performance-optimizer/SKILL.md @@ -0,0 +1,597 @@ +--- +name: Performance Optimizer +description: "Optimize Datasphere performance NOW! Use when views run slow, queries timeout, or data flows lag. Analyze bottlenecks, tune execution plans, optimize persistence, manage storage. Keywords: slow query, performance, bottleneck, timeout, memory, CPU, execution plan, view optimization, partition, index." +--- + +# Performance Optimizer Skill + +## Overview + +The Performance Optimizer skill helps you identify and resolve performance issues in SAP Datasphere. Whether your views are running slowly, queries are timing out, or data flows are consuming excessive resources, this skill provides a systematic approach to diagnose and fix performance problems at every layer: views, queries, data flows, and storage. + +## When to Use This Skill + +Trigger this skill when you encounter: +- Views or queries running slower than expected +- Query timeouts or cancellations +- High memory or CPU consumption +- Data flow execution taking excessive time +- Replication flows with poor delta performance +- Need to optimize storage usage +- Capacity warnings or resource contention alerts +- Performance regression after schema changes +- Need to benchmark performance improvements + +## Performance Analysis Approach + +### 1. Identify the Bottleneck + +Start by pinpointing where performance degradation occurs: + +**View Performance Issues:** +- Access the View Analyzer tool in your view's details +- Check execution time trends over the last 7-30 days +- Compare execution times across different consumer queries +- Identify which views are called most frequently +- Note views with long initialization times + +**Query Performance Issues:** +- Review query execution statistics in task logs +- Use Explain Plan to understand query execution strategy +- Check statement logs for query duration and resource usage +- Identify queries with full table scans or inefficient joins +- Monitor query frequency and timing patterns + +**Data Flow Performance Issues:** +- Check data flow execution logs for step-level timing +- Monitor initial load duration vs. delta load duration +- Review parallelism settings and actual utilization +- Check for data quality issues causing processing overhead +- Analyze memory allocation and spill events + +### 2. Measure Current Performance + +Establish baseline metrics before optimization: + +**Key Metrics to Capture:** +- Query execution time (wall-clock and CPU time) +- Memory usage (peak and average) +- Data volume processed (rows and bytes) +- I/O operations and throughput +- Number of disk accesses vs. in-memory operations +- Index usage patterns +- Cache hit rates +- Parallelism level achieved +- Storage tier distribution + +Use the MCP tools to gather metrics: +- `execute_query`: Run diagnostic queries to capture baseline stats +- `analyze_column_distribution`: Understand data skew and selectivity +- `get_table_schema`: Review column types and indexing +- `get_task_status`: Monitor execution metrics + +### 3. Optimize + +Apply targeted optimizations based on findings (see following sections). + +### 4. Validate + +Re-measure performance after changes: +- Compare new metrics against baseline +- Verify performance improvement meets targets +- Monitor for 7-14 days to ensure consistency +- Check for negative side effects on other queries +- Document changes for future reference + +## View Analyzer + +The View Analyzer provides insights into view execution performance and resource consumption. + +### Interpreting View Analyzer Results + +**Execution Time Breakdown:** +- **Initialization Time**: Time to prepare execution plan and allocate resources. High values indicate complex view logic or many dependent views. +- **Execution Time**: Actual query processing time. Dominated by data retrieval and transformation. +- **Fetch Time**: Time to return results to consumer. Long fetch times indicate large result sets or network latency. + +**Resource Consumption:** +- **Peak Memory**: Maximum memory used during execution. Watch for memory spills to disk. +- **CPU Time**: Total CPU cycles consumed. High CPU indicates data processing-heavy operations. +- **I/O Throughput**: Data read from storage. Compare with data returned to find filtering efficiency. + +### Identifying Expensive Operations + +**High Memory Consumers:** +- Joins with large intermediate result sets +- Aggregations without pre-filtering +- Complex subqueries creating multiple temporary tables +- Non-persisted views with repeated calculations + +**CPU-Intensive Operations:** +- Complex calculations and expressions +- Multiple levels of nested aggregations +- String operations on large text fields +- Date/time calculations without optimization + +**I/O-Heavy Operations:** +- Full table scans on large tables +- Inefficient join strategies +- Missing indexes on join keys +- Reading unnecessary columns + +## Explain Plan Analysis + +The Explain Plan shows exactly how Datasphere executes your query, revealing inefficiencies at each step. + +### Reading Execution Plans + +**Plan Structure:** +1. **Node Type**: SCAN (table access), JOIN, AGGREGATE, SORT, FILTER, etc. +2. **Input Rows**: Rows input to this node +3. **Output Rows**: Rows produced by this node +4. **Selectivity**: Output/Input ratio. Values < 0.1 indicate effective filtering. +5. **Estimated Time**: Predicted execution time +6. **Actual Time**: Measured execution time (if available) + +### Spotting Full Table Scans + +**Indicators:** +- SCAN node with no index specification +- Input rows equal to table row count +- High selectivity discrepancy (many input rows, few output rows) + +**Solutions:** +- Add index on WHERE clause columns +- Add covering index including SELECT columns +- Consider partitioning with partition-aware queries +- Rewrite to improve filter pushdown + +### Identifying Join Inefficiencies + +**Red Flags:** +- Join producing result set larger than inputs (indicates Cartesian product) +- Very high intermediate row counts before filtering +- Hash joins with small build tables (should use index joins) +- Join order processing large tables first + +**Solutions:** +- Reorder joins to process smallest tables first +- Add indexes on join keys +- Apply filters before joins (reduce probe table size) +- Consider materialization of frequently-joined dimensions +- Use broadcast tables for small dimension tables + +### Common Explain Plan Anti-Patterns + +**Pattern: Multiple Cascading Aggregations** +``` +AGGREGATE (GROUP BY col1, col2, col3) + AGGREGATE (GROUP BY col1, col2) + AGGREGATE (GROUP BY col1) + SCAN big_table +``` +Issue: Each aggregation processes full dataset +Solution: Combine aggregations or materialize intermediate results + +**Pattern: Cartesian Join** +``` +JOIN (on false condition) + SCAN table1 (1M rows) + SCAN table2 (1M rows) +``` +Issue: 1T row intermediate result +Solution: Fix join condition, verify ON clause logic + +**Pattern: Late Filtering** +``` +FILTER (where price > 100) + JOIN table1 (1M rows) with table2 (1M rows) +``` +Issue: Join processes all rows before filtering +Solution: Move WHERE clause into source tables (pushdown) + +## View Optimization Strategies + +### Persistence Strategy + +**When to Persist:** +- Views consumed by many queries (reuse calculation) +- Complex views with expensive calculations +- Views used in real-time dashboards (reduce latency) +- High-frequency views (cache results between accesses) +- Views with stable underlying data + +**When NOT to Persist:** +- Frequently updated source data (maintenance cost) +- Views with specific daily snapshots +- Once-per-execution views +- Views with minimal consumer count + +**Persistence Modes:** +- **Runtime**: Results cached in memory during session (no latency, high memory) +- **Disk-Based**: Results written to disk (lower memory, I/O cost) +- **Materialized**: Pre-calculated, updated on schedule (consistent performance, staleness risk) + +### Column Pruning + +Remove unnecessary columns from views to reduce data volume: + +**Strategy:** +1. Audit all consuming queries +2. Identify columns never referenced +3. Remove unused columns from source +4. Test dependent views for impact +5. Document retained columns and purpose + +**Benefits:** +- Reduced memory footprint +- Faster data transfer +- Smaller index sizes +- Improved cache effectiveness + +### Partition Pruning + +Organize data into partitions to reduce scanned volume: + +**Partition Key Selection:** +- High-cardinality columns with clear ranges (dates, regions) +- Columns frequently in WHERE clauses +- Columns supporting most common filters +- Avoid partitioning on low-cardinality columns + +**Partition Strategy:** +- **Time-Based**: Year, month, day (best for temporal data) +- **Range-Based**: Numeric ranges (best for numeric dimensions) +- **Hash-Based**: Distribute evenly (when no natural partition key) +- **List-Based**: Specific values (geographic regions, business units) + +**Partition Pruning in Queries:** +```sql +SELECT * FROM sales +WHERE year = 2024 AND month IN (1,2,3) +-- Partition pruning eliminates months outside this range +``` + +### Push-Down Optimization + +Move filtering and aggregation to earliest possible stage: + +**Examples:** +```sql +-- GOOD: Filters pushed to source tables +SELECT dept, COUNT(*) +FROM employees +WHERE hire_date >= '2023-01-01' -- Pushed down +GROUP BY dept + +-- BAD: Filter applied to pre-aggregated result (wrong answer potential) +SELECT dept, COUNT(*) as emp_count +FROM employees +GROUP BY dept +HAVING CAST(MAX(hire_date) AS DATE) >= '2023-01-01' +``` + +**Benefits:** +- Reduces intermediate result sizes +- Minimizes data movement +- Allows index usage +- Reduces memory pressure + +## Query Optimization Techniques + +### Index Usage + +**Index Types:** +- **B-Tree Index**: Default, efficient for range and equality queries +- **Hash Index**: For equality lookups only, faster than B-Tree +- **Bitmap Index**: For low-cardinality columns, space-efficient +- **Covering Index**: Includes all columns needed (no table access required) + +**Index Selection:** +1. Profile query WHERE and JOIN clauses +2. Create indexes on high-selectivity columns +3. Use composite indexes for multi-column conditions +4. Add covering indexes for frequently-run queries +5. Monitor index usage; drop unused indexes + +**Query Hints for Index Usage:** +```sql +SELECT /*+ INDEX(orders order_date_idx) */ + order_id, amount +FROM orders +WHERE order_date = '2024-01-15' +``` + +### Join Order Optimization + +**Heuristic: Smallest Table First** +```sql +-- GOOD: Dimension first (smaller) +SELECT * +FROM departments d + INNER JOIN employees e ON d.dept_id = e.dept_id +WHERE d.region = 'WEST' + +-- BAD: Large fact table first +SELECT * +FROM sales_fact f + INNER JOIN date_dim d ON f.date_id = d.date_id +WHERE d.year = 2024 +``` + +**Broadcast Strategy:** +Use broadcast join for small dimensions (< 100MB): +```sql +SELECT /*+ BROADCAST(dimension) */ + f.amount, d.category +FROM fact_table f + INNER JOIN small_dimension d ON f.dim_id = d.id +``` + +### Aggregation Strategies + +**Pre-Aggregation Pattern:** +Materialize frequently-aggregated views: +```sql +-- Materialized view updated daily +CREATE MATERIALIZED VIEW daily_sales_summary AS +SELECT date, product_id, SUM(amount) as total_sales, COUNT(*) as order_count +FROM sales_fact +GROUP BY date, product_id + +-- Consumer query now fast +SELECT product_id, SUM(total_sales) +FROM daily_sales_summary +WHERE date >= '2024-01-01' +GROUP BY product_id +``` + +**Aggregation Push-Down:** +```sql +-- GOOD: Aggregate at source +SELECT dept, SUM(salary) +FROM employees +WHERE status = 'ACTIVE' +GROUP BY dept + +-- BAD: Aggregate after joining to dimensions +SELECT e.dept, SUM(e.salary) +FROM employees e + INNER JOIN departments d ON e.dept_id = d.dept_id +GROUP BY e.dept +``` + +## Data Flow Performance + +### Parallelism Configuration + +**Parallelism Settings:** +- **DOP (Degree of Parallelism)**: Number of parallel worker threads +- **Default**: Auto-calculated based on CPU cores and available memory +- **Manual Override**: Set for specific flows with special requirements + +**Tuning Approach:** +1. Baseline with default parallelism +2. Monitor CPU utilization (target 70-85%) +3. Increase DOP if CPU < 50% (data not distributed well) +4. Decrease DOP if CPU > 90% (contention, thread overhead) +5. Test with representative data volume + +### Batch Sizes + +**Batch Size Impact:** +- **Small batches** (1K rows): More overhead, better interactivity +- **Large batches** (100K+ rows): Better throughput, more memory per batch +- **Optimal**: Typically 10K-50K rows depending on row width and memory + +**Configuration:** +``` +Batch Size = Available Memory / (Row Width * 2) +``` + +### Memory Allocation + +**Memory Distribution:** +- Allocate 60% to data buffers +- 20% to working memory (joins, aggregations) +- 20% to system overhead and safety margin + +**Monitoring:** +- Watch for memory spill messages (indicates insufficient allocation) +- Monitor peak memory usage over time +- Check for memory leaks in long-running flows + +## Replication Flow Performance + +### Initial Load Optimization + +**Full Table Replication:** +1. Schedule during low-activity windows +2. Use maximum batch size for the table size +3. Verify target has sufficient staging space +4. Monitor progress via status logs +5. Validate row counts match source + +**Selective Replication:** +```sql +-- Replicate only recent data +WHERE created_date >= CURRENT_DATE - 7 +``` + +Benefits: Reduced network traffic, faster completion + +### Delta Performance Tuning + +**Change Data Capture (CDC) Optimization:** +1. Ensure source system has CDC enabled +2. Configure appropriate change log retention +3. Set delta frequency (hourly, daily) based on volume +4. Monitor delta processing time trends +5. Alert on delta lag exceeding threshold + +**Performance Issues and Solutions:** +- **Long delta processing**: Increase parallelism, check for data anomalies +- **High memory usage**: Reduce batch size, increase delta frequency +- **Network bottleneck**: Verify network bandwidth, consider compression +- **Target contention**: Schedule deltas during off-peak, increase parallel writers + +## Storage Optimization + +### Disk vs. In-Memory Strategy + +**In-Memory Advantages:** +- Sub-millisecond query latency +- Efficient for repeated access +- Better for interactive dashboards + +**In-Memory Disadvantages:** +- Limited capacity +- Higher cost per GB +- Unsuitable for very large tables + +**Disk Advantages:** +- Unlimited capacity +- Cost-effective for archival data +- Suitable for infrequently-accessed tables + +**Disk Disadvantages:** +- Higher query latency (100ms+) +- Suitable for batch reporting + +**Decision Matrix:** +- High-frequency, small tables → In-Memory +- Large, infrequently-accessed → Disk/Object Store +- Medium-sized, moderate frequency → Hybrid (hot set in-memory) + +### Object Store Tiering (HDLF - Hadoop Distributed File System Layer) + +**Tier Organization:** +1. **Hot Tier**: Frequently accessed, in-memory or SSD +2. **Warm Tier**: Moderate access, disk-based +3. **Cold Tier**: Archive data, object store (S3/blob) + +**Tiering Policy:** +``` +Data age < 30 days → Hot tier +Data age 30-90 days → Warm tier +Data age > 90 days → Cold tier +``` + +**Benefits:** +- Optimal cost-performance balance +- Automatic promotion/demotion +- Transparent to queries +- Improved capacity utilization + +## Monitoring and Alerting + +### Task Logs + +**Key Information:** +- Task execution time (start, end, duration) +- Rows processed (read, written, failed) +- Memory usage (peak, average) +- Data quality metrics (error counts) +- Dependencies (parent/child tasks) + +**Analysis:** +- Track execution time trends +- Identify tasks with increasing duration +- Monitor failure rates +- Check for resource contention patterns + +### Statement Logs + +**Query Metrics:** +- Query text (full SQL) +- Execution time breakdown (parse, optimize, execute) +- Rows returned +- Memory consumed +- CPU time +- Table accesses and I/O + +**Usage:** +```sql +SELECT query_text, AVG(execution_time_ms) as avg_time, COUNT(*) as frequency +FROM statement_logs +WHERE timestamp >= CURRENT_DATE - 7 +GROUP BY query_text +ORDER BY frequency DESC +``` + +### Capacity Dashboards + +**Metrics to Monitor:** +- Overall CPU utilization +- Memory consumption by tenant/space +- Storage usage by object type +- Network I/O +- Query queue depth +- Active sessions + +**Alerting Thresholds:** +- CPU > 80%: Investigate resource contention +- Memory > 85%: Add capacity or optimize +- Storage > 90%: Archive or clean data +- Queue depth > 10: Performance degradation +- Session count increasing: Check for runaway queries + +## Using MCP Tools for Performance Analysis + +### execute_query +Use to run diagnostic queries and capture explain plans: +``` +execute_query(query="EXPLAIN SELECT ...", explain_type="full") +``` +Returns execution plan with estimated and actual metrics. + +### analyze_column_distribution +Understand data distribution for optimal indexing: +``` +analyze_column_distribution(table="sales", column="product_id") +``` +Returns histogram of values, identifying skew and cardinality. + +### get_table_schema +Review table structure, indexes, and partitioning: +``` +get_table_schema(table="orders") +``` +Returns columns, data types, indexes, constraints, and statistics. + +### get_task_status +Monitor task execution metrics: +``` +get_task_status(task_id="replication_flow_01") +``` +Returns execution time, rows processed, memory used, and state. + +## Performance Optimization Workflow + +1. **Identify Issue**: Run diagnostic queries, review task logs +2. **Measure Baseline**: Capture execution time, resource usage, row counts +3. **Analyze Root Cause**: Review Explain Plan, check for anti-patterns +4. **Create Optimization Plan**: Document proposed changes and expected impact +5. **Implement Changes**: Add indexes, modify view logic, adjust persistence +6. **Test Thoroughly**: Validate correctness and measure performance +7. **Monitor Results**: Track metrics for 1-2 weeks post-deployment +8. **Document Learnings**: Record successful optimizations for future reference + +## Best Practices + +- Always establish baseline metrics before optimization +- Make one change at a time to isolate impact +- Test in non-production environment first +- Monitor for 2+ weeks after changes (catch edge cases) +- Document all optimization decisions and rationale +- Regular review of slow query logs (at least weekly) +- Implement automated alerting on performance regressions +- Consider future growth when sizing indexes and partitions +- Balance optimization effort against business value +- Regularly validate statistics and rebuild indexes + +## Reference Materials + +See reference files for detailed procedures: +- `references/optimization-techniques.md` - Explain Plan reading guide, persistence decision matrix, partitioning strategies, memory management +- `references/diagnostic-procedures.md` - Advanced diagnostic procedures: PlanViz trace generation/analysis, MDS query diagnosis for SAC live connections, HAR file network analysis, tenant memory/CPU profiling, and extracting underlying SQL from Datasphere views diff --git a/partner-built/SAP-Datasphere/skills/datasphere-performance-optimizer/references/diagnostic-procedures.md b/partner-built/SAP-Datasphere/skills/datasphere-performance-optimizer/references/diagnostic-procedures.md new file mode 100644 index 0000000..e59fd2c --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-performance-optimizer/references/diagnostic-procedures.md @@ -0,0 +1,156 @@ +# Advanced Performance Diagnostic Procedures + +## 1. Explain Plan Analysis + +### Generating an Explain Plan +- Method 1: SQL Console menu → Analyze → Explain Plan +- Method 2: SQL statement: `EXPLAIN PLAN SET STATEMENT_NAME = '' FOR ` + +### Key Metrics to Review +- OPERATOR_DETAILS: Operators used in execution +- Execution Engine: ROW vs COLUMN engine (COLUMN is preferred for analytics) +- Table type: Column store vs Row store +- SUBTREE_COST: Relative cost of operations +- CPU time, execution time, memory consumed + +### Interpreting Results +- Look for full table scans (expensive on large tables) +- Check if joins are pushed down to the engine level +- Verify aggregations happen early in the plan +- Watch for row-to-column conversions (performance penalty) + +## 2. PlanViz Trace Generation and Analysis + +### When to Use +- Complex queries taking >10 seconds +- SAC live connection performance issues +- Unexplained slow view previews +- Comparing performance between views + +### Generating a PlanViz Trace +1. Open SQL Console in Database Explorer +2. Right-click on the query → "Generate SQL Analyzer Plan File" +3. Provide a meaningful filename +4. For long-running queries: Run in background +5. Download from Database Diagnostic Files → "other" folder + +### Analyzing PlanViz +- Use HANA Studio PlanViz perspective or VS Code SQL Analyzer extension +- Key analysis points: + - Dominant Plan Operators (top 5 by CPU time) + - Accessed tables and remote sources + - Row counts at each step (data volume) + - GROUP BY and JOIN operator counts + - Remote source round-trips (for federated queries) +- Optimization targets: + - Reduce join complexity + - Push filters earlier in the plan + - Persist intermediate results for hot paths + - Cache frequently accessed dimensions + +## 3. MDS Query Diagnosis (for SAC Live Connections) + +### Understanding MDS +Multi-Dimensional Services (MDS) is the query engine used by SAP Analytics Cloud to consume Datasphere analytical models via live connections. + +### Extracting MDS Queries from SAC +1. Open Chrome Developer Tools → Network tab +2. Reproduce the slow action in SAC +3. Find the GetResponse request in the network log +4. Download as HAR file (Right-click → Save all as HAR) +5. In the HAR file, locate the Payload with PerformanceAnalysis data +6. Extract the MDS query from the View Source payload + +### Executing MDS Queries Directly +Execute directly on HANA to isolate whether the issue is in the query or SAC rendering: +```sql +CALL SYS.EXECUTE_MDS('Analytics', '', '', '', '', '', ?); +``` +- Compare execution time to SAC response time +- If HANA is fast but SAC is slow → Network or rendering issue +- If HANA is slow → Query optimization needed +- See SAP Note 2550833, 2691501 for EXECUTE_MDS details + +### Result Set Limits +- SAC Drill Limitation: Default 500 rows per table widget +- Datasphere: result_set_size_max_default = 1,000,000 +- Datasphere: result_set_size_max_limit = 10,000,000 +- When exceeded: "MaxResultRecords" error +- Solution: Reduce dimensions/measures or add filters +- See SAP Note 2770570 for parameter configuration + +## 4. Network Performance Diagnosis + +### HAR File Collection +1. Open Chrome Developer Tools (F12) → Network tab +2. Check "Preserve log" checkbox +3. Reproduce the slow operation +4. Right-click in network list → "Save all as HAR with content" +5. See SAP KBA 3405253 for detailed instructions + +### Interpreting HAR Timing +- DNS Lookup: Should be <50ms +- Initial Connection: Should be <100ms +- SSL Handshake: Should be <100ms +- Waiting (TTFB): Server processing time — this is where most performance issues live +- Content Download: Network transfer time +- If Waiting is high → Server-side optimization needed +- If Content Download is high → Network bandwidth issue + +## 5. Tenant Performance Analysis + +### Memory Analysis +- Check System Monitor → Memory allocation trends over 4 weeks +- Rising trend without corresponding data growth = potential memory leak +- Key consumers: Persisted tables/views, complex queries, in-memory caches +- Use Allocators tab to identify top memory consumers +- See SAP Note 1969700 for HANA memory SQL collection + +### CPU Analysis +- Monitor thread states and CPU utilization +- Check for MVCC (Multi-Version Concurrency Control) version buildup +- Analyze workload distribution across threads +- Identify long-running queries consuming disproportionate CPU +- See SAP Note 2114710 for HANA thread analysis + +### Creating a Database Analysis User +1. System → Configuration → Database Access → Database Analysis Users +2. Provide name suffix +3. Enable "space schema access" for SQL execution +4. Generate and save credentials +5. Use credentials to access Database Explorer or HANA Cockpit + +## 6. Retrieving Underlying SQL from Datasphere + +### Purpose +Extract the actual SQL generated by Datasphere views to execute directly on HANA for diagnosis. + +### Steps +1. Create Database Analysis User (see above) +2. Open Database Explorer via the analysis user +3. In Datasphere, locate the view's generated SQL (from SQL view definition or calculated columns) +4. In Database Explorer SQL Console, add schema prefixes to table references +5. Execute and compare results/performance with Datasphere execution +6. Use Explain Plan on the extracted SQL for detailed analysis + +### Common Use Case: Function Errors +Example: DAYS_BETWEEN function error — "inconsistent datatype" +- Root cause: Function expects DATE type but receives wrong type +- Fix: Use TO_DATE() conversion: `DAYS_BETWEEN(TO_DATE(col1), TO_DATE(col2))` +- See SAP Note 2573900 for HANA 2.0 SPS03 behavior changes + +## Key SAP Notes for Performance + +| Note | Description | +|------|-------------| +| 1969700 | SQL Statement Collection for HANA | +| 2114710 | HANA Threads and Thread Samples FAQ | +| 2073964 | PlanViz trace generation | +| 2550833 | EXECUTE_MDS procedure reference | +| 2691501 | EXECUTE_MDS for troubleshooting | +| 2770570 | Result set size parameters | +| 2573900 | DAYS_BETWEEN behavior change HANA 2.0 SPS03 | +| 3405253 | Network trace collection with Developer Tools | +| 3575377 | Max result records for MDS | +| 3616785 | Analytics mode and data persistency | +| 3476918 | How to access HANA Cloud DB traces | diff --git a/partner-built/SAP-Datasphere/skills/datasphere-performance-optimizer/references/optimization-techniques.md b/partner-built/SAP-Datasphere/skills/datasphere-performance-optimizer/references/optimization-techniques.md new file mode 100644 index 0000000..ac289c9 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-performance-optimizer/references/optimization-techniques.md @@ -0,0 +1,553 @@ +# Performance Optimization Techniques Reference + +## Explain Plan Reading Guide + +### Basic Structure Example + +``` +EXPLAIN SELECT d.category, SUM(s.amount) as total +FROM sales s +INNER JOIN product_dim p ON s.product_id = p.id +WHERE s.sale_date >= '2024-01-01' +GROUP BY d.category +``` + +**Expected Plan:** + +``` +Plan (cost=100..500 rows=10) + AGGREGATE (GROUP BY category) + Input rows: 50,000 + Output rows: 10 + Estimated cost: 450 + Selectivity: 0.0002 + Node Detail: GROUP BY category, SUM(amount) + + JOIN (INNER) + Input rows: 50,000 + Output rows: 50,000 + Estimated cost: 350 + Join strategy: HASH JOIN + Build table: product_dim (500 rows) + Probe table: sales (100,000 rows) + + SCAN sales + Input rows: 100,000 + Output rows: 50,000 + Estimated cost: 100 + Selectivity: 0.5 + Filter: sale_date >= '2024-01-01' + Index: sales_date_idx (USED) + + SCAN product_dim + Input rows: 500 + Output rows: 500 + Estimated cost: 50 + No filter +``` + +### Interpretation Key + +| Metric | Good Value | Bad Value | Implication | +|--------|-----------|-----------|-------------| +| Selectivity | < 0.1 | > 0.5 | Effective filtering removes rows early | +| Index Used | Yes | No | Leveraging indexes for fast access | +| Rows Output | < Rows Input | = Rows Input | Filtering effectiveness | +| Join Strategy | NESTED LOOP (small) or HASH | CARTESIAN | Correct join algorithm | +| Cost Trend | Decreasing | Increasing at each node | Proper plan structure | + +## Persistence Decision Matrix + +| Factor | Persist | Don't Persist | +|--------|---------|---------------| +| **Consumer Count** | 3+ | 1-2 | +| **Calculation Cost** | High (> 1s) | Low (< 100ms) | +| **Update Frequency** | Stable (daily) | Frequent (hourly+) | +| **Data Freshness** | Can tolerate 1h lag | Requires real-time | +| **Query Concurrency** | High (10+ concurrent) | Low (< 5 concurrent) | +| **Size** | Medium (100MB-10GB) | Very large (> 50GB) | +| **Storage Cost** | Worth reuse benefit | Minimal benefit | + +### Persistence Decision Examples + +**PERSIST these views:** +- Customer dimension (500MB, stable daily, used in 5+ queries) +- Monthly sales summary (10GB, calculated from 1TB fact, used in 8 dashboards) +- Product hierarchy (100MB, rarely changes, joined in many queries) + +**DON'T PERSIST these views:** +- Ad-hoc analysis views (1-time use) +- Real-time operational views (requires latest data) +- Huge aggregations (> 50GB, rarely accessed) +- Frequently recalculated dimensions (underlying data changes hourly) + +## Partitioning Strategies by Use Case + +### Time-Based Partitioning (Monthly) + +**Best For:** Sales, transactions, event logs, time-series data + +**Strategy:** +```sql +CREATE TABLE sales_fact ( + sale_id INT, + amount DECIMAL(10,2), + sale_date DATE, + product_id INT +) +PARTITION BY MONTH(sale_date) +``` + +**Benefits:** +- Natural alignment with business calendars +- Easy to archive old partitions +- Supports typical date-based filters +- Good for time-series analytics + +**Query Optimization:** +```sql +-- Partition pruning: only scans 2024-01 partition +SELECT SUM(amount) +FROM sales_fact +WHERE sale_date BETWEEN '2024-01-01' AND '2024-01-31' + +-- No pruning: scans all partitions +SELECT SUM(amount) +FROM sales_fact +WHERE EXTRACT(MONTH FROM sale_date) = 1 +``` + +### Range-Based Partitioning (Customer ID) + +**Best For:** Large customer tables, hierarchical data, geographic data + +**Strategy:** +```sql +CREATE TABLE customers ( + customer_id INT, + name VARCHAR(100), + region VARCHAR(50) +) +PARTITION BY RANGE(customer_id) ( + PARTITION p_group_1 VALUES LESS THAN (1000000), + PARTITION p_group_2 VALUES LESS THAN (2000000), + PARTITION p_group_3 VALUES LESS THAN (MAXVALUE) +) +``` + +**Benefits:** +- Distributes large tables evenly +- Supports range-based access patterns +- Enables parallel processing + +### Hash-Based Partitioning + +**Best For:** Distributing data evenly without natural key, load balancing + +**Strategy:** +```sql +CREATE TABLE order_items ( + order_id INT, + item_id INT, + product_id INT +) +PARTITION BY HASH(order_id) PARTITIONS 16 +``` + +**Benefits:** +- Automatic even distribution +- No natural partition key needed +- Scalable to any partition count + +### List-Based Partitioning (Region) + +**Best For:** Categorical data, geographic regions, business units + +**Strategy:** +```sql +CREATE TABLE regional_sales ( + region VARCHAR(50), + amount DECIMAL(10,2) +) +PARTITION BY LIST(region) ( + PARTITION p_west VALUES ('CA', 'WA', 'OR'), + PARTITION p_midwest VALUES ('IL', 'MI', 'OH'), + PARTITION p_east VALUES ('NY', 'MA', 'PA') +) +``` + +**Benefits:** +- Logical alignment with business divisions +- Easy to assign partitions to storage tiers +- Supports region-based compliance requirements + +## Memory Management Best Practices + +### Memory Calculation Formulas + +**Available Memory for Queries:** +``` +Available Memory = Total Memory - OS Overhead - Connections - Caching + = 256GB - 16GB - 8GB - 32GB + = 200GB +``` + +**Per-Query Memory Allocation:** +``` +Per-Query Limit = Available Memory / Expected Concurrent Queries + = 200GB / 10 + = 20GB per query +``` + +**Batch Size Optimization:** +``` +Batch Size = Per-Query Memory / (Row Width * Rows Buffered) + = 20GB / (500 bytes/row * 2 copies in memory) + = 20,971,520 rows + ≈ 10M rows or adjust based on testing +``` + +### Memory Spill Prevention + +**Causes of Memory Spill:** +1. Batch size too large for memory +2. Intermediate result sizes grow unexpectedly +3. Join with large build table +4. Multiple concurrent queries exceeding limit +5. Incorrect memory allocation + +**Monitoring:** +```sql +SELECT + task_id, + memory_peak_mb, + memory_spill_mb, + spill_count, + CASE WHEN memory_spill_mb > 0 THEN 'SPILL DETECTED' + ELSE 'OK' END as status +FROM task_metrics +WHERE execution_date >= CURRENT_DATE - 7 +ORDER BY memory_spill_mb DESC +``` + +**Resolution:** +1. Reduce batch size (increase memory efficiency) +2. Increase total memory allocation +3. Add indexes to reduce intermediate rows +4. Split large operations into smaller chunks +5. Schedule during lower concurrent load + +## Common Anti-Patterns and Fixes + +### Anti-Pattern 1: Late Filtering in Joins + +**Problem:** +```sql +-- SLOW: Joins full tables, then filters +SELECT * +FROM orders o +INNER JOIN customers c ON o.customer_id = c.id +WHERE o.order_date >= '2024-01-01' +``` + +**Impact:** +- Joins all 100K customers with all 10M orders (1B intermediate rows) +- Filters reduce to 1M rows after join +- Massive memory usage and I/O + +**Fix:** +```sql +-- FAST: Filter source before join +SELECT * +FROM (SELECT * FROM orders WHERE order_date >= '2024-01-01') o +INNER JOIN customers c ON o.customer_id = c.id +``` + +**Result:** +- Filters orders to 1M rows first +- Joins 1M orders with 100K customers (100M rows) +- 1000x memory reduction + +### Anti-Pattern 2: Multiple Cascading Aggregations + +**Problem:** +```sql +-- SLOW: Each aggregation full scan +SELECT dept, COUNT(*) +FROM ( + SELECT dept, employee_id + FROM ( + SELECT dept, employee_id + FROM employees + GROUP BY dept, employee_id + ) + GROUP BY dept +) +GROUP BY dept +``` + +**Fix:** +```sql +-- FAST: Single aggregation +SELECT dept, COUNT(DISTINCT employee_id) +FROM employees +GROUP BY dept +``` + +### Anti-Pattern 3: Function on Indexed Column + +**Problem:** +```sql +-- SLOW: Index not usable +SELECT * +FROM orders +WHERE YEAR(order_date) = 2024 +``` + +**Explain Plan Shows:** +- Full table scan (no index used) +- Function evaluation on every row +- High CPU usage + +**Fix:** +```sql +-- FAST: Index usable +SELECT * +FROM orders +WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01' +``` + +**Impact:** +- Index used (order_date_idx) +- Partition pruning (if date-partitioned) +- Orders of magnitude faster + +### Anti-Pattern 4: SELECT * with Unused Columns + +**Problem:** +```sql +-- SLOW: Fetches all 50 columns +SELECT * +FROM customer_360 +WHERE region = 'WEST' +``` + +**Impact:** +- Fetches columns never used in join/filter +- Larger result sets over network +- More memory required + +**Fix:** +```sql +-- FAST: Only needed columns +SELECT customer_id, name, region +FROM customer_360 +WHERE region = 'WEST' +``` + +### Anti-Pattern 5: Missing Index on Foreign Key + +**Problem:** +```sql +CREATE TABLE orders ( + order_id INT PRIMARY KEY, + customer_id INT, -- No index! + amount DECIMAL(10,2) +) +``` + +**Impact:** +- Joins on customer_id require full table scan +- Every join query slowly crawls through 100M rows +- High memory usage for hash join build + +**Fix:** +```sql +CREATE INDEX idx_orders_customer_id ON orders(customer_id); +``` + +**Impact:** +- Nested loop join uses index +- 100x faster for typical filters + +## Performance Benchmarking Approaches + +### Baseline Establishment (Week 1) + +**Measurements to Capture:** +``` +FOR EACH QUERY: + - Cold execution (cache cleared): 3 runs, take median + - Warm execution (cache primed): 5 runs, take median + - Peak memory: from task metrics + - Rows returned: from logs + - Data volume scanned: from task metrics + - CPU time: from task metrics + +AGGREGATE: + - P50, P95, P99 execution time + - Average rows per query + - Total data scanned per day + - Memory utilization pattern +``` + +### Benchmark Tools Setup + +```sql +-- Create benchmark table to track results +CREATE TABLE perf_benchmarks ( + benchmark_date DATE, + query_name VARCHAR(256), + execution_time_ms INT, + memory_peak_mb INT, + rows_returned INT, + status VARCHAR(20), + notes VARCHAR(500) +); + +-- Procedure to run benchmark +CREATE PROCEDURE benchmark_query(query_name VARCHAR, query_sql VARCHAR) AS + DECLARE start_time TIMESTAMP; + DECLARE end_time TIMESTAMP; + DECLARE execution_ms INT; +BEGIN + SET start_time = CURRENT_TIMESTAMP; + EXECUTE query_sql; + SET end_time = CURRENT_TIMESTAMP; + SET execution_ms = EXTRACT(EPOCH FROM (end_time - start_time)) * 1000; + INSERT INTO perf_benchmarks VALUES (CURRENT_DATE, query_name, execution_ms, ...); +END; +``` + +### Comparison Methodology (Post-Optimization) + +**Compare Against Baseline:** +``` +For each metric: + - Improvement % = (Baseline - New) / Baseline * 100 + - 20% improvement = significant + - 50%+ improvement = major optimization + - < 5% = noise, monitor for regression +``` + +**Statistical Significance:** +- Collect minimum 5 runs in stable conditions +- Use median (not mean) for outlier resistance +- Run for 2+ weeks to catch time-of-day effects +- Monitor for p-value < 0.05 on t-test + +### Regression Detection + +```sql +-- Daily performance check +SELECT + query_name, + AVG(execution_time_ms) as avg_time, + LAG(AVG(execution_time_ms)) OVER (PARTITION BY query_name ORDER BY benchmark_date) as prev_day_avg, + ROUND((AVG(execution_time_ms) - LAG(...)) / LAG(...) * 100, 2) as regression_pct +FROM perf_benchmarks +WHERE benchmark_date >= CURRENT_DATE - 7 +GROUP BY query_name, benchmark_date +HAVING regression_pct > 10 +ORDER BY regression_pct DESC +``` + +## Storage Tier Recommendations + +### Hot Tier (In-Memory) + +**Characteristics:** +- Access latency: < 1ms +- Cost: $50-100 per GB/month +- Capacity: Limited to available RAM + +**Candidates:** +- Dimension tables (< 1GB) +- Small reference data +- Frequently queried aggregations +- Real-time KPI tables + +**Example Allocation:** +``` +Customer Dimension: 500 MB +Product Dimension: 200 MB +Date Dimension: 50 MB +Daily Sales Summary: 2 GB +Region Master Data: 100 MB +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ +TOTAL HOT: 3 GB +``` + +### Warm Tier (SSD/NVMe) + +**Characteristics:** +- Access latency: 1-10ms +- Cost: $5-20 per GB/month +- Capacity: Typically 50-500 GB + +**Candidates:** +- Medium-sized tables (1-50 GB) +- Week's worth of transaction data +- Aggregations queried daily +- Staging tables + +**Example Allocation:** +``` +Last 4 weeks sales: 30 GB +Last 90 days inventory: 5 GB +Transactional staging: 15 GB +Marketing analytics: 20 GB +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ +TOTAL WARM: 70 GB +``` + +### Cold Tier (Object Store/Archive) + +**Characteristics:** +- Access latency: 100ms-1s (may require restore) +- Cost: $1-5 per GB/month +- Capacity: Unlimited + +**Candidates:** +- Historical data (> 2 years) +- Infrequently accessed archives +- Compliance/regulatory data +- Backup data + +**Example Allocation:** +``` +2022 historical sales: 500 GB +2021 archive: 800 GB +Compliance records: 200 GB +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ +TOTAL COLD: 1.5 TB +``` + +### Tiering Policy Implementation + +```sql +-- Auto-tier based on age +CREATE POLICY tier_by_age AS + IF data_age_days < 7 THEN hot_tier + ELSE IF data_age_days < 90 THEN warm_tier + ELSE cold_tier; + +-- Example: Automatic execution +ALTER TABLE sales_fact SET STORAGE POLICY tier_by_age; +``` + +### Cost-Benefit Analysis + +**Monthly Storage Cost Comparison:** + +| Strategy | Hot | Warm | Cold | Total | Notes | +|----------|-----|------|------|-------|-------| +| All Hot | 250 | - | - | $250 | Simple, expensive, limited capacity | +| Hot + Warm | 50 | 350 | - | $160 | Balanced, good performance | +| Hot + Warm + Cold | 50 | 70 | 75 | $130 | Optimal cost, tiering complexity | + +**Performance vs. Cost Trade-off:** +- Use tiering when potential savings > 20% +- Avoid over-tiering (management overhead) +- Monitor actual access patterns quarterly +- Adjust tiers based on usage evolution diff --git a/partner-built/SAP-Datasphere/skills/datasphere-s4hana-import/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-s4hana-import/SKILL.md new file mode 100644 index 0000000..ea4212f --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-s4hana-import/SKILL.md @@ -0,0 +1,771 @@ +--- +name: S/4HANA Import Assistant +description: "Specialized assistant for importing entities from SAP S/4HANA and BW/4HANA into Datasphere. Use this when connecting to SAP systems, selecting CDS views, configuring ODP extraction, setting up Cloud Connector, or enabling real-time delta loads." +--- + +# S/4HANA Import Assistant + +## Overview + +This skill provides comprehensive guidance for successfully importing data from SAP S/4HANA and BW/4HANA systems into SAP Datasphere. It covers the entire workflow from understanding available source objects to configuring extraction methods and monitoring data flows. + +## When to Use This Skill + +- **Initial S/4HANA connection**: Setting up Cloud Connector and DP Agent +- **Exploring available objects**: Finding the right CDS views, InfoProviders, or tables +- **Selecting semantically rich data**: Choosing extraction-enabled views over raw tables +- **Configuring delta extraction**: Setting up ODP for real-time or near-real-time data +- **Handling BW/4HANA objects**: Importing InfoProviders and CompositeProviders +- **Troubleshooting connectivity**: Resolving Cloud Connector or DP Agent issues +- **Managing delta queues**: Monitoring and recovering from extraction issues +- **Optimizing extraction performance**: Choosing the right extraction methods + +## The Import Entities Wizard Workflow + +The Import Entities wizard guides you through a structured process for bringing SAP source objects into Datasphere: + +### Step 1: Create or Select a Connection + +``` +Datasphere UI → Data Integrations → Connections +→ Create New or Select Existing S/4HANA Connection +``` + +**Required Information:** +- Connection Name (e.g., "SAP_S4H_PROD") +- Source System Type (S/4HANA, BW/4HANA, ECC, etc.) +- Host and Port +- Client Number +- Authentication Method (Basic, Certificate, OAuth) +- Cloud Connector Location ID (for on-premise systems) + +**Example Connection Configuration:** +- Source: SAP S/4HANA 2023 +- Host: s4h-prod.example.com +- Port: 443 +- Client: 100 +- Cloud Connector: CC_PROD_001 + +### Step 2: Create a New Replication or Data Integration Task + +``` +Data Integrations → New Task +→ Select Source System (from connections) +→ Choose Task Type (Replication Flow, Data Flow, or Transformation Flow) +``` + +### Step 3: Search and Select Source Objects + +The system searches for available objects in the SAP source system: + +**Search Criteria:** +- Object Type (CDS View, InfoProvider, ODP Extractor, Query) +- Business Area (Financial, Sales, Procurement, HR, Supply Chain) +- Object Name or Pattern +- Extraction Method (ODP, Database, Query) + +**Search Example:** +``` +Search: "C_*" +Filter by: CDS Views, Extraction-enabled only +Results: C_CUSTOMER, C_SALES_ORDER, C_INVOICE, ... +``` + +### Step 4: Configure Extraction Method + +Choose how data will be extracted: + +``` +Object: C_CUSTOMER (CDS View) +Available Extraction Methods: + ✓ ODP (Operational Data Provisioning) - Real-time, Delta support + ✓ Database Access - Direct SQL, Full only + ✓ API - If exposed as API + → Select ODP for Delta capability +``` + +### Step 5: Map Source to Target + +``` +Source CDS View: C_CUSTOMER +Fields: CUSTOMER_ID, CUSTOMER_NAME, EMAIL, PHONE, CREATED_DATE +↓ +Target Table (auto-created or manual selection) +CUSTOMER_MASTER (new table in Datasphere) +``` + +### Step 6: Configure Load Settings + +``` +Initial Load: + - Full Load: Yes, load all historical data + - Parallel Threads: 4 (for performance) + - Package Size: 10000 rows + +Delta Load: + - Delta Enabled: Yes + - Delta Type: ODP Change Data Capture + - Extraction Frequency: Every 15 minutes + - Watermark Field: CHANGENUMBER (from ODP) +``` + +### Step 7: Review and Activate + +``` +Review: + - Source: C_CUSTOMER (S/4HANA) + - Target: CUSTOMER_MASTER (Datasphere) + - Method: ODP with Delta + - Schedule: Automatic every 15 minutes + +Status: Ready to Activate +``` + +## Identifying Extraction-Enabled CDS Views in S/4HANA + +Not all CDS views are extraction-enabled. You must use views explicitly marked for extraction: + +### Characteristics of Extraction-Enabled CDS Views + +**Naming Convention:** +- Prefix `C_` : Consumer views, extraction-enabled (e.g., C_CUSTOMER, C_SALES_ORDER) +- Prefix `I_` : Internal views, not for extraction +- Prefix `P_` : Projection views +- Prefix `DD_` : Domain-specific views + +**Extraction Capability Indicators:** + +1. **@Analytics Annotation** +```abap +@Analytics.dataCategory: #FACT +@Analytics.dataExtraction.enabled: true +define view C_SALES_ORDER as + select from vbak { + vbak.vbeln as SalesOrderNumber, + vbak.erdat as CreatedDate, + vbak.netwr as NetAmount + } +``` + +2. **@Semantics Annotation** +```abap +@Semantics.amount.currencyCode: 'CurrencyCode' +@ObjectModel.Composition.RefreshingElement: true +``` + +### Finding CDS Views in S/4HANA + +**Via Transaction SE11 (ABAP Dictionary):** +``` +SE11 → CDS View → Search for C_* +Filter: "Author = SAP" AND "Extraction Enabled = X" +``` + +**Via Search in Datasphere:** +``` +Data Integrations → Search Catalog +Source: S/4HANA_PROD +Type: CDS View +Search Text: "customer" OR "sales" OR "invoice" +``` + +**Common Extraction-Enabled Views by Module:** + +| Module | View Name | Purpose | +|--------|-----------|---------| +| Finance | C_GENERALLEDGER | General Ledger transactions | +| Finance | C_CUSTOMER_INVOICE | Customer invoices | +| Finance | C_SUPPLIER_INVOICE | Supplier invoices | +| Sales | C_CUSTOMER | Customer master | +| Sales | C_SALES_ORDER | Sales orders | +| Sales | C_SALES_ORDER_ITEM | Sales order line items | +| Procurement | C_SUPPLIER | Supplier master | +| Procurement | C_PURCHASE_ORDER | Purchase orders | +| Procurement | C_PURCHASE_ORDER_ITEM | Purchase order lines | +| Inventory | C_MATERIAL | Material master | +| Inventory | C_MATERIAL_STOCK | Inventory balances | +| HR | C_EMPLOYEE | Employee master | +| HR | C_EMPLOYEE_SALARY | Salary information | + +### Verify Extraction Capability + +Use the `search_catalog` MCP tool: + +``` +search_catalog( + source="S/4HANA_PROD", + object_type="CDS_VIEW", + search_term="C_CUSTOMER", + extraction_enabled=True +) +``` + +Expected output includes extraction metadata: +```json +{ + "object_name": "C_CUSTOMER", + "extraction_enabled": true, + "extraction_types": ["ODP", "DATABASE"], + "delta_capable": true, + "fields": [ + { + "name": "CUSTOMER_ID", + "type": "STRING", + "key": true, + "changeable": false + }, + { + "name": "CUSTOMER_NAME", + "type": "STRING", + "changeable": true + } + ] +} +``` + +## Understanding CDS View Annotations + +Annotations in CDS views control behavior and extraction characteristics: + +### @Analytics Annotations + +```abap +@Analytics.dataCategory: #FACT -- Type: FACT, DIMENSION, CUBE, QUERY +@Analytics.dataExtraction.enabled: true -- Enable extraction +@Analytics.dataExtraction.deltaSupported: true -- Support delta extractions +``` + +### @ObjectModel Annotations + +```abap +@ObjectModel.readOnly: true -- View is read-only +@ObjectModel.transactional: false -- Not transactional +@ObjectModel.usageType: #FACT -- Usage classification +@ObjectModel.Composition.RefreshingElement: true -- Refresh semantics +``` + +### @Semantics Annotations + +```abap +@Semantics.amount.currencyCode: 'CurrencyCode' -- Currency field +@Semantics.quantity.unitOfMeasure: 'UnitOfMeasure' -- UOM field +@Semantics.calendar.date: true -- Date field +@Semantics.businessKey: true -- Business key +``` + +### @EndUserText Annotations + +```abap +@EndUserText.label: 'Customer Master' +@EndUserText.description: 'Extraction-enabled customer master data' +@EndUserText.quickInfo: 'All customers from KNVP and KNVV tables' +``` + +## ODP (Operational Data Provisioning) Extractors + +ODP is SAP's modern extraction framework, replacing older RFC-based methods: + +### ODP Architecture + +``` +S/4HANA System +├── ODP Provider (e.g., CDS View C_CUSTOMER) +├── ODP Context (e.g., ABAP:CDS_VIEWS) +├── Change Log Table +├── Delta Queue (Pending changes) +└── Watermark (Last extracted position) +``` + +### ODP Extractor Types + +| Type | Source | Use Case | Delta Support | +|------|--------|----------|----------------| +| CDS_VIEWS | Extraction-enabled CDS views | Modern standard objects | Yes (FULL) | +| FUNCTION | RFC function modules | Custom extraction logic | Yes | +| TABLE | Database tables | Direct table access | Yes | +| QUERY | ABAP query/report | Parameterized extraction | No | +| LOGICAL_LOG | Application log/changes | Event-based changes | Yes | + +### Configuring ODP Delta Load + +``` +Source: C_SALES_ORDER (ODP Provider) +Delta Configuration: + - Delta Type: FULL_THEN_DELTA + - Semantics: Changed records only + - Key Fields: SALES_ORDER_NUMBER (determines uniqueness) + - Extraction Sequence: Change Number (internal counter) + +Example Delta First Run: +1. Full load: all records +2. Store watermark: CHANGENUMBER = 1000000 + +Example Delta Second Run: +1. Query: CHANGENUMBER > 1000000 +2. Delta load: only changed since last run +3. Update watermark: CHANGENUMBER = 1000050 +``` + +### ODP Change Data Capture (CDC) + +ODP tracks changes via change numbers and delta queues: + +``` +Change Number = Internal sequence counter +When a record changes: + → ODP logs the change + → Increments change number + → Stores in delta queue + → Available until queue is cleared (usually 3-8 days) + +Delta Extraction Sequence: + Initial Load (FULL): Get CHANGENUMBER = 500000 + Delta Load (Day 2): WHERE CHANGENUMBER > 500000 + Delta Load (Day 3): WHERE CHANGENUMBER > 500050 + ... +``` + +## ABAP CDS Views: Released APIs vs Custom Views + +### SAP-Released CDS Views (Safe for Production) + +**Characteristics:** +- Prefix: `C_` (consumption) or published by SAP +- Fully documented and supported by SAP +- Extraction-enabled with stable field lists +- Backward compatible across releases +- Published in SAP API Hub + +**Example - C_GENERALLEDGER:** +```abap +@VDM.viewType: #CONSUMPTION +@ObjectModel.semanticKey: ['CompanyCode', 'DocumentNumber', 'FiscalYear'] +@Analytics.dataCategory: #FACT +@Analytics.dataExtraction.enabled: true +define view C_GENERALLEDGER as + select from fin_gl_posting as posting { + posting.bukrs as CompanyCode, + posting.belnr as DocumentNumber, + posting.gjahr as FiscalYear, + posting.dmbtr as Amount, + posting.budat as DocumentDate + } + where posting.bukrs <> '' + and posting.dmbtr <> 0; +``` + +### Custom CDS Views (Use with Caution) + +**Risks:** +- No support guarantee from SAP +- Fields may change or disappear +- Not backward compatible +- May not have extraction enabled +- Performance not optimized for extraction + +**When to use custom views:** +- SAP standard view doesn't exist +- Need to combine multiple tables +- Apply complex filtering/calculations +- Specific business requirements + +**Creating a custom extraction-enabled view:** +```abap +@VDM.viewType: #CONSUMPTION +@Analytics.dataCategory: #FACT +@Analytics.dataExtraction.enabled: true +@Analytics.dataExtraction.deltaSupported: true +@EndUserText.label: 'Custom Sales Analysis' +define view Z_CUSTOM_SALES as + select from vbak as orders + inner join vbap as items on vbak.vbeln = vbap.vbeln + left outer join vbrk as invoices on vbak.vbeln = vbrk.xblnr { + vbak.vbeln as SalesOrder, + vbap.posnr as LineNumber, + vbak.erdat as CreatedDate, + vbap.netwr as NetAmount + }; +``` + +## BW/4HANA Objects: InfoProviders and CompositeProviders + +BW/4HANA uses InfoProviders as the primary data container for analytics: + +### InfoProvider Types + +**Standard InfoProvider (Cube):** +``` +BW/4HANA → InfoProvider: /BIC/SALES (Facts and dimensions) +├── Fact Table: /BIC/FSALES00 +├── Dimensions: +│ ├── 0CUSTOMER (Customer dimension) +│ ├── 0MATERIAL (Material dimension) +│ └── 0PLANT (Plant dimension) +└── Measures: Sales Value, Quantity, Margin +``` + +**CompositeProvider (Virtual Cube):** +``` +CompositeProvider: /BIC/COMP_SALES +├── InfoProvider 1: /BIC/SALES (Current year) +├── InfoProvider 2: /BIC/SALES_ARCHIVE (Prior years) +└── Query logic: UNION with transformation rules +``` + +### Extraction from BW/4HANA + +**Via ODP:** +``` +BW System → ODP Provider (BW:INFOPROVIDER) +Source: /BIC/SALES +Type: InfoProvider +Delta: Change request number (BW concept) +Extraction: All cubes support ODP extraction +``` + +**Configuration:** +``` +Source BW Object: /BIC/SALES +Characteristics (Dimensions): + - 0CUSTOMER + - 0MATERIAL + - 0PLANT + +Key Figures (Measures): + - 0SALES_VALUE + - 0QUANTITY + - 0GROSS_MARGIN + +Load Setting: + - Initial Load: Full (all data) + - Delta: Request-based (BW change request tracking) +``` + +### BW Query Extraction + +**Query Option (Extract from OLAP Query):** +``` +BW Query: Z_SALES_ANALYSIS +└── Extract directly as data source +├── Advantages: Complex logic defined in query, pre-calculated +├── Disadvantages: Delta not supported, slower than ODP +└── Use case: Aggregated reporting data +``` + +## Connection Prerequisites + +### Cloud Connector Setup (On-Premise Systems) + +The Cloud Connector acts as a reverse proxy tunnel between cloud and on-premise SAP systems: + +**Installation Checklist:** + +1. **Procurement** + - [ ] Download Cloud Connector from SAP downloads + - [ ] Obtain license key + - [ ] Allocate VM or physical server (4GB RAM minimum) + +2. **Installation** + - [ ] Install on machine in SAP network + - [ ] Configure JDK (Java 11+) + - [ ] Set up HTTPS certificates + - [ ] Configure administrative user + +3. **Configuration** + ``` + Administration UI: https://localhost:8443 + Add Backend System: + - System Type: SAP System + - Host: s4h-prod.example.com + - Port: 443 + - Protocol: HTTPS + - Virtual Host: s4h-prod (for cloud access) + + Resource Mapping: + - URL Pattern: /sap/opu/odata/sap/* + - URL Regex: ^/sap/opu/odata/sap/.* + - Check: YES (enabled) + ``` + +4. **Certificate Exchange** + - [ ] Export Cloud Connector certificate + - [ ] Import in S/4HANA trusted store + - [ ] Exchange client certificate if using mutual TLS + +5. **Testing** + - [ ] Test from Datasphere → Cloud Connector → SAP system + - [ ] Verify ODP availability + - [ ] Check latency and throughput + +**Cloud Connector Architecture:** +``` +Datasphere (Cloud) + ↓ HTTPS +Cloud Connector (DMZ/Internal Network) + ↓ HTTPS/HTTP +S/4HANA System (On-Premise) +``` + +### DP Agent Setup (Alternative to Cloud Connector) + +Data Provisioning Agent is used in some scenarios: + +**When to use DP Agent:** +- Multi-tier network (requires agent in SAP network) +- Firewall restrictions prevent Cloud Connector +- Batch/scheduled extractions preferred + +**Installation:** +``` +Download: SAP Datasphere → Administration → DP Agent Download +Install on: Server in SAP network with access to source systems +Configure: + - Datasphere tenant URL + - OAuth credentials or certificate + - Source system connections + +Start Agent: .\dpsagent.exe start +Verify: Check "Agent Status" in Datasphere Administration +``` + +**DP Agent Communication:** +``` +Datasphere → HTTPS → DP Agent (On-Premise) + ← Data ← Source Systems +``` + +## Delta Extraction Patterns + +### Pattern 1: Timestamp-Based Delta (Most Common) + +**Setup:** +``` +Source View: C_CUSTOMER +Delta Field: CHANGED_AT (timestamp) +ODP Configuration: + - Track changes via timestamp + - Load only records changed since last run +``` + +**Extraction Sequence:** +``` +Day 1: Full load all customers (CHANGED_AT <= 2024-01-15) + Save watermark: 2024-01-15 23:59:59 + +Day 2: Delta load where CHANGED_AT > 2024-01-15 23:59:59 + Changes: 150 new/updated records + New watermark: 2024-01-16 23:59:59 + +Day 3: Delta load where CHANGED_AT > 2024-01-16 23:59:59 + Changes: 75 new/updated records +``` + +**Advantages:** Simple, reliable, handles late arrivals +**Disadvantages:** Requires timestamp maintenance + +### Pattern 2: Change Number Sequence (ODP Native) + +**Setup:** +``` +Source: C_SALES_ORDER +ODP provides: Change Number (CHANGENUMBER field) +Automatic tracking by ODP +``` + +**Extraction Sequence:** +``` +Day 1: Full load with CHANGENUMBER up to 5000000 +Day 2: Delta where CHANGENUMBER > 5000000 + Extract: records 5000001 to 5000500 +Day 3: Delta where CHANGENUMBER > 5000500 +``` + +**Advantages:** ODP handles completely, most reliable +**Disadvantages:** Limited history (3-8 days typically) + +### Pattern 3: Logical Change Document (For Complex Changes) + +**Setup:** +``` +Source: BW InfoProvider +Tracking: Change request numbers +Each request = batch of changes +``` + +**Handling:** +``` +Request 001: 1000 rows inserted +Request 002: 500 rows updated +Request 003: 100 rows deleted + +Delta load pulls complete requests (no partial) +``` + +## Best Practices for S/4HANA Imports + +### 1. Choose the Right Extraction Method + +**Decision Tree:** +``` +Need Real-time Data? +├─ YES → Use ODP if available +│ └─ Delta every 5-15 minutes +└─ NO → Can batch load suffice? + ├─ YES → Use Database access or ODP batch + └─ NO → Use ODP with frequent schedule +``` + +### 2. Identify Semantically Rich Objects + +**Semantic Richness Hierarchy:** +``` +1. SAP-provided consumption CDS view (C_*) ✓✓✓ (Use this) +2. SAP cluster/pooled table via CDS ✓✓ +3. SAP standard table via CDS ✓ +4. Custom CDS view ✓ (Use with caution) +5. Raw SAP table ✗ (Avoid if possible) +``` + +**Why it matters:** +- Consumption views have business logic applied +- Transformations and hierarchies built-in +- Data quality rules enforced +- Fields well-documented and stable + +### 3. Handle Hierarchies Correctly + +**Hierarchy Example (0CUSTOMER dimension):** +``` +Hierarchy Structure: +Region +├── North America +│ ├── USA +│ │ ├── East Region +│ │ └── West Region +│ └── Canada +└── Europe + ├── UK + └── Germany + +Extract with Hierarchy: +- Get all leaf customers +- Include parent relationships +- Track parent changes (SCD Type 2) +``` + +**Configuration:** +``` +BW InfoProvider: /BIC/SALES +Dimension: 0CUSTOMER +Include: Hierarchy levels + - CUSTHIER01 (Standard customer hierarchy) + - With parent-child relationships +``` + +### 4. Manage Load Volumes + +**Large Initial Loads:** +``` +Source: C_GENERALLEDGER (500M rows) +Strategy: + - Partition by date range + - Load in parallel batches + - Monitor memory and storage + - Estimate: 500M rows ≈ 100GB compressed +``` + +**Configuration:** +``` +Parallel Threads: 4-8 (for initial load) +Package Size: 100,000 rows per package +Estimated Duration: 8-12 hours +Monitor: Memory, CPU, Network +``` + +### 5. Monitor Delta Queue Health + +**Check Delta Queue Status:** +``` +Use MCP tool: test_connection(source="S/4HANA_PROD") +Verify: + - Delta queue not overflowing + - Change documents available + - No extraction errors +``` + +**Delta Queue Limits:** +``` +Typical retention: 3-8 days +Size limit: 2-4GB per table +If exceeded: Must do full reload + +Monitor: + - Days of changes retained + - % of queue capacity used + - Time since last cleanup +``` + +## MCP Tool References + +### search_catalog +Find available objects in S/4HANA or BW/4HANA systems: + +``` +search_catalog( + source="S/4HANA_PROD", + object_type="CDS_VIEW", + search_term="customer", + extraction_enabled=True, + module="SALES" +) +``` + +### get_space_assets +View existing objects imported into Datasphere spaces: + +``` +get_space_assets( + space_name="FINANCE", + asset_type="TABLE", + source="S/4HANA_PROD" +) +``` + +### search_repository +Search Datasphere repository for specific objects: + +``` +search_repository( + object_type="REPLICATION_FLOW", + name_contains="CUSTOMER", + source_system="S/4HANA" +) +``` + +### test_connection +Verify connectivity to SAP systems and ODP availability: + +``` +test_connection( + connection_name="SAP_S4H_PROD", + verify_odp=True, + check_delta_queue=True +) +``` + +## Reference Materials + +See reference files for detailed procedures: +- `references/s4hana-integration-guide.md` - CDS views by functional area, ODP configuration, Cloud Connector and DP Agent setup +- `references/cds-replication-architecture.md` - End-to-end architecture for CDS view replication (RMS, RDB, CDC Engine), setup checklist, CDS annotation requirements, and troubleshooting quick reference + +## Next Steps + +1. **Gather prerequisites** using test_connection +2. **Search for objects** using search_catalog +3. **Evaluate semantic richness** of available CDS views +4. **Plan extraction method** (ODP for delta, Database for full) +5. **Set up Cloud Connector** if on-premise system +6. **Create and activate** replication flow +7. **Monitor** initial load and subsequent deltas + diff --git a/partner-built/SAP-Datasphere/skills/datasphere-s4hana-import/references/cds-replication-architecture.md b/partner-built/SAP-Datasphere/skills/datasphere-s4hana-import/references/cds-replication-architecture.md new file mode 100644 index 0000000..7e4c436 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-s4hana-import/references/cds-replication-architecture.md @@ -0,0 +1,149 @@ +# CDS View Replication Architecture: S/4HANA On-Premise to Datasphere + +## End-to-End Architecture + +### Datasphere Side (Target) +- Replication Flow UI/Services/Repository +- vFlow → VFlow-sub-abap → Replication Management Service (RMS) +- Axino/ABAP Pipeline Engine +- Replicated local table (target) + +### Network Layer +- SAP Cloud Connector: Bridges cloud-to-on-premise connectivity + +### S/4HANA Side (Source) +- CDS View with @Analytics.dataExtraction.enabled: true +- For Initial Only: Source table → RDB Buffer (direct) +- For Initial and Delta: Source table → Internal SQL View → CDC Engine (Master/Subscriber Logging Tables) → RDB Buffer + +### Data Flow Summary + +1. **Initial Only**: CDS View → RDB Buffer Tables → (Cloud Connector) → RMS → Local Table in Datasphere +2. **Initial and Delta**: CDS View → CDC Engine → Master Logging Table → Subscriber Logging Table → RDB Buffer → (Cloud Connector) → RMS → Local Table + +## CDS View Requirements + +### Mandatory Annotations + +``` +@Analytics:{ + dataExtraction: { + enabled: true, + delta.changeDataCapture.automatic: true // for Initial and Delta + } +} +``` + +- `dataExtraction.enabled: true` — Required for any replication +- `delta.changeDataCapture.automatic: true` — Required for delta capability (simple CDS views) +- `delta.changeDataCapture.mapping` — Alternative for complex CDS views with explicit field mapping +- See SAP Note 2890171 for complete CDS view requirements + +### CDS View Validation + +Use Transaction SDDLAR on source system with the following options: + +- **Display DDL Source**: Review annotation correctness +- **Check DDL Source**: Validate syntax and consistency +- **Data Preview**: Confirm data is extractable +- **Show ROOT/COMPOSITION relations**: Check view hierarchy + +## Setup Checklist + +### Step 1: Cloud Connector + +- Install and configure SAP Cloud Connector +- Add to Datasphere Configuration → Cloud Connector list +- Configure access paths for required services +- References: "Preparing Cloud Connector Connectivity", "Prepare Connectivity to SAP S/4HANA On-Premise" + +### Step 2: Create S/4HANA On-Premise Connection + +Connection Details: +- SAP Logon Connection Type (Application Server) +- Application Server +- System Number +- Client +- Language + +Cloud Connector Configuration: +- Use Cloud Connector = true +- Location +- Virtual Host/Port + +### Step 3: Validate Connection + +Required validation result: +- Must show: "Replication flows are enabled" + +Also verify: +- Data flows enabled +- Remote tables status +- Model Import status +- For validation errors: SAP KBA 3369433 + +### Step 4: Prepare CDS Views on Source + +- Ensure required annotations are present +- Activate CDS views +- Verify in CDS_EXTRACTION container + +### Step 5: Create Replication Flow in Datasphere + +1. Data Builder → New Replication Flow +2. Select Source Connection (S/4HANA) +3. Select Source Container: CDS_EXTRACTION folder +4. Select Source Objects (CDS views) +5. Select Target Connection: SAP Datasphere +6. Set Load Type (Initial Only or Initial and Delta) +7. Save and Deploy before running + +### Step 6: Run and Monitor + +Monitor via Data Integration Monitor: +- Run Status +- Object Status +- Messages +- Metrics + +**For Initial and Delta Replication**: +- Run Status = ACTIVE (RETRYING OBJECTS) between deltas is NORMAL +- Object Status cycles: INITIAL RUNNING → RETRYING → DELTA RUNNING → RETRYING +- Delta Load Interval configurable (Hours/Minutes) + +## Source Container Notes + +- Standard CDS views appear in CDS_EXTRACTION root folder +- If CDS view is not visible in CDS_EXTRACTION: + 1. Confirm `@Analytics.dataExtraction.enabled: true` annotation + 2. Confirm data can be extracted via RODPS_REPL_TEST + 3. Verify communication user has required authorizations + 4. Verify user has authorization to access the specific CDS view + +## Troubleshooting Quick Reference + +### Pre-Runtime Checks + +1. All SAP Notes from SAP Note 2890171 applied on source system +2. CDS view validated with SDDLAR +3. Connection validates with "Replication flows are enabled" + +### Runtime Error Investigation + +1. Check Data Integration Monitor → Messages for error text +2. Check SLG1 on source system with objects DHAPE or DHCDC +3. Check DHRDBMON for buffer table status +4. Check DHCDCMON for delta/CDC status (Initial and Delta only) +5. Verify Observer and Transfer jobs are green in DHCDCMON Job Settings + +### Performance Investigation + +- System Monitor Dashboard in Datasphere +- HANA Cockpit for CPU and memory +- HANA indexserver log (access via SAP KBA 3476918) +- Check partitioning configuration + +### Testing Outside Datasphere + +- Report RODPS_REPL_TEST: Test extraction for the CDS view's corresponding SQL view +- Transaction ODQMON: Verify record count after test extraction diff --git a/partner-built/SAP-Datasphere/skills/datasphere-s4hana-import/references/s4hana-integration-guide.md b/partner-built/SAP-Datasphere/skills/datasphere-s4hana-import/references/s4hana-integration-guide.md new file mode 100644 index 0000000..16c8e06 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-s4hana-import/references/s4hana-integration-guide.md @@ -0,0 +1,532 @@ +# S/4HANA Integration Reference Guide + +## Common S/4HANA CDS Views by Functional Area + +### Finance Module (FI/CO) + +| View Name | Description | Key Fields | Delta Support | Typical Volume | +|-----------|-------------|-----------|---|---| +| C_GENERALLEDGER | General ledger transactions | Company Code, Account, Amount, Date | Yes | 100M-500M | +| C_ACCOUNT_MASTER | Chart of accounts | Company Code, Account, Type, Currency | No | 10K-100K | +| C_COSTCENTER | Cost center master | Cost Center, Name, Manager | No | 1K-10K | +| C_PROFITCENTER | Profit center master | Profit Center, Name, Controller | No | 100-1K | +| C_CUSTOMER_INVOICE | Customer invoices | Invoice#, Customer, Amount, Date | Yes | 10M-50M | +| C_SUPPLIER_INVOICE | Vendor invoices | Invoice#, Vendor, Amount, Date | Yes | 10M-50M | +| C_PAYMENT_HISTORY | Payment transactions | Payment#, Customer/Vendor, Date | Yes | 5M-20M | +| C_BANK_RECONCILIATION | Bank transactions | Bank Account, Transaction, Date | Yes | 1M-10M | + +**Extraction Recommendation:** +- Use ODP for all transactions with delta enabled +- Full load for masters (daily or weekly) +- Delta frequency: 15 minutes for transactions + +### Sales Module (SD) + +| View Name | Description | Key Fields | Delta Support | Typical Volume | +|-----------|-------------|-----------|---|---| +| C_CUSTOMER | Customer master | Customer#, Name, Region, Currency | Yes | 10K-100K | +| C_SALES_ORDER | Sales orders | Order#, Customer, Date, Amount | Yes | 100K-1M | +| C_SALES_ORDER_ITEM | Order line items | Order#, Item#, Material, Qty, Price | Yes | 1M-10M | +| C_INVOICE | Customer invoices | Invoice#, Customer, Date, Amount | Yes | 100K-1M | +| C_CREDIT_MEMO | Credit memos | Memo#, Customer, Amount, Reason | Yes | 10K-100K | +| C_DELIVERY | Deliveries | Delivery#, Order#, Date, Status | Yes | 100K-1M | +| C_SALES_FORECAST | Sales forecasts | Customer, Material, Month, Qty | Yes | 100K-1M | + +**Extraction Recommendation:** +- Delta enabled for all documents +- 5-minute frequency for real-time sales dashboards +- Use for sales analytics, AR aging, pipeline tracking + +### Procurement Module (MM/PO) + +| View Name | Description | Key Fields | Delta Support | Typical Volume | +|-----------|-------------|-----------|---|---| +| C_SUPPLIER | Supplier master | Supplier#, Name, Region, Currency | Yes | 1K-10K | +| C_PURCHASE_ORDER | Purchase orders | PO#, Supplier, Date, Amount | Yes | 100K-1M | +| C_PURCHASE_ORDER_ITEM | PO line items | PO#, Item#, Material, Qty, Price | Yes | 1M-10M | +| C_SUPPLIER_INVOICE | Vendor invoices | Invoice#, Supplier, Date, Amount | Yes | 10M-50M | +| C_MATERIAL | Material master | Material#, Description, Type, UOM | No | 10K-100K | +| C_MATERIAL_STOCK | Inventory balances | Plant, Material, Qty, Value | Yes | 100K-1M | +| C_PURCHASE_REQUISITION | Purchase requisitions | PR#, Date, Requester, Amount | Yes | 100K-1M | + +**Extraction Recommendation:** +- Masters: Full load daily or weekly +- Transactions: Delta every 5-15 minutes +- Stock: Delta with high frequency (5 min) if needed for inventory management + +### Human Resources (HR) + +| View Name | Description | Key Fields | Delta Support | Typical Volume | +|-----------|-------------|-----------|---|---| +| C_EMPLOYEE | Employee master | Employee#, Name, Department, Manager | Yes | 1K-10K | +| C_EMPLOYEE_SALARY | Salary information | Employee#, Period, Amount, Currency | Yes | 100K-1M | +| C_EMPLOYEE_TIME | Time tracking | Employee#, Date, Hours, Activity | Yes | 1M-10M | +| C_ORGANIZATIONAL_UNIT | Org structure | Org Unit#, Name, Parent, Manager | No | 100-1K | +| C_EMPLOYEE_SKILL | Skill master | Employee#, Skill, Level, Date | Yes | 10K-100K | + +**Extraction Recommendation:** +- Masters: Weekly full load +- Transactions: Daily delta +- Use for HR analytics, headcount, skill gap analysis + +### Supply Chain Module (SCM) + +| View Name | Description | Key Fields | Delta Support | Typical Volume | +|-----------|-------------|-----------|---|---| +| C_DEMAND | Demand forecast | Product, Region, Month, Quantity | Yes | 100K-1M | +| C_SUPPLY | Supply availability | Plant, Material, Date, Quantity | Yes | 100K-1M | +| C_PURCHASE_REQUISITION | PR forecast | Date, Material, Quantity, Required Date | Yes | 100K-1M | + +## CDS Annotation Reference + +### Complete Annotation Examples + +**Complete C_CUSTOMER Example:** +```abap +@VDM.viewType: #CONSUMPTION +@VDM.Datalake:true +@Analytics.dataCategory: #DIMENSION +@Analytics.dataExtraction.enabled: true +@Analytics.dataExtraction.deltaSupported: true + +@ObjectModel.readOnly: true +@ObjectModel.usageType: #DOCUMENT +@ObjectModel.semanticKey: ['CustomerID'] +@ObjectModel.Composition.RefreshingElement: true + +@Semantics.businessKey: true +@Semantics.text: {element: ['CustomerName']} + +@EndUserText.label: 'Customer Master' +@EndUserText.description: 'Extraction-enabled customer master data for analytics' +@EndUserText.quickInfo: 'Customer dimensions for sales and finance' + +@UI.headerInfo: { typeName: 'Customer', typeNamePlural: 'Customers' } + +define view C_CUSTOMER as + select from kna1 { + @Semantics.businessKey: true + kna1.kunnr as CustomerID, + + @Semantics.text: true + kna1.name1 as CustomerName, + + kna1.ort01 as City, + + @Semantics.location.longitude: true + kna1.location_lon as Longitude, + + @Semantics.location.latitude: true + kna1.location_lat as Latitude, + + @Semantics.amount.currencyCode: 'CurrencyCode' + kna1.netwr as NetAnnualRevenue, + + kna1.waers as CurrencyCode, + + kna1.erdat as CreatedDate, + + @Semantics.systemDateTime.lastChangedAt: true + kna1.aedat as LastChangedDate + }; +``` + +### Annotation Quick Reference Table + +| Annotation | Purpose | Example | +|-----------|---------|---------| +| @VDM.viewType | Defines view layer | #CONSUMPTION (exposed), #INTERNAL, #DERIVED | +| @Analytics.dataCategory | Data role | #FACT, #DIMENSION, #CUBE, #QUERY | +| @Analytics.dataExtraction.enabled | Enable extraction | true/false | +| @Analytics.dataExtraction.deltaSupported | Delta capable | true/false | +| @ObjectModel.readOnly | Prevents writes | true/false | +| @ObjectModel.semanticKey | Unique identifier | ['Field1', 'Field2'] | +| @Semantics.businessKey | Business identifier | true/false | +| @Semantics.text | Description field | {element: ['FieldName']} | +| @Semantics.amount.currencyCode | Currency field | 'CurrencyField' | +| @Semantics.quantity.unitOfMeasure | Unit field | 'UOMField' | +| @Semantics.calendar.date | Date semantics | true/false | +| @EndUserText.label | User-friendly name | 'Customer Master' | +| @EndUserText.description | Detailed description | 'Customer master data...' | + +## ODP Extractor Configuration + +### ODP Context Types + +``` +ABAP:CDS_VIEWS +├── Source: ABAP CDS Views +├── Providers: All extraction-enabled views (C_*) +├── Delta: Yes, via ODP change log +└── Best For: Standard extraction + +SAP_HANA:CALCULATION_VIEWS +├── Source: SAP HANA calculation views +├── Providers: Published views +├── Delta: Limited +└── Best For: Pre-calculated aggregates + +BW:INFOPROVIDER +├── Source: BW InfoProviders +├── Providers: Cubes, DSO, InfoObjects +├── Delta: Yes, via BW requests +└── Best For: Integrated BW data + +LOGICAL_LOG +├── Source: Application change logs +├── Providers: Business transactions +├── Delta: Event-based +└── Best For: Audit trail, compliance +``` + +### ODP Configuration Best Practices + +**Enable Change Log in Source System:** +``` +S/4HANA → SPRO → Source System Settings → ODP +└─ Activate change logging for CDS views: + - Set retention period: 8 days + - Set log size limit: 4GB + - Set parallel write threads: 4 +``` + +**Define ODP Extractor:** +```xml +ODP Provider Configuration: + + ABAP:CDS_VIEWS + C_CUSTOMER + + FULL_THEN_DELTA + FULL + CHANGE_LOG + + + + + + +``` + +**ODP Request Semantics:** +``` +Request Concept in ODP: + - Request = Batch of extractions for a run + - Request ID = Uniquely identifies extraction batch + - Watermark = Position within change log + - Delta Request = Request for changes since last run + +Full Load Request: + SELECT_ALL = true + FROM_CHANGENUMBER = 0 + Result: All records + +Delta Request: + SELECT_ALL = false + FROM_CHANGENUMBER = 1000000 + Result: Only changes since change number 1000000 +``` + +## Cloud Connector Setup Checklist + +### Pre-Installation + +- [ ] VM allocated: 4GB RAM minimum, 100GB storage +- [ ] Network: VM can reach S/4HANA system +- [ ] Java: JDK 11 or 17 installed and JAVA_HOME set +- [ ] SSL Certificate: Valid certificate for CC domain or self-signed accepted +- [ ] Firewall: Port 8443 (admin UI), port 443 (backend) +- [ ] User: Admin user for CC installation + +### Installation Steps + +```bash +# 1. Download Cloud Connector +cd /opt/sap +wget https://tools.hana.ondemand.com/additional/sapcc-latest.tar.gz +tar -xzf sapcc-latest.tar.gz + +# 2. Set environment +export JAVA_HOME=/usr/lib/jvm/java-11-openjdk +export PATH=$PATH:$JAVA_HOME/bin + +# 3. Start Cloud Connector +cd /opt/sap/scc +./go.sh start + +# 4. Access Admin UI +# https://localhost:8443/admin +# Default: Administrator / manage +``` + +### Configuration Steps + +**1. Add Backend System:** +``` +Admin Console → Cloud Connectors → Connector Group +→ Connected Backend Systems → + +Name: SAP_S4H_PROD +Type: SAP System +Host: s4h-prod.example.com +Port: 443 +Principal Type: User Name/Password +User: DATASPH_TECH +Password: [encrypted in CC] +``` + +**2. Add Resource Mapping:** +``` +Resource: /sap/opu/odata/sap/* +Check: YES (Allow) +Semantics: Access allowed +Protocol: HTTP/HTTPS + +Resource: /sap/bc/odata/v4/* +Check: YES +Protocol: HTTP/HTTPS + +Resource: /sap/bc/adt/* +Check: YES +Protocol: HTTP/HTTPS +``` + +**3. Configure High Availability:** +``` +Multiple Cloud Connectors: +CC1: Primary (Active) +CC2: Standby (Passive) + +Configure in Datasphere: +Connection Settings → Cloud Connector +Connector Group: SCC_GROUP_PROD +Connectors: CC1, CC2 (failover enabled) +``` + +**4. SSL Certificate:** +```bash +# Generate self-signed (development only) +openssl req -x509 -nodes -days 365 -newkey rsa:2048 \ + -keyout /opt/sap/scc/server.key \ + -out /opt/sap/scc/server.crt + +# Import in SAP system +STRUST transaction → Certificate +Import Cloud Connector certificate +``` + +### Cloud Connector Monitoring + +**Performance Metrics:** +``` +Admin Console → Monitoring +├── Requests/second: Target >100 req/s +├── Latency: Target <100ms average +├── Throughput: Target >50MB/s +└── Error Rate: Target <0.1% +``` + +**Health Check:** +``` +Backend Connectivity: + ✓ HTTPS port 443 reachable + ✓ SSL handshake successful + ✓ Authentication successful + ✓ ODP availability confirmed +``` + +## DP Agent Installation and Configuration + +### System Requirements + +``` +OS: Windows Server 2016+ or Linux (RHEL/SUSE) +CPU: 4 cores minimum +RAM: 8GB minimum (16GB recommended) +Storage: 50GB free space +Java: JDK 11+ +Network: Direct access to S/4HANA and Datasphere +Firewall: Outbound HTTPS to Datasphere tenant +``` + +### Installation (Windows) + +```powershell +# 1. Download DP Agent from Datasphere +# Administration → Data Provisioning → Download DP Agent + +# 2. Extract and install +Expand-Archive -Path dpsagent-*.zip -DestinationPath C:\dpsagent +cd C:\dpsagent + +# 3. Run installer +.\install.exe /path C:\dpsagent /user datasph_agent /password [pass] + +# 4. Configure +notepad conf\agent.properties +# Set: +# DATASPHERE_TENANT=https://[tenant].datasphere.cloud.sap +# DATASPHERE_USER=dpsagent@company.com +# DATASPHERE_PASSWORD=[oauth_token] +# SAP_SYSTEM_HOST=s4h-prod.example.com +# SAP_SYSTEM_CLIENT=100 + +# 5. Start service +Start-Service dpsagent + +# 6. Verify +.\dpsagent status +``` + +### Installation (Linux) + +```bash +# 1. Download and extract +cd /opt/sap +unzip dpsagent-*.zip +cd dpsagent + +# 2. Configure +nano conf/agent.properties +# Set properties as above + +# 3. Create systemd service +sudo tee /etc/systemd/system/dpsagent.service > /dev/null < current max +Solution: + 1. Full reload to get current watermark + 2. Restart delta from new watermark +``` + +**Symptom: No delta records extracted** +``` +Cause: No changes since last run +Solution: + 1. Verify source data actually changed + 2. Check watermark = previous max + 3. Query source directly for verification +``` + +**Symptom: Duplicate records in delta load** +``` +Cause: Duplicate changes in log +Solution: + 1. Deduplicate on key fields + 2. Use SCD Type 2 for dimension tables + 3. Verify merge logic in target +``` + diff --git a/partner-built/SAP-Datasphere/skills/datasphere-security-architect/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-security-architect/SKILL.md new file mode 100644 index 0000000..3070e6d --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-security-architect/SKILL.md @@ -0,0 +1,1600 @@ +--- +name: Security Architect +description: "Design and enforce row-level security, Data Access Controls (DACs), Analysis Authorization imports from BW/4HANA, and audit policies. Use when implementing data governance, protecting sensitive information, migrating BW authorizations, configuring compliance auditing (SOX, GDPR), or establishing segregation of duties. Critical for regulated industries." +--- + +# Security Architect Skill + +## Overview + +This skill guides you through designing and implementing comprehensive security controls in SAP Datasphere. Security architecture is foundational to enterprise data governance, ensuring users see only appropriate data while maintaining audit trails for compliance. + +### When to Use This Skill + +- **Authorization Migration**: Converting SAP BW/4HANA Analysis Authorizations to Datasphere Data Access Controls +- **Row-Level Security**: Restricting data visibility by user attributes, organizational hierarchy, or business rules +- **Compliance Requirements**: Implementing SOX, GDPR, HIPAA, or industry-specific audit requirements +- **Data Governance**: Establishing data access policies and enforcing them across organization +- **Sensitive Data Protection**: Masking or restricting access to PII, financial data, or proprietary information +- **Segregation of Duties**: Preventing unauthorized combinations of access +- **Audit Trail Management**: Logging and monitoring data access for investigation and reporting + +### Security Architecture Overview + +Datasphere provides layered security controls: + +``` +┌──────────────────────────────────────────────────────────┐ +│ Data Consumer (BI Tool, App) │ +└───────────────────────┬──────────────────────────────────┘ + │ + ┌────▼─────────────────┐ + │ Identity & Access │ + │ Management (IdP) │ + │ - SAML, OIDC │ + │ - User Attributes │ + │ - Group Membership │ + └────┬─────────────────┘ + │ + ┌───────────────┼───────────────┐ + │ │ │ + ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ + │ Space │ │ Object │ │ Row- │ + │ Level │ │ Level │ │ Level │ + │ Access │ │ Access │ │ Access │ + └────┬────┘ └────┬────┘ └────┬────┘ + │ │ │ + └──────────────┼──────────────┘ + │ + ┌──────────▼────────────┐ + │ Data Access Controls │ + │ (DAC) - Row Filters │ + │ │ + │ - Operator/Values DAC │ + │ - Hierarchy DAC │ + │ - Combined DACs │ + └──────────┬───────────┘ + │ + ┌──────────────▼──────────────┐ + │ Data (Tables, Views) │ + │ │ + │ User sees filtered rows │ + └─────────────────────────────┘ + + ┌──────────────────────────────┐ + │ Audit Trail (Logging) │ + │ - Read operations logged │ + │ - Changes logged │ + │ - Access violations logged │ + └──────────────────────────────┘ +``` + +--- + +## Part 1: Data Access Controls (DAC) — Row-Level Security + +Data Access Controls are Datasphere's primary mechanism for row-level security (RLS). DACs filter rows based on user context. + +### DAC Architecture + +**Core Concept:** + +```sql +-- Without DAC: User sees all rows +SELECT * FROM T_SALES WHERE 1=1 + +-- Result: 1,000,000 rows + +-- With DAC applied: User sees filtered rows +SELECT * FROM T_SALES WHERE COMPANY_CODE = CURRENT_USER_COMPANY + +-- Result: 50,000 rows (only their company) +``` + +### DAC Types + +#### Type 1: Operator and Values DAC + +Filters rows by comparing a table column to a user attribute or fixed list. + +**Use Cases:** +- Restrict sales reps to their assigned region +- Show finance staff only their cost center +- Limit product managers to their category +- Prevent cross-subsidiary data access + +**Architecture:** + +```yaml +DAC Type: Operator and Values +Name: DAC_SALES_BY_REGION + +Filter Logic: + Table Column: SALES_REGION + Condition Type: IN + Condition Values: [User Attribute: USER_ASSIGNED_REGIONS] + +Result: + - User with attribute REGIONS = ['Americas', 'EMEA'] + - Sees only rows where SALES_REGION IN ('Americas', 'EMEA') + - Other regions hidden (Europe, APAC if not in attribute) +``` + +**Implementation Steps:** + +1. **Create Custom User Attribute** + ```sql + -- In your Identity Provider (IdP) or Datasphere User Management + User: john.smith@company.com + Custom Attributes: + - ASSIGNED_REGIONS: ['Americas', 'EMEA'] + - ASSIGNED_COST_CENTERS: ['CC001', 'CC002'] + - ASSIGNED_PLANTS: ['PLANT_USA', 'PLANT_MEXICO'] + ``` + +2. **Create DAC in Datasphere** + - Navigate: Data Access Controls → New + - Name: `DAC_SALES_BY_REGION` + - Filter Column: `SALES_REGION` (from T_SALES table) + - Operator: `IN` + - Value Source: `User Attribute: ASSIGNED_REGIONS` + +3. **Apply DAC to Table/View** + - Open table: `T_SALES` + - Add Data Access Control: `DAC_SALES_BY_REGION` + - Scope: All users (or specific role) + - Save and activate + +4. **Test DAC Behavior** + ```sql + -- As user john.smith (ASSIGNED_REGIONS = ['Americas', 'EMEA']) + SELECT COUNT(*) FROM T_SALES + -- Expected: Only Americas & EMEA rows (e.g., 50K of 100K total) + + -- As user jane.doe (ASSIGNED_REGIONS = ['APAC']) + SELECT COUNT(*) FROM T_SALES + -- Expected: Only APAC rows (e.g., 30K of 100K total) + ``` + +**SQL Expression DACs (Advanced):** + +For complex logic, use SQL expressions: + +```yaml +DAC Name: DAC_COMPLEX_SALES_ACCESS +Filter Expression: | + (SALES_REGION IN :USER_REGIONS + AND SALES_ORG = :USER_PRIMARY_ORG) + OR (USER_ROLE = 'GLOBAL_MANAGER' AND 1=1) -- Bypass for managers + +Parameters: + - :USER_REGIONS (from custom attribute) + - :USER_PRIMARY_ORG (from custom attribute) + - :USER_ROLE (from IdP role) +``` + +**Operator Choices:** + +| Operator | Use Case | Example | +|---|---|---| +| `=` | Exact match | `COMPANY_CODE = USER_COMPANY` | +| `<>` | Not equal | `STATUS <> 'RESTRICTED'` | +| `IN` | Multiple values | `REGION IN (USER_REGIONS)` | +| `NOT IN` | Exclude values | `CATEGORY NOT IN (USER_EXCLUDED)` | +| `>`, `<`, `>=`, `<=` | Range filtering | `AMOUNT >= USER_MIN_THRESHOLD` | +| `LIKE` | Pattern matching | `CUSTOMER_NAME LIKE USER_NAME_PATTERN` | +| `BETWEEN` | Range (inclusive) | `POSTING_DATE BETWEEN START_DATE AND END_DATE` | + +--- + +#### Type 2: Hierarchy DAC + +Filters rows by organizational or master data hierarchies. + +**Use Cases:** +- Restrict users to their department and sub-departments (org hierarchy) +- Show sales data by customer hierarchy (top → bottom levels) +- Limit access by product category hierarchy +- Enable drill-down while preventing sibling access + +**Architecture:** + +```yaml +DAC Type: Hierarchy +Name: DAC_ORG_HIERARCHY_ACCESS + +Hierarchy: ORG_STRUCTURE + ├── Company + │ ├── Division + │ │ ├── Department + │ │ └── Sub-Department + │ └── Region + │ ├── Country + │ └── Territory + +User Assignment: + john.smith: Department "Engineering" + Result: Can see Engineering + all sub-departments + Cannot see: Other divisions, other departments + +jane.doe: Division "Sales" +Result: Can see Sales division + all departments within Sales +Cannot see: Engineering division, other divisions +``` + +**Implementation Steps:** + +1. **Create Hierarchy Table** + ```sql + CREATE TABLE T_ORG_HIERARCHY ( + NODE_ID VARCHAR(10) PRIMARY KEY, + PARENT_NODE_ID VARCHAR(10), + NODE_NAME VARCHAR(50), + NODE_LEVEL INTEGER, -- 1=Company, 2=Division, 3=Dept + NODE_TYPE VARCHAR(20), + EFFECTIVE_FROM DATE, + EFFECTIVE_TO DATE + ); + + INSERT INTO T_ORG_HIERARCHY VALUES + ('ROOT', NULL, 'Company', 1, 'COMPANY', '2024-01-01', NULL), + ('DIV_ENG', 'ROOT', 'Engineering', 2, 'DIVISION', '2024-01-01', NULL), + ('DIV_SALES', 'ROOT', 'Sales', 2, 'DIVISION', '2024-01-01', NULL), + ('DEPT_DEV', 'DIV_ENG', 'Development', 3, 'DEPARTMENT', '2024-01-01', NULL), + ('DEPT_QA', 'DIV_ENG', 'QA', 3, 'DEPARTMENT', '2024-01-01', NULL), + ('DEPT_EMEA', 'DIV_SALES', 'EMEA Sales', 3, 'DEPARTMENT', '2024-01-01', NULL), + ('DEPT_APAC', 'DIV_SALES', 'APAC Sales', 3, 'DEPARTMENT', '2024-01-01', NULL); + ``` + +2. **Create User-to-Hierarchy Mapping** + ```sql + CREATE TABLE T_USER_ORG_MAPPING ( + USER_ID VARCHAR(12), + ASSIGNED_NODE_ID VARCHAR(10), + ASSIGNED_NODE_LEVEL INTEGER, + EFFECTIVE_FROM DATE + ); + + INSERT INTO T_USER_ORG_MAPPING VALUES + ('john.smith', 'DEPT_DEV', 3, '2024-01-01'), + ('jane.doe', 'DIV_SALES', 2, '2024-01-01'), + ('mary.johnson', 'ROOT', 1, '2024-01-01'); -- CEO sees all + ``` + +3. **Create Hierarchy-Based DAC** + - Create DAC: `DAC_ORG_HIERARCHY` + - Type: Hierarchy + - Hierarchy Table: `T_ORG_HIERARCHY` + - Key Column: `NODE_ID` + - User Mapping Table: `T_USER_ORG_MAPPING` + - Mapping Key: `ASSIGNED_NODE_ID` + - Include Sub-Hierarchy: Yes (allows drill-down) + +4. **Apply DAC to Fact Tables** + - Add DAC to any table with department/org column + - Ensure table has matching key: `DEPARTMENT_ID` + - Filter merges table column with user's hierarchy level + +5. **Test Hierarchy Access** + ```sql + -- As john.smith (assigned DEPT_DEV, level 3) + -- John can see data for: + -- - DEPT_DEV (his department) + -- - Any future sub-departments under DEPT_DEV (if any) + -- - Sub-departments he gets reassigned to + -- + -- John CANNOT see: + -- - DEPT_QA (sibling department) + -- - DIV_SALES (other division) + -- - Other company divisions + + -- As jane.doe (assigned DIV_SALES, level 2) + -- Jane can see data for: + -- - DIV_SALES (her division) + -- - DEPT_EMEA (child of DIV_SALES) + -- - DEPT_APAC (child of DIV_SALES) + -- - All sub-departments under EMEA/APAC + -- + -- Jane CANNOT see: + -- - DIV_ENG (other division) + -- - DEPT_DEV, DEPT_QA (different division) + ``` + +**Hierarchy Best Practices:** + +- Maintain effective dates for organizational changes +- Test transitions when org structure changes +- Document hierarchy structure for audit purposes +- Keep hierarchy tables normalized to avoid anomalies +- Use version control if hierarchy evolves frequently + +--- + +#### Type 3: Combined DACs + +Chain multiple DACs together for complex scenarios. + +**Use Case: Multi-Attribute Filtering** + +```yaml +Scenario: Sales Manager needs: + - Only their assigned regions (Operator DAC) + - Only approved customer categories (Hierarchy DAC) + - Only current fiscal year data (Date DAC) + +Implementation: + DAC 1 (Operator): SALES_REGION IN (:USER_REGIONS) + DAC 2 (Hierarchy): CUSTOMER_CATEGORY hierarchy filter + DAC 3 (Date): FISCAL_YEAR = CURRENT_FISCAL_YEAR + + Combination Logic: DAC1 AND DAC2 AND DAC3 + Result: User sees intersection of all three filters +``` + +**SQL Representation:** + +```sql +-- Combined DAC logic (applied automatically by Datasphere) +SELECT * FROM T_SALES +WHERE SALES_REGION IN ('Americas', 'EMEA') -- DAC 1 + AND CUSTOMER_CATEGORY IN ('A', 'B', 'C') -- DAC 2 + AND FISCAL_YEAR = 2024 -- DAC 3 +``` + +--- + +### DAC Testing and Validation + +**Test Framework:** + +```yaml +Test Plan: DAC_SALES_BY_REGION + +Test Case 1: User with Single Region + User: john.smith (REGIONS = ['Americas']) + Query: SELECT COUNT(*) FROM T_SALES + Expected: Only Americas rows visible + Verification: + SELECT DISTINCT SALES_REGION FROM T_SALES + Expected Result: ['Americas'] + +Test Case 2: User with Multiple Regions + User: jane.doe (REGIONS = ['EMEA', 'APAC']) + Query: SELECT COUNT(*) FROM T_SALES + Expected: EMEA + APAC rows visible, Americas hidden + Verification: + SELECT DISTINCT SALES_REGION FROM T_SALES + Expected Result: ['EMEA', 'APAC'] + +Test Case 3: User with No Assigned Regions + User: alex.brown (REGIONS = []) + Query: SELECT COUNT(*) FROM T_SALES + Expected: 0 rows (empty result set) + Verification: Row count = 0 + +Test Case 4: Admin Bypass (if applicable) + User: admin.user (ROLE = 'GLOBAL_ADMIN') + Query: SELECT COUNT(*) FROM T_SALES + Expected: All rows visible (DAC bypassed) + Verification: Row count = total all regions +``` + +**Validation SQL Queries:** + +```sql +-- Query 1: Verify DAC is active on table +SELECT TABLE_NAME, DAC_NAME, DAC_TYPE, ACTIVE_FLAG +FROM DATASPHERE.DATA_ACCESS_CONTROLS +WHERE TABLE_NAME = 'T_SALES'; + +-- Expected: DAC_SALES_BY_REGION, Operator, Y + +-- Query 2: List all DACs applied to table +SELECT DAC_NAME, FILTER_COLUMN, FILTER_OPERATOR, FILTER_VALUES +FROM DATASPHERE.DAC_DEFINITIONS +WHERE TABLE_NAME = 'T_SALES' +ORDER BY APPLY_SEQUENCE; + +-- Query 3: Count rows per user per region (for validation) +-- Run as each test user and verify results match expectations +SELECT USER_ID, SALES_REGION, COUNT(*) as row_count +FROM T_SALES +GROUP BY USER_ID, SALES_REGION +ORDER BY USER_ID, SALES_REGION; +``` + +**Regression Testing (Ongoing):** + +After any DAC changes: +- Re-run all test cases +- Validate no unintended data exposure +- Check performance impact (DAC filters add overhead) +- Document any behavioral changes + +--- + +## Part 2: Analysis Authorization Import (BW → Datasphere) + +BW/4HANA Analysis Authorizations define who can see which InfoCubes and which rows. Datasphere uses a similar model but requires mapping. + +### Understanding BW Analysis Authorizations + +**BW Authorization Objects:** + +``` +Authorization: +├── Object: 0_RF_BOA +│ └── InfoCube: 0SALES_001 +│ ├── Authorization Type: Value-based +│ ├── Auth Fields: +│ │ ├── COMPANY_CODE: ['0010', '0020'] +│ │ ├── SALES_ORG: ['1000'] +│ │ ├── CUSTOMER: ['*'] (All) +│ │ └── PRODUCT_CATEGORY: ['CARS', 'BIKES'] +│ └── Result: User can access 0SALES_001 with those filters + +└── Object: 0_USER + └── Authorization Values: + ├── User: john.smith + ├── Company Code: '0010', '0020' + ├── Role: Sales Manager + └── Reports Allowed: [SALES001, SALES002] +``` + +**BW Authorization Types:** + +| Type | Scope | Datasphere Mapping | +|---|---|---| +| Value-Based | Specific field values | Operator DAC | +| Hierarchy-Based | Organizational structure | Hierarchy DAC | +| Derived | Based on other authorizations | Computed DAC | +| Role-Based | User role determines access | Space/Object roles | + +--- + +### BW to Datasphere Authorization Mapping + +#### Step 1: Extract BW Authorizations + +``` +Transaction: SU01 (User Maintenance) in source BW system + +For Each User: + 1. Select User ID + 2. View "Roles" tab + 3. For each role, note authorizations + 4. Document authorization fields and values: + +User: JOHN.SMITH +├── Role: ZSD_REGIONAL_SALES_MGR +│ ├── Authorization 0_RF_BOA (InfoCube Access) +│ │ ├── COMPANY_CODE: 0010, 0020 +│ │ ├── SALES_ORG: 1000 +│ │ ├── SALES_DISTRICT: 10, 20, 30 +│ │ └── REGION: EUR, AMER +│ ├── Authorization 0_AUDIT (Report Access) +│ │ └── Reports: SALES001, SALES002, SALES003 +│ └── Authorization S_DEVELOP (Development Access) +│ └── Programs: ZSD_* +│ +└── Role: ZFI_COST_CENTER_MGR + ├── Authorization 0_RF_BOA + │ ├── COMPANY_CODE: 0010 + │ ├── COST_CENTER: 1000, 1100, 1200 + │ └── PLANT: 1001, 1002 + └── Authorization 0_F_LEDGER + └── Ledger access: All +``` + +**Export Steps:** + +1. Use transaction **URAM** (Analysis Authorization Report) +2. Select user range (or all users) +3. Run report to extract authorization details +4. Export to Excel/CSV with columns: + - USER_ID + - AUTH_OBJECT + - AUTH_FIELD_1 + - AUTH_FIELD_1_VALUES + - AUTH_FIELD_2 + - AUTH_FIELD_2_VALUES + - ... (repeat for all fields) + - INFOCUBE_ID (for 0_RF_BOA) + +**Example Export Format:** + +``` +USER_ID,AUTH_OBJECT,FIELD_1,VALUE_1,FIELD_2,VALUE_2,FIELD_3,VALUE_3,INFOCUBE +john.smith,0_RF_BOA,COMPANY_CODE,"0010;0020",SALES_ORG,"1000",SALES_DISTRICT,"10;20;30",0SALES_001 +john.smith,0_RF_BOA,COMPANY_CODE,"0010;0020",SALES_ORG,"1000",REGION,"EUR;AMER",0SALES_001 +jane.doe,0_RF_BOA,COMPANY_CODE,"0010",COST_CENTER,"1000;1100;1200",PLANT,"1001;1002",0COST_001 +mary.johnson,0_RF_BOA,COMPANY_CODE,"*",SALES_ORG,"*",REGION,"*",0SALES_001 +``` + +--- + +#### Step 2: Map BW Fields to Datasphere Columns + +Create mapping table: + +```sql +CREATE TABLE T_AUTH_FIELD_MAPPING ( + BW_INFOCUBE_ID VARCHAR(30), + BW_AUTH_FIELD VARCHAR(30), + DS_TABLE_NAME VARCHAR(60), + DS_COLUMN_NAME VARCHAR(60), + MAPPING_TYPE VARCHAR(20), -- VALUES, HIERARCHY, FORMULA + NOTES VARCHAR(500) +); + +INSERT INTO T_AUTH_FIELD_MAPPING VALUES + ('0SALES_001', 'COMPANY_CODE', 'T_SALES', 'COMPANY_CODE', 'VALUES', + 'Direct column match'), + ('0SALES_001', 'SALES_ORG', 'T_SALES', 'SALES_ORG', 'VALUES', + 'Direct column match'), + ('0SALES_001', 'SALES_DISTRICT', 'T_SALES', 'DISTRICT_ID', 'VALUES', + 'Maps to DISTRICT_ID column'), + ('0SALES_001', 'REGION', 'T_SALES', 'SALES_REGION', 'HIERARCHY', + 'Use org hierarchy for filtering'), + ('0COST_001', 'COST_CENTER', 'T_FINANCIALS', 'COST_CENTER', 'HIERARCHY', + 'Use cost center hierarchy'), + ('0COST_001', 'PLANT', 'T_FINANCIALS', 'PLANT_ID', 'VALUES', + 'Direct column match'); +``` + +**Mapping Challenges:** + +| BW Field | Datasphere Challenge | Solution | +|---|---|---| +| Wildcard (*) | Means "all values" | Remove filter (no DAC needed) | +| Compound key | Multiple fields required | Create composite key DAC | +| User-exit logic | Custom ABAP code | Rebuild logic in transformation | +| Time-dependent | Fields change per period | Use date-scoped DAC | +| Hierarchies with wildcards | "REGION: EUR*" | Hierarchy DAC with pattern matching | + +--- + +#### Step 3: Create Datasphere DACs from BW Authorizations + +``` +For Each BW Authorization: + 1. Parse all fields and values + 2. Create corresponding Datasphere DAC + 3. Link DAC to appropriate table/view + 4. Assign to user/role +``` + +**Example Conversion:** + +``` +BW Authorization: +User: john.smith +AuthObject: 0_RF_BOA +InfoCube: 0SALES_001 +Fields: + - COMPANY_CODE = 0010, 0020 + - SALES_ORG = 1000 + - SALES_DISTRICT = 10, 20, 30 + +Datasphere Implementation: + DAC Name: DAC_SALES_JOHN_SMITH + Type: Operator and Values + Filter Expression: + COMPANY_CODE IN ('0010', '0020') + AND SALES_ORG = '1000' + AND DISTRICT_ID IN ('10', '20', '30') + + Applied To: T_SALES (table), V_SALES_QUERY (view) + Assigned To: john.smith + Active: Yes +``` + +**Batch DAC Creation Script Template:** + +```sql +-- Pseudo-code for bulk DAC creation +FOR EACH row in T_BW_AUTH_EXPORT: + SET @user_id = row.USER_ID + SET @infocube = row.INFOCUBE_ID + SET @field_1 = row.FIELD_1 + SET @values_1 = row.VALUES_1 + SET @field_2 = row.FIELD_2 + SET @values_2 = row.VALUES_2 + + -- Map BW fields to Datasphere columns + SET @ds_table = (SELECT DS_TABLE_NAME + FROM T_AUTH_FIELD_MAPPING + WHERE BW_INFOCUBE_ID = @infocube) + + SET @ds_column_1 = (SELECT DS_COLUMN_NAME + FROM T_AUTH_FIELD_MAPPING + WHERE BW_INFOCUBE_ID = @infocube + AND BW_AUTH_FIELD = @field_1) + + -- Build filter expression + SET @filter_expr = CONCAT( + @ds_column_1, ' IN (', @values_1, ')' + ) + + -- Create DAC in Datasphere + CREATE DATA ACCESS CONTROL AS (CONCAT('DAC_', @user_id, '_', @infocube)) + FOR TABLE (CONCAT('T_', @infocube)) + WHERE (@filter_expr) + + -- Assign to user + GRANT DATA ACCESS CONTROL (CONCAT('DAC_', @user_id, '_', @infocube)) + TO USER @user_id +END FOR + +-- After creation, validate +SELECT USER_ID, COUNT(*) as dac_count +FROM DATASPHERE.DATA_ACCESS_CONTROLS +GROUP BY USER_ID +HAVING dac_count > 0 +ORDER BY dac_count DESC; +``` + +--- + +#### Step 4: Validate Authorization Migration + +**Reconciliation Steps:** + +1. **Verify all BW users have corresponding DACs** + ```sql + -- Compare user count + SELECT COUNT(DISTINCT USER_ID) as bw_users + FROM T_BW_AUTH_EXPORT; + + SELECT COUNT(DISTINCT USER_ID) as ds_users + FROM DATASPHERE.DATA_ACCESS_CONTROLS; + + -- Expected: Both counts equal (or very close) + ``` + +2. **Verify all BW fields are mapped** + ```sql + -- Check for unmapped fields + SELECT DISTINCT BW_AUTH_FIELD + FROM T_BW_AUTH_EXPORT bw + WHERE NOT EXISTS ( + SELECT 1 FROM T_AUTH_FIELD_MAPPING m + WHERE m.BW_AUTH_FIELD = bw.BW_AUTH_FIELD + ); + + -- Expected: Empty result set (no unmapped fields) + ``` + +3. **Test sample authorizations** + ```sql + -- For test user john.smith: + -- 1. Log in as john.smith in Datasphere + -- 2. Execute query: SELECT DISTINCT COMPANY_CODE FROM T_SALES + -- 3. Expected result: ['0010', '0020'] (from BW auth) + -- 4. Execute: SELECT COUNT(*) FROM T_SALES WHERE SALES_ORG = '1000' + -- 5. Expected: 150,000 records (only his sales org) + ``` + +4. **Compare row counts (BW vs. Datasphere)** + ```sql + -- BW System: + SELECT USER, COUNT(*) as row_count + FROM 0SALES_001 + WHERE COMPANY_CODE IN ('0010', '0020') + AND SALES_ORG = '1000' + GROUP BY USER; + + -- Datasphere: + SELECT CURRENT_USER as USER, COUNT(*) as row_count + FROM T_SALES + WHERE 1=1 -- DAC filter applied automatically + GROUP BY CURRENT_USER; + + -- Expected: Row counts match ±0.1% + ``` + +--- + +## Part 3: Audit Policy Configuration + +Audit logging tracks who accessed what data and when, essential for compliance. + +### Audit Logging Architecture + +``` +┌─────────────────┐ +│ User Query │ +│ SELECT ... │ +└────────┬────────┘ + │ + ▼ +┌──────────────────────────────┐ +│ Datasphere Query Engine │ +│ Applies Row-Level Security │ +└────────┬─────────────────────┘ + │ + ▼ +┌──────────────────────────────┐ +│ Audit Policy Evaluator │ +│ Is this query auditable? │ +│ - Table in audit scope? │ +│ - Operation type logged? │ +│ - User role triggers audit? │ +└────────┬─────────────────────┘ + │ + ┌──┴───┬────────┐ + │ │ │ + Yes│ │No │No Log + │ │ │ + ▼ ▼ ▼ + ┌─────────────┐ Query + │Log to Audit │ Continues + │Table │ Normally + └────┬────────┘ + │ + ▼ + ┌──────────────────────┐ + │ T_AUDIT_LOG (Table) │ + │ - USER_ID │ + │ - OPERATION │ + │ - TABLE_NAME │ + │ - RECORD_COUNT │ + │ - TIMESTAMP │ + │ - IP_ADDRESS │ + └──────────────────────┘ +``` + +### Audit Policy Configuration Steps + +#### Step 1: Define Audit Scope + +Decide which tables, operations, and users to audit: + +```yaml +Audit Scope Decision: + +High Priority (Audit All Operations): + ├── T_CUSTOMER_PII (Personal Identifiable Information) + │ ├── Operations: Read, Insert, Update, Delete + │ ├── Users: All + │ └── Retention: 3 years (GDPR) + ├── T_PAYROLL (Sensitive HR Data) + │ ├── Operations: Read, Update, Delete + │ ├── Users: All (especially non-HR) + │ └── Retention: 7 years (Labor Laws) + └── T_FINANCIAL_TRANSACTIONS (Compliance) + ├── Operations: Read, Insert, Update, Delete + ├── Users: All + └── Retention: 10 years (SOX) + +Medium Priority (Audit Changes Only): + ├── T_PRODUCT_MASTER + │ ├── Operations: Insert, Update, Delete (not Read) + │ ├── Users: All + │ └── Retention: 1 year + ├── T_SALES + │ ├── Operations: Insert, Update (not Read) + │ ├── Users: Only non-sales users (exception access) + │ └── Retention: 6 months + └── T_CUSTOMER_ATTRIBUTES + ├── Operations: Update, Delete + ├── Users: All + └── Retention: 1 year + +Low Priority (Sample/Spot-Check): + ├── T_REFERENCE_DATA + │ ├── Operations: Reads (random sampling) + │ ├── Users: All + │ └── Retention: 3 months +``` + +#### Step 2: Create Audit Log Table + +```sql +CREATE TABLE T_AUDIT_LOG ( + AUDIT_ID BIGINT PRIMARY KEY AUTO_INCREMENT, + AUDIT_TIMESTAMP TIMESTAMP DEFAULT CURRENT_TIMESTAMP, + DATASPHERE_USER VARCHAR(256) NOT NULL, -- Logged-in user + IP_ADDRESS VARCHAR(45), -- IPv4 or IPv6 + SESSION_ID VARCHAR(64), -- Unique session identifier + OBJECT_NAME VARCHAR(256) NOT NULL, -- Table or view name + OBJECT_TYPE VARCHAR(20), -- TABLE, VIEW, ANALYTICAL_MODEL + OPERATION VARCHAR(20) NOT NULL, -- READ, INSERT, UPDATE, DELETE + ROW_COUNT_AFFECTED INTEGER, -- Rows read/modified + EXECUTION_TIME_MS INTEGER, -- Query duration + SQL_HASH VARCHAR(64), -- Hash of SQL (for uniqueness, not visibility) + FILTER_APPLIED CHAR(1), -- Y/N - was DAC applied? + DAC_FILTERS_APPLIED VARCHAR(500), -- Which DACs filtered data? + RESULT_STATUS VARCHAR(20), -- SUCCESS, FAILED, DENIED + ERROR_MESSAGE VARCHAR(1000), -- If failed + SENSITIVE_DATA_ACCESSED CHAR(1), -- Y/N - flagged by policy? + SUSPICIOUS_ACTIVITY_FLAG CHAR(1), -- Y/N - triggered alert? + COMPLIANCE_RELEVANT CHAR(1), -- Y/N - required for audit trail? + CREATED_AT TIMESTAMP DEFAULT CURRENT_TIMESTAMP, + RETAINED_UNTIL DATE, -- When to archive/delete + + INDEX idx_timestamp (AUDIT_TIMESTAMP), + INDEX idx_user_timestamp (DATASPHERE_USER, AUDIT_TIMESTAMP), + INDEX idx_object_name (OBJECT_NAME, OPERATION), + INDEX idx_sensitivity (SENSITIVE_DATA_ACCESSED), + INDEX idx_retention (RETAINED_UNTIL) +); +``` + +**Partitioning Strategy (for performance):** + +```sql +-- Partition by month for easier archival +PARTITION BY RANGE (YEAR_MONTH(AUDIT_TIMESTAMP)) ( + PARTITION p_202401 VALUES LESS THAN ('202402'), + PARTITION p_202402 VALUES LESS THAN ('202403'), + -- ... monthly partitions + PARTITION p_202412 VALUES LESS THAN ('202501'), + PARTITION p_future VALUES LESS THAN MAXVALUE +); + +-- Allows fast deletion: ALTER TABLE T_AUDIT_LOG DROP PARTITION p_202301; +-- Much faster than DELETE WHERE AUDIT_TIMESTAMP < '2023-02-01' +``` + +#### Step 3: Configure Audit Policies + +In Datasphere: + +1. Navigate: **Administration** → **Audit Management** +2. Create Audit Policy + +```yaml +Audit Policy: POLICY_HIGH_SENSITIVITY_DATA + +Name: POLICY_HIGH_SENSITIVITY_DATA +Description: "Audit all access to PII, financial, and compliance-relevant data" +Enabled: Yes + +Tables in Scope: + - T_CUSTOMER_PII + - T_PAYROLL + - T_FINANCIAL_TRANSACTIONS + - T_EXECUTIVE_COMPENSATION + +Operations to Audit: + ☑ READ (All read queries) + ☑ INSERT (New records added) + ☑ UPDATE (Existing records modified) + ☑ DELETE (Records removed) + ☑ TRUNCATE (Entire table cleared) + +Users in Scope: + ☐ All users (selected) + ☐ Specific roles: [ADMIN, ANALYST] + ☐ Specific users: [] + +Audit Destination: + ☑ Central Audit Table: T_AUDIT_LOG + ☐ External SIEM: siem.company.com:514 + ☐ File Export: /var/audit/datasphere/ + +Retention Settings: + Retention Period: 3 years + Retention Method: Keep in active storage for 1 year, then archive to cold storage + Purge Schedule: Quarterly (automatic deletion after retention expires) + +Alerting: + ☑ Alert on suspicious activity + ├─ > 100 rows accessed in single query: YES, Alert Level: MEDIUM + ├─ DELETE operation on sensitive table: YES, Alert Level: HIGH + ├─ Same user accessing multiple restricted tables in 10 min: YES, Alert Level: HIGH + └─ Query from unusual IP address: YES, Alert Level: MEDIUM + +Compliance Mapping: + ☑ GDPR (General Data Protection Regulation) + ├─ Personal Data Processing: YES + ├─ Retention: 3 years minimum + └─ Right to be Forgotten: Support deletion log + ☑ SOX (Sarbanes-Oxley) + ├─ Financial Data: YES + ├─ Retention: 10 years minimum + └─ Immutable Log: YES + ☑ HIPAA (Health Insurance Portability) + ├─ PHI (Protected Health Information): YES + ├─ Retention: 6 years minimum + └─ Breach Notification: Auto-trigger alert +``` + +#### Step 4: Configure Read/Change Audit Levels + +**Read-Level Auditing (Detailed):** + +```yaml +Audit Level: DETAILED_READ + +When to Use: + - Highly sensitive data (customer PII, health records, executive data) + - Regulatory requirements (HIPAA, GDPR, SOX) + - High-value business intelligence + +Captured Information: + ✓ Every SELECT query on audited table + ✓ User ID and timestamp + ✓ Number of rows accessed + ✓ Filters applied (DAC information) + ✓ Query execution time + ✓ IP address and session ID + ✓ Any exception/error conditions + +Performance Impact: + - Significant: ~5-15% query overhead + - Storage: 1-3 bytes per row accessed + - Example: 1 billion queries/day → 50-100 GB/month audit logs + +Query Example (Detailed Read Audit): + User: john.smith + Query: SELECT * FROM T_CUSTOMER_PII WHERE STATE = 'CA' + Audit Log Entry: + - OPERATION: READ + - ROW_COUNT_AFFECTED: 500,000 + - EXECUTION_TIME_MS: 2,345 + - DAC_FILTERS_APPLIED: [DAC_STATE_FILTER] + - SENSITIVE_DATA_ACCESSED: Y + - Timestamp: 2024-02-08 14:32:15 +``` + +**Change-Level Auditing (Moderate):** + +```yaml +Audit Level: CHANGES_ONLY + +When to Use: + - Standard operational tables (products, orders, inventory) + - Security/compliance needs but not highest tier + - Balance between audit trail and performance + +Captured Information: + ✓ INSERT operations (all new records) + ✓ UPDATE operations (modified records) + ✓ DELETE operations (removed records) + ✗ READ operations (not logged) + ✓ User, timestamp, row count + ✓ Execution time, session ID + +Performance Impact: + - Low: ~1-2% query overhead (only changes logged) + - Storage: 0.5-1 byte per modified row + +Query Example (Change-Level Audit): + User: jane.doe + Query: UPDATE T_PRODUCT SET PRICE = 99.99 WHERE PRODUCT_ID = 'ABC123' + Audit Log Entry: + - OPERATION: UPDATE + - ROW_COUNT_AFFECTED: 1 + - EXECUTION_TIME_MS: 145 + - Timestamp: 2024-02-08 14:35:22 +``` + +**Space-Level Audit Scoping:** + +```yaml +Scenario: Multiple Spaces with Different Sensitivity + +Space: FINANCE (Highly Sensitive) + Audit Policies: [POLICY_HIGH_SENSITIVITY_DATA, POLICY_SOX] + Audit Level: DETAILED_READ + DETAILED_CHANGES + Retention: 10 years + +Space: SALES (Medium Sensitivity) + Audit Policies: [POLICY_STANDARD, POLICY_PII_ONLY] + Audit Level: CHANGES_ONLY + Retention: 1 year + +Space: REFERENCE (Low Sensitivity) + Audit Policies: [] + Audit Level: NONE + Retention: N/A + +Result: + - All spaces can use shared T_AUDIT_LOG table + - Partition by AUDIT_TIMESTAMP for retention management + - Query audit logs with WHERE clause: + SELECT * FROM T_AUDIT_LOG + WHERE OBJECT_NAME IN (SELECT TABLE_NAME FROM FINANCE.TABLES) + AND AUDIT_TIMESTAMP >= DATE_SUB(CURRENT_DATE, INTERVAL 10 YEAR) +``` + +--- + +## Part 4: Identity Provider (IdP) Integration + +Datasphere integrates with SAML 2.0 and OIDC-compliant Identity Providers for authentication and user attribute mapping. + +### Supported Identity Providers + +| Provider | Protocol | User Attributes | Status | +|---|---|---|---| +| Azure AD | SAML 2.0, OIDC | Object ID, UPN, Groups | Supported | +| Okta | SAML 2.0, OIDC | Okta ID, Groups | Supported | +| Ping Identity | SAML 2.0, OIDC | User ID, Groups | Supported | +| SAP Cloud Identity | SAML 2.0 | Global User ID | Supported | +| Custom SAML | SAML 2.0 | Any custom attributes | Supported | + +### IdP Configuration Steps + +#### Step 1: Gather Identity Provider Metadata + +From your IdP (Azure AD example): + +```yaml +Identity Provider Details: + +Azure AD Tenant: contoso.onmicrosoft.com +Application ID: 12345678-1234-1234-1234-123456789012 +Client Secret: [SECRET] + +SAML Endpoints: + - Sign-On URL: https://login.microsoftonline.com/[TENANT_ID]/saml2 + - Sign-Out URL: https://login.microsoftonline.com/[TENANT_ID]/saml2/logout + - Certificate (Signing): [X.509 Certificate] + +Claim Mappings: + - NameID (User Identifier): userPrincipalName (john.smith@contoso.com) + - Email: mail (john.smith@contoso.com) + - First Name: givenName (John) + - Last Name: surname (Smith) + - Custom Attributes: + - REGIONS: extensionAttribute1 (list: ["Americas", "EMEA"]) + - COST_CENTER: extensionAttribute2 ("CC001") + - SALES_ORG: extensionAttribute3 ("1000") +``` + +#### Step 2: Configure Datasphere as SAML Service Provider + +In Datasphere Admin Console: + +1. Navigate: **Administration** → **Security** → **Identity Providers** +2. Click **Add Identity Provider** + +```yaml +Configuration: + +Basic Settings: + Provider Name: Azure_AD_Production + Protocol: SAML 2.0 + Status: Active + +Endpoint Configuration: + IdP Sign-On URL: https://login.microsoftonline.com/[TENANT]/saml2 + IdP Sign-Out URL: https://login.microsoftonline.com/[TENANT]/saml2/logout + IdP Signing Certificate: [Paste X.509 certificate content] + +Service Provider Configuration: + Assertion Consumer Service (ACS) URL: + https://datasphere.company.com/saml/acs + Entity ID: + https://datasphere.company.com/saml/entity + Signing Certificate: [Auto-generated] + +Name ID Mapping: + Source: SAML NameID + Target: Datasphere User ID + Format: urn:oasis:names:tc:SAML:1.1:nameid-format:unspecified + +Attribute Mappings: + ┌─────────────────────┬──────────────────┬────────────────────┐ + │ Datasphere Field │ SAML Claim Name │ Example Value │ + ├─────────────────────┼──────────────────┼────────────────────┤ + │ User ID │ userPrincipalName│ john.smith@contoso │ + │ Email │ mail │ john.smith@contoso │ + │ First Name │ givenName │ John │ + │ Last Name │ surname │ Smith │ + │ REGIONS (Custom) │ extensionAttribute1│["Americas","EMEA"]│ + │ COST_CENTER (Custom)│ extensionAttribute2│ CC001 │ + │ SALES_ORG (Custom) │ extensionAttribute3│ 1000 │ + └─────────────────────┴──────────────────┴────────────────────┘ + +User Auto-Provisioning: + ☑ Auto-Create User on First Login + ☐ Auto-Assign to Space: [SALES_ANALYTICS] + ☐ Auto-Assign Role: [ANALYTICS_VIEWER] + +Session Settings: + Single Sign-Out: Enabled + Session Timeout: 60 minutes + Remember Me: Disabled + +Validation: + [ ] Test Connection + [ ] Download SP Metadata (for IdP registration) +``` + +#### Step 3: Map User Attributes to DACs + +Create a mapping table to link IdP attributes to DAC filters: + +```sql +CREATE TABLE T_USER_ATTRIBUTE_TO_DAC_MAPPING ( + USER_ID VARCHAR(256), + ATTRIBUTE_NAME VARCHAR(50), + ATTRIBUTE_VALUE VARCHAR(500), + DAC_NAME VARCHAR(256), + DAC_FILTER_VALUE VARCHAR(500), + EFFECTIVE_FROM DATE, + EFFECTIVE_TO DATE, + MAPPING_STATUS VARCHAR(20) -- ACTIVE, INACTIVE, PENDING +); + +INSERT INTO T_USER_ATTRIBUTE_TO_DAC_MAPPING VALUES + -- User: john.smith + ('john.smith@contoso.com', 'REGIONS', 'Americas', 'DAC_SALES_BY_REGION', 'Americas', '2024-01-01', NULL, 'ACTIVE'), + ('john.smith@contoso.com', 'REGIONS', 'EMEA', 'DAC_SALES_BY_REGION', 'EMEA', '2024-01-01', NULL, 'ACTIVE'), + ('john.smith@contoso.com', 'COST_CENTER', 'CC001', 'DAC_FINANCE_BY_COST_CENTER', 'CC001', '2024-01-01', NULL, 'ACTIVE'), + -- User: jane.doe + ('jane.doe@contoso.com', 'REGIONS', 'EMEA', 'DAC_SALES_BY_REGION', 'EMEA', '2024-01-01', NULL, 'ACTIVE'), + ('jane.doe@contoso.com', 'REGIONS', 'APAC', 'DAC_SALES_BY_REGION', 'APAC', '2024-01-01', NULL, 'ACTIVE'), + ('jane.doe@contoso.com', 'SALES_ORG', '2000', 'DAC_SALES_BY_ORG', '2000', '2024-01-01', NULL, 'ACTIVE'); +``` + +**Synchronization Process:** + +```yaml +Flow: IdP Attribute → Datasphere User → DAC Application + +1. User Logs In to Datasphere + - SAML Request sent to Azure AD + - Azure AD authenticates user + - Azure AD returns SAML Response with: + * User ID (john.smith@contoso.com) + * Email (john.smith@contoso.com) + * Custom Attributes (REGIONS: ['Americas', 'EMEA']) + +2. Datasphere Processes SAML Response + - Validates SAML signature + - Extracts user ID and attributes + - Creates/updates user in Datasphere + - Stores attributes in user session context + +3. User Executes Query + SELECT * FROM T_SALES + +4. Datasphere Applies Security Filters + - Retrieves user's stored attributes + - Looks up applicable DACs for T_SALES + - Builds WHERE clause from DACs: + WHERE SALES_REGION IN ('Americas', 'EMEA') + +5. Query Returns Filtered Results + - 500,000 rows (50% of total) + - Other regions hidden + +6. Audit Log Entry + - USER_ID: john.smith@contoso.com + - OPERATION: READ + - FILTER_APPLIED: DAC_SALES_BY_REGION + - ROW_COUNT_AFFECTED: 500,000 +``` + +--- + +## Part 5: Privilege Escalation Prevention + +Prevent users from gaining unauthorized access through configuration exploits. + +### Principle of Least Privilege + +**Core Concept:** + +``` +Each user gets MINIMUM permissions needed for their job. + +Example: Sales Rep +├── NEED: Read sales data for their region +├── DON'T NEED: Edit master data, delete records, admin functions +└── ASSIGN: Viewer role on Sales space + DAC filtering by region + +Example: Data Administrator +├── NEED: Create/modify objects, manage users, audit logs +├── DON'T NEED: View actual data, execute ad-hoc queries +└── ASSIGN: Admin role on Admin space + explicit DATA_READER removed +``` + +**Implementation Checklist:** + +```yaml +Least Privilege Checklist: + +☐ Identify Role Requirements + For each job function: + - List required access (tables, operations) + - List forbidden access (data types, functions) + - Document time-sensitive access (only during month-end) + +☐ Create Custom Roles + Instead of: Using standard Viewer/Editor/Admin roles + Do This: Create specific roles: + - SALES_ANALYST: Read sales only, no customers + - FINANCE_REVIEWER: Read financial data only, no HR + - ADMIN_DATAMODEL: Create objects, no data access + +☐ Assign Minimum Permissions + - Role: Finance Analyst + - Space Access: FINANCE (Viewer) + - Object Level: T_GENERAL_LEDGER (Read) + - Row Level: DAC_COST_CENTER_FILTER (their CC) + - Avoid: Admin rights, cross-space access, write permissions + +☐ Regular Access Reviews + - Quarterly: Review all user assignments + - Question: "Does john.smith still need ADMIN role?" + - Action: Remove if not actively used + +☐ Segregation of Duties (SOD) Enforcement + Prevent: Same person having conflicting roles + - Cannot be: Approver AND Requester + - Cannot be: Data Creator AND Auditor + - Cannot be: Admin AND Data User +``` + +### Segregation of Duties (SoD) Patterns + +**Pattern 1: Financial Controls (SOX Compliance)** + +```yaml +Scenario: Accounts Payable Process + +Segregation: + ├── Purchase Requisition (Employee) + │ └── Can: Create requisitions, view own POs + │ Cannot: Approve own requisitions, access accounting + ├── Approval (Manager) + │ └── Can: Approve requisitions up to $10K + │ Cannot: Create invoices, execute payments + ├── Invoice Receipt (Accounts Payable) + │ └── Can: Enter invoices, match to POs + │ Cannot: Approve requisitions, execute payments + ├── Payment Execution (Finance) + │ └── Can: Execute payments to approved invoices + │ Cannot: Approve, create invoices, change vendors + └── Audit (Internal Audit) + └── Can: View all, produce reports + Cannot: Modify data, execute transactions + +Implementation in Datasphere: + Create 5 Spaces: + - PROCUREMENT (for requisition creation) + - APPROVAL (for managers) + - ACCOUNTS_PAYABLE (for AP team) + - FINANCE (for payment execution) + - AUDIT (for compliance review) + + Create 5 Roles: + - REQUISITIONER: Read/Write on PROCUREMENT tables + - APPROVER: Read on PROCUREMENT, Write on APPROVAL tables + - AP_STAFF: Read on PROCUREMENT/APPROVAL, Write on AP tables + - FINANCE: Read on AP tables, Write on FINANCE tables + - AUDITOR: Read all, no Write permissions + + DAC Controls: + - Requisitioner: Only sees own requisitions + - Approver: Sees requisitions below their limit + - AP_Staff: Sees invoices matched to approved POs + - Finance: Sees invoices ready for payment + - Auditor: Sees all with audit trail +``` + +**Pattern 2: Data Governance (GDPR Compliance)** + +```yaml +Scenario: Customer PII Access + +Segregation: + ├── Data Collection (Customer Service) + │ └── Can: Create customer records with contact info + │ Cannot: Delete records, access payment info + ├── Analysis (Marketing) + │ └── Can: View aggregated customer data (no PII) + │ Cannot: Access individual customer records + ├── Privacy (Data Protection Officer) + │ └── Can: Access full customer records for privacy requests + │ Cannot: Use data for marketing, audit customer queries + └── Audit (Compliance) + └── Can: View who accessed PII, when, what actions + Cannot: Modify data, access actual PII + +Implementation: + Tables: + - T_CUSTOMER_CONTACT (phone, email) + - T_CUSTOMER_PAYMENT (CC, bank account) + - T_CUSTOMER_AGGREGATE (summary, no PII) + - T_CUSTOMER_AUDIT (access logs) + + Roles: + - CUSTOMER_REPRESENTATIVE: Write CONTACT, cannot access PAYMENT + - MARKETING_ANALYST: Read AGGREGATE only, CONTACT hidden + - DATA_PROTECTION_OFFICER: Read all PII, Write audit logs + - COMPLIANCE_AUDITOR: Read AUDIT, specific rows based on time-range DAC + + DACs: + - MARKETING_ANALYST: No DAC on CONTACT (entire table hidden) + - CUSTOMER_REP: DAC_ASSIGNED_CUSTOMERS (only their region's customers) + - DPO: DAC_PRIVACY_REQUESTS (only records under active request) +``` + +**Pattern 3: System Administration (Separation from Data Use)** + +```yaml +Scenario: Admin Role Compartmentalization + +Segregation: + ├── Object Administrator + │ └── Can: Create tables, views, transformations + │ Cannot: Access data, manage users, see audit logs + ├── User/Security Administrator + │ └── Can: Manage users, assign roles, configure DACs + │ Cannot: Create objects, access data, modify audit settings + ├── Audit Administrator + │ └── Can: Configure audit policies, view audit logs + │ Cannot: Modify users, create objects, access data + └── System Administrator (limited) + └── Can: Overall system settings, coordinate between other admins + Cannot: Day-to-day object creation, user management + +Implementation in Datasphere: + Roles: + - OBJECT_ADMIN: Roles SPACE_ADMIN on Tech space only + - USER_ADMIN: User.Manage permission, cannot enter Data space + - AUDIT_ADMIN: Audit.Manage permission, cannot create objects + - SUPER_ADMIN: All three roles + + Space Segregation: + - TECH_SPACE: Only OBJECT_ADMIN has access + - MASTER_DATA: Only USER_ADMIN can modify access + - DATA_SPACE: No admin role, only business users with DACs + - AUDIT_SPACE: Only AUDIT_ADMIN, read-only for others +``` + +--- + +## Part 6: Security Testing and Validation Workflows + +### Pre-Go-Live Security Checklist + +```yaml +Security Validation - Pre-Production Checklist: + +Identity & Access Management: + ☐ IdP (SAML/OIDC) configuration tested + ☐ User provisioning tested (new users created correctly) + ☐ User attribute mapping verified (custom attributes populated) + ☐ Single sign-out verified (logout clears session) + ☐ Session timeout tested (idle users logged out) + ☐ Concurrent session limits enforced + +Data Access Controls: + ☐ DACs active on all sensitive tables + ☐ Each DAC tested with test users + ☐ Row-level filtering verified (users see only assigned rows) + ☐ Hierarchy DACs tested (managers see subordinate data) + ☐ Combined DACs tested (multiple filters work together) + ☐ Admin override tested (admin users bypass DACs if configured) + ☐ DAC performance measured (< 5% query overhead) + +Audit & Compliance: + ☐ Audit policies active on sensitive tables + ☐ Read operations logged for PII tables + ☐ Change operations logged for all data + ☐ Audit log table populated correctly + ☐ Retention policies configured and tested + ☐ Alert thresholds set and tested + ☐ Compliance mapping configured + +Authorization Migration (if from BW): + ☐ All BW users have corresponding DACs + ☐ All BW fields mapped to Datasphere columns + ☐ Row count reconciliation completed (±0.1%) + ☐ Authorization coverage tested (sample users) + ☐ Audit trail shows migration timestamp + +Privilege Segregation: + ☐ No user has multiple conflicting roles + ☐ Least privilege assignments verified + ☐ Admin roles limited to necessary users + ☐ Data access separated from admin functions + ☐ Audit access separated from data access + +Encryption & Network: + ☐ TLS 1.2+ for all connections + ☐ Data in transit encrypted + ☐ Data at rest encrypted (if required) + ☐ VPN/Firewall rules configured + ☐ IP whitelisting tested (if applicable) + +Disaster & Incident Response: + ☐ Audit log backup/archival process defined + ☐ Emergency access procedure documented + ☐ Incident response runbook created + ☐ Key person coverage confirmed + ☐ Change management process implemented +``` + +### Security Testing Scenarios + +#### Test 1: User Isolation (DAC Filtering) + +```yaml +Scenario: Sales Reps Cannot See Each Other's Data + +Setup: + User 1: john.smith (ASSIGNED_REGION = 'Americas') + User 2: jane.doe (ASSIGNED_REGION = 'EMEA') + Table: T_SALES (100,000 rows: 50K Americas, 50K EMEA) + +Test Execution: + Step 1: Log in as john.smith + Step 2: Execute: SELECT COUNT(*) FROM T_SALES + Expected Result: 50,000 (Americas only) + Actual Result: ___ + + Step 3: Log in as jane.doe + Step 4: Execute: SELECT COUNT(*) FROM T_SALES + Expected Result: 50,000 (EMEA only) + Actual Result: ___ + + Step 5: Try to query: SELECT * FROM T_SALES WHERE REGION = 'EMEA' as john.smith + Expected Result: 0 rows (filtered out by DAC) + Actual Result: ___ + +Pass Criteria: Both users see only their region, cannot access other regions +``` + +#### Test 2: Admin Override (if configured) + +```yaml +Scenario: Admin Can See All Data When Override Enabled + +Setup: + User: admin.user (ROLE = 'ADMIN', DAC bypass = enabled) + Table: T_SALES (100,000 rows) + +Test Execution: + Step 1: Log in as admin.user + Step 2: Execute: SELECT COUNT(*) FROM T_SALES + Expected Result: 100,000 (all rows, DAC bypassed) + Actual Result: ___ + +Pass Criteria: Admin sees complete dataset +``` + +#### Test 3: Audit Trail Completeness + +```yaml +Scenario: All Sensitive Data Access Logged + +Setup: + User: john.smith + Table: T_CUSTOMER_PII (sensitive) + Audit Policy: Detailed READ logging enabled + +Test Execution: + Step 1: Execute SELECT COUNT(*) FROM T_CUSTOMER_PII + Step 2: Check audit log: SELECT * FROM T_AUDIT_LOG + WHERE DATASPHERE_USER = 'john.smith' + AND OBJECT_NAME = 'T_CUSTOMER_PII' + + Expected Audit Entry: + - AUDIT_TIMESTAMP: (within 1 second of query) + - DATASPHERE_USER: john.smith + - OPERATION: READ + - ROW_COUNT_AFFECTED: (should match count) + - FILTER_APPLIED: Y + - DAC_FILTERS_APPLIED: [DAC_name] + +Pass Criteria: Audit entry exists with complete information +``` + +#### Test 4: Hierarchy DAC Navigation + +```yaml +Scenario: Manager Can See Team Data but Not Siblings + +Setup: + Organization Hierarchy: + ├── Company + │ ├── Sales Division + │ │ ├── Americas Department + │ │ │ ├── North Region + │ │ │ └── South Region + │ │ └── EMEA Department + │ │ ├── North Europe + │ │ └── South Europe + │ └── Engineering Division + + User: john.smith (Assigned: Sales Division → Americas Department) + Data: T_SALES_TRANSACTIONS (has DEPARTMENT_ID column) + +Test Execution: + Step 1: john.smith executes: SELECT DISTINCT DEPARTMENT_ID FROM T_SALES_TRANSACTIONS + Expected Result: [Americas, North Region, South Region] + Actual Result: ___ + + Step 2: john.smith executes: + SELECT COUNT(*) FROM T_SALES_TRANSACTIONS WHERE DEPARTMENT_ID = 'EMEA' + Expected Result: 0 (EMEA is sibling, not subordinate) + Actual Result: ___ + + Step 3: john.smith executes: + SELECT COUNT(*) FROM T_SALES_TRANSACTIONS WHERE DEPARTMENT_ID = 'North Region' + Expected Result: Actual count (subordinate department visible) + Actual Result: ___ + +Pass Criteria: john.smith sees Americas and its sub-departments, +not EMEA or other divisions +``` + +--- + +## Part 7: MCP Tool References + +This skill integrates with these Claude MCP tools: + +- **get_current_user**: Retrieve logged-in user's identity and attributes +- **get_available_scopes**: List user's accessible spaces and permissions +- **list_spaces**: Enumerate all spaces with access metadata +- **get_space_info**: Retrieve space security configuration, members, roles +- **search_repository**: Find security-relevant objects (DACs, audit policies) + +**Example Usage:** + +``` +Assistant: "Let me check what spaces you have access to and your current permission level." +Tool: get_available_scopes() +Result: Scopes = [FINANCE (Viewer), SALES (Editor), AUDIT (Viewer)] +``` + +--- + +## MCP Tool Integration Examples + +### Example 1: Review Current User Authorization + +``` +User Request: "What security controls apply to me?" + +Assistant Actions: +1. get_current_user() + → Returns: user_id=john.smith@contoso.com, roles=[SALES_ANALYST, VIEWER] + +2. get_available_scopes() + → Returns: spaces=[ + { space: 'SALES', role: 'EDITOR', tables: ['T_SALES', 'T_CUSTOMER'] }, + { space: 'FINANCE', role: 'VIEWER', tables: ['T_GL'] } + ] + +3. search_repository(query='DAC', filter='assigned_to:john.smith') + → Returns: dacs=[ + { dac_name: 'DAC_SALES_BY_REGION', filter_column: 'REGION', values: ['Americas', 'EMEA'] } + ] + +Response: "You have access to SALES (editor) and FINANCE (viewer). +A data access control (DAC_SALES_BY_REGION) filters your SALES data to Americas and EMEA regions." +``` + +### Example 2: Audit Policy Impact Analysis + +``` +User Request: "Which tables have audit logging enabled?" + +Assistant Actions: +1. search_repository(query='audit', filter='type:policy', status='active') + → Returns list of active audit policies + +2. For each policy, get_space_info() to see affected tables + +3. Compile table list with audit levels and retention + +Response: "These 12 tables have audit logging: +- T_CUSTOMER_PII: Detailed READ logging, 3-year retention +- T_FINANCIAL: Change-only logging, 10-year retention +- T_SALES: Change-only logging, 1-year retention" +``` + +--- + +End of SKILL.md \ No newline at end of file diff --git a/partner-built/SAP-Datasphere/skills/datasphere-security-architect/references/security-patterns.md b/partner-built/SAP-Datasphere/skills/datasphere-security-architect/references/security-patterns.md new file mode 100644 index 0000000..1d3fbdb --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-security-architect/references/security-patterns.md @@ -0,0 +1,1345 @@ +# Security Architect Reference Guide + +## Table of Contents + +1. DAC Creation Patterns with SQL Examples +2. Authorization Migration Mapping +3. Audit Policy Templates +4. Identity Provider Configuration Checklists +5. Security Review Pre-Go-Live Checklist +6. Common Security Anti-Patterns and Fixes +7. Emergency Access Procedures + +--- + +## 1. DAC Creation Patterns with SQL Examples + +### Pattern 1: Simple Value-Based DAC + +**Use Case:** Sales reps restricted to their assigned region. + +**DAC Definition:** +```sql +CREATE DATA ACCESS CONTROL DAC_SALES_BY_REGION +FOR TABLE T_SALES +WHERE SALES_REGION = :USER_ASSIGNED_REGION; + +-- Parameters: +-- :USER_ASSIGNED_REGION - User attribute from IdP +-- Example: john.smith has attribute ASSIGNED_REGION = 'Americas' +``` + +**User Attribute Setup:** +```sql +-- In IdP (Azure AD, Okta, etc.) or Datasphere User Management: +User: john.smith +Custom Attributes: + ASSIGNED_REGION = 'Americas' + +User: jane.doe +Custom Attributes: + ASSIGNED_REGION = 'EMEA' + +User: mary.johnson +Custom Attributes: + ASSIGNED_REGION = 'APAC' +``` + +**Testing:** +```sql +-- Test as john.smith +SELECT DISTINCT SALES_REGION FROM T_SALES; +-- Expected: ['Americas'] + +-- Test as jane.doe +SELECT DISTINCT SALES_REGION FROM T_SALES; +-- Expected: ['EMEA'] + +-- Test as mary.johnson +SELECT DISTINCT SALES_REGION FROM T_SALES; +-- Expected: ['APAC'] +``` + +--- + +### Pattern 2: Multi-Value DAC (IN List) + +**Use Case:** Manager sees multiple assigned cost centers. + +**DAC Definition:** +```sql +CREATE DATA ACCESS CONTROL DAC_COST_CENTER_MULTI +FOR TABLE T_FINANCIALS +WHERE COST_CENTER IN :USER_ASSIGNED_COST_CENTERS; + +-- Parameters: +-- :USER_ASSIGNED_COST_CENTERS - User attribute (array/list) +-- Example: john.smith has attribute ASSIGNED_COST_CENTERS = ['CC001', 'CC002', 'CC003'] +``` + +**User Attribute Setup:** +```sql +-- In IdP: +User: john.smith +Custom Attributes: + ASSIGNED_COST_CENTERS = ['CC001', 'CC002', 'CC003'] + +User: jane.doe +Custom Attributes: + ASSIGNED_COST_CENTERS = ['CC010'] +``` + +**Testing:** +```sql +-- Test as john.smith +SELECT DISTINCT COST_CENTER FROM T_FINANCIALS; +-- Expected: ['CC001', 'CC002', 'CC003'] + +SELECT COUNT(*) FROM T_FINANCIALS WHERE COST_CENTER = 'CC010'; +-- Expected: 0 (not in assigned list) + +-- Test as jane.doe +SELECT DISTINCT COST_CENTER FROM T_FINANCIALS; +-- Expected: ['CC010'] +``` + +--- + +### Pattern 3: Compound Key DAC + +**Use Case:** Restrict by company AND subsidiary. + +**DAC Definition:** +```sql +CREATE DATA ACCESS CONTROL DAC_COMPANY_SUBSIDIARY +FOR TABLE T_GENERAL_LEDGER +WHERE COMPANY_CODE = :USER_COMPANY + AND SUBSIDIARY_CODE IN :USER_ASSIGNED_SUBSIDIARIES; + +-- Parameters: +-- :USER_COMPANY - Single value +-- :USER_ASSIGNED_SUBSIDIARIES - Array value +``` + +**User Attribute Setup:** +```sql +-- In IdP: +User: john.smith +Custom Attributes: + COMPANY = 'CORP001' + ASSIGNED_SUBSIDIARIES = ['SUB_USA', 'SUB_CANADA', 'SUB_MEXICO'] + +User: jane.doe +Custom Attributes: + COMPANY = 'CORP002' + ASSIGNED_SUBSIDIARIES = ['SUB_UK', 'SUB_DE', 'SUB_FR'] +``` + +**Testing:** +```sql +-- Test as john.smith +SELECT DISTINCT COMPANY_CODE FROM T_GENERAL_LEDGER; +-- Expected: ['CORP001'] + +SELECT DISTINCT SUBSIDIARY_CODE FROM T_GENERAL_LEDGER +WHERE COMPANY_CODE = 'CORP001'; +-- Expected: ['SUB_USA', 'SUB_CANADA', 'SUB_MEXICO'] + +SELECT COUNT(*) FROM T_GENERAL_LEDGER +WHERE COMPANY_CODE = 'CORP002'; +-- Expected: 0 (different company, hidden) +``` + +--- + +### Pattern 4: Hierarchy-Based DAC + +**Use Case:** Users see their department and subordinates in org hierarchy. + +**Hierarchy Table Structure:** +```sql +CREATE TABLE T_ORG_HIERARCHY ( + NODE_ID VARCHAR(20) PRIMARY KEY, + PARENT_NODE_ID VARCHAR(20), + NODE_NAME VARCHAR(100), + NODE_LEVEL INTEGER, + EFFECTIVE_FROM DATE, + EFFECTIVE_TO DATE +); + +INSERT INTO T_ORG_HIERARCHY VALUES + ('CORP', NULL, 'Corporation', 1, '2024-01-01', NULL), + ('DIV_ENG', 'CORP', 'Engineering', 2, '2024-01-01', NULL), + ('DIV_SALES', 'CORP', 'Sales', 2, '2024-01-01', NULL), + ('DEPT_SWE', 'DIV_ENG', 'Software Engineering', 3, '2024-01-01', NULL), + ('DEPT_QA', 'DIV_ENG', 'QA', 3, '2024-01-01', NULL), + ('DEPT_FIELD', 'DIV_SALES', 'Field Sales', 3, '2024-01-01', NULL), + ('DEPT_INSIDE', 'DIV_SALES', 'Inside Sales', 3, '2024-01-01', NULL), + ('TEAM_PYTHON', 'DEPT_SWE', 'Python Team', 4, '2024-01-01', NULL), + ('TEAM_JAVA', 'DEPT_SWE', 'Java Team', 4, '2024-01-01', NULL); +``` + +**User-to-Hierarchy Mapping:** +```sql +CREATE TABLE T_USER_HIERARCHY_MAP ( + USER_ID VARCHAR(256), + ASSIGNED_NODE_ID VARCHAR(20), + EFFECTIVE_FROM DATE, + EFFECTIVE_TO DATE +); + +INSERT INTO T_USER_HIERARCHY_MAP VALUES + ('john.smith@contoso.com', 'DEPT_SWE', '2024-01-01', NULL), -- Dept head + ('jane.doe@contoso.com', 'DIV_SALES', '2024-01-01', NULL), -- Division head + ('mary.johnson@contoso.com', 'CORP', '2024-01-01', NULL), -- CEO + ('alex.brown@contoso.com', 'TEAM_PYTHON', '2024-01-01', NULL), -- Team lead +``` + +**DAC Definition:** +```sql +CREATE DATA ACCESS CONTROL DAC_ORG_HIERARCHY +FOR TABLE T_EMPLOYEE_DATA +WHERE DEPARTMENT_ID IN ( + SELECT NODE_ID FROM T_ORG_HIERARCHY h + WHERE h.NODE_ID = :USER_ASSIGNED_NODE + OR h.PARENT_NODE_ID = :USER_ASSIGNED_NODE + OR h.PARENT_NODE_ID IN ( + SELECT NODE_ID FROM T_ORG_HIERARCHY + WHERE PARENT_NODE_ID = :USER_ASSIGNED_NODE + ) +); + +-- Simplified: Include assigned node + all descendants +``` + +**Testing:** +```sql +-- Test as john.smith (DEPT_SWE): +-- Can see: DEPT_SWE, TEAM_PYTHON, TEAM_JAVA +SELECT DISTINCT DEPARTMENT_ID FROM T_EMPLOYEE_DATA; +-- Expected: ['DEPT_SWE', 'TEAM_PYTHON', 'TEAM_JAVA'] + +-- Cannot see: DIV_SALES, DEPT_FIELD +SELECT COUNT(*) FROM T_EMPLOYEE_DATA WHERE DEPARTMENT_ID = 'DEPT_FIELD'; +-- Expected: 0 + +-- Test as jane.doe (DIV_SALES): +-- Can see: DIV_SALES, DEPT_FIELD, DEPT_INSIDE +SELECT DISTINCT DEPARTMENT_ID FROM T_EMPLOYEE_DATA; +-- Expected: ['DIV_SALES', 'DEPT_FIELD', 'DEPT_INSIDE'] + +-- Test as mary.johnson (CORP): +-- Can see: Everything (all descendants) +SELECT COUNT(DISTINCT DEPARTMENT_ID) FROM T_EMPLOYEE_DATA; +-- Expected: 9 (all departments) +``` + +--- + +### Pattern 5: Time-Bounded DAC + +**Use Case:** Users can only see current year's data (rolling window). + +**DAC Definition:** +```sql +CREATE DATA ACCESS CONTROL DAC_CURRENT_YEAR +FOR TABLE T_SALES_TRANSACTIONS +WHERE POSTING_DATE >= DATE_TRUNC('year', CURRENT_DATE) + AND POSTING_DATE < DATE_TRUNC('year', CURRENT_DATE) + INTERVAL 1 YEAR; + +-- Alternative: Last 12 months +CREATE DATA ACCESS CONTROL DAC_TRAILING_12_MONTHS +FOR TABLE T_SALES_TRANSACTIONS +WHERE POSTING_DATE >= CURRENT_DATE - INTERVAL 1 YEAR + AND POSTING_DATE < CURRENT_DATE + INTERVAL 1 DAY; +``` + +**Testing:** +```sql +-- Current date: 2024-02-08 + +-- Test: Current Year (2024) +SELECT MIN(POSTING_DATE), MAX(POSTING_DATE) FROM T_SALES_TRANSACTIONS; +-- Expected: '2024-01-01' to '2024-02-08' +-- Hidden: Any 2023 or prior dates + +-- Test: Trailing 12 months +SELECT COUNT(*) FROM T_SALES_TRANSACTIONS +WHERE POSTING_DATE < '2023-02-08'; +-- Expected: 0 (older than 12 months hidden) +``` + +--- + +### Pattern 6: Conditional DAC (IF/THEN) + +**Use Case:** Different filters based on user role. + +**DAC Definition:** +```sql +CREATE DATA ACCESS CONTROL DAC_CONDITIONAL_ROLE +FOR TABLE T_ORDERS +WHERE CASE + WHEN :USER_ROLE = 'ADMIN' THEN 1 = 1 -- No filter for admins + WHEN :USER_ROLE = 'MANAGER' THEN MANAGER_ID = :USER_ID + WHEN :USER_ROLE = 'SALES_REP' THEN ASSIGNED_SALES_REP = :USER_ID + ELSE 1 = 0 -- Deny all if role unknown +END; + +-- Parameters: +-- :USER_ROLE - from IdP role mapping +-- :USER_ID - current user's ID +``` + +**Testing:** +```sql +-- Test as admin user (ROLE='ADMIN'): +SELECT COUNT(*) FROM T_ORDERS; +-- Expected: All orders + +-- Test as manager (ROLE='MANAGER', USER_ID='MGR001'): +SELECT COUNT(*) FROM T_ORDERS; +-- Expected: Only orders where MANAGER_ID = 'MGR001' + +-- Test as sales rep (ROLE='SALES_REP', USER_ID='SR001'): +SELECT COUNT(*) FROM T_ORDERS; +-- Expected: Only orders where ASSIGNED_SALES_REP = 'SR001' +``` + +--- + +### Pattern 7: Sensitive Column Masking (Advanced) + +**Note:** Datasphere DACs filter rows, not columns. For column-level masking, use Transformation Rules: + +```sql +-- Column-level masking: Redact PII in SELECT +SELECT + CUSTOMER_ID, + CUSTOMER_NAME, + CASE + WHEN CURRENT_USER NOT IN ('dpo@company.com', 'admin@company.com') + THEN '***REDACTED***' + ELSE CUSTOMER_EMAIL + END as CUSTOMER_EMAIL, + CASE + WHEN CURRENT_USER NOT IN ('dpo@company.com', 'admin@company.com') + THEN '***REDACTED***' + ELSE CUSTOMER_PHONE + END as CUSTOMER_PHONE +FROM T_CUSTOMER_SENSITIVE; + +-- Result: Non-privileged users see redacted values +-- DPO and admin see actual PII +``` + +--- + +## 2. Authorization Migration Mapping (BW → Datasphere) + +### Mapping Table Structure + +```sql +CREATE TABLE T_BW_DS_AUTH_MAPPING ( + MAPPING_ID INTEGER PRIMARY KEY AUTO_INCREMENT, + BW_USER_ID VARCHAR(256), + BW_AUTH_OBJECT VARCHAR(30), + BW_AUTH_FIELD VARCHAR(30), + BW_AUTH_VALUES VARCHAR(1000), + DS_TABLE_NAME VARCHAR(256), + DS_COLUMN_NAME VARCHAR(256), + DS_DAC_NAME VARCHAR(256), + MAPPING_STATUS VARCHAR(20), -- PENDING, MAPPED, VALIDATED, ACTIVE + VALIDATION_STATUS VARCHAR(20), -- PASSED, FAILED, WARNING + VALIDATION_NOTES VARCHAR(1000), + CREATED_AT TIMESTAMP DEFAULT CURRENT_TIMESTAMP, + ACTIVATED_AT TIMESTAMP, + CREATED_BY VARCHAR(256), + + INDEX idx_bw_user (BW_USER_ID), + INDEX idx_ds_table (DS_TABLE_NAME), + INDEX idx_status (MAPPING_STATUS) +); +``` + +### Migration Workflow + +``` +Phase 1: EXTRACTION (BW System) +│ +├─ Extract all users with active authorizations +├─ For each user, export: +│ ├─ BW_USER_ID +│ ├─ Authorization objects (0_RF_BOA, etc.) +│ ├─ Auth fields and values +│ └─ Assigned InfoCubes +│ +└─ Output: T_BW_AUTH_EXTRACT (staging table) + +Phase 2: MAPPING (Analysis & Planning) +│ +├─ For each BW authorization: +│ ├─ Identify corresponding Datasphere table +│ ├─ Map BW field → Datasphere column +│ ├─ Determine DAC type (Operator, Hierarchy, etc.) +│ └─ Create DAC definition +│ +└─ Output: T_BW_DS_AUTH_MAPPING (PENDING status) + +Phase 3: CREATION (Datasphere) +│ +├─ For each row in T_BW_DS_AUTH_MAPPING: +│ ├─ Create DAC in Datasphere +│ ├─ Assign to table/view +│ ├─ Assign to user +│ └─ Update status to MAPPED +│ +└─ Output: Datasphere DACs created + +Phase 4: VALIDATION +│ +├─ For each DAC: +│ ├─ Test user access +│ ├─ Verify row filtering +│ ├─ Reconcile BW vs. DS row counts +│ └─ Update validation status +│ +└─ Output: T_BW_DS_AUTH_MAPPING (VALIDATED status) + +Phase 5: ACTIVATION +│ +├─ Disable BW Bridge access +├─ Enable Datasphere access +├─ Monitor for access issues +├─ Update status to ACTIVE +│ +└─ Output: Go-live complete +``` + +### Example Migration + +``` +BW Authorization Export: +┌────────────┬──────────────────────────────┐ +│ USER_ID │ Authorization Details │ +├────────────┼──────────────────────────────┤ +│ JOHN.SMITH │ Object: 0_RF_BOA │ +│ │ InfoCube: 0SALES_001 │ +│ │ COMPANY_CODE: 0010, 0020 │ +│ │ SALES_ORG: 1000 │ +│ │ REGION: EUR, AMER │ +└────────────┴──────────────────────────────┘ + +Datasphere Mapping: +┌──────────────────┬──────────────────────┬─────────────────────────┐ +│ BW Field │ DS Column │ DAC Name │ +├──────────────────┼──────────────────────┼─────────────────────────┤ +│ COMPANY_CODE │ COMPANY_CODE │ DAC_JOHN_COMPANY │ +│ SALES_ORG │ SALES_ORG │ DAC_JOHN_SALES_ORG │ +│ REGION │ SALES_REGION │ DAC_JOHN_REGION │ +└──────────────────┴──────────────────────┴─────────────────────────┘ + +DAC Creation: +DAC_JOHN_COMPANY: + WHERE COMPANY_CODE IN ('0010', '0020') + +DAC_JOHN_SALES_ORG: + WHERE SALES_ORG = '1000' + +DAC_JOHN_REGION: + WHERE SALES_REGION IN ('EUR', 'AMER') + +Combined Effect: + SELECT * FROM T_SALES + WHERE COMPANY_CODE IN ('0010', '0020') + AND SALES_ORG = '1000' + AND SALES_REGION IN ('EUR', 'AMER') +``` + +--- + +## 3. Audit Policy Templates for Compliance Frameworks + +### Template 1: SOX (Sarbanes-Oxley) Compliance + +**Applicable To:** Publicly traded companies, financial reporting. + +```yaml +Audit Policy: SOX_COMPLIANCE_AUDIT + +Scope: + Tables: + - T_GENERAL_LEDGER (all accounting entries) + - T_JOURNAL_VOUCHERS (manual journal entries) + - T_RECONCILIATIONS (bank/balance sheet reconciliations) + - T_FIXED_ASSETS (asset management) + - T_INTERCOMPANY_TRANSACTIONS (consolidation entries) + +Operations: + ☑ READ (Financial data access) + ☑ INSERT (New transactions) + ☑ UPDATE (Changes to existing transactions) + ☑ DELETE (Deletion of transactions - should not occur) + +Users: + - All (especially non-Finance) + - Priority: Non-accounting users accessing financial data + +Logging Details: + ├─ Timestamp (down to millisecond) + ├─ User ID & session + ├─ IP address & hostname + ├─ Rows affected (count) + ├─ Key field values (GL account, cost center) + ├─ SQL hash (for deduplication) + └─ Execution time + +Retention: + Duration: 10 years (SEC requirement) + Storage: Primary database for 1 year, then archive + Archival: Immutable (cannot be deleted) + Retrieval: Must be queryable within 24 hours + +Alerting: + ☑ DELETE on T_JOURNAL_VOUCHERS: CRITICAL + ☑ Bulk UPDATE on T_GENERAL_LEDGER: HIGH (> 100 rows) + ☑ Access from unusual IP: MEDIUM + ☑ Non-Finance user reading GL: MEDIUM + +Reporting: + ├─ Daily reconciliation: Transactions entered vs. approved + ├─ Weekly: User access reports (who accessed what) + ├─ Monthly: Exception report (DELETE/bulk updates) + ├─ Quarterly: Compliance audit (coverage = 100%) + └─ Annual: SOX Section 404 attestation data + +Audit Review: + Frequency: Continuous monitoring + Monthly deep-dive + Owner: Internal Audit + Scope: 100% of financial transactions + Approval: CFO quarterly certification +``` + +--- + +### Template 2: GDPR (General Data Protection Regulation) + +**Applicable To:** European data subjects, personal data processing. + +```yaml +Audit Policy: GDPR_PERSONAL_DATA_AUDIT + +Scope: + Tables: + - T_CUSTOMER_PERSONAL (name, address, contact) + - T_CUSTOMER_PAYMENT (credit cards, bank accounts) + - T_EMPLOYEE_PERSONAL (HR records) + - T_APPLICANT_DATA (job application info) + - Any table containing PII + + Definition of PII: Name, Email, Phone, SSN, Address, Payment methods, etc. + +Operations: + ☑ READ (Accessing personal data - DETAILED logging) + ☑ INSERT (Collecting personal data) + ☑ UPDATE (Modifying personal data) + ☑ DELETE (Right to be forgotten) + +Users: + - All (especially non-authorized business functions) + - Flag: Non-Customer-Service users accessing customer PII + +Logging Details: + ├─ Timestamp + ├─ User ID & session + ├─ Purpose code (e.g., "BILLING", "SUPPORT", "MARKETING") + ├─ Data subject ID (anonymized if possible) + ├─ Rows accessed (count) + ├─ Specific fields accessed (if column-level logging available) + └─ Data export/download (flagged separately) + +Retention: + Duration: 3 years (GDPR Article 5.1.e - storage limitation) + Storage: Encrypted at rest + Archival: Delete on expiration (automatic) + Legal Hold: Extended retention if litigation pending + Data Subject Access: Must provide log extract within 30 days + +Alerting: + ☑ DELETE personal data: CRITICAL (log for audit trail) + ☑ Bulk export: HIGH (> 100 records) + ☑ Non-support user reading PII: HIGH + ☑ International data transfer: CRITICAL + ☑ Unauthorized processing: CRITICAL + +Reporting: + ├─ Data Subject Access Requests (SARs): 30-day reporting + ├─ Breach notification: <72 hours to regulator + ├─ Purpose justification: User must document legal basis + ├─ Weekly: Users accessing PII without clear business need + └─ Monthly: Compliance dashboard + +Special Requirements: + ├─ Data Minimization: Only collect/store minimum necessary + ├─ Consent Tracking: Log consent dates/revocation + ├─ Purpose Limitation: Can only use data for stated purpose + ├─ Right to be Forgotten: Track deletion requests + ├─ Legitimate Interest: Document basis for processing + └─ Data Protection Impact Assessment (DPIA): Document high-risk processing + +Audit Review: + Frequency: Continuous + Quarterly DPA review + Owner: Data Protection Officer (DPO) + Scope: 100% of PII access + Approval: DPO sign-off for compliance +``` + +--- + +### Template 3: HIPAA (Health Insurance Portability and Accountability Act) + +**Applicable To:** Healthcare providers, health plans, covered entities. + +```yaml +Audit Policy: HIPAA_PHI_AUDIT + +Scope: + Tables: + - T_PATIENT_RECORDS (medical history) + - T_DIAGNOSES (diagnosis codes) + - T_MEDICATIONS (drug records) + - T_CLAIMS (insurance claims with patient info) + - T_GENETIC_DATA (DNA/genetic tests) + - Any table with Protected Health Information (PHI) + + Definition of PHI: Medical records, diagnoses, medications, genetic info, etc. + +Operations: + ☑ READ (Clinician access - requires authorization) + ☑ INSERT (New patient records) + ☑ UPDATE (Chart updates) + ☑ DELETE (Corrections - rare) + +Users: + - Clinicians (doctors, nurses, specialists) + - Billing staff (claims processing) + - QA/Auditors (compliance review) + - Flag all others + +Logging Details: + ├─ Timestamp (down to second) + ├─ User ID + ├─ User role (Physician, Nurse, Billing, etc.) + ├─ Patient identifier + ├─ Access purpose (e.g., "Direct Care", "Treatment", "Billing") + ├─ Rows accessed (patient record count) + ├─ Unusual access patterns (e.g., accessing deceased patient) + └─ Export/download of records + +Retention: + Duration: 6 years (HIPAA minimum) + Storage: Encrypted, access-controlled + Archival: Immutable after 1 year + Retrieval: Audit logs available within 24 hours + +Alerting: + ☑ Unauthorized access attempt: CRITICAL + ☑ Non-clinician accessing PHI: CRITICAL + ☑ Bulk PHI export: CRITICAL + ☑ After-hours access (unusual): HIGH + ☑ Access to unassigned patient: MEDIUM + +Reporting: + ├─ Daily: Unauthorized access attempts + ├─ Weekly: User access reports (by role) + ├─ Monthly: Unusual access patterns (dead patients, bulk queries) + ├─ Quarterly: Breach risk assessment + └─ Annual: Compliance certification + +Access Controls: + ├─ Role-Based: Only clinicians caring for patient see records + ├─ Time-Limited: Access expires at discharge + ├─ Minimum Necessary: Show only relevant portions + ├─ Emergency Access: Track and log all emergency overrides + └─ Terminated Employee: Immediate revocation + +Breach Notification: + ├─ Detection: Unauthorized access → immediate alert + ├─ Assessment: Within 24 hours (< 500 records = low risk) + ├─ Notification: Patients (> 500 records = HHS/media notice) + ├─ Documentation: Audit log required for investigation + └─ Corrective Action: Preventive measures documented + +Audit Review: + Frequency: Continuous monitoring + Monthly deep-dive + Owner: Chief Compliance Officer (CCO) & Privacy Officer + Scope: 100% of PHI access + Approval: Annual OCR (Office for Civil Rights) audit +``` + +--- + +## 4. Identity Provider Configuration Checklist + +### Azure AD SAML Configuration + +```yaml +Prerequisites: + ☐ Azure AD Premium subscription (or free tier) + ☐ Datasphere tenant URL (e.g., https://datasphere.company.com) + ☐ Domain administrator access + ☐ User attribute mapping planned + +Step 1: Register Application in Azure AD + ☐ Sign in to Azure Portal (portal.azure.com) + ☐ Navigate: Azure AD → App Registrations → New Registration + ☐ Name: "SAP Datasphere" + ☐ Supported account types: "Accounts in this organizational directory only" + ☐ Redirect URI: Web - https://datasphere.company.com/saml/acs + ☐ Click Register + +Step 2: Configure SAML Single Sign-On + ☐ In app overview, click "Single sign-on" + ☐ Select "SAML" + ☐ Upload or copy Datasphere SP metadata + +Step 3: Basic SAML Configuration + ☐ Identifier (Entity ID): https://datasphere.company.com/saml/entity + ☐ Reply URL (ACS): https://datasphere.company.com/saml/acs + ☐ Sign On URL: https://datasphere.company.com/login + ☐ Logout URL: https://datasphere.company.com/logout + +Step 4: Attributes & Claims + ☐ NameID format: unspecified + ☐ NameID value: user.userprincipalname + ☐ Add custom claim: + Name: http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress + Value: user.mail + ☐ Add custom claim for REGIONS: + Name: REGIONS + Value: user.extensionAttribute1 + +Step 5: SAML Signing Certificate + ☐ Download signing certificate (Base64) + ☐ Upload to Datasphere IdP configuration + +Step 6: Azure AD Sign-In URL & Issuer + ☐ Copy "Azure AD Sign-In URL" → Datasphere IdP Sign-On URL + ☐ Copy "Azure AD Identifier" → Datasphere IdP Entity ID + +Step 7: Configure User Attributes in Azure AD + ☐ Go: Azure AD → Users → Select User (e.g., john.smith) + ☐ Profile → Edit + ☐ Extension attributes: + extensionAttribute1 = Americas,EMEA (for REGIONS) + extensionAttribute2 = CC001,CC002 (for COST_CENTERS) + +Step 8: Test Configuration + ☐ In Datasphere Admin Console, click "Test SSO" + ☐ Select test user (john.smith) + ☐ Verify login succeeds + ☐ Verify user attributes populated correctly + +Step 9: Assign Users to Application + ☐ In Azure AD, navigate to Users and groups + ☐ Add users/groups to the Datasphere application + ☐ (Alternative: Allow all users, manage via Datasphere roles) + +Step 10: User Provisioning (Optional) + ☐ If using automatic provisioning: + ☑ Enable SCIM provisioning + ☑ Copy SCIM URL from Azure + ☑ Generate bearer token + ☑ Enter in Datasphere IdP configuration + ☐ If manual: Users created in Datasphere, authenticate via Azure AD + +Verification: + ☐ User can log in via SAML + ☐ User attributes visible in Datasphere profile + ☐ DACs apply based on user attributes + ☐ Logout successful + ☐ Session timeout works +``` + +### Okta SAML Configuration + +```yaml +Prerequisites: + ☐ Okta organization (org-xxxxx.okta.com) + ☐ Admin access + ☐ Datasphere SP metadata or details + +Step 1: Create SAML 2.0 App in Okta + ☐ Sign in to Okta Admin (okta.com/admin) + ☐ Applications → Applications → Create New App + ☐ Platform: Web + ☐ Sign-On Method: SAML 2.0 + ☐ Click Create + ☐ App Name: SAP Datasphere + ☐ Click Next + +Step 2: Configure SAML Settings + ☐ Single Sign-On URL: https://datasphere.company.com/saml/acs + ☐ Audience URI (SP Entity ID): https://datasphere.company.com/saml/entity + ☐ Name ID Format: unspecified + ☐ Name ID Value: user.login + +Step 3: Configure Attributes & Claims + ☐ Add Attribute Statements: + Name: email, Value: user.email + Name: firstName, Value: user.firstName + Name: lastName, Value: user.lastName + ☐ Add Custom Claims: + Name: REGIONS, Value: user.regions (if custom attribute) + Name: COST_CENTER, Value: user.costCenter + +Step 4: Configure Group Claims (Optional) + ☐ Add Group Claim: + Name: groups + Filter: Starts with = "*" (all groups) + Value: group.name + +Step 5: Download Metadata + ☐ Scroll to "SAML Signing Certificates" + ☐ Click "Download" to get certificate + +Step 6: Provide Okta Details to Datasphere + ☐ Sign-On URL: https://org-xxxxx.okta.com/app/123.../sso/saml + ☐ Issuer ID: https://org-xxxxx.okta.com + ☐ Certificate: [Downloaded certificate] + ☐ Sign-Out URL: https://org-xxxxx.okta.com/login/signout + +Step 7: Assign Users/Groups to App + ☐ In Okta, go to Assignments tab + ☐ Assign users or groups (e.g., "Datasphere Users" group) + ☐ Set provisioning options if automated provisioning enabled + +Step 8: Configure User Attributes (if custom) + ☐ Profile Editor → Okta user profile + ☐ Add custom attributes: + regions: List (value: "Americas,EMEA") + costCenter: String (value: "CC001") + ☐ Populate attributes for test users + +Step 9: Test SAML Login + ☐ In Datasphere, click "Test SAML Configuration" + ☐ User is redirected to Okta + ☐ Okta login prompt appears + ☐ User logs in + ☐ Redirected back to Datasphere + ☐ Session established + +Step 10: Verify Attribute Flow + ☐ Log in as test user + ☐ Check Datasphere user profile + ☐ Verify custom attributes present + ☐ Verify DACs applied correctly + +Troubleshooting: + ☐ If "Assertion Consumer Service URL mismatch": + Check ACS URL matches exactly in both Okta and Datasphere + ☐ If "NameID not found": + Verify NameID attribute matches IdP claim name + ☐ If "Attributes not populated": + Check Attribute Statements in Okta are correct + Verify user has values for those attributes +``` + +--- + +## 5. Security Review Pre-Go-Live Checklist + +### 30-Day Pre-Go-Live Security Review + +```yaml +Timeline: T-30 days to Go-Live + +Week 1 (T-30 to T-23): + Access Control Review: + ☐ List all users with Datasphere access + ☐ For each user, verify: + ☑ Authorized to access Datasphere + ☑ Has appropriate roles assigned + ☑ No excessive privileges + ☑ Segregation of duties maintained + + DAC Validation: + ☐ All DACs created from authorization migration + ☐ All DACs active and applied to tables + ☐ Sample test: 5 users, verify data filtering works + ☐ High-risk users tested (admins, global access) + + IdP Integration Testing: + ☐ SAML/OIDC configuration verified + ☐ User attributes populate correctly + ☐ Custom attributes flowing through + ☐ Single sign-out works + ☐ Session timeout enforced + +Week 2 (T-23 to T-16): + Audit Policy Validation: + ☐ All sensitive tables have audit policies + ☐ Audit logging working (test with sample queries) + ☐ Audit table receiving records + ☐ Retention policies configured + ☐ Alert thresholds set appropriately + + Compliance Mapping: + ☐ All compliance requirements identified + ☐ Applicable audit policies assigned (SOX, GDPR, HIPAA) + ☐ Retention periods configured per regulation + ☐ Breach notification procedures documented + ☐ Data protection impact assessment (DPIA) completed if required + + Network & Encryption: + ☐ TLS 1.2+ required for all connections + ☐ Certificate installed and valid + ☐ VPN/Firewall rules in place + ☐ IP whitelisting configured (if applicable) + ☐ Disable weak ciphers + +Week 3 (T-16 to T-9): + Stress Testing: + ☐ High-volume query test (100 concurrent users) + ☐ DAC performance impact measured (< 5% overhead acceptable) + ☐ Audit logging performance (not slowing queries > 10%) + ☐ Large data transfer test (10 GB table load) + + Disaster Recovery: + ☐ Backup procedure tested + ☐ Restore from backup tested + ☐ Audit log backup separate from data + ☐ RTO (Recovery Time Objective) documented + ☐ RPO (Recovery Point Objective) documented + + Incident Response Planning: + ☐ Incident response team assigned + ☐ Escalation procedures defined + ☐ On-call rotation established + ☐ Communication templates prepared + ☐ Contact information for all key personnel + +Week 4 (T-9 to Go-Live): + Final Security Audit: + ☐ External security assessment (if required) + ☐ Vulnerability scanning completed + ☐ Penetration testing completed + ☐ Security findings remediated + ☐ Risk sign-off from CISO/Security Team + + User Readiness: + ☐ Security awareness training completed by all users + ☐ Password change required on first login + ☐ MFA enabled (if applicable) + ☐ Support team trained on security procedures + ☐ User documentation reviewed + + Go-Live Prep: + ☐ Cutover plan reviewed with security team + ☐ Rollback procedures tested + ☐ Emergency access procedures documented + ☐ On-call team briefed + ☐ Final security sign-off obtained + +Approval Gates: + ☐ Week 1: Access Control Review PASS + ☐ Week 2: Audit Policy & Compliance Review PASS + ☐ Week 3: Stress Test & DR Test PASS + ☐ Week 4: Final Security Audit PASS + CISO Sign-Off + ☐ Go-Live: Authorized by Security & Business Leadership +``` + +--- + +## 6. Common Security Anti-Patterns and Fixes + +### Anti-Pattern 1: DAC Not Applied to All Tables + +**Problem:** +``` +User has DAC on T_SALES but not on V_SALES_SUMMARY +(View defined as: SELECT * FROM T_SALES) + +Result: User can access T_SALES → filtered by DAC + User can access V_SALES_SUMMARY → NO DAC applied → sees all data! +``` + +**Fix:** +```sql +-- Apply DAC to both base table AND views +CREATE DATA ACCESS CONTROL DAC_SALES_BY_REGION FOR TABLE T_SALES + WHERE SALES_REGION = :USER_REGION; + +CREATE DATA ACCESS CONTROL DAC_SALES_BY_REGION FOR VIEW V_SALES_SUMMARY + WHERE SALES_REGION = :USER_REGION; + +-- Verify: +SELECT TABLE_NAME, DAC_NAME FROM DATASPHERE.DAC_ASSIGNMENTS +WHERE DAC_NAME = 'DAC_SALES_BY_REGION' +ORDER BY TABLE_NAME; + +-- Expected: Both T_SALES and V_SALES_SUMMARY listed +``` + +--- + +### Anti-Pattern 2: Admin User Sees All Data Despite DAC + +**Problem:** +``` +Admin user (ROLE='ADMIN') has DAC applied: + DAC: SALES_REGION = :USER_REGION + User's REGION attribute: 'Americas' + +Result: Admin sees ALL regions (not just Americas) + DAC bypass not documented +``` + +**Root Cause:** +- Admin role may have implicit bypass +- Configuration issue in DAC setup + +**Fix:** +```sql +-- Option 1: Remove admin user from DAC filter +-- Explicitly exclude admins: +CREATE DATA ACCESS CONTROL DAC_SALES_BY_REGION FOR TABLE T_SALES + WHERE CASE + WHEN :USER_ROLE = 'ADMIN' THEN 1 = 1 -- Admins see all + ELSE SALES_REGION = :USER_REGION -- Others filtered + END; + +-- Option 2: Keep admin under same DAC (requires admin REGION attribute) +UPDATE T_USER_ATTRIBUTES + SET ASSIGNED_REGION = 'ALL' + WHERE USER_ID = 'admin.user'; + +-- Verify DAC behavior: +SELECT USER_ID, COUNT(*) as visible_rows, COUNT(DISTINCT SALES_REGION) as regions +FROM T_SALES +GROUP BY USER_ID +ORDER BY USER_ID; + +-- Check for anomalies: Admin should see similar or more rows than expected +``` + +--- + +### Anti-Pattern 3: Audit Logs Not Retained Long Enough + +**Problem:** +``` +Audit policy configured with 1-year retention +GDPR requires 3-year retention +After 1 year, audit logs auto-deleted + +Result: Cannot respond to GDPR data access request (beyond 1 year) + Regulatory violation +``` + +**Fix:** +```sql +-- Update audit retention policy +ALTER AUDIT POLICY SOX_COMPLIANCE_AUDIT + SET RETENTION_PERIOD = 10 YEARS; + +ALTER AUDIT POLICY GDPR_COMPLIANCE_AUDIT + SET RETENTION_PERIOD = 3 YEARS; + +-- Archive strategy: Move to cold storage after 1 year +-- but keep for required duration +CREATE PARTITIONED TABLE T_AUDIT_LOG_ARCHIVE ( + PARTITION p_2023 PARTITION BY RANGE (YEAR(AUDIT_TIMESTAMP)) + PARTITION p_2024 + PARTITION p_2025 +); + +-- Archive job (quarterly): +INSERT INTO T_AUDIT_LOG_ARCHIVE +SELECT * FROM T_AUDIT_LOG +WHERE AUDIT_TIMESTAMP < DATE_SUB(CURRENT_DATE, INTERVAL 1 YEAR); + +DELETE FROM T_AUDIT_LOG +WHERE AUDIT_TIMESTAMP < DATE_SUB(CURRENT_DATE, INTERVAL 1 YEAR); + +-- Verify retention: +SELECT + POLICY_NAME, + RETENTION_PERIOD, + PURGE_DATE +FROM DATASPHERE.AUDIT_POLICIES; + +-- Expected: All policies meet regulatory requirements +``` + +--- + +### Anti-Pattern 4: User Attribute Not Synced with IdP + +**Problem:** +``` +IdP (Azure AD) has: + john.smith: ASSIGNED_REGION = 'Americas' + +Datasphere has: + john.smith: ASSIGNED_REGION = 'EMEA' (stale value) + +Result: DAC filters to wrong region (old value) + User sees unintended data +``` + +**Root Cause:** +- Manual attribute import → out of sync +- One-time sync → no ongoing updates +- SCIM provisioning not enabled + +**Fix:** +```sql +-- Option 1: Enable SCIM (automated provisioning) +-- In IdP settings: Enable SCIM 2.0 +-- In Datasphere: Configure SCIM token and endpoint +-- Result: Attributes sync every 15-60 minutes + +-- Option 2: Scheduled attribute sync (if SCIM not available) +CREATE SCHEDULED TASK sync_user_attributes + FREQUENCY: DAILY AT 02:00 UTC + ACTION: Execute stored procedure; + +CREATE PROCEDURE sync_user_attributes AS + -- Extract from IdP LDAP/API + DELETE FROM T_USER_ATTRIBUTES_STAGING; + INSERT INTO T_USER_ATTRIBUTES_STAGING + SELECT USER_ID, ASSIGNED_REGION, ASSIGNED_COST_CENTER + FROM LDAP.USERS; -- Query source IdP + + -- Upsert into Datasphere + MERGE INTO T_USER_ATTRIBUTES target + USING T_USER_ATTRIBUTES_STAGING source + ON target.USER_ID = source.USER_ID + WHEN MATCHED THEN UPDATE SET + target.ASSIGNED_REGION = source.ASSIGNED_REGION, + target.ASSIGNED_COST_CENTER = source.ASSIGNED_COST_CENTER, + target.SYNC_TIMESTAMP = CURRENT_TIMESTAMP + WHEN NOT MATCHED THEN INSERT + VALUES (source.USER_ID, source.ASSIGNED_REGION, source.ASSIGNED_COST_CENTER, CURRENT_TIMESTAMP); + + -- Audit + INSERT INTO T_ATTRIBUTE_SYNC_LOG + VALUES (CURRENT_TIMESTAMP, 'SYNC_COMPLETE', @@AFFECTED_ROWS); +END; + +-- Verify sync: +SELECT USER_ID, ASSIGNED_REGION, SYNC_TIMESTAMP +FROM T_USER_ATTRIBUTES +WHERE SYNC_TIMESTAMP > CURRENT_TIMESTAMP - INTERVAL 1 HOUR +ORDER BY SYNC_TIMESTAMP DESC; +``` + +--- + +### Anti-Pattern 5: No Segregation of Duties + +**Problem:** +``` +Same person (john.smith) assigned both: + - Role: PURCHASE_REQUISITIONER (create POs) + - Role: PAYMENT_APPROVER (approve payments) + +Result: john.smith can requisition AND approve his own request + Fraud control broken +``` + +**Fix:** +```sql +-- Define SoD rules +CREATE TABLE T_SEGREGATION_OF_DUTIES_RULES ( + RULE_ID INTEGER PRIMARY KEY, + CONFLICT_ROLE_1 VARCHAR(100), + CONFLICT_ROLE_2 VARCHAR(100), + REASON VARCHAR(500), + ENFORCEMENT VARCHAR(20) -- PREVENT or MONITOR +); + +INSERT INTO T_SEGREGATION_OF_DUTIES_RULES VALUES + (1, 'PURCHASE_REQUISITIONER', 'PAYMENT_APPROVER', + 'Cannot request and approve own payment', 'PREVENT'), + (2, 'TRANSACTION_CREATOR', 'TRANSACTION_AUDITOR', + 'Auditor must not create transactions they audit', 'PREVENT'), + (3, 'USER_ADMIN', 'DATA_ADMIN', + 'User mgmt and data mgmt must be separate', 'MONITOR'); + +-- Validation: Check for SoD violations +SELECT u.USER_ID, COUNT(DISTINCT ur.ROLE_ID) as role_count +FROM T_USERS u +JOIN T_USER_ROLES ur ON u.USER_ID = ur.USER_ID +WHERE EXISTS ( + SELECT 1 FROM T_SEGREGATION_OF_DUTIES_RULES sod + WHERE sod.CONFLICT_ROLE_1 IN (SELECT ur2.ROLE_ID + FROM T_USER_ROLES ur2 + WHERE ur2.USER_ID = u.USER_ID) + AND sod.CONFLICT_ROLE_2 IN (SELECT ur3.ROLE_ID + FROM T_USER_ROLES ur3 + WHERE ur3.USER_ID = u.USER_ID) + AND sod.ENFORCEMENT = 'PREVENT' +) +GROUP BY u.USER_ID; + +-- If results found: Remove conflicting role assignments +-- Example: john.smith → remove PAYMENT_APPROVER role +``` + +--- + +## 7. Emergency Access Procedures + +### Emergency Access When Normal IdP Unavailable + +**Scenario:** Azure AD outage, users cannot log in via SAML. + +```yaml +Emergency Access Procedure (Break-Glass): + +Trigger Conditions: + ☐ IdP unavailable for > 15 minutes + ☐ Critical business operations blocked + ☐ Approval from CISO/Security Lead obtained + +Pre-Requisites: + ☐ Emergency access accounts created and secured + ☐ Temporary credentials stored in secure vault (Vault, AWS Secrets Manager) + ☐ Access log prepared for audit trail + ☐ Rollback plan documented + +Step 1: Declare Emergency (5 min) + ☐ CISO/Security Lead confirms IdP issue + ☐ Escalation email sent to incident response team + ☐ Status page updated: "Emergency access enabled" + ☐ Timer started: Document emergency duration + +Step 2: Authenticate User via Emergency Account (5 min) + ☐ User provides: Employee ID, Manager approval, Business justification + ☐ Validate against employee database + ☐ Call user's manager to verbally confirm (if possible) + ☐ Issue temporary password (e.g., TempPass_2024_ABC123) + ☐ Force password change on first login + +Step 3: Grant Temporary Access (5 min) + ☐ Assign minimal required roles (NOT admin roles) + ☐ Apply restrictive DACs (if any changes to standard) + ☐ Limit session duration: 4 hours max + ☐ Enable IP whitelist if possible + ☐ Log all assignments in T_EMERGENCY_ACCESS_LOG + +Step 4: Audit & Monitoring (ongoing) + ☐ Monitor user session closely + ☐ Alert on: Data exports, schema changes, privilege escalation attempts + ☐ Capture detailed audit logs: Every query, every data access + ☐ Disable access immediately after business need satisfied (not waiting 4 hours) + +Step 5: Revoke Emergency Access (at resolution) + ☐ IdP restored and working + ☐ Confirm all users can log in normally + ☐ Disable all temporary accounts + ☐ Force password change required on next IdP login + ☐ Deactivate emergency session tokens + +Step 6: Post-Emergency Audit (within 24 hours) + ☐ Review audit log: T_EMERGENCY_ACCESS_LOG + ☐ Verify all access was legitimate + ☐ Check for data exfiltration or unauthorized changes + ☐ Document findings in incident report + ☐ Update incident management system (e.g., Jira, ServiceNow) + +Template: Emergency Access Log Entry + ┌────────────────────────────────────────┐ + │ Emergency Access Event Log │ + ├────────────────────────────────────────┤ + │ Incident ID: INC-2024-0234 │ + │ User ID: john.smith@contoso.com │ + │ Access Type: Temporary Login │ + │ Reason: Azure AD outage (03:00-04:30 UTC) │ + │ Authorized By: CISO - Jane Doe │ + │ Grant Time: 2024-02-08 03:15 UTC │ + │ Revoke Time: 2024-02-08 04:45 UTC │ + │ Duration: 1.5 hours │ + │ Roles Granted: VIEWER │ + │ DACs Applied: Standard │ + │ Queries Executed: 12 │ + │ Data Accessed: T_SALES (4 rows) │ + │ Audit: REVIEWED & APPROVED │ + │ Reviewer: Internal Audit Team │ + └────────────────────────────────────────┘ + +RTO (Recovery Time Objective): 15 minutes + - Emergency access available within 15 min of IdP failure + +Frequency Review: + - Quarterly: Validate emergency accounts still work + - Semi-annually: Update authorized personnel list + - Annually: Run mock emergency access drill +``` + +--- + +### Permanent Account Deprovisioning + +**When:** Employee termination, role change, access revocation. + +```yaml +Deprovisioning Checklist: + +T+0 (Termination Date): + ☐ HR notifies IT/Security of termination + ☐ Collect all devices (laptop, phone, badge) + ☐ Take note of time (often end-of-business Friday) + +T+1 hour (Immediate Actions): + ☐ Disable IdP account (Azure AD, Okta, etc.) + ☐ Revoke session tokens (logout all sessions) + ☐ Disable Datasphere user account + ☐ Remove user from all roles and spaces + ☐ Revoke API tokens/credentials + ☐ Log action: T_USER_DEPROVISIONING_LOG + +T+1 day (Extended Access Cleanup): + ☐ Check for shared passwords (vault) → change all + ☐ Remove from security groups and mailing lists + ☐ Transfer file ownership to manager + ☐ Archive email (if required by retention policy) + ☐ Document what data user created/owns + +T+30 days (Archival): + ☐ Review audit logs for any activity post-termination (should be zero) + ☐ Archive all user session logs + ☐ Delete audit events > 30 days old (unless regulatory hold) + ☐ Confirm no data leakage occurred + ☐ Final sign-off from Security/Compliance + +Template: User Deprovisioning Checklist + User: john.smith@contoso.com + Termination Date: 2024-02-08 + Reason: Voluntary resignation + + Deprovisioning Steps: + ☐ [02/08 17:00] HR notification received + ☐ [02/08 17:15] Datasphere access disabled + ☐ [02/08 17:30] Azure AD account disabled + ☐ [02/08 17:45] All sessions terminated + ☐ [02/09 09:00] Shared credentials changed + ☐ [02/09 10:00] File access transferred to manager + ☐ [02/09 11:00] Email archived + ☐ [02/10] Audit log review completed (no suspicious activity) + ☐ [02/10] Final approval: Security & Compliance sign-off + + Risk Assessment: + Access Duration: 4 years + Data Accessed: T_CUSTOMER_PII, T_FINANCIAL + Export Capability: YES (standard role) + Suspicious Activity: NONE detected + Conclusion: LOW RISK - Clean deprovisioning +``` + +--- + +End of Security Architect Reference Guide + diff --git a/partner-built/SAP-Datasphere/skills/datasphere-transformation-logic/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-transformation-logic/SKILL.md new file mode 100644 index 0000000..badc7dd --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-transformation-logic/SKILL.md @@ -0,0 +1,707 @@ +--- +name: Transformation Logic Generator +description: "Generate and validate SQLScript and Python transformation logic for Data Flows and Transformation Flows. Use this when building complex transformations, optimizing performance, handling delta logic, or implementing SCD Type 2 slowly changing dimensions." +--- + +# Transformation Logic Generator + +## Overview + +This skill helps you design, write, and validate transformation logic for SAP Datasphere. Whether you're building a Transformation Flow with SQLScript or a Data Flow with Python operators, this skill provides patterns, best practices, and diagnostic tools to ensure your transformations are correct, performant, and maintainable. + +## When to Use This Skill + +- **Designing transformations** from scratch: Deciding which tool and language to use +- **Handling delta logic**: Implementing incremental loads with watermarks +- **Slowly Changing Dimensions (SCD Type 2)**: Tracking history of dimension changes +- **Complex data cleansing**: Deduplication, pivoting, date/time normalization +- **Performance optimization**: Dealing with large datasets or slow execution +- **Troubleshooting transformation failures**: SQL errors, type mismatches, operator crashes +- **Data type mapping**: Converting between source and target systems +- **Error handling**: Adding logging and validation to transformations + +## SQLScript vs Python: Choosing Your Tool + +### Use SQLScript for Transformation Flows When: +- Working with structured, tabular data from relational sources +- Needing high performance for large volumes (1M+ rows) +- Implementing delta loads with watermark patterns +- Performing set-based operations (MERGE, aggregations, window functions) +- Operating in the SAP HANA native database +- Team expertise is SQL-focused + +### Use Python for Data Flows When: +- Requiring complex business logic that's hard to express in SQL +- Integrating with ML libraries (scikit-learn, pandas, numpy) +- Handling unstructured data (text, JSON, images) +- Needing pandas-like data manipulation +- Working with multiple input sources in flexible ways +- Team expertise is Python-focused + +## SQLScript Transformations for Transformation Flows + +### Delta Handling with Watermarks + +Watermarks track the last extracted value to enable incremental loads. Common watermark types: + +```sql +-- Timestamp watermark pattern +PROCEDURE TF_LOAD_CUSTOMER_DELTA ( + IN iv_last_watermark TIMESTAMP +) +LANGUAGE SQLSCRIPT +AS +BEGIN + -- Get current watermark (typically max of changed timestamp) + DECLARE v_current_watermark TIMESTAMP = CURRENT_TIMESTAMP(); + + -- Load only changed records + UPSERT TARGET_CUSTOMER + SELECT + CUSTOMER_ID, + CUSTOMER_NAME, + REVENUE, + UPDATED_AT, + 'ACTIVE' AS RECORD_STATUS + FROM SOURCE_CUSTOMER + WHERE UPDATED_AT > :iv_last_watermark + AND UPDATED_AT <= :v_current_watermark; + + -- Update watermark in control table + UPSERT WATERMARK_CONTROL + VALUES ('CUSTOMER', :v_current_watermark); +END; +``` + +### MERGE Operations for Upserts + +MERGE is the most efficient way to handle inserts and updates: + +```sql +-- Standard MERGE pattern +MERGE INTO TARGET_PRODUCT tp +USING SOURCE_PRODUCT sp + ON tp.PRODUCT_ID = sp.PRODUCT_ID +WHEN MATCHED AND sp.IS_DELETED = 'X' THEN + DELETE +WHEN MATCHED THEN + UPDATE SET + tp.PRODUCT_NAME = sp.PRODUCT_NAME, + tp.PRICE = sp.PRICE, + tp.UPDATED_AT = CURRENT_TIMESTAMP() +WHEN NOT MATCHED THEN + INSERT ( + PRODUCT_ID, + PRODUCT_NAME, + PRICE, + CREATED_AT, + UPDATED_AT + ) + VALUES ( + sp.PRODUCT_ID, + sp.PRODUCT_NAME, + sp.PRICE, + CURRENT_TIMESTAMP(), + CURRENT_TIMESTAMP() + ); +``` + +### Window Functions for Analytics + +Window functions enable efficient ranking, running totals, and partition-based calculations: + +```sql +-- Rank products by revenue within each category +SELECT + PRODUCT_ID, + PRODUCT_NAME, + CATEGORY, + REVENUE, + ROW_NUMBER() OVER ( + PARTITION BY CATEGORY + ORDER BY REVENUE DESC + ) AS REVENUE_RANK, + SUM(REVENUE) OVER ( + PARTITION BY CATEGORY + ORDER BY MONTH_ID + ROWS BETWEEN 2 PRECEDING AND CURRENT ROW + ) AS ROLLING_3M_REVENUE +FROM PRODUCT_SALES; +``` + +### Common Table Expressions (CTEs) for Readability + +CTEs make complex logic more maintainable: + +```sql +WITH customer_revenue AS ( + SELECT + CUSTOMER_ID, + SUM(ORDER_AMOUNT) AS TOTAL_REVENUE, + COUNT(DISTINCT ORDER_ID) AS ORDER_COUNT + FROM ORDERS + GROUP BY CUSTOMER_ID +), +customer_segments AS ( + SELECT + CUSTOMER_ID, + TOTAL_REVENUE, + CASE + WHEN TOTAL_REVENUE >= 100000 THEN 'GOLD' + WHEN TOTAL_REVENUE >= 50000 THEN 'SILVER' + ELSE 'BRONZE' + END AS SEGMENT + FROM customer_revenue +) +SELECT * +FROM customer_segments +ORDER BY TOTAL_REVENUE DESC; +``` + +## Python Operators for Data Flows + +### Pandas-like Operations + +Python operators work with pandas DataFrames for flexible transformations: + +```python +import pandas as pd + +def process_orders(orders_df): + """ + Transform and enrich orders with customer segments + """ + # Group by customer and calculate metrics + customer_stats = orders_df.groupby('customer_id').agg({ + 'order_amount': ['sum', 'mean', 'count'], + 'order_date': 'max' + }).reset_index() + + customer_stats.columns = ['customer_id', 'total_revenue', + 'avg_order_value', 'order_count', 'last_order_date'] + + # Calculate customer segment + customer_stats['segment'] = pd.cut( + customer_stats['total_revenue'], + bins=[0, 50000, 100000, float('inf')], + labels=['BRONZE', 'SILVER', 'GOLD'] + ) + + return customer_stats +``` + +### Multi-input Fusion + +Python operators can merge multiple inputs flexibly: + +```python +def enrich_orders(orders_df, customers_df, products_df): + """ + Join orders with customer and product dimensions + """ + enriched = orders_df.merge( + customers_df[['customer_id', 'segment', 'region']], + on='customer_id', + how='left' + ).merge( + products_df[['product_id', 'category', 'margin']], + on='product_id', + how='left' + ) + + enriched['revenue_contribution'] = ( + enriched['order_amount'] * enriched['margin'] + ) + + return enriched[enriched['order_date'] >= '2024-01-01'] +``` + +### Custom Business Logic + +Implement domain-specific rules that are cumbersome in SQL: + +```python +def apply_discount_rules(orders_df, rules_df): + """ + Apply complex tiered discount rules based on customer and product + """ + def calculate_discount(row): + applicable_rules = rules_df[ + (rules_df['segment'] == row['segment']) & + (rules_df['category'] == row['category']) + ] + + if applicable_rules.empty: + return 0.0 + + max_discount = applicable_rules['discount_rate'].max() + return min(max_discount, row['order_amount'] * 0.15) + + orders_df['discount'] = orders_df.apply(calculate_discount, axis=1) + orders_df['net_amount'] = orders_df['order_amount'] - orders_df['discount'] + + return orders_df +``` + +## Common Transformation Patterns + +### SCD Type 2: Slowly Changing Dimensions + +Track historical changes to dimension attributes: + +```sql +-- Initialize dimension with SCD Type 2 structure +CREATE TABLE DIM_CUSTOMER ( + CUSTOMER_SK BIGINT, + CUSTOMER_ID STRING, + CUSTOMER_NAME STRING, + REGION STRING, + EFFECTIVE_DATE DATE, + END_DATE DATE, + IS_CURRENT CHAR(1), + PRIMARY KEY (CUSTOMER_SK) +); + +-- Load new and changed dimensions +PROCEDURE LOAD_DIM_CUSTOMER_SCD2 ( + IN iv_load_date DATE +) +LANGUAGE SQLSCRIPT +AS +BEGIN + -- Close expired records + UPDATE DIM_CUSTOMER + SET END_DATE = :iv_load_date, + IS_CURRENT = 'N' + WHERE IS_CURRENT = 'Y' + AND CUSTOMER_ID IN ( + SELECT CUSTOMER_ID + FROM SOURCE_CUSTOMER sc + WHERE EXISTS ( + SELECT 1 FROM DIM_CUSTOMER dc + WHERE dc.CUSTOMER_ID = sc.CUSTOMER_ID + AND (dc.REGION <> sc.REGION + OR dc.CUSTOMER_NAME <> sc.CUSTOMER_NAME) + ) + ); + + -- Insert new records and changed dimensions + INSERT INTO DIM_CUSTOMER + SELECT + NEXT VALUE FOR DIM_CUSTOMER_SK_SEQ, + CUSTOMER_ID, + CUSTOMER_NAME, + REGION, + :iv_load_date, + NULL, + 'Y' + FROM SOURCE_CUSTOMER + WHERE CUSTOMER_ID NOT IN ( + SELECT DISTINCT CUSTOMER_ID FROM DIM_CUSTOMER WHERE IS_CURRENT = 'Y' + ) + UNION ALL + SELECT + NEXT VALUE FOR DIM_CUSTOMER_SK_SEQ, + sc.CUSTOMER_ID, + sc.CUSTOMER_NAME, + sc.REGION, + :iv_load_date, + NULL, + 'Y' + FROM SOURCE_CUSTOMER sc + JOIN DIM_CUSTOMER dc ON sc.CUSTOMER_ID = dc.CUSTOMER_ID + WHERE dc.IS_CURRENT = 'Y' + AND ( + dc.REGION <> sc.REGION + OR dc.CUSTOMER_NAME <> sc.CUSTOMER_NAME + ); +END; +``` + +### Deduplication Pattern + +Remove duplicate records, keeping the most recent or best version: + +```sql +-- Keep only the most recent version of each record +WITH ranked_records AS ( + SELECT + *, + ROW_NUMBER() OVER ( + PARTITION BY SOURCE_ID + ORDER BY LOAD_DATE DESC, RECORD_ID DESC + ) AS RN + FROM SOURCE_DATA +) +SELECT * +FROM ranked_records +WHERE RN = 1; +``` + +### Pivoting/Unpivoting + +Convert between row and column formats: + +```sql +-- Pivot: Months as columns +SELECT + CUSTOMER_ID, + SUM(CASE WHEN MONTH = 1 THEN AMOUNT ELSE 0 END) AS JAN, + SUM(CASE WHEN MONTH = 2 THEN AMOUNT ELSE 0 END) AS FEB, + SUM(CASE WHEN MONTH = 3 THEN AMOUNT ELSE 0 END) AS MAR +FROM MONTHLY_SALES +GROUP BY CUSTOMER_ID; + +-- Unpivot: Months as rows (using UNION) +SELECT CUSTOMER_ID, 1 AS MONTH, JAN_SALES AS AMOUNT FROM CUSTOMER_MONTHLY_SALES +UNION ALL +SELECT CUSTOMER_ID, 2 AS MONTH, FEB_SALES AS AMOUNT FROM CUSTOMER_MONTHLY_SALES +UNION ALL +SELECT CUSTOMER_ID, 3 AS MONTH, MAR_SALES AS AMOUNT FROM CUSTOMER_MONTHLY_SALES; +``` + +### Date/Time Handling + +Common date transformations for business requirements: + +```sql +-- Fiscal period calculation +SELECT + DATE_FIELD, + EXTRACT(YEAR FROM DATE_FIELD) AS CAL_YEAR, + EXTRACT(MONTH FROM DATE_FIELD) AS CAL_MONTH, + WEEKDAY(DATE_FIELD) AS DAY_OF_WEEK, + CAST(TO_DECIMAL(DATE_FORMAT(DATE_FIELD, 'YYYYMM'), 7, 0) AS INTEGER) AS YYYYMM, + -- Fiscal calendar (starts April) + CASE + WHEN MONTH(DATE_FIELD) >= 4 THEN YEAR(DATE_FIELD) + ELSE YEAR(DATE_FIELD) - 1 + END AS FISCAL_YEAR, + CASE + WHEN MONTH(DATE_FIELD) >= 4 THEN MONTH(DATE_FIELD) - 3 + ELSE MONTH(DATE_FIELD) + 9 + END AS FISCAL_MONTH +FROM TRANSACTIONS; +``` + +## Data Type Mapping + +Map source types to target types correctly to prevent runtime errors: + +| Source Type | Target Type | Notes | +|-----------|-----------|-------| +| Source VARCHAR(255) | VARCHAR(255) or TEXT | Map size appropriately | +| Source DECIMAL(15,2) | DECIMAL(19,4) or FLOAT | Allow room for calculations | +| Source DATE | DATE or TIMESTAMP | Timestamp if time needed | +| Source NUMERIC(10) | INTEGER or BIGINT | Use BIGINT for IDs | +| Source BOOLEAN/CHAR(1) | CHAR(1) or INTEGER | Use 'Y'/'N' or 0/1 consistently | +| Source JSON | STRING | Parse in Python operator | +| Source NULL | Not applicable | Must handle explicitly | + +Use `execute_query` to test type conversions: + +```sql +SELECT + CAST('2024-01-15' AS DATE) AS converted_date, + CAST('123.45' AS DECIMAL(10,2)) AS converted_amount, + CASE WHEN source_value IS NULL THEN 0 ELSE source_value END AS handled_null +FROM source_table LIMIT 10; +``` + +## Error Handling and Logging + +Implement robust error handling in transformations: + +### SQLScript Error Handling + +```sql +PROCEDURE TRANSFORM_WITH_ERROR_HANDLING ( + IN iv_batch_id STRING +) +LANGUAGE SQLSCRIPT +WITH RESULT VIEW vv_load_result +AS +BEGIN + DECLARE v_row_count INT; + DECLARE v_error_message STRING; + + -- Create result logging table if needed + CREATE LOCAL TEMPORARY TABLE lt_log ( + BATCH_ID STRING, + OPERATION STRING, + ROWS_AFFECTED INT, + ERROR_FLAG CHAR(1), + ERROR_MESSAGE STRING, + TIMESTAMP TIMESTAMP + ); + + -- Wrap main operation in exception handler + CALL DBMS_OUTPUT.PUT_LINE('Starting batch: ' || :iv_batch_id); + + BEGIN + INSERT INTO TARGET_TABLE + SELECT * FROM SOURCE_TABLE + WHERE BATCH_ID = :iv_batch_id + AND STATUS = 'VALID'; + + v_row_count := ROWCOUNT; + + INSERT INTO lt_log VALUES ( + :iv_batch_id, + 'INSERT', + :v_row_count, + 'N', + NULL, + CURRENT_TIMESTAMP() + ); + + EXCEPTION WHEN SQL_ERROR_CODE THEN + v_error_message := CURRENT_TIMESTAMP() || ' - Error Code: ' || + ::SQL_ERROR_CODE || ', Message: ' || ::SQL_ERROR_MESSAGE; + + INSERT INTO lt_log VALUES ( + :iv_batch_id, + 'INSERT', + 0, + 'Y', + :v_error_message, + CURRENT_TIMESTAMP() + ); + END; + + -- Return log results + vv_load_result = SELECT * FROM lt_log; +END; +``` + +### Python Operator Logging + +```python +import logging +from datetime import datetime + +def transform_with_logging(input_df): + """ + Transform with comprehensive logging + """ + logger = logging.getLogger(__name__) + logger.info(f"Processing {len(input_df)} input rows at {datetime.now()}") + + try: + # Data validation + if input_df.isnull().sum().sum() > 0: + logger.warning(f"Found nulls: {input_df.isnull().sum().to_dict()}") + input_df = input_df.fillna(0) + + # Transformation + output_df = input_df.assign( + processed_amount=input_df['amount'] * 1.1, + processed_date=pd.to_datetime(input_df['date']) + ) + + logger.info(f"Successfully processed {len(output_df)} rows") + return output_df + + except Exception as e: + logger.error(f"Transformation failed: {str(e)}", exc_info=True) + raise ValueError(f"Transformation error: {str(e)}") +``` + +## Testing Transformations + +### Using execute_query for SQLScript Testing + +Test your SQL logic incrementally before deploying: + +```sql +-- Test 1: Data quality checks +EXECUTE QUERY ' +SELECT + COUNT(*) as total_rows, + COUNT(DISTINCT customer_id) as distinct_customers, + SUM(CASE WHEN amount < 0 THEN 1 ELSE 0 END) as negative_amounts +FROM source_orders +WHERE load_date = CURRENT_DATE +'; + +-- Test 2: Transformation validation +EXECUTE QUERY ' +SELECT + segment, + COUNT(*) as count, + AVG(revenue) as avg_revenue, + MIN(revenue) as min_revenue, + MAX(revenue) as max_revenue +FROM transformed_customers +GROUP BY segment +'; + +-- Test 3: Delta logic verification +EXECUTE QUERY ' +SELECT + operation, + COUNT(*) as count +FROM ( + SELECT CASE + WHEN old_value IS NULL THEN "INSERT" + WHEN new_value IS NULL THEN "DELETE" + ELSE "UPDATE" + END as operation + FROM merge_changes +) +GROUP BY operation +'; +``` + +### Using smart_query for Intelligent Analysis + +Let the system identify anomalies and patterns: + +``` +smart_query( + dataset="transformed_customers", + question="Are there any unexpected patterns or anomalies in the revenue by segment?" +) + +smart_query( + dataset="delta_loads", + question="Is the distribution of inserted vs updated records normal for this load?" +) +``` + +## Performance Considerations for Large Datasets + +### Indexing Strategy + +```sql +-- Create indexes on join keys and filter conditions +CREATE INDEX IDX_ORDER_CUSTOMER ON ORDERS (CUSTOMER_ID); +CREATE INDEX IDX_PRODUCT_CATEGORY ON PRODUCTS (CATEGORY_ID); +CREATE INDEX IDX_DATE_PARTITION ON FACT_SALES (LOAD_DATE, CUSTOMER_ID); +``` + +### Partitioning for Scalability + +```sql +-- Partition by month for faster filtering +CREATE TABLE FACT_SALES ( + TRANSACTION_ID BIGINT, + CUSTOMER_ID INT, + AMOUNT DECIMAL(15,2), + TRANSACTION_DATE DATE, + PRIMARY KEY (TRANSACTION_ID) +) +PARTITION BY RANGE (EXTRACT(YEAR_MONTH FROM TRANSACTION_DATE)) +( + PARTITION '202401' <= VALUES < '202402', + PARTITION '202402' <= VALUES < '202403' +); +``` + +### Query Optimization + +```sql +-- Use LIMIT for initial testing +SELECT * FROM large_table LIMIT 1000; + +-- Filter early and often +SELECT * +FROM fact_table +WHERE load_date = CURRENT_DATE -- Filter first + AND customer_id IN (SELECT id FROM active_customers) + AND amount > 0; + +-- Aggregate before joining +SELECT + c.customer_id, + c.name, + agg.total_revenue +FROM customers c +JOIN ( + SELECT customer_id, SUM(amount) as total_revenue + FROM orders + WHERE load_date >= CURRENT_DATE - 30 + GROUP BY customer_id +) agg ON c.customer_id = agg.customer_id; +``` + +### Memory Management in Python + +```python +def process_large_file_chunked(input_df, chunk_size=10000): + """ + Process large data in chunks to avoid memory issues + """ + result_chunks = [] + + for i in range(0, len(input_df), chunk_size): + chunk = input_df.iloc[i:i + chunk_size] + + # Process chunk + processed_chunk = chunk.assign( + processed_value=chunk['value'] * 1.1 + ) + + result_chunks.append(processed_chunk) + + # Explicitly free memory + del chunk + + return pd.concat(result_chunks, ignore_index=True) +``` + +## MCP Tool References + +### execute_query +Run and test SQL queries in your transformations. Use for validation and testing logic before deploying. + +``` +execute_query( + query="SELECT * FROM source_table WHERE load_date = CURRENT_DATE LIMIT 100" +) +``` + +### smart_query +Ask intelligent questions about your data to identify patterns, anomalies, and quality issues. + +``` +smart_query( + dataset="transformed_data", + question="What are the top 5 anomalies in this dataset?" +) +``` + +### get_table_schema +Understand the structure of source and target tables before writing transformations. + +``` +get_table_schema(table_name="SOURCE_CUSTOMER") +``` + +### get_object_definition +View the complete definition of a Transformation Flow or Data Flow object. + +``` +get_object_definition(object_id="TF_CUSTOMER_TRANSFORM") +``` + +### analyze_column_distribution +Analyze data distribution to identify outliers and inform transformation logic. + +``` +analyze_column_distribution( + table_name="CUSTOMER_REVENUE", + column_name="ANNUAL_REVENUE" +) +``` + +## Next Steps + +1. Identify your source and target structures using `get_table_schema` +2. Choose SQLScript for set-based operations or Python for custom logic +3. Draft your transformation logic using provided patterns +4. Test with `execute_query` or `smart_query` on sample data +5. Validate data types and mappings +6. Deploy to Transformation Flow or Data Flow +7. Monitor performance and adjust as needed + diff --git a/partner-built/SAP-Datasphere/skills/datasphere-transformation-logic/references/transformation-patterns.md b/partner-built/SAP-Datasphere/skills/datasphere-transformation-logic/references/transformation-patterns.md new file mode 100644 index 0000000..5d649ac --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-transformation-logic/references/transformation-patterns.md @@ -0,0 +1,730 @@ +# Transformation Patterns Reference + +## SQLScript Syntax Reference + +### MERGE Statement Complete Syntax + +```sql +MERGE INTO target_table AS tt +USING source_table AS st + ON tt.key_column = st.key_column +WHEN MATCHED AND condition THEN + UPDATE SET + col1 = st.col1, + col2 = st.col2 +WHEN MATCHED THEN + DELETE +WHEN NOT MATCHED THEN + INSERT (col1, col2, col3) + VALUES (st.col1, st.col2, st.col3); +``` + +#### MERGE Examples + +**Basic Upsert (Insert or Update)** +```sql +MERGE INTO CUSTOMER_MASTER cm +USING CUSTOMER_STAGING cs + ON cm.CUSTOMER_ID = cs.CUSTOMER_ID +WHEN MATCHED THEN + UPDATE SET + cm.CUSTOMER_NAME = cs.CUSTOMER_NAME, + cm.EMAIL = cs.EMAIL, + cm.UPDATED_AT = CURRENT_TIMESTAMP() +WHEN NOT MATCHED THEN + INSERT (CUSTOMER_ID, CUSTOMER_NAME, EMAIL, CREATED_AT, UPDATED_AT) + VALUES (cs.CUSTOMER_ID, cs.CUSTOMER_NAME, cs.EMAIL, CURRENT_TIMESTAMP(), CURRENT_TIMESTAMP()); +``` + +**Soft Delete Pattern** +```sql +MERGE INTO PRODUCT_DIM pd +USING PRODUCT_STAGING ps + ON pd.PRODUCT_ID = ps.PRODUCT_ID +WHEN MATCHED AND ps.IS_ACTIVE = 0 THEN + UPDATE SET + pd.EFFECTIVE_END_DATE = CURRENT_DATE, + pd.IS_CURRENT = 'N' +WHEN MATCHED AND ps.IS_ACTIVE = 1 THEN + UPDATE SET + pd.PRODUCT_NAME = ps.PRODUCT_NAME, + pd.PRICE = ps.PRICE, + pd.UPDATED_AT = CURRENT_TIMESTAMP() +WHEN NOT MATCHED AND ps.IS_ACTIVE = 1 THEN + INSERT (PRODUCT_ID, PRODUCT_NAME, PRICE, EFFECTIVE_START_DATE, IS_CURRENT) + VALUES (ps.PRODUCT_ID, ps.PRODUCT_NAME, ps.PRICE, CURRENT_DATE, 'Y'); +``` + +**Three-Way Merge (Insert, Update, Delete)** +```sql +MERGE INTO EMPLOYEE e +USING EMPLOYEE_STAGING es + ON e.EMP_ID = es.EMP_ID +WHEN MATCHED AND es.STATUS = 'TERMINATED' THEN + DELETE +WHEN MATCHED THEN + UPDATE SET + e.EMP_NAME = es.EMP_NAME, + e.SALARY = es.SALARY, + e.DEPARTMENT = es.DEPARTMENT, + e.MODIFIED_DATE = CURRENT_TIMESTAMP() +WHEN NOT MATCHED THEN + INSERT (EMP_ID, EMP_NAME, SALARY, DEPARTMENT, CREATED_DATE) + VALUES (es.EMP_ID, es.EMP_NAME, es.SALARY, es.DEPARTMENT, CURRENT_TIMESTAMP()); +``` + +### Window Functions Reference + +#### ROW_NUMBER() - Sequential numbering within partition + +```sql +-- Get top 3 products by revenue in each category +SELECT + PRODUCT_ID, + PRODUCT_NAME, + CATEGORY, + REVENUE, + ROW_NUMBER() OVER (PARTITION BY CATEGORY ORDER BY REVENUE DESC) AS RN +FROM PRODUCTS +QUALIFY RN <= 3; +``` + +#### RANK() and DENSE_RANK() - Handle ties differently + +```sql +-- RANK() skips numbers after ties, DENSE_RANK() doesn't +SELECT + EMPLOYEE_ID, + SALARY, + RANK() OVER (ORDER BY SALARY DESC) AS RANK_NO, + DENSE_RANK() OVER (ORDER BY SALARY DESC) AS DENSE_RANK_NO +FROM EMPLOYEES; + +-- Output: +-- EMP1, 100000, 1, 1 +-- EMP2, 100000, 1, 1 +-- EMP3, 90000, 3, 2 <- RANK skips 2, DENSE_RANK doesn't +``` + +#### NTILE() - Divide into buckets + +```sql +-- Divide customers into quartiles by revenue +SELECT + CUSTOMER_ID, + TOTAL_REVENUE, + NTILE(4) OVER (ORDER BY TOTAL_REVENUE DESC) AS QUARTILE +FROM CUSTOMER_REVENUE +ORDER BY QUARTILE, TOTAL_REVENUE DESC; +``` + +#### LAG() and LEAD() - Look at adjacent rows + +```sql +-- Calculate month-over-month revenue change +SELECT + YEAR_MONTH, + REVENUE, + LAG(REVENUE) OVER (ORDER BY YEAR_MONTH) AS PREV_MONTH_REVENUE, + REVENUE - LAG(REVENUE) OVER (ORDER BY YEAR_MONTH) AS MONTH_CHANGE, + ROUND( + 100.0 * (REVENUE - LAG(REVENUE) OVER (ORDER BY YEAR_MONTH)) / + LAG(REVENUE) OVER (ORDER BY YEAR_MONTH), + 2 + ) AS PERCENT_CHANGE +FROM MONTHLY_REVENUE +ORDER BY YEAR_MONTH; +``` + +#### Running Totals and Cumulative Sums + +```sql +-- Calculate running revenue total +SELECT + DATE_FIELD, + DAILY_REVENUE, + SUM(DAILY_REVENUE) OVER ( + ORDER BY DATE_FIELD + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + ) AS CUMULATIVE_REVENUE, + AVG(DAILY_REVENUE) OVER ( + ORDER BY DATE_FIELD + ROWS BETWEEN 6 PRECEDING AND CURRENT ROW + ) AS ROLLING_7DAY_AVG +FROM DAILY_SALES +ORDER BY DATE_FIELD; +``` + +### Common Table Expressions (CTEs) + +#### Multi-Level CTE with Dependencies + +```sql +WITH monthly_sales AS ( + SELECT + CUSTOMER_ID, + EXTRACT(YEAR_MONTH FROM ORDER_DATE) AS YEAR_MONTH, + SUM(ORDER_AMOUNT) AS MONTHLY_REVENUE + FROM ORDERS + GROUP BY CUSTOMER_ID, EXTRACT(YEAR_MONTH FROM ORDER_DATE) +), +customer_annual AS ( + SELECT + CUSTOMER_ID, + SUM(MONTHLY_REVENUE) AS ANNUAL_REVENUE, + AVG(MONTHLY_REVENUE) AS AVG_MONTHLY_REVENUE, + MAX(MONTHLY_REVENUE) AS MAX_MONTHLY_REVENUE, + MIN(MONTHLY_REVENUE) AS MIN_MONTHLY_REVENUE + FROM monthly_sales + GROUP BY CUSTOMER_ID +), +customer_with_rank AS ( + SELECT + CUSTOMER_ID, + ANNUAL_REVENUE, + RANK() OVER (ORDER BY ANNUAL_REVENUE DESC) AS REVENUE_RANK, + CASE + WHEN ANNUAL_REVENUE >= 500000 THEN 'PLATINUM' + WHEN ANNUAL_REVENUE >= 250000 THEN 'GOLD' + WHEN ANNUAL_REVENUE >= 100000 THEN 'SILVER' + ELSE 'BRONZE' + END AS CUSTOMER_SEGMENT + FROM customer_annual +) +SELECT * +FROM customer_with_rank +WHERE REVENUE_RANK <= 100 +ORDER BY REVENUE_RANK; +``` + +## Delta Load Patterns + +### Timestamp-Based Delta Load + +```sql +PROCEDURE LOAD_CUSTOMER_DELTA_TIMESTAMP ( + IN iv_last_load_timestamp TIMESTAMP +) +LANGUAGE SQLSCRIPT +AS +BEGIN + -- Get the current maximum timestamp (becomes next watermark) + DECLARE v_current_max_timestamp TIMESTAMP; + + SELECT MAX(LAST_MODIFIED_AT) INTO v_current_max_timestamp + FROM SOURCE_CUSTOMER; + + -- Load only modified records + UPSERT TARGET_CUSTOMER ( + CUSTOMER_ID, + NAME, + EMAIL, + PHONE, + LAST_MODIFIED_AT + ) + SELECT + CUSTOMER_ID, + NAME, + EMAIL, + PHONE, + LAST_MODIFIED_AT + FROM SOURCE_CUSTOMER + WHERE LAST_MODIFIED_AT > :iv_last_load_timestamp + AND LAST_MODIFIED_AT <= :v_current_max_timestamp; + + -- Update the watermark + UPDATE LOAD_WATERMARK + SET LAST_LOAD_TIMESTAMP = :v_current_max_timestamp, + LOAD_COUNT = LOAD_COUNT + 1, + LAST_RUN_DATE = CURRENT_TIMESTAMP() + WHERE TABLE_NAME = 'CUSTOMER'; +END; +``` + +### Change Data Capture (CDC) Pattern + +```sql +PROCEDURE LOAD_FROM_CDC_QUEUE ( + IN iv_queue_name STRING +) +LANGUAGE SQLSCRIPT +AS +BEGIN + -- Process CDC log entries + MERGE INTO FACT_ORDERS fo + USING ( + SELECT + ORDER_ID, + CUSTOMER_ID, + ORDER_AMOUNT, + ORDER_DATE, + OPERATION, + OPERATION_TIMESTAMP + FROM CDC_QUEUE + WHERE QUEUE_NAME = :iv_queue_name + AND PROCESSED_FLAG = 'N' + ) cdc + ON fo.ORDER_ID = cdc.ORDER_ID + WHEN MATCHED AND cdc.OPERATION = 'D' THEN + DELETE + WHEN MATCHED AND cdc.OPERATION IN ('U', 'M') THEN + UPDATE SET + fo.CUSTOMER_ID = cdc.CUSTOMER_ID, + fo.ORDER_AMOUNT = cdc.ORDER_AMOUNT, + fo.ORDER_DATE = cdc.ORDER_DATE + WHEN NOT MATCHED AND cdc.OPERATION IN ('I', 'M') THEN + INSERT (ORDER_ID, CUSTOMER_ID, ORDER_AMOUNT, ORDER_DATE) + VALUES (cdc.ORDER_ID, cdc.CUSTOMER_ID, cdc.ORDER_AMOUNT, cdc.ORDER_DATE); + + -- Mark CDC entries as processed + UPDATE CDC_QUEUE + SET PROCESSED_FLAG = 'Y', + PROCESSED_TIMESTAMP = CURRENT_TIMESTAMP() + WHERE QUEUE_NAME = :iv_queue_name + AND PROCESSED_FLAG = 'N'; +END; +``` + +### Numeric Sequence Delta Load + +```sql +PROCEDURE LOAD_MATERIAL_DELTA_SEQUENCE ( + IN iv_last_sequence INT +) +LANGUAGE SQLSCRIPT +AS +BEGIN + DECLARE v_current_max_sequence INT; + + -- Load only records with sequence > last_sequence + SELECT MAX(CHANGE_SEQUENCE) INTO v_current_max_sequence + FROM SOURCE_MATERIAL; + + UPSERT TARGET_MATERIAL + SELECT + MATERIAL_ID, + MATERIAL_NAME, + UNIT_PRICE, + MATERIAL_GROUP, + CHANGE_SEQUENCE + FROM SOURCE_MATERIAL + WHERE CHANGE_SEQUENCE > :iv_last_sequence + AND CHANGE_SEQUENCE <= :v_current_max_sequence; + + -- Store the new sequence number + UPDATE SEQUENCE_WATERMARK + SET LAST_SEQUENCE = :v_current_max_sequence, + LAST_LOAD_DATE = CURRENT_TIMESTAMP() + WHERE TABLE_NAME = 'MATERIAL'; +END; +``` + +## SCD Type 2 Implementation Templates + +### Full SCD Type 2 Pattern with History + +```sql +PROCEDURE LOAD_DIM_ACCOUNT_SCD2 ( + IN iv_effective_date DATE +) +LANGUAGE SQLSCRIPT +AS +BEGIN + DECLARE EXIT HANDLER FOR SQL_ERROR + BEGIN + RESIGNAL; + END; + + -- Step 1: Identify changed attributes in source + WITH changed_accounts AS ( + SELECT + sa.ACCOUNT_ID, + sa.ACCOUNT_NAME, + sa.ACCOUNT_TYPE, + sa.REGION, + sa.MANAGER_ID, + da.ACCOUNT_SK, + da.ACCOUNT_NAME AS OLD_NAME, + da.ACCOUNT_TYPE AS OLD_TYPE, + da.REGION AS OLD_REGION, + da.MANAGER_ID AS OLD_MANAGER + FROM SOURCE_ACCOUNT sa + LEFT JOIN DIM_ACCOUNT da + ON sa.ACCOUNT_ID = da.ACCOUNT_ID + AND da.IS_CURRENT = 'Y' + WHERE sa.ACCOUNT_ID NOT IN ( + SELECT DISTINCT ACCOUNT_ID FROM DIM_ACCOUNT WHERE IS_CURRENT = 'Y' + ) + OR ( + da.ACCOUNT_NAME <> sa.ACCOUNT_NAME + OR da.ACCOUNT_TYPE <> sa.ACCOUNT_TYPE + OR da.REGION <> sa.REGION + OR da.MANAGER_ID <> sa.MANAGER_ID + ) + ) + -- Step 2: Close old records for changed attributes + UPDATE DIM_ACCOUNT + SET END_DATE = :iv_effective_date, + IS_CURRENT = 'N' + WHERE ACCOUNT_SK IN (SELECT ACCOUNT_SK FROM changed_accounts); + + -- Step 3: Insert new records (both new accounts and changed versions) + INSERT INTO DIM_ACCOUNT ( + ACCOUNT_SK, + ACCOUNT_ID, + ACCOUNT_NAME, + ACCOUNT_TYPE, + REGION, + MANAGER_ID, + EFFECTIVE_DATE, + END_DATE, + IS_CURRENT + ) + SELECT + NEXT VALUE FOR SEQ_ACCOUNT_SK, + ACCOUNT_ID, + ACCOUNT_NAME, + ACCOUNT_TYPE, + REGION, + MANAGER_ID, + :iv_effective_date, + NULL, + 'Y' + FROM SOURCE_ACCOUNT; +END; +``` + +### Audit Trail for SCD Type 2 + +```sql +PROCEDURE AUDIT_SCD2_CHANGES ( + IN iv_table_name STRING, + IN iv_effective_date DATE +) +LANGUAGE SQLSCRIPT +AS +BEGIN + INSERT INTO DIM_AUDIT_TRAIL ( + TABLE_NAME, + RECORD_ID, + CHANGE_TYPE, + CHANGED_COLUMNS, + OLD_VALUES, + NEW_VALUES, + EFFECTIVE_DATE, + AUDIT_TIMESTAMP + ) + SELECT + :iv_table_name, + ACCOUNT_ID, + CASE + WHEN OLD_NAME IS NULL THEN 'NEW' + ELSE 'CHANGED' + END, + CONCAT_STRING('|', NULL, cols), + OLD_VALUES, + NEW_VALUES, + :iv_effective_date, + CURRENT_TIMESTAMP() + FROM ( + SELECT + sa.ACCOUNT_ID, + sa.ACCOUNT_NAME, + da.ACCOUNT_NAME AS OLD_NAME, + STRING_AGG( + CASE + WHEN da.ACCOUNT_NAME <> sa.ACCOUNT_NAME THEN 'ACCOUNT_NAME' + WHEN da.REGION <> sa.REGION THEN 'REGION' + WHEN da.MANAGER_ID <> sa.MANAGER_ID THEN 'MANAGER_ID' + END, + '|' + ) AS cols, + CONCAT('OLD_NAME:', COALESCE(da.ACCOUNT_NAME, 'NULL'), + '|OLD_REGION:', COALESCE(da.REGION, 'NULL')) AS OLD_VALUES, + CONCAT('NEW_NAME:', COALESCE(sa.ACCOUNT_NAME, 'NULL'), + '|NEW_REGION:', COALESCE(sa.REGION, 'NULL')) AS NEW_VALUES + FROM SOURCE_ACCOUNT sa + LEFT JOIN DIM_ACCOUNT da + ON sa.ACCOUNT_ID = da.ACCOUNT_ID + AND da.IS_CURRENT = 'Y' + GROUP BY sa.ACCOUNT_ID, sa.ACCOUNT_NAME, da.ACCOUNT_NAME, da.REGION, da.MANAGER_ID + ); +END; +``` + +## Data Cleansing Transformation Recipes + +### Remove Duplicates, Keep Latest + +```sql +CREATE TABLE CUSTOMER_CLEANED AS +WITH ranked_records AS ( + SELECT + *, + ROW_NUMBER() OVER ( + PARTITION BY EMAIL + ORDER BY LAST_MODIFIED_AT DESC + ) AS RN + FROM CUSTOMER_RAW + WHERE EMAIL IS NOT NULL +) +SELECT * FROM ranked_records WHERE RN = 1; +``` + +### Fix Common Data Quality Issues + +```sql +SELECT + -- Trim and uppercase strings + UPPER(TRIM(CUSTOMER_NAME)) AS CUSTOMER_NAME, + -- Standardize phone format + REPLACE(REPLACE(REPLACE(PHONE, ' ', ''), '-', ''), '(', ''), ')', '') AS PHONE_NORMALIZED, + -- Handle null amounts + COALESCE(SALES_AMOUNT, 0) AS SALES_AMOUNT, + -- Fix dates + CASE + WHEN BIRTH_DATE > CURRENT_DATE THEN CURRENT_DATE - 1 + WHEN BIRTH_DATE < '1900-01-01' THEN NULL + ELSE BIRTH_DATE + END AS BIRTH_DATE_FIXED, + -- Standardize boolean + CASE WHEN IS_ACTIVE IN ('Y', 'true', 1, 'yes') THEN 'Y' ELSE 'N' END AS IS_ACTIVE +FROM CUSTOMER_RAW; +``` + +### Consolidate Duplicate Entries + +```sql +SELECT + CUSTOMER_ID, + MAX(CUSTOMER_NAME) AS CUSTOMER_NAME, + MAX(EMAIL) AS EMAIL, + COUNT(*) AS OCCURRENCE_COUNT, + MAX(LAST_MODIFIED_AT) AS LATEST_UPDATE, + STRING_AGG(PHONE, '|' ORDER BY PHONE) AS ALL_PHONES +FROM CUSTOMER_RAW +WHERE CUSTOMER_ID IS NOT NULL +GROUP BY CUSTOMER_ID +HAVING COUNT(*) > 1; +``` + +### Classify Data Quality Issues + +```sql +SELECT + CUSTOMER_ID, + CUSTOMER_NAME, + EMAIL, + PHONE, + CASE + WHEN CUSTOMER_NAME IS NULL THEN 'MISSING_NAME' + WHEN LENGTH(TRIM(CUSTOMER_NAME)) < 3 THEN 'NAME_TOO_SHORT' + WHEN EMAIL IS NULL THEN 'MISSING_EMAIL' + WHEN NOT EMAIL LIKE '%@%.%' THEN 'INVALID_EMAIL' + WHEN PHONE IS NULL THEN 'MISSING_PHONE' + WHEN REGEXP_LIKE(PHONE, '^[0-9\-\(\) ]+$') = 0 THEN 'INVALID_PHONE' + ELSE 'VALID' + END AS DATA_QUALITY_FLAG +FROM CUSTOMER_RAW; +``` + +## Date/Time Manipulation Functions + +### Fiscal Calendar Calculations + +```sql +-- Return week number within fiscal year +SELECT + CALENDAR_DATE, + EXTRACT(YEAR FROM CALENDAR_DATE) AS CAL_YEAR, + EXTRACT(MONTH FROM CALENDAR_DATE) AS CAL_MONTH, + EXTRACT(WEEK FROM CALENDAR_DATE) AS CAL_WEEK, + -- Fiscal year starting April 1 + CASE + WHEN MONTH(CALENDAR_DATE) >= 4 + THEN YEAR(CALENDAR_DATE) + ELSE YEAR(CALENDAR_DATE) - 1 + END AS FISCAL_YEAR, + -- Fiscal month (1-12, starting April) + CASE + WHEN MONTH(CALENDAR_DATE) >= 4 + THEN MONTH(CALENDAR_DATE) - 3 + ELSE MONTH(CALENDAR_DATE) + 9 + END AS FISCAL_MONTH, + -- Fiscal quarter + CASE + WHEN MONTH(CALENDAR_DATE) IN (4, 5, 6) THEN 1 + WHEN MONTH(CALENDAR_DATE) IN (7, 8, 9) THEN 2 + WHEN MONTH(CALENDAR_DATE) IN (10, 11, 12) THEN 3 + ELSE 4 + END AS FISCAL_QUARTER +FROM DATE_DIMENSION; +``` + +### Age and Duration Calculations + +```sql +SELECT + CUSTOMER_ID, + BIRTH_DATE, + -- Age in years + DATEDIFF(YEAR, BIRTH_DATE, CURRENT_DATE) AS AGE_YEARS, + -- Exact age including months and days + ROUND(DATEDIFF(DAY, BIRTH_DATE, CURRENT_DATE) / 365.25, 2) AS AGE_YEARS_EXACT, + -- Tenure in months + DATEDIFF(MONTH, CUSTOMER_START_DATE, CURRENT_DATE) AS TENURE_MONTHS, + -- Days since last activity + DATEDIFF(DAY, LAST_ACTIVITY_DATE, CURRENT_DATE) AS DAYS_INACTIVE +FROM CUSTOMER; +``` + +### Period-Over-Period Comparisons + +```sql +-- Compare current month to same month last year +SELECT + curr.YEAR_MONTH, + curr.REVENUE AS CURRENT_REVENUE, + prev.REVENUE AS PRIOR_YEAR_REVENUE, + curr.REVENUE - prev.REVENUE AS ABSOLUTE_CHANGE, + ROUND(100.0 * (curr.REVENUE - prev.REVENUE) / prev.REVENUE, 2) AS PERCENT_CHANGE +FROM MONTHLY_REVENUE curr +LEFT JOIN MONTHLY_REVENUE prev + ON curr.CUSTOMER_ID = prev.CUSTOMER_ID + AND DATE_ADD(prev.YEAR_MONTH, INTERVAL 12 MONTH) = curr.YEAR_MONTH +ORDER BY curr.YEAR_MONTH; +``` + +### Rolling Window Periods + +```sql +-- 13-week rolling average +SELECT + WEEK_ENDING_DATE, + WEEKLY_REVENUE, + AVG(WEEKLY_REVENUE) OVER ( + ORDER BY WEEK_ENDING_DATE + ROWS BETWEEN 12 PRECEDING AND CURRENT ROW + ) AS ROLLING_13WEEK_AVG, + -- YTD total + SUM(WEEKLY_REVENUE) OVER ( + PARTITION BY EXTRACT(YEAR FROM WEEK_ENDING_DATE) + ORDER BY WEEK_ENDING_DATE + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + ) AS YTD_REVENUE +FROM WEEKLY_REVENUE_TREND +ORDER BY WEEK_ENDING_DATE; +``` + +## Python Operator API Reference + +### Basic DataFrame Operations + +```python +import pandas as pd + +def transform_data(input_df): + """ + Common DataFrame operations + """ + # Select columns + subset = input_df[['customer_id', 'amount', 'date']] + + # Filter rows + active = input_df[input_df['status'] == 'ACTIVE'] + + # Add computed columns + input_df['new_column'] = input_df['amount'] * 1.1 + + # Rename columns + renamed = input_df.rename(columns={'amount': 'sale_amount'}) + + # Drop nulls in specific column + no_nulls = input_df.dropna(subset=['amount']) + + return input_df +``` + +### Aggregation and Grouping + +```python +def aggregate_by_segment(orders_df): + """ + Group and aggregate operations + """ + summary = orders_df.groupby('customer_segment').agg({ + 'order_amount': ['sum', 'mean', 'count', 'std'], + 'order_date': ['min', 'max'], + 'customer_id': 'nunique' + }).reset_index() + + # Multi-level grouping + customer_segment = orders_df.groupby(['customer_id', 'segment']).agg( + total_amount=('order_amount', 'sum'), + order_count=('order_id', 'count'), + avg_order_size=('order_amount', 'mean') + ).reset_index() + + return summary +``` + +### Merging and Joining + +```python +def join_dimensions(facts_df, customer_df, product_df): + """ + Multi-input fusion with joins + """ + # Inner join + enriched = facts_df.merge( + customer_df[['customer_id', 'segment', 'region']], + on='customer_id', + how='inner' + ) + + # Left join with product + enriched = enriched.merge( + product_df[['product_id', 'category', 'list_price']], + on='product_id', + how='left' + ) + + # Fill missing with defaults + enriched['segment'] = enriched['segment'].fillna('UNKNOWN') + enriched['category'] = enriched['category'].fillna('UNCATEGORIZED') + + return enriched +``` + +### Advanced Transformations + +```python +def apply_business_rules(orders_df, rules_config): + """ + Complex business logic + """ + # Apply conditional transformations + orders_df['discount_rate'] = orders_df.apply( + lambda row: apply_discount_rule( + row['order_amount'], + row['customer_segment'], + row['order_date'] + ), + axis=1 + ) + + # Pivot transformation + monthly_summary = orders_df.pivot_table( + index='customer_id', + columns='order_month', + values='order_amount', + aggfunc='sum', + fill_value=0 + ) + + # String operations + orders_df['normalized_name'] = orders_df['customer_name'].str.upper().str.strip() + + return orders_df +``` + diff --git a/partner-built/SAP-Datasphere/skills/datasphere-transport-manager/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-transport-manager/SKILL.md new file mode 100644 index 0000000..218273c --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-transport-manager/SKILL.md @@ -0,0 +1,1468 @@ +--- +name: Transport Manager +description: "Move objects between Datasphere tenants using transport packages. Use when migrating objects from Dev to QA/Prod, managing versions, handling dependencies, or integrating SAP Content Network packages. Keywords: transport, package, export, import, CSN, JSON, dev qa prod, migration, dependencies, version control." +--- + +# Transport Manager Skill + +## Overview + +The Transport Manager skill guides you through the complete lifecycle of creating, exporting, and importing transport packages in SAP Datasphere. From packaging objects in your development environment to deploying them into production tenants, this skill covers the essential workflows for managing object movement across your tenant landscape. + +## When to Use This Skill + +Trigger this skill when you need to: +- Move objects from development to quality assurance or production +- Create reusable transport packages for deployment +- Manage dependencies between objects during transport +- Handle version control and change tracking +- Import objects from SAP Content Network +- Resolve conflicts when importing into target tenants +- Test transport package validity before deployment +- Troubleshoot missing dependencies or import failures +- Establish automated deployment pipelines +- Document transport history and decisions + +## Transport Concept in Datasphere + +### What is Transport in Datasphere? + +Transport is the mechanism for moving data models (tables, views, data flows, etc.) between isolated Datasphere tenants representing different environments: Development (Dev), Quality Assurance (QA), and Production (Prod). + +### Multi-Tenant Landscape + +**Typical Setup:** +``` +┌─────────────────────────────────────────────────────────┐ +│ SAP DATASPHERE MULTI-TENANT LANDSCAPE │ +├─────────────────────────────────────────────────────────┤ +│ │ +│ ┌──────────────────┐ ┌──────────────────┐ │ +│ │ DEVELOPMENT │ │ QUALITY ASSURE │ │ +│ │ TENANT │ │ TENANT │ │ +│ │ │ │ │ │ +│ │ • Rapid changes │ │ • Validation │ │ +│ │ • Experiments │ │ • Testing │ │ +│ │ • Draft objects │ │ • UAT │ │ +│ │ │ │ • Stable subset │ │ +│ └────────┬─────────┘ └────────┬─────────┘ │ +│ │ TRANSPORT PACKAGE │ TRANSPORT PACKAGE │ +│ │ Export │ Export │ +│ v v │ +│ ┌──────────────────────────────────────┐ │ +│ │ PRODUCTION TENANT │ │ +│ │ │ │ +│ │ • Live data models │ │ +│ │ • Approved objects only │ │ +│ │ • Change-controlled │ │ +│ │ • High availability & backup │ │ +│ └──────────────────────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────┘ +``` + +### Transport Package Format + +**Package Composition:** +- **Metadata**: Objects and their definitions (tables, views, data flows) +- **Dependencies**: References to related objects +- **Versioning**: Timestamp and version information +- **Change Log**: What objects changed, who made changes +- **Format**: CSN (Core Schema Notation) or JSON + +**File Structure:** +``` +transport_package_20240115.zip +├── metadata.json # Package metadata +├── objects.json # Object definitions +├── dependencies.json # Dependency graph +├── changelog.json # Version/change history +└── content/ + ├── table_001.csn # Table definitions + ├── view_001.csn # View definitions + ├── dataflow_001.csn # Data flow definitions + └── ... +``` + +### Key Benefits of Transport + +| Benefit | Impact | Use Case | +|---------|--------|----------| +| **Controlled Deployment** | Change tracked, auditable | Regulatory compliance, governance | +| **Reproducibility** | Same object definitions across tenants | Consistency across environments | +| **Rollback Capability** | Revert to prior package version | Error recovery, quick fixes | +| **Content Reuse** | Share packages across teams/companies | Accelerate implementation | +| **Version Control** | Track object evolution over time | Historical analysis, compliance | + +## Package Creation Workflow + +### Step 1: Plan Package Contents + +**Before Creating Package, Determine:** + +1. **Scope**: Which objects to include + - Tables, Views, Data Flows, Replication Flows, Analytic Models + - Business rules, Calculations, Custom Logic + - Documentation and Metadata + +2. **Dependencies**: What other objects are required + - Tables sourcing other tables + - Views building on tables or other views + - Data flows reading from tables or views + - Connections needed for data flows + +3. **Versioning**: What version are you releasing? + - Version number (semantic: 1.0, 1.1, 2.0) + - Release date + - Change summary + +**Example Planning Session:** + +``` +PACKAGE: Customer Analytics Suite v1.0 + +CONTENTS: +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ +Table: customer_master (source) +Table: customer_transactions (source) + +View: vw_customer_active + Dependencies: customer_master + +View: vw_customer_lifetime_value + Dependencies: customer_transactions, customer_master + +View: vw_customer_segmentation + Dependencies: vw_customer_lifetime_value, customer_master + +Data Flow: df_customer_enrichment + Dependencies: vw_customer_active (reads), customer_master (writes) + +Data Flow: df_segment_scoring + Dependencies: vw_customer_segmentation (reads), customer_master (writes) + +CONNECTION: Connection_ERP (for replication) + +TOTAL OBJECTS: 8 +DEPENDENCIES: 6 relationships +ESTIMATED TRANSPORT SIZE: 15 MB + +VERSION: 1.0 (Initial release) +TARGET DEPLOYMENT: QA first (UAT), then Production +``` + +### Step 2: Navigate Transport Cockpit + +**Location:** +``` +Datasphere Home → Administration → Transport +``` + +**Transport Cockpit Sections:** +1. **Export Packages**: Create and manage outbound packages +2. **Import History**: View imported packages and status +3. **Package Library**: Browse and reuse existing packages +4. **Deployment Queue**: Scheduled transports + +### Step 3: Create New Package + +**Package Creation Dialog:** + +| Field | Required | Guidance | +|-------|----------|----------| +| **Package Name** | Yes | Descriptive, business-focused (50 chars max) | +| **Package ID** | Auto | System identifier (immutable) | +| **Description** | Yes | Purpose and contents (500 chars) | +| **Version** | Yes | Semantic version (X.Y.Z) | +| **Release Notes** | Yes | What changed, why | +| **Source Space** | Yes | Space containing objects to transport | +| **Target Tenants** | Yes | Which tenant(s) will receive this | + +**Example Package Definition:** + +``` +Package Name: Customer Analytics Suite +Package ID: PKG_CUST_ANALYTICS_001 +Version: 1.0.0 + +Description: + Comprehensive customer analytics data models including master data, + customer lifetime value calculations, segmentation logic, and supporting + data flows. Includes 2 source tables, 3 analytic views, and 2 data flows. + +Release Notes (v1.0.0): + - Initial release with customer master and segmentation models + - Includes vw_customer_lifetime_value for RFM analysis + - Includes df_enrichment data flow for daily customer updates + - Tested with 50M customer records, supports 99.9% uptime SLA + +Release Date: 2024-01-15 +Released By: john.smith@company.com + +Target Tenants: + - QA Tenant (datasphere-qa.company.com) + - Production Tenant (datasphere-prod.company.com) +``` + +### Step 4: Select Objects for Transport + +**Selection Process:** + +``` +DATASPHERE TRANSPORT DIALOG +═══════════════════════════════ + +Space: Sales Analytics + +Available Objects: + ☑ table__customer_master + ☐ table__customer_transactions + ☑ vw_customer_active + ☑ vw_customer_lifetime_value + ☑ vw_customer_segmentation + ☑ df_customer_enrichment + ☑ df_segment_scoring + ☐ custom_calculation_revenue + ☐ business_rule_inactive_filter + +SELECTED: 7 objects +TOTAL SIZE: ~12 MB +ESTIMATED DEPENDENCY CHECK TIME: 30 seconds +``` + +**Selection Tips:** +- Don't include "draft" or "in-progress" objects +- Include all dependencies explicitly +- Start with tables, then views, then flows +- Group related objects in one package +- Avoid transporting test/temporary objects + +### Step 5: Verify Complete List + +**Review Checklist:** + +``` +PACKAGE CONTENTS REVIEW +═══════════════════════ + +OBJECTS INCLUDED: +[✓] customer_master (Table) +[✓] customer_transactions (Table) +[✓] vw_customer_active (View) +[✓] vw_customer_lifetime_value (View) +[✓] vw_customer_segmentation (View) +[✓] df_customer_enrichment (Data Flow) +[✓] df_segment_scoring (Data Flow) + Total: 7 objects + +DEPENDENCIES VERIFIED: +[✓] All referenced tables included +[✓] All dependent views included +[✓] All data flow sources identified +[✓] No external dependencies outside this package + +SIZE ANALYSIS: +[✓] Total size: ~12 MB (reasonable) +[✓] Largest object: vw_customer_segmentation (3.5 MB) +[✓] Compressed transport size: ~3.2 MB + +VERSIONING: +[✓] New package (v1.0.0) +[✓] Version > previous release (N/A, first) +[✓] Release notes complete +[✓] Change log documented +``` + +## Dependency Checking + +### Understanding Object Dependencies + +**Dependency Graph Example:** + +``` +DEPENDENCY HIERARCHY +════════════════════════════════════════════════════════════ + +SOURCE TABLES (No dependencies) + │ + ├─ customer_master + │ (Base customer data from ERP) + │ + └─ customer_transactions + (Base sales transactions) + + +LEVEL 1 VIEWS (Depend on source tables) + │ + ├─ vw_customer_active + │ (Filter: status = 'ACTIVE') + │ └─ Depends on: customer_master + │ + └─ vw_transaction_detail + (Enriched transaction data) + └─ Depends on: customer_transactions, customer_master + + +LEVEL 2 VIEWS (Depend on Level 1 views or tables) + │ + └─ vw_customer_lifetime_value + (Calculate LTV metrics) + └─ Depends on: vw_customer_active, vw_transaction_detail + +LEVEL 3 VIEWS (Depend on Level 2 views) + │ + └─ vw_customer_segmentation + (RFM segmentation using LTV) + └─ Depends on: vw_customer_lifetime_value + + +DATA FLOWS (Complex dependencies) + │ + ├─ df_customer_enrichment + │ (Read: vw_customer_active, Write: customer_master) + │ └─ Depends on: customer_master, customer_transactions + │ + └─ df_segment_scoring + (Read: vw_customer_segmentation, Write: customer_master) + └─ Depends on: vw_customer_lifetime_value + +CONNECTIONS + │ + └─ Connection_ERP (for replication flows) + (Used by: df_customer_enrichment) +``` + +### Dependency Resolution Algorithm + +**Automatic Dependency Detection:** + +``` +STEP 1: Identify Direct Dependencies +──────────────────────────────────── +For each selected object: + 1. Scan object definition for table/view references + 2. Extract all SOURCE object names + 3. Add to dependency list + +Example: vw_customer_lifetime_value + Definition: SELECT ... FROM vw_transaction_detail, vw_customer_active + Direct Dependencies: vw_transaction_detail, vw_customer_active + + +STEP 2: Recursive Dependency Resolution +───────────────────────────────────────── +For each direct dependency: + 1. Check if object included in package + 2. If not included: + a. Mark as MISSING + b. Add to dependency resolution queue + 3. Recursively resolve object's dependencies + +Example: vw_customer_lifetime_value + Direct: vw_transaction_detail, vw_customer_active + vw_transaction_detail: + └─ Depends on: customer_transactions, customer_master + vw_customer_active: + └─ Depends on: customer_master + + +STEP 3: Build Complete Dependency List +──────────────────────────────────────── +Transitive closure of all dependencies: + vw_customer_lifetime_value + ├─ vw_transaction_detail + │ ├─ customer_transactions + │ └─ customer_master + └─ vw_customer_active + └─ customer_master + +Unique dependencies: customer_master, customer_transactions + + +STEP 4: Verify Completeness +──────────────────────────── +✓ All dependencies included? +✓ All objects deployable? +✓ Version compatibility? +✓ No circular dependencies? +``` + +### Manual Dependency Verification + +**When to Manually Check:** + +1. **Custom Objects**: Custom code not auto-detected + ``` + View definition uses SCRIPT operator: + CREATE VIEW vw_complex_calc AS + SELECT * FROM table_x + PLUS + + Manual Check: Verify CUSTOM_SCRIPT dependencies + ``` + +2. **External Data Sources**: References to systems outside Datasphere + ``` + Data Flow reads from: SAP S/4HANA ERP via Replication Flow + + Manual Check: Is Replication Flow in package? + Is connection definition included? + ``` + +3. **Parameter References**: Views/flows using global parameters + ``` + View: SELECT * FROM table WHERE region = $$REGION_PARAM + + Manual Check: Is parameter defined in target tenant? + Is default value set? + ``` + +**Manual Dependency Checklist:** + +``` +MANUAL VERIFICATION CHECKLIST +══════════════════════════════════════════════ + +For each object in package: + +[ ] Check view source code for table references + Command: View Definition → Review "FROM" clause + +[ ] Check view parameters + Command: View Definition → Review "WHERE" clause for parameters + +[ ] Check data flow connections + Command: Data Flow → Review "Source" operator + +[ ] Check for user-defined functions + Command: View Definition → Search for custom UDF calls + +[ ] Check replication flows + Command: Data Flow → Review Replication object references + +[ ] Verify external system connections + Command: Transport Cockpit → Review connection list + +[ ] Document all dependencies found + Format: Dependency CSV with: + [Dependent Object] | [Required Object] | [Type] + +[ ] Validate no circular dependencies + Example (CIRCULAR - BAD): + vw_a → vw_b → vw_c → vw_a + +[ ] Confirm all dependencies included + Mark: [✓] Included in package + [✗] MISSING - must add +``` + +## Export Workflow + +### Step 1: Prepare for Export + +**Pre-Export Validation:** + +``` +EXPORT READINESS CHECKLIST +═══════════════════════════════ + +OBJECT VALIDATION +[ ] All objects compile without errors + Check: Datasphere UI shows no error indicators + All objects in "ACTIVE" status + +[ ] No in-progress edits + Check: No one has objects checked out + All changes committed + +[ ] Documentation complete + Check: All objects have descriptions + Business context documented + +[ ] Testing completed + Check: All views query successfully + Data flows execute without errors + No data quality issues detected + +DEPENDENCY VALIDATION +[ ] All dependencies included in package + Check: Run dependency analysis + No "MISSING" or "UNRESOLVED" indicators + +[ ] No circular dependencies + Check: Dependency tool reports no cycles + Manual review of complex paths + +[ ] Version compatibility verified + Check: All object versions compatible + No deprecated syntax used + +METADATA VALIDATION +[ ] Package metadata complete + Check: Version number set + Release notes documented + Change log updated + +[ ] Access rights correct + Check: Objects not marked confidential/restricted + Appropriate consumers can see metadata + No unnecessary security settings + +BACKUP & SAFETY +[ ] Source tenant has recent backup + Check: Run backup before export + Backup verified restorable + +[ ] Change log prepared + Check: Document all changes since last version + Note any breaking changes +``` + +### Step 2: Execute Export + +**Export Process:** + +``` +DATASPHERE TRANSPORT COCKPIT +═══════════════════════════════ + +1. Click: "Create Export Package" + +2. Enter Package Details: + Package Name: Customer Analytics Suite + Version: 1.0.0 + Description: [as planned] + Release Notes: [as prepared] + +3. Select Objects: + ☑ customer_master (Table) + ☑ customer_transactions (Table) + ☑ vw_customer_active (View) + ☑ vw_customer_lifetime_value (View) + ☑ vw_customer_segmentation (View) + ☑ df_customer_enrichment (Data Flow) + ☑ df_segment_scoring (Data Flow) + +4. Verify Dependencies: + ✓ All dependencies included + ✓ No circular dependencies detected + ✓ Total size: 12 MB + ✓ 7 objects selected + +5. Select Export Format: + Format: CSN (Core Schema Notation) + Compression: GZIP + Output: transport_cust_analytics_v1_0_0.zip + +6. Click: "Generate Export Package" + Status: Processing... + Progress: [████████░░░░░░░░░░] 40% + ... + Status: EXPORT COMPLETE + File: transport_cust_analytics_v1_0_0.zip (3.2 MB) + +7. Download Package: + Action: Download to local file system + Verify: Checksum SHA256: a3f5e... +``` + +### Step 3: Validate Export Package + +**Post-Export Verification:** + +``` +POST-EXPORT VALIDATION +════════════════════════════ + +FILE CHECKS +[ ] File exists and readable + Command: ls -lh transport_cust_analytics_v1_0_0.zip + +[ ] File size reasonable + Expected: ~3.2 MB + Actual: 3.2 MB ✓ + +[ ] Checksum matches + Expected: a3f5e... + Actual: a3f5e... ✓ + +[ ] Not corrupted (test extraction) + Command: unzip -t transport_cust_analytics_v1_0_0.zip + Result: All files test OK ✓ + +CONTENTS VERIFICATION +[ ] Package metadata present + Check: metadata.json exists + +[ ] Object definitions present + Check: objects.json contains 7 objects + +[ ] Dependencies documented + Check: dependencies.json complete + +[ ] File structure valid + Check: All expected folders and files present + +VERSION VERIFICATION +[ ] Version number in metadata matches intent + Expected: 1.0.0 + Actual: 1.0.0 ✓ + +[ ] Release notes present and correct +[ ] Change log documented +[ ] Timestamp recorded + +SECURITY VERIFICATION +[ ] No sensitive data in export + Check: No passwords, API keys, credentials + No PII unencrypted + +[ ] Encryption enabled + Check: Export uses encryption for transmission + Access controls in place +``` + +## Import Workflow + +### Step 1: Prepare Target Tenant + +**Pre-Import Steps:** + +``` +TARGET TENANT PREPARATION +═══════════════════════════════════════════ + +ENVIRONMENT VERIFICATION +[ ] Target tenant accessible + Test: Login to target Datasphere instance + Verify read/write permissions + +[ ] Required space exists + Check: Target space available in destination + Space has sufficient quota + Space permissions allow imports + +[ ] Space is clean and ready + Check: No conflicting objects (same names) + Space has recent backup + Space in stable state (no ongoing changes) + +DEPENDENCY VERIFICATION +[ ] All table sources available + Check: customer_master exists in target (or will be created) + customer_transactions exists in target (or will be created) + Note: Source tables often already exist; confirm status + +[ ] Required connections present + Check: Connection_ERP exists in target + Connection credentials valid + Connection can connect successfully + +[ ] Parameter definitions available + Check: Any global parameters used are defined in target + Default values appropriate for target environment + +[ ] User/Role access configured + Check: Import user has CREATE/UPDATE permissions + Target users have SELECT permissions on objects + Data consumer roles configured + +NETWORK & CAPACITY +[ ] Network connectivity stable + Test: Ping target system + No recent network issues + Upload bandwidth sufficient + +[ ] Target has sufficient space + Check: Disk space > package size * 3 (for extraction) + Memory available for import process + No capacity alerts in target + +BACKUP & SAFETY +[ ] Target tenant has recent backup + Verify: Last backup completed successfully + Backup can be restored if needed + Backup retention meets compliance + +[ ] Rollback plan ready + Document: How to rollback if import fails + Restore procedure tested + Team trained on rollback +``` + +### Step 2: Upload Package to Target + +**Upload Process:** + +``` +TARGET TENANT TRANSPORT COCKPIT +═══════════════════════════════════ + +1. Navigate: Administration → Transport → Import Packages + +2. Click: "Upload Package" + +3. Select File: + File: transport_cust_analytics_v1_0_0.zip + Size: 3.2 MB + +4. Upload: + Progress: [████████████████░░] 80% + Status: Validating package... + Status: UPLOAD SUCCESSFUL + Uploaded: 2024-01-15 14:30 UTC + +5. Review Package Details: + Package: Customer Analytics Suite + Version: 1.0.0 + Objects: 7 items + Size: 12 MB (uncompressed) + Release Notes: [as documented] + +6. System Pre-Import Check: + Checking dependencies... + [✓] All table sources available or will be created + [✓] All view dependencies resolvable + [✓] No version conflicts detected + [✓] 0 objects will be OVERWRITTEN (new import) + [!] WARNING: Data flow connections require verification + Action Required: Verify Connection_ERP is available + Status: User must confirm before import proceeds +``` + +### Step 3: Resolve Conflicts (If Any) + +**Conflict Scenarios:** + +**Scenario 1: Object Already Exists (Update)** + +``` +CONFLICT DETECTED +═════════════════════════════════════════ + +Object: vw_customer_active (View) +Status in Target: EXISTS + +Current Definition in Target: + SELECT * FROM customer_master WHERE status = 'ACTIVE' + +New Definition in Package: + SELECT * FROM customer_master WHERE status = 'ACTIVE' + AND account_age_days > 30 -- CHANGED + +Options: + [ ] SKIP - Don't overwrite, keep existing + [ ] OVERWRITE - Replace with new version + [ ] RENAME - Import as vw_customer_active_v2 + [ ] REVIEW - Show detailed diff before deciding + +Decision: OVERWRITE +Reason: New version includes important account age filter + Vetted in QA before export +``` + +**Scenario 2: Dependency Missing** + +``` +MISSING DEPENDENCY +═════════════════════════════════════════ + +Object: vw_customer_lifetime_value (View) +Required Dependency: vw_transaction_detail (View) +Status: NOT FOUND in target tenant + +Options: + [ ] FAIL - Block import, require dependency first + [ ] CREATE - Import will create missing dependency + [ ] SKIP - Skip this object, import others + [ ] ADD_TO_IMPORT - Request missing object be added + +Decision: CREATE +Status: System will create vw_transaction_detail as prerequisite +``` + +**Scenario 3: Connection Not Available** + +``` +CONNECTION ERROR +═════════════════════════════════════════ + +Object: df_customer_enrichment (Data Flow) +Required Resource: Connection_ERP +Status: Connection NOT FOUND in target + +Error: Cannot import data flow without connection + +Options: + [ ] SKIP - Import other objects, skip data flow + [ ] FAIL - Block entire import + [ ] CREATE_MANUAL - Create connection, retry import later + [ ] REASSIGN - Use existing similar connection + +Decision: CREATE_MANUAL +Action Required: + 1. Create Connection_ERP in target manually + 2. Configure credentials for target ERP system + 3. Test connection (verify connectivity) + 4. Retry import +``` + +### Step 4: Execute Import + +**Import Execution:** + +``` +IMPORT EXECUTION +═════════════════════════════════════════ + +1. Review Conflict Resolutions: + [✓] vw_customer_active (OVERWRITE) + [✓] vw_transaction_detail (CREATE) + [✓] df_customer_enrichment (SKIP - connection not available) + [✓] 7 total objects (5 create, 1 update, 1 skip) + +2. Click: "Start Import" + Status: Validating package... + Status: Creating objects... + +3. Import Progress: + [████████████████████████] 100% + Objects Created: 5 + Objects Updated: 1 + Objects Skipped: 1 + Errors: 0 + +4. Import Complete + Status: SUCCESS (with 1 warning) + Timestamp: 2024-01-15 14:45 UTC + Duration: 15 minutes + +5. Post-Import Report: + ┌─────────────────────────────────────────────┐ + │ IMPORT SUMMARY REPORT │ + ├─────────────────────────────────────────────┤ + │ Package: Customer Analytics Suite v1.0.0 │ + │ Target Tenant: datasphere-qa.company.com │ + │ Import Date: 2024-01-15 │ + │ │ + │ OBJECT RESULTS: │ + │ ✓ customer_master (CREATE) │ + │ ✓ customer_transactions (CREATE) │ + │ ✓ vw_customer_active (UPDATE) │ + │ ✓ vw_customer_lifetime_value (CREATE) │ + │ ✓ vw_transaction_detail (CREATE) │ + │ ✓ vw_customer_segmentation (CREATE) │ + │ ⊘ df_customer_enrichment (SKIPPED) │ + │ ? df_segment_scoring (SKIPPED) │ + │ │ + │ STATUS: 5 SUCCESS, 2 SKIPPED, 0 FAILED │ + │ │ + │ WARNINGS: │ + │ ! Data flows require manual connection │ + │ configuration before execution │ + │ │ + │ NEXT STEPS: │ + │ 1. Create Connection_ERP in target │ + │ 2. Configure data flow inputs/outputs │ + │ 3. Test view queries │ + │ 4. Execute data flows │ + │ 5. Validate data results │ + └─────────────────────────────────────────────┘ +``` + +## SAP Content Network Integration + +### What is SAP Content Network? + +The SAP Content Network provides pre-built, industry-standard data models and solutions available for direct import into Datasphere. + +**Content Types:** +- Industry solutions (Financial Services, Retail, Manufacturing) +- Data models aligned with SAP standards +- Best-practice configurations +- Sample data and documentation + +### Discovering Content Network Packages + +**Access:** +``` +Datasphere Home → Content Network + OR +Transport Cockpit → Content Network Tab +``` + +**Browse Options:** +1. **By Industry**: Filter packages for your industry +2. **By Solution**: Find packaged solutions (e.g., "Sales Cloud Analytics") +3. **By Use Case**: Browse by analytical need (e.g., "Customer Analytics") +4. **By Popularity**: See most-used packages +5. **By New**: Latest releases + +**Example Package:** + +``` +SAP CONTENT NETWORK PACKAGE +════════════════════════════════════════════════════════════ + +Name: SAP Analytics Cloud - Sales Insights +Publisher: SAP +Version: 4.2.1 +Release Date: 2024-01-10 + +Description: + Industry-standard Sales Analytics solution with best-practice + KPI definitions, customer data models, and sales pipeline + analytics. Includes 12 pre-built models, 20+ analytic views, + and 5 data flows for common analysis patterns. + +Industry: Retail, Manufacturing, High-Tech +Use Cases: + - Sales pipeline forecasting + - Territory performance analysis + - Customer acquisition cost calculation + - Sales rep productivity benchmarking + +Objects Included: + - 12 data models (tables/entities) + - 25 analytic views + - 5 pre-configured data flows + - 8 KPI definitions + - Sample data included + +Size: 45 MB +Estimated Deploy Time: 20 minutes +Prerequisites: + - Datasphere Cloud Edition or higher + - S/4HANA connection (for data replication) + - Business Role "Space Admin" or higher + +Downloads: 5,231 +Rating: 4.8/5.0 stars +Latest Review: "Great starting point, saves weeks of modeling" + +Cost: Free (included with Datasphere license) +Support: Community forum + SAP support +Documentation: Full documentation + videos + sample queries +``` + +### Importing Content Network Package + +**Import Process:** + +``` +STEP 1: Select Package from Content Network + Action: Find and review package details + Check: Prerequisites met, supports use case + +STEP 2: Click "Import to My Tenant" + Automatic: + - Download package + - Validate against target environment + - Display conflict resolution dialog + +STEP 3: Configure Target Space + Selection: + ☐ Create new space: "Sales_Analytics_CN" + ☐ Import to existing space: [select from list] + Recommendation: Create new space (cleaner isolation) + +STEP 4: Resolve Conflicts + Typical Issues: + - Table names exist (choose RENAME or OVERWRITE) + - Connections not available (create manually later) + - Sample data volume (remove if not needed) + +STEP 5: Start Import + Progress: [████████████████░░░░] 50% + Objects: 3 created, 2 dependencies resolving... + Status: IMPORT COMPLETE + +STEP 6: Post-Import Validation + Tasks: + [ ] Navigate to imported space + [ ] Review imported objects + [ ] Test views with sample queries + [ ] Update any connections needed + [ ] Load sample data if provided + [ ] Customize KPI definitions for your data +``` + +### Customizing Imported Content + +**Common Customizations:** + +| Change | Effort | Impact | +|--------|--------|--------| +| Rename objects to match standards | Low | Branding consistency | +| Update data connections | Medium | Enable actual data flows | +| Modify KPI definitions | Medium | Business alignment | +| Add organization filters (region, department) | Medium | Scoping to your org | +| Extend views with additional columns | Medium | Enhanced analysis | +| Remove sample data | Low | Reclaim storage | + +**Example Customization:** + +``` +ORIGINAL PACKAGE OBJECT: + View: vw_sales_by_region + Definition: 5 generic regions (North, South, East, West, Central) + +CUSTOMIZATION: + 1. Clone view: vw_sales_by_region_company_specific + 2. Modify WHERE clause: + Original: WHERE region IN ('North', 'South', ...) + Modified: WHERE region IN (SELECT region FROM region_mapping + WHERE company_id = CURRENT_COMPANY_ID) + 3. Add column: company_name (join to company master) + 4. Test with your data + 5. Update dependent views to use company-specific version + +RESULT: + - Original package unmodified (easier to upgrade) + - Company-specific views available for your analytics + - Multi-tenant capable using company_id filter +``` + +## Versioning and Change Tracking + +### Semantic Versioning + +**Version Format: MAJOR.MINOR.PATCH** + +``` +Version Number | Increment When | Breaking | Consumer Action +──────────────────────────────────────────────────────────────── +1.0.0 | Initial release | N/A | Baseline +1.0.1 | Bug fix | No | Optional update +1.1.0 | Feature addition | No | Update recommended +2.0.0 | Breaking change | Yes | Must update +``` + +**Examples:** + +``` +1.0.0 → 1.0.1: Bug fix in vw_customer_lifetime_value calculation + - Backward compatible + - No schema changes + - Safe to deploy anytime + +1.0.1 → 1.1.0: Add new column (tenure_months) to vw_customer_active + - Backward compatible (new column optional) + - Existing queries unaffected + - Safe to deploy anytime + +1.1.0 → 2.0.0: Remove deprecated column (legacy_id) from customer_master + - BREAKING: Queries using legacy_id will fail + - Requires consumer testing before deployment + - 60+ day notice required + +2.0.0 → 2.0.1: Fix data quality issue in customer_master + - Backward compatible + - Data corrected (historical) + - Safe to deploy anytime +``` + +### Change Tracking + +**Changelog Format:** + +``` +CHANGE LOG: Customer Analytics Suite +═══════════════════════════════════════════════════════════════ + +VERSION 1.0.0 (2024-01-15) +─────────────────────────── +Release Type: Initial Release +Status: Production Ready +Author: Data Team +Tested By: QA Team + +OBJECTS ADDED: + - customer_master (Table): Base customer directory + - customer_transactions (Table): All customer purchases + - vw_customer_active (View): Active customers (status = 'ACTIVE') + - vw_customer_lifetime_value (View): 3-year LTV calculation + - vw_customer_segmentation (View): RFM segmentation model + - df_customer_enrichment (Data Flow): Daily customer master updates + - df_segment_scoring (Data Flow): Daily RFM score calculation + +OBJECTS MODIFIED: + [none - initial release] + +OBJECTS REMOVED: + [none - initial release] + +DATA QUALITY: + - customer_master: 50M records, 99.8% completeness + - customer_transactions: 1B records, 99.95% completeness + - Validation: GL reconciliation within 0.1% + +KNOWN LIMITATIONS: + - customer_transactions includes retail orders only (excludes web) + - RFM segmentation uses last 3 years data + - Daily refresh by 6am UTC (max 1-hour delay tolerance) + +BREAKING CHANGES: None + +DEPLOYMENT NOTES: + - QA deployment completed 2024-01-10, testing passed + - Production deployment scheduled 2024-01-20 + - Rollback procedure tested and verified + - Support team trained on new objects and queries + +─────────────────────────────────────────────────────────────── + +VERSION 1.1.0 (Expected Q2 2024) +───────────────────────────────── +Planned Changes: + - Add vw_customer_churn_risk (predictive model) + - Add df_churn_prediction (data flow) + - Add region filter to all customer views + - Expand customer_master to 100M records + +Breaking: No +Notice Required: 30 days +Estimated Deploy Date: Q2 2024 + +─────────────────────────────────────────────────────────────── + +VERSION 2.0.0 (Planned Q4 2024) +───────────────────────────────── +Planned Breaking Changes: + - Remove legacy_customer_id column (use customer_id only) + - Restructure customer_transactions grain + - Rename vw_customer_lifetime_value → vw_customer_value + - Consolidate segment tables into single entity + +Breaking: YES +Notice Required: 60+ days +Migration: Detailed guide + webinar +New Deploy Date: Q4 2024 +``` + +## Best Practices for Transport + +### Transport Strategy + +**Development Process:** +``` +1. Dev Space: Rapid iteration, experiments, drafts + - Make changes frequently + - Test locally + - Document incremental changes + +2. Feature Branch: Create feature-specific packages + - One feature = one package + - Clear naming (feature_name_v1_0) + - Complete testing in dev + +3. Integration: Combine approved features + - Merge features into main package + - Run integrated testing + - Final validation + +4. QA Space: Test in target environment + - Import package to QA + - Run full test suite + - Validate with business users + - Document issues + +5. UAT (Optional): Business user acceptance testing + - Provide limited access to QA + - Gather feedback + - Make adjustments + +6. Production: Controlled deployment + - Import to production + - Monitor for issues + - Support production consumers +``` + +### Naming Conventions + +**Package Names:** + +``` +Format: [Feature]_[Version]_[Date] + +Examples: + - customer_analytics_v1_0_0_20240115 + - sales_dashboard_v2_1_0_20240201 + - replication_flow_updates_v1_0_0_20240208 + +Guidelines: + - Use snake_case (lowercase_with_underscores) + - Include version number + - Include date (YYYYMMDD) + - Keep descriptive but concise + - Avoid special characters +``` + +**Object Naming Within Package:** + +``` +Tables: tbl_[descriptive_name] + tbl_customer_master + tbl_sales_transactions + +Views: vw_[descriptive_name] + vw_customer_active + vw_customer_lifetime_value + +Data Flows: df_[descriptive_name] + df_customer_enrichment + df_segment_scoring + +Connections: conn_[system_name] + conn_erp_production + conn_excel_uploads +``` + +### Testing After Import + +**Validation Checklist:** + +``` +POST-IMPORT VALIDATION +═════════════════════════════════════════════════ + +OBJECT INTEGRITY TESTS +[ ] All objects created/updated successfully + Check: View object list in target space + All expected objects present + +[ ] No compilation errors + Check: Open each object definition + No error indicators + Syntax valid + +[ ] Dependencies resolved + Check: All dependent objects accessible + No broken references + +DATA INTEGRITY TESTS +[ ] Tables have data (if applicable) + Check: Query each table, verify row counts + Sample data appears correct + No null key fields + +[ ] Views return results + Check: Query each view, verify results + Execution time reasonable + Output matches expectations + +[ ] Data freshness + Check: Data timestamp recent + No stale data indicators + Last refresh successful + +FUNCTIONAL TESTS +[ ] View logic correct + Test: Run known queries + Results match expected outcomes + Calculations validated + +[ ] Data flows execute + Test: Trigger each data flow + Monitor execution logs + Completion status successful + No errors or warnings + +[ ] Filters work correctly + Test: Apply various filters + Results subset correctly + No false positives/negatives + +PERFORMANCE TESTS +[ ] Query execution time acceptable + Benchmark: < 5 seconds for typical queries + < 30 seconds for complex aggregations + +[ ] Memory usage reasonable + Monitor: No spill to disk + Peak memory < 80% available + +[ ] Concurrency tolerated + Test: Run multiple queries simultaneously + No slowdown or deadlocks + +COMPARISON WITH SOURCE +[ ] Objects match source definition + Compare: Definition in source (dev) vs. target (qa) + No unexpected changes + Versions match + +[ ] Data matches source + Compare: Sample data in source vs. target + Row counts align + Totals match +``` + +## Common Issues and Resolution + +### Issue 1: Missing Dependencies + +**Symptom:** +``` +IMPORT ERROR: Cannot import vw_customer_lifetime_value +Missing dependency: vw_transaction_detail not found in target +``` + +**Root Causes:** +1. Dependency not included in package +2. Dependency name changed in target +3. Dependency in different space than expected + +**Resolution:** +``` +Option A: Include missing dependency + 1. Go back to source (dev) tenant + 2. Edit package to include vw_transaction_detail + 3. Re-export package + 4. Re-import to target + +Option B: Create missing object manually + 1. Manually create vw_transaction_detail in target + 2. Use definition from source + 3. Retry import (should now succeed) + +Option C: Update view reference + 1. Import vw_customer_lifetime_value with error + 2. Edit in target to use existing view with different name + 3. Save modified version +``` + +**Prevention:** +- Use dependency analysis tool before exporting +- Test import in QA before production +- Document all dependencies in package release notes + +### Issue 2: Version Conflicts + +**Symptom:** +``` +IMPORT ERROR: vw_customer_active exists in target +Current version: 1.0.1 in target +Package version: 1.0.0 (older) +Cannot downgrade object +``` + +**Root Cause:** +Target already has a newer version of the object than what you're trying to import. + +**Resolution:** +``` +Option A: Skip this object + Decision: Don't overwrite target version + Choose: SKIP in conflict resolution + Verify: Target version has all needed changes + +Option B: Get latest package from source + Action: Go to source, update package + Include: Latest version of object (1.0.1 or newer) + Re-export and re-import + +Option C: Upgrade source package first + Action: Make improvements in source + Release: As version 1.1.0 or 2.0.0 + Then: Export and import new version + +Prevention: + - Always export latest version from source + - Maintain version consistency across tenants + - Document version of each environment +``` + +### Issue 3: Connection Not Available + +**Symptom:** +``` +IMPORT WARNING: Data flow df_customer_enrichment references +connection 'Connection_ERP' which does not exist in target +Data flow will not execute until connection configured +``` + +**Root Cause:** +Connection requires different credentials or configuration in target (different ERP instance, different credentials, different environment). + +**Resolution:** +``` +Option A: Import without data flows (temporary) + 1. Skip data flow objects during import + 2. Create connection in target manually + 3. Re-import data flows later + +Option B: Create connection in target first + 1. Create new connection in target tenant + 2. Name it identically to source (Connection_ERP) + 3. Configure credentials for target ERP system + 4. Test connection (verify connectivity) + 5. Re-import package (should now succeed) + +Option C: Modify data flow to use existing connection + 1. Import data flows anyway (with warning) + 2. Edit data flows in target + 3. Update connection reference to existing connection + 4. Test data flow execution + 5. Save modified version + +Prevention: + - Create connections in target before import + - Use standard connection names across environments + - Document required connections in package metadata + - Include connection setup in import instructions +``` + +## Using MCP Tools for Transport Management + +### search_repository +Find transportable objects in your space: +``` +search_repository(space="sales_analytics", object_type="view") +``` +Returns: All views in space, enabling object selection for packaging + +### get_object_definition +Retrieve complete object metadata for dependency analysis: +``` +get_object_definition(object_id="vw_customer_lifetime_value") +``` +Returns: Full definition including source references, enabling manual dependency verification + +### list_repository_objects +List all objects in a space to validate package contents: +``` +list_repository_objects(space="sales_analytics", include_metadata=true) +``` +Returns: All objects with metadata, supporting package planning + +### get_deployed_objects +Verify what's currently deployed in target tenant: +``` +get_deployed_objects(tenant="datasphere-qa", space="sales_analytics") +``` +Returns: Current objects in target, identifying conflicts before import + +## Transport Workflow Summary + +1. **Plan Package**: Define scope, dependencies, versioning +2. **Create Package**: Select objects, verify completeness +3. **Validate Export**: Check file integrity and contents +4. **Test Package**: Validate in lower environment first +5. **Deploy to QA**: Import to QA, test thoroughly +6. **Deploy to Prod**: After QA sign-off, import to production +7. **Verify Post-Deployment**: Validate all objects functional +8. **Document Changes**: Update tracking, communicate to users + +## Best Practices Summary + +- Always test in QA before production deployment +- Use semantic versioning for clear change tracking +- Include comprehensive dependency documentation +- Create rollback plan before any deployment +- Backup target tenant before import +- Maintain clear naming conventions +- Document known limitations and breaking changes +- Monitor data quality post-deployment +- Communicate changes to all stakeholders +- Keep detailed changelog for compliance and tracking diff --git a/partner-built/SAP-Datasphere/skills/datasphere-transport-manager/references/transport-operations.md b/partner-built/SAP-Datasphere/skills/datasphere-transport-manager/references/transport-operations.md new file mode 100644 index 0000000..2b34ea5 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-transport-manager/references/transport-operations.md @@ -0,0 +1,1784 @@ +# Transport Operations Reference + +## Package Creation Step-by-Step Guide + +Complete walkthrough for creating and preparing a transport package from start to finish. + +### Prerequisites + +Before starting package creation, verify: +- Access to source Datasphere tenant (Dev environment) +- Objects in source tenant are finalized and tested +- All objects compile without errors +- Dependencies identified and documented +- Target tenant(s) prepared and accessible +- Backup of target tenant completed + +### Step 1: Plan Package Contents + +**Determine Package Scope:** + +``` +PLANNING WORKSHEET +════════════════════════════════════════════════════════════ + +PACKAGE INFORMATION +Project: ________________ +Feature: ________________ +Owner: ________________ +Target Deployment: Dev [ ] QA [ ] Prod [ ] All [ ] + +OBJECT INVENTORY +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + +TABLES (Raw data sources): + [ ] Table Name: __________________ Priority: High/Med/Low + [ ] Table Name: __________________ Priority: High/Med/Low + +VIEWS (Transformed/enriched data): + [ ] View Name: __________________ Priority: High/Med/Low + [ ] View Name: __________________ Priority: High/Med/Low + +DATA FLOWS (Processes): + [ ] Data Flow Name: __________________ Priority: High/Med/Low + [ ] Data Flow Name: __________________ Priority: High/Med/Low + +CONNECTIONS (External system connections): + [ ] Connection Name: __________________ Priority: High/Med/Low + +TOTAL OBJECTS: ____ + +DEPENDENCY ANALYSIS +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + +Object | Depends On | Depends Type | Status +───────────────────────────────────────────────────────────── + +VERSIONING +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + +Previous Version: ________ +New Version: ________ +Version Type: [ ] Major [ ] Minor [ ] Patch +Breaking Changes: [ ] Yes [ ] No + +Change Summary: +______________________________________________________________ +______________________________________________________________ +``` + +### Step 2: Access Transport Cockpit + +**Navigation:** +``` +Step 1: Log into Datasphere (source/dev tenant) +Step 2: Click Main Menu (hamburger icon) → Administration +Step 3: Select "Transport" from left navigation +Step 4: You're now in Transport Cockpit + +Location: [Tenant Name] → Administration → Transport +``` + +### Step 3: Create New Export Package + +**Dialog Entry:** + +``` +TRANSPORT COCKPIT DIALOG +═══════════════════════════════════════════════════════════ + +BUTTON: "Create Export Package" + +DIALOG OPENS: +┌────────────────────────────────────────────────────────┐ +│ CREATE NEW EXPORT PACKAGE │ +├────────────────────────────────────────────────────────┤ +│ │ +│ Package Information: │ +│ ┌──────────────────────────────────────────────────┐ │ +│ │ Package Name: [________________] │ │ +│ │ Description: [_____________________] │ │ +│ │ Version: [1.0.0] │ │ +│ │ │ │ +│ │ Release Notes: │ │ +│ │ [_________________________________ │ │ +│ │ _________________________________] │ │ +│ │ │ │ +│ │ Source Space: [Sales Analytics ▼] │ │ +│ │ Target Tenant(s): [✓QA ✓Prod] │ │ +│ │ │ │ +│ │ [ ] Include Sample Data │ │ +│ │ [ ] Include Documentation │ │ +│ │ │ │ +│ │ [NEXT] [CANCEL] │ │ +│ └──────────────────────────────────────────────────┘ │ +│ │ +└────────────────────────────────────────────────────────┘ + +FIELD GUIDANCE: +- Package Name: customer_analytics_v1_0_0 +- Description: Customer master, LTV calculation, segmentation +- Version: 1.0.0 (semantic versioning) +- Release Notes: Initial release with core analytics models +- Source Space: Select space containing objects +- Target Tenants: Check all target environments +``` + +### Step 4: Select Objects for Transport + +**Object Selection Dialog:** + +``` +STEP 2: SELECT OBJECTS +═══════════════════════════════════════════════════════════ + +Available Objects in Space: "Sales Analytics" + +FILTER BY TYPE: + [All] [Tables] [Views] [Data Flows] [Connections] [Other] + +OBJECTS: +┌────────────────────────────────────────────────────────┐ +│ [Checkbox] Object Name Type Size Notes │ +├────────────────────────────────────────────────────────┤ +│ [✓] customer_master Table 5MB Active│ +│ [✓] customer_transactions Table 100MB Active│ +│ [ ] product_master Table 2MB Draft │ +│ [✓] vw_customer_active View 1MB ✓ │ +│ [✓] vw_customer_lifetime_value View 2MB ✓ │ +│ [✓] vw_customer_segmentation View 3MB ✓ │ +│ [ ] vw_churn_risk View 1MB Beta │ +│ [✓] df_customer_enrichment DataFlo 100KB ✓ │ +│ [✓] df_segment_scoring DataFlo 80KB ✓ │ +│ [✓] Connection_ERP Conn 0KB ✓ │ +│ │ +│ [SELECT ALL] [SELECT NONE] [FILTER] │ +│ Objects Selected: 7 │ +│ Total Package Size: ~12 MB │ +│ [NEXT] [BACK] [CANCEL] │ +│ │ +└────────────────────────────────────────────────────────┘ + +SELECTION TIPS: +- Use filter to show only tables first, select all +- Then filter to views, select dependencies +- Then filter to data flows, select dependent flows +- Avoid draft objects (not ready for deployment) +- Avoid beta/experimental objects (not stable) +``` + +### Step 5: Verify Dependencies + +**Dependency Analysis:** + +``` +STEP 3: VERIFY DEPENDENCIES +═══════════════════════════════════════════════════════════ + +System is analyzing dependencies... +[████████████████░░░░░░░░░░░░] 60% + +DEPENDENCY REPORT: +═════════════════════════════════════════════════════════════ + +INCLUDED OBJECTS: 7 +├── Tables (2) +│ ├── customer_master (sources: ERP) +│ └── customer_transactions (sources: ERP) +│ +├── Views (3) +│ ├── vw_customer_active (sources: customer_master) +│ ├── vw_customer_lifetime_value (sources: customer_transactions, customer_master) +│ └── vw_customer_segmentation (sources: vw_customer_lifetime_value, customer_master) +│ +├── Data Flows (2) +│ ├── df_customer_enrichment (reads: vw_customer_active, writes: customer_master) +│ └── df_segment_scoring (reads: vw_customer_segmentation, writes: customer_master) +│ +└── Connections (1) + └── Connection_ERP (used by: df_customer_enrichment) + +DEPENDENCY ANALYSIS RESULT: +✓ All dependencies included +✓ No circular dependencies detected +✓ All objects deployable +✓ Version compatibility verified +✓ No external dependencies outside package + +PACKAGE COMPOSITION: + Total Objects: 7 + Total Size: ~12 MB (compressed: ~3.2 MB) + Deploy Time Estimate: 10-15 minutes + Required Actions After Import: Configure Connection_ERP + +[CONFIRM] [MODIFY SELECTION] [BACK] +``` + +### Step 6: Configure Export Options + +**Export Configuration:** + +``` +STEP 4: CONFIGURE EXPORT +═════════════════════════════════════════════════════════════ + +EXPORT FORMAT: + [✓] CSN (Core Schema Notation) - recommended + [ ] JSON (JSON format) + +COMPRESSION: + [✓] Enable GZIP compression + Estimated file size: 3.2 MB (from 12 MB uncompressed) + +ENCRYPTION: + [✓] Enable encryption for transport + Encryption method: AES-256 + +METADATA: + [✓] Include object documentation + [✓] Include lineage information + [✓] Include access controls + [✓] Include version history + +ADDITIONAL OPTIONS: + [ ] Include sample data (if any) + [ ] Include test data + [✓] Generate deployment guide + [✓] Generate dependency report + +TRANSPORT SETTINGS: + Output Format: Binary (.zip file) + Output Location: [Datasphere Downloads] + Filename: transport_customer_analytics_v1_0_0.zip + Checksum: [Will be generated] + +[GENERATE EXPORT] [BACK] [CANCEL] +``` + +### Step 7: Generate and Download + +**Export Process:** + +``` +GENERATING EXPORT... +═════════════════════════════════════════════════════════════ + +Step 1: Compiling object definitions... +[████████░░░░░░░░░░░░░░░░░░] 20% + +Step 2: Resolving dependencies... +[████████████░░░░░░░░░░░░░░] 40% + +Step 3: Creating metadata files... +[████████████████░░░░░░░░░░] 60% + +Step 4: Compressing package... +[████████████████████░░░░░░] 80% + +Step 5: Generating checksum... +[████████████████████████░░] 95% + +Step 6: Finalizing... +[████████████████████████████] 100% + +SUCCESS! +═════════════════════════════════════════════════════════════ + +Package Generated: + Filename: transport_customer_analytics_v1_0_0.zip + Size: 3.2 MB + Checksum (SHA256): a3f5e8c2d91b4f6a7e9c2b8d4a1f6e3c + Generated: 2024-01-15 14:30 UTC + Expires: 2024-02-15 14:30 UTC (30 days) + +DOWNLOAD OPTIONS: + [DOWNLOAD TO COMPUTER] + [COPY DOWNLOAD LINK] + [SEND TO EMAIL] + [SCHEDULE DOWNLOAD] + +NEXT STEPS: + 1. Download package to secure location + 2. Verify file integrity (checksum) + 3. Upload to target tenant(s) + 4. Follow import procedure + +[DOWNLOAD] [BACK TO COCKPIT] +``` + +## Dependency Resolution Algorithm and Manual Checks + +### Automated Dependency Resolution Process + +**Algorithm Pseudocode:** + +``` +FUNCTION resolve_dependencies(selected_objects): + completed = empty set + unresolved = copy(selected_objects) + missing = empty set + + WHILE unresolved is not empty: + FOR EACH object in unresolved: + + STEP 1: Extract references from object definition + references = scan_definition(object) + // Returns list of table/view names referenced in SQL + + STEP 2: For each reference, check status + FOR EACH ref in references: + IF ref in selected_objects: + // Already selected, no action needed + continue + ELSE IF ref in completed: + // Already processed, no action needed + continue + ELSE IF ref exists in space: + // Found in space but not selected + // Mark for resolution + add_to_resolution_list(ref) + ELSE: + // Not found in space + add_to_missing_list(ref) + + STEP 3: Move object to completed + move_to_completed(object) + remove_from_unresolved(object) + + RETURN (completed, missing, resolution_list) +``` + +**Example Resolution:** + +``` +INPUT SELECTION: + - vw_customer_segmentation + +RESOLUTION TRACE: +═════════════════════════════════════════════════════════════ + +LEVEL 1: vw_customer_segmentation + References: vw_customer_lifetime_value, customer_master + Status: vw_customer_lifetime_value NOT SELECTED + customer_master SELECTED + Action: Add vw_customer_lifetime_value to resolution list + +LEVEL 2: vw_customer_lifetime_value (resolved) + References: vw_transaction_detail, vw_customer_active, customer_master + Status: vw_transaction_detail NOT SELECTED + vw_customer_active NOT SELECTED + customer_master ALREADY PROCESSED + Action: Add both views to resolution list + +LEVEL 3: vw_transaction_detail (resolved) + References: customer_transactions, customer_master + Status: customer_transactions NOT SELECTED + customer_master ALREADY PROCESSED + Action: Add customer_transactions to resolution list + +LEVEL 3: vw_customer_active (resolved) + References: customer_master + Status: customer_master ALREADY PROCESSED + Action: No new dependencies + +LEVEL 4: customer_transactions (resolved) + References: [internal ERP system] - external source + Status: External dependency (resolved) + Action: Note that ERP connection required + +FINAL DEPENDENCIES: + Selected: 1 object (vw_customer_segmentation) + Resolved: 5 objects (vw_customer_lifetime_value, + vw_transaction_detail, + vw_customer_active, + customer_master, + customer_transactions) + Missing: 0 objects + External: Connection_ERP (manual creation required) + +RECOMMENDATION: + ✓ Include all 5 resolved dependencies + ✓ Verify Connection_ERP exists in target +``` + +### Manual Dependency Verification Checklist + +**For Complex or Custom Objects:** + +``` +MANUAL DEPENDENCY VERIFICATION +═════════════════════════════════════════════════════════════ + +OBJECT: vw_customer_lifetime_value +CRITICALITY: High (used in 3+ views) + +STEP 1: SOURCE ANALYSIS +──────────────────────────────────────────────────────────── + +[ ] Access view definition in Datasphere UI + Location: [Space] → Objects → [View Name] → Definition + +[ ] Extract all table/view references + Command in view code: Search for FROM and JOIN keywords + Results: + FROM customer_transactions ct + LEFT JOIN customer_master cm ON ct.customer_id = cm.id + Identified references: customer_transactions, customer_master + +[ ] Extract all column references + Review: All columns used in SELECT, WHERE, GROUP BY + Check: All referenced columns exist in source tables + + Example: + SELECT cm.customer_id, cm.region, COUNT(*) as order_count + Columns: customer_id, region from customer_master + order_count (calculated) + +[ ] Check for function/expression dependencies + Review: Any user-defined functions, stored procedures + Example: WHERE CUSTOM_CALC(amount) > threshold + Action: Note function dependency for manual verification + +[ ] Check for parameter references + Review: Any $$ parameters used (e.g., $$REGION_PARAM) + Example: WHERE region = $$REGION_PARAM + Action: Verify parameter exists in target tenant + +STEP 2: DATA SOURCE DEPENDENCIES +──────────────────────────────────────────────────────────── + +[ ] For each referenced table, verify it's included + [ ] customer_transactions - Included? YES / NO + [ ] customer_master - Included? YES / NO + +[ ] For each referenced view, verify it's included + [ ] None in this example + +[ ] For external data sources (replication, imports) + Check: Is source system accessible? + Is connection configured? + Is replication schedule maintained? + +STEP 3: RELATIONSHIP VERIFICATION +──────────────────────────────────────────────────────────── + +[ ] Check all foreign key relationships + FK: customer_transactions.customer_id → customer_master.id + Status: Required for joins to work + Action: Verify customer_master included + +[ ] Check all referenced attributes + Attributes: region (used in GROUP BY) + Status: Must exist in customer_master + Action: Verify column exists after import + +STEP 4: COMPLETENESS CHECK +──────────────────────────────────────────────────────────── + +[ ] All tables needed to build view are included +[ ] All views this view depends on are included +[ ] All functions/calculations can be performed +[ ] All parameters are available in target +[ ] All external connections available in target + +RESULT: + ✓ All dependencies verified + ✓ Safe to include in package + ✓ No missing prerequisites +``` + +## Conflict Resolution Strategies During Import + +### Conflict Type 1: Object Already Exists (Update) + +**Scenario:** +``` +CONFLICT: vw_customer_active exists in target + +Target Current Version: + SELECT * FROM customer_master + WHERE status = 'ACTIVE' + AND account_type = 'STANDARD' + +Package New Version: + SELECT * FROM customer_master + WHERE status = 'ACTIVE' + AND account_age_days > 30 -- Change: added filter + +Questions to Resolve: + 1. Is the new version better/required? + 2. Will existing queries still work? + 3. Is this a breaking change? + 4. Do consumers need to be notified? +``` + +**Resolution Decision Tree:** + +``` +DECISION TREE: OVERWRITE OR SKIP? +═════════════════════════════════════════════════════════════ + +START: Object exists in target + │ + ├─→ Is new version backward compatible? + │ │ + │ ├─YES─→ Is new version more recent? + │ │ │ + │ │ ├─YES─→ Does new version add value? + │ │ │ │ + │ │ │ ├─YES─→ ACTION: OVERWRITE + │ │ │ │ Reason: Improved, compatible + │ │ │ │ + │ │ │ └─NO──→ ACTION: SKIP + │ │ │ Reason: No new value + │ │ │ + │ │ └─NO──→ Are target version and package version equivalent? + │ │ │ + │ │ ├─YES─→ ACTION: SKIP + │ │ │ Reason: Same version already in target + │ │ │ + │ │ └─NO──→ Is target version actively used? + │ │ │ + │ │ ├─YES─→ ACTION: SKIP + │ │ │ Reason: Keep production version + │ │ │ + │ │ └─NO──→ ACTION: RENAME + │ │ Reason: Keep both versions + │ │ + │ └─NO──→ Is breaking change unavoidable? + │ │ + │ ├─YES─→ Notify all consumers of required update + │ │ Provide migration guide + │ │ Set update deadline (30+ days) + │ │ + │ └─NO──→ Refactor to maintain compatibility + │ Consider alternative approach + │ + └─END +``` + +**Implementation:** + +``` +RESOLUTION DIALOG +═════════════════════════════════════════════════════════════ + +Object: vw_customer_active + +CURRENT TARGET VERSION: + SELECT * FROM customer_master WHERE status = 'ACTIVE' AND account_type = 'STANDARD' + Last Modified: 2024-01-10 by sarah.johnson + +NEW PACKAGE VERSION: + SELECT * FROM customer_master WHERE status = 'ACTIVE' AND account_age_days > 30 + Package Version: 1.0.0 + Prepared by: john.smith (QA Approver) + +CONFLICT OPTIONS: + [ ] SKIP - Keep current target version (1/10 modified) + Rationale: Target version has recent modifications + New version may override important changes + + [✓] OVERWRITE - Replace with package version + Rationale: New version improves filter logic + Both versions functionally aligned + QA has validated new version + Impact: Active queries may return different results + Column set unchanged (backward compatible) + + [ ] RENAME - Import as vw_customer_active_v2 + Rationale: Keep both versions for comparison + Impact: Requires updating dependent views + Adds maintenance burden + + [ ] REVIEW - Show detailed diff before deciding + Rationale: Examine changes in detail + Action: [SHOW DIFF] + +RECOMMENDATION: OVERWRITE +DECISION: OVERWRITE +``` + +### Conflict Type 2: Dependency Missing + +**Scenario:** +``` +CONFLICT: vw_customer_lifetime_value requires vw_transaction_detail + vw_transaction_detail NOT FOUND in target + +Package contains: vw_customer_lifetime_value +Does NOT contain: vw_transaction_detail + +Action required: How to provide vw_transaction_detail? +``` + +**Resolution Options:** + +``` +MISSING DEPENDENCY RESOLUTION +═════════════════════════════════════════════════════════════ + +CONFLICT: Missing Dependency +Object: vw_customer_lifetime_value +Missing: vw_transaction_detail +Error: Cannot import view without required source view + +OPTIONS: + +[✓] ADD TO IMPORT - Request dependency be added + Action: Package processor searches source for vw_transaction_detail + Adds it to import package automatically + Benefit: Simple, ensures compatibility + Risk: May import unwanted dependencies + Time: +5 minutes to re-analyze dependencies + +[ ] SKIP - Don't import this object + Action: Skip vw_customer_lifetime_value + Import other objects in package + Benefit: No blocking issues + Risk: Incomplete package, dependent objects unusable + Time: Immediate + +[ ] FAIL - Block entire import + Action: Stop import process + Require manual resolution + Benefit: Forces proper dependency planning + Risk: No progress until issue resolved + Time: Unknown (manual intervention required) + +[ ] CREATE MANUALLY - Create missing view in target first + Action: Manually create vw_transaction_detail in target + Then retry import + Benefit: Control implementation details + Risk: Manual effort, risk of mismatch vs. source + Time: 30+ minutes (manual development) + +RECOMMENDATION: ADD TO IMPORT +DECISION: ADD TO IMPORT +Revised package: 8 objects (added vw_transaction_detail) +``` + +### Conflict Type 3: Connection Not Available + +**Scenario:** +``` +CONFLICT: Data flow df_customer_enrichment requires Connection_ERP + Connection_ERP NOT FOUND in target + +Error: Cannot import data flow without required connection + +Root Cause: ERP connection uses different credentials in target environment + (test ERP instance vs production) +``` + +**Resolution Process:** + +``` +CONNECTION AVAILABILITY RESOLUTION +═════════════════════════════════════════════════════════════ + +PROBLEM: + - Data Flow references: Connection_ERP + - Target Status: Connection not found + - Impact: Data flow won't execute in target + +ROOT CAUSE OPTIONS: + A) Connection doesn't exist yet (new environment) + B) Connection exists but has different name + C) Connection exists but insufficient permissions + D) Connection configured for different target system + +DIAGNOSTIC STEPS: + [ ] Step 1: Check connection list in target + Action: View all connections in target tenant + Search for "ERP" connections + Result: Found "Connection_ERP_TEST" and "Connection_ERP_PROD" + Not found: "Connection_ERP" + + [ ] Step 2: Check connection configurations + Action: Review each connection's target system + Result: Connection_ERP_TEST → Test ERP (10.0.0.5) + Connection_ERP_PROD → Production ERP (10.0.0.1) + Data flow needs: Production ERP + + [ ] Step 3: Verify credentials + Action: Check if credentials have access + Result: Credentials valid for PROD ERP + +RESOLUTION DECISION: + +[ ] USE EXISTING CONNECTION + Decision: Map Connection_ERP → Connection_ERP_PROD + Action: Update data flow in target to use existing connection + Benefit: No new connection creation needed + Risk: Must verify right target system + Steps: + 1. Import data flow (with warning) + 2. Edit data flow in target + 3. Change connection from "Connection_ERP" to "Connection_ERP_PROD" + 4. Test data flow + 5. Save modified version + +[✓] CREATE NEW CONNECTION + Decision: Create Connection_ERP in target + Action: Create connection with same name/config as source + Benefit: Consistent naming across tenants + Risk: Must configure credentials for target environment + Time: 15-30 minutes + Steps: + 1. Create new connection in target + 2. Name: Connection_ERP + 3. System: Production ERP + 4. Configure credentials (different from source) + 5. Test connection (verify connectivity) + 6. Retry import + 7. Verify data flows execute + +[ ] SKIP DATA FLOWS + Decision: Import other objects, skip data flows + Action: During import, skip all data flow objects + Benefit: Unblocks import process + Risk: Data flows unavailable until connection created + Time: Create connection later manually + Steps: + 1. Import package without data flows + 2. Create Connection_ERP later + 3. Manually re-import data flows + 4. Or: Recreate data flows manually in target + +RECOMMENDATION: CREATE NEW CONNECTION FIRST +WORKFLOW: + 1. PAUSE import + 2. Create Connection_ERP in target (15 min) + 3. RESUME import + 4. Verify data flows execute (5 min) +``` + +## Transport Landscape Setup (Dev → QA → Prod) + +### Architecture Overview + +**Three-Tier Landscape:** + +``` +DATASPHERE TRANSPORT LANDSCAPE +═════════════════════════════════════════════════════════════ + +┌────────────────────────────────────────────────────────────┐ +│ │ +│ DEVELOPMENT TENANT (datasphere-dev.company.com) │ +│ ────────────────────────────────────────────────────── │ +│ • Rapid development and iteration │ +│ • Free-form experimentation │ +│ • Draft objects and features │ +│ • Full access for data engineers │ +│ • Limited data volume (1M records sample) │ +│ • 99% uptime SLA │ +│ │ +│ Owner: Data Engineering Team │ +│ Approver: Team Lead │ +│ │ +│ UPDATE FREQUENCY: Multiple times daily │ +│ CHANGE PROCESS: None (free-form) │ +│ BACKUP: Daily (30-day retention) │ +│ │ +└────────────────────┬─────────────────────────────────────┘ + │ + │ EXPORT PACKAGE + │ (Feature-complete, QA-ready) + │ + v +┌────────────────────────────────────────────────────────────┐ +│ │ +│ QUALITY ASSURANCE TENANT (datasphere-qa.company.com) │ +│ ─────────────────────────────────────────────────────── │ +│ • Controlled testing environment │ +│ • Production-like configuration │ +│ • UAT environment (business user validation) │ +│ • Limited access (QA + Business Users) │ +│ • Full data volume (all production data) │ +│ • 99.5% uptime SLA │ +│ │ +│ Owner: QA Team │ +│ Approver: QA Lead + Business Lead │ +│ │ +│ UPDATE FREQUENCY: Weekly (controlled) │ +│ CHANGE PROCESS: Controlled (test plan required) │ +│ BACKUP: Daily (60-day retention) │ +│ │ +└────────────────────┬─────────────────────────────────────┘ + │ + │ EXPORT PACKAGE + │ (QA-approved, production-ready) + │ + v +┌────────────────────────────────────────────────────────────┐ +│ │ +│ PRODUCTION TENANT (datasphere-prod.company.com) │ +│ ──────────────────────────────────────────────────────── │ +│ • Live operations │ +│ • Real customer data │ +│ • Change-controlled │ +│ • Limited access (select power users) │ +│ • Full data volume (production data) │ +│ • 99.9% uptime SLA │ +│ • High availability configuration │ +│ │ +│ Owner: Operations Team │ +│ Approver: CTO + Change Advisory Board (CAB) │ +│ │ +│ UPDATE FREQUENCY: Monthly (controlled releases) │ +│ CHANGE PROCESS: Formal (CAB approval, rollback plan) │ +│ BACKUP: Hourly (90-day retention) │ +│ DISASTER RECOVERY: Tested quarterly │ +│ │ +└────────────────────────────────────────────────────────────┘ +``` + +### Tenants Configuration + +**Development Tenant Setup:** + +``` +DEVELOPMENT ENVIRONMENT CONFIGURATION +══════════════════════════════════════════════════════════════ + +DATA VOLUME: + Sample Data Model: 1-5% of production volume + Purpose: Fast iteration, quick feedback loops + Refresh: On-demand by developers + Retention: 3 months + Cost: Minimal (small storage) + +ACCESS CONTROL: + Users: All data engineers and developers + Roles: Admin (full access) + External Access: None + Data Classification: Internal only + +DATA SOURCES: + Replication Flows: From test/staging systems + Frequency: Manual on-demand + Freshness: Not required to be current + Example: Test ERP (10.0.0.5) + +DEVELOPMENT BEST PRACTICES: + ✓ Create feature branches (separate spaces for features) + ✓ Name objects clearly (tbl_, vw_, df_ prefixes) + ✓ Document changes in object metadata + ✓ Commit to version control (CSN files) + ✓ Self-testing (verify queries work before packaging) + ✓ Tag objects with version numbers + ✗ Don't use production data + ✗ Don't maintain strict change control + ✗ Don't create permanent test objects +``` + +**QA Tenant Setup:** + +``` +QA ENVIRONMENT CONFIGURATION +══════════════════════════════════════════════════════════════ + +DATA VOLUME: + Data Model: 100% of production (full volume) + Purpose: Realistic testing, performance validation + Refresh: Weekly (Friday evening) + Retention: 6 months (for regression testing) + Cost: Same as production (full storage) + +ACCESS CONTROL: + Users: QA Team, Business Analysts, UAT participants + Roles: Viewer, Developer, Admin (by role) + External Access: Limited to select partners (if needed) + Data Classification: Internal confidential + +DATA SOURCES: + Replication Flows: From production systems (mirrored) + Frequency: Weekly (Friday at 8pm UTC) + Freshness: 1-week lag acceptable + Example: Production ERP (snapshot) + +QA PROCEDURES: + ✓ Formal test plans required for each import + ✓ Sign-off from QA lead before production deployment + ✓ Performance benchmarking (compare to production) + ✓ User acceptance testing (business validation) + ✓ Regression testing (verify no breakage) + ✓ Data quality validation + ✓ Security/access control validation + ✗ Ad-hoc changes (all changes via transport) + ✗ Direct production data (use copy/refresh) + ✗ Production-level access (read-only primary users) +``` + +**Production Tenant Setup:** + +``` +PRODUCTION ENVIRONMENT CONFIGURATION +══════════════════════════════════════════════════════════════ + +DATA VOLUME: + Data Model: 100% of production (full volume) + Purpose: Operational analytics, business reporting + Refresh: Real-time to daily (per object SLA) + Retention: 12+ months (compliance/archival) + Cost: Highest (storage, compute, HA) + +ACCESS CONTROL: + Users: Limited (business intelligence teams, analysts) + Roles: Viewer (primary), Developer (select), Admin (rare) + External Access: Via SAP Datasphere cloud sharing only + Data Classification: Business confidential + +DATA SOURCES: + Replication Flows: From production systems (live) + Frequency: Real-time to hourly (per replication SLA) + Freshness: Critical (must match reporting requirements) + Example: Production ERP (live feeds) + +PRODUCTION REQUIREMENTS: + ✓ All objects thoroughly tested in QA + ✓ Change Advisory Board (CAB) approval required + ✓ Rollback plan documented and tested + ✓ Performance validated with production data volume + ✓ Security review completed (data classification, access) + ✓ Runbook documentation (how to operate) + ✓ Monitoring and alerting configured + ✓ Support team trained on changes + ✗ No ad-hoc changes (frozen, transport-only) + ✗ No experimental objects (only stable, approved) + ✗ No direct source system changes (through controlled ETL) + +DEPLOYMENT APPROVAL PROCESS: + 1. Development Complete: Objects finalized in dev + 2. Submit for QA: Package prepared, sent to QA + 3. QA Testing: Test plan executed, issues logged + 4. QA Sign-Off: "Ready for production" approval + 5. CAB Review: Change review board approves deployment + 6. Schedule: Deployment window determined (maintenance) + 7. Pre-Deployment: Backup of production verified + 8. Deploy: Package imported to production + 9. Smoke Test: Quick functionality verification + 10. Monitor: Watch for 24-48 hours after deployment + 11. Success: Deployment considered complete +``` + +### Environment Promotion Workflow + +**Standard Promotion Path:** + +``` +OBJECT LIFECYCLE: DEV → QA → PROD +════════════════════════════════════════════════════════════════ + +PHASE 1: DEVELOPMENT (DEV TENANT) +───────────────────────────────────────────────────────────── +Timeline: 1-4 weeks +Activities: + □ Create view/table in dev space + □ Implement logic and calculations + □ Test with sample data + □ Document purpose and logic + □ Gather initial feedback + □ Make iterations based on feedback + □ Tag for promotion to QA + +Criteria for Promotion: + ✓ Objects compile without errors + ✓ Sample data tests pass + ✓ Documentation complete + ✓ No open issues/bugs + ✓ Developer sign-off obtained + ✓ Objects in ACTIVE status + +Output: Transport package (dev_to_qa_v1.zip) + + +PHASE 2: QUALITY ASSURANCE (QA TENANT) +────────────────────────────────────────────────────────────── +Timeline: 1-2 weeks +Activities: + □ Import package to QA + □ Verify objects created successfully + □ Connect to full production data volume + □ Execute comprehensive test plan + □ Performance benchmark testing + □ User acceptance testing (business stakeholders) + □ Issue discovery and resolution + □ Sign-off documentation + +Criteria for Production Approval: + ✓ All tests passed + ✓ Performance meets requirements + ✓ Users accept functionality + ✓ Data quality validated + ✓ No critical issues remaining + ✓ QA lead sign-off obtained + ✓ CAB approves for production + +Output: Transport package (qa_to_prod_v1.zip) + + +PHASE 3: PRODUCTION (PROD TENANT) +────────────────────────────────────────────────────────────── +Timeline: 1 day (deployment) +Activities: + □ CAB approves change window + □ Production backup completed and verified + □ Import package to production + □ Verify all objects deployed successfully + □ Run smoke tests (sample queries) + □ Notify stakeholders (deployment complete) + □ Monitor for issues (24-48 hours) + □ Close change ticket + +Post-Deployment Monitoring: + □ Query execution times normal + □ Data refresh successful + □ No error messages in logs + □ Users can access objects + □ Performance within expectations + □ Data quality metrics normal + +Success Criteria: + ✓ All objects functional in production + ✓ No errors or warnings in logs + ✓ Data refresh SLA met + ✓ Users report successful access + ✓ Performance metrics acceptable + ✓ Change ticket completed + +Output: Production objects live, supporting business analytics + + +ROLLBACK PROCEDURE (IF NEEDED) +────────────────────────────────────────────────────────────── +Trigger: Critical issue discovered, user complaints, data errors + +Steps: + 1. Assess severity: Can issue be worked around or fixed? + Critical issues: Data corruption, wrong results, unavailable + Minor issues: UI issue, performance < SLA but usable + + 2. If critical: + □ STOP using production objects + □ Switch users back to previous version (if available) + □ Notify all stakeholders + □ Restore from pre-deployment backup + □ Root cause analysis in QA + □ Fix issues + □ Retest thoroughly + □ Redeploy with fixes + + 3. Restore production objects: + Command: Restore backup from 24 hours before deployment + Verify: All objects restored, data intact + Confirm: Users can access previous version + + 4. Post-Rollback: + □ Root cause analysis meeting + □ Document lessons learned + □ Update test procedures (prevent recurrence) + □ Retrain team on issue +``` + +## Pre-Transport Validation Checklist + +**Complete Before Creating Export Package:** + +``` +PRE-TRANSPORT VALIDATION CHECKLIST +════════════════════════════════════════════════════════════════ + +OBJECT READINESS +════════════════════════════════════════════════════════════════ + +COMPILATION & SYNTAX +[ ] All objects compile without errors + Verify: No red error indicators in UI + Open each object, check status + +[ ] No deprecated syntax + Check: Review SQL for deprecated functions + Run syntax validator if available + +[ ] All objects in ACTIVE status + Verify: Status = ACTIVE (not Draft, Error, Locked) + +[ ] No objects checked out by other users + Check: Confirm no one editing objects + All changes committed + + +DATA COMPLETENESS +════════════════════════════════════════════════════════════════ + +[ ] All required tables included + List: Document each required table + Verify: All listed in package selection + +[ ] All dependent views included + Trace: Use dependency analyzer + Verify: No missing view dependencies + +[ ] All required data flows included + List: Flows that read/write to objects + Verify: All flows in package + +[ ] Connection definitions included (if needed) + Identify: Which flows need connections + Verify: Connections exist and configured + + +DOCUMENTATION & METADATA +════════════════════════════════════════════════════════════════ + +[ ] Object descriptions complete + Check: Each object has meaningful description + Explains business purpose + +[ ] Business context documented + Include: What data represents, why important + Typical use cases and analytics + +[ ] Data quality metrics documented + Record: Completeness, accuracy, freshness + Last validation date + +[ ] Known limitations documented + Example: "Excludes web orders (retail only)" + "Data lag: 1 day behind source" + "Grain: One row per transaction" + +[ ] Change log prepared + Document: What changed from previous version + Why it changed + Who approved the change + +[ ] Release notes prepared + Summary: 2-3 sentences describing package + Key improvements or fixes + Target audience + + +DATA QUALITY VALIDATION +════════════════════════════════════════════════════════════════ + +[ ] Source data validated + Check: Completeness (nulls, duplicates) + Accuracy (spot-check sample records) + Consistency (relationships intact) + Timeliness (last refresh recent) + +[ ] Calculated fields verified + Test: Spot-check calculations + Verify formulas correct + Compare to known totals + +[ ] Aggregations validated + Test: Row counts expected + Totals reconcile to source + No unexpected null values + +[ ] Filters working correctly + Test: WHERE clauses eliminate expected rows + Sample different filters + + +DEPENDENCY VERIFICATION +════════════════════════════════════════════════════════════════ + +[ ] All dependencies identified + Tool: Use dependency analyzer + Output: Dependency graph reviewed + +[ ] No circular dependencies + Check: No A → B → C → A cycles + Dependency analyzer confirms + +[ ] External dependencies documented + Identify: Systems outside Datasphere + Replication flows required + Connections needed + +[ ] Deployment order determined + Plan: In what order will objects deploy? + Dependencies satisfied at each step + + +TESTING VERIFICATION +════════════════════════════════════════════════════════════════ + +[ ] Views query successfully + Test: Run each view with no WHERE clause + Verify results reasonable + Check execution time acceptable + +[ ] Sample queries work + Test: 3-5 representative queries + Results match expected values + Performance acceptable + +[ ] Data flows execute successfully + Test: Trigger manual execution + Monitor completion status + Verify no errors in logs + +[ ] No data quality issues + Test: Check for unexpected nulls + Verify no data corruption + Reconciliation passes + +[ ] Performance acceptable + Benchmark: Query execution time < target + Memory usage reasonable + No unexpected slowness + + +ENVIRONMENT READINESS +════════════════════════════════════════════════════════════════ + +[ ] Target tenant accessible + Test: Can connect to target Datasphere + Have sufficient permissions + Credentials valid + +[ ] Target space prepared + Check: Target space exists + Space has sufficient quota + Space is ready for import (no conflicts) + +[ ] Dependencies available in target + Verify: Required source systems accessible + Required connections exist (or will be created) + Credentials correct for target + +[ ] Backup strategy in place + Confirm: Target has recent backup + Backup verified restorable + Backup retention meets compliance + +[ ] Rollback plan documented + Plan: How to rollback if needed + Restore procedure tested + Team trained on rollback + + +VERSION & GOVERNANCE +════════════════════════════════════════════════════════════════ + +[ ] Version number determined + Decide: Major, Minor, or Patch release + Follows semantic versioning + +[ ] Previous version documented + Record: What was version before this? + How does this improve on previous? + +[ ] Approval obtained + From: Development Lead + QA Lead (if applicable) + Product Owner (if applicable) + +[ ] Change request created (if required) + System: Ticketing system (Jira, SAP, etc.) + Status: Approved + Attachment: Package metadata + + +SECURITY & COMPLIANCE +════════════════════════════════════════════════════════════════ + +[ ] No sensitive data exposed + Review: Check for unmasked PII, passwords, secrets + Verify: Customer names masked + Phone numbers redacted + Pricing not exposed to partners + +[ ] Access controls configured + Verify: Column-level security set + Row-level security set + Consumer-specific views created + +[ ] Data classification correct + Check: Objects marked with appropriate classification + Compliance with data governance policies + +[ ] Licensing review completed + Verify: All included objects properly licensed + No unlicensed or restricted content + + +SIGN-OFF +════════════════════════════════════════════════════════════════ + +Prepared By: __________________ Title: __________ Date: ______ + +Reviewed By: __________________ Title: __________ Date: ______ + +Approved By: __________________ Title: __________ Date: ______ + +Final Check: [ ] All items verified and checked + [ ] Ready for export + [ ] Ready for import to target + +SIGN-OFF STATEMENT: +I certify that all items on this checklist have been completed +and verified. All objects are production-ready and safe to +deploy to the target environment. +``` + +## Post-Import Verification Steps + +**Validation After Successful Import:** + +``` +POST-IMPORT VERIFICATION PROCEDURE +════════════════════════════════════════════════════════════════ + +PHASE 1: BASIC VERIFICATION (Immediate, 30 min) +────────────────────────────────────────────────────────────── + +STEP 1: Object Inventory + [ ] Navigate to imported space + Location: Target Tenant → Spaces → [Space Name] + + [ ] Count objects + Expected: 7 objects (per package) + Actual: 7 objects ✓ + + [ ] Verify object names + Table: customer_master ✓ + Table: customer_transactions ✓ + View: vw_customer_active ✓ + View: vw_customer_lifetime_value ✓ + View: vw_customer_segmentation ✓ + DataFlow: df_customer_enrichment ✓ + DataFlow: df_segment_scoring ✓ + +STEP 2: Object Status Check + [ ] No error indicators + Visual check: No red X, error messages + Status bar: Shows ACTIVE for all objects + + [ ] Check compilation status + Open each object: Definition shows no errors + Test view in editor: No syntax issues + +STEP 3: Quick Query Test + [ ] Execute sample query on each view + vw_customer_active: SELECT COUNT(*) FROM ... + Expected: 50,000,000 rows + Actual: 50,000,000 rows ✓ + + vw_customer_lifetime_value: SELECT * LIMIT 10 + Expected: 10 sample rows + Actual: 10 rows with values ✓ + +RESULT: ✓ Objects successfully imported, basic structure OK + + +PHASE 2: DATA VALIDATION (1-4 hours) +────────────────────────────────────────────────────────────── + +STEP 1: Data Availability + [ ] Tables have data + customer_master: COUNT(*) > 0 + Result: 50,000,000 rows ✓ + + customer_transactions: COUNT(*) > 0 + Result: 1,000,000,000 rows ✓ + +STEP 2: Data Quality Checks + [ ] No unexpected nulls + SELECT COUNT(*) FROM customer_master + WHERE customer_id IS NULL + Result: 0 (expected) ✓ + + [ ] Row counts reasonable + customer_master: 50M (matches source) ✓ + customer_transactions: 1B (matches source) ✓ + +STEP 3: Derived Data Validation + [ ] View calculations correct + SELECT COUNT(*), SUM(lifetime_value) + FROM vw_customer_lifetime_value + Compare to expected aggregates + Expected: 50M customers, $50B total value + Actual: 50M customers, $50.1B total value + Variance: 0.2% (acceptable) ✓ + + [ ] Filters work correctly + SELECT COUNT(*) FROM vw_customer_active + WHERE status = 'ACTIVE' + Compare to source + Expected: 40M active + Actual: 40M active ✓ + +STEP 4: Data Freshness + [ ] Last modified date recent + Objects imported: 2024-01-15 14:45 UTC + Current time: 2024-01-15 15:00 UTC + Age: 15 minutes (acceptable) ✓ + +RESULT: ✓ Data integrity verified, quality acceptable + + +PHASE 3: FUNCTIONAL TESTING (2-8 hours) +────────────────────────────────────────────────────────────── + +STEP 1: Query Functionality + [ ] Test representative queries + Query 1: Sales by region (GROUP BY) + Query 2: Customer segmentation (JOIN + AGGREGATE) + Query 3: Time-series analysis (time window) + Status: All three queries execute successfully ✓ + + [ ] Verify query results match expectations + Sample query result rows review + Spot-check 10 rows against source data + All values match expectations ✓ + + [ ] Check execution time + vw_customer_active: 2.5 sec (< 5 sec target) ✓ + vw_customer_lifetime_value: 8.2 sec (< 10 sec target) ✓ + vw_customer_segmentation: 15.3 sec (< 30 sec target) ✓ + +STEP 2: Data Flow Functionality + [ ] Data flows executable + df_customer_enrichment: Manual trigger successful ✓ + df_segment_scoring: Manual trigger successful ✓ + + [ ] Data flows complete successfully + Both flows show "COMPLETED" status + Duration: Within expected range + Errors: None ✓ + + [ ] Output data validated + After df_customer_enrichment execution: + customer_master updated (verified by timestamp) + New columns populated: enrichment_date, enrichment_status + No errors in data quality checks ✓ + +STEP 3: Performance Baseline + [ ] Establish baseline metrics + Record: Execution times for 5 sample queries + Average execution time: 8.5 seconds + Peak memory: 2.3 GB + CPU utilization: 45% + Storage used: 125 GB (expected 120 GB) ✓ + +STEP 4: User Acceptance Test + [ ] Business users validate results + Sample group: 3 power users + Queries run: 5 representative business questions + Results match expectations: 5/5 ✓ + Feedback: "Looks good, matches dev environment" + User sign-off: Obtained ✓ + +RESULT: ✓ All functions working correctly, acceptable performance + + +PHASE 4: COMPLIANCE & SECURITY (30 min) +────────────────────────────────────────────────────────────── + +STEP 1: Data Access Control + [ ] Row/column-level security active + Test: Consumer roles can only see authorized data + vw_customer_active: Only 'ACTIVE' rows visible to partner + columns: customer_id masked in partner view + Result: Security policies enforced ✓ + +STEP 2: Access Logging + [ ] Audit log records object access + Check: Audit logs show all queries from test users + Timestamp: Matches query execution time + User: Correctly identified + Result: Logging active ✓ + +STEP 3: Data Classification + [ ] Objects marked with classification + customer_master: "Internal Confidential" + vw_customer_lifetime_value: "Internal Confidential" + Result: All objects properly classified ✓ + +RESULT: ✓ Security and compliance verified + + +PHASE 5: SIGN-OFF DOCUMENTATION (30 min) +────────────────────────────────────────────────────────────── + +POST-IMPORT VERIFICATION REPORT +════════════════════════════════════════════════════════════════ + +Package: Customer Analytics Suite v1.0.0 +Target Tenant: datasphere-qa.company.com +Import Date: 2024-01-15 14:45 UTC +Verified By: QA Team + +VERIFICATION RESULTS: + +✓ BASIC VERIFICATION: PASSED + - 7 objects successfully imported + - All objects in ACTIVE status + - No compilation errors + +✓ DATA VALIDATION: PASSED + - Row counts match source (50M + 1B records) + - Data quality metrics acceptable + - No unexpected null values + - Data freshness: 15 minutes old (acceptable) + +✓ FUNCTIONAL TESTING: PASSED + - 3 representative queries execute successfully + - Query results match expectations + - Execution times within targets + - 2 data flows execute successfully + - User acceptance test: Passed (5/5 queries) + +✓ PERFORMANCE BASELINE: ESTABLISHED + - Average execution time: 8.5 seconds + - Peak memory: 2.3 GB + - Storage: 125 GB (vs expected 120 GB) + - Acceptable for production deployment + +✓ SECURITY & COMPLIANCE: VERIFIED + - Row/column-level security enforced + - Audit logging active + - Data classification correct + +RECOMMENDATION: READY FOR PRODUCTION DEPLOYMENT + +Verified By: Sarah Johnson (QA Lead) __________ Date: 2024-01-15 +Approved By: John Smith (QA Manager) _________ Date: 2024-01-15 + +Next Steps: + 1. Request CAB approval for production deployment + 2. Schedule production deployment window (maintenance) + 3. Notify support team of upcoming changes + 4. Prepare runbook for production operations +``` + +## Rollback Procedures + +**Emergency Rollback Workflow:** + +``` +ROLLBACK DECISION & EXECUTION +════════════════════════════════════════════════════════════════ + +PHASE 1: INCIDENT DETECTION (5-15 min after deployment) +────────────────────────────────────────────────────────────── + +SYMPTOM TRIGGERS: + [ ] Critical Error in logs + Example: "Customer_master table corrupted" + Severity: CRITICAL + + [ ] Wrong Results in queries + Example: "Revenue totals off by 50%" + Severity: CRITICAL + + [ ] Objects Unavailable + Example: "vw_customer_lifetime_value query timeout" + Severity: HIGH → CRITICAL (if prevents access) + + [ ] Data Quality Issue + Example: "Duplicate customer IDs found" + Severity: HIGH → CRITICAL (if widespread) + +ASSESSMENT: + 1. Severity assessment: Does issue block operations? + CRITICAL: Data unavailable or wrong (ROLLBACK) + HIGH: Performance degraded (INVESTIGATE FIRST) + MEDIUM: Minor issue (FIX IN PLACE) + + 2. Impact scope: How many users/systems affected? + Global: All users affected (rollback NOW) + Partial: Some functions working (investigate) + Single: One user issue (user-specific) + + 3. Root cause hypothesis: Deployment cause? + Suspected: Yes (likely rollback needed) + Unknown: Investigate first + Elsewhere: Don't rollback + +DECISION: IS CRITICAL & GLOBAL & DEPLOYMENT-CAUSED? + YES → INITIATE ROLLBACK + NO → INVESTIGATE FURTHER + + +PHASE 2: ROLLBACK AUTHORIZATION (5-10 min) +────────────────────────────────────────────────────────────── + +NOTIFY: Incident Commander + On-Call Manager + Message: "Critical issue post-deployment. Assessing rollback." + Status: Incident severity CRITICAL + Time: 14:50 UTC (5 min after deployment) + +AUTHORIZE: CTO / Senior IT Manager + Question: "Rollback approved to restore production?" + Decision: YES / NO / INVESTIGATE FIRST + Answer: YES (given critical issue) + Authority: CTO (on-call) + +CONFIRM: Change Management (if required) + Verify: Emergency change approval process + Fast-track approval for critical incidents + Documentation: Incident ticket number + + +PHASE 3: ROLLBACK PREPARATION (5-10 min) +────────────────────────────────────────────────────────────── + +STEP 1: Backup Assessment + [ ] Verify pre-deployment backup exists + Backup ID: PROD_BACKUP_20240115_1400 + Timestamp: 2024-01-15 14:00 UTC (45 min before deployment) + Size: 250 GB + Status: Verified restorable ✓ + + [ ] Check backup integrity + Test restore: Spot-check restore to test system + Verify data intact + All objects present + Result: Backup verified good ✓ + +STEP 2: Production Cutover Planning + [ ] Communicate to users + Message: "Production data temporarily unavailable" + Duration: "Estimated 15-20 minutes" + Expected restoration: 15:15 UTC + + [ ] Halt incoming transactions (if applicable) + Action: Stop data flows/ETL processes + Protect: Prevent data loss during rollback + + [ ] Notify support team + Update: Ticket status to "ROLLBACK IN PROGRESS" + Task: Support to hold all user requests + + +PHASE 4: ROLLBACK EXECUTION (10-20 min) +────────────────────────────────────────────────────────────── + +STEP 1: Stop Current Services + [ ] Disable all queries to production objects + Action: Set objects to read-only mode + Effect: Existing queries complete, new queries blocked + + [ ] Stop data flows + Action: Cancel executing data flows + Status: Monitor for completion + + [ ] Notify users (2nd notice) + Message: "Rollback now in progress" + "Expect data back online 15:15 UTC" + +STEP 2: Restore From Backup + [ ] Select backup to restore + Backup: PROD_BACKUP_20240115_1400 + Source: Last known good (45 min before incident) + Data loss: 45 minutes (acceptable for critical issue) + + [ ] Execute restore + Command: $ datasphere restore backup_id=PROD_BACKUP_20240115_1400 + Progress: [████████████████░░░░░░░░░░] 60% + [████████████████████░░░░░░] 85% + [██████████████████████████░░] 95% + Status: RESTORE COMPLETE (14:15 UTC, 15 minutes) + + [ ] Verify restored objects + Check: All objects present + Data matches expected (45 min old) + No errors in logs + Result: ✓ Restore verified successful + + +PHASE 5: PRODUCTION VERIFICATION (5-10 min) +────────────────────────────────────────────────────────────── + +STEP 1: Quick Sanity Checks + [ ] Query customer_master + SELECT COUNT(*) FROM customer_master + Expected: 50M + Actual: 50M ✓ + + [ ] Query vw_customer_active + SELECT COUNT(*) FROM vw_customer_active + Expected: 40M + Actual: 40M ✓ + + [ ] Check data freshness + Last update: 2024-01-15 14:00 UTC (45 min lag) + Status: Acceptable for rollback ✓ + +STEP 2: User Availability Check + [ ] Enable queries again + Action: Set objects from read-only back to read-write + Effect: Users can now query production data + + [ ] Verify user access + Test: Sample power user queries + Status: All queries successful ✓ + + [ ] Notify users (3rd notice) + Message: "Production data restored" + "Data is from 14:00 UTC (45 min old)" + "Normal operations resumed" + Status: RESOLVED + + +PHASE 6: POST-ROLLBACK ANALYSIS (30 min - 4 hours) +────────────────────────────────────────────────────────────── + +INCIDENT ANALYSIS: + [ ] Root cause investigation + What: Objects created/updated by deployment + When: Exactly 14:45 UTC + Why: [To be determined in investigation] + Impact: Queries returned wrong results + + [ ] Data loss assessment + What: 45 minutes of updates lost (14:00-14:45) + How many: 50K customer records updated + Recovery: Re-run necessary delta loads + +LESSONS LEARNED MEETING: + [ ] Attend: Development, QA, Operations teams + Topic: Why test in QA didn't catch issue? + How to prevent in future? + + [ ] Action items: + 1. Enhanced QA test cases for this scenario + 2. Add data quality check before production import + 3. Add 15-min smoke test window before full cutover + 4. Additional monitoring/alerting for data quality + +DEPLOYMENT DECISION: + [ ] When ready: Redeploy with fixes + Timeline: After root cause fixed and re-tested + Testing: Full QA cycle required + Approval: CAB re-approval needed + + +POST-ROLLBACK STATUS +════════════════════════════════════════════════════════════════ + +Rollback Initiated: 2024-01-15 14:50 UTC +Rollback Completed: 2024-01-15 15:05 UTC +Total Downtime: 15 minutes +Data Loss: 45 minutes of updates (14:00-14:45) +Root Cause: [Under investigation] +Permanent Fix: [Estimated 48 hours] +Redeployment: [When fix tested and approved] + +Incident Summary: + Status: RESOLVED (rolled back) + Severity: CRITICAL + Duration: 15 minutes + Impact: All production users (45 min data age) + Resolution: Automatic restore from pre-deployment backup + Prevention: Enhanced QA testing added +``` diff --git a/partner-built/SAP-Datasphere/skills/datasphere-view-architect/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-view-architect/SKILL.md new file mode 100644 index 0000000..daf35a8 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-view-architect/SKILL.md @@ -0,0 +1,421 @@ +--- +name: View Architect +description: "Expert guide for designing Graphical and SQL Views in SAP Datasphere's Data Builder. Use this when you need to create views, define semantic models, set up associations, configure data access controls, or optimize view performance. Essential for building semantic layers, implementing star schemas, and preparing data for analytics." +--- + +# View Architect Skill + +## Overview + +The View Architect skill guides you through creating and designing views in SAP Datasphere's Data Builder. Views are fundamental semantic objects that organize raw data into meaningful business concepts. Whether you're building a graphical view through the intuitive UI or writing custom SQL, this skill covers the complete workflow from source selection through deployment and optimization. + +## When to Use This Skill + +- Creating new Graphical or SQL Views +- Designing semantic data models (Fact, Dimension, Hierarchy tables) +- Setting up associations between entities +- Implementing complex business logic through calculated columns and filters +- Configuring data access controls and security +- Optimizing view performance and push-down behavior +- Deploying views to development and production environments + +## Graphical Views vs SQL Views + +### Graphical Views +**When to use:** +- Building views through visual, drag-and-drop interface +- Team members need low-code/no-code approach +- Rapid prototyping and iteration +- Complex join logic with multiple tables +- Need version control and change tracking built-in + +**Advantages:** +- Visual representation of logic +- Built-in validation and constraint checking +- Easier maintenance and documentation +- Collaborative design capabilities +- Automatic dependency tracking + +**Limitations:** +- Cannot express certain complex SQL patterns +- May have slight performance overhead compared to hand-optimized SQL +- Limited to Datasphere's graphical expression capabilities + +### SQL Views +**When to use:** +- Implementing complex analytical logic +- Requiring specific SQL functions or window operations +- Performance-critical transformations +- Migrating from existing SQL systems +- Need for advanced SQL patterns (CTEs, recursive queries, etc.) + +**Advantages:** +- Maximum flexibility and control +- Hand-optimized for performance +- Access to all SQL functions available in SAP HANA +- Familiar to SQL developers +- Direct control over execution plans + +**Limitations:** +- Requires SQL expertise +- Harder to maintain without proper documentation +- Less visual feedback during design +- Version control less integrated + +## Semantic Usage Types + +Semantic usage defines how Datasphere interprets and uses your view data. Choosing the correct semantic usage is crucial for reporting, aggregation, and filtering behavior. + +### Fact +**Use when:** Your view represents transactional data, events, or measurements that you want to aggregate and analyze across dimensions. + +**Characteristics:** +- Contains quantitative measures (amounts, counts, durations) +- Typically has grain at the transaction or event level +- Often used as the source for Analytic Models +- Supports multiple measures and aggregation rules +- Can reference multiple dimensions through associations + +**Example:** Sales transactions (Order ID, Amount, Date, Quantity) + +### Dimension +**Use when:** Your view represents descriptive attributes that provide context to facts. + +**Characteristics:** +- Contains hierarchical or categorical data +- Typically slower-changing than fact data +- Used to filter, group, and drill down in analytics +- Should have a unique key (business key) +- Often associated with text entities for translations + +**Example:** Product catalog (Product ID, Category, Subcategory, Supplier) + +### Text +**Use when:** Your view contains multi-language descriptions or attributes for other entities. + +**Characteristics:** +- Supports language-specific text and descriptions +- Associated with a main entity (Dimension or Fact) +- Language key and text element key structure +- Used for translation and localization + +**Example:** Product descriptions in English, German, French + +### Hierarchy +**Use when:** Your view defines parent-child relationships for drill-down and roll-up analysis. + +**Characteristics:** +- Represents hierarchical structures (Organization charts, Product hierarchies) +- Contains hierarchy key, parent key, and order fields +- Supports multiple hierarchies in a single entity +- Used for drill-down/roll-up in SAC and reporting tools + +**Example:** Organizational structure (Manager ID, Employee ID, Level) + +### Relational Dataset +**Use when:** Your view is purely for data distribution, data integration, or non-analytical purposes. + +**Characteristics:** +- Not used for analytical aggregation +- Used for operational reporting or data export +- Cannot be the source of an Analytic Model +- Useful for intermediate transformations +- Supports full-outer-join semantics + +**Example:** Transaction audit log, data export view + +## View Creation Workflow + +### Step 1: Select Source Tables +Start by identifying which source tables contain the data you need. + +**Best practices:** +- Use the Data Catalog to understand table structures: `search_catalog` with table keywords +- Retrieve detailed schema information: `get_table_schema` for your source tables +- Document your source dependencies +- Consider whether you're building on raw data or existing views +- Verify data quality and completeness of source tables + +### Step 2: Define Joins +Create relationships between source tables using appropriate join types. + +**Join types:** + +- **INNER JOIN:** Returns rows matching in both tables. Use when you want only matched records (e.g., Orders with Customers). +- **LEFT OUTER JOIN:** Returns all rows from left table + matched rows from right. Use when left table is primary (e.g., All Customers and their Orders, if any). +- **RIGHT OUTER JOIN:** Returns all rows from right table + matched rows from left. Use when right table is primary. +- **FULL OUTER JOIN:** Returns all rows from both tables. Use when you need all records from both sources (e.g., reconciliation). +- **CROSS JOIN:** Cartesian product of both tables. Use cautiously; creates many rows. Example: creating combinations of dimensions. + +**Join strategy best practices:** +- Join on business keys (unique identifiers) when possible +- Avoid joining on descriptive fields +- Consider join cardinality (1:1, 1:N, N:N) +- N:N joins can create data explosion; validate results +- Filter early to reduce join volume +- Document join logic in calculated columns + +### Step 3: Configure Projections +Select which columns to include and exclude from the view. + +**Best practices:** +- Include only columns needed downstream +- Remove redundant columns (avoid duplicate keys from joined tables) +- Rename columns for clarity (Customer ID → CustomerID) +- Consider column ordering for readability +- Mark key columns appropriately +- Use semantic naming conventions + +### Step 4: Add Calculated Columns +Create derived fields using expressions. + +**Expression syntax examples:** +- **String concatenation:** `"Company: " || COMPANY_NAME` +- **Conditional logic:** `CASE WHEN AMOUNT > 1000 THEN 'Large' ELSE 'Small' END` +- **Date calculations:** `DATEDIFF(day, ORDER_DATE, DELIVERY_DATE)` +- **Aggregations (window functions):** `SUM(AMOUNT) OVER (PARTITION BY CUSTOMER_ID ORDER BY DATE)` +- **String functions:** `UPPER(PRODUCT_NAME)`, `SUBSTRING(CODE, 1, 3)` +- **Numeric functions:** `ROUND(PRICE, 2)`, `ABS(VARIANCE)` + +**Calculated column best practices:** +- Use meaningful aliases +- Document complex formulas +- Test expressions with `execute_query` before deployment +- Avoid overly complex nested expressions; break into multiple columns +- Consider performance impact of calculations on large datasets + +### Step 5: Add Filters +Define row-level filters to exclude unwanted data. + +**Filter expression examples:** +- **Simple comparison:** `STATUS = 'ACTIVE'` +- **Date ranges:** `INVOICE_DATE >= '2024-01-01' AND INVOICE_DATE < '2024-02-01'` +- **IN lists:** `COUNTRY IN ('USA', 'Canada', 'Mexico')` +- **Null checks:** `CUSTOMER_EMAIL IS NOT NULL` +- **Complex logic:** `(STATUS = 'ACTIVE' AND AMOUNT > 0) OR (STATUS = 'ARCHIVED' AND APPROVAL_DATE > '2023-01-01')` + +**Filter best practices:** +- Apply filters at the view level to prevent duplicate filtering logic +- Use meaningful filter names and descriptions +- Consider whether filters should be static (always applied) or dynamic (parameterized) +- Test filter performance on large datasets +- Document business logic behind filters + +## Associations + +Associations define relationships between entities without creating joins at the view level. They enable navigation and filtering in SAC and other analytics tools. + +### Creating Associations + +**Association types:** + +- **To-One Association:** Links to a single dimension record. Example: Order → Customer (many orders to one customer) +- **To-Many Association:** Links to multiple records. Less common; use for navigational purposes + +**Association setup:** +1. Define the foreign key (your view's column) +2. Define the target entity and its primary key +3. Set cardinality (1:1, N:1, 1:N) +4. Optionally set as "primary" for default navigation +5. Add descriptive label and documentation + +**Association best practices:** +- Create associations to Dimension and Text entities +- Limit to business-meaningful relationships +- Avoid circular associations +- Document navigation paths for users +- Use consistent naming conventions (e.g., "To_Customer", "To_Date") + +### Managed Associations vs Direct References +- **Managed associations:** Defined in the view, tracked in metadata +- **Direct references:** Column-based references without formal association + +Use managed associations for navigational clarity and to enable SAC drill-down. + +## Input Parameters and Data Access Controls + +### Input Parameters +Add dynamic parameters to views for flexible filtering and analysis. + +**Parameter types:** +- **Prompt (Single Value):** Users select one value before executing the view +- **Prompt (Multiple Values):** Users can select multiple values +- **Range:** Users define a start and end value (e.g., date range) +- **Variable:** Parametrized column value for runtime substitution + +**Parameter usage example:** +``` +INVOICE_DATE >= :StartDate AND INVOICE_DATE <= :EndDate +WHERE REGION = :SelectedRegion +``` + +**Input parameter best practices:** +- Provide meaningful default values +- Add descriptions and help text for end users +- Consider mandatory vs optional parameters +- Validate parameter ranges (e.g., EndDate > StartDate) +- Document which parameters are required for queries + +### Data Access Controls (DAC) +Implement row-level security to restrict data based on user attributes. + +**DAC setup:** +1. Define a Principal Hierarchy (user groups, departments, regions) +2. Create privilege assignments mapping users to hierarchy levels +3. Apply DAC filters at the view level + +**DAC expression example:** +``` +REGION IN (SELECT region FROM user_region_mapping WHERE user_id = CURRENT_USER) +``` + +**DAC best practices:** +- Align with organizational structure +- Review and audit access regularly +- Test with sample users to verify restrictions +- Document security policies in view descriptions +- Avoid hardcoding user-specific logic; use system variables + +## Persistence Strategies + +### Virtual Views +**When to use:** +- Data is frequently updated in source tables +- You need minimal storage overhead +- Underlying data changes daily/hourly +- Query latency is acceptable (seconds range) +- Data volume is moderate + +**Characteristics:** +- No physical storage of view data +- Always reflects latest source data +- Query executed on-demand +- Lower storage costs +- Longer query times (joins and transformations happen at query time) + +### Persisted Views +**When to use:** +- Data changes less frequently (daily or weekly) +- You need sub-second query performance +- Data volume is large and queries are heavy +- Multiple downstream views consume this view +- Aggregated or summarized data + +**Characteristics:** +- Data physically stored in SAP HANA +- Refreshed on schedule (hourly, daily, etc.) +- Fast query performance +- Higher storage consumption +- Slightly stale data (between refresh intervals) +- Can be source for other persisted views + +### Hybrid Approach +Persist heavily-used aggregates while virtualizing granular data: +``` +Raw transactions (virtual) + ↓ (source) +Daily summaries by customer (persisted, refreshed nightly) + ↓ (source) +Monthly KPI reports (persisted, refreshed monthly) +``` + +## Performance Best Practices + +### Push-Down Optimization +Enable Datasphere to push filters and projections down to the source database. + +**Optimization rules:** +- Use columns from underlying tables directly when possible +- Place filters that operate on source columns at the view level +- Avoid complex calculated columns that prevent push-down +- Test execution plans to verify push-down behavior +- Use `execute_query` to analyze query performance + +**Non-push-down scenarios:** +- Complex string manipulation +- Calculations requiring multiple source rows (window functions) +- Union operations +- Calculated columns combining multiple tables + +### Avoiding Unnecessary Columns +- Include only columns needed by downstream objects +- Remove columns used only in intermediate joins +- Reduces memory and I/O +- Simplifies metadata for end users +- Speeds up view compilation + +### Join Strategy Optimization +- Join dimension tables (small) after fact tables (large) +- Apply row filters before joins when possible +- Use inner joins where applicable (reduces rows) +- Monitor join cardinality; avoid N:N situations +- Consider materialization of intermediate steps + +### Column Order and Indexing +- Place frequently filtered columns early in selection +- Logical grouping (keys, measures, attributes) +- No direct indexing control in views, but impacts source query + +### Aggregation and Deduplication +- Use COUNT(DISTINCT column) carefully on large datasets +- Consider materializing lookups to avoid repeated joins +- Use aggregation functions efficiently (GROUP BY on necessary columns only) +- Test performance with realistic data volumes + +## Deployment Workflow + +### Pre-Deployment Validation +1. **Schema validation:** Run `get_table_schema` on all sources to confirm structure +2. **Query testing:** Use `execute_query` to test view with sample data +3. **Dependency review:** Use `get_object_definition` to check downstream dependencies +4. **Performance testing:** Analyze query time and execution plan +5. **Data validation:** Verify row counts and data ranges match expectations + +### Deployment Steps +1. Save view and resolve any validation errors +2. Deploy to development environment for testing +3. Test with end users if applicable +4. Deploy to production environment +5. Monitor view execution performance +6. Document in change management system + +### Post-Deployment +- Monitor query performance in production +- Collect metrics on refresh times (if persisted) +- Gather user feedback +- Adjust filters, associations, or persistence as needed +- Update documentation + +## MCP Tools Reference + +### get_table_schema +Retrieve detailed information about source table structure, data types, and constraints. +``` +Use to understand available columns and their properties before designing joins +``` + +### search_catalog +Search for existing tables, views, and data assets in the catalog. +``` +Use to find source data and understand the semantic data model landscape +``` + +### get_object_definition +Retrieve detailed metadata about views, dimensions, or other semantic objects. +``` +Use to understand existing view structures, associations, and dependencies +``` + +### execute_query +Execute test queries against views to validate logic and performance. +``` +Use to test calculated column expressions, filters, and join logic +``` + +## Key Takeaways + +1. **Choose semantic usage carefully** — Fact, Dimension, Text, Hierarchy, or Relational Dataset determines behavior in analytics tools +2. **Design for performance** — Push-down optimization, efficient joins, and persistence strategies impact user experience +3. **Build associations for navigation** — Enable users to drill down and explore data across dimensions +4. **Validate before deployment** — Use MCP tools to test queries and understand dependencies +5. **Document comprehensively** — Clear descriptions help future maintainers and end users understand intent and usage patterns diff --git a/partner-built/SAP-Datasphere/skills/datasphere-view-architect/references/view-modeling-guide.md b/partner-built/SAP-Datasphere/skills/datasphere-view-architect/references/view-modeling-guide.md new file mode 100644 index 0000000..ded76b9 --- /dev/null +++ b/partner-built/SAP-Datasphere/skills/datasphere-view-architect/references/view-modeling-guide.md @@ -0,0 +1,742 @@ +# View Modeling Reference Guide + +## Semantic Usage Types - Detailed Reference + +### Fact Entity + +**Definition:** Contains measures and quantitative data that is aggregated and analyzed across dimensions. + +**Key characteristics:** +- Contains numeric measures (amounts, counts, weights, durations) +- Supports COUNT, SUM, AVG, MIN, MAX aggregations +- Typically at the grain of transactions or events +- Can have multiple fact tables with different grains +- Referenced by Analytic Models as primary source + +**Data structure example:** +``` +OrderID (Key) | OrderDate | CustomerID | ProductID | Amount | Quantity | ShipDate +123 | 2024-01-15| 456 | 789 | 1500 | 10 | 2024-01-18 +124 | 2024-01-15| 457 | 790 | 2300 | 15 | 2024-01-19 +``` + +**When NOT to use:** +- Contains primarily text/descriptions (use Dimension or Text) +- Data doesn't change frequently enough for aggregation +- Only used for lookups (use Dimension instead) + +--- + +### Dimension Entity + +**Definition:** Provides descriptive context and attributes for analysis. Slower-changing reference data. + +**Key characteristics:** +- Contains categorical and descriptive attributes +- Typically has a business key (unique identifier) +- Can be hierarchical (category → subcategory → product) +- Associated with one or more Fact tables +- Should be relatively stable (infrequent changes) + +**Data structure example:** +``` +ProductID (Key) | ProductName | Category | Subcategory | UnitPrice | Supplier +789 | Widget Pro | Widgets | Premium | 150 | SuppCo A +790 | Widget Standard | Widgets | Standard | 89 | SuppCo B +791 | Gadget Deluxe | Gadgets | Premium | 299 | SuppCo A +``` + +**Associated Text Entity example:** +``` +ProductID | Language | Description +789 | EN | Professional-grade widget with advanced features +789 | DE | Professionelles Widget mit erweiterten Funktionen +790 | EN | Cost-effective widget for general use +790 | DE | Kostengünstiges Widget für den allgemeinen Gebrauch +``` + +**Hierarchy example (Product Hierarchy):** +``` +ProductHierarchyKey | ProductID | ParentProductID | HierarchyLevel | OrderNumber +PH-001 | 789 | 100 | 1 | 1 +PH-002 | 100 | NULL | 0 | 1 +PH-003 | 790 | 100 | 1 | 2 +``` + +--- + +### Text Entity + +**Definition:** Contains language-specific translations and descriptions for other entities. + +**Key characteristics:** +- Language-dependent content +- Associated with a parent entity (usually Dimension or Fact) +- Structure: Entity Key + Language Key + Text content +- Used for multi-language reporting +- Cannot have measures + +**Data structure example:** +``` +CustomerID | Language | CustomerName_Text | AddressLine1_Text +456 | EN | Acme Corporation | 123 Main Street +456 | DE | Acme Konzern | Hauptstraße 123 +456 | FR | Société Acme | 123 rue Principale +457 | EN | Beta Industries | 456 Oak Avenue +457 | DE | Beta Industrien | Eichenallee 456 +``` + +**Usage in reporting:** +``` +Users see: +EN: "Acme Corporation" → "123 Main Street" +DE: "Acme Konzern" → "Hauptstraße 123" +FR: "Société Acme" → "123 rue Principale" +``` + +--- + +### Hierarchy Entity + +**Definition:** Defines parent-child relationships for drill-down and roll-up analysis. + +**Key characteristics:** +- Represents organizational or categorical hierarchies +- Contains hierarchy key, member key, parent key +- Supports drill-down in analytics (Level 0 → Level 1 → Level 2) +- Can be recursive or balanced +- Includes order/sequence information + +**Data structure example (Organizational Hierarchy):** +``` +OrgHierarchyKey | EmployeeID | ParentEmployeeID | DepartmentName | Level +1 | 100 | NULL | Chief Executive | 0 +2 | 200 | 100 | Chief Financial | 1 +3 | 300 | 100 | Chief Operating | 1 +4 | 250 | 200 | Accounting Manager | 2 +5 | 260 | 200 | Finance Manager | 2 +6 | 251 | 250 | Accountant | 3 +``` + +**Drill-down path:** CEO → CFO → Finance Manager → Accountant + +**Data structure example (Geography Hierarchy):** +``` +GeoHierarchyKey | CountryCode | CountryName | RegionCode | RegionName | CityCode | CityName +1 | US | United States| CA | California | SF | San Francisco +2 | US | United States| CA | California | LA | Los Angeles +3 | US | United States| NY | New York | NYC | New York +4 | CA | Canada | ON | Ontario | TO | Toronto +``` + +--- + +### Relational Dataset + +**Definition:** Data used for distribution, integration, or operational purposes, not for analytical aggregation. + +**Key characteristics:** +- No aggregation semantics +- Used for operational reporting or data export +- Cannot be source for Analytic Model +- Useful as intermediate transformation layer +- Supports full outer join semantics + +**Use cases:** +- Customer contact lists for CRM exports +- Transaction audit logs for compliance +- Master data interfaces for upstream systems +- Data distribution feeds +- Non-analytical reporting views + +--- + +## Join Types - Detailed Reference + +### INNER JOIN + +**Syntax:** +```sql +SELECT * FROM Orders o +INNER JOIN Customers c ON o.CustomerID = c.CustomerID +``` + +**Result visualization:** +``` +Orders: Customers: Result: +O1-C1 C1-Acme O1-C1-Acme +O2-C1 C2-Beta O2-C1-Acme +O3-C2 C3-Gamma O3-C2-Beta + (matches only) +``` + +**When to use:** +- Only want data where both tables match +- Filtering out unmatched records is acceptable +- Customer orders (exclude customers with no orders) +- Invoice line items (exclude invoices with no lines) + +**Performance:** Generally fast, filters rows early + +--- + +### LEFT OUTER JOIN + +**Syntax:** +```sql +SELECT * FROM Customers c +LEFT OUTER JOIN Orders o ON c.CustomerID = o.CustomerID +``` + +**Result visualization:** +``` +Customers: Orders: Result: +C1-Acme O1-C1 C1-Acme-O1 +C2-Beta O2-C1 C1-Acme-O2 +C3-Gamma O3-C2 C2-Beta-O3 + C3-Gamma-NULL +``` + +**When to use:** +- All records from left table are important +- Right table may not have matching records +- All customers, whether they have orders or not +- All products, whether they've been sold or not + +**NULL handling:** Right table columns will be NULL where no match exists + +--- + +### RIGHT OUTER JOIN + +**Syntax:** +```sql +SELECT * FROM Customers c +RIGHT OUTER JOIN Orders o ON c.CustomerID = o.CustomerID +``` + +**Result visualization:** +``` +Customers: Orders: Result: +C1-Acme O1-C1 O1-C1-Acme +C2-Beta O2-C1 O2-C1-Acme +C3-Gamma O3-C2 O3-C2-Beta + O4-C99 O4-C99-NULL +``` + +**When to use:** +- Right table records are the primary source +- Want all right table records, some left may not match +- Rare in data modeling; usually rewrite with LEFT JOIN and swap table order + +--- + +### FULL OUTER JOIN + +**Syntax:** +```sql +SELECT * FROM Customers c +FULL OUTER JOIN Orders o ON c.CustomerID = o.CustomerID +``` + +**Result visualization:** +``` +Customers: Orders: Result: +C1-Acme O1-C1 C1-Acme-O1 +C2-Beta O2-C1 C1-Acme-O2 +C3-Gamma O3-C2 C2-Beta-O3 + O4-C99 C3-Gamma-NULL + O4-C99-NULL +``` + +**When to use:** +- Reconciliation queries (all from both sides) +- Outer join of independent datasets +- Finding unmatched records in either table + +**Performance:** Most expensive join type, retains all rows from both tables + +--- + +### CROSS JOIN + +**Syntax:** +```sql +SELECT * FROM DimDate d +CROSS JOIN DimProducts p +WHERE d.Year = 2024 +``` + +**Result visualization:** +``` +DimDate: DimProducts: Result (no join condition): +2024-01-01 Product A 2024-01-01-Product A +2024-01-02 Product B 2024-01-01-Product B +2024-01-03 Product C 2024-01-02-Product A + Product D 2024-01-02-Product B + ... (many combinations) +``` + +**When to use:** +- Creating all possible combinations +- Generating complete date-product matrix +- Budget allocation across products +- Forecasting scenarios + +**Warning:** Result size = left rows × right rows. Can create millions of rows! + +--- + +## Calculated Column Expressions - Syntax Reference + +### String Functions + +```sql +-- Concatenation +'Customer: ' || CUSTOMER_NAME || ' (' || COUNTRY || ')' +CONCAT(FIRST_NAME, ' ', LAST_NAME) + +-- Case conversion +UPPER(PRODUCT_NAME) -- 'product' → 'PRODUCT' +LOWER(DESCRIPTION) -- 'UPPER' → 'upper' +INITCAP(CITY_NAME) -- 'new york' → 'New York' + +-- String manipulation +SUBSTRING(SKU_CODE, 1, 3) -- First 3 characters +LENGTH(PRODUCT_CODE) -- Number of characters +TRIM(CUSTOMER_NAME) -- Remove leading/trailing spaces +LTRIM(VALUE) -- Remove leading spaces +RTRIM(VALUE) -- Remove trailing spaces +REPLACE(DESCRIPTION, 'Old', 'New') -- Find and replace + +-- Pattern matching +POSITION('ABC' IN PRODUCT_CODE) -- Find position of substring +INSTR(CUSTOMER_NAME, 'Inc') -- Index of substring +LIKE '%Corp%' -- Pattern matching (use in WHERE) + +-- String extraction +LEFT(ACCOUNT_CODE, 2) -- Leftmost n characters +RIGHT(ACCOUNT_CODE, 4) -- Rightmost n characters +MID(CODE, 2, 3) -- Extract n chars starting at position + +-- Encoding/Decoding +HEX(DATA) -- Convert to hexadecimal +UNHEX(HEX_VALUE) -- Convert from hexadecimal +``` + +### Numeric Functions + +```sql +-- Rounding +ROUND(PRICE, 2) -- Round to 2 decimals: 19.555 → 19.56 +FLOOR(AMOUNT) -- Round down: 19.9 → 19 +CEIL(AMOUNT) -- Round up: 19.1 → 20 +TRUNCATE(VALUE, 1) -- Truncate to 1 decimal: 19.99 → 19.9 + +-- Sign and absolute value +ABS(VARIANCE) -- Absolute value: -150 → 150 +SIGN(DIFFERENCE) -- Return -1, 0, or 1 +SQRT(VARIANCE_SQUARED) -- Square root + +-- Trigonometry +SIN(ANGLE), COS(ANGLE), TAN(ANGLE) -- Trigonometric functions +ASIN(), ACOS(), ATAN() -- Inverse trigonometric + +-- Logarithms +LOG(VALUE) -- Natural logarithm +LOG10(VALUE) -- Base-10 logarithm +EXP(POWER) -- e raised to power + +-- Power and modulo +POWER(BASE, EXPONENT) -- 2 raised to power 3 = 8 +MOD(17, 5) -- Remainder: 17 mod 5 = 2 +GREATEST(10, 20, 5) -- Maximum of values: 20 +LEAST(10, 20, 5) -- Minimum of values: 5 + +-- Random numbers +RAND() -- Random decimal 0-1 +RANDINT(1, 100) -- Random integer 1-100 +``` + +### Date Functions + +```sql +-- Current date/time +CURRENT_DATE -- Today: 2024-01-15 +CURRENT_TIMESTAMP -- Now: 2024-01-15 14:30:45.123 +CURRENT_TIME -- Time: 14:30:45 + +-- Date arithmetic +DATE_ADD(ORDER_DATE, INTERVAL 30 DAY) -- Add 30 days +DATE_SUB(DELIVERY_DATE, INTERVAL 1 WEEK) -- Subtract 1 week +DATEDIFF(day, ORDER_DATE, DELIVERY_DATE) -- Days between dates: 5 +DATEDIFF(month, START_DATE, END_DATE) -- Months between dates + +-- Date extraction +YEAR(INVOICE_DATE) -- Extract year: 2024 +MONTH(INVOICE_DATE) -- Extract month: 6 +DAY(INVOICE_DATE) -- Extract day: 15 +QUARTER(INVOICE_DATE) -- Extract quarter: 2 +WEEK(INVOICE_DATE) -- Week of year: 24 +DAYNAME(ORDER_DATE) -- Day name: 'Monday' +MONTHNAME(ORDER_DATE) -- Month name: 'June' + +-- Date formatting +TO_DATE(DATE_STRING, 'YYYY-MM-DD') -- Parse string to date +TO_CHAR(ORDER_DATE, 'YYYY-MM-DD') -- Format date as string + +-- Quarter and fiscal calculations +'Q' || QUARTER(DATE_COLUMN) -- 'Q1', 'Q2', etc. +YEAR(DATE_COLUMN) * 10000 + MONTH(DATE_COLUMN) * 100 + DAY(DATE_COLUMN) -- YYYYMMDD +``` + +### Conditional Functions + +```sql +-- Simple CASE +CASE WHEN AMOUNT > 1000 THEN 'Large' + WHEN AMOUNT > 100 THEN 'Medium' + ELSE 'Small' +END + +-- Searched CASE +CASE WHEN STATUS = 'ACTIVE' AND AMOUNT > 0 THEN 'Valid' + WHEN STATUS = 'INACTIVE' THEN 'Closed' + ELSE 'Unknown' +END + +-- IF alternative +IF(QUANTITY > 0, AMOUNT / QUANTITY, 0) -- Avoid division by zero + +-- COALESCE (return first non-NULL) +COALESCE(UPDATED_DATE, CREATED_DATE, SYSTEM_DATE) -- Use first available + +-- NULL handling +IFNULL(COMMISSION, 0) -- Replace NULL with 0 +NULLIF(VALUE1, VALUE2) -- Return NULL if equal, else VALUE1 +``` + +### Aggregate Functions (Window Functions) + +```sql +-- Running totals and averages +SUM(AMOUNT) OVER (PARTITION BY CUSTOMER_ID ORDER BY ORDER_DATE) +-- Running total of sales per customer + +AVG(PRICE) OVER (PARTITION BY PRODUCT_CATEGORY) +-- Average price within each category + +-- Ranking +ROW_NUMBER() OVER (PARTITION BY CUSTOMER_ID ORDER BY ORDER_DATE DESC) +-- Row number per customer, newest orders first + +RANK() OVER (ORDER BY SALES_AMOUNT DESC) +-- Rank with gaps (1, 2, 2, 4) + +DENSE_RANK() OVER (ORDER BY SALES_AMOUNT DESC) +-- Rank without gaps (1, 2, 2, 3) + +-- Lead and lag +LAG(AMOUNT, 1) OVER (PARTITION BY CUSTOMER_ID ORDER BY DATE) +-- Previous order amount for each customer + +LEAD(AMOUNT, 1) OVER (PARTITION BY CUSTOMER_ID ORDER BY DATE) +-- Next order amount for each customer + +-- First and last values +FIRST_VALUE(AMOUNT) OVER (PARTITION BY CUSTOMER_ID ORDER BY DATE) +-- First purchase amount per customer + +LAST_VALUE(AMOUNT) OVER (PARTITION BY CUSTOMER_ID ORDER BY DATE) +-- Last purchase amount per customer +``` + +--- + +## Filter Expressions - Reference + +### Date Range Filters + +```sql +-- Single month +INVOICE_DATE >= '2024-01-01' AND INVOICE_DATE < '2024-02-01' + +-- Year to date +INVOICE_DATE >= '2024-01-01' AND INVOICE_DATE <= CURRENT_DATE + +-- Last 90 days +INVOICE_DATE >= CURRENT_DATE - 90 + +-- Fiscal year (April-March) +CASE WHEN MONTH(ORDER_DATE) >= 4 + THEN YEAR(ORDER_DATE) + ELSE YEAR(ORDER_DATE) - 1 +END = 2024 + +-- Relative filters +ORDER_DATE >= DATE_ADD(CURRENT_DATE, INTERVAL -30 DAY) -- Last 30 days +DELIVERY_DATE >= DATE_SUB(CURRENT_DATE, INTERVAL 1 MONTH) -- Last month +``` + +### Categorical Filters + +```sql +-- Single value +STATUS = 'ACTIVE' +REGION = 'North America' + +-- Multiple values +COUNTRY IN ('USA', 'Canada', 'Mexico') +PRODUCT_LINE IN (SELECT PRODUCT_LINE_ID FROM APPROVED_LINES) + +-- Exclusions +DEPARTMENT NOT IN ('Discontinued', 'Testing') +STATUS <> 'INACTIVE' + +-- Null/empty handling +CUSTOMER_EMAIL IS NOT NULL +MIDDLE_NAME IS NULL +DESCRIPTION <> '' +``` + +### Numeric Range Filters + +```sql +-- Simple ranges +AMOUNT > 1000 +QUANTITY BETWEEN 10 AND 100 + +-- Percentage-based +DISCOUNT_PERCENT <= 10 + +-- Variance tolerance +ABS(ACTUAL - FORECAST) <= 100 + +-- Top values +SALES_RANK <= 10 -- Top 10 products +PERCENTILE >= 0.75 -- Top quartile +``` + +### Complex Composite Filters + +```sql +-- Multiple conditions (AND) +STATUS = 'ACTIVE' +AND ORDER_DATE >= '2024-01-01' +AND AMOUNT > 0 +AND COUNTRY IN ('USA', 'Canada') + +-- Alternative conditions (OR) +REGION IN ('North', 'South') +OR MANAGER_ID IS NULL +OR PERFORMANCE_RATING >= 4 + +-- Complex logic +(STATUS = 'ACTIVE' AND AMOUNT > 100) +OR (STATUS = 'PENDING' AND AMOUNT > 1000) +OR (STATUS = 'ARCHIVED' AND APPROVAL_DATE > '2023-01-01') + +-- Excluding problematic records +AMOUNT > 0 -- No zero/negative sales +AND CREATED_DATE <= CURRENT_DATE -- No future dates +AND CUSTOMER_ID <> 0 -- Valid customers only +AND LENGTH(TRIM(DESCRIPTION)) > 0 -- Non-empty descriptions +``` + +--- + +## Data Access Control (DAC) Setup + +### Principal Hierarchy Structure + +``` +Organization (Root) +├── Region +│ ├── Sales_East +│ ├── Sales_West +│ └── Sales_Europe +└── Department + ├── Finance + ├── Operations + └── Executive +``` + +### User-to-Principal Mapping + +``` +UserID | PrincipalValue | Level +john.smith | Sales_East | Region +jane.doe | Sales_West | Region +carlos.lopez | Executive | Department +maria.garcia | Finance | Department +``` + +### DAC Filter Expression Examples + +```sql +-- Region-based access +SALES_REGION IN ( + SELECT principal_value + FROM user_principals + WHERE user_id = CURRENT_USER + AND principal_type = 'Region' +) + +-- Department-based access +COST_CENTER IN ( + SELECT cost_center_id + FROM department_mapping + WHERE department_name IN ( + SELECT principal_value + FROM user_principals + WHERE user_id = CURRENT_USER + ) +) + +-- Hierarchical access (user sees their team and below) +MANAGER_ID IN ( + SELECT employee_id + FROM org_hierarchy + WHERE reporting_line LIKE CONCAT('%', + (SELECT employee_id FROM employees WHERE user_id = CURRENT_USER), '%') +) + +-- Multi-dimensional access +(REGION = (SELECT region FROM user_attributes WHERE user_id = CURRENT_USER)) +AND (FISCAL_YEAR >= (SELECT start_year FROM user_fiscal_access WHERE user_id = CURRENT_USER)) +``` + +--- + +## Common View Patterns + +### Star Schema Pattern + +``` +Central Fact Table (Orders) + ↙ ↓ ↓ ↖ + Customer Date Product Warehouse + (Dimension) Dimensions... +``` + +**Implementation:** +- Fact view contains transaction-level data with foreign keys +- Dimension views contain reference data +- Associations link Fact to Dimensions +- Supports drill-down and analysis across multiple dimensions + +--- + +### Snowflake Schema Pattern + +``` +Fact Table (Orders) + ↓ + Customer Dimension + ↓ + Geography (nested hierarchy) + ↓ + Region → Country → Continent +``` + +**Implementation:** +- Normalized dimension tables +- Dimension views reference sub-dimensions +- More storage efficient than star schema +- Slightly more complex join logic + +--- + +### Denormalized Pattern + +``` +Single Fact View (Sales with all attributes) +- OrderID, CustomerID, CustomerName, Address +- ProductID, ProductName, Category +- Date, Region, Amount, Quantity +``` + +**Use when:** +- Data is read-heavy, rarely updated +- Performance is critical +- Query simplicity is important + +**Caution:** Data redundancy, update anomalies possible + +--- + +### Federated/Materialized View Pattern + +``` +Virtual View (combines virtual and persisted sources) +- Virtual customer orders (real-time) +- Persisted customer master (daily refresh) +- Persisted product catalog (weekly refresh) +``` + +**Implementation:** +- Mix persistent and virtual views +- Persist frequently-accessed aggregates +- Virtual where real-time is critical + +--- + +## Association Best Practices + +### Navigational Hierarchy +``` +Fact: Orders + → To_Customer (Customer Dimension) + → To_Geography (Customer Geography) + → To_Region (Region Master) +``` + +**Enable:** Drill-down from Order → Customer Region in analytics + +### Bidirectional Reference +``` +Customer (Dimension) + ← Has_Orders (inverse association) + +Order (Fact) + → To_Customer (forward association) +``` + +**Note:** Explicitly model if needed for reverse navigation + +### Avoiding Circular References +``` +AVOID: +Product ← → Category ← → Brand + +DO: +Product → Category +Product → Brand +(No circular dependency) +``` + +--- + +## Performance Tuning Reference + +### Push-Down Eligible Operations +- ✓ Column selection (projection) +- ✓ Row filters on source columns +- ✓ Simple arithmetic operations +- ✓ String comparisons + +### Push-Down NOT Possible +- ✗ Complex calculated columns +- ✗ Window functions +- ✗ Union operations +- ✗ Group-by aggregations in virtual views + +### Storage Estimate Formula +``` +View Size = Row Count × Average Row Width + +Example: +100 million order rows × 80 bytes/row = 8 GB + +Compressed (typical 3:1) = ~2.7 GB +``` From debdaa683ec96e8fa1266827623bf17202f60ebb Mon Sep 17 00:00:00 2001 From: MarioDeFelipe <89535508+MarioDeFelipe@users.noreply.github.com> Date: Fri, 27 Feb 2026 20:57:13 +0100 Subject: [PATCH 2/2] Bump SAP Datasphere plugin to v0.4.0 Updated skills with improved cross-references and content refinements. Co-Authored-By: Claude Opus 4.6 --- partner-built/SAP-Datasphere/.claude-plugin/plugin.json | 4 ++-- .../SAP-Datasphere/skills/datasphere-admin/SKILL.md | 5 +++++ .../skills/datasphere-analytic-model-creator/SKILL.md | 5 +++++ .../SAP-Datasphere/skills/datasphere-data-flows/SKILL.md | 5 +++++ .../references/replication-flows.md | 5 +++++ .../datasphere-data-flows/references/task-chains.md | 9 +++++++++ .../skills/datasphere-data-product-publisher/SKILL.md | 4 ++++ .../SAP-Datasphere/skills/datasphere-explorer/SKILL.md | 4 ++++ .../skills/datasphere-security-architect/SKILL.md | 4 ++++ .../skills/datasphere-view-architect/SKILL.md | 6 ++++++ 10 files changed, 49 insertions(+), 2 deletions(-) diff --git a/partner-built/SAP-Datasphere/.claude-plugin/plugin.json b/partner-built/SAP-Datasphere/.claude-plugin/plugin.json index ccee345..684b038 100644 --- a/partner-built/SAP-Datasphere/.claude-plugin/plugin.json +++ b/partner-built/SAP-Datasphere/.claude-plugin/plugin.json @@ -1,7 +1,7 @@ { "name": "datasphere", - "version": "0.3.0", - "description": "The most comprehensive SAP Datasphere plugin for Claude. 18 specialized skills covering exploration, data modeling, integration, BW Bridge migration, security architecture, CLI automation, business content activation, catalog governance, performance optimization, and troubleshooting — all through natural language. Powered by 45 MCP tools with enterprise-grade security.", + "version": "0.4.0", + "description": "The most comprehensive SAP Datasphere plugin for Claude. 18 specialized skills covering exploration, data modeling, integration, BW Bridge migration, security architecture, CLI automation, business content activation, catalog governance, performance optimization, and troubleshooting \u2014 all through natural language. Powered by 45 MCP tools with enterprise-grade security.", "author": { "name": "Mario De Felipe", "url": "https://github.com/MarioDeFelipe" diff --git a/partner-built/SAP-Datasphere/skills/datasphere-admin/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-admin/SKILL.md index b4c3ac5..46e31d1 100644 --- a/partner-built/SAP-Datasphere/skills/datasphere-admin/SKILL.md +++ b/partner-built/SAP-Datasphere/skills/datasphere-admin/SKILL.md @@ -281,3 +281,8 @@ See reference files for detailed procedures: - `references/security-governance.md` - Security configuration details - `references/system-monitoring.md` - Monitoring and troubleshooting - `references/transport.md` - Transport lifecycle management + +## What's New (2026.05) + +- **Critical Storage Threshold — Automatic Space Locking**: To protect tenant stability, Datasphere will now automatically lock ALL spaces when disk usage reaches a critical threshold. A message appears in the Space Management app and on each space page. To unlock and resume work, you must either increase tenant disk storage or delete unneeded data. This is a critical operational consideration — monitor disk usage proactively and set up alerts before reaching the threshold. +- **Installing Intelligent Applications for Multiple Source Systems**: If your Datasphere is part of a Business Data Cloud formation, you can now install a single intelligent application multiple times for different source systems. Each source system creates (or reuses) its own ingestion space, with source-specific preparation and application spaces identified by an alias. diff --git a/partner-built/SAP-Datasphere/skills/datasphere-analytic-model-creator/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-analytic-model-creator/SKILL.md index 3b0fbcc..f203281 100644 --- a/partner-built/SAP-Datasphere/skills/datasphere-analytic-model-creator/SKILL.md +++ b/partner-built/SAP-Datasphere/skills/datasphere-analytic-model-creator/SKILL.md @@ -686,3 +686,8 @@ Verify data types and column names 6. **Test in SAC early** — Validate filter behavior and dashboard performance 7. **Document for users** — Clear measure definitions enable self-service analytics 8. **Optimize for consumption** — Simplify complexity; performance matters in dashboards + +## What's New (2026.05) + +- **Share Analytic Models Across Spaces**: You can now share an analytic model to one or more other spaces. In the target space, you can create a new analytic model on top of the shared one for space-specific consumption. This enables a hub-and-spoke pattern where a central analytics team maintains core models and business units layer their own measures and dimensions on top. +- **Filter on Aggregated Measure Values in OData API**: When consuming analytic models via OData, you can now filter on aggregated measure values. Example syntax: `?$filter=Partner_ID eq '100000005' and Value gt 1000000`. This enables consumers to request only rows where measures meet specific thresholds, reducing data transfer and improving dashboard performance. diff --git a/partner-built/SAP-Datasphere/skills/datasphere-data-flows/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-data-flows/SKILL.md index ea42f22..61424f5 100644 --- a/partner-built/SAP-Datasphere/skills/datasphere-data-flows/SKILL.md +++ b/partner-built/SAP-Datasphere/skills/datasphere-data-flows/SKILL.md @@ -449,3 +449,8 @@ See reference files for detailed procedures: - `references/replication-flows.md` - Detailed replication configuration - `references/transformation-flows.md` - Delta staging patterns - `references/task-chains.md` - Orchestration patterns + +## What's New (2026.05) + +- **Improved Primary Key Order Handling in Replication Flows**: During table replication, the primary key order from the source is now preserved in the target. This prevents replication failures caused by key order mismatches between source and target tables. No configuration needed — this is automatic behavior. +- **Output Parameters in Task Chains**: Task chain objects now support output parameters. You can map output parameters from task objects to the parent task chain, enabling more flexible orchestration of nested task chains and conditional logic based on task results. diff --git a/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/replication-flows.md b/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/replication-flows.md index 470aa1e..d53f7a3 100644 --- a/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/replication-flows.md +++ b/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/replication-flows.md @@ -87,3 +87,8 @@ Path: Left Menu → Data Integration Monitor 3. Review error logs in monitor 4. Check Cloud Connector (on-prem sources) 5. Verify POI block availability (external targets) + +## What's New (2026.05) + +### Improved Primary Key Order Handling +During table replication, the primary key order from the source is now preserved in the target table. Previously, mismatches in primary key column ordering between source and target could cause replication failures. This fix is automatic and requires no configuration changes. If you previously encountered key order mismatch errors, re-deploying the replication flow should resolve the issue. diff --git a/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/task-chains.md b/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/task-chains.md index cbdb01f..b88685e 100644 --- a/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/task-chains.md +++ b/partner-built/SAP-Datasphere/skills/datasphere-data-flows/references/task-chains.md @@ -89,4 +89,13 @@ Morning: Source Extracts (Parallel) → AND → Transformations (Serial) → Fac ``` Replication Flows (CDC) run continuously Task Chain (scheduled): Transformation Flows → Analytics Layer + +## What's New (2026.05) + +### Output Parameters in Task Chains +Task chain objects now support output parameters, enabling more flexible orchestration: +- **Define output parameters** on task objects (Data Flows, Transformation Flows, etc.) +- **Map output parameters** from a task object to the parent task chain +- **Use in nested task chains** — output parameters propagate upward through the chain hierarchy +- This enables conditional logic and dynamic behavior based on task execution results ``` diff --git a/partner-built/SAP-Datasphere/skills/datasphere-data-product-publisher/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-data-product-publisher/SKILL.md index c35fb3e..d79d1d3 100644 --- a/partner-built/SAP-Datasphere/skills/datasphere-data-product-publisher/SKILL.md +++ b/partner-built/SAP-Datasphere/skills/datasphere-data-product-publisher/SKILL.md @@ -970,3 +970,7 @@ Returns: Space configuration, objects, sizing, security posture - Version data products and communicate changes - Plan data retention and archival strategy upfront - Consider privacy and compliance implications early + +## What's New (2026.05) + +- **Optimized Data Product Uninstallation**: Uninstalling data products is now significantly faster. All related artifacts — including replication flows — are automatically removed during uninstallation. Previously, orphaned replication flows could remain after uninstalling a data product, requiring manual cleanup. diff --git a/partner-built/SAP-Datasphere/skills/datasphere-explorer/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-explorer/SKILL.md index 9ac72b9..39899b3 100644 --- a/partner-built/SAP-Datasphere/skills/datasphere-explorer/SKILL.md +++ b/partner-built/SAP-Datasphere/skills/datasphere-explorer/SKILL.md @@ -180,3 +180,7 @@ a business analyst, a data scientist, or a manager trying to understand what dat - Use concrete examples and numbers rather than abstract descriptions - If showing tabular data, keep it to 5-10 rows unless the user asks for more - When profiling, focus on the insights (quality issues, patterns, anomalies) not just the numbers + +## What's New (2026.05) + +- **Filter on Aggregated Measure Values in OData API**: When querying analytic models via OData, you can now filter on aggregated measure values. Example: `?$filter=Partner_ID eq '100000005' and Value gt 1000000`. This is useful when exploring data programmatically or building consumption queries — you can request only rows where measures exceed specific thresholds. diff --git a/partner-built/SAP-Datasphere/skills/datasphere-security-architect/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-security-architect/SKILL.md index 3070e6d..5fd9d97 100644 --- a/partner-built/SAP-Datasphere/skills/datasphere-security-architect/SKILL.md +++ b/partner-built/SAP-Datasphere/skills/datasphere-security-architect/SKILL.md @@ -1593,6 +1593,10 @@ Response: "These 12 tables have audit logging: - T_CUSTOMER_PII: Detailed READ logging, 3-year retention - T_FINANCIAL: Change-only logging, 10-year retention - T_SALES: Change-only logging, 1-year retention" + +## What's New (2026.05) + +- **Visibility of Data Access Controls Applied to Sources**: When editing a Graphical or SQL View, you can now see data access controls (DACs) applied to the view's sources in a new subsection "Applied via Sources" under the Data Access Controls panel. This provides transparency into inherited security — you can see not just the DACs applied directly to your view, but also those applied upstream to the views and tables your view consumes. Critical for auditing and debugging access control behavior in complex view hierarchies. ``` --- diff --git a/partner-built/SAP-Datasphere/skills/datasphere-view-architect/SKILL.md b/partner-built/SAP-Datasphere/skills/datasphere-view-architect/SKILL.md index daf35a8..6e3a716 100644 --- a/partner-built/SAP-Datasphere/skills/datasphere-view-architect/SKILL.md +++ b/partner-built/SAP-Datasphere/skills/datasphere-view-architect/SKILL.md @@ -419,3 +419,9 @@ Use to test calculated column expressions, filters, and join logic 3. **Build associations for navigation** — Enable users to drill down and explore data across dimensions 4. **Validate before deployment** — Use MCP tools to test queries and understand dependencies 5. **Document comprehensively** — Clear descriptions help future maintainers and end users understand intent and usage patterns + +## What's New (2026.05) + +- **Partitioning Local Tables for Intelligent Applications**: If your Datasphere is part of an SAP Business Data Cloud formation, you can now create partitions for local tables installed via intelligent applications. This enables better management of read-only tables with large data volumes by breaking data into chunks. +- **Change Primary Key Index Type in Local Tables**: When a local table has multiple primary keys, you can now change the index type in the Local Table editor. This optimizes performance in very large volume scenarios where the default index type may not be optimal. +- **Review and Restore Transformation Flow Versions**: You can now review past versions of transformation flows, open them in read-only mode, download them as CSN/JSON files, and restore a past version to replace the current version. This provides version history and rollback capability for transformation logic.