diff --git a/notebooks/databricks/validmind_databricks_quickstart.ipynb b/notebooks/databricks/validmind_databricks_quickstart.ipynb new file mode 100644 index 000000000..c54c52ae8 --- /dev/null +++ b/notebooks/databricks/validmind_databricks_quickstart.ipynb @@ -0,0 +1,553 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# ValidMind + Databricks Quickstart\n", + "\n", + "Use this notebook to install and run the ValidMind Library inside a Databricks Collaborative Notebook, load data from a Unity Catalog table linked to your model in ValidMind, train a simple classification model, and send the results to the ValidMind Platform.\n", + "\n", + "In this notebook, you will:\n", + "\n", + "- Install and initialize the ValidMind Library\n", + "- Load data from a Unity Catalog table linked to your model in ValidMind\n", + "- Train a simple classification model\n", + "- Run ValidMind tests and send the results to the ValidMind Platform\n", + "\n", + "## Before you begin\n", + "\n", + "You will need:\n", + "1. A running Databricks workspace with Unity Catalog enabled\n", + "2. A ValidMind account with a registered model\n", + "3. Your ValidMind API credentials (API key, API secret, model identifier)\n", + "\n", + "To get your credentials: log in to ValidMind → **Model Inventory** → select your model → **Getting Started** → **Copy snippet to clipboard**.\n", + "\n", + "For step-by-step instructions on setting up the Databricks integration and linking a Unity Catalog table to your model, refer to [Synchronize with Databricks](https://docs.validmind.ai/guide/integrations/integrations-examples/synchronize-with-databricks.html).\n", + "\n", + "> **Note:** If you don't have a Unity Catalog table linked to your model yet, this notebook includes a synthetic-data fallback so you can still run through the full workflow." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 1 — Install the ValidMind Library\n", + "\n", + "Run this cell first. Databricks requires a Python restart after `%pip install`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -q validmind" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Restart Python kernel to pick up newly installed packages\n", + "dbutils.library.restartPython()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2 — Verify installation\n", + "\n", + "Confirm that the ValidMind Library installed successfully and check the version available in your notebook environment:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import importlib.metadata\n", + "version = importlib.metadata.version('validmind')\n", + "print(f'ValidMind Library version: {version}')\n", + "print('Installation successful!')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3 — Initialize the ValidMind Library\n", + "\n", + "Initialize the ValidMind Library with the *code snippet* unique to your model so that test results are uploaded to the correct model in the ValidMind Platform.\n", + "\n", + "You can supply your credentials in either of two ways:\n", + "\n", + "- **Databricks widgets**: set widgets named `vm_api_host`, `vm_api_key`, `vm_api_secret`, and `vm_model_cuid` on the notebook. This is convenient when you parameterize the notebook as part of a Databricks job.\n", + "- **Edit the next cell directly**: replace the placeholder values with your own credentials.\n", + "\n", + "To get your credentials:\n", + "\n", + "1. In ValidMind, go to **Model Inventory** and select your model.\n", + "2. Open **Getting Started** and click **Copy snippet to clipboard**.\n", + "3. Paste the values into the next cell, or use them to set the corresponding widgets:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import validmind as vm\n", + "\n", + "# ---------------------------------------------------------------------------\n", + "# Credentials are read from Databricks widgets if set. Otherwise, replace the\n", + "# placeholder values below before running this cell.\n", + "# ---------------------------------------------------------------------------\n", + "try:\n", + " api_host = dbutils.widgets.getAll().get(\"vm_api_host\", \"\")\n", + " api_key = dbutils.widgets.getAll().get(\"vm_api_key\", \"\")\n", + " api_secret = dbutils.widgets.getAll().get(\"vm_api_secret\", \"\")\n", + " model_cuid = dbutils.widgets.getAll().get(\"vm_model_cuid\", \"\")\n", + "except NameError:\n", + " # dbutils is not available — running outside Databricks\n", + " api_host = \"\" # replace with your API host\n", + " api_key = \"\" # replace with your API key\n", + " api_secret = \"\" # replace with your API secret\n", + " model_cuid = \"\" # replace with your model CUID\n", + "\n", + "vm.init(\n", + " api_host=api_host,\n", + " api_key=api_key,\n", + " api_secret=api_secret,\n", + " model=model_cuid,\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4 — Load data from your linked Databricks table\n", + "\n", + "Load the data for this notebook from a Unity Catalog table that you've linked to your model in ValidMind. Once a table binding is set up, ValidMind syncs the data and makes it available through the tracking API. You don't need a Spark session or direct Unity Catalog credentials in this notebook.\n", + "\n", + "Before running the next cell, make sure you have:\n", + "\n", + "1. A Databricks integration configured in **Settings → Integrations → Databricks**\n", + "2. A `table` binding created for your model that links a Unity Catalog table to it\n", + "3. At least one successful sync (the initial sync runs automatically when you create the binding)\n", + "\n", + "If you don't have a table binding yet, set `USE_SYNTHETIC_FALLBACK = True` in the next cell to run this notebook with generated data instead." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "import requests\n", + "import pandas as pd\n", + "from validmind import api_client as _vm_client\n", + "\n", + "# Set to True only if you don't have a Databricks table binding set up yet\n", + "USE_SYNTHETIC_FALLBACK = False\n", + "\n", + "# ---------------------------------------------------------------------------\n", + "# Load from ValidMind — uses the linked Databricks table binding for this model\n", + "# ---------------------------------------------------------------------------\n", + "if not USE_SYNTHETIC_FALLBACK:\n", + " _api_host = _vm_client.get_api_host() # same host as vm.init()\n", + " _headers = _vm_client._get_api_headers()\n", + "\n", + " _response = requests.get(\n", + " f\"{_api_host}/integrations/dataset\",\n", + " headers=_headers,\n", + " timeout=30,\n", + " )\n", + "\n", + " if _response.status_code == 200:\n", + " _data = _response.json()\n", + " TABLE_NAME = _data.get(\"table_name\", \"unknown\")\n", + " TARGET_COLUMN = \"target\" # <-- update if your table uses a different column name\n", + " row_data = _data.get(\"row_data\", [])\n", + "\n", + " if not row_data:\n", + " raise RuntimeError(\n", + " f\"Binding found for table '{TABLE_NAME}' but row_data is empty. \"\n", + " \"The sync may still be in progress — wait a moment and re-run this cell.\"\n", + " )\n", + "\n", + " df = pd.DataFrame(row_data)\n", + "\n", + " if TARGET_COLUMN not in df.columns:\n", + " raise ValueError(\n", + " f\"Column '{TARGET_COLUMN}' not found in synced data. \"\n", + " f\"Available columns: {list(df.columns)}. \"\n", + " \"Update TARGET_COLUMN above to match your table's target column.\"\n", + " )\n", + "\n", + " print(f\"Loaded {len(df):,} rows, {len(df.columns)} columns from {TABLE_NAME}\")\n", + " print(f\"Last synced: {_data.get('last_synced_at', 'unknown')}\")\n", + " print(f\"Target distribution: {df[TARGET_COLUMN].value_counts().to_dict()}\")\n", + " display(df.head())\n", + "\n", + " elif _response.status_code == 404:\n", + " raise RuntimeError(\n", + " \"No active Databricks table binding found for this model.\\n\\n\"\n", + " \"To fix:\\n\"\n", + " \" 1. Go to ValidMind → Settings → Integrations → Databricks\\n\"\n", + " \" 2. Open the model binding browser and select a Unity Catalog table\\n\"\n", + " \" 3. Wait ~30 seconds for the initial sync to complete\\n\"\n", + " \" 4. Re-run this cell\\n\\n\"\n", + " \"Or set USE_SYNTHETIC_FALLBACK = True above to continue with generated data.\"\n", + " )\n", + " else:\n", + " raise RuntimeError(\n", + " f\"Unexpected error loading dataset from ValidMind: \"\n", + " f\"{_response.status_code} — {_response.text}\"\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ---------------------------------------------------------------------------\n", + "# Synthetic data fallback — runs when USE_SYNTHETIC_FALLBACK = True\n", + "# Uses the Bank Customer Churn dataset pattern from ValidMind examples\n", + "# ---------------------------------------------------------------------------\n", + "if USE_SYNTHETIC_FALLBACK:\n", + " import numpy as np\n", + " from sklearn.datasets import make_classification\n", + "\n", + " np.random.seed(42)\n", + " X, y = make_classification(\n", + " n_samples=1000,\n", + " n_features=10,\n", + " n_informative=6,\n", + " n_redundant=2,\n", + " random_state=42,\n", + " )\n", + " feature_names = [\n", + " \"credit_score\", \"age\", \"tenure\", \"balance\",\n", + " \"num_products\", \"has_credit_card\", \"is_active_member\",\n", + " \"estimated_salary\", \"geography_encoded\", \"gender_encoded\",\n", + " ]\n", + " df = pd.DataFrame(X, columns=feature_names)\n", + " df[\"target\"] = y\n", + " TARGET_COLUMN = \"target\"\n", + " TABLE_NAME = \"synthetic\"\n", + "\n", + " print(f\"Using synthetic dataset: {len(df):,} rows, {len(df.columns)} columns\")\n", + " print(f\"Target distribution: {df[TARGET_COLUMN].value_counts().to_dict()}\")\n", + " display(df.head())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 5 — Prepare train/test split\n", + "\n", + "Split the dataset into a training set and a test set so you can train the model on one slice of the data and evaluate how it generalizes on data it hasn't seen:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "feature_columns = [c for c in df.columns if c != TARGET_COLUMN]\n", + "\n", + "train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)\n", + "\n", + "print(f'Train set: {len(train_df):,} rows')\n", + "print(f'Test set: {len(test_df):,} rows')\n", + "print(f'Features: {feature_columns}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 6 — Train a simple model\n", + "\n", + "Train a gradient boosting classifier on the training set. This is a small, fast model that's well-suited to a quickstart. The goal here is to produce something documentable end-to-end, not to tune for accuracy." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.ensemble import GradientBoostingClassifier\n", + "\n", + "model = GradientBoostingClassifier(n_estimators=100, random_state=42)\n", + "model.fit(train_df[feature_columns], train_df[TARGET_COLUMN])\n", + "\n", + "train_accuracy = model.score(train_df[feature_columns], train_df[TARGET_COLUMN])\n", + "test_accuracy = model.score(test_df[feature_columns], test_df[TARGET_COLUMN])\n", + "\n", + "print(f'Train accuracy: {train_accuracy:.4f}')\n", + "print(f'Test accuracy: {test_accuracy:.4f}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 7 — Register datasets and model with ValidMind\n", + "\n", + "Before you can run tests, ValidMind needs to know about your datasets and your model. Wrap the training and test DataFrames with [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset) and the trained classifier with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model). Each call returns a ValidMind object that the test functions accept as input.\n", + "\n", + "The `input_id` you pass identifies each input when results are sent to the ValidMind Platform." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vm_train_ds = vm.init_dataset(\n", + " dataset=train_df,\n", + " input_id=\"train_dataset\",\n", + " target_column=TARGET_COLUMN,\n", + ")\n", + "\n", + "vm_test_ds = vm.init_dataset(\n", + " dataset=test_df,\n", + " input_id=\"test_dataset\",\n", + " target_column=TARGET_COLUMN,\n", + ")\n", + "\n", + "vm_model = vm.init_model(\n", + " model=model,\n", + " input_id=\"gradient_boosting_model\",\n", + ")\n", + "\n", + "print('Datasets and model registered with ValidMind.')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 8 — Assign predictions to datasets\n", + "\n", + "Many tests compare predicted values against actual values, so ValidMind needs the model's predictions attached to each dataset. The [`assign_predictions()` method](https://docs.validmind.ai/validmind/validmind/vm_models.html#assign_predictions) computes predictions from your model and links them to the dataset object, once for the training set and once for the test set:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vm_train_ds.assign_predictions(model=vm_model)\n", + "vm_test_ds.assign_predictions(model=vm_model)\n", + "\n", + "print('Predictions assigned.')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 9 — Run individual tests\n", + "\n", + "Run a few individual tests against your registered datasets and model to get familiar with how ValidMind tests work before running the full suite. Each [`vm.tests.run_test()`](https://docs.validmind.ai/validmind/validmind/tests.html#run_test) call executes one test, renders the result inline in this notebook, and `result.log()` sends the result to the ValidMind Platform:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Dataset statistics — validates data documentation capability\n", + "result = vm.tests.run_test(\n", + " \"validmind.data_validation.DatasetDescription\",\n", + " inputs={\"dataset\": vm_train_ds},\n", + ")\n", + "result.log()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Class imbalance check\n", + "result = vm.tests.run_test(\n", + " \"validmind.data_validation.ClassImbalance\",\n", + " inputs={\"dataset\": vm_train_ds},\n", + ")\n", + "result.log()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Confusion matrix — validates model performance visualization\n", + "result = vm.tests.run_test(\n", + " \"validmind.model_validation.sklearn.ConfusionMatrix\",\n", + " inputs={\"dataset\": vm_test_ds, \"model\": vm_model},\n", + ")\n", + "result.log()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ROC curve\n", + "result = vm.tests.run_test(\n", + " \"validmind.model_validation.sklearn.ROCCurve\",\n", + " inputs={\"dataset\": vm_test_ds, \"model\": vm_model},\n", + ")\n", + "result.log()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Feature importance\n", + "result = vm.tests.run_test(\n", + " \"validmind.model_validation.sklearn.FeatureImportance\",\n", + " inputs={\"dataset\": vm_train_ds, \"model\": vm_model},\n", + ")\n", + "result.log()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 10 — Run the full test suite\n", + "\n", + "Run the complete classifier documentation suite. This single call executes every test in the suite and sends all results to the ValidMind Platform, where they populate the corresponding sections of your model documentation:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "test_suite_result = vm.run_test_suite(\n", + " \"classifier_full_suite\",\n", + " inputs={\n", + " \"dataset\": vm_test_ds,\n", + " \"model\": vm_model,\n", + " \"train_dataset\": vm_train_ds,\n", + " \"test_dataset\": vm_test_ds,\n", + " },\n", + ")\n", + "print('Full test suite completed and results sent to ValidMind Platform.')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 11 — Verify results on the platform\n", + "\n", + "To see the results of this notebook in the ValidMind Platform:\n", + "\n", + "1. Go to the [ValidMind Platform](https://app.prod.validmind.ai) (or your ValidMind instance).\n", + "2. Navigate to **Model Inventory** and select your model.\n", + "3. Open the **Documentation** tab.\n", + "4. Confirm that the test results from this notebook appear in the relevant sections.\n", + "\n", + "After a successful run, you should see the following results in your model's documentation:\n", + "\n", + "- Dataset Description table\n", + "- Class Imbalance chart\n", + "- Confusion Matrix\n", + "- ROC Curve\n", + "- Feature Importance chart\n", + "- Full classifier suite results\n", + "\n", + "---\n", + "\n", + "## Troubleshooting\n", + "\n", + "If you run into any of the issues below, the table lists the likely fix:\n", + "\n", + "| Issue | Fix |\n", + "|-------|-----|\n", + "| `ModuleNotFoundError` after install | Re-run the `dbutils.library.restartPython()` cell. |\n", + "| `ConnectionError` on `vm.init()` | Your workspace may block outbound traffic. Check your network policy, or use a cluster with internet access. |\n", + "| `401 Unauthorized` on `vm.init()` | The API key or secret is incorrect. Copy your credentials again from the ValidMind Platform. |\n", + "| `numpy` version conflict | Pin a compatible version with `%pip install -q validmind \"numpy>=1.23,<2.0.0\"`. |\n", + "| `404` on dataset load | No Databricks table binding was found. Create one in **Settings → Integrations → Databricks**, then wait for the initial sync to complete. |\n", + "| `row_data is empty` after binding created | The initial sync is still running. Wait about 30 seconds and re-run Step 4. |\n", + "| Wrong columns or target not found | Update `TARGET_COLUMN` in Step 4 to match the target column in your Unity Catalog table. |\n", + "| Want to try the notebook without a binding | Set `USE_SYNTHETIC_FALLBACK = True` in Step 4 to use generated data. |" + ] + }, + { + "cell_type": "markdown", + "id": "copyright-08359404300c413f964cfb59cd670f71", + "metadata": {}, + "source": [ + "\n", + "\n", + "\n", + "\n", + "***\n", + "\n", + "Copyright © 2023-2026 ValidMind Inc. All rights reserved.
\n", + "Refer to [LICENSE](https://github.com/validmind/validmind-library/blob/main/LICENSE) for details.
\n", + "SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.7" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}