diff --git a/examples/triton_gpt2/GPT2-ONNX-Azure.ipynb b/examples/triton_gpt2/GPT2-ONNX-Azure.ipynb
deleted file mode 100644
index c06f2fc85e..0000000000
--- a/examples/triton_gpt2/GPT2-ONNX-Azure.ipynb
+++ /dev/null
@@ -1,836 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "id": "liked-toronto",
- "metadata": {},
- "source": [
- "# Pretrained GPT2 Model Deployment Example\n",
- "\n",
- "In this notebook, we will run an example of text generation using GPT2 model exported from HuggingFace and deployed with Seldon's Triton pre-packed server. the example also covers converting the model to ONNX format.\n",
- "The implemented example below is of the Greedy approach for the next token prediction.\n",
- "more info: https://huggingface.co/transformers/model_doc/gpt2.html?highlight=gpt2\n",
- "\n",
- "After we have the module deployed to Kubernetes, we will run a simple load test to evaluate the module inference performance.\n",
- "\n",
- "\n",
- "## Steps:\n",
- "- [Download pretrained GPT2 model from hugging face](#hf)\n",
- "- [Convert the model to ONNX](#onnx)\n",
- "- [Store model in Azure Storage Blob](#blob)\n",
- "- [Create PersistentVolume and PVC](#pv) mounting Azure Storage Blob\n",
- "- [Setup Seldon-Core](#seldon) in your kubernetes cluster\n",
- "- [Deploy the ONNX model](#sd) with Seldon’s prepackaged Triton server.\n",
- "- [Run model inference](#infer), run a greedy alg example (generate sentence completion)\n",
- "- [Monitor model with Azure Monitor](#azuremonitor)\n",
- "- [Run load test using vegeta](#vegeta)\n",
- "- [Clean-up](#cleanup)\n",
- "\n",
- "## Basic requirements\n",
- "* Helm v3.0.0+\n",
- "* A Kubernetes cluster running v1.13 or above (minkube / docker-for-windows work well if enough RAM)\n",
- "* kubectl v1.14+\n",
- "* Python 3.6+ "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "korean-reporter",
- "metadata": {},
- "outputs": [],
- "source": [
- "%%writefile requirements.txt\n",
- "transformers==4.5.1\n",
- "torch==1.8.1\n",
- "tokenizers<0.11,>=0.10.1\n",
- "tensorflow==2.4.1\n",
- "tf2onnx"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "assigned-diesel",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "!pip install --trusted-host=pypi.python.org --trusted-host=pypi.org --trusted-host=files.pythonhosted.org -r requirements.txt"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "completed-evaluation",
- "metadata": {},
- "source": [
- "### Export HuggingFace TFGPT2LMHeadModel pre-trained model and save it locally "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "iraqi-million",
- "metadata": {},
- "outputs": [],
- "source": [
- "from transformers import GPT2Tokenizer, TFGPT2LMHeadModel\n",
- "\n",
- "tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n",
- "model = TFGPT2LMHeadModel.from_pretrained(\n",
- " \"gpt2\", from_pt=True, pad_token_id=tokenizer.eos_token_id\n",
- ")\n",
- "model.save_pretrained(\"./tfgpt2model\", saved_model=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "further-tribute",
- "metadata": {},
- "source": [
- "### Convert the TensorFlow saved model to ONNX "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "irish-mountain",
- "metadata": {},
- "outputs": [],
- "source": [
- "!python -m tf2onnx.convert --saved-model ./tfgpt2model/saved_model/1 --opset 13 --output model.onnx"
- ]
- },
- {
- "source": [
- "## Azure Setup\n",
- "We have provided [Azure Setup Notebook](https://docs.seldon.io/projects/seldon-core/en/latest/examples/triton_gpt2_example_azure_setup.html) that deploys AKS cluster, Azure storage account and installs Azure Blob CSI driver. If AKS cluster already exists skip to creation of Blob Storage and CSI driver installtion steps. Upon completion of Azure setup following infrastructure will be created:\n",
- ""
- ],
- "cell_type": "markdown",
- "metadata": {}
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "resource_group = \"seldon\" # feel free to replace or use this default\n",
- "aks_name = \"modeltests\"\n",
- "\n",
- "storage_account_name = \"modeltestsgpt\" # fill in\n",
- "storage_container_name = \"gpt2onnx\""
- ]
- },
- {
- "cell_type": "markdown",
- "id": "sunset-pantyhose",
- "metadata": {},
- "source": [
- "### Copy your model to Azure Blob \n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "lasting-performance",
- "metadata": {},
- "outputs": [],
- "source": [
- "%%time\n",
- "# Copy model file\n",
- "!az extension add --name storage-preview\n",
- "!az storage azcopy blob upload --container {storage_container_name} \\\n",
- " --account-name {storage_account_name} \\\n",
- " --source ./model.onnx \\\n",
- " --destination gpt2/1/model.onnx "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "\u001b[33mThis command has been deprecated and will be removed in future release. Use 'az storage fs file list' instead. For more information go to https://github.com/Azure/azure-cli/blob/dev/src/azure-cli/azure/cli/command_modules/storage/docs/ADLS%20Gen2.md\u001b[39m\n",
- "\u001b[33mThe behavior of this command has been altered by the following extension: storage-preview\u001b[0m\n",
- "Name IsDirectory Blob Type Blob Tier Length Content Type Last Modified Snapshot\n",
- "----------------- ------------- ----------- ----------- --------- ------------------------ ------------------------- ----------\n",
- "gpt2/1/model.onnx BlockBlob Hot 652535462 application/octet-stream 2021-05-28T04:37:11+00:00\n",
- "\u001b[0m"
- ]
- }
- ],
- "source": [
- "#Verify Uploaded file\n",
- "!az storage blob list \\\n",
- " --account-name {storage_account_name}\\\n",
- " --container-name {storage_container_name} \\\n",
- " --output table \n",
- " "
- ]
- },
- {
- "source": [
- "## Add Azure PersistentVolume and Claim \n",
- "For more details on creating PersistentVolume using CSI driver refer to https://github.com/kubernetes-sigs/blob-csi-driver/blob/master/deploy/example/e2e_usage.md\n",
- " - Create secret\n",
- " - Create PersistentVolume pointing to secret and Blob Container Name and `mountOptions` specifying user id for non-root containers \n",
- " - Creare PersistentVolumeClaim to bind to volume"
- ],
- "cell_type": "markdown",
- "metadata": {}
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "key = !az storage account keys list --account-name {storage_account_name} -g {resource_group} --query '[0].value' -o tsv\n",
- "storage_account_key = key[0]"
- ]
- },
- {
- "source": [],
- "cell_type": "markdown",
- "metadata": {}
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Create secret to access storage account\n",
- "!kubectl create secret generic azure-blobsecret --from-literal azurestorageaccountname={storage_account_name} --from-literal azurestorageaccountkey=\"{storage_account_key}\" --type=Opaque"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "%%writefile azure-blobfuse-pv.yaml\n",
- "apiVersion: v1\n",
- "kind: PersistentVolume\n",
- "metadata:\n",
- " name: pv-gpt2blob\n",
- " \n",
- "spec:\n",
- " capacity:\n",
- " storage: 10Gi\n",
- " accessModes:\n",
- " - ReadWriteMany\n",
- " persistentVolumeReclaimPolicy: Retain # \"Delete\" is not supported in static provisioning\n",
- " csi:\n",
- " driver: blob.csi.azure.com\n",
- " readOnly: false\n",
- " volumeHandle: trainingdata # make sure this volumeid is unique in the cluster\n",
- " volumeAttributes:\n",
- " containerName: gpt2onnx # Modify if changed in Notebook\n",
- " nodeStageSecretRef:\n",
- " name: azure-blobsecret\n",
- " namespace: default\n",
- " mountOptions: # Use same user id that is used by POD security context\n",
- " - -o uid=8888 \n",
- " - -o allow_other\n",
- "---\n",
- "kind: PersistentVolumeClaim\n",
- "apiVersion: v1\n",
- "metadata:\n",
- " name: pvc-gpt2blob\n",
- " \n",
- "spec:\n",
- " accessModes:\n",
- " - ReadWriteMany\n",
- " resources:\n",
- " requests:\n",
- " storage: 10Gi\n",
- " volumeName: pv-gpt2blob\n",
- " storageClassName: \"\"\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "persistentvolume/pv-gptblob configured\n",
- "persistentvolumeclaim/pvc-gptblob unchanged\n"
- ]
- }
- ],
- "source": [
- "!kubectl apply -f azure-blobfuse-pv.yaml"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE\npersistentvolume/pv-gpt2blob 10Gi RWX Retain Bound default/pvc-gpt2blob 4h54m\n\nNAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE\npersistentvolumeclaim/pvc-gpt2blob Bound pv-gpt2blob 10Gi RWX 4h54m\n"
- ]
- }
- ],
- "source": [
- "# Verify PVC is bound\n",
- "!kubectl get pv,pvc"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "convinced-syracuse",
- "metadata": {},
- "source": [
- "### Run Seldon in your kubernetes cluster \n",
- "\n",
- "Follow the [Seldon-Core Setup notebook](https://docs.seldon.io/projects/seldon-core/en/latest/examples/seldon_core_setup.html) to Setup a cluster with Istio Ingress and install Seldon Core"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "backed-outreach",
- "metadata": {},
- "source": [
- "### Deploy your model with Seldon pre-packaged Triton server "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "id": "beneficial-anime",
- "metadata": {},
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "Overwriting gpt2-deploy.yaml\n"
- ]
- }
- ],
- "source": [
- "%%writefile gpt2-deploy.yaml\n",
- "apiVersion: machinelearning.seldon.io/v1alpha2\n",
- "kind: SeldonDeployment\n",
- "metadata:\n",
- " name: gpt2gpu\n",
- "spec:\n",
- " annotations:\n",
- " prometheus.io/port: \"8002\" # we will explain below in Monitoring section\n",
- " prometheus.io/path: \"/metrics\"\n",
- " predictors:\n",
- " - componentSpecs:\n",
- " - spec:\n",
- " containers:\n",
- " - name: gpt2\n",
- " resources:\n",
- " requests:\n",
- " memory: 2Gi\n",
- " cpu: 2\n",
- " nvidia.com/gpu: 1 \n",
- " limits:\n",
- " memory: 4Gi\n",
- " cpu: 4\n",
- " nvidia.com/gpu: 1 \n",
- " tolerations:\n",
- " - key: \"nvidia.com\" # to be able to run in GPU Nodepool\n",
- " operator: \"Equal\"\n",
- " value: \"gpu\"\n",
- " effect: \"NoSchedule\" \n",
- " graph:\n",
- " implementation: TRITON_SERVER\n",
- " logger:\n",
- " mode: all\n",
- " modelUri: pvc://pvc-gpt2blob/\n",
- " name: gpt2\n",
- " type: MODEL \n",
- " name: default\n",
- " replicas: 1\n",
- " protocol: kfserving"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "id": "subjective-involvement",
- "metadata": {},
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "seldondeployment.machinelearning.seldon.io/gpt2 created\n"
- ]
- }
- ],
- "source": [
- "!kubectl apply -f gpt2-deploy.yaml -n default"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "id": "demanding-thesaurus",
- "metadata": {},
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "deployment \"gpt2gpu-default-0-gpt2\" successfully rolled out\n"
- ]
- }
- ],
- "source": [
- "!kubectl rollout status deploy/$(kubectl get deploy -l seldon-deployment-id=gpt2gpu -o jsonpath='{.items[0].metadata.name}')"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "digital-supervisor",
- "metadata": {},
- "source": [
- "#### Interact with the model: get model metadata (a \"test\" request to make sure our model is available and loaded correctly)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "id": "married-roller",
- "metadata": {},
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "* Trying 20.75.117.145:80...\n",
- "* TCP_NODELAY set\n",
- "* Connected to 20.75.117.145 (20.75.117.145) port 80 (#0)\n",
- "\n",
- "\n",
- "\n",
- "\n",
- "\n",
- "* Mark bundle as not supporting multiuse\n",
- "\n",
- "\n",
- "\n",
- "\n",
- "\n",
- "\n",
- "\n",
- "\n",
- "\n",
- "\n",
- "\n",
- "\n",
- "* Connection #0 to host 20.75.117.145 left intact\n",
- "{\"name\":\"gpt2\",\"versions\":[\"1\"],\"platform\":\"onnxruntime_onnx\",\"inputs\":[{\"name\":\"input_ids:0\",\"datatype\":\"INT32\",\"shape\":[-1,-1]},{\"name\":\"attention_mask:0\",\"datatype\":\"INT32\",\"shape\":[-1,-1]}],\"outputs\":[{\"name\":\"past_key_values\",\"datatype\":\"FP32\",\"shape\":[12,2,-1,12,-1,64]},{\"name\":\"logits\",\"datatype\":\"FP32\",\"shape\":[-1,-1,50257]}]}"
- ]
- }
- ],
- "source": [
- "ingress_ip = !(kubectl get svc --namespace istio-system istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')\n",
- "ingress_ip = ingress_ip[0]\n",
- "\n",
- "!curl -v http://{ingress_ip}:80/seldon/default/gpt2gpu/v2/models/gpt2"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "anonymous-resource",
- "metadata": {},
- "source": [
- "### Run prediction test: generate a sentence completion using GPT2 model - Greedy approach \n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "id": "modified-termination",
- "metadata": {},
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "sending request to http://20.75.117.145/seldon/default/gpt2gpu/v2/models/gpt2/infer\n",
- "Sentence: I love Artificial Intelligence .\n",
- "sending request to http://20.75.117.145/seldon/default/gpt2gpu/v2/models/gpt2/infer\n",
- "Sentence: I love Artificial Intelligence . I\n",
- "sending request to http://20.75.117.145/seldon/default/gpt2gpu/v2/models/gpt2/infer\n",
- "Sentence: I love Artificial Intelligence . I love\n",
- "sending request to http://20.75.117.145/seldon/default/gpt2gpu/v2/models/gpt2/infer\n",
- "Sentence: I love Artificial Intelligence . I love the\n",
- "sending request to http://20.75.117.145/seldon/default/gpt2gpu/v2/models/gpt2/infer\n",
- "Sentence: I love Artificial Intelligence . I love the way\n",
- "sending request to http://20.75.117.145/seldon/default/gpt2gpu/v2/models/gpt2/infer\n",
- "Sentence: I love Artificial Intelligence . I love the way it\n",
- "sending request to http://20.75.117.145/seldon/default/gpt2gpu/v2/models/gpt2/infer\n",
- "Sentence: I love Artificial Intelligence . I love the way it 's\n",
- "sending request to http://20.75.117.145/seldon/default/gpt2gpu/v2/models/gpt2/infer\n",
- "Sentence: I love Artificial Intelligence . I love the way it 's designed\n",
- "Input: I love Artificial Intelligence\n",
- "Output: I love Artificial Intelligence . I love the way it 's designed\n"
- ]
- }
- ],
- "source": [
- "import http\n",
- "import json\n",
- "\n",
- "import numpy as np\n",
- "import requests\n",
- "from transformers import GPT2Tokenizer\n",
- "\n",
- "tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n",
- "input_text = \"I love Artificial Intelligence\"\n",
- "count = 0\n",
- "max_gen_len = 8\n",
- "gen_sentence = input_text\n",
- "while count < max_gen_len:\n",
- " input_ids = tokenizer.encode(gen_sentence, return_tensors=\"tf\")\n",
- " shape = input_ids.shape.as_list()\n",
- " payload = {\n",
- " \"inputs\": [\n",
- " {\n",
- " \"name\": \"input_ids:0\",\n",
- " \"datatype\": \"INT32\",\n",
- " \"shape\": shape,\n",
- " \"data\": input_ids.numpy().tolist(),\n",
- " },\n",
- " {\n",
- " \"name\": \"attention_mask:0\",\n",
- " \"datatype\": \"INT32\",\n",
- " \"shape\": shape,\n",
- " \"data\": np.ones(shape, dtype=np.int32).tolist(),\n",
- " },\n",
- " ]\n",
- " }\n",
- "\n",
- " tfserving_url = (\n",
- " \"http://\" + str(ingress_ip) + \"/seldon/default/gpt2gpu/v2/models/gpt2/infer\"\n",
- " )\n",
- " print(f\"sending request to {tfserving_url}\")\n",
- "\n",
- " with requests.post(tfserving_url, json=payload) as ret:\n",
- " try:\n",
- " res = ret.json()\n",
- " except:\n",
- " continue\n",
- "\n",
- " # extract logits\n",
- " logits = np.array(res[\"outputs\"][1][\"data\"])\n",
- " logits = logits.reshape(res[\"outputs\"][1][\"shape\"])\n",
- "\n",
- " # take the best next token probability of the last token of input ( greedy approach)\n",
- " next_token = logits.argmax(axis=2)[0]\n",
- " next_token_str = tokenizer.decode(\n",
- " next_token[-1:], skip_special_tokens=True, clean_up_tokenization_spaces=True\n",
- " ).strip()\n",
- " gen_sentence += \" \" + next_token_str\n",
- " print(f\"Sentence: {gen_sentence}\")\n",
- "\n",
- " count += 1\n",
- "\n",
- "print(f\"Input: {input_text}\\nOutput: {gen_sentence}\")"
- ]
- },
- {
- "source": [
- "## Configure Model Monitoring with Azure Monitor \n",
- "The Azure Monitor Containers Insights provides functionality to allow collecting data from any Prometheus endpoints. It removes the need to install and operate Prometheus server and manage the monitoring data as Azure Monitor provides centralized point for collecting, displaying and alerting on monitoring data. To turn on Azure Monitor Container Insights follow steps described [here](https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-onboard) and you should that you have an “omsagent” pod running."
- ],
- "cell_type": "markdown",
- "metadata": {}
- },
- {
- "source": [
- "!kubectl get pods -n kube-system | grep omsagent"
- ],
- "cell_type": "code",
- "metadata": {},
- "execution_count": 5,
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "omsagent-27lk7 1/1 Running 3 12d\nomsagent-7q49d 1/1 Running 3 12d\nomsagent-9slf6 1/1 Running 3 12d\nomsagent-kzbkr 1/1 Running 3 12d\nomsagent-q85hk 1/1 Running 3 12d\nomsagent-rs-5976fbdc8b-rgxs4 1/1 Running 0 8d\nomsagent-tpkq2 1/1 Running 3 12d\n"
- ]
- }
- ]
- },
- {
- "source": [
- "### Configure Prometheus Metrics scraping\n",
- "Once `omsagent` is running we need to configure it to collect metrics from Prometheus endpoints. Azure Monitor Containers Insights allows configuration to be applied on a cluster or node-wide scope and configure endpoints for monitoring on one of the following ways:\n",
- "- Provide an array of URLs \n",
- "- Provide an Array of Kubernetes services\n",
- "- Enable monitoring of any pods with Prometheus annotations\n",
- "For more details on how to configure the scraping endpoints and query collected data refer to [MS Docs on Configure scraping of Prometheus metrics with Container insights](https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-prometheus-integration)\n",
- "\n",
- "Our deployed model metrics are availble from couple infrasture layers - [Seldon model orchestrator metrics](https://docs.seldon.io/projects/seldon-core/en/latest/analytics/analytics.html) and [Nvidia Triton Server Metrics](https://github.com/triton-inference-server/server/blob/main/docs/metrics.md). To enable scraping for both endpoints we updated Microsoft provided default `ConfigMap` that configures `omsagent` [azure-metrics-cm.yaml](./azure-metrics-cm.yaml):\n",
- "- **Triton Server:** update `monitor_kubernetes_pods = true` to enable scrapting for Pods with `prometheus.io` annotations\n",
- " In SeldonDeployment shown above `prometheus.io/path` and `prometheus.io/port` point to default Triton metrics endpoint\n",
- "- **Seldon Orchestrator:** add our deployed model seldon service endpoint to list of Kubernetes services to be scraped: \n",
- " ```yaml\n",
- " kubernetes_services = [\"http://gpt2gpu-default.default:8000/prometheus\"]\n",
- " ``` "
- ],
- "cell_type": "markdown",
- "metadata": {}
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "!kubectl apply -f azure-metrics-cm.yaml"
- ]
- },
- {
- "source": [
- "## Query and Visualize collected data\n",
- "Collected metrics are available in Logs blade of Azure Monitor in a table **InsightsMetrics**, you could see all metrics gathered by running query\n",
- "\n",
- "```yaml\n",
- "InsightsMetrics\n",
- "| where Namespace == \"prometheus\" \n",
- "```\n",
- "\n",
- "To get Model Inference Requests per minute from Seldon Metrics run the following query and pin it to Dashboard or add to Azure Monitor Workbook:\n",
- "\n",
- "```yaml\n",
- "InsightsMetrics \n",
- "| where Namespace == \"prometheus\"\n",
- "| where Name == \"seldon_api_executor_server_requests_seconds_count\"\n",
- "| extend Model = parse_json(Tags).deployment_name\n",
- "| where parse_json(Tags).service == \"predictions\" \n",
- "| order by TimeGenerated asc \n",
- "| extend RequestsPerMin = Val - prev(Val,1)\n",
- "| project TimeGenerated, RequestsPerMin\n",
- "| render areachart \n",
- "```\n",
- "\n",
- "\n",
- "To get Inference Duration from Triton Metrics:\n",
- "\n",
- "```yaml\n",
- "InsightsMetrics \n",
- "| where Namespace == \"prometheus\"\n",
- "| where Name in (\"nv_inference_request_duration_us\")\n",
- "| order by TimeGenerated asc\n",
- "| extend QueueDurationSec = (Val - prev(Val, 1)) / 1000\n",
- "| project TimeGenerated, Name, QueueDurationSec\n",
- "| render areachart \n",
- "```\n",
- "\n",
- "Here is example dashboard we created using queries above\n",
- "\n",
- " \n"
- ],
- "cell_type": "markdown",
- "metadata": {}
- },
- {
- "cell_type": "markdown",
- "id": "colored-status",
- "metadata": {},
- "source": [
- "### Run Load Test / Performance Test using vegeta "
- ]
- },
- {
- "cell_type": "markdown",
- "id": "exempt-discovery",
- "metadata": {},
- "source": [
- "#### Install vegeta, for more details take a look in [vegeta](https://github.com/tsenart/vegeta#install) official documentation"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 20,
- "id": "interesting-laptop",
- "metadata": {},
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "--2021-05-28 18:40:27-- https://github.com/tsenart/vegeta/releases/download/v12.8.3/vegeta-12.8.3-linux-arm64.tar.gz\n",
- "Resolving github.com (github.com)... 140.82.114.4\n",
- "Connecting to github.com (github.com)|140.82.114.4|:443... connected.\n",
- "HTTP request sent, awaiting response... 302 Found\n",
- "Location: https://github-releases.githubusercontent.com/12080551/ba68d580-6e90-11ea-8bd2-3f43f5c08b3c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210528%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210528T224014Z&X-Amz-Expires=300&X-Amz-Signature=2efad77c33f1663eea17d366986bfad1cd081128d45012c9b6e6659c4c80eff6&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=12080551&response-content-disposition=attachment%3B%20filename%3Dvegeta-12.8.3-linux-arm64.tar.gz&response-content-type=application%2Foctet-stream [following]\n",
- "--2021-05-28 18:40:27-- https://github-releases.githubusercontent.com/12080551/ba68d580-6e90-11ea-8bd2-3f43f5c08b3c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210528%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210528T224014Z&X-Amz-Expires=300&X-Amz-Signature=2efad77c33f1663eea17d366986bfad1cd081128d45012c9b6e6659c4c80eff6&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=12080551&response-content-disposition=attachment%3B%20filename%3Dvegeta-12.8.3-linux-arm64.tar.gz&response-content-type=application%2Foctet-stream\n",
- "Resolving github-releases.githubusercontent.com (github-releases.githubusercontent.com)... 185.199.108.154, 185.199.109.154, 185.199.110.154, ...\n",
- "Connecting to github-releases.githubusercontent.com (github-releases.githubusercontent.com)|185.199.108.154|:443... connected.\n",
- "HTTP request sent, awaiting response... 200 OK\n",
- "Length: 3281900 (3.1M) [application/octet-stream]\n",
- "Saving to: ‘vegeta-12.8.3-linux-arm64.tar.gz.2’\n",
- "\n",
- "vegeta-12.8.3-linux 100%[===================>] 3.13M 2.95MB/s in 1.1s \n",
- "\n",
- "2021-05-28 18:40:28 (2.95 MB/s) - ‘vegeta-12.8.3-linux-arm64.tar.gz.2’ saved [3281900/3281900]\n",
- "\n",
- "CHANGELOG\n",
- "LICENSE\n",
- "README.md\n",
- "vegeta\n"
- ]
- }
- ],
- "source": [
- "!wget https://github.com/tsenart/vegeta/releases/download/v12.8.3/vegeta-12.8.3-linux-arm64.tar.gz\n",
- "!tar -zxvf vegeta-12.8.3-linux-arm64.tar.gz\n",
- "!chmod +x vegeta"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "friendly-lying",
- "metadata": {},
- "source": [
- "#### Generate vegeta [target file](https://github.com/tsenart/vegeta#-targets) contains \"post\" cmd with payload in the requiered structure"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "id": "reliable-croatia",
- "metadata": {},
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "preparing request to http://20.75.117.145/seldon/default/gpt2gpu/v2/models/gpt2/infer\n"
- ]
- }
- ],
- "source": [
- "import base64\n",
- "import json\n",
- "from subprocess import PIPE, Popen, run\n",
- "\n",
- "import numpy as np\n",
- "from transformers import GPT2Tokenizer, TFGPT2LMHeadModel\n",
- "\n",
- "tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n",
- "input_text = \"I enjoy working in Seldon\"\n",
- "input_ids = tokenizer.encode(input_text, return_tensors=\"tf\")\n",
- "shape = input_ids.shape.as_list()\n",
- "payload = {\n",
- " \"inputs\": [\n",
- " {\n",
- " \"name\": \"input_ids:0\",\n",
- " \"datatype\": \"INT32\",\n",
- " \"shape\": shape,\n",
- " \"data\": input_ids.numpy().tolist(),\n",
- " },\n",
- " {\n",
- " \"name\": \"attention_mask:0\",\n",
- " \"datatype\": \"INT32\",\n",
- " \"shape\": shape,\n",
- " \"data\": np.ones(shape, dtype=np.int32).tolist(),\n",
- " },\n",
- " ]\n",
- "}\n",
- "tfserving_url = (\n",
- " \"http://\" + str(ingress_ip) + \"/seldon/default/gpt2gpu/v2/models/gpt2/infer\"\n",
- ")\n",
- "print(f\"preparing request to {tfserving_url}\")\n",
- "\n",
- "cmd = {\n",
- " \"method\": \"POST\",\n",
- " \"header\": {\"Content-Type\": [\"application/json\"]},\n",
- " \"url\": tfserving_url,\n",
- " \"body\": base64.b64encode(bytes(json.dumps(payload), \"utf-8\")).decode(\"utf-8\"),\n",
- "}\n",
- "\n",
- "with open(\"vegeta_target.json\", mode=\"w\") as file:\n",
- " json.dump(cmd, file)\n",
- " file.write(\"\\n\\n\")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "id": "tribal-statistics",
- "metadata": {},
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "Requests [total, rate, throughput] 60, 1.02, 0.95\nDuration [total, attack, wait] 1m3s, 58.994s, 4.445s\nLatencies [min, mean, 50, 90, 95, 99, max] 1.45s, 4.003s, 3.983s, 5.249s, 6.329s, 7.876s, 7.97s\nBytes In [total, mean] 475803960, 7930066.00\nBytes Out [total, mean] 13140, 219.00\nSuccess [ratio] 100.00%\nStatus Codes [code:count] 200:60 \nError Set:\n"
- ]
- }
- ],
- "source": [
- "!./vegeta attack -targets=vegeta_target.json -rate=1 -duration=60s -format=json | ./vegeta report -type=text"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "patient-suite",
- "metadata": {},
- "source": [
- "### Clean-up "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "pacific-collectible",
- "metadata": {},
- "outputs": [],
- "source": [
- "!kubectl delete -f gpt2-deploy.yaml -n default"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "name": "python3",
- "display_name": "Python 3.8.5 64-bit"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.8.5"
- },
- "interpreter": {
- "hash": "e7370f93d1d0cde622a1f8e1c04877d8463912d04d973331ad4851f04de6915a"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}