Skip to content

Commit f014aa4

Browse files
committed
feat: create references for evaluations and simulations
1 parent 5c2b809 commit f014aa4

3 files changed

Lines changed: 358 additions & 0 deletions

File tree

skills/netra-best-practices/SKILL.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@ npm install netra-sdk
2626
## Use-Case specific references
2727

2828
- Instrumenting an LLM application: references/instrumentation.md
29+
- Setting up single-turn evaluations: references/single-turn-eval.md
30+
- Setting up multi-turn simulations: references/multi-turn-simulations.md
2931

3032
## Feedback
3133

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# Multi-Turn Simulations (Netra SDK)
2+
3+
Use this reference when the user asks to test an AI agent with goal-oriented, multi-turn conversations.
4+
5+
## Outcome
6+
7+
Set up simulation runs where:
8+
- A multi-turn dataset defines realistic scenarios.
9+
- A task wrapper calls the user's agent each turn.
10+
- Netra evaluates whole conversations with simulation evaluators.
11+
12+
## Prerequisites
13+
14+
1. Netra SDK installed and initialized.
15+
2. Multi-turn dataset created in Netra dashboard.
16+
3. Dataset includes scenario goal, max turns, persona, user data, and fact checker data.
17+
4. Simulation evaluators selected (start with Goal Fulfillment + Factual Accuracy).
18+
19+
## Recommended workflow for the agent
20+
21+
1. Confirm the target agent entry point (how to send one message and get one reply).
22+
2. Verify Netra init happens before simulation execution.
23+
3. Implement/extend `BaseTask` wrapper around the user's agent.
24+
4. Ensure session continuity is preserved (`session_id` / `sessionId`).
25+
5. Run simulation with conservative concurrency.
26+
6. Report completed/failed counts and where to inspect results.
27+
28+
## Python template
29+
30+
```python
31+
from netra import Netra
32+
from netra.simulation.task import BaseTask
33+
from netra.simulation.models import TaskResult
34+
from uuid import uuid4
35+
36+
Netra.init(
37+
app_name="my-app",
38+
headers="x-api-key=YOUR_NETRA_API_KEY",
39+
)
40+
41+
class MyAgentTask(BaseTask):
42+
def __init__(self, agent):
43+
self.agent = agent
44+
45+
def run(self, message: str, session_id: str | None = None) -> TaskResult:
46+
# Netra can call the first turn with session_id=None.
47+
# Generate a fresh per-conversation session id in that case.
48+
sid = session_id or str(uuid4())
49+
# Replace with the user's real agent call
50+
reply = self.agent.chat(message, session_id=sid)
51+
return TaskResult(
52+
message=reply,
53+
session_id=sid,
54+
)
55+
56+
result = Netra.simulation.run_simulation(
57+
name="Customer Support Simulation",
58+
dataset_id="your-multi-turn-dataset-id",
59+
task=MyAgentTask(agent=...),
60+
context={"environment": "staging"},
61+
max_concurrency=3,
62+
)
63+
64+
print("completed:", len(result["completed"]))
65+
print("failed:", len(result["failed"]))
66+
```
67+
68+
## TypeScript template
69+
70+
```typescript
71+
import { Netra } from "netra-sdk";
72+
import { BaseTask, TaskResult } from "netra-sdk/simulation";
73+
74+
await Netra.init({
75+
appName: "my-app",
76+
headers: `x-api-key=${process.env.NETRA_API_KEY}`,
77+
});
78+
79+
class MyAgentTask extends BaseTask {
80+
constructor(private agent: any) {
81+
super();
82+
}
83+
84+
async run(message: string, sessionId?: string | null): Promise<TaskResult> {
85+
// Netra can call the first turn with sessionId=null/undefined.
86+
// Generate a fresh per-conversation session id in that case.
87+
const sid = sessionId ?? crypto.randomUUID();
88+
// Replace with the user's real agent call
89+
const reply = await this.agent.chat(message, { sessionId: sid });
90+
return {
91+
message: String(reply?.text ?? reply ?? ""),
92+
sessionId: sid,
93+
};
94+
}
95+
}
96+
97+
const result = await Netra.simulation.runSimulation({
98+
name: "Customer Support Simulation",
99+
datasetId: "your-multi-turn-dataset-id",
100+
task: new MyAgentTask(/* agent */),
101+
context: { environment: "staging" },
102+
maxConcurrency: 3,
103+
});
104+
105+
console.log("completed:", result?.completed.length ?? 0);
106+
console.log("failed:", result?.failed.length ?? 0);
107+
```
108+
109+
## Dataset design guidance
110+
111+
When instructing users to create datasets in the dashboard, include:
112+
1. Clear scenario goal (what success looks like).
113+
2. Realistic max turns (support: 4-6 is a good default).
114+
3. Persona fit (neutral/friendly/frustrated/confused/custom).
115+
4. Simulated user data (context the simulator can reference).
116+
5. Fact checker values (critical facts that must be correct).
117+
118+
## Evaluator guidance
119+
120+
Simulation evaluators are session-level (whole conversation).
121+
122+
Start with:
123+
- `Goal Fulfillment`
124+
- `Factual Accuracy`
125+
126+
Then add as needed:
127+
- `Conversation Completeness`
128+
- `Guideline Adherence`
129+
- `Conversational Flow`
130+
- `Conversation Memory`
131+
- `Profile Utilization`
132+
- `Information Elicitation`
133+
134+
## What to check after setup
135+
136+
1. Simulation run starts and returns totals.
137+
2. Most conversations land in `completed`.
138+
3. Failures include actionable `error` and `turn_id`/`turnId`.
139+
4. Evaluation scores appear under Test Runs for each scenario.
140+
5. Per-turn traces are available for debugging.
141+
142+
## Troubleshooting guidance
143+
144+
- Repeated session resets: verify you propagate returned session id every turn.
145+
- High failure count: reduce concurrency and inspect first failing turn trace.
146+
- Unrealistic simulations: improve scenario goal/persona/user-data quality.
147+
- Weak signal: raise evaluator pass thresholds once baseline quality is stable.
148+
149+
## References
150+
151+
- https://docs.getnetra.ai/quick-start/QuickStart_Simulation
152+
- https://docs.getnetra.ai/Simulation/Datasets
153+
- https://docs.getnetra.ai/Simulation/Evaluators
154+
- https://docs.getnetra.ai/sdk-reference/simulation/python
155+
- https://docs.getnetra.ai/sdk-reference/simulation/typescript
Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
# Single-Turn Evaluations (Netra SDK)
2+
3+
Use this reference when the user asks to set up or automate single-turn evaluations (input -> output) with Netra.
4+
5+
## Outcome
6+
7+
Set up a repeatable evaluation loop where:
8+
- Test cases live in a Netra dataset.
9+
- A task function runs the user's app logic for each dataset item.
10+
- Netra executes a test run and scores outputs with evaluators.
11+
12+
## Prerequisites
13+
14+
1. Netra SDK installed (`netra-sdk` for Python, `netra-sdk` for TypeScript).
15+
2. Netra initialized with API key/header.
16+
3. At least one dataset configured in the dashboard, or created programmatically.
17+
4. Evaluators attached in the dashboard (recommended), or passed in code.
18+
19+
## Recommended workflow for the agent
20+
21+
1. Confirm language/runtime (Python or TypeScript).
22+
2. Ensure `Netra.init(...)` / `await Netra.init(...)` is called once at startup.
23+
3. Fetch dataset via SDK.
24+
4. Define `task(input)` that returns the system output string/value.
25+
5. Run test suite with a clear run name and safe concurrency.
26+
6. Return run id and quick status summary to the user.
27+
7. Direct the user to Evaluation -> Test Runs for detailed scores.
28+
29+
## Python template
30+
31+
```python
32+
from netra import Netra
33+
34+
Netra.init(
35+
app_name="my-app",
36+
headers="x-api-key=YOUR_NETRA_API_KEY",
37+
)
38+
39+
def my_task(input_data):
40+
# Call the user's app/agent here and return generated output
41+
return f"response for: {input_data}"
42+
43+
dataset = Netra.evaluation.get_dataset(dataset_id="your-dataset-id")
44+
45+
result = Netra.evaluation.run_test_suite(
46+
name="My Single-Turn Eval",
47+
data=dataset,
48+
task=my_task,
49+
evaluators=["correctness", "relevance"], # optional
50+
max_concurrency=5,
51+
)
52+
53+
print(result["runId"])
54+
```
55+
56+
## TypeScript template
57+
58+
```typescript
59+
import { Netra } from "netra-sdk";
60+
61+
await Netra.init({
62+
appName: "my-app",
63+
headers: `x-api-key=${process.env.NETRA_API_KEY}`,
64+
});
65+
66+
async function myTask(inputData: any): Promise<string> {
67+
// Call the user's app/agent here and return generated output
68+
return `response for: ${String(inputData)}`;
69+
}
70+
71+
const dataset = await Netra.evaluation.getDataset("your-dataset-id");
72+
73+
const result = await Netra.evaluation.runTestSuite(
74+
"My Single-Turn Eval",
75+
dataset,
76+
myTask,
77+
["correctness", "relevance"], // optional
78+
5
79+
);
80+
81+
console.log(result?.runId);
82+
```
83+
84+
## Programmatic dataset management (optional)
85+
86+
Use these SDK APIs when the user wants setup fully in code:
87+
- Python: `create_dataset`, `add_dataset_item`, `get_dataset`, `run_test_suite`
88+
- TypeScript: `createDataset`, `addDatasetItem`, `getDataset`, `runTestSuite`
89+
90+
Minimal pattern:
91+
1. Create dataset.
92+
2. Add dataset items with `input` and `expected_output`/`expectedOutput`.
93+
3. Fetch dataset and execute test suite.
94+
95+
### Python example (fully programmatic)
96+
97+
```python
98+
from netra import Netra
99+
100+
Netra.init(
101+
app_name="my-app",
102+
headers="x-api-key=YOUR_NETRA_API_KEY",
103+
)
104+
105+
created = Netra.evaluation.create_dataset(name="Support QA Dataset")
106+
dataset_id = created["datasetId"]
107+
108+
Netra.evaluation.add_dataset_item(
109+
dataset_id=dataset_id,
110+
item={
111+
"input": "What is your refund window?",
112+
"expected_output": "You can request a refund within 30 days of purchase.",
113+
},
114+
)
115+
116+
Netra.evaluation.add_dataset_item(
117+
dataset_id=dataset_id,
118+
item={
119+
"input": "Do you support overnight shipping?",
120+
"expected_output": "Yes, overnight shipping is available in select regions.",
121+
},
122+
)
123+
124+
def task(input_data):
125+
# Replace with your real app/agent call.
126+
return f"response for: {input_data}"
127+
128+
dataset = Netra.evaluation.get_dataset(dataset_id=dataset_id)
129+
130+
result = Netra.evaluation.run_test_suite(
131+
name="Support QA Programmatic Eval",
132+
data=dataset,
133+
task=task,
134+
max_concurrency=3,
135+
)
136+
137+
print(result["runId"])
138+
```
139+
140+
### TypeScript example (fully programmatic)
141+
142+
```typescript
143+
import { Netra } from "netra-sdk";
144+
145+
await Netra.init({
146+
appName: "my-app",
147+
headers: `x-api-key=${process.env.NETRA_API_KEY}`,
148+
});
149+
150+
const created = await Netra.evaluation.createDataset("Support QA Dataset");
151+
const datasetId = created?.datasetId;
152+
153+
if (!datasetId) {
154+
throw new Error("Dataset creation failed: missing datasetId");
155+
}
156+
157+
await Netra.evaluation.addDatasetItem(datasetId, {
158+
input: "What is your refund window?",
159+
expectedOutput: "You can request a refund within 30 days of purchase.",
160+
});
161+
162+
await Netra.evaluation.addDatasetItem(datasetId, {
163+
input: "Do you support overnight shipping?",
164+
expectedOutput: "Yes, overnight shipping is available in select regions.",
165+
});
166+
167+
const dataset = await Netra.evaluation.getDataset(datasetId);
168+
169+
const result = await Netra.evaluation.runTestSuite(
170+
"Support QA Programmatic Eval",
171+
dataset,
172+
async (inputData: any) => {
173+
// Replace with your real app/agent call.
174+
return `response for: ${String(inputData)}`;
175+
},
176+
undefined,
177+
3
178+
);
179+
180+
console.log(result?.runId);
181+
```
182+
183+
## What to check after setup
184+
185+
1. `runId` is returned.
186+
2. Items are mostly `completed` (not `failed`).
187+
3. Traces are linked for each test item.
188+
4. Evaluator scores appear in Test Runs.
189+
190+
## Troubleshooting guidance
191+
192+
- No test runs: confirm dataset has evaluators and the correct dataset id is used.
193+
- Empty/invalid outputs: ensure the `task` function returns output for every item.
194+
- Too many failures/timeouts: lower concurrency first (`max_concurrency` / `maxConcurrency`).
195+
196+
## References
197+
198+
- https://docs.getnetra.ai/quick-start/QuickStart_Evals
199+
- https://docs.getnetra.ai/Evaluation/Datasets
200+
- https://docs.getnetra.ai/sdk-reference/evaluation/python
201+
- https://docs.getnetra.ai/sdk-reference/evaluation/typescript

0 commit comments

Comments
 (0)