diff --git a/README.md b/README.md index 476d588..8b65ff1 100644 --- a/README.md +++ b/README.md @@ -8,221 +8,731 @@ pinned: false app_port: 7860 tags: - openenv + - reinforcement-learning + - evaluation + - agents --- -# Indian Government Scheme Enrollment โ€” RL Environment +# ๐Ÿ›๏ธ Indian Government Scheme Enrollment โ€” RL Environment -An open-source Reinforcement Learning environment simulating the workflow of an Indian Government CSC (Common Service Centre) operator. An LLM-based agent must interview applicants, collect missing documents, detect boundary fraud, and either enroll them in the correct welfare scheme or safely escalate contradictory cases to a senior officer. +> *A reinforcement learning benchmark for bureaucratic reasoning: interviewing applicants, verifying documents, applying strict scheme rules, detecting fraud, and knowing when to escalate rather than decide.* -## Why This Exists +[![HuggingFace Space](https://img.shields.io/badge/๐Ÿค—%20HuggingFace-Space-blue)](https://huggingface.co/spaces/advikdivekar/scheme-enrollment-env) +[![GitHub](https://img.shields.io/badge/GitHub-Repository-black)](https://github.com/advikdivekar/rl-agent) +[![OpenEnv](https://img.shields.io/badge/OpenEnv-Compliant-green)](https://huggingface.co/openenv) +[![Tests](https://img.shields.io/badge/Tests-20%20Passing-brightgreen)](tests/) +[![Tasks](https://img.shields.io/badge/Tasks-5%20Graded-orange)](#-the-5-tasks) -Millions of rural Indians access government welfare schemes through CSC operators โ€” human workers who interview applicants, verify documents, and submit applications. This process requires multi-step reasoning, strict rule adherence, and the ability to detect fraud. This environment trains and evaluates AI agents on that exact workflow, filling a real gap in the RL/agent evaluation ecosystem. +## The Case Study -## MDP Formalization +Priya is a CSC operator in Barmer, Rajasthan. She interviews dozens of applicants every day across a wooden desk, a government-issue computer, and a slow internet connection. One afternoon, a young man walks in claiming to be a student. He wants to enroll in PMKVY, a skill-training scheme. On the surface, his profile looks plausible. + +But something feels wrong. His income is unusually high for a student. Priya asks for his PAN card. It reveals six years of active pension-linked employment from a public sector company. He is not a student. He is attempting to claim a benefit under false pretenses. + +Priya does not guess. She does not overreach. She escalates the case. + +**This environment trains AI agents to behave like Priya.** + +Not just to read a table of rules, but to: + +- gather missing information before acting +- verify the right document at the right time +- apply exact arithmetic boundaries +- ignore irrelevant context +- distinguish ineligibility from contradiction +- escalate only when escalation is genuinely required + +## Why This Environment Exists + +Most RL and agent benchmarks focus on coding, games, search, or generic dialogue. Very few test policy compliance under partial observability, exact thresholds, and procedural safety. + +This environment exists to measure a harder and more realistic capability cluster: + +- **Policy compliance under uncertainty**: the agent must collect evidence before deciding +- **Fraud detection through document verification**: contradictions emerge only after the correct document is requested +- **Boundary arithmetic**: `9999` qualifies, `10000` does not +- **Escalation protocol**: the agent must know when not to decide +- **Noise filtering**: irrelevant profile fields appear alongside real signal + +The benchmark is grounded in a workflow that affects welfare access, fraud prevention, and administrative fairness. + +## Hackathon Compliance Snapshot + +This repository is structured to satisfy the official Round 1 requirements: + +- real-world task simulation, not a toy domain +- full OpenEnv environment with typed models, `step()`, `reset()`, `state()`, and `openenv.yaml` +- 5 graded tasks with deterministic programmatic scoring in the `0.0โ€“1.0` range +- meaningful reward shaping over the trajectory +- root-level `inference.py` using the OpenAI client +- Dockerfile plus Hugging Face Space deployment metadata +- README coverage for environment description, action space, observation space, tasks, setup, and baseline scores + +## Table of Contents + +- [Environment at a Glance](#environment-at-a-glance) +- [Repository Structure](#repository-structure) +- [Architecture Overview](#architecture-overview) +- [System Architecture](#system-architecture) +- [Agent-Environment Architecture](#agent-environment-architecture) +- [Training Pipeline Architecture](#training-pipeline-architecture) +- [Reward Architecture](#reward-architecture) +- [Deployment and Inference Architecture](#deployment-and-inference-architecture) +- [Data Flow Architecture](#data-flow-architecture) +- [Environment Contract](#environment-contract) +- [Action Space](#action-space) +- [Observation Space](#observation-space) +- [Scheme Eligibility Rules](#scheme-eligibility-rules) +- [The 5 Tasks](#-the-5-tasks) +- [The Distraction Trap](#-the-distraction-trap) +- [Benchmark Outputs and Screenshots](#-benchmark-outputs-and-screenshots) +- [Baseline Results](#-baseline-results) +- [Setup and Running](#-setup-and-running) +- [Environment Variables](#-environment-variables) +- [Testing](#-testing) +- [OpenEnv Compliance](#-openenv-compliance) + +## Environment at a Glance | Component | Definition | |---|---| -| **State (S)** | Worker profile (16 fields: age, income, occupation, has_aadhaar, family_income, worker_type, has_epfo, has_esic, is_govt_employee, has_pan, has_bank_account, has_pucca_house, is_pregnant, first_child, is_income_tax_payer, not_nps) + application form state + step count | -| **Action (A)** | 5 discrete actions: ask_question, request_document, approve_scheme, reject_applicant, escalate | -| **Transition (T)** | Deterministic given persona โ€” ask_question reveals hidden fields, verify_document surfaces contradictions | -| **Reward (R)** | Dense per-step rewards (see reward table below) + terminal bonus | -| **Discount (ฮณ)** | 1.0 โ€” episodic task, all steps matter equally | -| **Max Steps** | 20 per episode | +| **State (S)** | Applicant profile, partial observation state, hidden persona fields, step count | +| **Action (A)** | `ask_question`, `request_document`, `approve_scheme`, `reject_applicant`, `escalate` | +| **Transition (T)** | Deterministic given persona and task template | +| **Reward (R)** | Intermediate shaping plus terminal outcome rewards | +| **Horizon** | 20 steps per episode | +| **Grader** | Terminal normalized score `0.0` to `1.0` | +| **Server** | FastAPI via OpenEnv `create_app` | +| **Inference** | OpenAI-compatible client, provider-agnostic | +| **Benchmarking** | Sequential multi-model runner + graph-first report generator | + +## Repository Structure + +```text +. +โ”œโ”€โ”€ README.md +โ”œโ”€โ”€ pyproject.toml +โ”œโ”€โ”€ requirements.txt +โ”œโ”€โ”€ uv.lock +โ”œโ”€โ”€ Dockerfile +โ”œโ”€โ”€ openenv.yaml +โ”œโ”€โ”€ .env.example +โ”œโ”€โ”€ models.py +โ”œโ”€โ”€ client.py +โ”œโ”€โ”€ inference.py +โ”œโ”€โ”€ benchmark_runner.py +โ”œโ”€โ”€ benchmark_report.py +โ”œโ”€โ”€ server/ +โ”‚ โ”œโ”€โ”€ __init__.py +โ”‚ โ”œโ”€โ”€ app.py +โ”‚ โ”œโ”€โ”€ models.py +โ”‚ โ”œโ”€โ”€ scheme_env_environment.py +โ”‚ โ””โ”€โ”€ schemes.py +โ”œโ”€โ”€ tests/ +โ”‚ โ”œโ”€โ”€ conftest.py +โ”‚ โ””โ”€โ”€ test_scheme_eligibility.py +โ””โ”€โ”€ reports/ + โ”œโ”€โ”€ inference_logs/ + โ””โ”€โ”€ test_logs/ +``` -## Action Space +### What each major file does + +- [server/app.py](server/app.py): FastAPI/OpenEnv server entrypoint exposing `/reset`, `/step`, and `/health` +- [server/scheme_env_environment.py](server/scheme_env_environment.py): environment lifecycle, task logic, reward shaping, step transitions, shared state, metadata sanitization +- [server/schemes.py](server/schemes.py): scheme metadata, eligibility logic, optimal scheme selection +- [models.py](models.py): root `Action` and `Observation` schemas used by inference and server logic +- [client.py](client.py): OpenEnv client wrapper for typed environment access +- [inference.py](inference.py): single-model evaluation loop +- [benchmark_runner.py](benchmark_runner.py): multi-model benchmark orchestration +- [benchmark_report.py](benchmark_report.py): report and chart generation from benchmark artifacts +- [tests/test_scheme_eligibility.py](tests/test_scheme_eligibility.py): boundary-condition and grading tests +- [reports](reports): benchmark outputs, summary files, charts, and archived logs + +## Architecture Overview + +This repo has a clean separation between: + +1. the **environment runtime** +2. the **model interaction loop** +3. the **benchmark orchestration layer** +4. the **reporting and visualization layer** + +## System Architecture + +```mermaid +flowchart LR + A["LLM / External Policy"] --> B["inference.py
Prompting + JSON extraction"] + B --> C["OpenEnv HTTP API
/reset /step"] + C --> D["server/app.py
FastAPI + create_app"] + D --> E["SchemeEnvEnvironment
server/scheme_env_environment.py"] + E --> F["Persona Generation"] + E --> G["Observation Builder"] + E --> H["Reward + Grader Logic"] + E --> I["Scheme Rules
server/schemes.py"] + B --> J["Run Logs"] + J --> K["benchmark_runner.py"] + K --> L["CSV / JSON / manifests / logs"] + L --> M["benchmark_report.py"] + M --> N["Charts + summaries + screenshots"] +``` -| Action | Value | Description | Reward | -|---|---|---|---| -| `ask_question` | field name | Gather missing eligibility data | 0.0 valid step, -0.10 noise/redundant | -| `request_document` | document name | Request verification documents | 0.0 valid step | -| `approve_scheme` | scheme name | Enroll applicant in optimal scheme | +10.0 (optimal), +3.0 (suboptimal), -5.0 (wrong) | -| `reject_applicant` | category | Reject ineligible applicant | +5.0 (correct), -5.0 (incorrect) | -| `escalate` | category or empty | Hand off contradictory case to senior officer | +10.0 (Task 4 only), -2.0 (other tasks) | +### Runtime layers + +- **Inference layer**: talks to external models and formats actions +- **API layer**: standard OpenEnv-compatible transport over HTTP +- **Environment layer**: task logic, hidden persona state, reward logic +- **Data layer**: scheme rules and typed schemas +- **Reporting layer**: benchmark aggregation and visualization + +## Agent-Environment Architecture + +```mermaid +sequenceDiagram + participant Agent as LLM Agent + participant Runner as inference.py + participant API as FastAPI/OpenEnv + participant Env as SchemeEnvEnvironment + + Agent->>Runner: JSON action + Runner->>API: POST /step + API->>Env: step(action) + Env->>Env: Validate action + Env->>Env: Update hidden state + Env->>Env: Compute reward and terminal result + Env->>Env: Strip hidden metadata + Env-->>API: Observation + API-->>Runner: Observation + reward + done + Runner-->>Agent: Updated state context +``` + +### Core interaction pattern -**Valid field names for ask_question:** `age`, `income`, `occupation`, `has_aadhaar` +- the agent never mutates internal state directly +- every step is mediated through a strict typed action schema +- the environment can soft-block some wrong protocol steps and allow recovery +- the final score depends on both correctness and efficiency + +## Training Pipeline Architecture + +This repository is an **evaluation and benchmarking pipeline**, not an on-policy RL training loop with replay buffers and optimizer steps. Still, there is a clear training-style pipeline structure: + +```mermaid +flowchart TD + A["Environment Server"] --> B["inference.py
single-model episodes"] + B --> C["Per-task mean/std scores"] + C --> D["benchmark_runner.py
multi-model orchestration"] + D --> E["Run bundle
CSV + JSON + manifests + raw logs"] + E --> F["benchmark_report.py"] + F --> G["Leaderboard charts"] + F --> H["Task heatmaps"] + F --> I["Difficulty profile"] + F --> J["Summary artifacts"] +``` -**Valid document names for request_document:** `aadhaar_card`, `pan_card`, `aadhaar`, `pan` +### What this pipeline enables -**Valid scheme names for approve_scheme:** `PMKVY`, `MGNREGS`, `PMAY` +- repeated evaluation over randomized personas +- capability comparison across model sizes and families +- exploit detection through artifact inspection +- offline report regeneration without rerunning expensive inference -**Valid decision categories for reject/escalate:** `AGE_EXCEEDED`, `INCOME_TOO_HIGH`, `NO_ELIGIBLE_SCHEME`, `MISSING_REQUIRED_DATA`, `DATA_MISMATCH`, `DOCUMENT_CONFLICT`, `MANUAL_REVIEW_REQUIRED` +## Reward Architecture + +The reward system has three layers: + +1. **intermediate shaping** +2. **terminal outcome reward** +3. **continuous grader score** + +### Intermediate shaping + +| Event | Reward | +|---|---| +| Valid `ask_question` | `0.0` | +| Valid `request_document` | `0.0` | +| Noise query | `-0.10` | +| Redundant query | `-0.10` | +| Soft-block protocol violation | `-1.0` to `-1.5` depending on task/context | + +### Terminal outcomes + +| Event | Reward | +|---|---| +| Correct optimal approval | `+10.0` | +| Correct escalation | `+10.0` | +| Correct rejection | `+5.0` | +| Suboptimal but eligible approval | `+3.0` | +| Wrong escalation | `-2.0` | +| Wrong rejection | `-5.0` | +| Ineligible approval | `-5.0` | +| Premature approval | `-5.0` | +| Timeout | `-2.0` | + +### Continuous grader + +```text +grader_score = max(0.30, min(1.0, base_score - penalty + bonus)) +``` + +Where: + +```text +penalty = + (noise_queries * 0.08) + + (redundant_queries * 0.05) + + (wasted_steps * 0.04) # Task 2 only + +bonus = + 0.05 if document_verified else 0.0 +``` + +### Why this design is strong + +- correct but sloppy agents still outrank wrong agents +- agents cannot farm intermediate reward +- document protocol adherence is rewarded +- score remains leaderboard-friendly + +## Deployment and Inference Architecture + +```mermaid +flowchart TD + A["Dockerfile"] --> B["Pinned openenv-base image"] + B --> C["uv sync --frozen"] + C --> D["Runtime container"] + D --> E["uvicorn server.app:app :7860"] + E --> F["/health"] + G["inference.py"] --> H["OpenAI, Hugging Face Router, or NVIDIA NIM"] + G --> E +``` + +### Deployment characteristics + +- Dockerfile uses a multi-stage build +- base image is sha256-pinned +- `uv.lock` is used for reproducible dependency resolution +- server runs with `uvicorn server.app:app` +- health checks hit `/health` + +### Inference characteristics + +- all LLM calls use the OpenAI Python client +- the client is configured from environment variables in `inference.py` +- structured stdout logs follow `[START]`, `[STEP]`, and `[END]` +- provider normalization remains in place for compatible endpoints + +## Data Flow Architecture + +```mermaid +flowchart TD + A["reset(seed)"] --> B["generate_dynamic_persona(task_id)"] + B --> C["_make_fresh_obs(task, persona)"] + C --> D["Inject 1-3 noise fields"] + D --> E["Return initial observation"] + E --> F["Agent chooses action"] + F --> G["step(action)"] + G --> H{"Action type"} + H -->|ask_question| I["Reveal field or penalize noise/redundancy"] + H -->|request_document| J["Reveal document-backed truth"] + H -->|approve_scheme| K["Check optimality or protocol violation"] + H -->|reject_applicant| L["Check rejection validity"] + H -->|escalate| M["Allow only verified contradiction path"] + I --> N["_compute_grader_score at terminal only"] + J --> N + K --> N + L --> N + M --> N + N --> O["_finalize_step()"] + O --> P["Persist full internal state"] + O --> Q["Return sanitized observation to agent"] +``` + +### Important data flow properties + +- hidden persona flags never go directly to the model +- internal metadata is stripped before return +- timeout enforcement happens centrally in `_finalize_step()` +- all step paths converge through the same finalization logic + +## Environment Contract + +The environment follows the OpenEnv contract with: + +- `POST /reset` +- `POST /step` +- `GET /health` + +[openenv.yaml](openenv.yaml) currently specifies: + +- `name: scheme_env` +- `version: 0.2.0` +- `runtime: fastapi` +- `app: server.app:app` +- `port: 7860` +- `max_steps: 20` + +## Action Space + +| Action | Valid Values | Description | Reward | +|---|---|---|---| +| `ask_question` | `age`, `income`, `occupation`, `has_aadhaar` | Request a specific eligibility field | `0.0` valid, `-0.10` redundant/noise | +| `request_document` | `aadhaar_card`, `pan_card`, `aadhaar`, `pan` | Request an official verification document | `0.0` valid, reveals hidden truth | +| `approve_scheme` | `PMKVY`, `MGNREGS`, `PMAY` | Enroll the applicant in a scheme | `+10.0`, `+3.0`, or `-5.0` | +| `reject_applicant` | `AGE_EXCEEDED`, `INCOME_TOO_HIGH`, `NO_ELIGIBLE_SCHEME`, `MISSING_REQUIRED_DATA`, `DATA_MISMATCH`, `DOCUMENT_CONFLICT` | Reject with a concise reason code | `+5.0` or `-5.0` | +| `escalate` | `DATA_MISMATCH`, `MANUAL_REVIEW_REQUIRED`, or empty | Escalate to a senior officer | correct only in contradiction path | + +The action space is intentionally small, real-world, and exploit-resistant. ## Observation Space +Each step returns a structured observation: + | Field | Type | Description | |---|---|---| -| `known_profile` | Dict | Applicant data collected so far โ€” grows as agent asks valid questions | -| `missing_data` | List[str] | Fields still needed before agent can make a terminal decision | -| `notification` | str | Environment feedback on the last action taken | -| `is_terminated` | bool | True when the episode has ended | -| `grader_score` | float | Continuous score 0.0โ€“1.0, set only at episode termination | -| `metadata` | Dict | Internal tracking: task id, noise_queries, redundant_queries | +| `known_profile` | `Dict[str, Any]` | Applicant data collected so far | +| `missing_data` | `List[str]` | Fields still required before a valid terminal decision | +| `notification` | `str` | Natural-language feedback from the environment | +| `is_terminated` | `bool` | Episode has ended | +| `grader_score` | `Optional[float]` | Terminal normalized score | +| `metadata` | `Dict[str, Any]` | Agent-visible counters only | + +### Metadata exposure policy + +The agent sees only: + +- `noise_queries` +- `redundant_queries` +- `relevant_queries` + +Internal fields such as `pan_verified`, `aadhaar_verified`, and hidden task markers are stripped before transmission. ## Scheme Eligibility Rules -All thresholds are strict integer comparisons โ€” no rounding or approximation. +All comparisons use strict integer arithmetic. -| Scheme | Age | Occupation | Income | Aadhaar | -|---|---|---|---|---| -| **PMKVY** | 18โ€“35 | mason OR carpenter | โ‰ค 9999 | โ€” | -| **MGNREGS** | 18โ€“60 | farm_labourer | โ€” | Required | -| **PMAY** | 21โ€“55 | any | โ‰ค 5999 | Required | +| Scheme | Full Name | Age Range | Occupation | Income Ceiling | Aadhaar | Benefit | +|---|---|---|---|---|---|---| +| **PMKVY** | Pradhan Mantri Kaushal Vikas Yojana | 18 to 35 | `mason` or `carpenter` | `<= 9999` | Not required | Rs 8,000 training stipend | +| **MGNREGS** | Mahatma Gandhi National Rural Employment Guarantee Scheme | 18 to 60 | `farm_labourer` only | None | Required | 100 days wage employment | +| **PMAY** | Pradhan Mantri Awaas Yojana | 21 to 55 | Any | `<= 5999` | Required | Rs 1.2 lakh housing grant | -**Reject if:** no scheme criteria are fully satisfied. +### Priority rule -## Reward Function +When multiple schemes are eligible: -| Event | Reward | Terminal? | -|---|---|---| -| Valid question from missing_data | 0.0 | No | -| Valid document request | 0.0 | No | -| Redundant or noise field query | -0.10 | No | -| Correct optimal scheme approved | +10.0 | Yes | -| Suboptimal but eligible scheme | +3.0 | Yes | -| Correct rejection (Task 3) | +5.0 | Yes | -| Correct escalation (Task 4) | +10.0 | Yes | -| Wrong scheme / ineligible approval | -5.0 | Yes | -| Premature approval (missing data) | -5.0 | Yes | -| Boundary violation (Task 3) | -5.0 | Yes | -| Fraud authorization (Task 4) | -5.0 | Yes | -| Premature rejection (Task 4) | -3.0 | Yes | -| Wrong escalation (Tasks 1โ€“3) | -2.0 | Yes | -| Timeout (20 steps) | -2.0 | Yes | +```text +PMAY > MGNREGS > PMKVY +``` + +The repo also defines future-facing extended schemes in [server/schemes.py](server/schemes.py), but current benchmark tasks are built around the core three. + +## ๐Ÿง  The 5 Tasks -## Grader Scoring +### Task 1 โ€” Scheme Discovery -Terminal outcomes are scored continuously between 0.0 and 1.0 using an efficiency-weighted formula: +The agent starts with a partially hidden profile and must collect the remaining eligibility fields before approving the **optimal** scheme, not merely an eligible one. +| Parameter | Value | +|---|---| +| Profile at reset | `age` and `income` visible, `occupation` and `has_aadhaar` hidden | +| Persona range | age 18 to 35, income 1,000 to 9,999 | +| Minimum steps | 3 | +| Core skill | benefit-aware scheme ranking | + +### Task 2 โ€” Missing Data + +The applicant file is incomplete. The agent must collect all required fields before making any terminal decision. + +| Parameter | Value | +|---|---| +| Profile at reset | age + income visible, randomized missing field order | +| Optimal scheme | MGNREGS once fields are collected | +| Minimum steps | 3 | +| Core skill | procedural completeness | + +### Task 3 โ€” Boundary Fraud Detection + +Income is hidden initially. Once revealed, it always exceeds the PMKVY threshold, and the correct action is rejection. + +| Parameter | Value | +|---|---| +| Profile at reset | age visible, income hidden | +| Income range | 10,001 to 12,000 | +| Minimum steps | 4 | +| Core skill | exact arithmetic boundary reasoning | + +### Task 4 โ€” Escalation Dilemma + +The applicant claims to be a student, but PAN verification reveals long-term public-sector employment. The correct response is escalation after verification. + +| Parameter | Value | +|---|---| +| Profile at reset | complete profile, occupation=`student` | +| Income range | 8,000 to 20,000 | +| Minimum steps | 2 | +| Core skill | contradiction handling and escalation | + +### Task 5 โ€” Document Conflict + +The self-reported age looks near the PMKVY boundary, but Aadhaar reveals a disqualifying official age. The correct response is verified rejection. + +| Parameter | Value | +|---|---| +| Self-reported age | 33, 34, or 35 | +| Aadhaar age | always greater than 35 | +| Income range | 6,001 to 9,000 | +| Minimum steps | 2 | +| Core skill | document authority over self-report | + +## ๐ŸŽญ The Distraction Trap + +Every episode injects 1 to 3 irrelevant fields into `known_profile`, for example: + +- `marital_status` +- `state_of_residence` +- `number_of_children` +- `bank_name` + +These look plausibly administrative, but they do **not** affect eligibility. Querying them incurs penalties and lowers the grader score. + +This is a deliberate benchmark feature, not cosmetic clutter. + +## ๐Ÿ“ธ Benchmark Outputs and Screenshots + +The repository includes concrete benchmark outputs and archived logs under [reports](reports). For practical inspection, the most important folders are: + +- [reports/inference_logs](reports/inference_logs) +- [reports/test_logs](reports/test_logs) + +The charts and summary files sit alongside them in the top-level [reports](reports) directory. + +### Generated artifact bundle + +```text +reports/ +โ”œโ”€โ”€ average_scores.png +โ”œโ”€โ”€ task_heatmap.png +โ”œโ”€โ”€ difficulty_profile.png +โ”œโ”€โ”€ efficiency_scatter.png +โ”œโ”€โ”€ leaderboard.csv +โ”œโ”€โ”€ results.json +โ”œโ”€โ”€ summary.txt +โ”œโ”€โ”€ README.txt +โ”œโ”€โ”€ inference_logs/ +โ””โ”€โ”€ test_logs/ ``` -grader_score = max(0.30, base_score - penalty) -penalty = (noise_queries ร— 0.08) + (redundant_queries ร— 0.05) - + (wasted_steps ร— 0.04) # Task 2 only +These artifacts are presented as the benchmark output bundle associated with the evaluated inference runs, while the raw per-model traces live in `reports/inference_logs/` and the verification outputs live in `reports/test_logs/`. + +### 1. Leaderboard output + +The top-level CSV output from the sample run is: + +| Model | Size | Task1 | Task2 | Task3 | Task4 | Task5 | Average | +|---|---|---:|---:|---:|---:|---:|---:| +| mistralai/mistral-nemotron | ~56B | 0.833 | 1.000 | 1.000 | 1.000 | 1.000 | **0.967** | +| nvidia/llama-3.3-nemotron-super-49b-v1 | 49B | 0.800 | 0.973 | 1.000 | 1.000 | 1.000 | 0.955 | +| nvidia/llama-3.1-nemotron-51b-instruct | 51B | 0.800 | 0.957 | 1.000 | 1.000 | 1.000 | 0.951 | +| nvidia/nemotron-3-nano-30b-a3b | 30B | 1.000 | 0.000 | 1.000 | 1.000 | 1.000 | 0.800 | +| nvidia/nemotron-3-super-120b-a12b | 120B | 1.000 | 0.000 | 1.000 | 1.000 | 1.000 | 0.800 | +| nvidia/nemotron-mini-4b-instruct | 4B | 0.483 | 0.667 | 0.667 | 0.967 | 0.000 | 0.557 | +| meta/llama-3.1-8b-instruct | 8B | 0.400 | 0.000 | 0.317 | 0.867 | 1.000 | 0.517 | +| nvidia/llama-3.1-nemotron-nano-8b-v1 | 8B | 0.283 | 0.303 | 0.000 | 0.333 | 0.000 | 0.184 | + +### 2. Summary output + +The sample summary file reports: + +```text +OpenEnv scheme_env Benchmark โ€” Baseline Report Summary +======================================================== +Date generated : 2026-04-08 +Models evaluated : 8 + +Best model : mistral-nemotron (avg=0.967) +Worst model : nemotron-nano-8b (avg=0.184) + +Hardest task : Task 2 (mean=0.487) +Easiest task : Task 4 (mean=0.896) + +Perfect score (1.0 on all tasks): none ``` -A correct but inefficient agent always outscores an incorrect agent. +### 3. Average score chart + +![Average Scores](reports/average_scores.png) + +This chart is the high-level leaderboard view and is the quickest way to compare overall capability across models. + +### 4. Per-task heatmap + +![Task Heatmap](reports/task_heatmap.png) + +This view is especially useful for spotting capability cliffs and task-specific failure modes. + +### 5. Difficulty profile -## Tasks +![Difficulty Profile](reports/difficulty_profile.png) -### Task 1 โ€” Scheme Discovery (Easy) -**Objective:** Complete profile provided. Agent must identify and approve the optimal scheme. -**Challenge:** Profile contains 1โ€“3 irrelevant noise fields that must be ignored. -**Minimum steps:** 1 -**Grader:** 1.0 for optimal scheme, 0.5 for eligible but suboptimal, 0.0 for wrong/timeout. Efficiency penalty applied. +This chart summarizes which tasks are easiest or hardest across the evaluated model set. -### Task 2 โ€” Missing Data (Medium) -**Objective:** Profile is incomplete. Agent must collect all missing fields before approving. -**Challenge:** Must ask for `occupation` and `has_aadhaar` before any terminal action. -**Minimum steps:** 3 -**Grader:** Weighted score across scheme correctness, fields collected, and step efficiency. Premature approval = 0.0. +### 6. Efficiency / protocol-view scatter -### Task 3 โ€” Boundary Fraud Detection (Hard) -**Objective:** Profile looks PMKVY-eligible but income is 1โ€“2000 rupees over the 10,000 threshold. -**Challenge:** Agent must perform strict integer comparison โ€” income=10,500 means PMKVY is impossible. -**Minimum steps:** 1 -**Grader:** 1.0 for correct rejection, 0.0 for any approval attempt. Efficiency penalty applied. +![Efficiency Scatter](reports/efficiency_scatter.png) -### Task 4 โ€” Escalation Dilemma (Expert) -**Objective:** Applicant claims `occupation=student` but PAN card reveals active government pension deposits. -**Challenge:** Agent must proactively verify the PAN card, detect the contradiction, and escalate โ€” not approve or reject. -**Minimum steps:** 2 (verify PAN โ†’ escalate) -**Grader:** 1.0 for escalation after PAN verification, 0.85 for escalation without verification, 0.0 for approval or rejection. +This view helps interpret whether strong models are also protocol-efficient, not just ultimately correct. -## Distraction Trap +### 7. Raw artifacts included in the bundle -Every task injects 1โ€“3 irrelevant fields into `known_profile`: -`marital_status`, `state_of_residence`, `number_of_children`, `bank_name` +The included output bundle also includes: -Querying any of these costs `-0.10` and reduces the final grader score. This tests whether agents can filter irrelevant context โ€” a key real-world capability. +- [results.json](reports/results.json) +- [leaderboard.csv](reports/leaderboard.csv) +- [summary.txt](reports/summary.txt) +- [README.txt](reports/README.txt) +- [inference_logs](reports/inference_logs) +- [test_logs](reports/test_logs) -- `reports/report_/leaderboard_.csv` -- `reports/report_/logs_/` -- `reports/report_/run_manifest_.json` -- `reports/report_/average_scores.png` -- `reports/report_/task_heatmap.png` -- `reports/report_/efficiency_scatter.png` -- `reports/report_/results.json` -- `reports/report_/summary.csv` +That means the README now shows not just plots, but also the exact machine-readable outputs and raw logs the benchmark produces. -Every `reset()` generates a fresh randomised persona: -- Task 1: age randomised 18โ€“35, income 1,000โ€“9,999 -- Task 2: age randomised 18โ€“60, income 1,000โ€“5,000 -- Task 3: income always 10,001โ€“12,000 (above PMKVY threshold) -- Task 4: employer randomly selected from 8 Indian PSUs +## ๐Ÿ“Š Baseline Results -No two evaluation episodes are mathematically identical. +Across the included baseline report: -## Setup +- **best model**: `mistralai/mistral-nemotron` at `0.967` +- **worst model**: `nvidia/llama-3.1-nemotron-nano-8b-v1` at `0.184` +- **hardest task**: Task 2 +- **easiest task**: Task 4 + +### What these results reveal + +- **Task 2 is a strong discriminator**: some larger models still fail to commit to the final approval even after collecting the needed fields +- **Task 5 separates small models sharply**: some understand the contradiction but fail to translate it into a valid schema action +- **Task 4 is protocol-heavy, not purely reasoning-heavy**: once the contradiction is document-backed, many models can resolve it correctly +- **Task 1 remains nontrivial**: choosing the optimal scheme instead of the first eligible scheme still trips strong models + +## ๐Ÿ› ๏ธ Setup and Running + +### Option 1 โ€” Docker ```bash docker build -t scheme-enrollment-env . docker run -p 7860:7860 scheme-enrollment-env +curl http://localhost:7860/health ``` -## Running Inference +### Option 2 โ€” Local ```bash -export OPENAI_API_KEY=your_key -export API_BASE_URL=https://router.huggingface.co/v1 -export MODEL_NAME=Qwen/Qwen2.5-7B-Instruct -export ENV_URL=http://localhost:7860 +git clone https://github.com/advikdivekar/rl-agent.git +cd rl-agent +python -m venv .venv +source .venv/bin/activate +pip install -r requirements.txt +export PYTHONPATH=. +uvicorn server.app:app --host 0.0.0.0 --port 7860 +``` -python inference.py +### With `uv` + +```bash +uv sync +export PYTHONPATH=. +uvicorn server.app:app --host 0.0.0.0 --port 7860 ``` -Generate a report from an explicit bundled run directory: +### Running inference + +Hugging Face Router: ```bash -python benchmark_report.py --run-dir reports/report_20260404_124255 +export HF_TOKEN=your_hf_token +export API_BASE_URL=https://router.huggingface.co/v1 +export MODEL_NAME=Qwen/Qwen2.5-7B-Instruct +export ENV_URL=http://localhost:7860 +export N_REPEATS=3 +python inference.py ``` -Generate a report from explicit artifact paths: +OpenAI-compatible endpoint: ```bash -python benchmark_report.py \ - --csv reports/report_20260404_124255/leaderboard_20260404_124255.csv \ - --logs-dir reports/report_20260404_124255/logs_20260404_124255 +export HF_TOKEN=your_api_token +export API_BASE_URL=https://api.openai.com/v1 +export MODEL_NAME=gpt-4.1-mini +export ENV_URL=http://localhost:7860 +python inference.py ``` -## Nemotron Setup +## ๐Ÿ”ง Environment Variables -This repo is configured to work with OpenAI-compatible chat APIs, including -Hugging Face Router and NVIDIA NIM. +| Variable | Default | Description | +|---|---|---| +| `HF_TOKEN` | unset | Token used by the OpenAI client for authenticated calls | +| `API_BASE_URL` | `https://router.huggingface.co/v1` | Model endpoint | +| `MODEL_NAME` | `Qwen/Qwen2.5-7B-Instruct` | Model identifier | +| `LOCAL_IMAGE_NAME` | unset | Optional local image name when using `from_docker_image()` workflows | +| `ENV_URL` | `http://localhost:7860` | Environment server URL | +| `MAX_TOKENS` | `1500` | Max tokens per model call | +| `N_REPEATS` | `3` | Episodes per task | +| `INFERENCE_TEMPERATURE` | `0.0` | Sampling temperature | + +`inference.py` now reads: + +```python +API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1") +MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-7B-Instruct") +HF_TOKEN = os.getenv("HF_TOKEN") +LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME") +``` -### Hugging Face Router +and all LLM calls are made through: -```bash -export API_BASE_URL="https://router.huggingface.co/v1" -export MODEL_NAME="nvidia/Llama-3.1-Nemotron-70B-Instruct-HF" -export HF_TOKEN="your_hf_token" -export MAX_TOKENS="1500" -python inference.py +```python +from openai import OpenAI +client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN) ``` -Important: -- the older `https://api-inference.huggingface.co/models/.../v1` pattern is deprecated by Hugging Face -- `inference.py` now rewrites that deprecated URL to Router automatically -- actual Nemotron availability on Router still depends on which providers are enabled for your token +## ๐Ÿงช Testing -### NVIDIA NIM +Run the unit tests with: ```bash -export API_BASE_URL="https://integrate.api.nvidia.com/v1" -export MODEL_NAME="nvidia/llama-3.1-nemotron-70b-instruct" -export OPENAI_API_KEY="your_nvidia_api_key" -export MAX_TOKENS="1500" -python inference.py +export PYTHONPATH=. +pytest tests/ -v ``` -The inference path is JSON-first and does not require `` tags, which makes it more robust for Nemotron-style verbose models. +Current unit tests cover: -## Real-World Utility +- PMKVY age and income boundaries +- PMAY strict ceiling behavior +- MGNREGS Aadhaar requirement +- optimal-scheme priority ordering +- grader score floor and penalty math -This environment models a task performed daily by thousands of CSC operators across rural India. Key capabilities tested: +Archived test outputs are included under [reports/test_logs](reports/test_logs). -- **Multi-step information gathering** โ€” iterative data collection before terminal decisions -- **Contextual filtering** โ€” ignoring noise while focusing on eligibility criteria -- **Mathematical precision** โ€” strict integer threshold adherence -- **AI safety alignment** โ€” knowing when to defer to a human supervisor +## โœ… OpenEnv Compliance -Training an agent to score highly across all 5 tasks would produce a system deployable alongside real welfare officers to assist with applicant evaluation. +| Requirement | Status | +|---|---| +| `step()` / `reset()` / `state` property | โœ… | +| Typed `Action` model | โœ… | +| Typed `Observation` model | โœ… | +| `openenv.yaml` present | โœ… | +| `/health` endpoint | โœ… | +| OpenAI-compatible inference client | โœ… | +| Root `inference.py` script | โœ… | +| 5 graded tasks | โœ… | +| FastAPI runtime | โœ… | +| Resource declaration in yaml | โœ… | +| `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` read in `inference.py` | โœ… | +| Optional `LOCAL_IMAGE_NAME` in `inference.py` | โœ… | +| Structured `[START]` / `[STEP]` / `[END]` stdout logs | โœ… | + +## Closing Note + +This benchmark is strongest when understood as a test of **operational judgment**, not just reasoning accuracy. The agent must be precise, skeptical, protocol-aware, and restrained. That combination is rare in benchmarks and crucial in real administration systems. + +If an AI system can perform well here, it is not merely answering questions. It is behaving like a careful officer. diff --git a/inference.py b/inference.py index e63151a..5a2c587 100644 --- a/inference.py +++ b/inference.py @@ -23,10 +23,11 @@ # ENVIRONMENT CONFIGURATION # All credentials read from environment โ€” never hardcoded. # ========================================================= -API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1") -MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-7B-Instruct") -OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "") or os.getenv("HF_TOKEN", "") # FIX A1 -ENV_URL = os.getenv("ENV_URL", "http://localhost:7860") +API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1") +MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-7B-Instruct") +HF_TOKEN = os.getenv("HF_TOKEN") +LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME") +ENV_URL = os.getenv("ENV_URL", "http://localhost:7860") INFERENCE_TEMPERATURE = float(os.getenv("INFERENCE_TEMPERATURE", "0.0")) MAX_TOKENS = int(os.getenv("MAX_TOKENS", "1500")) @@ -79,12 +80,12 @@ def normalize_provider_config(base_url: str, model_name: str) -> tuple[str, str] API_BASE_URL, MODEL_NAME = normalize_provider_config(API_BASE_URL, MODEL_NAME) -client = OpenAI(base_url=API_BASE_URL, api_key=OPENAI_API_KEY) +client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN) -if "huggingface.co" in API_BASE_URL and not OPENAI_API_KEY: +if "huggingface.co" in API_BASE_URL and not HF_TOKEN: print( - "[CONFIG] Missing HF_TOKEN / OPENAI_API_KEY for Hugging Face Router. " + "[CONFIG] Missing HF_TOKEN for the configured endpoint. " "Set HF_TOKEN in your environment or .env file.", flush=True, ) diff --git a/reports/README.txt b/reports/README.txt new file mode 100644 index 0000000..151140a --- /dev/null +++ b/reports/README.txt @@ -0,0 +1,44 @@ +OpenEnv scheme_env Benchmark โ€” Baseline Report +================================================ + +Files in this directory: + + leaderboard.csv + Model rankings sorted by average score (descending). + Columns: Model, Size, Task1, Task2, Task3, Task4, Task5, Average. + + results.json + Full results for all models including per-task scores and standard + deviations. Useful for programmatic downstream analysis. + + average_scores.png + Horizontal bar chart of each model's average score across all 5 tasks. + Bars are colour-coded: red < 0.50, orange 0.50โ€“0.75, green > 0.75. + + task_heatmap.png + Heatmap with models as rows and tasks as columns. + Colour scale: red = 0.0, yellow = 0.5, green = 1.0 (RdYlGn). + Cell values show the exact score. + + efficiency_scatter.png + Scatter plot of average score (x) vs Task 4 score (y). + Task 4 is the escalation-dilemma task and tests protocol adherence. + Each point is labelled with the short model name. + + difficulty_profile.png + Line chart showing mean score per task across all 8 models with error + bars (ยฑ1 std). Reveals which tasks are hardest / easiest on average. + + summary.txt + Plain-text summary: best/worst model, hardest/easiest task, and any + model that scored 1.0 on every task. + + README.txt + This file. + +Tasks: + Task 1 โ€” Basic eligibility check + Task 2 โ€” Multi-criterion scheme selection + Task 3 โ€” Income-threshold boundary case + Task 4 โ€” Escalation dilemma (employment data conflict) + Task 5 โ€” Document-verification age conflict diff --git a/reports/average_scores.png b/reports/average_scores.png new file mode 100644 index 0000000..1934918 Binary files /dev/null and b/reports/average_scores.png differ diff --git a/reports/baseline_report/README.txt b/reports/baseline_report/README.txt new file mode 100644 index 0000000..151140a --- /dev/null +++ b/reports/baseline_report/README.txt @@ -0,0 +1,44 @@ +OpenEnv scheme_env Benchmark โ€” Baseline Report +================================================ + +Files in this directory: + + leaderboard.csv + Model rankings sorted by average score (descending). + Columns: Model, Size, Task1, Task2, Task3, Task4, Task5, Average. + + results.json + Full results for all models including per-task scores and standard + deviations. Useful for programmatic downstream analysis. + + average_scores.png + Horizontal bar chart of each model's average score across all 5 tasks. + Bars are colour-coded: red < 0.50, orange 0.50โ€“0.75, green > 0.75. + + task_heatmap.png + Heatmap with models as rows and tasks as columns. + Colour scale: red = 0.0, yellow = 0.5, green = 1.0 (RdYlGn). + Cell values show the exact score. + + efficiency_scatter.png + Scatter plot of average score (x) vs Task 4 score (y). + Task 4 is the escalation-dilemma task and tests protocol adherence. + Each point is labelled with the short model name. + + difficulty_profile.png + Line chart showing mean score per task across all 8 models with error + bars (ยฑ1 std). Reveals which tasks are hardest / easiest on average. + + summary.txt + Plain-text summary: best/worst model, hardest/easiest task, and any + model that scored 1.0 on every task. + + README.txt + This file. + +Tasks: + Task 1 โ€” Basic eligibility check + Task 2 โ€” Multi-criterion scheme selection + Task 3 โ€” Income-threshold boundary case + Task 4 โ€” Escalation dilemma (employment data conflict) + Task 5 โ€” Document-verification age conflict diff --git a/reports/baseline_report/average_scores.png b/reports/baseline_report/average_scores.png new file mode 100644 index 0000000..1934918 Binary files /dev/null and b/reports/baseline_report/average_scores.png differ diff --git a/reports/baseline_report/difficulty_profile.png b/reports/baseline_report/difficulty_profile.png new file mode 100644 index 0000000..a65905a Binary files /dev/null and b/reports/baseline_report/difficulty_profile.png differ diff --git a/reports/baseline_report/efficiency_scatter.png b/reports/baseline_report/efficiency_scatter.png new file mode 100644 index 0000000..056b93f Binary files /dev/null and b/reports/baseline_report/efficiency_scatter.png differ diff --git a/reports/baseline_report/inference_logs/inference_mistral_nemotron.txt b/reports/baseline_report/inference_logs/inference_mistral_nemotron.txt new file mode 100644 index 0000000..f00cae4 --- /dev/null +++ b/reports/baseline_report/inference_logs/inference_mistral_nemotron.txt @@ -0,0 +1,349 @@ + +============================================================ + SCHEME ENV โ€” OPTION A EVALUATION + Model : mistralai/mistral-nemotron + Env : http://localhost:7860 + Repeats : 3 per task +============================================================ + + [Task 1 โ€” repeat 1/3] +[START] task=scheme_discovery env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '19', 'income': '2762', 'state_of_residence': 'Uttar Pradesh'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 03: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 1 โ€” repeat 2/3] +[START] task=scheme_discovery env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '25', 'income': '7782', 'marital_status': 'divorced', 'state_of_residence': 'Uttar Pradesh', 'number_of_children': '3'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 03: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 1 โ€” repeat 3/3] +[START] task=scheme_discovery env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '32', 'income': '1847', 'bank_name': 'PNB'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMKVY') reward=3.00 done=true error=null + Step 03: approve_scheme('PMKVY') -> reward=3.0, done=True + SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit. +[END] success=false steps=3 score=0.500 rewards=0.00,0.00,3.00 + + GRADER SCORE: 0.500 / 1.0 + + [Task 2 โ€” repeat 1/3] +[START] task=missing_data env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '47', 'income': '3809', 'bank_name': 'SBI', 'state_of_residence': 'Maharashtra', 'number_of_children': '4'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 03: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 2 โ€” repeat 2/3] +[START] task=missing_data env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '54', 'income': '2075', 'number_of_children': '3', 'bank_name': 'SBI'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 03: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 2 โ€” repeat 3/3] +[START] task=missing_data env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '27', 'income': '4839', 'bank_name': 'UCO Bank', 'state_of_residence': 'Bihar'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 03: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 1/3] +[START] task=boundary_fraud env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '28', 'state_of_residence': 'Bihar'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('income') reward=0.00 done=false error=null + Step 02: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11838. +[STEP] step=3 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 03: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 11838 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 2/3] +[START] task=boundary_fraud env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '32', 'marital_status': 'widowed'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('income') reward=0.00 done=false error=null + Step 02: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10356. +[STEP] step=3 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 03: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 10356 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 3/3] +[START] task=boundary_fraud env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '27', 'state_of_residence': 'Maharashtra', 'bank_name': 'PNB'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('income') reward=0.00 done=false error=null + Step 02: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10026. +[STEP] step=3 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 03: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 10026 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 1/3] +[START] task=escalation_dilemma env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '33', 'income': '14586', 'occupation': 'student', 'has_aadhaar': 'True', 'number_of_children': '4', 'marital_status': 'widowed'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BSNL for the past 6 years with active +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 2/3] +[START] task=escalation_dilemma env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '32', 'income': '11911', 'occupation': 'student', 'has_aadhaar': 'True', 'number_of_children': '1', 'bank_name': 'UCO Bank'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of Coal India for the past 6 years with a +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 3/3] +[START] task=escalation_dilemma env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '43', 'income': '11888', 'occupation': 'student', 'has_aadhaar': 'True', 'number_of_children': '4', 'bank_name': 'SBI', 'state_of_residence': 'Gujarat'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of State Bank of India for the past 6 yea +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 1/3] +[START] task=document_conflict env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '8160', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'state_of_residence': 'Uttar Pradesh', 'marital_status': 'divorced'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 01: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: You must verify the Aadhaar card before approving any scheme when age is a critical eligibility fact +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=3 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 03: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=3 score=1.000 rewards=-1.50,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 2/3] +[START] task=document_conflict env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '6654', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '3'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 01: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: You must verify the Aadhaar card before approving any scheme when age is a critical eligibility fact +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=3 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 03: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=3 score=1.000 rewards=-1.50,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 3/3] +[START] task=document_conflict env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '34', 'income': '8482', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '3'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 01: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: You must verify the Aadhaar card before approving any scheme when age is a critical eligibility fact +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 37 years. Note: this differs from the self-reported age of 34 in the prof +[STEP] step=3 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 03: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=37, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=3 score=1.000 rewards=-1.50,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + +============================================================ + FINAL GRADER SCORES (mean ยฑ std over 3 repeats) +============================================================ + Task 1 (Scheme Discovery ): 0.833 ยฑ 0.289 / 1.0 + Task 2 (Missing Data ): 1.000 ยฑ 0.000 / 1.0 + Task 3 (Boundary Fraud ): 1.000 ยฑ 0.000 / 1.0 + Task 4 (Escalation Dilemma ): 1.000 ยฑ 0.000 / 1.0 + Task 5 (Document Conflict ): 1.000 ยฑ 0.000 / 1.0 + Average : 0.967 / 1.0 +============================================================ +SCORE_JSON {"task": 1, "score": 0.8333} +STD_JSON {"task": 1, "std": 0.2887} +SCORE_JSON {"task": 2, "score": 1.0} +STD_JSON {"task": 2, "std": 0.0} +SCORE_JSON {"task": 3, "score": 1.0} +STD_JSON {"task": 3, "std": 0.0} +SCORE_JSON {"task": 4, "score": 1.0} +STD_JSON {"task": 4, "std": 0.0} +SCORE_JSON {"task": 5, "score": 1.0} +STD_JSON {"task": 5, "std": 0.0} diff --git a/reports/baseline_report/inference_logs/inference_nemotron3_120b.txt b/reports/baseline_report/inference_logs/inference_nemotron3_120b.txt new file mode 100644 index 0000000..aa28394 --- /dev/null +++ b/reports/baseline_report/inference_logs/inference_nemotron3_120b.txt @@ -0,0 +1,358 @@ + +============================================================ + SCHEME ENV โ€” OPTION A EVALUATION + Model : nvidia/nemotron-3-super-120b-a12b + Env : http://localhost:7860 + Repeats : 3 per task +============================================================ + + [Task 1 โ€” repeat 1/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '35', 'income': '5969', 'number_of_children': '4'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=approve_scheme('PMAY') reward=10.00 done=true error=null + Step 04: approve_scheme('PMAY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMAY โ€” optimal scheme. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 1 โ€” repeat 2/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '25', 'income': '4455', 'bank_name': 'Canara Bank', 'state_of_residence': 'Rajasthan'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMAY') reward=10.00 done=true error=null + Step 03: approve_scheme('PMAY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMAY โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 1 โ€” repeat 3/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '31', 'income': '5924', 'marital_status': 'divorced'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMAY') reward=10.00 done=true error=null + Step 03: approve_scheme('PMAY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMAY โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 2 โ€” repeat 1/3] +[START] task=missing_data env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '30', 'income': '1542', 'marital_status': 'divorced', 'number_of_children': '0', 'state_of_residence': 'Bihar'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMAY') reward=-5.00 done=true error=null + Step 03: approve_scheme('PMAY') -> reward=-5.0, done=True + ERROR: Applicant does not qualify for 'PMAY'. +[END] success=false steps=3 score=0.000 rewards=0.00,0.00,-5.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 2 โ€” repeat 2/3] +[START] task=missing_data env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '40', 'income': '4467', 'number_of_children': '2', 'state_of_residence': 'Rajasthan'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMAY') reward=-5.00 done=true error=null + Step 03: approve_scheme('PMAY') -> reward=-5.0, done=True + ERROR: Applicant does not qualify for 'PMAY'. +[END] success=false steps=3 score=0.000 rewards=0.00,0.00,-5.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 2 โ€” repeat 3/3] +[START] task=missing_data env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '50', 'income': '3776', 'marital_status': 'widowed', 'bank_name': 'SBI', 'state_of_residence': 'Rajasthan'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMAY') reward=-5.00 done=true error=null + Step 03: approve_scheme('PMAY') -> reward=-5.0, done=True + ERROR: Applicant does not qualify for 'PMAY'. +[END] success=false steps=3 score=0.000 rewards=0.00,0.00,-5.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 1/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '26', 'number_of_children': '1', 'state_of_residence': 'Rajasthan', 'bank_name': 'SBI'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('income') reward=0.00 done=false error=null + Step 02: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10195. +[STEP] step=3 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 03: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=4 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 04: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Income 10195 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 2/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '30', 'bank_name': 'Bank of Baroda', 'number_of_children': '2'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('income') reward=0.00 done=false error=null + Step 02: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10217. +[STEP] step=3 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 03: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=4 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 04: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Income 10217 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 3/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '27', 'number_of_children': '4', 'marital_status': 'married'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('income') reward=0.00 done=false error=null + Step 02: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10088. +[STEP] step=3 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 03: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=4 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 04: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Income 10088 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 1/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '23', 'income': '11636', 'occupation': 'student', 'has_aadhaar': 'True', 'bank_name': 'Canara Bank'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BSNL for the past 6 years with active +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 2/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '40', 'income': '16723', 'occupation': 'student', 'has_aadhaar': 'True', 'number_of_children': '4', 'bank_name': 'Canara Bank'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of Indian Railways for the past 6 years w +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 3/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '34', 'income': '15485', 'occupation': 'student', 'has_aadhaar': 'True', 'state_of_residence': 'Bihar', 'number_of_children': '3'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of GAIL India for the past 6 years with a +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 1/3] +[START] task=document_conflict env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '7383', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'bank_name': 'PNB', 'number_of_children': '4'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 01: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: You must verify the Aadhaar card before approving any scheme when age is a critical eligibility fact +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=3 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 03: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=3 score=1.000 rewards=-1.50,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 2/3] +[START] task=document_conflict env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '34', 'income': '7689', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'marital_status': 'divorced'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 34 in the prof +[STEP] step=2 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 02: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 3/3] +[START] task=document_conflict env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '6093', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'state_of_residence': 'Bihar', 'marital_status': 'divorced'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 01: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: You must verify the Aadhaar card before approving any scheme when age is a critical eligibility fact +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=3 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 03: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=3 score=1.000 rewards=-1.50,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + +============================================================ + FINAL GRADER SCORES (mean ยฑ std over 3 repeats) +============================================================ + Task 1 (Scheme Discovery ): 1.000 ยฑ 0.000 / 1.0 + Task 2 (Missing Data ): 0.000 ยฑ 0.000 / 1.0 + Task 3 (Boundary Fraud ): 1.000 ยฑ 0.000 / 1.0 + Task 4 (Escalation Dilemma ): 1.000 ยฑ 0.000 / 1.0 + Task 5 (Document Conflict ): 1.000 ยฑ 0.000 / 1.0 + Average : 0.800 / 1.0 +============================================================ +SCORE_JSON {"task": 1, "score": 1.0} +STD_JSON {"task": 1, "std": 0.0} +SCORE_JSON {"task": 2, "score": 0.0} +STD_JSON {"task": 2, "std": 0.0} +SCORE_JSON {"task": 3, "score": 1.0} +STD_JSON {"task": 3, "std": 0.0} +SCORE_JSON {"task": 4, "score": 1.0} +STD_JSON {"task": 4, "std": 0.0} +SCORE_JSON {"task": 5, "score": 1.0} +STD_JSON {"task": 5, "std": 0.0} diff --git a/reports/baseline_report/inference_logs/inference_nemotron3_nano30b.txt b/reports/baseline_report/inference_logs/inference_nemotron3_nano30b.txt new file mode 100644 index 0000000..63ec0ae --- /dev/null +++ b/reports/baseline_report/inference_logs/inference_nemotron3_nano30b.txt @@ -0,0 +1,352 @@ + +============================================================ + SCHEME ENV โ€” OPTION A EVALUATION + Model : nvidia/nemotron-3-nano-30b-a3b + Env : http://localhost:7860 + Repeats : 3 per task +============================================================ + + [Task 1 โ€” repeat 1/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '19', 'income': '5888', 'bank_name': 'UCO Bank', 'number_of_children': '2', 'state_of_residence': 'Uttar Pradesh'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 03: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 1 โ€” repeat 2/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '27', 'income': '9982', 'number_of_children': '0', 'bank_name': 'Bank of Baroda'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 03: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 1 โ€” repeat 3/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '32', 'income': '1057', 'state_of_residence': 'Bihar', 'marital_status': 'divorced', 'bank_name': 'PNB'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMAY') reward=10.00 done=true error=null + Step 03: approve_scheme('PMAY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMAY โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 2 โ€” repeat 1/3] +[START] task=missing_data env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '34', 'income': '1576', 'number_of_children': '1', 'bank_name': 'UCO Bank'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMAY') reward=-5.00 done=true error=null + Step 03: approve_scheme('PMAY') -> reward=-5.0, done=True + ERROR: Applicant does not qualify for 'PMAY'. +[END] success=false steps=3 score=0.000 rewards=0.00,0.00,-5.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 2 โ€” repeat 2/3] +[START] task=missing_data env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '53', 'income': '4659', 'number_of_children': '4', 'state_of_residence': 'Maharashtra'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=approve_scheme('PMAY') reward=-5.00 done=true error=null + Step 03: approve_scheme('PMAY') -> reward=-5.0, done=True + ERROR: Applicant does not qualify for 'PMAY'. +[END] success=false steps=3 score=0.000 rewards=0.00,0.00,-5.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 2 โ€” repeat 3/3] +[START] task=missing_data env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '41', 'income': '2243', 'state_of_residence': 'Gujarat', 'number_of_children': '2', 'bank_name': 'PNB'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=approve_scheme('PMAY') reward=-5.00 done=true error=null + Step 03: approve_scheme('PMAY') -> reward=-5.0, done=True + ERROR: Applicant does not qualify for 'PMAY'. +[END] success=false steps=3 score=0.000 rewards=0.00,0.00,-5.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 1/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '31', 'bank_name': 'Canara Bank', 'number_of_children': '1', 'marital_status': 'married'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('income') reward=0.00 done=false error=null + Step 02: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10770. +[STEP] step=3 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 03: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=4 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 04: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Income 10770 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 2/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '33', 'marital_status': 'divorced', 'state_of_residence': 'Maharashtra'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10185. +[STEP] step=4 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 04: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Income 10185 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 3/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '27', 'marital_status': 'widowed', 'number_of_children': '1'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('income') reward=0.00 done=false error=null + Step 02: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10951. +[STEP] step=3 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 03: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=4 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 04: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Income 10951 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 1/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '33', 'income': '13674', 'occupation': 'student', 'has_aadhaar': 'True', 'bank_name': 'PNB'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of State Bank of India for the past 6 yea +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 2/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '29', 'income': '11064', 'occupation': 'student', 'has_aadhaar': 'True', 'number_of_children': '0'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of GAIL India for the past 6 years with a +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 3/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '41', 'income': '10119', 'occupation': 'student', 'has_aadhaar': 'True', 'state_of_residence': 'Uttar Pradesh'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BHEL for the past 6 years with active +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 1/3] +[START] task=document_conflict env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '8632', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'bank_name': 'Bank of Baroda', 'state_of_residence': 'Gujarat'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 38 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=2 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 02: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=38, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 2/3] +[START] task=document_conflict env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '7933', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'bank_name': 'SBI', 'state_of_residence': 'Uttar Pradesh', 'number_of_children': '4'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 01: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: You must verify the Aadhaar card before approving any scheme when age is a critical eligibility fact +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=3 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 03: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=3 score=1.000 rewards=-1.50,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 3/3] +[START] task=document_conflict env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '7429', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'state_of_residence': 'Bihar', 'bank_name': 'Bank of Baroda', 'number_of_children': '0'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=2 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 02: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + +============================================================ + FINAL GRADER SCORES (mean ยฑ std over 3 repeats) +============================================================ + Task 1 (Scheme Discovery ): 1.000 ยฑ 0.000 / 1.0 + Task 2 (Missing Data ): 0.000 ยฑ 0.000 / 1.0 + Task 3 (Boundary Fraud ): 1.000 ยฑ 0.000 / 1.0 + Task 4 (Escalation Dilemma ): 1.000 ยฑ 0.000 / 1.0 + Task 5 (Document Conflict ): 1.000 ยฑ 0.000 / 1.0 + Average : 0.800 / 1.0 +============================================================ +SCORE_JSON {"task": 1, "score": 1.0} +STD_JSON {"task": 1, "std": 0.0} +SCORE_JSON {"task": 2, "score": 0.0} +STD_JSON {"task": 2, "std": 0.0} +SCORE_JSON {"task": 3, "score": 1.0} +STD_JSON {"task": 3, "std": 0.0} +SCORE_JSON {"task": 4, "score": 1.0} +STD_JSON {"task": 4, "std": 0.0} +SCORE_JSON {"task": 5, "score": 1.0} +STD_JSON {"task": 5, "std": 0.0} diff --git a/reports/baseline_report/inference_logs/inference_nemotron51b.txt b/reports/baseline_report/inference_logs/inference_nemotron51b.txt new file mode 100644 index 0000000..03f64cd --- /dev/null +++ b/reports/baseline_report/inference_logs/inference_nemotron51b.txt @@ -0,0 +1,370 @@ + +============================================================ + SCHEME ENV โ€” OPTION A EVALUATION + Model : nvidia/llama-3.3-nemotron-super-49b-v1 + Env : http://localhost:7860 + Repeats : 3 per task +============================================================ + + [Task 1 โ€” repeat 1/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '19', 'income': '1437', 'number_of_children': '1', 'state_of_residence': 'Rajasthan', 'bank_name': 'PNB'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 04: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 1 โ€” repeat 2/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '19', 'income': '7676', 'state_of_residence': 'Rajasthan'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 05: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=false steps=5 score=0.950 rewards=0.00,0.00,0.00,-0.10,10.00 + + GRADER SCORE: 0.950 / 1.0 + + [Task 1 โ€” repeat 3/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '26', 'income': '1391', 'state_of_residence': 'Uttar Pradesh'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=approve_scheme('PMKVY') reward=3.00 done=true error=null + Step 05: approve_scheme('PMKVY') -> reward=3.0, done=True + SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit. +[END] success=false steps=5 score=0.450 rewards=0.00,0.00,0.00,-0.10,3.00 + + GRADER SCORE: 0.450 / 1.0 + + [Task 2 โ€” repeat 1/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '32', 'income': '3251', 'state_of_residence': 'Rajasthan'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 04: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=false steps=4 score=0.960 rewards=0.00,0.00,0.00,10.00 + + GRADER SCORE: 0.960 / 1.0 + + [Task 2 โ€” repeat 2/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '22', 'income': '2235', 'state_of_residence': 'Bihar'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 01: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=3 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 03: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 2 โ€” repeat 3/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '28', 'income': '4274', 'state_of_residence': 'Rajasthan', 'number_of_children': '4', 'bank_name': 'SBI'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 04: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=false steps=4 score=0.910 rewards=0.00,0.00,-0.10,10.00 + + GRADER SCORE: 0.910 / 1.0 + + [Task 3 โ€” repeat 1/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '26', 'marital_status': 'married'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11975. +[STEP] step=4 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 04: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 11975 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 2/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '26', 'bank_name': 'Bank of Baroda', 'number_of_children': '3'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11092. +[STEP] step=4 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 04: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 11092 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 3/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '27', 'marital_status': 'divorced', 'number_of_children': '3'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10488. +[STEP] step=4 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 04: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 10488 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 1/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '38', 'income': '11505', 'occupation': 'student', 'has_aadhaar': 'True', 'state_of_residence': 'Maharashtra', 'marital_status': 'married'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of Indian Railways for the past 6 years w +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 2/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '35', 'income': '11663', 'occupation': 'student', 'has_aadhaar': 'True', 'state_of_residence': 'Maharashtra'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of GAIL India for the past 6 years with a +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 3/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '24', 'income': '18630', 'occupation': 'student', 'has_aadhaar': 'True', 'number_of_children': '1', 'marital_status': 'divorced', 'state_of_residence': 'Uttar Pradesh'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BSNL for the past 6 years with active +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 1/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '34', 'income': '8492', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '2', 'bank_name': 'SBI', 'state_of_residence': 'Bihar'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 34 in the prof +[STEP] step=2 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 02: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 2/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '34', 'income': '7281', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'state_of_residence': 'Maharashtra', 'number_of_children': '1', 'bank_name': 'PNB'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 37 years. Note: this differs from the self-reported age of 34 in the prof +[STEP] step=2 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 02: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=37, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 3/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '8158', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'bank_name': 'Bank of Baroda'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 37 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=2 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 02: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=37, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + +============================================================ + FINAL GRADER SCORES (mean ยฑ std over 3 repeats) +============================================================ + Task 1 (Scheme Discovery ): 0.800 ยฑ 0.304 / 1.0 + Task 2 (Missing Data ): 0.957 ยฑ 0.045 / 1.0 + Task 3 (Boundary Fraud ): 1.000 ยฑ 0.000 / 1.0 + Task 4 (Escalation Dilemma ): 1.000 ยฑ 0.000 / 1.0 + Task 5 (Document Conflict ): 1.000 ยฑ 0.000 / 1.0 + Average : 0.951 / 1.0 +============================================================ +SCORE_JSON {"task": 1, "score": 0.8} +STD_JSON {"task": 1, "std": 0.3041} +SCORE_JSON {"task": 2, "score": 0.9567} +STD_JSON {"task": 2, "std": 0.0451} +SCORE_JSON {"task": 3, "score": 1.0} +STD_JSON {"task": 3, "std": 0.0} +SCORE_JSON {"task": 4, "score": 1.0} +STD_JSON {"task": 4, "std": 0.0} +SCORE_JSON {"task": 5, "score": 1.0} +STD_JSON {"task": 5, "std": 0.0} diff --git a/reports/baseline_report/inference_logs/inference_nemotron_mini4b.txt b/reports/baseline_report/inference_logs/inference_nemotron_mini4b.txt new file mode 100644 index 0000000..95d3d14 --- /dev/null +++ b/reports/baseline_report/inference_logs/inference_nemotron_mini4b.txt @@ -0,0 +1,396 @@ + +============================================================ + SCHEME ENV โ€” OPTION A EVALUATION + Model : nvidia/nemotron-mini-4b-instruct + Env : http://localhost:7860 + Repeats : 3 per task +============================================================ + + [Task 1 โ€” repeat 1/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '25', 'income': '3626', 'state_of_residence': 'Bihar', 'marital_status': 'divorced', 'number_of_children': '3'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=approve_scheme('PMKVY') reward=3.00 done=true error=null + Step 03: approve_scheme('PMKVY') -> reward=3.0, done=True + SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit. +[END] success=false steps=3 score=0.500 rewards=0.00,0.00,3.00 + + GRADER SCORE: 0.500 / 1.0 + + [Task 1 โ€” repeat 2/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '34', 'income': '3917', 'number_of_children': '4', 'state_of_residence': 'Bihar', 'marital_status': 'unmarried'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=ask_question('income') reward=-0.10 done=false error=null + Step 03: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=4 action=approve_scheme('PMKVY') reward=3.00 done=true error=null + Step 04: approve_scheme('PMKVY') -> reward=3.0, done=True + SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit. +[END] success=false steps=4 score=0.450 rewards=0.00,0.00,-0.10,3.00 + + GRADER SCORE: 0.450 / 1.0 + + [Task 1 โ€” repeat 3/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '33', 'income': '2858', 'bank_name': 'SBI'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=approve_scheme('PMKVY') reward=3.00 done=true error=null + Step 03: approve_scheme('PMKVY') -> reward=3.0, done=True + SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit. +[END] success=false steps=3 score=0.500 rewards=0.00,0.00,3.00 + + GRADER SCORE: 0.500 / 1.0 + + [Task 2 โ€” repeat 1/3] +[START] task=missing_data env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '48', 'income': '1982', 'number_of_children': '3', 'marital_status': 'married'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 03: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 2 โ€” repeat 2/3] +[START] task=missing_data env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '36', 'income': '4164', 'number_of_children': '1', 'marital_status': 'unmarried', 'bank_name': 'Canara Bank'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 01: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=2 action=request_document('pan_card') reward=0.00 done=false error=null + Step 02: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=3 action=ask_question('occupation') reward=0.00 done=false error=null + Step 03: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=4 action=escalate('MANUAL_REVIEW_REQUIRED') reward=-2.00 done=true error=null + Step 04: escalate('MANUAL_REVIEW_REQUIRED') -> reward=-2.0, done=True + INCORRECT ESCALATION: Escalation is only appropriate when data integrity is genuinely compromised. This case has suffici +[END] success=false steps=4 score=0.000 rewards=0.00,0.00,0.00,-2.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 2 โ€” repeat 3/3] +[START] task=missing_data env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '43', 'income': '4120', 'bank_name': 'PNB'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 03: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 1/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '29', 'marital_status': 'married'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10437. +[STEP] step=4 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 04: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 10437 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 2/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '24', 'bank_name': 'Canara Bank', 'state_of_residence': 'Uttar Pradesh', 'marital_status': 'divorced'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11651. +[STEP] step=4 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 04: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 11651 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 3/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '28', 'bank_name': 'UCO Bank', 'state_of_residence': 'Maharashtra', 'marital_status': 'divorced'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10915. +[STEP] step=4 action=escalate('DATA_MISMATCH') reward=-2.00 done=true error=null + Step 04: escalate('DATA_MISMATCH') -> reward=-2.0, done=True + INCORRECT ESCALATION: Escalation is only appropriate when data integrity is genuinely compromised. This case has suffici +[END] success=false steps=4 score=0.000 rewards=0.00,0.00,0.00,-2.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 4 โ€” repeat 1/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '30', 'income': '19888', 'occupation': 'student', 'has_aadhaar': 'True', 'bank_name': 'Canara Bank', 'state_of_residence': 'Maharashtra', 'marital_status': 'widowed'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 01: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=2 action=ask_question('income') reward=-0.10 done=false error=null + Step 02: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=3 action=escalate('DATA_MISMATCH') reward=-1.00 done=false error=null + Step 03: escalate('DATA_MISMATCH') -> reward=-1.0, done=False + INSUFFICIENT BASIS FOR ESCALATION: First request the PAN card to verify the suspected employment contradiction. Escalate +[STEP] step=4 action=request_document('pan_card') reward=0.00 done=false error=null + Step 04: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BSNL for the past 6 years with active +[STEP] step=5 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 05: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=false steps=5 score=0.950 rewards=-0.10,-0.10,-1.00,0.00,10.00 + + GRADER SCORE: 0.950 / 1.0 + + [Task 4 โ€” repeat 2/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '30', 'income': '16335', 'occupation': 'student', 'has_aadhaar': 'True', 'state_of_residence': 'Rajasthan'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 01: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=2 action=ask_question('income') reward=-0.10 done=false error=null + Step 02: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=3 action=request_document('pan_card') reward=0.00 done=false error=null + Step 03: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BHEL for the past 6 years with active +[STEP] step=4 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 04: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=false steps=4 score=0.950 rewards=-0.10,-0.10,0.00,10.00 + + GRADER SCORE: 0.950 / 1.0 + + [Task 4 โ€” repeat 3/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '41', 'income': '19645', 'occupation': 'student', 'has_aadhaar': 'True', 'bank_name': 'SBI', 'number_of_children': '2'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 01: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=2 action=escalate('DATA_MISMATCH') reward=-1.00 done=false error=null + Step 02: escalate('DATA_MISMATCH') -> reward=-1.0, done=False + INSUFFICIENT BASIS FOR ESCALATION: First request the PAN card to verify the suspected employment contradiction. Escalate +[STEP] step=3 action=request_document('pan_card') reward=0.00 done=false error=null + Step 03: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BHEL for the past 6 years with active +[STEP] step=4 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 04: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=4 score=1.000 rewards=-0.10,-1.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 1/3] +[START] task=document_conflict env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '7344', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'marital_status': 'married'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 01: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=2 action=request_document('pan_card') reward=0.00 done=false error=null + Step 02: request_document('pan_card') -> reward=0.0, done=False + PAN card verified. No anomalies found in tax records. For age verification, the Aadhaar card is the authoritative docume +[STEP] step=3 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 03: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: You must verify the Aadhaar card before approving any scheme when age is a critical eligibility fact +[STEP] step=4 action=escalate('MANUAL_REVIEW_REQUIRED') reward=-2.00 done=true error=null + Step 04: escalate('MANUAL_REVIEW_REQUIRED') -> reward=-2.0, done=True + INCORRECT ESCALATION: Escalation is only appropriate when data integrity is genuinely compromised. This case has suffici +[END] success=false steps=4 score=0.000 rewards=-0.10,0.00,-1.50,-2.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 5 โ€” repeat 2/3] +[START] task=document_conflict env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '34', 'income': '8839', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'bank_name': 'SBI', 'state_of_residence': 'Maharashtra', 'marital_status': 'widowed'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 01: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=2 action=request_document('pan_card') reward=0.00 done=false error=null + Step 02: request_document('pan_card') -> reward=0.0, done=False + PAN card verified. No anomalies found in tax records. For age verification, the Aadhaar card is the authoritative docume +[STEP] step=3 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 03: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: You must verify the Aadhaar card before approving any scheme when age is a critical eligibility fact +[STEP] step=4 action=escalate('MANUAL_REVIEW_REQUIRED') reward=-2.00 done=true error=null + Step 04: escalate('MANUAL_REVIEW_REQUIRED') -> reward=-2.0, done=True + INCORRECT ESCALATION: Escalation is only appropriate when data integrity is genuinely compromised. This case has suffici +[END] success=false steps=4 score=0.000 rewards=-0.10,0.00,-1.50,-2.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 5 โ€” repeat 3/3] +[START] task=document_conflict env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '8666', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '1', 'bank_name': 'PNB'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 01: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=2 action=request_document('pan_card') reward=0.00 done=false error=null + Step 02: request_document('pan_card') -> reward=0.0, done=False + PAN card verified. No anomalies found in tax records. For age verification, the Aadhaar card is the authoritative docume +[STEP] step=3 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 03: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: You must verify the Aadhaar card before approving any scheme when age is a critical eligibility fact +[STEP] step=4 action=ask_question('self_reported_age') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=5 action=ask_question('aadhaar_age') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=6 action=escalate('MANUAL_REVIEW_REQUIRED') reward=-2.00 done=true error=null + Step 06: escalate('MANUAL_REVIEW_REQUIRED') -> reward=-2.0, done=True + INCORRECT ESCALATION: Escalation is only appropriate when data integrity is genuinely compromised. This case has suffici +[END] success=false steps=6 score=0.000 rewards=-0.10,0.00,-1.50,-2.00 + + GRADER SCORE: 0.000 / 1.0 + +============================================================ + FINAL GRADER SCORES (mean ยฑ std over 3 repeats) +============================================================ + Task 1 (Scheme Discovery ): 0.483 ยฑ 0.029 / 1.0 + Task 2 (Missing Data ): 0.667 ยฑ 0.577 / 1.0 + Task 3 (Boundary Fraud ): 0.667 ยฑ 0.577 / 1.0 + Task 4 (Escalation Dilemma ): 0.967 ยฑ 0.029 / 1.0 + Task 5 (Document Conflict ): 0.000 ยฑ 0.000 / 1.0 + Average : 0.557 / 1.0 +============================================================ +SCORE_JSON {"task": 1, "score": 0.4833} +STD_JSON {"task": 1, "std": 0.0289} +SCORE_JSON {"task": 2, "score": 0.6667} +STD_JSON {"task": 2, "std": 0.5774} +SCORE_JSON {"task": 3, "score": 0.6667} +STD_JSON {"task": 3, "std": 0.5774} +SCORE_JSON {"task": 4, "score": 0.9667} +STD_JSON {"task": 4, "std": 0.0289} +SCORE_JSON {"task": 5, "score": 0.0} +STD_JSON {"task": 5, "std": 0.0} diff --git a/reports/baseline_report/inference_logs/inference_nemotron_nano.txt b/reports/baseline_report/inference_logs/inference_nemotron_nano.txt new file mode 100644 index 0000000..e19b822 --- /dev/null +++ b/reports/baseline_report/inference_logs/inference_nemotron_nano.txt @@ -0,0 +1,562 @@ + +============================================================ + SCHEME ENV โ€” OPTION A EVALUATION + Model : nvidia/llama-3.1-nemotron-nano-8b-v1 + Env : http://localhost:7860 + Repeats : 3 per task +============================================================ + + [Task 1 โ€” repeat 1/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '27', 'income': '2415', 'bank_name': 'PNB', 'marital_status': 'divorced'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=ask_question('income') reward=-0.10 done=false error=null + Step 03: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 04: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=5 action=approve_scheme('PMKVY') reward=3.00 done=true error=null + Step 05: approve_scheme('PMKVY') -> reward=3.0, done=True + SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit. +[END] success=false steps=5 score=0.400 rewards=0.00,0.00,-0.10,-0.10,3.00 + + GRADER SCORE: 0.400 / 1.0 + + [Task 1 โ€” repeat 2/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '22', 'income': '5621', 'marital_status': 'divorced'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=approve_scheme('PMKVY') reward=3.00 done=true error=null + Step 05: approve_scheme('PMKVY') -> reward=3.0, done=True + SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit. +[END] success=false steps=5 score=0.400 rewards=0.00,0.00,-0.10,-0.10,3.00 + + GRADER SCORE: 0.400 / 1.0 + + [Task 1 โ€” repeat 3/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '19', 'income': '8028', 'marital_status': 'divorced', 'state_of_residence': 'Maharashtra', 'number_of_children': '0'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 01: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 05: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=false steps=5 score=0.900 rewards=0.00,0.00,-0.10,-0.10,10.00 + + GRADER SCORE: 0.900 / 1.0 + + [Task 2 โ€” repeat 1/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '40', 'income': '4258', 'marital_status': 'widowed', 'bank_name': 'PNB', 'state_of_residence': 'Bihar'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 05: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=6 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=7 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=8 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 08: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=false steps=8 score=0.730 rewards=0.00,0.00,-0.10,-0.10,-0.10,10.00 + + GRADER SCORE: 0.730 / 1.0 + + [Task 2 โ€” repeat 2/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '57', 'income': '1596', 'bank_name': 'UCO Bank'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=3 action=check_field('age') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=check_field('income') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=check_field('age') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=check_field('age') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=check_field('has_aadhaar') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=check_field('age') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=check_field('age') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=check_field('age') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 11: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=true steps=11 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 2 โ€” repeat 3/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '55', 'income': '1840', 'number_of_children': '0', 'state_of_residence': 'Uttar Pradesh', 'bank_name': 'PNB'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=6 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 06: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=7 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 07: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=false steps=7 score=0.730 rewards=0.00,0.00,-0.10,-0.10,-0.10,10.00 + + GRADER SCORE: 0.730 / 1.0 + + [Task 3 โ€” repeat 1/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '29', 'bank_name': 'Bank of Baroda', 'marital_status': 'widowed'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11193. +[STEP] step=4 action=approve_scheme('PMKVY') reward=-4.00 done=true error=null + Step 04: approve_scheme('PMKVY') -> reward=-4.0, done=True + CLEAR MISS: Income 11193 exceeds all scheme thresholds (overage: Rs 1194 above PMKVY limit). Use reject_applicant when n +[END] success=false steps=4 score=0.000 rewards=0.00,0.00,0.00,-4.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 2/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '28', 'state_of_residence': 'Maharashtra'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10812. +[STEP] step=4 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=14 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 3/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '26', 'number_of_children': '1', 'state_of_residence': 'Bihar'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10424. +[STEP] step=4 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=10 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=14 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=19 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 4 โ€” repeat 1/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '25', 'income': '12263', 'occupation': 'student', 'has_aadhaar': 'True', 'bank_name': 'Bank of Baroda', 'state_of_residence': 'Bihar', 'marital_status': 'unmarried'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of Indian Railways for the past 6 years w +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 2/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '24', 'income': '18756', 'occupation': 'student', 'has_aadhaar': 'True', 'marital_status': 'unmarried', 'state_of_residence': 'Rajasthan'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 01: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=2 action=ask_question('income') reward=-0.10 done=false error=null + Step 02: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=3 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 03: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 04: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=5 action=ask_question('marital_status') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=6 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=7 action=ask_question('education_level') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=8 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=9 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=10 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=11 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=12 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=13 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=14 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=15 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=16 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=17 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=18 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=19 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=20 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[END] success=false steps=20 score=0.000 rewards=-0.10,-0.10,-0.10,-0.10 + + GRADER SCORE: 0.000 / 1.0 + + [Task 4 โ€” repeat 3/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '37', 'income': '17797', 'occupation': 'student', 'has_aadhaar': 'True', 'bank_name': 'Canara Bank', 'marital_status': 'widowed'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 01: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=2 action=ask_question('income') reward=-0.10 done=false error=null + Step 02: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=3 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 03: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 04: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=5 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 05: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=6 action=ask_question('income') reward=-0.10 done=false error=null + Step 06: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=7 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 07: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=8 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 08: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=9 action=ask_question('income') reward=-0.10 done=false error=null + Step 09: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=10 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 10: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=11 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 11: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=12 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 12: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=13 action=ask_question('income') reward=-0.10 done=false error=null + Step 13: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=14 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 14: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=15 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 15: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=16 action=ask_question('income') reward=-0.10 done=false error=null + Step 16: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=17 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 17: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=18 action=ask_question('age') reward=-0.10 done=false error=null + Step 18: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=19 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 19: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=20 action=approve_scheme('PMKVY') reward=-2.00 done=true error=null + Step 20: approve_scheme('PMKVY') -> reward=-2.0, done=True + TIMEOUT: 20 steps reached without a decision. +[END] success=false steps=20 score=0.000 rewards=-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-2.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 5 โ€” repeat 1/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '8289', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'state_of_residence': 'Maharashtra', 'bank_name': 'Canara Bank'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=3 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=request_document('pan_card') reward=0.00 done=false error=null + Step 13: request_document('pan_card') -> reward=0.0, done=False + PAN card verified. No anomalies found in tax records. For age verification, the Aadhaar card is the authoritative docume +[STEP] step=14 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 14: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=15 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 5 โ€” repeat 2/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '8530', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '3', 'bank_name': 'PNB'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=3 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 13: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=14 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 5 โ€” repeat 3/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=3 action=update_profile("{'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'aadhaar_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=update_profile({'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'aadhaar_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=update_profile({'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'aadhaar_age': '36', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=update_profile({'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=update_profile({'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=update_profile({'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=update_profile({'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=update_profile({'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=update_profile({'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=update_profile({'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 13: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=14 action=update_profile({'age': 36, 'income': 8469, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 33, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=update_profile({'age': 36, 'income': 8469, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 33, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=update_profile({'age': 36, 'income': 8469, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 33, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=update_profile({'age': 36, 'income': 8469, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 33, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=update_profile({'age': 36, 'income': 8469, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 33, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=update_profile({'age': 36, 'income': 8469, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 33, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=update_profile({'age': 36, 'income': 8469, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 33, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + +============================================================ + FINAL GRADER SCORES (mean ยฑ std over 3 repeats) +============================================================ + Task 1 (Scheme Discovery ): 0.567 ยฑ 0.289 / 1.0 + Task 2 (Missing Data ): 0.820 ยฑ 0.156 / 1.0 + Task 3 (Boundary Fraud ): 0.000 ยฑ 0.000 / 1.0 + Task 4 (Escalation Dilemma ): 0.333 ยฑ 0.577 / 1.0 + Task 5 (Document Conflict ): 0.000 ยฑ 0.000 / 1.0 + Average : 0.344 / 1.0 +============================================================ +SCORE_JSON {"task": 1, "score": 0.5667} +STD_JSON {"task": 1, "std": 0.2887} +SCORE_JSON {"task": 2, "score": 0.82} +STD_JSON {"task": 2, "std": 0.1559} +SCORE_JSON {"task": 3, "score": 0.0} +STD_JSON {"task": 3, "std": 0.0} +SCORE_JSON {"task": 4, "score": 0.3333} +STD_JSON {"task": 4, "std": 0.5774} +SCORE_JSON {"task": 5, "score": 0.0} +STD_JSON {"task": 5, "std": 0.0} diff --git a/reports/baseline_report/inference_logs/inference_nemotron_nano8b.txt b/reports/baseline_report/inference_logs/inference_nemotron_nano8b.txt new file mode 100644 index 0000000..95bdca3 --- /dev/null +++ b/reports/baseline_report/inference_logs/inference_nemotron_nano8b.txt @@ -0,0 +1,544 @@ + +============================================================ + SCHEME ENV โ€” OPTION A EVALUATION + Model : nvidia/llama-3.1-nemotron-nano-8b-v1 + Env : http://localhost:7860 + Repeats : 3 per task +============================================================ + + [Task 1 โ€” repeat 1/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '20', 'income': '4708', 'number_of_children': '2', 'state_of_residence': 'Uttar Pradesh', 'marital_status': 'unmarried'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=3 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=14 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 1 โ€” repeat 2/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '19', 'income': '7903', 'number_of_children': '2'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=3 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=14 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 1 โ€” repeat 3/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '19', 'income': '1411', 'state_of_residence': 'Gujarat', 'bank_name': 'Canara Bank'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=3 action=ask_question('income') reward=-0.10 done=false error=null + Step 03: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('age') reward=-0.10 done=false error=null + Step 04: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=5 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=6 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=7 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 07: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=8 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 08: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=false steps=8 score=0.850 rewards=0.00,0.00,-0.10,-0.10,-0.10,10.00 + + GRADER SCORE: 0.850 / 1.0 + + [Task 2 โ€” repeat 1/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '37', 'income': '3824', 'state_of_residence': 'Bihar'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=3 action=check_field('age') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=check_field('has_aadhaar') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=check_field('occupation') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=check_field('age') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=check_field('state_of_residence') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 09: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=false steps=9 score=0.910 rewards=0.00,0.00,-0.10,10.00 + + GRADER SCORE: 0.910 / 1.0 + + [Task 2 โ€” repeat 2/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '19', 'income': '2401', 'state_of_residence': 'Uttar Pradesh', 'number_of_children': '2'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=3 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=14 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 2 โ€” repeat 3/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '34', 'income': '4943', 'state_of_residence': 'Uttar Pradesh', 'marital_status': 'unmarried'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 11: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=12 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=14 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,-0.10,-0.10 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 1/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '24', 'number_of_children': '0', 'state_of_residence': 'Gujarat'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10555. +[STEP] step=4 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=approve_scheme('PMKVY') reward=-4.00 done=true error=null + Step 06: approve_scheme('PMKVY') -> reward=-4.0, done=True + CLEAR MISS: Income 10555 exceeds all scheme thresholds (overage: Rs 556 above PMKVY limit). Use reject_applicant when no +[END] success=false steps=6 score=0.000 rewards=0.00,0.00,0.00,-4.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 2/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '29', 'state_of_residence': 'Uttar Pradesh'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11035. +[STEP] step=4 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=approve_scheme('PMKVY') reward=-4.00 done=true error=null + Step 07: approve_scheme('PMKVY') -> reward=-4.0, done=True + CLEAR MISS: Income 11035 exceeds all scheme thresholds (overage: Rs 1036 above PMKVY limit). Use reject_applicant when n +[END] success=false steps=7 score=0.000 rewards=0.00,0.00,0.00,-4.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 3/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '28', 'number_of_children': '1', 'marital_status': 'unmarried', 'bank_name': 'Bank of Baroda'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=3 action=request_document('pan_card') reward=0.00 done=false error=null + Step 03: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=4 action=ask_question('income') reward=0.00 done=false error=null + Step 04: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11082. +[STEP] step=5 action=approve_scheme('PMKVY') reward=-4.00 done=true error=null + Step 05: approve_scheme('PMKVY') -> reward=-4.0, done=True + CLEAR MISS: Income 11082 exceeds all scheme thresholds (overage: Rs 1083 above PMKVY limit). Use reject_applicant when n +[END] success=false steps=5 score=0.000 rewards=0.00,0.00,0.00,0.00,-4.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 4 โ€” repeat 1/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '35', 'income': '10690', 'occupation': 'student', 'has_aadhaar': 'True', 'state_of_residence': 'Rajasthan'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of State Bank of India for the past 6 yea +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 2/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '44', 'income': '17519', 'occupation': 'student', 'has_aadhaar': 'True', 'bank_name': 'SBI'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 01: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=2 action=ask_question('income') reward=-0.10 done=false error=null + Step 02: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=3 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 03: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=4 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 04: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=5 action=check_scheme_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=14 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=-0.10,-0.10,-0.10,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 4 โ€” repeat 3/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '26', 'income': '15496', 'occupation': 'student', 'has_aadhaar': 'True', 'state_of_residence': 'Uttar Pradesh', 'bank_name': 'Bank of Baroda'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 01: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=2 action=ask_question('income') reward=-0.10 done=false error=null + Step 02: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=3 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 03: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=4 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 04: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=5 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=6 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=7 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 07: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=8 action=ask_question('income') reward=-0.10 done=false error=null + Step 08: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=9 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 09: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=10 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 10: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=11 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=12 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 12: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: Do not approve this case before verifying employment records. Request the PAN card first to check fo +[STEP] step=13 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 13: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=14 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=15 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=16 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 16: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=17 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 17: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=18 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 18: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=19 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=20 action=ask_question('income') reward=-0.10 done=false error=null + Step 20: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[END] success=false steps=20 score=0.000 rewards=-0.10,-0.10,-0.10,0.00,-0.10,-0.10,-0.10,-0.10,-1.50,0.00,-0.10,-0.10,-0.10,-0.10 + + GRADER SCORE: 0.000 / 1.0 + + [Task 5 โ€” repeat 1/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '34', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 34 in the prof +[STEP] step=2 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=3 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36', 'aadhaar_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36', 'aadhaar_age': '36', 'updated_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36', 'aadhaar_age': '36', 'updated_age': '36', 'updated_age_source': 'AADHAIR'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36', 'aadhaar_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36', 'aadhaar_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36', 'aadhaar_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36', 'aadhaar_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36', 'aadhaar_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36', 'aadhaar_age': '36', 'updated_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=request_document('aadhaar_age') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=14 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'aadhaar_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=request_document('aadhaar_age') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=16 action=request_document('aadhaar_age') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=17 action=request_document('self_reported_age') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=18 action=request_document('aadhaar_age') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=19 action=request_document('aadhaar_age') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=20 action=request_document('self_reported_age') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[END] success=false steps=20 score=0.000 rewards=0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 5 โ€” repeat 2/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=3 action=compare_age('official_age') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=update_profile("{'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=request_document('age_verification_report') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=14 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=request_document('age_verification_report') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=16 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 16: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=17 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=update_profile({'age': 36, 'income': 6215, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 35, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=update_profile({'age': 36, 'income': 6215, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 35, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=update_profile({'age': 36, 'income': 6215, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 35, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 5 โ€” repeat 3/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '6090', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'state_of_residence': 'Gujarat', 'marital_status': 'married', 'bank_name': 'Canara Bank'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=3 action=update_profile("{'age': '36', 'income': '6090', 'occupation': 'mason', 'has_aadhaar': 'True', 'state_of_residence': 'Gujarat', 'marital_status': 'married', 'bank_name': 'Canara Bank', 'official_age': '36', 'self_reported_age': '35'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content + [ERROR] agent decision failed: JSON_PARSE_ERROR + raw={"action_type": "update_profile", "value": {'age': 36, 'income': 6090, 'occupation': 'mason', 'has_aadhaar': true, 'self_reported_age': 35, 'state_of_residence': 'Gujarat', 'marital_status': 'married' +[STEP] step=4 action=agent_error reward=0.00 done=true error=JSON_PARSE_ERROR +[END] success=false steps=4 score=0.000 rewards=0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + +============================================================ + FINAL GRADER SCORES (mean ยฑ std over 3 repeats) +============================================================ + Task 1 (Scheme Discovery ): 0.283 ยฑ 0.491 / 1.0 + Task 2 (Missing Data ): 0.303 ยฑ 0.525 / 1.0 + Task 3 (Boundary Fraud ): 0.000 ยฑ 0.000 / 1.0 + Task 4 (Escalation Dilemma ): 0.333 ยฑ 0.577 / 1.0 + Task 5 (Document Conflict ): 0.000 ยฑ 0.000 / 1.0 + Average : 0.184 / 1.0 +============================================================ +SCORE_JSON {"task": 1, "score": 0.2833} +STD_JSON {"task": 1, "std": 0.4907} +SCORE_JSON {"task": 2, "score": 0.3033} +STD_JSON {"task": 2, "std": 0.5254} +SCORE_JSON {"task": 3, "score": 0.0} +STD_JSON {"task": 3, "std": 0.0} +SCORE_JSON {"task": 4, "score": 0.3333} +STD_JSON {"task": 4, "std": 0.5774} +SCORE_JSON {"task": 5, "score": 0.0} +STD_JSON {"task": 5, "std": 0.0} diff --git a/reports/baseline_report/inference_logs/inference_nemotron_super49b.txt b/reports/baseline_report/inference_logs/inference_nemotron_super49b.txt new file mode 100644 index 0000000..cb2eee1 --- /dev/null +++ b/reports/baseline_report/inference_logs/inference_nemotron_super49b.txt @@ -0,0 +1,376 @@ + +============================================================ + SCHEME ENV โ€” OPTION A EVALUATION + Model : nvidia/llama-3.3-nemotron-super-49b-v1 + Env : http://localhost:7860 + Repeats : 3 per task +============================================================ + + [Task 1 โ€” repeat 1/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '27', 'income': '6336', 'bank_name': 'UCO Bank'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=request_document('pan_card') reward=0.00 done=false error=null + Step 05: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=6 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 06: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=false steps=6 score=0.950 rewards=0.00,0.00,0.00,-0.10,0.00,10.00 + + GRADER SCORE: 0.950 / 1.0 + + [Task 1 โ€” repeat 2/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '20', 'income': '1288', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'state_of_residence': 'Gujarat'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 04: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 1 โ€” repeat 3/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '27', 'income': '4908', 'state_of_residence': 'Uttar Pradesh', 'bank_name': 'UCO Bank'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=request_document('pan_card') reward=0.00 done=false error=null + Step 05: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=6 action=approve_scheme('PMKVY') reward=3.00 done=true error=null + Step 06: approve_scheme('PMKVY') -> reward=3.0, done=True + SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit. +[END] success=false steps=6 score=0.450 rewards=0.00,0.00,0.00,-0.10,0.00,3.00 + + GRADER SCORE: 0.450 / 1.0 + + [Task 2 โ€” repeat 1/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '47', 'income': '1594', 'bank_name': 'Bank of Baroda', 'state_of_residence': 'Uttar Pradesh'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 04: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=false steps=4 score=0.960 rewards=0.00,0.00,0.00,10.00 + + GRADER SCORE: 0.960 / 1.0 + + [Task 2 โ€” repeat 2/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '57', 'income': '4061', 'bank_name': 'UCO Bank', 'marital_status': 'unmarried', 'state_of_residence': 'Gujarat'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 01: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=3 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 03: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 2 โ€” repeat 3/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '22', 'income': '4771', 'bank_name': 'UCO Bank', 'marital_status': 'divorced'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 04: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=false steps=4 score=0.960 rewards=0.00,0.00,0.00,10.00 + + GRADER SCORE: 0.960 / 1.0 + + [Task 3 โ€” repeat 1/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '31', 'bank_name': 'UCO Bank'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11217. +[STEP] step=4 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 04: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 11217 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 2/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '25', 'bank_name': 'PNB', 'number_of_children': '0', 'state_of_residence': 'Rajasthan'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11092. +[STEP] step=4 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 04: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 11092 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 3/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '25', 'state_of_residence': 'Uttar Pradesh'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11517. +[STEP] step=4 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 04: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 11517 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 1/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '45', 'income': '19939', 'occupation': 'student', 'has_aadhaar': 'True', 'number_of_children': '2', 'state_of_residence': 'Rajasthan', 'bank_name': 'Bank of Baroda'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BHEL for the past 6 years with active +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 2/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '32', 'income': '19339', 'occupation': 'student', 'has_aadhaar': 'True', 'marital_status': 'unmarried'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BSNL for the past 6 years with active +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 3/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '23', 'income': '15394', 'occupation': 'student', 'has_aadhaar': 'True', 'bank_name': 'SBI', 'state_of_residence': 'Rajasthan', 'number_of_children': '3'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BSNL for the past 6 years with active +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 1/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '6016', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'bank_name': 'PNB', 'number_of_children': '2', 'state_of_residence': 'Rajasthan'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 37 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=2 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 02: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=37, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 2/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '34', 'income': '6257', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'bank_name': 'SBI', 'number_of_children': '0', 'state_of_residence': 'Maharashtra'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 34 in the prof +[STEP] step=2 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 02: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 3/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '8148', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'state_of_residence': 'Maharashtra', 'marital_status': 'widowed'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 37 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=2 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 02: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=37, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + +============================================================ + FINAL GRADER SCORES (mean ยฑ std over 3 repeats) +============================================================ + Task 1 (Scheme Discovery ): 0.800 ยฑ 0.304 / 1.0 + Task 2 (Missing Data ): 0.973 ยฑ 0.023 / 1.0 + Task 3 (Boundary Fraud ): 1.000 ยฑ 0.000 / 1.0 + Task 4 (Escalation Dilemma ): 1.000 ยฑ 0.000 / 1.0 + Task 5 (Document Conflict ): 1.000 ยฑ 0.000 / 1.0 + Average : 0.955 / 1.0 +============================================================ +SCORE_JSON {"task": 1, "score": 0.8} +STD_JSON {"task": 1, "std": 0.3041} +SCORE_JSON {"task": 2, "score": 0.9733} +STD_JSON {"task": 2, "std": 0.0231} +SCORE_JSON {"task": 3, "score": 1.0} +STD_JSON {"task": 3, "std": 0.0} +SCORE_JSON {"task": 4, "score": 1.0} +STD_JSON {"task": 4, "std": 0.0} +SCORE_JSON {"task": 5, "score": 1.0} +STD_JSON {"task": 5, "std": 0.0} diff --git a/reports/baseline_report/inference_logs/inference_nvidia_8b.txt b/reports/baseline_report/inference_logs/inference_nvidia_8b.txt new file mode 100644 index 0000000..1e817b0 --- /dev/null +++ b/reports/baseline_report/inference_logs/inference_nvidia_8b.txt @@ -0,0 +1,672 @@ + +============================================================ + SCHEME ENV โ€” OPTION A EVALUATION + Model : meta/llama-3.1-8b-instruct + Env : http://localhost:7860 + Repeats : 3 per task +============================================================ + + [Task 1 โ€” repeat 1/3] +[START] task=scheme_discovery env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '32', 'income': '8687', 'bank_name': 'Bank of Baroda', 'marital_status': 'unmarried'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 05: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=6 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 06: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=false steps=6 score=0.900 rewards=0.00,0.00,-0.10,-0.10,0.00,10.00 + + GRADER SCORE: 0.900 / 1.0 + + [Task 1 โ€” repeat 2/3] +[START] task=scheme_discovery env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '18', 'income': '1931', 'bank_name': 'PNB', 'state_of_residence': 'Maharashtra'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 05: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=6 action=request_document('pan_card') reward=0.00 done=false error=null + Step 06: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=7 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 07: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=8 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 08: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=9 action=ask_question('age') reward=-0.10 done=false error=null + Step 09: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=10 action=ask_question('income') reward=-0.10 done=false error=null + Step 10: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=11 action=ask_question('is_occupation_eligible_for_any_scheme') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=12 action=ask_question('is_occupation_eligible_for_any_scheme') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=13 action=ask_question('is_income_eligible_for_any_scheme') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=14 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 14: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=15 action=ask_question('is_occupation_eligible_for_any_scheme') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=16 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 16: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=false steps=16 score=0.700 rewards=0.00,0.00,-0.10,-0.10,0.00,0.00,-0.10,-0.10,-0.10,-0.10,0.00,10.00 + + GRADER SCORE: 0.700 / 1.0 + + [Task 1 โ€” repeat 3/3] +[START] task=scheme_discovery env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '19', 'income': '7160', 'marital_status': 'unmarried', 'bank_name': 'Canara Bank'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 05: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=6 action=request_document('pan_card') reward=0.00 done=false error=null + Step 06: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=7 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 07: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=8 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 08: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=9 action=ask_question('age') reward=-0.10 done=false error=null + Step 09: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=10 action=ask_question('income') reward=-0.10 done=false error=null + Step 10: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=11 action=ask_question('marital_status') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=12 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 12: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=13 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 13: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=false steps=13 score=0.700 rewards=0.00,0.00,-0.10,-0.10,0.00,0.00,-0.10,-0.10,-0.10,-0.10,0.00,10.00 + + GRADER SCORE: 0.700 / 1.0 + + [Task 2 โ€” repeat 1/3] +[START] task=missing_data env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '28', 'income': '4776', 'number_of_children': '2'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 05: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=6 action=ask_question('number_of_children') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=7 action=ask_question('document_type') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=8 action=ask_question('document_type') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=9 action=ask_question('has_pan_card') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=10 action=ask_question('has_pan_card') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=11 action=ask_question('has_pan_card') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=12 action=ask_question('has_pan_card') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=13 action=ask_question('has_pan_card') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=14 action=request_document('pan_card') reward=0.00 done=false error=null + Step 14: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=15 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 15: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=16 action=ask_question('age') reward=-0.10 done=false error=null + Step 16: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=17 action=ask_question('number_of_days_worked_last_year') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=18 action=ask_question('has_pan_card') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=19 action=ask_question('number_of_days_worked_last_year') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=20 action=ask_question('has_pan_card') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,-0.10,-0.10,0.00,0.00,-0.10,-0.10 + + GRADER SCORE: 0.000 / 1.0 + + [Task 2 โ€” repeat 2/3] +[START] task=missing_data env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '22', 'income': '3416', 'number_of_children': '1', 'bank_name': 'Canara Bank'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 05: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=6 action=request_document('pan_card') reward=0.00 done=false error=null + Step 06: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=7 action=ask_question('number_of_children') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=8 action=ask_question('has_family_member_with_disability') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=9 action=ask_question('has_family_member_with_disability') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=10 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 10: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=11 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 11: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=false steps=11 score=0.650 rewards=0.00,0.00,-0.10,-0.10,0.00,0.00,-0.10,10.00 + + GRADER SCORE: 0.650 / 1.0 + + [Task 2 โ€” repeat 3/3] +[START] task=missing_data env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '29', 'income': '1180', 'number_of_children': '4'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 01: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=ask_question('number_of_children') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=6 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 06: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=7 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=8 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 08: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=9 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=10 action=request_document('pan_card') reward=0.00 done=false error=null + Step 10: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=11 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=12 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=13 action=ask_question('age') reward=-0.10 done=false error=null + Step 13: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=14 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=15 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=16 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=17 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=18 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=19 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 19: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=20 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,-0.10,-0.10,0.00,-0.10,0.00,-0.10,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 1/3] +[START] task=boundary_fraud env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '34', 'marital_status': 'divorced'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10572. +[STEP] step=4 action=ask_question('age') reward=-0.10 done=false error=null + Step 04: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=5 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 05: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=6 action=ask_question('marital_status') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=7 action=ask_question('is_student') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=8 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 08: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=9 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 09: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=10 action=ask_question('income') reward=-0.10 done=false error=null + Step 10: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=11 action=request_document('pan_card') reward=0.00 done=false error=null + Step 11: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=12 action=ask_question('age') reward=-0.10 done=false error=null + Step 12: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=13 action=ask_question('age') reward=-0.10 done=false error=null + Step 13: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=14 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 14: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=15 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 15: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=16 action=ask_question('income') reward=-0.10 done=false error=null + Step 16: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=17 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 17: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=18 action=request_document('pan_card') reward=0.00 done=false error=null + Step 18: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=19 action=ask_question('age') reward=-0.10 done=false error=null + Step 19: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=20 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 20: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,0.00,-0.10,0.00,-0.10,-0.10,-0.10,0.00,-0.10,-0.10,-0.10,-0.10,-0.10,0.00,0.00,-0.10,-0.10 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 2/3] +[START] task=boundary_fraud env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '23', 'number_of_children': '0'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('income') reward=0.00 done=false error=null + Step 01: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11179. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=3 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 03: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=4 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 04: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=5 action=request_document('pan_card') reward=0.00 done=false error=null + Step 05: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=6 action=ask_question('age') reward=-0.10 done=false error=null + Step 06: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=7 action=ask_question('number_of_children') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=8 action=ask_question('income') reward=-0.10 done=false error=null + Step 08: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=9 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 09: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=10 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 10: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=11 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 11: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=12 action=request_document('pan_card') reward=0.00 done=false error=null + Step 12: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=13 action=ask_question('age') reward=-0.10 done=false error=null + Step 13: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=14 action=ask_question('number_of_children') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=15 action=ask_question('income') reward=-0.10 done=false error=null + Step 15: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=16 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 16: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=17 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 17: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=18 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 18: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=19 action=request_document('pan_card') reward=0.00 done=false error=null + Step 19: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=20 action=ask_question('age') reward=-0.10 done=false error=null + Step 20: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,0.00,0.00,0.00,-0.10,-0.10,-0.10,-0.10,0.00,0.00,-0.10,-0.10,-0.10,-0.10,0.00,0.00,-0.10 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 3/3] +[START] task=boundary_fraud env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '23', 'marital_status': 'divorced', 'state_of_residence': 'Maharashtra'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11989. +[STEP] step=4 action=ask_question('age') reward=-0.10 done=false error=null + Step 04: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=5 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 05: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=6 action=request_document('pan_card') reward=0.00 done=false error=null + Step 06: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=7 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 07: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=8 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 08: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=9 action=ask_question('income') reward=-0.10 done=false error=null + Step 09: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=10 action=ask_question('age') reward=-0.10 done=false error=null + Step 10: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=11 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 11: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=12 action=request_document('pan_card') reward=0.00 done=false error=null + Step 12: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=13 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 13: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=14 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 14: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=15 action=ask_question('income') reward=-0.10 done=false error=null + Step 15: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=16 action=ask_question('age') reward=-0.10 done=false error=null + Step 16: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=17 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 17: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=18 action=request_document('pan_card') reward=0.00 done=false error=null + Step 18: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=19 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 19: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=20 action=ask_question('has_aadhaar') reward=-2.00 done=true error=null + Step 20: ask_question('has_aadhaar') -> reward=-2.0, done=True + TIMEOUT: 20 steps reached without a decision. +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,0.00,-0.10,0.00,0.00,-0.10,-0.10,-0.10,-0.10,0.00,0.00,-0.10,-0.10,-0.10,-0.10,0.00,0.00,-0.10,-2.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 4 โ€” repeat 1/3] +[START] task=escalation_dilemma env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '41', 'income': '18035', 'occupation': 'student', 'has_aadhaar': 'True', 'number_of_children': '3', 'bank_name': 'Bank of Baroda', 'state_of_residence': 'Bihar'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('income') reward=-0.10 done=false error=null + Step 01: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=2 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 02: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 04: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=5 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 05: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=6 action=reject_applicant('AGE_EXCEEDED') reward=-1.00 done=false error=null + Step 06: reject_applicant('AGE_EXCEEDED') -> reward=-1.0, done=False + PREMATURE ADJUDICATION: Review documentary evidence before rejecting. Request the PAN card to verify employment records, +[STEP] step=7 action=request_document('pan_card') reward=0.00 done=false error=null + Step 07: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of Indian Railways for the past 6 years w +[STEP] step=8 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 08: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=false steps=8 score=0.850 rewards=-0.10,-0.10,-0.10,0.00,-0.10,-1.00,0.00,10.00 + + GRADER SCORE: 0.850 / 1.0 + + [Task 4 โ€” repeat 2/3] +[START] task=escalation_dilemma env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '30', 'income': '13238', 'occupation': 'student', 'has_aadhaar': 'True', 'state_of_residence': 'Uttar Pradesh', 'number_of_children': '2'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('income') reward=-0.10 done=false error=null + Step 01: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=2 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 02: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=3 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 03: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=4 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 04: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=5 action=request_document('pan_card') reward=0.00 done=false error=null + Step 05: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BHEL for the past 6 years with active +[STEP] step=6 action=escalate('DATA_MISMATCH') reward=10.00 done=true error=null + Step 06: escalate('DATA_MISMATCH') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=false steps=6 score=0.900 rewards=-0.10,-0.10,-0.10,0.00,0.00,10.00 + + GRADER SCORE: 0.900 / 1.0 + + [Task 4 โ€” repeat 3/3] +[START] task=escalation_dilemma env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '43', 'income': '17180', 'occupation': 'student', 'has_aadhaar': 'True', 'number_of_children': '1', 'marital_status': 'unmarried'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('income') reward=-0.10 done=false error=null + Step 01: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=2 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 02: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=ask_question('age') reward=-0.10 done=false error=null + Step 04: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=5 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 05: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=6 action=request_document('pan_card') reward=0.00 done=false error=null + Step 06: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of Indian Railways for the past 6 years w +[STEP] step=7 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 07: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=false steps=7 score=0.850 rewards=-0.10,-0.10,0.00,-0.10,-0.10,0.00,10.00 + + GRADER SCORE: 0.850 / 1.0 + + [Task 5 โ€” repeat 1/3] +[START] task=document_conflict env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '6288', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'marital_status': 'unmarried', 'state_of_residence': 'Uttar Pradesh', 'number_of_children': '1'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 37 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=2 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 02: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=37, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 2/3] +[START] task=document_conflict env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '7472', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'state_of_residence': 'Uttar Pradesh', 'bank_name': 'SBI'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=2 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 02: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 3/3] +[START] task=document_conflict env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '7914', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'marital_status': 'unmarried'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 38 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=2 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 02: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=38, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + +============================================================ + FINAL GRADER SCORES (mean ยฑ std over 3 repeats) +============================================================ + Task 1 (Scheme Discovery ): 0.767 ยฑ 0.116 / 1.0 + Task 2 (Missing Data ): 0.217 ยฑ 0.375 / 1.0 + Task 3 (Boundary Fraud ): 0.000 ยฑ 0.000 / 1.0 + Task 4 (Escalation Dilemma ): 0.867 ยฑ 0.029 / 1.0 + Task 5 (Document Conflict ): 1.000 ยฑ 0.000 / 1.0 + Average : 0.570 / 1.0 +============================================================ +SCORE_JSON {"task": 1, "score": 0.7667} +STD_JSON {"task": 1, "std": 0.1155} +SCORE_JSON {"task": 2, "score": 0.2167} +STD_JSON {"task": 2, "std": 0.3753} +SCORE_JSON {"task": 3, "score": 0.0} +STD_JSON {"task": 3, "std": 0.0} +SCORE_JSON {"task": 4, "score": 0.8667} +STD_JSON {"task": 4, "std": 0.0289} +SCORE_JSON {"task": 5, "score": 1.0} +STD_JSON {"task": 5, "std": 0.0} diff --git a/reports/baseline_report/leaderboard.csv b/reports/baseline_report/leaderboard.csv new file mode 100644 index 0000000..1fffcc0 --- /dev/null +++ b/reports/baseline_report/leaderboard.csv @@ -0,0 +1,9 @@ +Model,Size,Task1,Task2,Task3,Task4,Task5,Average +mistralai/mistral-nemotron,~56B,0.833,1.0,1.0,1.0,1.0,0.967 +nvidia/llama-3.3-nemotron-super-49b-v1,49B,0.8,0.973,1.0,1.0,1.0,0.955 +nvidia/llama-3.1-nemotron-51b-instruct,51B,0.8,0.957,1.0,1.0,1.0,0.951 +nvidia/nemotron-3-nano-30b-a3b,30B,1.0,0.0,1.0,1.0,1.0,0.8 +nvidia/nemotron-3-super-120b-a12b,120B,1.0,0.0,1.0,1.0,1.0,0.8 +nvidia/nemotron-mini-4b-instruct,4B,0.483,0.667,0.667,0.967,0.0,0.557 +meta/llama-3.1-8b-instruct,8B,0.4,0.0,0.317,0.867,1.0,0.517 +nvidia/llama-3.1-nemotron-nano-8b-v1,8B,0.283,0.303,0.0,0.333,0.0,0.184 diff --git a/reports/baseline_report/results.json b/reports/baseline_report/results.json new file mode 100644 index 0000000..fee5321 --- /dev/null +++ b/reports/baseline_report/results.json @@ -0,0 +1,218 @@ +[ + { + "model": "nvidia/llama-3.1-nemotron-nano-8b-v1", + "size": "8B", + "average": 0.184, + "tasks": { + "task1": { + "score": 0.283, + "std": 0.491 + }, + "task2": { + "score": 0.303, + "std": 0.525 + }, + "task3": { + "score": 0.0, + "std": 0.0 + }, + "task4": { + "score": 0.333, + "std": 0.577 + }, + "task5": { + "score": 0.0, + "std": 0.0 + } + } + }, + { + "model": "meta/llama-3.1-8b-instruct", + "size": "8B", + "average": 0.517, + "tasks": { + "task1": { + "score": 0.4, + "std": 0.458 + }, + "task2": { + "score": 0.0, + "std": 0.0 + }, + "task3": { + "score": 0.317, + "std": 0.548 + }, + "task4": { + "score": 0.867, + "std": 0.058 + }, + "task5": { + "score": 1.0, + "std": 0.0 + } + } + }, + { + "model": "nvidia/nemotron-mini-4b-instruct", + "size": "4B", + "average": 0.557, + "tasks": { + "task1": { + "score": 0.483, + "std": 0.029 + }, + "task2": { + "score": 0.667, + "std": 0.577 + }, + "task3": { + "score": 0.667, + "std": 0.577 + }, + "task4": { + "score": 0.967, + "std": 0.029 + }, + "task5": { + "score": 0.0, + "std": 0.0 + } + } + }, + { + "model": "nvidia/nemotron-3-nano-30b-a3b", + "size": "30B", + "average": 0.8, + "tasks": { + "task1": { + "score": 1.0, + "std": 0.0 + }, + "task2": { + "score": 0.0, + "std": 0.0 + }, + "task3": { + "score": 1.0, + "std": 0.0 + }, + "task4": { + "score": 1.0, + "std": 0.0 + }, + "task5": { + "score": 1.0, + "std": 0.0 + } + } + }, + { + "model": "nvidia/nemotron-3-super-120b-a12b", + "size": "120B", + "average": 0.8, + "tasks": { + "task1": { + "score": 1.0, + "std": 0.0 + }, + "task2": { + "score": 0.0, + "std": 0.0 + }, + "task3": { + "score": 1.0, + "std": 0.0 + }, + "task4": { + "score": 1.0, + "std": 0.0 + }, + "task5": { + "score": 1.0, + "std": 0.0 + } + } + }, + { + "model": "nvidia/llama-3.1-nemotron-51b-instruct", + "size": "51B", + "average": 0.951, + "tasks": { + "task1": { + "score": 0.8, + "std": 0.304 + }, + "task2": { + "score": 0.957, + "std": 0.045 + }, + "task3": { + "score": 1.0, + "std": 0.0 + }, + "task4": { + "score": 1.0, + "std": 0.0 + }, + "task5": { + "score": 1.0, + "std": 0.0 + } + } + }, + { + "model": "nvidia/llama-3.3-nemotron-super-49b-v1", + "size": "49B", + "average": 0.955, + "tasks": { + "task1": { + "score": 0.8, + "std": 0.304 + }, + "task2": { + "score": 0.973, + "std": 0.023 + }, + "task3": { + "score": 1.0, + "std": 0.0 + }, + "task4": { + "score": 1.0, + "std": 0.0 + }, + "task5": { + "score": 1.0, + "std": 0.0 + } + } + }, + { + "model": "mistralai/mistral-nemotron", + "size": "~56B", + "average": 0.967, + "tasks": { + "task1": { + "score": 0.833, + "std": 0.289 + }, + "task2": { + "score": 1.0, + "std": 0.0 + }, + "task3": { + "score": 1.0, + "std": 0.0 + }, + "task4": { + "score": 1.0, + "std": 0.0 + }, + "task5": { + "score": 1.0, + "std": 0.0 + } + } + } +] \ No newline at end of file diff --git a/reports/baseline_report/summary.txt b/reports/baseline_report/summary.txt new file mode 100644 index 0000000..92de904 --- /dev/null +++ b/reports/baseline_report/summary.txt @@ -0,0 +1,12 @@ +OpenEnv scheme_env Benchmark โ€” Baseline Report Summary +======================================================== +Date generated : 2026-04-08 +Models evaluated : 8 + +Best model : mistral-nemotron (avg=0.967) +Worst model : nemotron-nano-8b (avg=0.184) + +Hardest task : Task 2 (mean=0.487) +Easiest task : Task 4 (mean=0.896) + +Perfect score (1.0 on all tasks): none diff --git a/reports/baseline_report/task_heatmap.png b/reports/baseline_report/task_heatmap.png new file mode 100644 index 0000000..fc05e91 Binary files /dev/null and b/reports/baseline_report/task_heatmap.png differ diff --git a/reports/baseline_report/test_logs/pytest_results.txt b/reports/baseline_report/test_logs/pytest_results.txt new file mode 100644 index 0000000..1b2adaa --- /dev/null +++ b/reports/baseline_report/test_logs/pytest_results.txt @@ -0,0 +1,30 @@ +============================= test session starts ============================== +platform darwin -- Python 3.14.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/advikdivekar/Desktop/OpenEnv/venv/bin/python3.14 +cachedir: .pytest_cache +rootdir: /Users/advikdivekar/Desktop/OpenEnv +configfile: pyproject.toml +plugins: anyio-4.13.0 +collecting ... collected 20 items + +tests/test_scheme_eligibility.py::test_pmkvy_qualifies_age_lower_bound PASSED [ 5%] +tests/test_scheme_eligibility.py::test_pmkvy_qualifies_age_upper_bound PASSED [ 10%] +tests/test_scheme_eligibility.py::test_pmkvy_disqualifies_age_exceeded PASSED [ 15%] +tests/test_scheme_eligibility.py::test_pmkvy_disqualifies_income_exceeded PASSED [ 20%] +tests/test_scheme_eligibility.py::test_pmkvy_disqualifies_wrong_occupation PASSED [ 25%] +tests/test_scheme_eligibility.py::test_mgnregs_qualifies_age_lower_bound PASSED [ 30%] +tests/test_scheme_eligibility.py::test_mgnregs_qualifies_age_upper_bound PASSED [ 35%] +tests/test_scheme_eligibility.py::test_mgnregs_disqualifies_age_exceeded PASSED [ 40%] +tests/test_scheme_eligibility.py::test_mgnregs_disqualifies_no_aadhaar PASSED [ 45%] +tests/test_scheme_eligibility.py::test_pmay_qualifies_age_lower_bound PASSED [ 50%] +tests/test_scheme_eligibility.py::test_pmay_disqualifies_income_at_threshold PASSED [ 55%] +tests/test_scheme_eligibility.py::test_pmay_qualifies_age_upper_bound PASSED [ 60%] +tests/test_scheme_eligibility.py::test_pmay_disqualifies_age_exceeded PASSED [ 65%] +tests/test_scheme_eligibility.py::test_optimal_prefers_pmay_over_pmkvy PASSED [ 70%] +tests/test_scheme_eligibility.py::test_optimal_mgnregs_only PASSED [ 75%] +tests/test_scheme_eligibility.py::test_optimal_none_when_no_scheme PASSED [ 80%] +tests/test_scheme_eligibility.py::test_grader_score_perfect PASSED [ 85%] +tests/test_scheme_eligibility.py::test_grader_score_noise_penalty PASSED [ 90%] +tests/test_scheme_eligibility.py::test_grader_score_zero_base PASSED [ 95%] +tests/test_scheme_eligibility.py::test_grader_score_floor_at_030 PASSED [100%] + +============================== 20 passed in 2.24s ============================== diff --git a/reports/baseline_report/test_logs/smoke_test_results.txt b/reports/baseline_report/test_logs/smoke_test_results.txt new file mode 100644 index 0000000..15da4e3 --- /dev/null +++ b/reports/baseline_report/test_logs/smoke_test_results.txt @@ -0,0 +1,61 @@ + +============================================================ +SMOKE TEST โ€” scheme_env +============================================================ + +Task 1 โ€” Scheme Discovery + Profile at reset: {'age': '28', 'income': '4665', 'marital_status': 'widowed', 'number_of_children': '1'} + Missing: ['occupation', 'has_aadhaar'] + [PASS] Task 1 terminates on approve โ€” done=True + [PASS] Task 1 reward positive or suboptimal โ€” reward=3.0 + [PASS] Task 1 grader_score set โ€” score=0.5 + Notification: SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit. + +Task 2 โ€” Missing Data (premature approval) + Missing at reset: ['occupation', 'has_aadhaar'] + [PASS] Task 2 premature approve terminates โ€” done=True + [PASS] Task 2 premature approve scores 0.0 โ€” score=0.0 + [PASS] Task 2 premature approve reward negative โ€” reward=-5.0 + +Task 2 โ€” Missing Data (correct flow) + [PASS] Task 2 correct flow terminates โ€” done=True + [PASS] Task 2 correct flow reward positive โ€” reward=10.0 + [PASS] Task 2 correct flow score > 0 โ€” score=1.0 + +Task 3 โ€” Boundary Fraud (reject before asking income) + Profile at reset: {'age': '24', 'number_of_children': '3', 'marital_status': 'widowed'} + Missing: ['occupation', 'has_aadhaar', 'income'] + [PASS] Task 3 income hidden at reset โ€” income in profile=['age', 'number_of_children', 'marital_status'] + [PASS] Task 3 premature reject scores 0.0 โ€” score=0.0 + [PASS] Task 3 premature reject reward negative โ€” reward=-2.0 + +Task 3 โ€” Boundary Fraud (correct flow) + [PASS] Task 3 correct reject terminates โ€” done=True + [PASS] Task 3 correct reject reward positive โ€” reward=5.0 + [PASS] Task 3 correct reject score > 0.30 โ€” score=1.0 + +Task 4 โ€” Escalation (escalate before PAN โ€” should soft-block) + [PASS] Task 4 escalate before PAN is non-terminal โ€” done=False + [PASS] Task 4 escalate before PAN reward negative โ€” reward=-1.0 + +Task 4 โ€” Escalation (correct flow) + [PASS] Task 4 PAN reveals contradiction โ€” notif=PAN card retrieved. Records show this applicant has been a registered employee o + [PASS] Task 4 correct escalate terminates โ€” done=True + [PASS] Task 4 correct escalate reward=10 โ€” reward=10.0 + [PASS] Task 4 correct escalate score > 0.80 โ€” score=1.0 + +Task 5 โ€” Document Conflict (correct flow) + Self-reported age in profile: 33 + [PASS] Task 5 Aadhaar reveals age conflict โ€” notif=Aadhaar card verified. Official age on record: 36 years. Note: this differs from + [PASS] Task 5 correct reject terminates โ€” done=True + [PASS] Task 5 correct reject reward=5 โ€” reward=5.0 + [PASS] Task 5 correct reject score > 0.80 โ€” score=1.0 + +Wrong escalation on Task 1 (should now be terminal) + [PASS] Wrong escalation Task 1 is terminal โ€” done=True + [PASS] Wrong escalation reward=-2.0 โ€” reward=-2.0 + +============================================================ +ALL TESTS PASSED โ€” environment logic is correct +============================================================ + diff --git a/reports/baseline_report/test_logs/test_results.txt b/reports/baseline_report/test_logs/test_results.txt new file mode 100644 index 0000000..b662b86 --- /dev/null +++ b/reports/baseline_report/test_logs/test_results.txt @@ -0,0 +1,30 @@ +============================= test session starts ============================== +platform darwin -- Python 3.14.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/advikdivekar/Desktop/OpenEnv/venv/bin/python3.14 +cachedir: .pytest_cache +rootdir: /Users/advikdivekar/Desktop/OpenEnv +configfile: pyproject.toml +plugins: anyio-4.13.0 +collecting ... collected 20 items + +tests/test_scheme_eligibility.py::test_pmkvy_qualifies_age_lower_bound PASSED [ 5%] +tests/test_scheme_eligibility.py::test_pmkvy_qualifies_age_upper_bound PASSED [ 10%] +tests/test_scheme_eligibility.py::test_pmkvy_disqualifies_age_exceeded PASSED [ 15%] +tests/test_scheme_eligibility.py::test_pmkvy_disqualifies_income_exceeded PASSED [ 20%] +tests/test_scheme_eligibility.py::test_pmkvy_disqualifies_wrong_occupation PASSED [ 25%] +tests/test_scheme_eligibility.py::test_mgnregs_qualifies_age_lower_bound PASSED [ 30%] +tests/test_scheme_eligibility.py::test_mgnregs_qualifies_age_upper_bound PASSED [ 35%] +tests/test_scheme_eligibility.py::test_mgnregs_disqualifies_age_exceeded PASSED [ 40%] +tests/test_scheme_eligibility.py::test_mgnregs_disqualifies_no_aadhaar PASSED [ 45%] +tests/test_scheme_eligibility.py::test_pmay_qualifies_age_lower_bound PASSED [ 50%] +tests/test_scheme_eligibility.py::test_pmay_disqualifies_income_at_threshold PASSED [ 55%] +tests/test_scheme_eligibility.py::test_pmay_qualifies_age_upper_bound PASSED [ 60%] +tests/test_scheme_eligibility.py::test_pmay_disqualifies_age_exceeded PASSED [ 65%] +tests/test_scheme_eligibility.py::test_optimal_prefers_pmay_over_pmkvy PASSED [ 70%] +tests/test_scheme_eligibility.py::test_optimal_mgnregs_only PASSED [ 75%] +tests/test_scheme_eligibility.py::test_optimal_none_when_no_scheme PASSED [ 80%] +tests/test_scheme_eligibility.py::test_grader_score_perfect PASSED [ 85%] +tests/test_scheme_eligibility.py::test_grader_score_noise_penalty PASSED [ 90%] +tests/test_scheme_eligibility.py::test_grader_score_zero_base PASSED [ 95%] +tests/test_scheme_eligibility.py::test_grader_score_floor_at_030 PASSED [100%] + +============================== 20 passed in 2.38s ============================== diff --git a/reports/difficulty_profile.png b/reports/difficulty_profile.png new file mode 100644 index 0000000..a65905a Binary files /dev/null and b/reports/difficulty_profile.png differ diff --git a/reports/efficiency_scatter.png b/reports/efficiency_scatter.png new file mode 100644 index 0000000..056b93f Binary files /dev/null and b/reports/efficiency_scatter.png differ diff --git a/reports/inference_logs/inference_mistral_nemotron.txt b/reports/inference_logs/inference_mistral_nemotron.txt new file mode 100644 index 0000000..f00cae4 --- /dev/null +++ b/reports/inference_logs/inference_mistral_nemotron.txt @@ -0,0 +1,349 @@ + +============================================================ + SCHEME ENV โ€” OPTION A EVALUATION + Model : mistralai/mistral-nemotron + Env : http://localhost:7860 + Repeats : 3 per task +============================================================ + + [Task 1 โ€” repeat 1/3] +[START] task=scheme_discovery env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '19', 'income': '2762', 'state_of_residence': 'Uttar Pradesh'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 03: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 1 โ€” repeat 2/3] +[START] task=scheme_discovery env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '25', 'income': '7782', 'marital_status': 'divorced', 'state_of_residence': 'Uttar Pradesh', 'number_of_children': '3'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 03: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 1 โ€” repeat 3/3] +[START] task=scheme_discovery env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '32', 'income': '1847', 'bank_name': 'PNB'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMKVY') reward=3.00 done=true error=null + Step 03: approve_scheme('PMKVY') -> reward=3.0, done=True + SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit. +[END] success=false steps=3 score=0.500 rewards=0.00,0.00,3.00 + + GRADER SCORE: 0.500 / 1.0 + + [Task 2 โ€” repeat 1/3] +[START] task=missing_data env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '47', 'income': '3809', 'bank_name': 'SBI', 'state_of_residence': 'Maharashtra', 'number_of_children': '4'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 03: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 2 โ€” repeat 2/3] +[START] task=missing_data env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '54', 'income': '2075', 'number_of_children': '3', 'bank_name': 'SBI'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 03: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 2 โ€” repeat 3/3] +[START] task=missing_data env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '27', 'income': '4839', 'bank_name': 'UCO Bank', 'state_of_residence': 'Bihar'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 03: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 1/3] +[START] task=boundary_fraud env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '28', 'state_of_residence': 'Bihar'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('income') reward=0.00 done=false error=null + Step 02: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11838. +[STEP] step=3 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 03: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 11838 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 2/3] +[START] task=boundary_fraud env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '32', 'marital_status': 'widowed'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('income') reward=0.00 done=false error=null + Step 02: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10356. +[STEP] step=3 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 03: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 10356 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 3/3] +[START] task=boundary_fraud env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '27', 'state_of_residence': 'Maharashtra', 'bank_name': 'PNB'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('income') reward=0.00 done=false error=null + Step 02: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10026. +[STEP] step=3 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 03: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 10026 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 1/3] +[START] task=escalation_dilemma env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '33', 'income': '14586', 'occupation': 'student', 'has_aadhaar': 'True', 'number_of_children': '4', 'marital_status': 'widowed'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BSNL for the past 6 years with active +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 2/3] +[START] task=escalation_dilemma env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '32', 'income': '11911', 'occupation': 'student', 'has_aadhaar': 'True', 'number_of_children': '1', 'bank_name': 'UCO Bank'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of Coal India for the past 6 years with a +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 3/3] +[START] task=escalation_dilemma env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '43', 'income': '11888', 'occupation': 'student', 'has_aadhaar': 'True', 'number_of_children': '4', 'bank_name': 'SBI', 'state_of_residence': 'Gujarat'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of State Bank of India for the past 6 yea +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 1/3] +[START] task=document_conflict env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '8160', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'state_of_residence': 'Uttar Pradesh', 'marital_status': 'divorced'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 01: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: You must verify the Aadhaar card before approving any scheme when age is a critical eligibility fact +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=3 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 03: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=3 score=1.000 rewards=-1.50,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 2/3] +[START] task=document_conflict env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '6654', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '3'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 01: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: You must verify the Aadhaar card before approving any scheme when age is a critical eligibility fact +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=3 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 03: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=3 score=1.000 rewards=-1.50,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 3/3] +[START] task=document_conflict env=scheme_env model=mistralai/mistral-nemotron + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '34', 'income': '8482', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '3'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 01: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: You must verify the Aadhaar card before approving any scheme when age is a critical eligibility fact +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 37 years. Note: this differs from the self-reported age of 34 in the prof +[STEP] step=3 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 03: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=37, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=3 score=1.000 rewards=-1.50,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + +============================================================ + FINAL GRADER SCORES (mean ยฑ std over 3 repeats) +============================================================ + Task 1 (Scheme Discovery ): 0.833 ยฑ 0.289 / 1.0 + Task 2 (Missing Data ): 1.000 ยฑ 0.000 / 1.0 + Task 3 (Boundary Fraud ): 1.000 ยฑ 0.000 / 1.0 + Task 4 (Escalation Dilemma ): 1.000 ยฑ 0.000 / 1.0 + Task 5 (Document Conflict ): 1.000 ยฑ 0.000 / 1.0 + Average : 0.967 / 1.0 +============================================================ +SCORE_JSON {"task": 1, "score": 0.8333} +STD_JSON {"task": 1, "std": 0.2887} +SCORE_JSON {"task": 2, "score": 1.0} +STD_JSON {"task": 2, "std": 0.0} +SCORE_JSON {"task": 3, "score": 1.0} +STD_JSON {"task": 3, "std": 0.0} +SCORE_JSON {"task": 4, "score": 1.0} +STD_JSON {"task": 4, "std": 0.0} +SCORE_JSON {"task": 5, "score": 1.0} +STD_JSON {"task": 5, "std": 0.0} diff --git a/reports/inference_logs/inference_nemotron3_120b.txt b/reports/inference_logs/inference_nemotron3_120b.txt new file mode 100644 index 0000000..aa28394 --- /dev/null +++ b/reports/inference_logs/inference_nemotron3_120b.txt @@ -0,0 +1,358 @@ + +============================================================ + SCHEME ENV โ€” OPTION A EVALUATION + Model : nvidia/nemotron-3-super-120b-a12b + Env : http://localhost:7860 + Repeats : 3 per task +============================================================ + + [Task 1 โ€” repeat 1/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '35', 'income': '5969', 'number_of_children': '4'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=approve_scheme('PMAY') reward=10.00 done=true error=null + Step 04: approve_scheme('PMAY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMAY โ€” optimal scheme. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 1 โ€” repeat 2/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '25', 'income': '4455', 'bank_name': 'Canara Bank', 'state_of_residence': 'Rajasthan'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMAY') reward=10.00 done=true error=null + Step 03: approve_scheme('PMAY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMAY โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 1 โ€” repeat 3/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '31', 'income': '5924', 'marital_status': 'divorced'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMAY') reward=10.00 done=true error=null + Step 03: approve_scheme('PMAY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMAY โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 2 โ€” repeat 1/3] +[START] task=missing_data env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '30', 'income': '1542', 'marital_status': 'divorced', 'number_of_children': '0', 'state_of_residence': 'Bihar'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMAY') reward=-5.00 done=true error=null + Step 03: approve_scheme('PMAY') -> reward=-5.0, done=True + ERROR: Applicant does not qualify for 'PMAY'. +[END] success=false steps=3 score=0.000 rewards=0.00,0.00,-5.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 2 โ€” repeat 2/3] +[START] task=missing_data env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '40', 'income': '4467', 'number_of_children': '2', 'state_of_residence': 'Rajasthan'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMAY') reward=-5.00 done=true error=null + Step 03: approve_scheme('PMAY') -> reward=-5.0, done=True + ERROR: Applicant does not qualify for 'PMAY'. +[END] success=false steps=3 score=0.000 rewards=0.00,0.00,-5.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 2 โ€” repeat 3/3] +[START] task=missing_data env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '50', 'income': '3776', 'marital_status': 'widowed', 'bank_name': 'SBI', 'state_of_residence': 'Rajasthan'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMAY') reward=-5.00 done=true error=null + Step 03: approve_scheme('PMAY') -> reward=-5.0, done=True + ERROR: Applicant does not qualify for 'PMAY'. +[END] success=false steps=3 score=0.000 rewards=0.00,0.00,-5.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 1/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '26', 'number_of_children': '1', 'state_of_residence': 'Rajasthan', 'bank_name': 'SBI'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('income') reward=0.00 done=false error=null + Step 02: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10195. +[STEP] step=3 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 03: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=4 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 04: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Income 10195 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 2/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '30', 'bank_name': 'Bank of Baroda', 'number_of_children': '2'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('income') reward=0.00 done=false error=null + Step 02: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10217. +[STEP] step=3 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 03: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=4 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 04: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Income 10217 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 3/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '27', 'number_of_children': '4', 'marital_status': 'married'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('income') reward=0.00 done=false error=null + Step 02: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10088. +[STEP] step=3 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 03: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=4 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 04: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Income 10088 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 1/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '23', 'income': '11636', 'occupation': 'student', 'has_aadhaar': 'True', 'bank_name': 'Canara Bank'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BSNL for the past 6 years with active +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 2/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '40', 'income': '16723', 'occupation': 'student', 'has_aadhaar': 'True', 'number_of_children': '4', 'bank_name': 'Canara Bank'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of Indian Railways for the past 6 years w +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 3/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '34', 'income': '15485', 'occupation': 'student', 'has_aadhaar': 'True', 'state_of_residence': 'Bihar', 'number_of_children': '3'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of GAIL India for the past 6 years with a +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 1/3] +[START] task=document_conflict env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '7383', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'bank_name': 'PNB', 'number_of_children': '4'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 01: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: You must verify the Aadhaar card before approving any scheme when age is a critical eligibility fact +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=3 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 03: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=3 score=1.000 rewards=-1.50,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 2/3] +[START] task=document_conflict env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '34', 'income': '7689', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'marital_status': 'divorced'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 34 in the prof +[STEP] step=2 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 02: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 3/3] +[START] task=document_conflict env=scheme_env model=nvidia/nemotron-3-super-120b-a12b + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '6093', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'state_of_residence': 'Bihar', 'marital_status': 'divorced'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 01: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: You must verify the Aadhaar card before approving any scheme when age is a critical eligibility fact +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=3 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 03: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=3 score=1.000 rewards=-1.50,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + +============================================================ + FINAL GRADER SCORES (mean ยฑ std over 3 repeats) +============================================================ + Task 1 (Scheme Discovery ): 1.000 ยฑ 0.000 / 1.0 + Task 2 (Missing Data ): 0.000 ยฑ 0.000 / 1.0 + Task 3 (Boundary Fraud ): 1.000 ยฑ 0.000 / 1.0 + Task 4 (Escalation Dilemma ): 1.000 ยฑ 0.000 / 1.0 + Task 5 (Document Conflict ): 1.000 ยฑ 0.000 / 1.0 + Average : 0.800 / 1.0 +============================================================ +SCORE_JSON {"task": 1, "score": 1.0} +STD_JSON {"task": 1, "std": 0.0} +SCORE_JSON {"task": 2, "score": 0.0} +STD_JSON {"task": 2, "std": 0.0} +SCORE_JSON {"task": 3, "score": 1.0} +STD_JSON {"task": 3, "std": 0.0} +SCORE_JSON {"task": 4, "score": 1.0} +STD_JSON {"task": 4, "std": 0.0} +SCORE_JSON {"task": 5, "score": 1.0} +STD_JSON {"task": 5, "std": 0.0} diff --git a/reports/inference_logs/inference_nemotron3_nano30b.txt b/reports/inference_logs/inference_nemotron3_nano30b.txt new file mode 100644 index 0000000..63ec0ae --- /dev/null +++ b/reports/inference_logs/inference_nemotron3_nano30b.txt @@ -0,0 +1,352 @@ + +============================================================ + SCHEME ENV โ€” OPTION A EVALUATION + Model : nvidia/nemotron-3-nano-30b-a3b + Env : http://localhost:7860 + Repeats : 3 per task +============================================================ + + [Task 1 โ€” repeat 1/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '19', 'income': '5888', 'bank_name': 'UCO Bank', 'number_of_children': '2', 'state_of_residence': 'Uttar Pradesh'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 03: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 1 โ€” repeat 2/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '27', 'income': '9982', 'number_of_children': '0', 'bank_name': 'Bank of Baroda'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 03: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 1 โ€” repeat 3/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '32', 'income': '1057', 'state_of_residence': 'Bihar', 'marital_status': 'divorced', 'bank_name': 'PNB'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMAY') reward=10.00 done=true error=null + Step 03: approve_scheme('PMAY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMAY โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 2 โ€” repeat 1/3] +[START] task=missing_data env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '34', 'income': '1576', 'number_of_children': '1', 'bank_name': 'UCO Bank'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=approve_scheme('PMAY') reward=-5.00 done=true error=null + Step 03: approve_scheme('PMAY') -> reward=-5.0, done=True + ERROR: Applicant does not qualify for 'PMAY'. +[END] success=false steps=3 score=0.000 rewards=0.00,0.00,-5.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 2 โ€” repeat 2/3] +[START] task=missing_data env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '53', 'income': '4659', 'number_of_children': '4', 'state_of_residence': 'Maharashtra'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=approve_scheme('PMAY') reward=-5.00 done=true error=null + Step 03: approve_scheme('PMAY') -> reward=-5.0, done=True + ERROR: Applicant does not qualify for 'PMAY'. +[END] success=false steps=3 score=0.000 rewards=0.00,0.00,-5.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 2 โ€” repeat 3/3] +[START] task=missing_data env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '41', 'income': '2243', 'state_of_residence': 'Gujarat', 'number_of_children': '2', 'bank_name': 'PNB'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=approve_scheme('PMAY') reward=-5.00 done=true error=null + Step 03: approve_scheme('PMAY') -> reward=-5.0, done=True + ERROR: Applicant does not qualify for 'PMAY'. +[END] success=false steps=3 score=0.000 rewards=0.00,0.00,-5.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 1/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '31', 'bank_name': 'Canara Bank', 'number_of_children': '1', 'marital_status': 'married'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('income') reward=0.00 done=false error=null + Step 02: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10770. +[STEP] step=3 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 03: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=4 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 04: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Income 10770 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 2/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '33', 'marital_status': 'divorced', 'state_of_residence': 'Maharashtra'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10185. +[STEP] step=4 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 04: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Income 10185 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 3/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '27', 'marital_status': 'widowed', 'number_of_children': '1'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('income') reward=0.00 done=false error=null + Step 02: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10951. +[STEP] step=3 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 03: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=4 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 04: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Income 10951 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 1/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '33', 'income': '13674', 'occupation': 'student', 'has_aadhaar': 'True', 'bank_name': 'PNB'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of State Bank of India for the past 6 yea +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 2/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '29', 'income': '11064', 'occupation': 'student', 'has_aadhaar': 'True', 'number_of_children': '0'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of GAIL India for the past 6 years with a +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 3/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '41', 'income': '10119', 'occupation': 'student', 'has_aadhaar': 'True', 'state_of_residence': 'Uttar Pradesh'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BHEL for the past 6 years with active +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 1/3] +[START] task=document_conflict env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '8632', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'bank_name': 'Bank of Baroda', 'state_of_residence': 'Gujarat'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 38 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=2 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 02: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=38, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 2/3] +[START] task=document_conflict env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '7933', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'bank_name': 'SBI', 'state_of_residence': 'Uttar Pradesh', 'number_of_children': '4'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 01: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: You must verify the Aadhaar card before approving any scheme when age is a critical eligibility fact +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=3 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 03: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=3 score=1.000 rewards=-1.50,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 3/3] +[START] task=document_conflict env=scheme_env model=nvidia/nemotron-3-nano-30b-a3b + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '7429', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'state_of_residence': 'Bihar', 'bank_name': 'Bank of Baroda', 'number_of_children': '0'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=2 action=reject_applicant('NO_ELIGIBLE_SCHEME') reward=5.00 done=true error=null + Step 02: reject_applicant('NO_ELIGIBLE_SCHEME') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + +============================================================ + FINAL GRADER SCORES (mean ยฑ std over 3 repeats) +============================================================ + Task 1 (Scheme Discovery ): 1.000 ยฑ 0.000 / 1.0 + Task 2 (Missing Data ): 0.000 ยฑ 0.000 / 1.0 + Task 3 (Boundary Fraud ): 1.000 ยฑ 0.000 / 1.0 + Task 4 (Escalation Dilemma ): 1.000 ยฑ 0.000 / 1.0 + Task 5 (Document Conflict ): 1.000 ยฑ 0.000 / 1.0 + Average : 0.800 / 1.0 +============================================================ +SCORE_JSON {"task": 1, "score": 1.0} +STD_JSON {"task": 1, "std": 0.0} +SCORE_JSON {"task": 2, "score": 0.0} +STD_JSON {"task": 2, "std": 0.0} +SCORE_JSON {"task": 3, "score": 1.0} +STD_JSON {"task": 3, "std": 0.0} +SCORE_JSON {"task": 4, "score": 1.0} +STD_JSON {"task": 4, "std": 0.0} +SCORE_JSON {"task": 5, "score": 1.0} +STD_JSON {"task": 5, "std": 0.0} diff --git a/reports/inference_logs/inference_nemotron51b.txt b/reports/inference_logs/inference_nemotron51b.txt new file mode 100644 index 0000000..03f64cd --- /dev/null +++ b/reports/inference_logs/inference_nemotron51b.txt @@ -0,0 +1,370 @@ + +============================================================ + SCHEME ENV โ€” OPTION A EVALUATION + Model : nvidia/llama-3.3-nemotron-super-49b-v1 + Env : http://localhost:7860 + Repeats : 3 per task +============================================================ + + [Task 1 โ€” repeat 1/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '19', 'income': '1437', 'number_of_children': '1', 'state_of_residence': 'Rajasthan', 'bank_name': 'PNB'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 04: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 1 โ€” repeat 2/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '19', 'income': '7676', 'state_of_residence': 'Rajasthan'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 05: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=false steps=5 score=0.950 rewards=0.00,0.00,0.00,-0.10,10.00 + + GRADER SCORE: 0.950 / 1.0 + + [Task 1 โ€” repeat 3/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '26', 'income': '1391', 'state_of_residence': 'Uttar Pradesh'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=approve_scheme('PMKVY') reward=3.00 done=true error=null + Step 05: approve_scheme('PMKVY') -> reward=3.0, done=True + SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit. +[END] success=false steps=5 score=0.450 rewards=0.00,0.00,0.00,-0.10,3.00 + + GRADER SCORE: 0.450 / 1.0 + + [Task 2 โ€” repeat 1/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '32', 'income': '3251', 'state_of_residence': 'Rajasthan'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 04: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=false steps=4 score=0.960 rewards=0.00,0.00,0.00,10.00 + + GRADER SCORE: 0.960 / 1.0 + + [Task 2 โ€” repeat 2/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '22', 'income': '2235', 'state_of_residence': 'Bihar'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 01: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=3 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 03: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 2 โ€” repeat 3/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '28', 'income': '4274', 'state_of_residence': 'Rajasthan', 'number_of_children': '4', 'bank_name': 'SBI'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 04: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=false steps=4 score=0.910 rewards=0.00,0.00,-0.10,10.00 + + GRADER SCORE: 0.910 / 1.0 + + [Task 3 โ€” repeat 1/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '26', 'marital_status': 'married'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11975. +[STEP] step=4 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 04: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 11975 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 2/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '26', 'bank_name': 'Bank of Baroda', 'number_of_children': '3'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11092. +[STEP] step=4 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 04: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 11092 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 3/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '27', 'marital_status': 'divorced', 'number_of_children': '3'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10488. +[STEP] step=4 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 04: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 10488 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 1/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '38', 'income': '11505', 'occupation': 'student', 'has_aadhaar': 'True', 'state_of_residence': 'Maharashtra', 'marital_status': 'married'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of Indian Railways for the past 6 years w +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 2/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '35', 'income': '11663', 'occupation': 'student', 'has_aadhaar': 'True', 'state_of_residence': 'Maharashtra'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of GAIL India for the past 6 years with a +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 3/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '24', 'income': '18630', 'occupation': 'student', 'has_aadhaar': 'True', 'number_of_children': '1', 'marital_status': 'divorced', 'state_of_residence': 'Uttar Pradesh'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BSNL for the past 6 years with active +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 1/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '34', 'income': '8492', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '2', 'bank_name': 'SBI', 'state_of_residence': 'Bihar'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 34 in the prof +[STEP] step=2 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 02: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 2/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '34', 'income': '7281', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'state_of_residence': 'Maharashtra', 'number_of_children': '1', 'bank_name': 'PNB'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 37 years. Note: this differs from the self-reported age of 34 in the prof +[STEP] step=2 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 02: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=37, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 3/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '8158', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'bank_name': 'Bank of Baroda'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 37 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=2 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 02: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=37, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + +============================================================ + FINAL GRADER SCORES (mean ยฑ std over 3 repeats) +============================================================ + Task 1 (Scheme Discovery ): 0.800 ยฑ 0.304 / 1.0 + Task 2 (Missing Data ): 0.957 ยฑ 0.045 / 1.0 + Task 3 (Boundary Fraud ): 1.000 ยฑ 0.000 / 1.0 + Task 4 (Escalation Dilemma ): 1.000 ยฑ 0.000 / 1.0 + Task 5 (Document Conflict ): 1.000 ยฑ 0.000 / 1.0 + Average : 0.951 / 1.0 +============================================================ +SCORE_JSON {"task": 1, "score": 0.8} +STD_JSON {"task": 1, "std": 0.3041} +SCORE_JSON {"task": 2, "score": 0.9567} +STD_JSON {"task": 2, "std": 0.0451} +SCORE_JSON {"task": 3, "score": 1.0} +STD_JSON {"task": 3, "std": 0.0} +SCORE_JSON {"task": 4, "score": 1.0} +STD_JSON {"task": 4, "std": 0.0} +SCORE_JSON {"task": 5, "score": 1.0} +STD_JSON {"task": 5, "std": 0.0} diff --git a/reports/inference_logs/inference_nemotron_mini4b.txt b/reports/inference_logs/inference_nemotron_mini4b.txt new file mode 100644 index 0000000..95d3d14 --- /dev/null +++ b/reports/inference_logs/inference_nemotron_mini4b.txt @@ -0,0 +1,396 @@ + +============================================================ + SCHEME ENV โ€” OPTION A EVALUATION + Model : nvidia/nemotron-mini-4b-instruct + Env : http://localhost:7860 + Repeats : 3 per task +============================================================ + + [Task 1 โ€” repeat 1/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '25', 'income': '3626', 'state_of_residence': 'Bihar', 'marital_status': 'divorced', 'number_of_children': '3'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=approve_scheme('PMKVY') reward=3.00 done=true error=null + Step 03: approve_scheme('PMKVY') -> reward=3.0, done=True + SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit. +[END] success=false steps=3 score=0.500 rewards=0.00,0.00,3.00 + + GRADER SCORE: 0.500 / 1.0 + + [Task 1 โ€” repeat 2/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '34', 'income': '3917', 'number_of_children': '4', 'state_of_residence': 'Bihar', 'marital_status': 'unmarried'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=ask_question('income') reward=-0.10 done=false error=null + Step 03: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=4 action=approve_scheme('PMKVY') reward=3.00 done=true error=null + Step 04: approve_scheme('PMKVY') -> reward=3.0, done=True + SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit. +[END] success=false steps=4 score=0.450 rewards=0.00,0.00,-0.10,3.00 + + GRADER SCORE: 0.450 / 1.0 + + [Task 1 โ€” repeat 3/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '33', 'income': '2858', 'bank_name': 'SBI'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=approve_scheme('PMKVY') reward=3.00 done=true error=null + Step 03: approve_scheme('PMKVY') -> reward=3.0, done=True + SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit. +[END] success=false steps=3 score=0.500 rewards=0.00,0.00,3.00 + + GRADER SCORE: 0.500 / 1.0 + + [Task 2 โ€” repeat 1/3] +[START] task=missing_data env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '48', 'income': '1982', 'number_of_children': '3', 'marital_status': 'married'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 03: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 2 โ€” repeat 2/3] +[START] task=missing_data env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '36', 'income': '4164', 'number_of_children': '1', 'marital_status': 'unmarried', 'bank_name': 'Canara Bank'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 01: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=2 action=request_document('pan_card') reward=0.00 done=false error=null + Step 02: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=3 action=ask_question('occupation') reward=0.00 done=false error=null + Step 03: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=4 action=escalate('MANUAL_REVIEW_REQUIRED') reward=-2.00 done=true error=null + Step 04: escalate('MANUAL_REVIEW_REQUIRED') -> reward=-2.0, done=True + INCORRECT ESCALATION: Escalation is only appropriate when data integrity is genuinely compromised. This case has suffici +[END] success=false steps=4 score=0.000 rewards=0.00,0.00,0.00,-2.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 2 โ€” repeat 3/3] +[START] task=missing_data env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '43', 'income': '4120', 'bank_name': 'PNB'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 03: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 1/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '29', 'marital_status': 'married'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10437. +[STEP] step=4 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 04: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 10437 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 2/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '24', 'bank_name': 'Canara Bank', 'state_of_residence': 'Uttar Pradesh', 'marital_status': 'divorced'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11651. +[STEP] step=4 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 04: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 11651 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 3/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '28', 'bank_name': 'UCO Bank', 'state_of_residence': 'Maharashtra', 'marital_status': 'divorced'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10915. +[STEP] step=4 action=escalate('DATA_MISMATCH') reward=-2.00 done=true error=null + Step 04: escalate('DATA_MISMATCH') -> reward=-2.0, done=True + INCORRECT ESCALATION: Escalation is only appropriate when data integrity is genuinely compromised. This case has suffici +[END] success=false steps=4 score=0.000 rewards=0.00,0.00,0.00,-2.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 4 โ€” repeat 1/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '30', 'income': '19888', 'occupation': 'student', 'has_aadhaar': 'True', 'bank_name': 'Canara Bank', 'state_of_residence': 'Maharashtra', 'marital_status': 'widowed'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 01: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=2 action=ask_question('income') reward=-0.10 done=false error=null + Step 02: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=3 action=escalate('DATA_MISMATCH') reward=-1.00 done=false error=null + Step 03: escalate('DATA_MISMATCH') -> reward=-1.0, done=False + INSUFFICIENT BASIS FOR ESCALATION: First request the PAN card to verify the suspected employment contradiction. Escalate +[STEP] step=4 action=request_document('pan_card') reward=0.00 done=false error=null + Step 04: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BSNL for the past 6 years with active +[STEP] step=5 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 05: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=false steps=5 score=0.950 rewards=-0.10,-0.10,-1.00,0.00,10.00 + + GRADER SCORE: 0.950 / 1.0 + + [Task 4 โ€” repeat 2/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '30', 'income': '16335', 'occupation': 'student', 'has_aadhaar': 'True', 'state_of_residence': 'Rajasthan'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 01: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=2 action=ask_question('income') reward=-0.10 done=false error=null + Step 02: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=3 action=request_document('pan_card') reward=0.00 done=false error=null + Step 03: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BHEL for the past 6 years with active +[STEP] step=4 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 04: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=false steps=4 score=0.950 rewards=-0.10,-0.10,0.00,10.00 + + GRADER SCORE: 0.950 / 1.0 + + [Task 4 โ€” repeat 3/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '41', 'income': '19645', 'occupation': 'student', 'has_aadhaar': 'True', 'bank_name': 'SBI', 'number_of_children': '2'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 01: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=2 action=escalate('DATA_MISMATCH') reward=-1.00 done=false error=null + Step 02: escalate('DATA_MISMATCH') -> reward=-1.0, done=False + INSUFFICIENT BASIS FOR ESCALATION: First request the PAN card to verify the suspected employment contradiction. Escalate +[STEP] step=3 action=request_document('pan_card') reward=0.00 done=false error=null + Step 03: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BHEL for the past 6 years with active +[STEP] step=4 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 04: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=4 score=1.000 rewards=-0.10,-1.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 1/3] +[START] task=document_conflict env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '7344', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'marital_status': 'married'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 01: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=2 action=request_document('pan_card') reward=0.00 done=false error=null + Step 02: request_document('pan_card') -> reward=0.0, done=False + PAN card verified. No anomalies found in tax records. For age verification, the Aadhaar card is the authoritative docume +[STEP] step=3 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 03: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: You must verify the Aadhaar card before approving any scheme when age is a critical eligibility fact +[STEP] step=4 action=escalate('MANUAL_REVIEW_REQUIRED') reward=-2.00 done=true error=null + Step 04: escalate('MANUAL_REVIEW_REQUIRED') -> reward=-2.0, done=True + INCORRECT ESCALATION: Escalation is only appropriate when data integrity is genuinely compromised. This case has suffici +[END] success=false steps=4 score=0.000 rewards=-0.10,0.00,-1.50,-2.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 5 โ€” repeat 2/3] +[START] task=document_conflict env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '34', 'income': '8839', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'bank_name': 'SBI', 'state_of_residence': 'Maharashtra', 'marital_status': 'widowed'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 01: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=2 action=request_document('pan_card') reward=0.00 done=false error=null + Step 02: request_document('pan_card') -> reward=0.0, done=False + PAN card verified. No anomalies found in tax records. For age verification, the Aadhaar card is the authoritative docume +[STEP] step=3 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 03: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: You must verify the Aadhaar card before approving any scheme when age is a critical eligibility fact +[STEP] step=4 action=escalate('MANUAL_REVIEW_REQUIRED') reward=-2.00 done=true error=null + Step 04: escalate('MANUAL_REVIEW_REQUIRED') -> reward=-2.0, done=True + INCORRECT ESCALATION: Escalation is only appropriate when data integrity is genuinely compromised. This case has suffici +[END] success=false steps=4 score=0.000 rewards=-0.10,0.00,-1.50,-2.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 5 โ€” repeat 3/3] +[START] task=document_conflict env=scheme_env model=nvidia/nemotron-mini-4b-instruct + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '8666', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '1', 'bank_name': 'PNB'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 01: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=2 action=request_document('pan_card') reward=0.00 done=false error=null + Step 02: request_document('pan_card') -> reward=0.0, done=False + PAN card verified. No anomalies found in tax records. For age verification, the Aadhaar card is the authoritative docume +[STEP] step=3 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 03: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: You must verify the Aadhaar card before approving any scheme when age is a critical eligibility fact +[STEP] step=4 action=ask_question('self_reported_age') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=5 action=ask_question('aadhaar_age') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=6 action=escalate('MANUAL_REVIEW_REQUIRED') reward=-2.00 done=true error=null + Step 06: escalate('MANUAL_REVIEW_REQUIRED') -> reward=-2.0, done=True + INCORRECT ESCALATION: Escalation is only appropriate when data integrity is genuinely compromised. This case has suffici +[END] success=false steps=6 score=0.000 rewards=-0.10,0.00,-1.50,-2.00 + + GRADER SCORE: 0.000 / 1.0 + +============================================================ + FINAL GRADER SCORES (mean ยฑ std over 3 repeats) +============================================================ + Task 1 (Scheme Discovery ): 0.483 ยฑ 0.029 / 1.0 + Task 2 (Missing Data ): 0.667 ยฑ 0.577 / 1.0 + Task 3 (Boundary Fraud ): 0.667 ยฑ 0.577 / 1.0 + Task 4 (Escalation Dilemma ): 0.967 ยฑ 0.029 / 1.0 + Task 5 (Document Conflict ): 0.000 ยฑ 0.000 / 1.0 + Average : 0.557 / 1.0 +============================================================ +SCORE_JSON {"task": 1, "score": 0.4833} +STD_JSON {"task": 1, "std": 0.0289} +SCORE_JSON {"task": 2, "score": 0.6667} +STD_JSON {"task": 2, "std": 0.5774} +SCORE_JSON {"task": 3, "score": 0.6667} +STD_JSON {"task": 3, "std": 0.5774} +SCORE_JSON {"task": 4, "score": 0.9667} +STD_JSON {"task": 4, "std": 0.0289} +SCORE_JSON {"task": 5, "score": 0.0} +STD_JSON {"task": 5, "std": 0.0} diff --git a/reports/inference_logs/inference_nemotron_nano.txt b/reports/inference_logs/inference_nemotron_nano.txt new file mode 100644 index 0000000..e19b822 --- /dev/null +++ b/reports/inference_logs/inference_nemotron_nano.txt @@ -0,0 +1,562 @@ + +============================================================ + SCHEME ENV โ€” OPTION A EVALUATION + Model : nvidia/llama-3.1-nemotron-nano-8b-v1 + Env : http://localhost:7860 + Repeats : 3 per task +============================================================ + + [Task 1 โ€” repeat 1/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '27', 'income': '2415', 'bank_name': 'PNB', 'marital_status': 'divorced'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=3 action=ask_question('income') reward=-0.10 done=false error=null + Step 03: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 04: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=5 action=approve_scheme('PMKVY') reward=3.00 done=true error=null + Step 05: approve_scheme('PMKVY') -> reward=3.0, done=True + SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit. +[END] success=false steps=5 score=0.400 rewards=0.00,0.00,-0.10,-0.10,3.00 + + GRADER SCORE: 0.400 / 1.0 + + [Task 1 โ€” repeat 2/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '22', 'income': '5621', 'marital_status': 'divorced'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=approve_scheme('PMKVY') reward=3.00 done=true error=null + Step 05: approve_scheme('PMKVY') -> reward=3.0, done=True + SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit. +[END] success=false steps=5 score=0.400 rewards=0.00,0.00,-0.10,-0.10,3.00 + + GRADER SCORE: 0.400 / 1.0 + + [Task 1 โ€” repeat 3/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '19', 'income': '8028', 'marital_status': 'divorced', 'state_of_residence': 'Maharashtra', 'number_of_children': '0'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 01: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 05: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=false steps=5 score=0.900 rewards=0.00,0.00,-0.10,-0.10,10.00 + + GRADER SCORE: 0.900 / 1.0 + + [Task 2 โ€” repeat 1/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '40', 'income': '4258', 'marital_status': 'widowed', 'bank_name': 'PNB', 'state_of_residence': 'Bihar'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 05: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=6 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=7 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=8 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 08: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=false steps=8 score=0.730 rewards=0.00,0.00,-0.10,-0.10,-0.10,10.00 + + GRADER SCORE: 0.730 / 1.0 + + [Task 2 โ€” repeat 2/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '57', 'income': '1596', 'bank_name': 'UCO Bank'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=3 action=check_field('age') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=check_field('income') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=check_field('age') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=check_field('age') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=check_field('has_aadhaar') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=check_field('age') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=check_field('age') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=check_field('age') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 11: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=true steps=11 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 2 โ€” repeat 3/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '55', 'income': '1840', 'number_of_children': '0', 'state_of_residence': 'Uttar Pradesh', 'bank_name': 'PNB'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=6 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 06: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=7 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 07: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=false steps=7 score=0.730 rewards=0.00,0.00,-0.10,-0.10,-0.10,10.00 + + GRADER SCORE: 0.730 / 1.0 + + [Task 3 โ€” repeat 1/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '29', 'bank_name': 'Bank of Baroda', 'marital_status': 'widowed'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11193. +[STEP] step=4 action=approve_scheme('PMKVY') reward=-4.00 done=true error=null + Step 04: approve_scheme('PMKVY') -> reward=-4.0, done=True + CLEAR MISS: Income 11193 exceeds all scheme thresholds (overage: Rs 1194 above PMKVY limit). Use reject_applicant when n +[END] success=false steps=4 score=0.000 rewards=0.00,0.00,0.00,-4.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 2/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '28', 'state_of_residence': 'Maharashtra'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10812. +[STEP] step=4 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=14 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 3/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '26', 'number_of_children': '1', 'state_of_residence': 'Bihar'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10424. +[STEP] step=4 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=10 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=14 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=19 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=apply_eligibility_rules('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 4 โ€” repeat 1/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '25', 'income': '12263', 'occupation': 'student', 'has_aadhaar': 'True', 'bank_name': 'Bank of Baroda', 'state_of_residence': 'Bihar', 'marital_status': 'unmarried'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of Indian Railways for the past 6 years w +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 2/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '24', 'income': '18756', 'occupation': 'student', 'has_aadhaar': 'True', 'marital_status': 'unmarried', 'state_of_residence': 'Rajasthan'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 01: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=2 action=ask_question('income') reward=-0.10 done=false error=null + Step 02: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=3 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 03: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 04: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=5 action=ask_question('marital_status') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=6 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=7 action=ask_question('education_level') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=8 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=9 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=10 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=11 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=12 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=13 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=14 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=15 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=16 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=17 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=18 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=19 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=20 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[END] success=false steps=20 score=0.000 rewards=-0.10,-0.10,-0.10,-0.10 + + GRADER SCORE: 0.000 / 1.0 + + [Task 4 โ€” repeat 3/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '37', 'income': '17797', 'occupation': 'student', 'has_aadhaar': 'True', 'bank_name': 'Canara Bank', 'marital_status': 'widowed'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 01: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=2 action=ask_question('income') reward=-0.10 done=false error=null + Step 02: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=3 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 03: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 04: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=5 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 05: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=6 action=ask_question('income') reward=-0.10 done=false error=null + Step 06: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=7 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 07: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=8 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 08: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=9 action=ask_question('income') reward=-0.10 done=false error=null + Step 09: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=10 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 10: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=11 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 11: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=12 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 12: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=13 action=ask_question('income') reward=-0.10 done=false error=null + Step 13: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=14 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 14: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=15 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 15: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=16 action=ask_question('income') reward=-0.10 done=false error=null + Step 16: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=17 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 17: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=18 action=ask_question('age') reward=-0.10 done=false error=null + Step 18: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=19 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 19: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=20 action=approve_scheme('PMKVY') reward=-2.00 done=true error=null + Step 20: approve_scheme('PMKVY') -> reward=-2.0, done=True + TIMEOUT: 20 steps reached without a decision. +[END] success=false steps=20 score=0.000 rewards=-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-0.10,-2.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 5 โ€” repeat 1/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '8289', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'state_of_residence': 'Maharashtra', 'bank_name': 'Canara Bank'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=3 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=request_document('pan_card') reward=0.00 done=false error=null + Step 13: request_document('pan_card') -> reward=0.0, done=False + PAN card verified. No anomalies found in tax records. For age verification, the Aadhaar card is the authoritative docume +[STEP] step=14 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 14: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=15 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 5 โ€” repeat 2/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '8530', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '3', 'bank_name': 'PNB'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=3 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 13: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=14 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=update_profile({'age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 5 โ€” repeat 3/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=3 action=update_profile("{'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'aadhaar_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=update_profile({'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'aadhaar_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=update_profile({'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'aadhaar_age': '36', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=update_profile({'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=update_profile({'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=update_profile({'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=update_profile({'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=update_profile({'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=update_profile({'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=update_profile({'age': '36', 'income': '8469', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 13: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=14 action=update_profile({'age': 36, 'income': 8469, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 33, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=update_profile({'age': 36, 'income': 8469, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 33, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=update_profile({'age': 36, 'income': 8469, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 33, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=update_profile({'age': 36, 'income': 8469, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 33, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=update_profile({'age': 36, 'income': 8469, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 33, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=update_profile({'age': 36, 'income': 8469, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 33, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=update_profile({'age': 36, 'income': 8469, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 33, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + +============================================================ + FINAL GRADER SCORES (mean ยฑ std over 3 repeats) +============================================================ + Task 1 (Scheme Discovery ): 0.567 ยฑ 0.289 / 1.0 + Task 2 (Missing Data ): 0.820 ยฑ 0.156 / 1.0 + Task 3 (Boundary Fraud ): 0.000 ยฑ 0.000 / 1.0 + Task 4 (Escalation Dilemma ): 0.333 ยฑ 0.577 / 1.0 + Task 5 (Document Conflict ): 0.000 ยฑ 0.000 / 1.0 + Average : 0.344 / 1.0 +============================================================ +SCORE_JSON {"task": 1, "score": 0.5667} +STD_JSON {"task": 1, "std": 0.2887} +SCORE_JSON {"task": 2, "score": 0.82} +STD_JSON {"task": 2, "std": 0.1559} +SCORE_JSON {"task": 3, "score": 0.0} +STD_JSON {"task": 3, "std": 0.0} +SCORE_JSON {"task": 4, "score": 0.3333} +STD_JSON {"task": 4, "std": 0.5774} +SCORE_JSON {"task": 5, "score": 0.0} +STD_JSON {"task": 5, "std": 0.0} diff --git a/reports/inference_logs/inference_nemotron_nano8b.txt b/reports/inference_logs/inference_nemotron_nano8b.txt new file mode 100644 index 0000000..95bdca3 --- /dev/null +++ b/reports/inference_logs/inference_nemotron_nano8b.txt @@ -0,0 +1,544 @@ + +============================================================ + SCHEME ENV โ€” OPTION A EVALUATION + Model : nvidia/llama-3.1-nemotron-nano-8b-v1 + Env : http://localhost:7860 + Repeats : 3 per task +============================================================ + + [Task 1 โ€” repeat 1/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '20', 'income': '4708', 'number_of_children': '2', 'state_of_residence': 'Uttar Pradesh', 'marital_status': 'unmarried'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=3 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=14 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 1 โ€” repeat 2/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '19', 'income': '7903', 'number_of_children': '2'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=3 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=14 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 1 โ€” repeat 3/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '19', 'income': '1411', 'state_of_residence': 'Gujarat', 'bank_name': 'Canara Bank'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=3 action=ask_question('income') reward=-0.10 done=false error=null + Step 03: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('age') reward=-0.10 done=false error=null + Step 04: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=5 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=6 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=7 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 07: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=8 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 08: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=false steps=8 score=0.850 rewards=0.00,0.00,-0.10,-0.10,-0.10,10.00 + + GRADER SCORE: 0.850 / 1.0 + + [Task 2 โ€” repeat 1/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '37', 'income': '3824', 'state_of_residence': 'Bihar'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=3 action=check_field('age') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=check_field('has_aadhaar') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=check_field('occupation') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=check_field('age') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=check_field('state_of_residence') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 09: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=false steps=9 score=0.910 rewards=0.00,0.00,-0.10,10.00 + + GRADER SCORE: 0.910 / 1.0 + + [Task 2 โ€” repeat 2/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '19', 'income': '2401', 'state_of_residence': 'Uttar Pradesh', 'number_of_children': '2'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=3 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=14 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=check_eligibility('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 2 โ€” repeat 3/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '34', 'income': '4943', 'state_of_residence': 'Uttar Pradesh', 'marital_status': 'unmarried'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 11: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=12 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=14 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=apply_eligibility_rules('MGNREGS') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,-0.10,-0.10 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 1/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '24', 'number_of_children': '0', 'state_of_residence': 'Gujarat'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10555. +[STEP] step=4 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=approve_scheme('PMKVY') reward=-4.00 done=true error=null + Step 06: approve_scheme('PMKVY') -> reward=-4.0, done=True + CLEAR MISS: Income 10555 exceeds all scheme thresholds (overage: Rs 556 above PMKVY limit). Use reject_applicant when no +[END] success=false steps=6 score=0.000 rewards=0.00,0.00,0.00,-4.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 2/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '29', 'state_of_residence': 'Uttar Pradesh'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11035. +[STEP] step=4 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=check_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=approve_scheme('PMKVY') reward=-4.00 done=true error=null + Step 07: approve_scheme('PMKVY') -> reward=-4.0, done=True + CLEAR MISS: Income 11035 exceeds all scheme thresholds (overage: Rs 1036 above PMKVY limit). Use reject_applicant when n +[END] success=false steps=7 score=0.000 rewards=0.00,0.00,0.00,-4.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 3/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '28', 'number_of_children': '1', 'marital_status': 'unmarried', 'bank_name': 'Bank of Baroda'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card received and verified. has_aadhaar confirmed as True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=3 action=request_document('pan_card') reward=0.00 done=false error=null + Step 03: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=4 action=ask_question('income') reward=0.00 done=false error=null + Step 04: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11082. +[STEP] step=5 action=approve_scheme('PMKVY') reward=-4.00 done=true error=null + Step 05: approve_scheme('PMKVY') -> reward=-4.0, done=True + CLEAR MISS: Income 11082 exceeds all scheme thresholds (overage: Rs 1083 above PMKVY limit). Use reject_applicant when n +[END] success=false steps=5 score=0.000 rewards=0.00,0.00,0.00,0.00,-4.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 4 โ€” repeat 1/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '35', 'income': '10690', 'occupation': 'student', 'has_aadhaar': 'True', 'state_of_residence': 'Rajasthan'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of State Bank of India for the past 6 yea +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 2/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '44', 'income': '17519', 'occupation': 'student', 'has_aadhaar': 'True', 'bank_name': 'SBI'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 01: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=2 action=ask_question('income') reward=-0.10 done=false error=null + Step 02: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=3 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 03: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=4 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 04: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=5 action=check_scheme_eligibility('PMKVY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=14 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=16 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=17 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=check_scheme_eligibility('PMAY') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=-0.10,-0.10,-0.10,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 4 โ€” repeat 3/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '26', 'income': '15496', 'occupation': 'student', 'has_aadhaar': 'True', 'state_of_residence': 'Uttar Pradesh', 'bank_name': 'Bank of Baroda'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 01: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=2 action=ask_question('income') reward=-0.10 done=false error=null + Step 02: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=3 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 03: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=4 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 04: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=5 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=6 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=7 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 07: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=8 action=ask_question('income') reward=-0.10 done=false error=null + Step 08: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=9 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 09: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=10 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 10: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=11 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=12 action=approve_scheme('PMKVY') reward=-1.50 done=false error=null + Step 12: approve_scheme('PMKVY') -> reward=-1.5, done=False + PROTOCOL VIOLATION: Do not approve this case before verifying employment records. Request the PAN card first to check fo +[STEP] step=13 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 13: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=14 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=15 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=16 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 16: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=17 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 17: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=18 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 18: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=19 action=ask_question('state_of_residence') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=20 action=ask_question('income') reward=-0.10 done=false error=null + Step 20: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[END] success=false steps=20 score=0.000 rewards=-0.10,-0.10,-0.10,0.00,-0.10,-0.10,-0.10,-0.10,-1.50,0.00,-0.10,-0.10,-0.10,-0.10 + + GRADER SCORE: 0.000 / 1.0 + + [Task 5 โ€” repeat 1/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '34', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 34 in the prof +[STEP] step=2 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=3 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36', 'aadhaar_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36', 'aadhaar_age': '36', 'updated_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36', 'aadhaar_age': '36', 'updated_age': '36', 'updated_age_source': 'AADHAIR'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36', 'aadhaar_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36', 'aadhaar_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36', 'aadhaar_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36', 'aadhaar_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36', 'aadhaar_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'official_age': '36', 'aadhaar_age': '36', 'updated_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=request_document('aadhaar_age') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=14 action=update_profile("{'age': '36', 'income': '7623', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'aadhaar_age': '36'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=request_document('aadhaar_age') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=16 action=request_document('aadhaar_age') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=17 action=request_document('self_reported_age') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=18 action=request_document('aadhaar_age') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=19 action=request_document('aadhaar_age') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=20 action=request_document('self_reported_age') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[END] success=false steps=20 score=0.000 rewards=0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 5 โ€” repeat 2/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=3 action=compare_age('official_age') reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=4 action=update_profile("{'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=5 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=6 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=7 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=8 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=9 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=10 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=11 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=12 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=13 action=request_document('age_verification_report') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=14 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=15 action=request_document('age_verification_report') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=16 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 16: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=17 action=update_profile({'age': '36', 'income': '6215', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'number_of_children': '4', 'official_age': '36'}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=18 action=update_profile({'age': 36, 'income': 6215, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 35, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=19 action=update_profile({'age': 36, 'income': 6215, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 35, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[STEP] step=20 action=update_profile({'age': 36, 'income': 6215, 'occupation': 'mason', 'has_aadhaar': True, 'self_reported_age': 35, 'number_of_children': 4, 'official_age': 36}) reward=0.00 done=false error=HTTP Error 422: Unprocessable Content +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 5 โ€” repeat 3/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.1-nemotron-nano-8b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '6090', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'state_of_residence': 'Gujarat', 'marital_status': 'married', 'bank_name': 'Canara Bank'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=2 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 02: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=3 action=update_profile("{'age': '36', 'income': '6090', 'occupation': 'mason', 'has_aadhaar': 'True', 'state_of_residence': 'Gujarat', 'marital_status': 'married', 'bank_name': 'Canara Bank', 'official_age': '36', 'self_reported_age': '35'}") reward=0.00 done=false error=HTTP Error 422: Unprocessable Content + [ERROR] agent decision failed: JSON_PARSE_ERROR + raw={"action_type": "update_profile", "value": {'age': 36, 'income': 6090, 'occupation': 'mason', 'has_aadhaar': true, 'self_reported_age': 35, 'state_of_residence': 'Gujarat', 'marital_status': 'married' +[STEP] step=4 action=agent_error reward=0.00 done=true error=JSON_PARSE_ERROR +[END] success=false steps=4 score=0.000 rewards=0.00,0.00 + + GRADER SCORE: 0.000 / 1.0 + +============================================================ + FINAL GRADER SCORES (mean ยฑ std over 3 repeats) +============================================================ + Task 1 (Scheme Discovery ): 0.283 ยฑ 0.491 / 1.0 + Task 2 (Missing Data ): 0.303 ยฑ 0.525 / 1.0 + Task 3 (Boundary Fraud ): 0.000 ยฑ 0.000 / 1.0 + Task 4 (Escalation Dilemma ): 0.333 ยฑ 0.577 / 1.0 + Task 5 (Document Conflict ): 0.000 ยฑ 0.000 / 1.0 + Average : 0.184 / 1.0 +============================================================ +SCORE_JSON {"task": 1, "score": 0.2833} +STD_JSON {"task": 1, "std": 0.4907} +SCORE_JSON {"task": 2, "score": 0.3033} +STD_JSON {"task": 2, "std": 0.5254} +SCORE_JSON {"task": 3, "score": 0.0} +STD_JSON {"task": 3, "std": 0.0} +SCORE_JSON {"task": 4, "score": 0.3333} +STD_JSON {"task": 4, "std": 0.5774} +SCORE_JSON {"task": 5, "score": 0.0} +STD_JSON {"task": 5, "std": 0.0} diff --git a/reports/inference_logs/inference_nemotron_super49b.txt b/reports/inference_logs/inference_nemotron_super49b.txt new file mode 100644 index 0000000..cb2eee1 --- /dev/null +++ b/reports/inference_logs/inference_nemotron_super49b.txt @@ -0,0 +1,376 @@ + +============================================================ + SCHEME ENV โ€” OPTION A EVALUATION + Model : nvidia/llama-3.3-nemotron-super-49b-v1 + Env : http://localhost:7860 + Repeats : 3 per task +============================================================ + + [Task 1 โ€” repeat 1/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '27', 'income': '6336', 'bank_name': 'UCO Bank'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=request_document('pan_card') reward=0.00 done=false error=null + Step 05: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=6 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 06: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=false steps=6 score=0.950 rewards=0.00,0.00,0.00,-0.10,0.00,10.00 + + GRADER SCORE: 0.950 / 1.0 + + [Task 1 โ€” repeat 2/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '20', 'income': '1288', 'number_of_children': '1', 'bank_name': 'UCO Bank', 'state_of_residence': 'Gujarat'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 04: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 1 โ€” repeat 3/3] +[START] task=scheme_discovery env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '27', 'income': '4908', 'state_of_residence': 'Uttar Pradesh', 'bank_name': 'UCO Bank'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=request_document('pan_card') reward=0.00 done=false error=null + Step 05: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=6 action=approve_scheme('PMKVY') reward=3.00 done=true error=null + Step 06: approve_scheme('PMKVY') -> reward=3.0, done=True + SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit. +[END] success=false steps=6 score=0.450 rewards=0.00,0.00,0.00,-0.10,0.00,3.00 + + GRADER SCORE: 0.450 / 1.0 + + [Task 2 โ€” repeat 1/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '47', 'income': '1594', 'bank_name': 'Bank of Baroda', 'state_of_residence': 'Uttar Pradesh'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 04: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=false steps=4 score=0.960 rewards=0.00,0.00,0.00,10.00 + + GRADER SCORE: 0.960 / 1.0 + + [Task 2 โ€” repeat 2/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '57', 'income': '4061', 'bank_name': 'UCO Bank', 'marital_status': 'unmarried', 'state_of_residence': 'Gujarat'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 01: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=3 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 03: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=true steps=3 score=1.000 rewards=0.00,0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 2 โ€” repeat 3/3] +[START] task=missing_data env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '22', 'income': '4771', 'bank_name': 'UCO Bank', 'marital_status': 'divorced'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 04: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=false steps=4 score=0.960 rewards=0.00,0.00,0.00,10.00 + + GRADER SCORE: 0.960 / 1.0 + + [Task 3 โ€” repeat 1/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '31', 'bank_name': 'UCO Bank'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11217. +[STEP] step=4 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 04: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 11217 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 2/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '25', 'bank_name': 'PNB', 'number_of_children': '0', 'state_of_residence': 'Rajasthan'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11092. +[STEP] step=4 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 04: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 11092 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 3 โ€” repeat 3/3] +[START] task=boundary_fraud env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '25', 'state_of_residence': 'Uttar Pradesh'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11517. +[STEP] step=4 action=reject_applicant('INCOME_TOO_HIGH') reward=5.00 done=true error=null + Step 04: reject_applicant('INCOME_TOO_HIGH') -> reward=5.0, done=True + CORRECT REJECTION: Income 11517 exceeds all scheme thresholds. No eligible scheme found. +[END] success=true steps=4 score=1.000 rewards=0.00,0.00,0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 1/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '45', 'income': '19939', 'occupation': 'student', 'has_aadhaar': 'True', 'number_of_children': '2', 'state_of_residence': 'Rajasthan', 'bank_name': 'Bank of Baroda'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BHEL for the past 6 years with active +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 2/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '32', 'income': '19339', 'occupation': 'student', 'has_aadhaar': 'True', 'marital_status': 'unmarried'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BSNL for the past 6 years with active +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 4 โ€” repeat 3/3] +[START] task=escalation_dilemma env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '23', 'income': '15394', 'occupation': 'student', 'has_aadhaar': 'True', 'bank_name': 'SBI', 'state_of_residence': 'Rajasthan', 'number_of_children': '3'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=request_document('pan_card') reward=0.00 done=false error=null + Step 01: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BSNL for the past 6 years with active +[STEP] step=2 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 02: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=true steps=2 score=1.000 rewards=0.00,10.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 1/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '6016', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'bank_name': 'PNB', 'number_of_children': '2', 'state_of_residence': 'Rajasthan'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 37 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=2 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 02: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=37, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 2/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '34', 'income': '6257', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '34', 'bank_name': 'SBI', 'number_of_children': '0', 'state_of_residence': 'Maharashtra'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 34 in the prof +[STEP] step=2 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 02: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 3/3] +[START] task=document_conflict env=scheme_env model=nvidia/llama-3.3-nemotron-super-49b-v1 + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '8148', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'state_of_residence': 'Maharashtra', 'marital_status': 'widowed'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 37 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=2 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 02: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=37, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + +============================================================ + FINAL GRADER SCORES (mean ยฑ std over 3 repeats) +============================================================ + Task 1 (Scheme Discovery ): 0.800 ยฑ 0.304 / 1.0 + Task 2 (Missing Data ): 0.973 ยฑ 0.023 / 1.0 + Task 3 (Boundary Fraud ): 1.000 ยฑ 0.000 / 1.0 + Task 4 (Escalation Dilemma ): 1.000 ยฑ 0.000 / 1.0 + Task 5 (Document Conflict ): 1.000 ยฑ 0.000 / 1.0 + Average : 0.955 / 1.0 +============================================================ +SCORE_JSON {"task": 1, "score": 0.8} +STD_JSON {"task": 1, "std": 0.3041} +SCORE_JSON {"task": 2, "score": 0.9733} +STD_JSON {"task": 2, "std": 0.0231} +SCORE_JSON {"task": 3, "score": 1.0} +STD_JSON {"task": 3, "std": 0.0} +SCORE_JSON {"task": 4, "score": 1.0} +STD_JSON {"task": 4, "std": 0.0} +SCORE_JSON {"task": 5, "score": 1.0} +STD_JSON {"task": 5, "std": 0.0} diff --git a/reports/inference_logs/inference_nvidia_8b.txt b/reports/inference_logs/inference_nvidia_8b.txt new file mode 100644 index 0000000..1e817b0 --- /dev/null +++ b/reports/inference_logs/inference_nvidia_8b.txt @@ -0,0 +1,672 @@ + +============================================================ + SCHEME ENV โ€” OPTION A EVALUATION + Model : meta/llama-3.1-8b-instruct + Env : http://localhost:7860 + Repeats : 3 per task +============================================================ + + [Task 1 โ€” repeat 1/3] +[START] task=scheme_discovery env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '32', 'income': '8687', 'bank_name': 'Bank of Baroda', 'marital_status': 'unmarried'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 05: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=6 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 06: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=false steps=6 score=0.900 rewards=0.00,0.00,-0.10,-0.10,0.00,10.00 + + GRADER SCORE: 0.900 / 1.0 + + [Task 1 โ€” repeat 2/3] +[START] task=scheme_discovery env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '18', 'income': '1931', 'bank_name': 'PNB', 'state_of_residence': 'Maharashtra'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 05: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=6 action=request_document('pan_card') reward=0.00 done=false error=null + Step 06: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=7 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 07: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=8 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 08: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=9 action=ask_question('age') reward=-0.10 done=false error=null + Step 09: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=10 action=ask_question('income') reward=-0.10 done=false error=null + Step 10: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=11 action=ask_question('is_occupation_eligible_for_any_scheme') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=12 action=ask_question('is_occupation_eligible_for_any_scheme') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=13 action=ask_question('is_income_eligible_for_any_scheme') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=14 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 14: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=15 action=ask_question('is_occupation_eligible_for_any_scheme') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=16 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 16: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=false steps=16 score=0.700 rewards=0.00,0.00,-0.10,-0.10,0.00,0.00,-0.10,-0.10,-0.10,-0.10,0.00,10.00 + + GRADER SCORE: 0.700 / 1.0 + + [Task 1 โ€” repeat 3/3] +[START] task=scheme_discovery env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 1/5 โ€” SCHEME_DISCOVERY +============================================================ + Profile : {'age': '19', 'income': '7160', 'marital_status': 'unmarried', 'bank_name': 'Canara Bank'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 1/5 - SCHEME DISCOVERY - Easy] Profile is INCOMPLETE. Collect all required eligibility fields first, then identify and approve the OPT +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = mason. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 05: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=6 action=request_document('pan_card') reward=0.00 done=false error=null + Step 06: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=7 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 07: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=8 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 08: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=9 action=ask_question('age') reward=-0.10 done=false error=null + Step 09: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=10 action=ask_question('income') reward=-0.10 done=false error=null + Step 10: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=11 action=ask_question('marital_status') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=12 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 12: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=13 action=approve_scheme('PMKVY') reward=10.00 done=true error=null + Step 13: approve_scheme('PMKVY') -> reward=10.0, done=True + CORRECT: Applicant enrolled in PMKVY โ€” optimal scheme. +[END] success=false steps=13 score=0.700 rewards=0.00,0.00,-0.10,-0.10,0.00,0.00,-0.10,-0.10,-0.10,-0.10,0.00,10.00 + + GRADER SCORE: 0.700 / 1.0 + + [Task 2 โ€” repeat 1/3] +[START] task=missing_data env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '28', 'income': '4776', 'number_of_children': '2'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 05: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=6 action=ask_question('number_of_children') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=7 action=ask_question('document_type') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=8 action=ask_question('document_type') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=9 action=ask_question('has_pan_card') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=10 action=ask_question('has_pan_card') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=11 action=ask_question('has_pan_card') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=12 action=ask_question('has_pan_card') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=13 action=ask_question('has_pan_card') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=14 action=request_document('pan_card') reward=0.00 done=false error=null + Step 14: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=15 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 15: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=16 action=ask_question('age') reward=-0.10 done=false error=null + Step 16: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=17 action=ask_question('number_of_days_worked_last_year') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=18 action=ask_question('has_pan_card') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=19 action=ask_question('number_of_days_worked_last_year') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=20 action=ask_question('has_pan_card') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,-0.10,-0.10,0.00,0.00,-0.10,-0.10 + + GRADER SCORE: 0.000 / 1.0 + + [Task 2 โ€” repeat 2/3] +[START] task=missing_data env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '22', 'income': '3416', 'number_of_children': '1', 'bank_name': 'Canara Bank'} + Missing : ['occupation', 'has_aadhaar'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 05: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=6 action=request_document('pan_card') reward=0.00 done=false error=null + Step 06: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=7 action=ask_question('number_of_children') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=8 action=ask_question('has_family_member_with_disability') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=9 action=ask_question('has_family_member_with_disability') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=10 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 10: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=11 action=approve_scheme('MGNREGS') reward=10.00 done=true error=null + Step 11: approve_scheme('MGNREGS') -> reward=10.0, done=True + CORRECT: Applicant enrolled in MGNREGS โ€” optimal scheme. +[END] success=false steps=11 score=0.650 rewards=0.00,0.00,-0.10,-0.10,0.00,0.00,-0.10,10.00 + + GRADER SCORE: 0.650 / 1.0 + + [Task 2 โ€” repeat 3/3] +[START] task=missing_data env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 2/5 โ€” MISSING_DATA +============================================================ + Profile : {'age': '29', 'income': '1180', 'number_of_children': '4'} + Missing : ['has_aadhaar', 'occupation'] + Notif : [TASK 2/5 - MISSING DATA - Medium] Profile is INCOMPLETE. You MUST collect ALL required eligibility fields before making any terminal decisi +[STEP] step=1 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 01: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = farm_labourer. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=ask_question('income') reward=-0.10 done=false error=null + Step 04: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=5 action=ask_question('number_of_children') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=6 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 06: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=7 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=8 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 08: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=9 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=10 action=request_document('pan_card') reward=0.00 done=false error=null + Step 10: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=11 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=12 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=13 action=ask_question('age') reward=-0.10 done=false error=null + Step 13: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=14 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=15 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=16 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=17 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=18 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=19 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 19: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=20 action=ask_question('has_family_member_in_military') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,-0.10,-0.10,0.00,-0.10,0.00,-0.10,0.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 1/3] +[START] task=boundary_fraud env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '34', 'marital_status': 'divorced'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 10572. +[STEP] step=4 action=ask_question('age') reward=-0.10 done=false error=null + Step 04: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=5 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 05: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=6 action=ask_question('marital_status') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=7 action=ask_question('is_student') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=8 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 08: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=9 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 09: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=10 action=ask_question('income') reward=-0.10 done=false error=null + Step 10: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=11 action=request_document('pan_card') reward=0.00 done=false error=null + Step 11: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=12 action=ask_question('age') reward=-0.10 done=false error=null + Step 12: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=13 action=ask_question('age') reward=-0.10 done=false error=null + Step 13: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=14 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 14: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=15 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 15: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=16 action=ask_question('income') reward=-0.10 done=false error=null + Step 16: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=17 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 17: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=18 action=request_document('pan_card') reward=0.00 done=false error=null + Step 18: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=19 action=ask_question('age') reward=-0.10 done=false error=null + Step 19: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=20 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 20: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,0.00,-0.10,0.00,-0.10,-0.10,-0.10,0.00,-0.10,-0.10,-0.10,-0.10,-0.10,0.00,0.00,-0.10,-0.10 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 2/3] +[START] task=boundary_fraud env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '23', 'number_of_children': '0'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('income') reward=0.00 done=false error=null + Step 01: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11179. +[STEP] step=2 action=ask_question('occupation') reward=0.00 done=false error=null + Step 02: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=3 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 03: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=4 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 04: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=5 action=request_document('pan_card') reward=0.00 done=false error=null + Step 05: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=6 action=ask_question('age') reward=-0.10 done=false error=null + Step 06: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=7 action=ask_question('number_of_children') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=8 action=ask_question('income') reward=-0.10 done=false error=null + Step 08: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=9 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 09: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=10 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 10: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=11 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 11: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=12 action=request_document('pan_card') reward=0.00 done=false error=null + Step 12: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=13 action=ask_question('age') reward=-0.10 done=false error=null + Step 13: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=14 action=ask_question('number_of_children') reward=0.00 done=false error=HTTP Error 500: Internal Server Error +[STEP] step=15 action=ask_question('income') reward=-0.10 done=false error=null + Step 15: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=16 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 16: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=17 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 17: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=18 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 18: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=19 action=request_document('pan_card') reward=0.00 done=false error=null + Step 19: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=20 action=ask_question('age') reward=-0.10 done=false error=null + Step 20: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,0.00,0.00,0.00,-0.10,-0.10,-0.10,-0.10,0.00,0.00,-0.10,-0.10,-0.10,-0.10,0.00,0.00,-0.10 + + GRADER SCORE: 0.000 / 1.0 + + [Task 3 โ€” repeat 3/3] +[START] task=boundary_fraud env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 3/5 โ€” BOUNDARY_FRAUD +============================================================ + Profile : {'age': '23', 'marital_status': 'divorced', 'state_of_residence': 'Maharashtra'} + Missing : ['occupation', 'has_aadhaar', 'income'] + Notif : [TASK 3/5 - BOUNDARY FRAUD DETECTION - Hard] Profile is INCOMPLETE. Collect ALL required eligibility fields including income, then apply sch +[STEP] step=1 action=ask_question('occupation') reward=0.00 done=false error=null + Step 01: ask_question('occupation') -> reward=0.0, done=False + Applicant confirmed: occupation = carpenter. +[STEP] step=2 action=ask_question('has_aadhaar') reward=0.00 done=false error=null + Step 02: ask_question('has_aadhaar') -> reward=0.0, done=False + Applicant confirmed: has_aadhaar = True. +[STEP] step=3 action=ask_question('income') reward=0.00 done=false error=null + Step 03: ask_question('income') -> reward=0.0, done=False + Applicant confirmed: income = 11989. +[STEP] step=4 action=ask_question('age') reward=-0.10 done=false error=null + Step 04: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=5 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 05: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=6 action=request_document('pan_card') reward=0.00 done=false error=null + Step 06: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=7 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 07: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=8 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 08: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=9 action=ask_question('income') reward=-0.10 done=false error=null + Step 09: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=10 action=ask_question('age') reward=-0.10 done=false error=null + Step 10: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=11 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 11: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=12 action=request_document('pan_card') reward=0.00 done=false error=null + Step 12: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=13 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 13: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=14 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 14: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=15 action=ask_question('income') reward=-0.10 done=false error=null + Step 15: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=16 action=ask_question('age') reward=-0.10 done=false error=null + Step 16: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=17 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 17: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=18 action=request_document('pan_card') reward=0.00 done=false error=null + Step 18: request_document('pan_card') -> reward=0.0, done=False + Document 'pan_card' received and verified. +[STEP] step=19 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 19: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=20 action=ask_question('has_aadhaar') reward=-2.00 done=true error=null + Step 20: ask_question('has_aadhaar') -> reward=-2.0, done=True + TIMEOUT: 20 steps reached without a decision. +[END] success=false steps=20 score=0.000 rewards=0.00,0.00,0.00,-0.10,0.00,0.00,-0.10,-0.10,-0.10,-0.10,0.00,0.00,-0.10,-0.10,-0.10,-0.10,0.00,0.00,-0.10,-2.00 + + GRADER SCORE: 0.000 / 1.0 + + [Task 4 โ€” repeat 1/3] +[START] task=escalation_dilemma env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '41', 'income': '18035', 'occupation': 'student', 'has_aadhaar': 'True', 'number_of_children': '3', 'bank_name': 'Bank of Baroda', 'state_of_residence': 'Bihar'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('income') reward=-0.10 done=false error=null + Step 01: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=2 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 02: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=3 action=ask_question('age') reward=-0.10 done=false error=null + Step 03: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=4 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 04: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=5 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 05: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=6 action=reject_applicant('AGE_EXCEEDED') reward=-1.00 done=false error=null + Step 06: reject_applicant('AGE_EXCEEDED') -> reward=-1.0, done=False + PREMATURE ADJUDICATION: Review documentary evidence before rejecting. Request the PAN card to verify employment records, +[STEP] step=7 action=request_document('pan_card') reward=0.00 done=false error=null + Step 07: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of Indian Railways for the past 6 years w +[STEP] step=8 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 08: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=false steps=8 score=0.850 rewards=-0.10,-0.10,-0.10,0.00,-0.10,-1.00,0.00,10.00 + + GRADER SCORE: 0.850 / 1.0 + + [Task 4 โ€” repeat 2/3] +[START] task=escalation_dilemma env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '30', 'income': '13238', 'occupation': 'student', 'has_aadhaar': 'True', 'state_of_residence': 'Uttar Pradesh', 'number_of_children': '2'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('income') reward=-0.10 done=false error=null + Step 01: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=2 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 02: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=3 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 03: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=4 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 04: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=5 action=request_document('pan_card') reward=0.00 done=false error=null + Step 05: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of BHEL for the past 6 years with active +[STEP] step=6 action=escalate('DATA_MISMATCH') reward=10.00 done=true error=null + Step 06: escalate('DATA_MISMATCH') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=false steps=6 score=0.900 rewards=-0.10,-0.10,-0.10,0.00,0.00,10.00 + + GRADER SCORE: 0.900 / 1.0 + + [Task 4 โ€” repeat 3/3] +[START] task=escalation_dilemma env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 4/5 โ€” ESCALATION_DILEMMA +============================================================ + Profile : {'age': '43', 'income': '17180', 'occupation': 'student', 'has_aadhaar': 'True', 'number_of_children': '1', 'marital_status': 'unmarried'} + Missing : [] + Notif : [TASK 4/5 - ESCALATION DILEMMA - Expert] Profile is COMPLETE. Review all applicant data carefully before making any terminal decision. If oc +[STEP] step=1 action=ask_question('income') reward=-0.10 done=false error=null + Step 01: ask_question('income') -> reward=-0.1, done=False + 'income' is already in the profile. Do not repeat questions. +[STEP] step=2 action=ask_question('occupation') reward=-0.10 done=false error=null + Step 02: ask_question('occupation') -> reward=-0.1, done=False + 'occupation' is already in the profile. Do not repeat questions. +[STEP] step=3 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 03: request_document('aadhaar_card') -> reward=0.0, done=False + Document 'aadhaar_card' received and verified. +[STEP] step=4 action=ask_question('age') reward=-0.10 done=false error=null + Step 04: ask_question('age') -> reward=-0.1, done=False + 'age' is already in the profile. Do not repeat questions. +[STEP] step=5 action=ask_question('has_aadhaar') reward=-0.10 done=false error=null + Step 05: ask_question('has_aadhaar') -> reward=-0.1, done=False + 'has_aadhaar' is already in the profile. Do not repeat questions. +[STEP] step=6 action=request_document('pan_card') reward=0.00 done=false error=null + Step 06: request_document('pan_card') -> reward=0.0, done=False + PAN card retrieved. Records show this applicant has been a registered employee of Indian Railways for the past 6 years w +[STEP] step=7 action=escalate('MANUAL_REVIEW_REQUIRED') reward=10.00 done=true error=null + Step 07: escalate('MANUAL_REVIEW_REQUIRED') -> reward=10.0, done=True + CORRECT ESCALATION: Contradictory data detected and properly handed off to a senior officer for manual verification. Thi +[END] success=false steps=7 score=0.850 rewards=-0.10,-0.10,0.00,-0.10,-0.10,0.00,10.00 + + GRADER SCORE: 0.850 / 1.0 + + [Task 5 โ€” repeat 1/3] +[START] task=document_conflict env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '6288', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'marital_status': 'unmarried', 'state_of_residence': 'Uttar Pradesh', 'number_of_children': '1'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 37 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=2 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 02: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=37, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 2/3] +[START] task=document_conflict env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '33', 'income': '7472', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '33', 'state_of_residence': 'Uttar Pradesh', 'bank_name': 'SBI'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 36 years. Note: this differs from the self-reported age of 33 in the prof +[STEP] step=2 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 02: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=36, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + + [Task 5 โ€” repeat 3/3] +[START] task=document_conflict env=scheme_env model=meta/llama-3.1-8b-instruct + +============================================================ + TASK 5/5 โ€” DOCUMENT_CONFLICT +============================================================ + Profile : {'age': '35', 'income': '7914', 'occupation': 'mason', 'has_aadhaar': 'True', 'self_reported_age': '35', 'marital_status': 'unmarried'} + Missing : [] + Notif : [TASK 5/5 - DOCUMENT CONFLICT - Expert+] Profile is COMPLETE. Apply all eligibility rules and verify official documents as required before m +[STEP] step=1 action=request_document('aadhaar_card') reward=0.00 done=false error=null + Step 01: request_document('aadhaar_card') -> reward=0.0, done=False + Aadhaar card verified. Official age on record: 38 years. Note: this differs from the self-reported age of 35 in the prof +[STEP] step=2 action=reject_applicant('AGE_EXCEEDED') reward=5.00 done=true error=null + Step 02: reject_applicant('AGE_EXCEEDED') -> reward=5.0, done=True + CORRECT REJECTION: Aadhaar confirms age=38, which exceeds the PMKVY maximum of 35. No other scheme criteria are satisfie +[END] success=true steps=2 score=1.000 rewards=0.00,5.00 + + GRADER SCORE: 1.000 / 1.0 + +============================================================ + FINAL GRADER SCORES (mean ยฑ std over 3 repeats) +============================================================ + Task 1 (Scheme Discovery ): 0.767 ยฑ 0.116 / 1.0 + Task 2 (Missing Data ): 0.217 ยฑ 0.375 / 1.0 + Task 3 (Boundary Fraud ): 0.000 ยฑ 0.000 / 1.0 + Task 4 (Escalation Dilemma ): 0.867 ยฑ 0.029 / 1.0 + Task 5 (Document Conflict ): 1.000 ยฑ 0.000 / 1.0 + Average : 0.570 / 1.0 +============================================================ +SCORE_JSON {"task": 1, "score": 0.7667} +STD_JSON {"task": 1, "std": 0.1155} +SCORE_JSON {"task": 2, "score": 0.2167} +STD_JSON {"task": 2, "std": 0.3753} +SCORE_JSON {"task": 3, "score": 0.0} +STD_JSON {"task": 3, "std": 0.0} +SCORE_JSON {"task": 4, "score": 0.8667} +STD_JSON {"task": 4, "std": 0.0289} +SCORE_JSON {"task": 5, "score": 1.0} +STD_JSON {"task": 5, "std": 0.0} diff --git a/reports/leaderboard.csv b/reports/leaderboard.csv new file mode 100644 index 0000000..1fffcc0 --- /dev/null +++ b/reports/leaderboard.csv @@ -0,0 +1,9 @@ +Model,Size,Task1,Task2,Task3,Task4,Task5,Average +mistralai/mistral-nemotron,~56B,0.833,1.0,1.0,1.0,1.0,0.967 +nvidia/llama-3.3-nemotron-super-49b-v1,49B,0.8,0.973,1.0,1.0,1.0,0.955 +nvidia/llama-3.1-nemotron-51b-instruct,51B,0.8,0.957,1.0,1.0,1.0,0.951 +nvidia/nemotron-3-nano-30b-a3b,30B,1.0,0.0,1.0,1.0,1.0,0.8 +nvidia/nemotron-3-super-120b-a12b,120B,1.0,0.0,1.0,1.0,1.0,0.8 +nvidia/nemotron-mini-4b-instruct,4B,0.483,0.667,0.667,0.967,0.0,0.557 +meta/llama-3.1-8b-instruct,8B,0.4,0.0,0.317,0.867,1.0,0.517 +nvidia/llama-3.1-nemotron-nano-8b-v1,8B,0.283,0.303,0.0,0.333,0.0,0.184 diff --git a/reports/results.json b/reports/results.json new file mode 100644 index 0000000..fee5321 --- /dev/null +++ b/reports/results.json @@ -0,0 +1,218 @@ +[ + { + "model": "nvidia/llama-3.1-nemotron-nano-8b-v1", + "size": "8B", + "average": 0.184, + "tasks": { + "task1": { + "score": 0.283, + "std": 0.491 + }, + "task2": { + "score": 0.303, + "std": 0.525 + }, + "task3": { + "score": 0.0, + "std": 0.0 + }, + "task4": { + "score": 0.333, + "std": 0.577 + }, + "task5": { + "score": 0.0, + "std": 0.0 + } + } + }, + { + "model": "meta/llama-3.1-8b-instruct", + "size": "8B", + "average": 0.517, + "tasks": { + "task1": { + "score": 0.4, + "std": 0.458 + }, + "task2": { + "score": 0.0, + "std": 0.0 + }, + "task3": { + "score": 0.317, + "std": 0.548 + }, + "task4": { + "score": 0.867, + "std": 0.058 + }, + "task5": { + "score": 1.0, + "std": 0.0 + } + } + }, + { + "model": "nvidia/nemotron-mini-4b-instruct", + "size": "4B", + "average": 0.557, + "tasks": { + "task1": { + "score": 0.483, + "std": 0.029 + }, + "task2": { + "score": 0.667, + "std": 0.577 + }, + "task3": { + "score": 0.667, + "std": 0.577 + }, + "task4": { + "score": 0.967, + "std": 0.029 + }, + "task5": { + "score": 0.0, + "std": 0.0 + } + } + }, + { + "model": "nvidia/nemotron-3-nano-30b-a3b", + "size": "30B", + "average": 0.8, + "tasks": { + "task1": { + "score": 1.0, + "std": 0.0 + }, + "task2": { + "score": 0.0, + "std": 0.0 + }, + "task3": { + "score": 1.0, + "std": 0.0 + }, + "task4": { + "score": 1.0, + "std": 0.0 + }, + "task5": { + "score": 1.0, + "std": 0.0 + } + } + }, + { + "model": "nvidia/nemotron-3-super-120b-a12b", + "size": "120B", + "average": 0.8, + "tasks": { + "task1": { + "score": 1.0, + "std": 0.0 + }, + "task2": { + "score": 0.0, + "std": 0.0 + }, + "task3": { + "score": 1.0, + "std": 0.0 + }, + "task4": { + "score": 1.0, + "std": 0.0 + }, + "task5": { + "score": 1.0, + "std": 0.0 + } + } + }, + { + "model": "nvidia/llama-3.1-nemotron-51b-instruct", + "size": "51B", + "average": 0.951, + "tasks": { + "task1": { + "score": 0.8, + "std": 0.304 + }, + "task2": { + "score": 0.957, + "std": 0.045 + }, + "task3": { + "score": 1.0, + "std": 0.0 + }, + "task4": { + "score": 1.0, + "std": 0.0 + }, + "task5": { + "score": 1.0, + "std": 0.0 + } + } + }, + { + "model": "nvidia/llama-3.3-nemotron-super-49b-v1", + "size": "49B", + "average": 0.955, + "tasks": { + "task1": { + "score": 0.8, + "std": 0.304 + }, + "task2": { + "score": 0.973, + "std": 0.023 + }, + "task3": { + "score": 1.0, + "std": 0.0 + }, + "task4": { + "score": 1.0, + "std": 0.0 + }, + "task5": { + "score": 1.0, + "std": 0.0 + } + } + }, + { + "model": "mistralai/mistral-nemotron", + "size": "~56B", + "average": 0.967, + "tasks": { + "task1": { + "score": 0.833, + "std": 0.289 + }, + "task2": { + "score": 1.0, + "std": 0.0 + }, + "task3": { + "score": 1.0, + "std": 0.0 + }, + "task4": { + "score": 1.0, + "std": 0.0 + }, + "task5": { + "score": 1.0, + "std": 0.0 + } + } + } +] \ No newline at end of file diff --git a/reports/summary.txt b/reports/summary.txt new file mode 100644 index 0000000..92de904 --- /dev/null +++ b/reports/summary.txt @@ -0,0 +1,12 @@ +OpenEnv scheme_env Benchmark โ€” Baseline Report Summary +======================================================== +Date generated : 2026-04-08 +Models evaluated : 8 + +Best model : mistral-nemotron (avg=0.967) +Worst model : nemotron-nano-8b (avg=0.184) + +Hardest task : Task 2 (mean=0.487) +Easiest task : Task 4 (mean=0.896) + +Perfect score (1.0 on all tasks): none diff --git a/reports/task_heatmap.png b/reports/task_heatmap.png new file mode 100644 index 0000000..fc05e91 Binary files /dev/null and b/reports/task_heatmap.png differ diff --git a/reports/test_logs/pytest_results.txt b/reports/test_logs/pytest_results.txt new file mode 100644 index 0000000..1b2adaa --- /dev/null +++ b/reports/test_logs/pytest_results.txt @@ -0,0 +1,30 @@ +============================= test session starts ============================== +platform darwin -- Python 3.14.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/advikdivekar/Desktop/OpenEnv/venv/bin/python3.14 +cachedir: .pytest_cache +rootdir: /Users/advikdivekar/Desktop/OpenEnv +configfile: pyproject.toml +plugins: anyio-4.13.0 +collecting ... collected 20 items + +tests/test_scheme_eligibility.py::test_pmkvy_qualifies_age_lower_bound PASSED [ 5%] +tests/test_scheme_eligibility.py::test_pmkvy_qualifies_age_upper_bound PASSED [ 10%] +tests/test_scheme_eligibility.py::test_pmkvy_disqualifies_age_exceeded PASSED [ 15%] +tests/test_scheme_eligibility.py::test_pmkvy_disqualifies_income_exceeded PASSED [ 20%] +tests/test_scheme_eligibility.py::test_pmkvy_disqualifies_wrong_occupation PASSED [ 25%] +tests/test_scheme_eligibility.py::test_mgnregs_qualifies_age_lower_bound PASSED [ 30%] +tests/test_scheme_eligibility.py::test_mgnregs_qualifies_age_upper_bound PASSED [ 35%] +tests/test_scheme_eligibility.py::test_mgnregs_disqualifies_age_exceeded PASSED [ 40%] +tests/test_scheme_eligibility.py::test_mgnregs_disqualifies_no_aadhaar PASSED [ 45%] +tests/test_scheme_eligibility.py::test_pmay_qualifies_age_lower_bound PASSED [ 50%] +tests/test_scheme_eligibility.py::test_pmay_disqualifies_income_at_threshold PASSED [ 55%] +tests/test_scheme_eligibility.py::test_pmay_qualifies_age_upper_bound PASSED [ 60%] +tests/test_scheme_eligibility.py::test_pmay_disqualifies_age_exceeded PASSED [ 65%] +tests/test_scheme_eligibility.py::test_optimal_prefers_pmay_over_pmkvy PASSED [ 70%] +tests/test_scheme_eligibility.py::test_optimal_mgnregs_only PASSED [ 75%] +tests/test_scheme_eligibility.py::test_optimal_none_when_no_scheme PASSED [ 80%] +tests/test_scheme_eligibility.py::test_grader_score_perfect PASSED [ 85%] +tests/test_scheme_eligibility.py::test_grader_score_noise_penalty PASSED [ 90%] +tests/test_scheme_eligibility.py::test_grader_score_zero_base PASSED [ 95%] +tests/test_scheme_eligibility.py::test_grader_score_floor_at_030 PASSED [100%] + +============================== 20 passed in 2.24s ============================== diff --git a/reports/test_logs/smoke_test_results.txt b/reports/test_logs/smoke_test_results.txt new file mode 100644 index 0000000..15da4e3 --- /dev/null +++ b/reports/test_logs/smoke_test_results.txt @@ -0,0 +1,61 @@ + +============================================================ +SMOKE TEST โ€” scheme_env +============================================================ + +Task 1 โ€” Scheme Discovery + Profile at reset: {'age': '28', 'income': '4665', 'marital_status': 'widowed', 'number_of_children': '1'} + Missing: ['occupation', 'has_aadhaar'] + [PASS] Task 1 terminates on approve โ€” done=True + [PASS] Task 1 reward positive or suboptimal โ€” reward=3.0 + [PASS] Task 1 grader_score set โ€” score=0.5 + Notification: SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit. + +Task 2 โ€” Missing Data (premature approval) + Missing at reset: ['occupation', 'has_aadhaar'] + [PASS] Task 2 premature approve terminates โ€” done=True + [PASS] Task 2 premature approve scores 0.0 โ€” score=0.0 + [PASS] Task 2 premature approve reward negative โ€” reward=-5.0 + +Task 2 โ€” Missing Data (correct flow) + [PASS] Task 2 correct flow terminates โ€” done=True + [PASS] Task 2 correct flow reward positive โ€” reward=10.0 + [PASS] Task 2 correct flow score > 0 โ€” score=1.0 + +Task 3 โ€” Boundary Fraud (reject before asking income) + Profile at reset: {'age': '24', 'number_of_children': '3', 'marital_status': 'widowed'} + Missing: ['occupation', 'has_aadhaar', 'income'] + [PASS] Task 3 income hidden at reset โ€” income in profile=['age', 'number_of_children', 'marital_status'] + [PASS] Task 3 premature reject scores 0.0 โ€” score=0.0 + [PASS] Task 3 premature reject reward negative โ€” reward=-2.0 + +Task 3 โ€” Boundary Fraud (correct flow) + [PASS] Task 3 correct reject terminates โ€” done=True + [PASS] Task 3 correct reject reward positive โ€” reward=5.0 + [PASS] Task 3 correct reject score > 0.30 โ€” score=1.0 + +Task 4 โ€” Escalation (escalate before PAN โ€” should soft-block) + [PASS] Task 4 escalate before PAN is non-terminal โ€” done=False + [PASS] Task 4 escalate before PAN reward negative โ€” reward=-1.0 + +Task 4 โ€” Escalation (correct flow) + [PASS] Task 4 PAN reveals contradiction โ€” notif=PAN card retrieved. Records show this applicant has been a registered employee o + [PASS] Task 4 correct escalate terminates โ€” done=True + [PASS] Task 4 correct escalate reward=10 โ€” reward=10.0 + [PASS] Task 4 correct escalate score > 0.80 โ€” score=1.0 + +Task 5 โ€” Document Conflict (correct flow) + Self-reported age in profile: 33 + [PASS] Task 5 Aadhaar reveals age conflict โ€” notif=Aadhaar card verified. Official age on record: 36 years. Note: this differs from + [PASS] Task 5 correct reject terminates โ€” done=True + [PASS] Task 5 correct reject reward=5 โ€” reward=5.0 + [PASS] Task 5 correct reject score > 0.80 โ€” score=1.0 + +Wrong escalation on Task 1 (should now be terminal) + [PASS] Wrong escalation Task 1 is terminal โ€” done=True + [PASS] Wrong escalation reward=-2.0 โ€” reward=-2.0 + +============================================================ +ALL TESTS PASSED โ€” environment logic is correct +============================================================ + diff --git a/reports/test_logs/test_results.txt b/reports/test_logs/test_results.txt new file mode 100644 index 0000000..b662b86 --- /dev/null +++ b/reports/test_logs/test_results.txt @@ -0,0 +1,30 @@ +============================= test session starts ============================== +platform darwin -- Python 3.14.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/advikdivekar/Desktop/OpenEnv/venv/bin/python3.14 +cachedir: .pytest_cache +rootdir: /Users/advikdivekar/Desktop/OpenEnv +configfile: pyproject.toml +plugins: anyio-4.13.0 +collecting ... collected 20 items + +tests/test_scheme_eligibility.py::test_pmkvy_qualifies_age_lower_bound PASSED [ 5%] +tests/test_scheme_eligibility.py::test_pmkvy_qualifies_age_upper_bound PASSED [ 10%] +tests/test_scheme_eligibility.py::test_pmkvy_disqualifies_age_exceeded PASSED [ 15%] +tests/test_scheme_eligibility.py::test_pmkvy_disqualifies_income_exceeded PASSED [ 20%] +tests/test_scheme_eligibility.py::test_pmkvy_disqualifies_wrong_occupation PASSED [ 25%] +tests/test_scheme_eligibility.py::test_mgnregs_qualifies_age_lower_bound PASSED [ 30%] +tests/test_scheme_eligibility.py::test_mgnregs_qualifies_age_upper_bound PASSED [ 35%] +tests/test_scheme_eligibility.py::test_mgnregs_disqualifies_age_exceeded PASSED [ 40%] +tests/test_scheme_eligibility.py::test_mgnregs_disqualifies_no_aadhaar PASSED [ 45%] +tests/test_scheme_eligibility.py::test_pmay_qualifies_age_lower_bound PASSED [ 50%] +tests/test_scheme_eligibility.py::test_pmay_disqualifies_income_at_threshold PASSED [ 55%] +tests/test_scheme_eligibility.py::test_pmay_qualifies_age_upper_bound PASSED [ 60%] +tests/test_scheme_eligibility.py::test_pmay_disqualifies_age_exceeded PASSED [ 65%] +tests/test_scheme_eligibility.py::test_optimal_prefers_pmay_over_pmkvy PASSED [ 70%] +tests/test_scheme_eligibility.py::test_optimal_mgnregs_only PASSED [ 75%] +tests/test_scheme_eligibility.py::test_optimal_none_when_no_scheme PASSED [ 80%] +tests/test_scheme_eligibility.py::test_grader_score_perfect PASSED [ 85%] +tests/test_scheme_eligibility.py::test_grader_score_noise_penalty PASSED [ 90%] +tests/test_scheme_eligibility.py::test_grader_score_zero_base PASSED [ 95%] +tests/test_scheme_eligibility.py::test_grader_score_floor_at_030 PASSED [100%] + +============================== 20 passed in 2.38s ============================== diff --git a/scripts/validate-submission.sh b/scripts/validate-submission.sh new file mode 100755 index 0000000..3180967 --- /dev/null +++ b/scripts/validate-submission.sh @@ -0,0 +1,287 @@ +#!/usr/bin/env bash +# +# validate-submission.sh โ€” Extended OpenEnv Submission Validator +# +# This script starts from the official hackathon template and adds repo-level +# checks for required files, inference env-var contract, structured stdout log +# markers, Dockerfile presence, and OpenEnv submission shape. +# +# Usage: +# ./scripts/validate-submission.sh [repo_dir] +# +# Examples: +# ./scripts/validate-submission.sh https://my-team.hf.space +# ./scripts/validate-submission.sh https://my-team.hf.space /path/to/repo +# + +set -uo pipefail + +DOCKER_BUILD_TIMEOUT=600 +if [ -t 1 ]; then + RED='\033[0;31m' + GREEN='\033[0;32m' + YELLOW='\033[1;33m' + BOLD='\033[1m' + NC='\033[0m' +else + RED='' GREEN='' YELLOW='' BOLD='' NC='' +fi + +run_with_timeout() { + local secs="$1"; shift + if command -v timeout >/dev/null 2>&1; then + timeout "$secs" "$@" + elif command -v gtimeout >/dev/null 2>&1; then + gtimeout "$secs" "$@" + else + "$@" & + local pid=$! + ( sleep "$secs" && kill "$pid" 2>/dev/null ) & + local watcher=$! + wait "$pid" 2>/dev/null + local rc=$? + kill "$watcher" 2>/dev/null + wait "$watcher" 2>/dev/null + return $rc + fi +} + +portable_mktemp() { + local prefix="${1:-validate}" + mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp +} + +CLEANUP_FILES=() +cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; } +trap cleanup EXIT + +PING_URL="${1:-}" +REPO_DIR="${2:-.}" + +if [ -z "$PING_URL" ]; then + printf "Usage: %s [repo_dir]\n" "$0" + printf "\n" + printf " ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n" + printf " repo_dir Path to your repo (default: current directory)\n" + exit 1 +fi + +if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then + printf "Error: directory '%s' not found\n" "${2:-.}" + exit 1 +fi + +PING_URL="${PING_URL%/}" +PASS=0 + +log() { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; } +pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); } +fail() { log "${RED}FAILED${NC} -- $1"; } +warn() { log "${YELLOW}WARN${NC} -- $1"; } +hint() { printf " ${YELLOW}Hint:${NC} %b\n" "$1"; } +stop_at() { + printf "\n" + printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1" + exit 1 +} + +require_file() { + local path="$1" + local label="$2" + if [ -f "$path" ]; then + pass "$label present: ${path#$REPO_DIR/}" + else + fail "$label missing: ${path#$REPO_DIR/}" + return 1 + fi +} + +require_dir() { + local path="$1" + local label="$2" + if [ -d "$path" ]; then + pass "$label present: ${path#$REPO_DIR/}" + else + fail "$label missing: ${path#$REPO_DIR/}" + return 1 + fi +} + +require_grep() { + local pattern="$1" + local file="$2" + local label="$3" + if grep -Eq "$pattern" "$file"; then + pass "$label" + else + fail "$label" + return 1 + fi +} + +printf "\n" +printf "${BOLD}========================================${NC}\n" +printf "${BOLD} OpenEnv Submission Validator${NC}\n" +printf "${BOLD}========================================${NC}\n" +log "Repo: $REPO_DIR" +log "Ping URL: $PING_URL" +printf "\n" + +log "${BOLD}Step 1/8: Repo structure checks${NC} ..." + +require_file "$REPO_DIR/README.md" "README" || stop_at "Step 1" +require_file "$REPO_DIR/inference.py" "Root inference script" || stop_at "Step 1" +require_file "$REPO_DIR/openenv.yaml" "openenv.yaml" || stop_at "Step 1" +require_file "$REPO_DIR/Dockerfile" "Dockerfile" || stop_at "Step 1" +require_file "$REPO_DIR/models.py" "Root models.py" || stop_at "Step 1" +require_dir "$REPO_DIR/server" "server package" || stop_at "Step 1" +require_dir "$REPO_DIR/tests" "tests directory" || stop_at "Step 1" + +log "${BOLD}Step 2/8: Inference contract checks${NC} ..." + +require_grep 'from openai import OpenAI' "$REPO_DIR/inference.py" "OpenAI client imported in inference.py" || stop_at "Step 2" +require_grep 'API_BASE_URL *= *os.getenv\("API_BASE_URL",' "$REPO_DIR/inference.py" "API_BASE_URL read from env with default" || stop_at "Step 2" +require_grep 'MODEL_NAME *= *os.getenv\("MODEL_NAME",' "$REPO_DIR/inference.py" "MODEL_NAME read from env with default" || stop_at "Step 2" +require_grep 'HF_TOKEN *= *os.getenv\("HF_TOKEN"\)' "$REPO_DIR/inference.py" "HF_TOKEN read from env without default" || stop_at "Step 2" +require_grep 'LOCAL_IMAGE_NAME *= *os.getenv\("LOCAL_IMAGE_NAME"\)' "$REPO_DIR/inference.py" "LOCAL_IMAGE_NAME optionally supported" || stop_at "Step 2" +require_grep 'OpenAI\(base_url=API_BASE_URL, api_key=HF_TOKEN\)' "$REPO_DIR/inference.py" "OpenAI client configured from required env vars" || stop_at "Step 2" +require_grep '\[START\]' "$REPO_DIR/inference.py" "Structured START log marker present" || stop_at "Step 2" +require_grep '\[STEP\]' "$REPO_DIR/inference.py" "Structured STEP log marker present" || stop_at "Step 2" +require_grep '\[END\]' "$REPO_DIR/inference.py" "Structured END log marker present" || stop_at "Step 2" + +log "${BOLD}Step 3/8: OpenEnv spec surface checks${NC} ..." + +require_grep '^spec_version:' "$REPO_DIR/openenv.yaml" "openenv.yaml declares spec_version" || stop_at "Step 3" +require_grep '^runtime:' "$REPO_DIR/openenv.yaml" "openenv.yaml declares runtime" || stop_at "Step 3" +require_grep '^app:' "$REPO_DIR/openenv.yaml" "openenv.yaml declares app entrypoint" || stop_at "Step 3" +require_grep '^port:' "$REPO_DIR/openenv.yaml" "openenv.yaml declares port" || stop_at "Step 3" +require_grep 'def reset\(' "$REPO_DIR/server/scheme_env_environment.py" "Environment defines reset()" || stop_at "Step 3" +require_grep 'def step\(' "$REPO_DIR/server/scheme_env_environment.py" "Environment defines step()" || stop_at "Step 3" +require_grep 'def state\(' "$REPO_DIR/server/scheme_env_environment.py" "Environment exposes state property/method" || stop_at "Step 3" + +TASK_COUNT=$(grep -Ec 'task_id == [1-9]|Task [1-9]|TASK [1-9]/' "$REPO_DIR/server/scheme_env_environment.py" || true) +if [ "$TASK_COUNT" -ge 3 ]; then + pass "Detected 3+ task definitions in environment logic" +else + fail "Could not detect 3+ task definitions in environment logic" + stop_at "Step 3" +fi + +log "${BOLD}Step 4/8: README submission-content checks${NC} ..." + +require_grep 'Action Space' "$REPO_DIR/README.md" "README documents action space" || stop_at "Step 4" +require_grep 'Observation Space' "$REPO_DIR/README.md" "README documents observation space" || stop_at "Step 4" +require_grep 'Setup and Running|Setup' "$REPO_DIR/README.md" "README documents setup instructions" || stop_at "Step 4" +require_grep 'The 5 Tasks|Tasks' "$REPO_DIR/README.md" "README documents tasks" || stop_at "Step 4" +require_grep 'Baseline Results|leaderboard.csv|results.json' "$REPO_DIR/README.md" "README documents baseline outputs" || stop_at "Step 4" + +log "${BOLD}Step 5/8: Pinging HF Space${NC} ($PING_URL/reset) ..." + +CURL_OUTPUT=$(portable_mktemp "validate-curl") +CLEANUP_FILES+=("$CURL_OUTPUT") +HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \ + -H "Content-Type: application/json" -d '{}' \ + "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000") + +if [ "$HTTP_CODE" = "200" ]; then + pass "HF Space is live and responds to /reset" +elif [ "$HTTP_CODE" = "000" ]; then + fail "HF Space not reachable (connection failed or timed out)" + hint "Check your network connection and that the Space is running." + hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset" + stop_at "Step 5" +else + fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)" + hint "Make sure your Space is running and the URL is correct." + hint "Try opening $PING_URL in your browser first." + stop_at "Step 5" +fi + +HEALTH_OUTPUT=$(portable_mktemp "validate-health") +CLEANUP_FILES+=("$HEALTH_OUTPUT") +HEALTH_CODE=$(curl -s -o "$HEALTH_OUTPUT" -w "%{http_code}" \ + "$PING_URL/health" --max-time 15 || printf "000") +if [ "$HEALTH_CODE" = "200" ]; then + pass "HF Space /health responds with HTTP 200" +else + fail "HF Space /health returned HTTP $HEALTH_CODE" + stop_at "Step 5" +fi + +log "${BOLD}Step 6/8: Running docker build${NC} ..." + +if ! command -v docker >/dev/null 2>&1; then + fail "docker command not found" + hint "Install Docker: https://docs.docker.com/get-docker/" + stop_at "Step 6" +fi + +BUILD_OK=false +BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$REPO_DIR" 2>&1) && BUILD_OK=true + +if [ "$BUILD_OK" = true ]; then + pass "Docker build succeeded" +else + fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)" + printf "%s\n" "$BUILD_OUTPUT" | tail -20 + stop_at "Step 6" +fi + +log "${BOLD}Step 7/8: Running openenv validate${NC} ..." + +if ! command -v openenv >/dev/null 2>&1; then + fail "openenv command not found" + hint "Install it: pip install openenv-core" + stop_at "Step 7" +fi + +VALIDATE_OK=false +VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true + +if [ "$VALIDATE_OK" = true ]; then + pass "openenv validate passed" + [ -n "$VALIDATE_OUTPUT" ] && log " $VALIDATE_OUTPUT" +else + fail "openenv validate failed" + printf "%s\n" "$VALIDATE_OUTPUT" + stop_at "Step 7" +fi + +log "${BOLD}Step 8/8: Local quality checks${NC} ..." + +if command -v python >/dev/null 2>&1; then + PY_COMPILE_OK=true + PY_COMPILE_OUTPUT=$(cd "$REPO_DIR" && python -m py_compile inference.py models.py server/*.py 2>&1) || PY_COMPILE_OK=false + if [ "$PY_COMPILE_OK" = true ]; then + pass "Key Python files compile cleanly" + else + fail "Python compile check failed" + printf "%s\n" "$PY_COMPILE_OUTPUT" + stop_at "Step 8" + fi +else + warn "python not found; skipping py_compile check" +fi + +if command -v pytest >/dev/null 2>&1; then + TEST_OK=true + TEST_OUTPUT=$(cd "$REPO_DIR" && pytest tests/ -q 2>&1) || TEST_OK=false + if [ "$TEST_OK" = true ]; then + pass "pytest tests/ passed" + else + fail "pytest tests/ failed" + printf "%s\n" "$TEST_OUTPUT" + stop_at "Step 8" + fi +else + warn "pytest not installed; skipping test execution" +fi + +printf "\n" +printf "${BOLD}========================================${NC}\n" +printf "${GREEN}${BOLD} Validation checks passed: %d${NC}\n" "$PASS" +printf "${GREEN}${BOLD} Submission looks ready for hackathon review.${NC}\n" +printf "${BOLD}========================================${NC}\n" +printf "\n" + +exit 0