advikdivekar · HeetRanpura · Apr 8, 2026
diff --git a/README.md b/README.md
diff --git a/reports/README.txt b/reports/README.txt
@@ -0,0 +1,44 @@
+OpenEnv scheme_env Benchmark — Baseline Report
+================================================
+
+Files in this directory:
+
+  leaderboard.csv
+      Model rankings sorted by average score (descending).
+      Columns: Model, Size, Task1, Task2, Task3, Task4, Task5, Average.
+
+  results.json
+      Full results for all models including per-task scores and standard
+      deviations. Useful for programmatic downstream analysis.
+
+  average_scores.png
+      Horizontal bar chart of each model's average score across all 5 tasks.
+      Bars are colour-coded: red < 0.50, orange 0.50–0.75, green > 0.75.
+
+  task_heatmap.png
+      Heatmap with models as rows and tasks as columns.
+      Colour scale: red = 0.0, yellow = 0.5, green = 1.0 (RdYlGn).
+      Cell values show the exact score.
+
+  efficiency_scatter.png
+      Scatter plot of average score (x) vs Task 4 score (y).
+      Task 4 is the escalation-dilemma task and tests protocol adherence.
+      Each point is labelled with the short model name.
+
+  difficulty_profile.png
+      Line chart showing mean score per task across all 8 models with error
+      bars (±1 std). Reveals which tasks are hardest / easiest on average.
+
+  summary.txt
+      Plain-text summary: best/worst model, hardest/easiest task, and any
+      model that scored 1.0 on every task.
+
+  README.txt
+      This file.
+
+Tasks:
+  Task 1 — Basic eligibility check
+  Task 2 — Multi-criterion scheme selection
+  Task 3 — Income-threshold boundary case
+  Task 4 — Escalation dilemma (employment data conflict)
+  Task 5 — Document-verification age conflict
diff --git a/reports/average_scores.png b/reports/average_scores.png
diff --git a/reports/difficulty_profile.png b/reports/difficulty_profile.png
diff --git a/reports/efficiency_scatter.png b/reports/efficiency_scatter.png
diff --git a/reports/inference_logs/inference_mistral_nemotron.txt b/reports/inference_logs/inference_mistral_nemotron.txt
diff --git a/reports/inference_logs/inference_nemotron3_120b.txt b/reports/inference_logs/inference_nemotron3_120b.txt
diff --git a/reports/inference_logs/inference_nemotron3_nano30b.txt b/reports/inference_logs/inference_nemotron3_nano30b.txt
diff --git a/reports/inference_logs/inference_nemotron51b.txt b/reports/inference_logs/inference_nemotron51b.txt
diff --git a/reports/inference_logs/inference_nemotron_mini4b.txt b/reports/inference_logs/inference_nemotron_mini4b.txt
diff --git a/reports/inference_logs/inference_nemotron_nano.txt b/reports/inference_logs/inference_nemotron_nano.txt
diff --git a/reports/inference_logs/inference_nemotron_nano8b.txt b/reports/inference_logs/inference_nemotron_nano8b.txt
diff --git a/reports/inference_logs/inference_nemotron_super49b.txt b/reports/inference_logs/inference_nemotron_super49b.txt
diff --git a/reports/inference_logs/inference_nvidia_8b.txt b/reports/inference_logs/inference_nvidia_8b.txt
diff --git a/reports/leaderboard.csv b/reports/leaderboard.csv
@@ -0,0 +1,9 @@
+Model,Size,Task1,Task2,Task3,Task4,Task5,Average
+mistralai/mistral-nemotron,~56B,0.833,1.0,1.0,1.0,1.0,0.967
+nvidia/llama-3.3-nemotron-super-49b-v1,49B,0.8,0.973,1.0,1.0,1.0,0.955
+nvidia/llama-3.1-nemotron-51b-instruct,51B,0.8,0.957,1.0,1.0,1.0,0.951
+nvidia/nemotron-3-nano-30b-a3b,30B,1.0,0.0,1.0,1.0,1.0,0.8
+nvidia/nemotron-3-super-120b-a12b,120B,1.0,0.0,1.0,1.0,1.0,0.8
+nvidia/nemotron-mini-4b-instruct,4B,0.483,0.667,0.667,0.967,0.0,0.557
+meta/llama-3.1-8b-instruct,8B,0.4,0.0,0.317,0.867,1.0,0.517
+nvidia/llama-3.1-nemotron-nano-8b-v1,8B,0.283,0.303,0.0,0.333,0.0,0.184
diff --git a/reports/results.json b/reports/results.json
@@ -0,0 +1,218 @@
+[
+  {
+    "model": "nvidia/llama-3.1-nemotron-nano-8b-v1",
+    "size": "8B",
+    "average": 0.184,
+    "tasks": {
+      "task1": {
+        "score": 0.283,
+        "std": 0.491
+      },
+      "task2": {
+        "score": 0.303,
+        "std": 0.525
+      },
+      "task3": {
+        "score": 0.0,
+        "std": 0.0
+      },
+      "task4": {
+        "score": 0.333,
+        "std": 0.577
+      },
+      "task5": {
+        "score": 0.0,
+        "std": 0.0
+      }
+    }
+  },
+  {
+    "model": "meta/llama-3.1-8b-instruct",
+    "size": "8B",
+    "average": 0.517,
+    "tasks": {
+      "task1": {
+        "score": 0.4,
+        "std": 0.458
+      },
+      "task2": {
+        "score": 0.0,
+        "std": 0.0
+      },
+      "task3": {
+        "score": 0.317,
+        "std": 0.548
+      },
+      "task4": {
+        "score": 0.867,
+        "std": 0.058
+      },
+      "task5": {
+        "score": 1.0,
+        "std": 0.0
+      }
+    }
+  },
+  {
+    "model": "nvidia/nemotron-mini-4b-instruct",
+    "size": "4B",
+    "average": 0.557,
+    "tasks": {
+      "task1": {
+        "score": 0.483,
+        "std": 0.029
+      },
+      "task2": {
+        "score": 0.667,
+        "std": 0.577
+      },
+      "task3": {
+        "score": 0.667,
+        "std": 0.577
+      },
+      "task4": {
+        "score": 0.967,
+        "std": 0.029
+      },
+      "task5": {
+        "score": 0.0,
+        "std": 0.0
+      }
+    }
+  },
+  {
+    "model": "nvidia/nemotron-3-nano-30b-a3b",
+    "size": "30B",
+    "average": 0.8,
+    "tasks": {
+      "task1": {
+        "score": 1.0,
+        "std": 0.0
+      },
+      "task2": {
+        "score": 0.0,
+        "std": 0.0
+      },
+      "task3": {
+        "score": 1.0,
+        "std": 0.0
+      },
+      "task4": {
+        "score": 1.0,
+        "std": 0.0
+      },
+      "task5": {
+        "score": 1.0,
+        "std": 0.0
+      }
+    }
+  },
+  {
+    "model": "nvidia/nemotron-3-super-120b-a12b",
+    "size": "120B",
+    "average": 0.8,
+    "tasks": {
+      "task1": {
+        "score": 1.0,
+        "std": 0.0
+      },
+      "task2": {
+        "score": 0.0,
+        "std": 0.0
+      },
+      "task3": {
+        "score": 1.0,
+        "std": 0.0
+      },
+      "task4": {
+        "score": 1.0,
+        "std": 0.0
+      },
+      "task5": {
+        "score": 1.0,
+        "std": 0.0
+      }
+    }
+  },
+  {
+    "model": "nvidia/llama-3.1-nemotron-51b-instruct",
+    "size": "51B",
+    "average": 0.951,
+    "tasks": {
+      "task1": {
+        "score": 0.8,
+        "std": 0.304
+      },
+      "task2": {
+        "score": 0.957,
+        "std": 0.045
+      },
+      "task3": {
+        "score": 1.0,
+        "std": 0.0
+      },
+      "task4": {
+        "score": 1.0,
+        "std": 0.0
+      },
+      "task5": {
+        "score": 1.0,
+        "std": 0.0
+      }
+    }
+  },
+  {
+    "model": "nvidia/llama-3.3-nemotron-super-49b-v1",
+    "size": "49B",
+    "average": 0.955,
+    "tasks": {
+      "task1": {
+        "score": 0.8,
+        "std": 0.304
+      },
+      "task2": {
+        "score": 0.973,
+        "std": 0.023
+      },
+      "task3": {
+        "score": 1.0,
+        "std": 0.0
+      },
+      "task4": {
+        "score": 1.0,
+        "std": 0.0
+      },
+      "task5": {
+        "score": 1.0,
+        "std": 0.0
+      }
+    }
+  },
+  {
+    "model": "mistralai/mistral-nemotron",
+    "size": "~56B",
+    "average": 0.967,
+    "tasks": {
+      "task1": {
+        "score": 0.833,
+        "std": 0.289
+      },
+      "task2": {
+        "score": 1.0,
+        "std": 0.0
+      },
+      "task3": {
+        "score": 1.0,
+        "std": 0.0
+      },
+      "task4": {
+        "score": 1.0,
+        "std": 0.0
+      },
+      "task5": {
+        "score": 1.0,
+        "std": 0.0
+      }
+    }
+  }
+]
diff --git a/reports/summary.txt b/reports/summary.txt
@@ -0,0 +1,12 @@
+OpenEnv scheme_env Benchmark — Baseline Report Summary
+========================================================
+Date generated      : 2026-04-08
+Models evaluated    : 8
+
+Best model          : mistral-nemotron (avg=0.967)
+Worst model         : nemotron-nano-8b (avg=0.184)
+
+Hardest task        : Task 2 (mean=0.487)
+Easiest task        : Task 4 (mean=0.896)
+
+Perfect score (1.0 on all tasks): none
diff --git a/reports/task_heatmap.png b/reports/task_heatmap.png
diff --git a/reports/test_logs/pytest_results.txt b/reports/test_logs/pytest_results.txt
@@ -0,0 +1,30 @@
+============================= test session starts ==============================
+platform darwin -- Python 3.14.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/advikdivekar/Desktop/OpenEnv/venv/bin/python3.14
+cachedir: .pytest_cache
+rootdir: /Users/advikdivekar/Desktop/OpenEnv
+configfile: pyproject.toml
+plugins: anyio-4.13.0
+collecting ... collected 20 items
+
+tests/test_scheme_eligibility.py::test_pmkvy_qualifies_age_lower_bound PASSED [  5%]
+tests/test_scheme_eligibility.py::test_pmkvy_qualifies_age_upper_bound PASSED [ 10%]
+tests/test_scheme_eligibility.py::test_pmkvy_disqualifies_age_exceeded PASSED [ 15%]
+tests/test_scheme_eligibility.py::test_pmkvy_disqualifies_income_exceeded PASSED [ 20%]
+tests/test_scheme_eligibility.py::test_pmkvy_disqualifies_wrong_occupation PASSED [ 25%]
+tests/test_scheme_eligibility.py::test_mgnregs_qualifies_age_lower_bound PASSED [ 30%]
+tests/test_scheme_eligibility.py::test_mgnregs_qualifies_age_upper_bound PASSED [ 35%]
+tests/test_scheme_eligibility.py::test_mgnregs_disqualifies_age_exceeded PASSED [ 40%]
+tests/test_scheme_eligibility.py::test_mgnregs_disqualifies_no_aadhaar PASSED [ 45%]
+tests/test_scheme_eligibility.py::test_pmay_qualifies_age_lower_bound PASSED [ 50%]
+tests/test_scheme_eligibility.py::test_pmay_disqualifies_income_at_threshold PASSED [ 55%]
+tests/test_scheme_eligibility.py::test_pmay_qualifies_age_upper_bound PASSED [ 60%]
+tests/test_scheme_eligibility.py::test_pmay_disqualifies_age_exceeded PASSED [ 65%]
+tests/test_scheme_eligibility.py::test_optimal_prefers_pmay_over_pmkvy PASSED [ 70%]
+tests/test_scheme_eligibility.py::test_optimal_mgnregs_only PASSED       [ 75%]
+tests/test_scheme_eligibility.py::test_optimal_none_when_no_scheme PASSED [ 80%]
+tests/test_scheme_eligibility.py::test_grader_score_perfect PASSED       [ 85%]
+tests/test_scheme_eligibility.py::test_grader_score_noise_penalty PASSED [ 90%]
+tests/test_scheme_eligibility.py::test_grader_score_zero_base PASSED     [ 95%]
+tests/test_scheme_eligibility.py::test_grader_score_floor_at_030 PASSED  [100%]
+
+============================== 20 passed in 2.24s ==============================
diff --git a/reports/test_logs/smoke_test_results.txt b/reports/test_logs/smoke_test_results.txt
@@ -0,0 +1,61 @@
+
+============================================================
+SMOKE TEST — scheme_env
+============================================================
+
+Task 1 — Scheme Discovery
+  Profile at reset: {'age': '28', 'income': '4665', 'marital_status': 'widowed', 'number_of_children': '1'}
+  Missing: ['occupation', 'has_aadhaar']
+  [PASS] Task 1 terminates on approve — done=True
+  [PASS] Task 1 reward positive or suboptimal — reward=3.0
+  [PASS] Task 1 grader_score set — score=0.5
+  Notification: SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit.
+
+Task 2 — Missing Data (premature approval)
+  Missing at reset: ['occupation', 'has_aadhaar']
+  [PASS] Task 2 premature approve terminates — done=True
+  [PASS] Task 2 premature approve scores 0.0 — score=0.0
+  [PASS] Task 2 premature approve reward negative — reward=-5.0
+
+Task 2 — Missing Data (correct flow)
+  [PASS] Task 2 correct flow terminates — done=True
+  [PASS] Task 2 correct flow reward positive — reward=10.0
+  [PASS] Task 2 correct flow score > 0 — score=1.0
+
+Task 3 — Boundary Fraud (reject before asking income)
+  Profile at reset: {'age': '24', 'number_of_children': '3', 'marital_status': 'widowed'}
+  Missing: ['occupation', 'has_aadhaar', 'income']
+  [PASS] Task 3 income hidden at reset — income in profile=['age', 'number_of_children', 'marital_status']
+  [PASS] Task 3 premature reject scores 0.0 — score=0.0
+  [PASS] Task 3 premature reject reward negative — reward=-2.0
+
+Task 3 — Boundary Fraud (correct flow)
+  [PASS] Task 3 correct reject terminates — done=True
+  [PASS] Task 3 correct reject reward positive — reward=5.0
+  [PASS] Task 3 correct reject score > 0.30 — score=1.0
+
+Task 4 — Escalation (escalate before PAN — should soft-block)
+  [PASS] Task 4 escalate before PAN is non-terminal — done=False
+  [PASS] Task 4 escalate before PAN reward negative — reward=-1.0
+
+Task 4 — Escalation (correct flow)
+  [PASS] Task 4 PAN reveals contradiction — notif=PAN card retrieved. Records show this applicant has been a registered employee o
+  [PASS] Task 4 correct escalate terminates — done=True
+  [PASS] Task 4 correct escalate reward=10 — reward=10.0
+  [PASS] Task 4 correct escalate score > 0.80 — score=1.0
+
+Task 5 — Document Conflict (correct flow)
+  Self-reported age in profile: 33
+  [PASS] Task 5 Aadhaar reveals age conflict — notif=Aadhaar card verified. Official age on record: 36 years. Note: this differs from
+  [PASS] Task 5 correct reject terminates — done=True
+  [PASS] Task 5 correct reject reward=5 — reward=5.0
+  [PASS] Task 5 correct reject score > 0.80 — score=1.0
+
+Wrong escalation on Task 1 (should now be terminal)
+  [PASS] Wrong escalation Task 1 is terminal — done=True
+  [PASS] Wrong escalation reward=-2.0 — reward=-2.0
+
+============================================================
+ALL TESTS PASSED — environment logic is correct
+============================================================
+