Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
873 changes: 736 additions & 137 deletions README.md

Large diffs are not rendered by default.

44 changes: 44 additions & 0 deletions reports/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
OpenEnv scheme_env Benchmark — Baseline Report
================================================

Files in this directory:

leaderboard.csv
Model rankings sorted by average score (descending).
Columns: Model, Size, Task1, Task2, Task3, Task4, Task5, Average.

results.json
Full results for all models including per-task scores and standard
deviations. Useful for programmatic downstream analysis.

average_scores.png
Horizontal bar chart of each model's average score across all 5 tasks.
Bars are colour-coded: red < 0.50, orange 0.50–0.75, green > 0.75.

task_heatmap.png
Heatmap with models as rows and tasks as columns.
Colour scale: red = 0.0, yellow = 0.5, green = 1.0 (RdYlGn).
Cell values show the exact score.

efficiency_scatter.png
Scatter plot of average score (x) vs Task 4 score (y).
Task 4 is the escalation-dilemma task and tests protocol adherence.
Each point is labelled with the short model name.

difficulty_profile.png
Line chart showing mean score per task across all 8 models with error
bars (±1 std). Reveals which tasks are hardest / easiest on average.

summary.txt
Plain-text summary: best/worst model, hardest/easiest task, and any
model that scored 1.0 on every task.

README.txt
This file.

Tasks:
Task 1 — Basic eligibility check
Task 2 — Multi-criterion scheme selection
Task 3 — Income-threshold boundary case
Task 4 — Escalation dilemma (employment data conflict)
Task 5 — Document-verification age conflict
Binary file added reports/average_scores.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added reports/difficulty_profile.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added reports/efficiency_scatter.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
349 changes: 349 additions & 0 deletions reports/inference_logs/inference_mistral_nemotron.txt

Large diffs are not rendered by default.

358 changes: 358 additions & 0 deletions reports/inference_logs/inference_nemotron3_120b.txt

Large diffs are not rendered by default.

352 changes: 352 additions & 0 deletions reports/inference_logs/inference_nemotron3_nano30b.txt

Large diffs are not rendered by default.

370 changes: 370 additions & 0 deletions reports/inference_logs/inference_nemotron51b.txt

Large diffs are not rendered by default.

396 changes: 396 additions & 0 deletions reports/inference_logs/inference_nemotron_mini4b.txt

Large diffs are not rendered by default.

562 changes: 562 additions & 0 deletions reports/inference_logs/inference_nemotron_nano.txt

Large diffs are not rendered by default.

544 changes: 544 additions & 0 deletions reports/inference_logs/inference_nemotron_nano8b.txt

Large diffs are not rendered by default.

376 changes: 376 additions & 0 deletions reports/inference_logs/inference_nemotron_super49b.txt

Large diffs are not rendered by default.

672 changes: 672 additions & 0 deletions reports/inference_logs/inference_nvidia_8b.txt

Large diffs are not rendered by default.

9 changes: 9 additions & 0 deletions reports/leaderboard.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Model,Size,Task1,Task2,Task3,Task4,Task5,Average
mistralai/mistral-nemotron,~56B,0.833,1.0,1.0,1.0,1.0,0.967
nvidia/llama-3.3-nemotron-super-49b-v1,49B,0.8,0.973,1.0,1.0,1.0,0.955
nvidia/llama-3.1-nemotron-51b-instruct,51B,0.8,0.957,1.0,1.0,1.0,0.951
nvidia/nemotron-3-nano-30b-a3b,30B,1.0,0.0,1.0,1.0,1.0,0.8
nvidia/nemotron-3-super-120b-a12b,120B,1.0,0.0,1.0,1.0,1.0,0.8
nvidia/nemotron-mini-4b-instruct,4B,0.483,0.667,0.667,0.967,0.0,0.557
meta/llama-3.1-8b-instruct,8B,0.4,0.0,0.317,0.867,1.0,0.517
nvidia/llama-3.1-nemotron-nano-8b-v1,8B,0.283,0.303,0.0,0.333,0.0,0.184
218 changes: 218 additions & 0 deletions reports/results.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
[
{
"model": "nvidia/llama-3.1-nemotron-nano-8b-v1",
"size": "8B",
"average": 0.184,
"tasks": {
"task1": {
"score": 0.283,
"std": 0.491
},
"task2": {
"score": 0.303,
"std": 0.525
},
"task3": {
"score": 0.0,
"std": 0.0
},
"task4": {
"score": 0.333,
"std": 0.577
},
"task5": {
"score": 0.0,
"std": 0.0
}
}
},
{
"model": "meta/llama-3.1-8b-instruct",
"size": "8B",
"average": 0.517,
"tasks": {
"task1": {
"score": 0.4,
"std": 0.458
},
"task2": {
"score": 0.0,
"std": 0.0
},
"task3": {
"score": 0.317,
"std": 0.548
},
"task4": {
"score": 0.867,
"std": 0.058
},
"task5": {
"score": 1.0,
"std": 0.0
}
}
},
{
"model": "nvidia/nemotron-mini-4b-instruct",
"size": "4B",
"average": 0.557,
"tasks": {
"task1": {
"score": 0.483,
"std": 0.029
},
"task2": {
"score": 0.667,
"std": 0.577
},
"task3": {
"score": 0.667,
"std": 0.577
},
"task4": {
"score": 0.967,
"std": 0.029
},
"task5": {
"score": 0.0,
"std": 0.0
}
}
},
{
"model": "nvidia/nemotron-3-nano-30b-a3b",
"size": "30B",
"average": 0.8,
"tasks": {
"task1": {
"score": 1.0,
"std": 0.0
},
"task2": {
"score": 0.0,
"std": 0.0
},
"task3": {
"score": 1.0,
"std": 0.0
},
"task4": {
"score": 1.0,
"std": 0.0
},
"task5": {
"score": 1.0,
"std": 0.0
}
}
},
{
"model": "nvidia/nemotron-3-super-120b-a12b",
"size": "120B",
"average": 0.8,
"tasks": {
"task1": {
"score": 1.0,
"std": 0.0
},
"task2": {
"score": 0.0,
"std": 0.0
},
"task3": {
"score": 1.0,
"std": 0.0
},
"task4": {
"score": 1.0,
"std": 0.0
},
"task5": {
"score": 1.0,
"std": 0.0
}
}
},
{
"model": "nvidia/llama-3.1-nemotron-51b-instruct",
"size": "51B",
"average": 0.951,
"tasks": {
"task1": {
"score": 0.8,
"std": 0.304
},
"task2": {
"score": 0.957,
"std": 0.045
},
"task3": {
"score": 1.0,
"std": 0.0
},
"task4": {
"score": 1.0,
"std": 0.0
},
"task5": {
"score": 1.0,
"std": 0.0
}
}
},
{
"model": "nvidia/llama-3.3-nemotron-super-49b-v1",
"size": "49B",
"average": 0.955,
"tasks": {
"task1": {
"score": 0.8,
"std": 0.304
},
"task2": {
"score": 0.973,
"std": 0.023
},
"task3": {
"score": 1.0,
"std": 0.0
},
"task4": {
"score": 1.0,
"std": 0.0
},
"task5": {
"score": 1.0,
"std": 0.0
}
}
},
{
"model": "mistralai/mistral-nemotron",
"size": "~56B",
"average": 0.967,
"tasks": {
"task1": {
"score": 0.833,
"std": 0.289
},
"task2": {
"score": 1.0,
"std": 0.0
},
"task3": {
"score": 1.0,
"std": 0.0
},
"task4": {
"score": 1.0,
"std": 0.0
},
"task5": {
"score": 1.0,
"std": 0.0
}
}
}
]
12 changes: 12 additions & 0 deletions reports/summary.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
OpenEnv scheme_env Benchmark — Baseline Report Summary
========================================================
Date generated : 2026-04-08
Models evaluated : 8

Best model : mistral-nemotron (avg=0.967)
Worst model : nemotron-nano-8b (avg=0.184)

Hardest task : Task 2 (mean=0.487)
Easiest task : Task 4 (mean=0.896)

Perfect score (1.0 on all tasks): none
Binary file added reports/task_heatmap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
30 changes: 30 additions & 0 deletions reports/test_logs/pytest_results.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
============================= test session starts ==============================
platform darwin -- Python 3.14.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/advikdivekar/Desktop/OpenEnv/venv/bin/python3.14
cachedir: .pytest_cache
rootdir: /Users/advikdivekar/Desktop/OpenEnv
configfile: pyproject.toml
plugins: anyio-4.13.0
collecting ... collected 20 items

tests/test_scheme_eligibility.py::test_pmkvy_qualifies_age_lower_bound PASSED [ 5%]
tests/test_scheme_eligibility.py::test_pmkvy_qualifies_age_upper_bound PASSED [ 10%]
tests/test_scheme_eligibility.py::test_pmkvy_disqualifies_age_exceeded PASSED [ 15%]
tests/test_scheme_eligibility.py::test_pmkvy_disqualifies_income_exceeded PASSED [ 20%]
tests/test_scheme_eligibility.py::test_pmkvy_disqualifies_wrong_occupation PASSED [ 25%]
tests/test_scheme_eligibility.py::test_mgnregs_qualifies_age_lower_bound PASSED [ 30%]
tests/test_scheme_eligibility.py::test_mgnregs_qualifies_age_upper_bound PASSED [ 35%]
tests/test_scheme_eligibility.py::test_mgnregs_disqualifies_age_exceeded PASSED [ 40%]
tests/test_scheme_eligibility.py::test_mgnregs_disqualifies_no_aadhaar PASSED [ 45%]
tests/test_scheme_eligibility.py::test_pmay_qualifies_age_lower_bound PASSED [ 50%]
tests/test_scheme_eligibility.py::test_pmay_disqualifies_income_at_threshold PASSED [ 55%]
tests/test_scheme_eligibility.py::test_pmay_qualifies_age_upper_bound PASSED [ 60%]
tests/test_scheme_eligibility.py::test_pmay_disqualifies_age_exceeded PASSED [ 65%]
tests/test_scheme_eligibility.py::test_optimal_prefers_pmay_over_pmkvy PASSED [ 70%]
tests/test_scheme_eligibility.py::test_optimal_mgnregs_only PASSED [ 75%]
tests/test_scheme_eligibility.py::test_optimal_none_when_no_scheme PASSED [ 80%]
tests/test_scheme_eligibility.py::test_grader_score_perfect PASSED [ 85%]
tests/test_scheme_eligibility.py::test_grader_score_noise_penalty PASSED [ 90%]
tests/test_scheme_eligibility.py::test_grader_score_zero_base PASSED [ 95%]
tests/test_scheme_eligibility.py::test_grader_score_floor_at_030 PASSED [100%]

============================== 20 passed in 2.24s ==============================
61 changes: 61 additions & 0 deletions reports/test_logs/smoke_test_results.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@

============================================================
SMOKE TEST — scheme_env
============================================================

Task 1 — Scheme Discovery
Profile at reset: {'age': '28', 'income': '4665', 'marital_status': 'widowed', 'number_of_children': '1'}
Missing: ['occupation', 'has_aadhaar']
[PASS] Task 1 terminates on approve — done=True
[PASS] Task 1 reward positive or suboptimal — reward=3.0
[PASS] Task 1 grader_score set — score=0.5
Notification: SUBOPTIMAL: PMKVY is eligible but PMAY provides greater benefit.

Task 2 — Missing Data (premature approval)
Missing at reset: ['occupation', 'has_aadhaar']
[PASS] Task 2 premature approve terminates — done=True
[PASS] Task 2 premature approve scores 0.0 — score=0.0
[PASS] Task 2 premature approve reward negative — reward=-5.0

Task 2 — Missing Data (correct flow)
[PASS] Task 2 correct flow terminates — done=True
[PASS] Task 2 correct flow reward positive — reward=10.0
[PASS] Task 2 correct flow score > 0 — score=1.0

Task 3 — Boundary Fraud (reject before asking income)
Profile at reset: {'age': '24', 'number_of_children': '3', 'marital_status': 'widowed'}
Missing: ['occupation', 'has_aadhaar', 'income']
[PASS] Task 3 income hidden at reset — income in profile=['age', 'number_of_children', 'marital_status']
[PASS] Task 3 premature reject scores 0.0 — score=0.0
[PASS] Task 3 premature reject reward negative — reward=-2.0

Task 3 — Boundary Fraud (correct flow)
[PASS] Task 3 correct reject terminates — done=True
[PASS] Task 3 correct reject reward positive — reward=5.0
[PASS] Task 3 correct reject score > 0.30 — score=1.0

Task 4 — Escalation (escalate before PAN — should soft-block)
[PASS] Task 4 escalate before PAN is non-terminal — done=False
[PASS] Task 4 escalate before PAN reward negative — reward=-1.0

Task 4 — Escalation (correct flow)
[PASS] Task 4 PAN reveals contradiction — notif=PAN card retrieved. Records show this applicant has been a registered employee o
[PASS] Task 4 correct escalate terminates — done=True
[PASS] Task 4 correct escalate reward=10 — reward=10.0
[PASS] Task 4 correct escalate score > 0.80 — score=1.0

Task 5 — Document Conflict (correct flow)
Self-reported age in profile: 33
[PASS] Task 5 Aadhaar reveals age conflict — notif=Aadhaar card verified. Official age on record: 36 years. Note: this differs from
[PASS] Task 5 correct reject terminates — done=True
[PASS] Task 5 correct reject reward=5 — reward=5.0
[PASS] Task 5 correct reject score > 0.80 — score=1.0

Wrong escalation on Task 1 (should now be terminal)
[PASS] Wrong escalation Task 1 is terminal — done=True
[PASS] Wrong escalation reward=-2.0 — reward=-2.0

============================================================
ALL TESTS PASSED — environment logic is correct
============================================================

Loading