This repository contains the code, prompts, raw conversation data, judge
outputs, and LaTeX source for the paper Same Question, Different Answer:
Latent Quality in LLMs Under Critical Engagement
(paper/main.pdf · paper/main.tex).
LLM default single-turn outputs systematically underrepresent what the model can produce under sustained critical user engagement. Across 14 open-ended analytical tasks and 9 engagement conditions, critical engagement substantially exceeds best-of-$N$ sampling, and the effect decomposes into two structurally distinct modes:
-
Evaluative critique (three error-and-gap-flagging moves) drives
epistemic calibration — Cohen's
$d = 1.51$ vs.\ sampling baseline. -
Comprehensive critique (the full 14-move taxonomy, adding
elicitation, generative, and calibration moves) drives analytical
novelty — Cohen's
$d = 2.54$ .
Both effects persist against a passive-engagement control, replicate across focal models (DeepSeek V4 Pro → Grok 4.3), and survive multiple-comparison correction. The two modes are not points on a single dose-response axis: a controlled persistence-directive ablation (Appendix B) shows that combining them via prompt instruction degrades both. Full results in the paper.
| Path | What it is |
|---|---|
paper/ |
Paper LaTeX source, references, figures and figure-generation scripts (paper/figures/), and prose-number analysis scripts (paper/analysis/). paper/main.pdf is the compiled manuscript. |
src/crit_thinking/ |
Python package: experiment runner, model clients (Anthropic, OpenAI, DeepSeek, Grok, Google), conversation pipeline, judge pipeline, storage layer. |
src/crit_thinking/prompts/ |
All 9 user-LLM system prompts (one per engagement condition) plus the 2 judge prompts (per-dimension 7-point rubric, position-bias-corrected pairwise). Same files loaded at runtime and reproduced verbatim in Appendix B. |
src/crit_thinking/tasks/ |
The 16 analytical tasks (opening prompts and metadata). |
data/experiments/ |
Raw transcripts and judge outputs for all 327 conversations across the primary, replication, ablation, bookend, and pilot studies. See data/experiments/README.md for layout. |
tests/ |
Unit and integration tests for the experiment infrastructure. |
art_study/ |
The taxonomy-pilot conversations (one human researcher engaging Claude Opus 4.7 across 4 overlap topics), plus the operational codebook and the critical-thinking reference card used to scaffold the engagement. |
docs/ |
Conceptual notes, the measurement framework, and the phase-by-phase project journal (docs/phases/). |
literature-review/ |
Reviews of the three closest prior works (Self-Refine, Multi-Agent Debate, Another Turn) that gated the design. Blocks B/C were de-scoped. |
research_design.md |
The authoritative pre-registered design document. |
Every figure and reported number regenerates from a committed script against committed data.
# 1. Set up the environment
uv sync # uses pyproject.toml + uv.lock
# 2. Reproduce figures (output goes to paper/figures/)
uv run python paper/figures/fig1_taxonomy.py
uv run python paper/figures/fig2_dimension_heatmap.py
uv run python paper/figures/fig3_effect_sizes.py
# ... etc
# 3. Reproduce prose numbers (output JSON in paper/analysis/)
uv run python paper/analysis/objective_detection.py
uv run python paper/analysis/art_study_pairwise.py
uv run python paper/analysis/v2_followthrough.py
uv run python paper/analysis/v2_conversation_length.py
# 4. Build the paper
cd paper && make pdfThe analysis and figure scripts read from data/experiments/ and do not
make any API calls. To re-run the experiments themselves (which does
require API credentials), copy .env.example to .env, fill in keys
for Anthropic / OpenAI / DeepSeek / xAI / Google, then drive the runner
from src/crit_thinking/scripts/.
- Python 3.11+ ·
uvfor environment management - For paper rebuild: TeX Live 2025 or equivalent (the paper uses
newtx,fvextra,cleveref,tikz, and standard ML-paper packages) - For experiment re-runs: API access to DeepSeek (focal, primary) and optionally Grok (focal, replication), Anthropic (user-LLM), OpenAI (judge)
@misc{bhat2026samequestion,
title = {Same Question, Different Answer: Latent Quality in LLMs
Under Critical Engagement},
author = {Bhat, Kartik G},
year = {2026},
url = {https://github.com/kar-ganap/crit-thinking}
}MIT — see LICENSE. This covers code, prompts, conversation
transcripts, and the paper LaTeX source. Use freely with attribution.
Kartik G Bhat · gkartik@gmail.com