Same Question, Different Answer

Latent Quality in LLMs Under Critical Engagement

This repository contains the code, prompts, raw conversation data, judge outputs, and LaTeX source for the paper Same Question, Different Answer: Latent Quality in LLMs Under Critical Engagement (paper/main.pdf · paper/main.tex).

What the paper finds

LLM default single-turn outputs systematically underrepresent what the model can produce under sustained critical user engagement. Across 14 open-ended analytical tasks and 9 engagement conditions, critical engagement substantially exceeds best-of-$N$ sampling, and the effect decomposes into two structurally distinct modes:

Evaluative critique (three error-and-gap-flagging moves) drives epistemic calibration — Cohen's $d = 1.51$ vs.\ sampling baseline.
Comprehensive critique (the full 14-move taxonomy, adding elicitation, generative, and calibration moves) drives analytical novelty — Cohen's $d = 2.54$.

Both effects persist against a passive-engagement control, replicate across focal models (DeepSeek V4 Pro → Grok 4.3), and survive multiple-comparison correction. The two modes are not points on a single dose-response axis: a controlled persistence-directive ablation (Appendix B) shows that combining them via prompt instruction degrades both. Full results in the paper.

Repository layout

Path	What it is
`paper/`	Paper LaTeX source, references, figures and figure-generation scripts (`paper/figures/`), and prose-number analysis scripts (`paper/analysis/`). `paper/main.pdf` is the compiled manuscript.
`src/crit_thinking/`	Python package: experiment runner, model clients (Anthropic, OpenAI, DeepSeek, Grok, Google), conversation pipeline, judge pipeline, storage layer.
`src/crit_thinking/prompts/`	All 9 user-LLM system prompts (one per engagement condition) plus the 2 judge prompts (per-dimension 7-point rubric, position-bias-corrected pairwise). Same files loaded at runtime and reproduced verbatim in Appendix B.
`src/crit_thinking/tasks/`	The 16 analytical tasks (opening prompts and metadata).
`data/experiments/`	Raw transcripts and judge outputs for all 327 conversations across the primary, replication, ablation, bookend, and pilot studies. See `data/experiments/README.md` for layout.
`tests/`	Unit and integration tests for the experiment infrastructure.
`art_study/`	The taxonomy-pilot conversations (one human researcher engaging Claude Opus 4.7 across 4 overlap topics), plus the operational codebook and the critical-thinking reference card used to scaffold the engagement.
`docs/`	Conceptual notes, the measurement framework, and the phase-by-phase project journal (`docs/phases/`).
`literature-review/`	Reviews of the three closest prior works (Self-Refine, Multi-Agent Debate, Another Turn) that gated the design. Blocks B/C were de-scoped.
`research_design.md`	The authoritative pre-registered design document.

Reproducing the paper's numbers

Every figure and reported number regenerates from a committed script against committed data.

# 1. Set up the environment
uv sync                       # uses pyproject.toml + uv.lock

# 2. Reproduce figures (output goes to paper/figures/)
uv run python paper/figures/fig1_taxonomy.py
uv run python paper/figures/fig2_dimension_heatmap.py
uv run python paper/figures/fig3_effect_sizes.py
# ... etc

# 3. Reproduce prose numbers (output JSON in paper/analysis/)
uv run python paper/analysis/objective_detection.py
uv run python paper/analysis/art_study_pairwise.py
uv run python paper/analysis/v2_followthrough.py
uv run python paper/analysis/v2_conversation_length.py

# 4. Build the paper
cd paper && make pdf

The analysis and figure scripts read from data/experiments/ and do not make any API calls. To re-run the experiments themselves (which does require API credentials), copy .env.example to .env, fill in keys for Anthropic / OpenAI / DeepSeek / xAI / Google, then drive the runner from src/crit_thinking/scripts/.

Requirements

Python 3.11+ · uv for environment management
For paper rebuild: TeX Live 2025 or equivalent (the paper uses newtx, fvextra, cleveref, tikz, and standard ML-paper packages)
For experiment re-runs: API access to DeepSeek (focal, primary) and optionally Grok (focal, replication), Anthropic (user-LLM), OpenAI (judge)

Citation

@misc{bhat2026samequestion,
  title  = {Same Question, Different Answer: Latent Quality in LLMs
            Under Critical Engagement},
  author = {Bhat, Kartik G},
  year   = {2026},
  url    = {https://github.com/kar-ganap/crit-thinking}
}

License

MIT — see LICENSE. This covers code, prompts, conversation transcripts, and the paper LaTeX source. Use freely with attribution.

Contact

Kartik G Bhat · gkartik@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Same Question, Different Answer

Latent Quality in LLMs Under Critical Engagement

What the paper finds

Repository layout

Reproducing the paper's numbers

Requirements

Citation

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
art_study		art_study
data/experiments		data/experiments
docs		docs
literature-review		literature-review
paper		paper
src/crit_thinking		src/crit_thinking
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
research_design.md		research_design.md
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Same Question, Different Answer

Latent Quality in LLMs Under Critical Engagement

What the paper finds

Repository layout

Reproducing the paper's numbers

Requirements

Citation

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages