LLM Testing

An early LLM evaluation sandbox for comparing how different model providers handle structured data correction and self-evaluation tasks.

The project runs prompts through multiple providers, extracts JSON-like outputs, repairs malformed responses when possible, and records scores for later comparison. It was built to explore practical failure modes in LLM pipelines: invalid JSON, inconsistent scoring, brittle formatting, and evaluator disagreement.

What It Explores

Prompting LLMs to correct structured records
Using a second model to evaluate generated corrections
Recovering malformed JSON responses from model output
Comparing model performance across providers and datasets
Writing repeatable evaluation artifacts to JSON and CSV

Providers / Backends

OpenAI-style clients
Anthropic-style clients
Gemini-style experiments
Local/open model experiments with Llama and Mistral-style folders

Tech Stack

Python
Jupyter notebooks
Provider-specific LLM client wrappers
JSON/CSV evaluation outputs

Notes

This is a research/prototyping repo rather than a polished product. The useful part is the evaluation pattern: run model output through a repeatable pipeline, repair what can be repaired, and make failures visible instead of manually inspecting every response.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
gemini		gemini
llama		llama
minstral		minstral
.gitignore		.gitignore
Metaprompt.ipynb		Metaprompt.ipynb
README.md		README.md
anthropicModule.py		anthropicModule.py
openaiModule.py		openaiModule.py
pipeline.ipynb		pipeline.ipynb
pipeline.py		pipeline.py
requirements.txt		requirements.txt
toolsModule.py		toolsModule.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Testing

What It Explores

Providers / Backends

Tech Stack

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Testing

What It Explores

Providers / Backends

Tech Stack

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages