humaneval

Here are 38 public repositories matching this topic...

bin123apple / AutoCoder

We introduced a new model designed for the Code generation task. Its test accuracy on the HumanEval base dataset surpasses that of GPT-4 Turbo (April 2024) and GPT-4o.

nlp text-generation code-generation nlp-machine-learning humaneval llm code-interpreter

Updated Jul 6, 2024
Python

the-crypt-keeper / can-ai-code

Star

Self-evaluating interview for AI coders

ai transformers humaneval llm langchain llama-cpp ggml

Updated Jun 21, 2025
Python

abacaj / code-eval

Sponsor

Star

Run evaluation on LLMs using human-eval benchmark

humaneval wizardcoder

Updated Sep 12, 2023
Python

SkyWorkAIGC / SkyCode-AI-CodeX-GPT3

Star

SkyCode是一个多语言开源编程大模型，采用GPT3模型结构，支持Java, JavaScript, C, C++, Python, Go, shell等多种主流编程语言，并能理解中文注释。模型可以对代码进行补全，拥有强大解题能力，使您从编程中解放出来，专心于解决更重要的问题。| SkyCode is an open source programming model, which adopts the GPT3 model structure. It supports Java, JavaScript, C, C++, Python, Go, shell and other languages, and can understand Chinese comments.

javascript python java go shell openai deepmind codex alphacode gpt-3 gpt3 gpt-neo humaneval polycoder codeparrot

Updated Mar 2, 2023

zorse-project / COBOLEval

Star

Evaluate LLM-generated COBOL

evaluation cobol humaneval llm

Updated May 9, 2024
Python

declare-lab / LLM-ReasoningTest

Star

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

reasoning humaneval gsm8k

Updated Apr 21, 2025
Python

nerdskingcom / gguf-humaneval-benchmark

Star

A strict, auditable HumanEval benchmark runner for GGUF models served via llama.cpp, using its OpenAI-compatible HTTP API.

benchmark humaneval llamacpp llama-cpp gguf

Updated Jan 16, 2026
Python

bjdbjd / human-eval-testbed

Star

HumanEval 模型评估工具

ai test openai exam humaneval llm anthropic

Updated Apr 28, 2026
Python

Dan23RR / snc-core

Star

Behavioral Trust Clustering a thermodynamic governance layer that reduces LLM hallucination by 52% on HumanEval. Drop-in wrapper for any decoder. MIT.

abstention openai-api selective-prediction humaneval llm ollama qwen hallucination-mitigation trust-calibration regulated-ai behavioral-clustering

Updated May 4, 2026
Python

JackYoung27 / s0-tuning

Star

S₀ Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

lora mamba hybrid-model fine-tuning state-space-model peft humaneval qwen gated-delta-net recurrent-state

Updated Apr 8, 2026
Python

abhaymundhara / llm-benchmark-suite

Star

Benchmark suite for evaluating LLMs and SLMs on coding and SE tasks. Features HumanEval, MBPP, SWE-bench, and BigCodeBench with an interactive Streamlit UI. Supports cloud APIs (OpenAI, Anthropic, Google) and local models via Ollama. Tracks pass rates, latency, token usage, and costs.

python benchmark evaluation gemini openai code-generation claude streamlit humaneval llm ollama swe-bench mbpp bigcodebench

Updated Apr 23, 2026
Python

he-yufeng / LiteBench

Star

A pip-installable benchmark runner for LLMs and agents. Five minutes to your first eval.

python agent cli benchmark evaluation humaneval llm gsm8k mmlu litellm

Updated May 12, 2026
Python

Miaoge-Ge / llm-eval-framework

Star

A lightweight, configuration-driven evaluation framework for LLM code generation & reasoning tasks (MBPP, HumanEval, GSM8K). Supports multi-provider (DeepSeek, OpenAI, ZhipuAI) and concurrent execution.

benchmark evaluation humaneval llm gsm8k mbpp

Updated May 27, 2026
Python

OpenMLRL / LLM_Collab_Code_Generation

Star

LLM Collaboration for Code Generation

code-generation multi-agent-systems multi-agent-reinforcement-learning humaneval large-language-models code-agent mbpp comlrl openmlrl coophumaneval

Updated May 30, 2026
Python

arcxteam / fortytwo-node

Star

Fortytwo Network Node Building AI on Monad

machine-learning ai monad rust-lang testnet ai-agents huggingface-models fortytwo humaneval llm-inference testnet-node monad-testnet node-operator noderunning rust-dataset swarm-inference

Updated Dec 8, 2025
Shell

aws-samples / sample-claude-code-multi-model

Star

Run Claude Code with any foundation model on Amazon Bedrock (43 models) or a self-hosted model on EC2. Includes a HumanEval benchmark.

ec2 bedrock claude humaneval ollama amazon-bedrock litellm claude-code ai-coding-assistant coding-agent

Updated Jun 5, 2026
Shell

chengjun-xu / ai-eval-platform

Star

大模型评测平台 — 本地/API/HuggingFace/OpenCompass 三路后端，支持数据生产(Self-Instruct/Evol-Instruct)、长尾场景生成、弱项挖掘、回归分析、污染检测、Bad Case归因。可扩展的 Benchmark 系统和 LLM-as-Judge 自动评分。

python flask humaneval ai-evaluation gsm8k mmlu llm-evaluation benchmark-platform rag-evaluation llm-as-judge opencompass llm-benchmark data-contamination-detection

Updated Jun 3, 2026
Python

MrRobotop / evalforge

Star

A multi-method LLM evaluation harness for text-to-SQL & code generation — bootstrapped CIs, calibrated LLM-as-judge, cost/latency Pareto frontiers, and a CI regression gate.

benchmark spider text-to-sql humaneval llmops llm-evaluation

Updated May 10, 2026
Python

sakethyalamanchili / DARWIN-PHOENIX

Star

Co-evolutionary LLM framework where DARWIN (generator) and PHOENIX (adversary) battle to produce antifragile code. Introduces behavioral fingerprinting — drift predicts degradation (ρ=0.720).

python research agents humaneval llm behavioral-fingerprinting openrouter langgraph qwen3 adversarial-testing

Updated May 1, 2026
Python

ognjenvujovic04 / privacy-utility-trade-off

Star

Privacy-utility trade-off analysis for AI code completion using AST-based obfuscation techniques on HumanEval dataset. Measures how variable renaming and comment stripping affect CodeGen model output quality via CodeBLEU and Levenshtein distance metrics.

machine-learning privacy code-completion code-obfuscation humaneval

Updated Jan 9, 2026
Jupyter Notebook

Improve this page

Add a description, image, and links to the humaneval topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the humaneval topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

humaneval

Here are 38 public repositories matching this topic...

bin123apple / AutoCoder

the-crypt-keeper / can-ai-code

abacaj / code-eval

SkyWorkAIGC / SkyCode-AI-CodeX-GPT3

zorse-project / COBOLEval

declare-lab / LLM-ReasoningTest

nerdskingcom / gguf-humaneval-benchmark

bjdbjd / human-eval-testbed

Dan23RR / snc-core

JackYoung27 / s0-tuning

abhaymundhara / llm-benchmark-suite

he-yufeng / LiteBench

Miaoge-Ge / llm-eval-framework

OpenMLRL / LLM_Collab_Code_Generation

arcxteam / fortytwo-node

aws-samples / sample-claude-code-multi-model

chengjun-xu / ai-eval-platform

MrRobotop / evalforge

sakethyalamanchili / DARWIN-PHOENIX

ognjenvujovic04 / privacy-utility-trade-off

Improve this page

Add this topic to your repo