We introduced a new model designed for the Code generation task. Its test accuracy on the HumanEval base dataset surpasses that of GPT-4 Turbo (April 2024) and GPT-4o.
-
Updated
Jul 6, 2024 - Python
We introduced a new model designed for the Code generation task. Its test accuracy on the HumanEval base dataset surpasses that of GPT-4 Turbo (April 2024) and GPT-4o.
SkyCode是一个多语言开源编程大模型,采用GPT3模型结构,支持Java, JavaScript, C, C++, Python, Go, shell等多种主流编程语言,并能理解中文注释。模型可以对代码进行补全,拥有强大解题能力,使您从编程中解放出来,专心于解决更重要的问题。| SkyCode is an open source programming model, which adopts the GPT3 model structure. It supports Java, JavaScript, C, C++, Python, Go, shell and other languages, and can understand Chinese comments.
Behavioral Trust Clustering a thermodynamic governance layer that reduces LLM hallucination by 52% on HumanEval. Drop-in wrapper for any decoder. MIT.
S₀ Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Benchmark suite for evaluating LLMs and SLMs on coding and SE tasks. Features HumanEval, MBPP, SWE-bench, and BigCodeBench with an interactive Streamlit UI. Supports cloud APIs (OpenAI, Anthropic, Google) and local models via Ollama. Tracks pass rates, latency, token usage, and costs.
LLM Collaboration for Code Generation
Fortytwo Network Node Building AI on Monad
Run Claude Code with any foundation model on Amazon Bedrock (43 models) or a self-hosted model on EC2. Includes a HumanEval benchmark.
大模型评测平台 — 本地/API/HuggingFace/OpenCompass 三路后端,支持数据生产(Self-Instruct/Evol-Instruct)、长尾场景生成、弱项挖掘、回归分析、污染检测、Bad Case归因。可扩展的 Benchmark 系统和 LLM-as-Judge 自动评分。
A multi-method LLM evaluation harness for text-to-SQL & code generation — bootstrapped CIs, calibrated LLM-as-judge, cost/latency Pareto frontiers, and a CI regression gate.
Co-evolutionary LLM framework where DARWIN (generator) and PHOENIX (adversary) battle to produce antifragile code. Introduces behavioral fingerprinting — drift predicts degradation (ρ=0.720).
Privacy-utility trade-off analysis for AI code completion using AST-based obfuscation techniques on HumanEval dataset. Measures how variable renaming and comment stripping affect CodeGen model output quality via CodeBLEU and Levenshtein distance metrics.
Add a description, image, and links to the humaneval topic page so that developers can more easily learn about it.
To associate your repository with the humaneval topic, visit your repo's landing page and select "manage topics."