CryptoAnalystBench

CryptoAnalystBench

Dataset • Technical Blog • Paper (Coming soon)

Long-form response quality evaluation harness for Web3/crypto domain queries. This system can be used to evaluate multiple AI responses using a judge LLM (Deepseek-V3.1-671B by default) on four key parameters: Relevance, Temporal Relevance, Depth, and Data Consistency.

Evaluation Summary

CryptoAnalystBench evaluates AI responses to crypto/blockchain queries using an automated judge that scores each response on:

Relevance (1-10): How well does the response address the specific question?
Temporal Relevance (1-10): How current and timely is the information?
Depth (1-10): How comprehensive and detailed is the response?
Data Consistency (1-10): How consistent and contradiction-free is the information?

The system generates evaluation reports including per-model statistics, tag-wise rankings, and comparative analysis.

Category distribution of CryptoAnalystBench queries

The benchmark dataset contains 198 queries across 11 unique categories:

Tag	Count
Project & Fundamental Research	36
Market Data & Price Discovery	34
On-Chain Analytics & Flows	33
Macro & Narrative Context	23
Trading & Strategy Design	19
Crypto Concepts & How-To	17
Comparative & Performance Analysis	13
Meme Coins	10
Security & Risks	10
NFTs	2
Default / General Analysis	1

How to Run

Set up virtual environment:

python3.12 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Configure environment variables:

cp .env-example .env

Edit .env and add your API key:

export FIREWORKS_API_KEY="your_api_key_here"

Prepare your input file:
- Use data/dataset.csv for queries
- Generate one input file with AI responses for each query
- Place the input file under data/input/
Run the evaluation:

python3.12 script.py --csv_path data/input/your_input_file.csv --models model1 model2 model3

Example:

python3.12 script.py --csv_path data/input/sample_input.csv --models sentient gpt5 grok4 pplx

Optional arguments:

--num_workers: Number of parallel workers (default: 3)
--max_queries: Maximum number of queries to evaluate (default: all)

Output: The evaluation generates an XLSX file in data/output/ with:
- Evaluation Results (detailed scores and rankings)
- Per-Model Statistics (aggregate metrics)
- Tag-wise Rankings (performance by query category)

Expected Input File Format

The input CSV file must contain:

Required columns:
- query: The crypto/blockchain question to evaluate
- {model_name}_response: Response column for each model (e.g., sentient_response, gpt5_response, grok4_response)
Optional columns:
- tags: Category tags for the query (e.g., "Macro & Narrative Context", "Comparative & Performance Analysis")

Example structure:

query	tags	sentient_response	gpt5_response	grok4_response	pplx_response
What's the Bitcoin fear and greed index today?	Macro & Narrative Context	Response from Sentient...	Response from GPT5...	Response from Grok4...	Response from PPLX...

TL;DR. Use data/dataset.csv for queries; generate one input file with LLM responses; place them under data/input; now you are good to go!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
data		data
src/llms		src/llms
.DS_Store		.DS_Store
.env-example		.env-example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
script.py		script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CryptoAnalystBench

Evaluation Summary

Category distribution of CryptoAnalystBench queries

How to Run

Expected Input File Format

About

Uh oh!

Releases 1

Uh oh!

Languages

License

sentient-agi/CryptoAnalystBench

Folders and files

Latest commit

History

Repository files navigation

CryptoAnalystBench

Evaluation Summary

Category distribution of CryptoAnalystBench queries

How to Run

Expected Input File Format

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Languages