Skip to content

Benchmark for evaluating crypto AI agents designed to produce long-form analytical answers.

License

Notifications You must be signed in to change notification settings

sentient-agi/CryptoAnalystBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

alt text

CryptoAnalystBench

Homepage Discord Twitter Follow GitHub HF License

DatasetTechnical Blog • Paper (Coming soon)

Long-form response quality evaluation harness for Web3/crypto domain queries. This system can be used to evaluate multiple AI responses using a judge LLM (Deepseek-V3.1-671B by default) on four key parameters: Relevance, Temporal Relevance, Depth, and Data Consistency.

Evaluation Summary

CryptoAnalystBench evaluates AI responses to crypto/blockchain queries using an automated judge that scores each response on:

  • Relevance (1-10): How well does the response address the specific question?
  • Temporal Relevance (1-10): How current and timely is the information?
  • Depth (1-10): How comprehensive and detailed is the response?
  • Data Consistency (1-10): How consistent and contradiction-free is the information?

The system generates evaluation reports including per-model statistics, tag-wise rankings, and comparative analysis.

Category distribution of CryptoAnalystBench queries

The benchmark dataset contains 198 queries across 11 unique categories:

Query Distribution

Tag Count
Project & Fundamental Research 36
Market Data & Price Discovery 34
On-Chain Analytics & Flows 33
Macro & Narrative Context 23
Trading & Strategy Design 19
Crypto Concepts & How-To 17
Comparative & Performance Analysis 13
Meme Coins 10
Security & Risks 10
NFTs 2
Default / General Analysis 1

How to Run

  1. Set up virtual environment:
python3.12 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
  1. Configure environment variables:
cp .env-example .env

Edit .env and add your API key:

export FIREWORKS_API_KEY="your_api_key_here"
  1. Prepare your input file:

    • Use data/dataset.csv for queries
    • Generate one input file with AI responses for each query
    • Place the input file under data/input/
  2. Run the evaluation:

python3.12 script.py --csv_path data/input/your_input_file.csv --models model1 model2 model3

Example:

python3.12 script.py --csv_path data/input/sample_input.csv --models sentient gpt5 grok4 pplx

Optional arguments:

  • --num_workers: Number of parallel workers (default: 3)
  • --max_queries: Maximum number of queries to evaluate (default: all)
  1. Output: The evaluation generates an XLSX file in data/output/ with:
    • Evaluation Results (detailed scores and rankings)
    • Per-Model Statistics (aggregate metrics)
    • Tag-wise Rankings (performance by query category)

Expected Input File Format

The input CSV file must contain:

  • Required columns:

    • query: The crypto/blockchain question to evaluate
    • {model_name}_response: Response column for each model (e.g., sentient_response, gpt5_response, grok4_response)
  • Optional columns:

    • tags: Category tags for the query (e.g., "Macro & Narrative Context", "Comparative & Performance Analysis")

Example structure:

query tags sentient_response gpt5_response grok4_response pplx_response
What's the Bitcoin fear and greed index today? Macro & Narrative Context Response from Sentient... Response from GPT5... Response from Grok4... Response from PPLX...

TL;DR. Use data/dataset.csv for queries; generate one input file with LLM responses; place them under data/input; now you are good to go!

About

Benchmark for evaluating crypto AI agents designed to produce long-form analytical answers.

Resources

License

Stars

Watchers

Forks

Languages