Dataset • Technical Blog • Paper (Coming soon)
Long-form response quality evaluation harness for Web3/crypto domain queries. This system can be used to evaluate multiple AI responses using a judge LLM (Deepseek-V3.1-671B by default) on four key parameters: Relevance, Temporal Relevance, Depth, and Data Consistency.
CryptoAnalystBench evaluates AI responses to crypto/blockchain queries using an automated judge that scores each response on:
- Relevance (1-10): How well does the response address the specific question?
- Temporal Relevance (1-10): How current and timely is the information?
- Depth (1-10): How comprehensive and detailed is the response?
- Data Consistency (1-10): How consistent and contradiction-free is the information?
The system generates evaluation reports including per-model statistics, tag-wise rankings, and comparative analysis.
The benchmark dataset contains 198 queries across 11 unique categories:
| Tag | Count |
|---|---|
| Project & Fundamental Research | 36 |
| Market Data & Price Discovery | 34 |
| On-Chain Analytics & Flows | 33 |
| Macro & Narrative Context | 23 |
| Trading & Strategy Design | 19 |
| Crypto Concepts & How-To | 17 |
| Comparative & Performance Analysis | 13 |
| Meme Coins | 10 |
| Security & Risks | 10 |
| NFTs | 2 |
| Default / General Analysis | 1 |
- Set up virtual environment:
python3.12 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt- Configure environment variables:
cp .env-example .envEdit .env and add your API key:
export FIREWORKS_API_KEY="your_api_key_here"
-
Prepare your input file:
- Use
data/dataset.csvfor queries - Generate one input file with AI responses for each query
- Place the input file under
data/input/
- Use
-
Run the evaluation:
python3.12 script.py --csv_path data/input/your_input_file.csv --models model1 model2 model3Example:
python3.12 script.py --csv_path data/input/sample_input.csv --models sentient gpt5 grok4 pplxOptional arguments:
--num_workers: Number of parallel workers (default: 3)--max_queries: Maximum number of queries to evaluate (default: all)
- Output: The evaluation generates an XLSX file in
data/output/with:- Evaluation Results (detailed scores and rankings)
- Per-Model Statistics (aggregate metrics)
- Tag-wise Rankings (performance by query category)
The input CSV file must contain:
-
Required columns:
query: The crypto/blockchain question to evaluate{model_name}_response: Response column for each model (e.g.,sentient_response,gpt5_response,grok4_response)
-
Optional columns:
tags: Category tags for the query (e.g., "Macro & Narrative Context", "Comparative & Performance Analysis")
Example structure:
| query | tags | sentient_response | gpt5_response | grok4_response | pplx_response |
|---|---|---|---|---|---|
| What's the Bitcoin fear and greed index today? | Macro & Narrative Context | Response from Sentient... | Response from GPT5... | Response from Grok4... | Response from PPLX... |
TL;DR. Use data/dataset.csv for queries; generate one input file with LLM responses; place them under data/input; now you are good to go!

