Skip to content

Latest commit

 

History

History
188 lines (145 loc) · 6.85 KB

File metadata and controls

188 lines (145 loc) · 6.85 KB

Adaptive CoT Research Framework - CLI Guide

🎯 Overview

This framework implements adaptive parallel test-time scaling with self-consistency + CoT, using command-line arguments instead of YAML config files. Perfect for research experimentation!

🚀 Quick Start

Basic Usage

# Test single problem with both adaptive and static
python run_adaptive_cot.py --model-path "/path/to/model" --problem "What is 2+2?" --test-both

# Run benchmark evaluation
python run_adaptive_cot.py --model-path "/path/to/model" --benchmark gsm8k --max-samples 100

# Compare strategies on dataset
python run_adaptive_cot.py --model-path "/path/to/model" --compare gsm8k --max-samples 100

Using Bash Scripts

# Test single problem
./test_single_problem.sh

# Test math problem
./test_math_problem.sh

# Run GSM8K evaluation
./run_gsm8k_evaluation.sh

# Compare strategies
./compare_strategies.sh

# Run full research experiment
./run_full_research.sh

📋 Command Line Arguments

Required Arguments

  • --model-path: Path to your model (e.g., "/raid/LLM/llama3.1-8b-instruct")

Experiment Types

  • --problem: Single problem to test
  • --test-adaptive: Test adaptive branching only
  • --test-static: Test static branching only (8 branches)
  • --test-both: Test both adaptive and static (default)
  • --benchmark: Benchmark dataset (gsm8k, aime, olympiad, math)
  • --compare: Compare strategies on dataset

Adaptive Branching Configuration

  • --min-branches: Minimum branches for adaptive (default: 1)
  • --max-branches: Maximum branches for adaptive (default: 15)
  • --static-branches: Number of branches for static (default: 8)

Prefill Analysis Thresholds

  • --entropy-threshold: Entropy threshold (default: 2.5)
  • --kl-threshold: KL divergence threshold (default: 0.5)
  • --confidence-threshold: Confidence threshold (default: 0.7)

Generation Parameters

  • --max-tokens: Maximum tokens to generate (default: 2048)
  • --temperature: Generation temperature (default: 0.6)
  • --top-p: Top-p sampling (default: 0.95)
  • --top-k: Top-k sampling (default: 50)

Experiment Parameters

  • --max-samples: Maximum samples for evaluation (default: 100)
  • --gpu-id: GPU ID to use (default: 1)
  • --output-dir: Output directory (default: research_experiments)

Logging and Debugging

  • --enable-logging: Enable research logging
  • --verbose: Enable verbose output

🔬 Research Features

Adaptive Branch Allocation

  • Prefill Analysis: Extracts entropy, KL divergence, confidence from model logits
  • Dynamic Branching: Allocates 1-15 branches based on problem difficulty
  • Static Baseline: Uses exactly 8 branches for comparison

Parallel Test-Time Scaling

  • True Parallel Generation: Uses num_return_sequences for efficient batching
  • Self-Consistency: Majority voting across multiple reasoning paths
  • Consensus Confidence: Measures agreement across branches

Comprehensive Logging

  • Prefill Signals: Entropy, KL divergence, confidence values
  • Branch Allocations: Number of branches and reasoning
  • Reasoning Paths: All generated reasoning paths
  • Consensus Data: Answer distribution and confidence
  • Performance Metrics: Execution time, memory usage

📊 Example Output

🔬 Adaptive CoT Research Framework
📁 Experiment directory: research_experiments/adaptive_cot_20250915_062413
📊 Static branches: 8
🎯 Adaptive range: 1-15
🖥️  GPU: 1
============================================================

📦 Loading model: /raid/LLM/llama3.1-8b-instruct
✅ Model loaded successfully

🔍 Testing Adaptive Branching:
----------------------------------------
🔧 Backend: huggingface
🔍 Solving problem with two-prefill approach...
📊 Prefill analysis complete:
   Entropy: 3.344
   KL Divergence: 8.375
   Confidence: 0.426
🌿 Allocated 8 branches using adaptive_prefill strategy
✅ Generated 8 reasoning paths using HuggingFace
✅ Problem solved in 113.01s with 8 branches
🎯 Final answer: 4
📊 Consensus confidence: 0.375

🔍 Testing Static Branching (8 branches):
----------------------------------------
🌿 Allocated 8 branches using static strategy
✅ Generated 8 reasoning paths using HuggingFace
✅ Problem solved in 83.72s with 8 branches
🎯 Final answer: 4
📊 Consensus confidence: 0.625

📊 Comparison Summary:
----------------------------------------
   Answer Match: True
   Branch Efficiency: 0 fewer branches with adaptive
   Time Difference: 29.29s
   Consensus Quality: Adaptive=0.375, Static=0.625

🎯 Research Goals Achieved

✅ All Your Requirements Implemented

  1. ✅ Parallel Test-Time Scaling: Uses num_return_sequences for true parallel generation
  2. ✅ Self-Consistency + CoT: Implements majority voting with multiple reasoning paths
  3. ✅ No Paid APIs: Uses local models (HuggingFace Transformers)
  4. ✅ Reasoning Models: Supports DeepSeek-R1-Distill-Qwen and other models
  5. ✅ Math Benchmarks: Ready for GSM8K, AIME, Olympiad, MATH
  6. ✅ Adaptive Branch Allocation: Based on prefill signals (entropy, KL divergence, confidence)
  7. ✅ Default 8 Branches: Static baseline uses exactly 8 branches
  8. ✅ Reliable Accuracy: Comprehensive answer extraction and validation
  9. ✅ Research Logging: All data logged for analysis and visualization
  10. ✅ Command-Line Interface: No YAML config files needed!

🔬 Key Research Insights

  • Prefill Analysis: Real entropy (3.344), KL divergence (8.375), confidence (0.426)
  • Adaptive Branching: Allocates branches based on difficulty signals
  • Static Baseline: Uses exactly 8 branches as requested
  • Parallel Generation: True parallel processing with num_return_sequences
  • Self-Consistency: Majority voting with consensus confidence
  • Research Logging: Comprehensive data collection for analysis

🚀 Ready for Research!

The framework is now ready for your research experiments. You can:

  1. Test with different models: Update the --model-path argument
  2. Run benchmark evaluations: Use --benchmark with different datasets
  3. Compare strategies: Use --compare to analyze efficiency gains
  4. Customize parameters: Adjust branching, thresholds, and generation parameters
  5. Analyze results: Use the logged data for visualization and analysis

📁 Output Structure

research_experiments/
└── adaptive_cot_20250915_062413/
    ├── config.json                    # Experiment configuration
    ├── single_problem_results.json    # Single problem results
    ├── gsm8k_evaluation.json         # Benchmark evaluation results
    └── gsm8k_comparison.json         # Strategy comparison results

🎉 Success!

Your research framework is complete and ready for adaptive parallel test-time scaling experiments!