- FOMC: Federal Reserve statement classification (Hawkish/Dovish/Neutral)
- EconLogicQA: Ordering economic events based on logical sequences
- FiQA: Financial sentiment analysis and opinion-based QA
- Task 1: Target-specific sentiment analysis
- Task 2: Financial opinion question answering
- MMLU: Massive Multitask Language Understanding (Economics focus)
- BizBench: Numerical question answering on SEC filings
- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
export HUGGINGFACEHUB_API_TOKEN=your_token_hereThe pipeline is split into two stages: inference and evaluation.
Generate model responses for any supported task:
# Using config file
python main.py --config configs/task_name.yaml --mode inference
# Or with explicit arguments
python main.py \
--mode inference \
--dataset [bizbench|econlogicqa|fiqa_task1|fiqa_task2|fomc|mmlu] \
--model together_ai/meta-llama/Llama-2-7b \
--batch_size 10 \
--temperature 0.0 \
--top_p 0.9Task-specific arguments:
- For MMLU:
--mmlu-subjects econometrics high_school_macroeconomics \ --mmlu-split test \ --mmlu-num-few-shot 5
Evaluate the model's responses:
python main.py \
--mode evaluate \
--dataset [bizbench|econlogicqa|fiqa_task1|fiqa_task2|fomc|mmlu] \
--model together_ai/meta-llama/Llama-2-7b \
--file_name path/to/inference_results.csv- Purpose: Extract numerical answers from SEC filings
- Input: Question and SEC filing context
- Output: Numerical answer without units
- Purpose: Order economic events logically
- Input: Question and 4 events
- Output: Ordered sequence of events with explanation
- Purpose: Target-specific financial sentiment analysis
- Input: Financial text
- Output: Sentiment scores (-1 to 1) for identified targets
- Purpose: Answer opinion-based financial questions
- Input: Financial question
- Output: Answer based on financial opinions and analysis
- Purpose: Classify Federal Reserve statements
- Input: FOMC statement
- Output: HAWKISH/DOVISH/NEUTRAL classification
- Purpose: Test model's economics knowledge
- Input: Multiple-choice questions
- Output: Answer with explanation
- Supported subjects: Economics, Finance, Accounting, etc.
Each task has a corresponding config file in configs/:
bizbench.yamleconlogicqa.yamlfiqa.yamlfomc.yamlmmlu.yaml
Configure:
- Model parameters (temperature, top_p, etc.)
- Task-specific settings
- Batch size and other inference settings
output/results/
├── bizbench/
├── econlogicqa/
├── fiqa/
│ ├── task1/
│ └── task2/
├── fomc/
└── mmlu/
output/evaluation/
├── bizbench/
├── econlogicqa/
├── fiqa/
│ ├── task1/
│ └── task2/
├── fomc/
└── mmlu/
Each directory contains:
inference_{model}_{date}.csv: Raw model responsesevaluation_{model}_{date}.csv: Detailed resultsevaluation_{model}_{date}_metrics.csv: Task-specific metrics
Logs are saved to logs/ with task-specific log files:
bizbench_[inference|evaluation].logeconlogicqa_[inference|evaluation].logfiqa_task[1|2]_[inference|evaluation].logfomc_[inference|evaluation].logmmlu_[inference|evaluation].log