Skip to content

RoseRahimi/FinTimeGAN-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

FinTimeGAN

FintGAN is a conditional TimeGAN/FinGAN-style research pipeline for financial return generation and next-step forecasting. It trains TensorFlow models on cached market data, evaluates multiple finance-aware loss combinations, and writes model weights, plots, PnL series, and summary tables under runs/<run_label>/.

The repository currently supports two workflows:

  • single: train and evaluate one ticker at a time.
  • universal: pool training windows from multiple tickers, then evaluate every loss combination across the full ticker set plus a portfolio aggregate.

Highlights

  • Conditional sequence modeling with an RNN embedder, recovery network, generator, and discriminator
  • Optional shared attention mode for both the generator and discriminator
  • Finance-aware objectives such as PnL, Sharpe ratio, MSE, and combinations of them
  • Local ArcticDB cache so missing symbols are downloaded once and reused later
  • Automatic output organization for weights, plots, results, and run notes
  • Support for repeated runs across multiple seeds

Repository Layout

src/
|-- main.py                            # CLI entry point
|-- libraries/
|   |-- data/
|   |   |-- loader.py                  # Nasdaq Data Link + ArcticDB cache
|   |   `-- preprocessing.py           # Return construction and windowing
|   `-- training/
|       |-- fin_timegan_trainer.py     # Single-ticker pipeline
|       `-- fin_timegan_universal.py   # Universal pooled pipeline
|-- models/
|   |-- embedder.py
|   |-- recovery.py
|   |-- generator.py
|   |-- discriminator.py
|   `-- utils.py
|-- utils/
|   |-- evaluation_fintimegan.py       # Metrics and plotting helpers 
|-- database/arctic/                   # ArcticDB LMDB cache
|-- runs/                              # Saved plots and model weights from experiments
`-- readme.md

How the Pipeline Works

1. Data loading and caching

The data layer lives in libraries/data/loader.py.

  • Stocks are fetched from Nasdaq Data Link table SHARADAR/SEP.
  • ETFs are fetched from SHARADAR/SFP.
  • Adjusted open and adjusted close are computed and stored in ArcticDB at lmdb://./database/arctic.
  • Once a symbol is cached locally, later runs read it from ArcticDB instead of downloading again.

2. Return construction

The preprocessing logic lives in libraries/data/preprocessing.py.

  • For supported stocks, the model uses excess returns relative to a mapped sector ETF.
  • For tickers whose symbol starts with X, the model uses raw returns instead.
  • Returns are built from interleaved adjusted open and adjusted close log prices.
  • The resulting series is clipped to [-0.15, 0.15].
  • Windows are created with length l + pred, stepping by h.

3. Training phases

The training logic lives in libraries/training/fin_timegan_trainer.py and libraries/training/fin_timegan_universal.py.

The default workflow is:

  1. Phase 1: pre-train the embedder and recovery network as an autoencoder on the condition window.
  2. Phase 2: optionally warm-start the generator/discriminator with BCE-only adversarial training. (Was created during experiment period and is now removed.)
  3. GradientCheck: estimate scaling coefficients for PnL, MSE, Sharpe, and STD objectives from gradient norms.
  4. Phase 3: train the generator/discriminator across a set of loss combinations.
  5. Evaluation: sample many forecasts per condition, compute forecasting and trading metrics, and save plots/CSVs.

4. Loss combinations

The project evaluates these combinations:

  • PnL
  • PnL MSE
  • PnL MSE STD
  • PnL MSE SR
  • PnL SR
  • PnL STD
  • SR
  • SR MSE
  • MSE
  • BCE

Model Architecture

Embedder

  • RNN over the condition window
  • Choice of gru or lstm
  • Projects the normalized sequence into a latent representation

Recovery

  • RNN decoder from latent space back to return space
  • Used both for Phase 1 reconstruction and for decoding generated next-step latents

Generator

  • Takes random noise plus condition context
  • context_mode=last: uses the last latent state only
  • context_mode=attention: uses multi-head attention over the latent condition sequence
  • In attention mode, the query is built from the last latent state plus a pooled summary of the full latent history
  • Noise is injected after attention so sampling stays stochastic without randomizing which history the model looks at
  • Outputs the next latent state, which the recovery network decodes into the next return

Discriminator

  • Encodes the condition sequence with an RNN
  • context_mode=last: uses the last encoded condition state
  • context_mode=attention: uses multi-head attention over the encoded condition history with the candidate next return as the query seed
  • Projects the candidate next return into the model hidden space before comparison
  • Predicts real/fake in return space

Training Modes

Single mode

Single mode loads one ticker, trains one model pipeline, and evaluates each loss combination on that ticker.

Important detail:

  • In single mode, the Phase 3 combinations are trained separately. Each combination starts from the model's gradient check saved weights and trains a separate model for each loss combination.

Universal mode

Universal mode pools training windows from --pool_tickers, then evaluates the trained model on all pool_tickers plus --other_tickers.

Important details:

  • The pooled training set is built only from --pool_tickers.
  • Validation and test sets are kept per ticker for evaluation.
  • A portfolio-level row named PORTFOLIO is added to the results.
  • In universal mode, each loss combination is run independently from the same saved baseline weights.

Supported Ticker Logic

Out of the box, stock excess returns are supported only for stocks present in STOCK_ETF_MAP in libraries/data/preprocessing.py.

Current stock-to-ETF mappings are:

Sector ETF Stocks
XLY (Consumer Discretionary) AMZN (Amazon.com), HD (Home Depot), NKE (Nike), CCL (Carnival), EBAY (eBay)
XLP (Consumer Staples) CL (Colgate-Palmolive), EL (Estée Lauder), KO (Coca-Cola), PEP (PepsiCo), SYY (Sysco), TSN (Tyson Foods)
XLE (Energy) APA (APA Corporation), OXY (Occidental Petroleum)
XLF (Financials) WFC (Wells Fargo), GS (Goldman Sachs), BLK (BlackRock), TROW (T. Rowe Price)
XLV (Health Care) PFE (Pfizer), HUM (Humana), CERN (Cerner)
XLI (Industrials) FDX (FedEx), GD (General Dynamics)
XLK (Information Technology) IBM (International Business Machines), TER (Teradyne)
XLB (Materials) ECL (Ecolab), IP (International Paper)
XLU (Utilities) DTE (DTE Energy), WEC (WEC Energy Group)

Raw-return mode is triggered when the ticker starts with X, for example:

  • XLY (Consumer Discretionary)
  • XLP (Consumer Staples)
  • XLE (Energy)
  • XLF (Financials)
  • XLV (Health Care)
  • XLI (Industrials)
  • XLK (Information Technology)
  • XLB (Materials)
  • XLU (Utilities)

Notes:

  • A stock not present in STOCK_ETF_MAP will fail until you add a mapping.
  • The current routing logic does not automatically treat non-X ETFs such as SPY as raw-return symbols.

Installation

1. Create a Python environment

Use a Python version supported by your TensorFlow and ArcticDB wheels.

Windows PowerShell example:

python -m venv .venv
.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip

2. Install dependencies

The code imports these packages directly:

  • numpy
  • pandas
  • tensorflow
  • matplotlib
  • arcticdb
  • nasdaqdatalink
  • tqdm

Example install:

pip install numpy pandas tensorflow matplotlib arcticdb nasdaq-data-link tqdm

3. Configure Nasdaq Data Link access

The loader currently sets the API key directly inside libraries/data/loader.py:

ndl.ApiConfig.api_key = "YOUR_API_KEY"

Before running experiments, replace that value with your own key. If you prefer a cleaner setup, move it to an environment variable and load it from code.

4. Check the cache location

By default, market data is stored in:

database/arctic

This directory is managed through:

ARCTIC_URI = "lmdb://./database/arctic"

Running the Project

The entry point is main.py.

Single-ticker run

python main.py --train_mode single --ticker PFE --seeds 42 --plot

This creates:

runs/PFE/

unless you override the run label with --run_name.

Universal run

python main.py --train_mode universal --pool_tickers PFE HUM --other_tickers XLV --run_name healthcare_universe --seeds 42 --plot

This creates:

runs/healthcare_universe/

A few practical examples

Run on CPU only:

python main.py --train_mode single --ticker PFE --device cpu

Use attention in both the generator and discriminator:

python main.py --train_mode single --ticker PFE --context_mode attention --attention_heads 2

Run multiple seeds:

python main.py --train_mode universal --pool_tickers PFE HUM --other_tickers XLV --seeds 42 7 99

Disable plot generation:

python main.py --train_mode single --ticker PFE --no-plot

Enable TensorFlow JIT compilation:

python main.py --train_mode single --ticker PFE --jit_compile

Command-Line Reference

The main arguments exposed by main.py are:

Argument Default Meaning
--train_mode universal single or universal
--run_name None Optional run label; spaces are replaced with underscores
--ticker PFE Main ticker; also used as a fallback in universal mode if --pool_tickers is omitted
--pool_tickers None Training pool for universal mode
--other_tickers [] Extra evaluation-only tickers in universal mode
--l 10 Condition window length
--pred 1 Prediction horizon parameter used during window construction
--tr 0.8 Train split ratio
--vl 0.1 Validation split ratio
--h 1 Sliding-window stride
--module lstm RNN cell type: gru or lstm
--hidden_dim 32 Hidden dimension for the networks
--num_layers 1 Number of recurrent layers
--noise_dim 32 Generator noise dimension
--n_epochs_phase1 25 Autoencoder pre-training epochs
--n_epochs_phase2 0 Optional BCE warm-start epochs
--n_epochs_phase3, --n_epochs 5 Epochs per Phase 3 loss combination
--ngrad 5 GradientCheck epochs
--batch_size 64 Batch size
--lr_phase1 0.001 Learning rate for Phase 1
--lr_phase2, --lr 0.0001 Learning rate for BCE warm-start and Phase 3
--tanh_coeff 100.0 Coefficient for the differentiable sign approximation
--diter 1 Number of discriminator updates per generator update
--eval_samples 1000 Monte Carlo samples per condition during evaluation
--phase1_early_stop_loss 1e-6 Early-stop threshold for Phase 1 reconstruction loss
--phase1_min_epochs 1 Minimum Phase 1 epochs before early stopping can trigger
--context_mode last Shared generator/discriminator context mode: last or attention
--attention_heads 2 Number of heads used in shared attention mode
--jit_compile False Enable TensorFlow jit_compile=True on selected training steps
--device auto auto, cpu, gpu, or npu
--notes_file None Optional custom location for run_notes.txt
--plot True Save plots
--no-plot False Disable plot generation
--seeds [42] One or more random seeds

Device selection behavior:

  • auto prefers NPU, then GPU, then CPU
  • cpu hides visible GPUs/NPUs if TensorFlow allows it
  • gpu or npu will fall back to CPU if the requested device is not visible

Outputs

Every run writes into:

runs/<run_label>/

where:

  • run_label = --run_name if provided
  • otherwise the single-ticker label is the ticker
  • otherwise universal mode defaults to universal

Each run directory contains:

Path Contents
models/ Saved weights for the embedder, recovery, and per-combo generator/discriminator checkpoints
plots/ Training curves, forecast plots, cumulative PnL charts, combo correlation heatmaps, and summary figures
results/ Per-seed result tables, all-seed concatenations, combo correlation CSVs, and across-seed summaries
pnl/ Saved PnL series per combination
run_notes.txt Timestamp, configuration, device report, output paths, and a summary preview

Typical result files

Single mode examples:

  • runs/PFE/results/PFE_results_seed_42.csv
  • runs/PFE/results/PFE_results_all_seeds.csv
  • runs/PFE/results/PFE_summary_across_seeds.csv
  • runs/PFE/results/PFE_seed_42_combo_corr.csv

Universal mode examples:

  • runs/universal/results/universal_results_seed_42.csv
  • runs/universal/results/universal_results_all_seeds.csv
  • runs/universal/results/universal_summary_across_seeds.csv
  • runs/universal/results/universal_seed_42_combo_corr.csv

Result Metrics

The evaluation code lives in utils/evaluation_fintimegan.py.

The most important columns in the results tables are:

Column Meaning
RMSE, MAE Forecast error between the mean generated return and the real return
Corr Correlation between the mean generated return and the real return
Hit rate Sign accuracy of the weighted trading signal
Hit rate hard sign Sign accuracy using sign(mean prediction)
SR_w scaled Annualized Sharpe ratio of the weighted-signal PnL
SR_w hard sign Annualized Sharpe ratio using the hard-sign signal
PnL_w Mean paired PnL from the weighted signal
Close-to-Open SR_w Sharpe ratio on the even-index leg of the paired returns
Open-to-Close SR_w Sharpe ratio on the odd-index leg of the paired returns
SR_w scaled shuffled Sharpe ratio after shuffling the signal as a leakage sanity check
SR_w constant +1 Long-only Sharpe baseline
SR_w constant -1 Short-only Sharpe baseline
Pos mn, Neg mn Fraction of positive and negative mean forecasts
narrow dist Collapse flag based on the sampled distribution width
narrow means dist Collapse flag based on the dispersion of mean forecasts
scope ticker or portfolio in universal mode

How the weighted signal is formed

For each condition window, the evaluator samples many future returns from the generator:

  • p_up = mean(samples >= 0)
  • p_down = 1 - p_up
  • weighted_signal = p_up - p_down

PnL is then computed as:

10000 * weighted_signal * real_return

The evaluator groups the interleaved close-to-open and open-to-close legs into paired daily PnL series before computing Sharpe metrics.

Existing Example Artifacts

The repository already contains example outputs:

  • runs/PFE/
  • runs/universal/

These are useful if you want to inspect the expected folder structure and file naming before launching a new run.

Important Assumptions and Limitations

  • The loader can fetch data through 2026-02-15, but preprocessing currently keeps only rows with dates earlier than 2022-01-01.
  • The current training loop is effectively one-step-ahead. Although --pred exists, the target used during training/evaluation is batch[:, l, :], so pred=1 is the safe setting.
  • Universal mode uses the first pooled ticker's validation/test tensors for Phase 1 reconstruction diagnostics.
  • Plotting uses the non-interactive Agg backend, so figures are saved to disk rather than shown in a GUI window.
  • The project currently has no requirements.txt, pyproject.toml, or license file in this directory.

Quick Start

If you just want one reproducible local run, use:

python main.py --train_mode single --ticker PFE --seeds 42 --plot

Then inspect:

  • runs/PFE/run_notes.txt
  • runs/PFE/results/PFE_summary_across_seeds.csv
  • runs/PFE/plots/

References Inside the Codebase

About

This is the accompanying code for distributional foecasting of ETF stock with Generative Adverserial Networks. The pipline is extending the FinGAN pipline by Milena Vuletic.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages