FinTimeGAN

FintGAN is a conditional TimeGAN/FinGAN-style research pipeline for financial return generation and next-step forecasting. It trains TensorFlow models on cached market data, evaluates multiple finance-aware loss combinations, and writes model weights, plots, PnL series, and summary tables under runs/<run_label>/.

The repository currently supports two workflows:

single: train and evaluate one ticker at a time.
universal: pool training windows from multiple tickers, then evaluate every loss combination across the full ticker set plus a portfolio aggregate.

Highlights

Conditional sequence modeling with an RNN embedder, recovery network, generator, and discriminator
Optional shared attention mode for both the generator and discriminator
Finance-aware objectives such as PnL, Sharpe ratio, MSE, and combinations of them
Local ArcticDB cache so missing symbols are downloaded once and reused later
Automatic output organization for weights, plots, results, and run notes
Support for repeated runs across multiple seeds

Repository Layout

src/
|-- main.py                            # CLI entry point
|-- libraries/
|   |-- data/
|   |   |-- loader.py                  # Nasdaq Data Link + ArcticDB cache
|   |   `-- preprocessing.py           # Return construction and windowing
|   `-- training/
|       |-- fin_timegan_trainer.py     # Single-ticker pipeline
|       `-- fin_timegan_universal.py   # Universal pooled pipeline
|-- models/
|   |-- embedder.py
|   |-- recovery.py
|   |-- generator.py
|   |-- discriminator.py
|   `-- utils.py
|-- utils/
|   |-- evaluation_fintimegan.py       # Metrics and plotting helpers 
|-- database/arctic/                   # ArcticDB LMDB cache
|-- runs/                              # Saved plots and model weights from experiments
`-- readme.md

How the Pipeline Works

1. Data loading and caching

The data layer lives in libraries/data/loader.py.

Stocks are fetched from Nasdaq Data Link table SHARADAR/SEP.
ETFs are fetched from SHARADAR/SFP.
Adjusted open and adjusted close are computed and stored in ArcticDB at lmdb://./database/arctic.
Once a symbol is cached locally, later runs read it from ArcticDB instead of downloading again.

2. Return construction

The preprocessing logic lives in libraries/data/preprocessing.py.

For supported stocks, the model uses excess returns relative to a mapped sector ETF.
For tickers whose symbol starts with X, the model uses raw returns instead.
Returns are built from interleaved adjusted open and adjusted close log prices.
The resulting series is clipped to [-0.15, 0.15].
Windows are created with length l + pred, stepping by h.

3. Training phases

The training logic lives in libraries/training/fin_timegan_trainer.py and libraries/training/fin_timegan_universal.py.

The default workflow is:

Phase 1: pre-train the embedder and recovery network as an autoencoder on the condition window.
Phase 2: optionally warm-start the generator/discriminator with BCE-only adversarial training. (Was created during experiment period and is now removed.)
GradientCheck: estimate scaling coefficients for PnL, MSE, Sharpe, and STD objectives from gradient norms.
Phase 3: train the generator/discriminator across a set of loss combinations.
Evaluation: sample many forecasts per condition, compute forecasting and trading metrics, and save plots/CSVs.

4. Loss combinations

The project evaluates these combinations:

PnL
PnL MSE
PnL MSE STD
PnL MSE SR
PnL SR
PnL STD
SR
SR MSE
MSE
BCE

Model Architecture

Embedder

RNN over the condition window
Choice of gru or lstm
Projects the normalized sequence into a latent representation

Recovery

RNN decoder from latent space back to return space
Used both for Phase 1 reconstruction and for decoding generated next-step latents

Generator

Takes random noise plus condition context
context_mode=last: uses the last latent state only
context_mode=attention: uses multi-head attention over the latent condition sequence
In attention mode, the query is built from the last latent state plus a pooled summary of the full latent history
Noise is injected after attention so sampling stays stochastic without randomizing which history the model looks at
Outputs the next latent state, which the recovery network decodes into the next return

Discriminator

Encodes the condition sequence with an RNN
context_mode=last: uses the last encoded condition state
context_mode=attention: uses multi-head attention over the encoded condition history with the candidate next return as the query seed
Projects the candidate next return into the model hidden space before comparison
Predicts real/fake in return space

Training Modes

Single mode

Single mode loads one ticker, trains one model pipeline, and evaluates each loss combination on that ticker.

Important detail:

In single mode, the Phase 3 combinations are trained separately. Each combination starts from the model's gradient check saved weights and trains a separate model for each loss combination.

Universal mode

Universal mode pools training windows from --pool_tickers, then evaluates the trained model on all pool_tickers plus --other_tickers.

Important details:

The pooled training set is built only from --pool_tickers.
Validation and test sets are kept per ticker for evaluation.
A portfolio-level row named PORTFOLIO is added to the results.
In universal mode, each loss combination is run independently from the same saved baseline weights.

Supported Ticker Logic

Out of the box, stock excess returns are supported only for stocks present in STOCK_ETF_MAP in libraries/data/preprocessing.py.

Current stock-to-ETF mappings are:

Sector ETF	Stocks
`XLY` (Consumer Discretionary)	`AMZN` (Amazon.com), `HD` (Home Depot), `NKE` (Nike), `CCL` (Carnival), `EBAY` (eBay)
`XLP` (Consumer Staples)	`CL` (Colgate-Palmolive), `EL` (Estée Lauder), `KO` (Coca-Cola), `PEP` (PepsiCo), `SYY` (Sysco), `TSN` (Tyson Foods)
`XLE` (Energy)	`APA` (APA Corporation), `OXY` (Occidental Petroleum)
`XLF` (Financials)	`WFC` (Wells Fargo), `GS` (Goldman Sachs), `BLK` (BlackRock), `TROW` (T. Rowe Price)
`XLV` (Health Care)	`PFE` (Pfizer), `HUM` (Humana), `CERN` (Cerner)
`XLI` (Industrials)	`FDX` (FedEx), `GD` (General Dynamics)
`XLK` (Information Technology)	`IBM` (International Business Machines), `TER` (Teradyne)
`XLB` (Materials)	`ECL` (Ecolab), `IP` (International Paper)
`XLU` (Utilities)	`DTE` (DTE Energy), `WEC` (WEC Energy Group)

Raw-return mode is triggered when the ticker starts with X, for example:

XLY (Consumer Discretionary)
XLP (Consumer Staples)
XLE (Energy)
XLF (Financials)
XLV (Health Care)
XLI (Industrials)
XLK (Information Technology)
XLB (Materials)
XLU (Utilities)

Notes:

A stock not present in STOCK_ETF_MAP will fail until you add a mapping.
The current routing logic does not automatically treat non-X ETFs such as SPY as raw-return symbols.

Installation

1. Create a Python environment

Use a Python version supported by your TensorFlow and ArcticDB wheels.

Windows PowerShell example:

python -m venv .venv
.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip

2. Install dependencies

The code imports these packages directly:

numpy
pandas
tensorflow
matplotlib
arcticdb
nasdaqdatalink
tqdm

Example install:

pip install numpy pandas tensorflow matplotlib arcticdb nasdaq-data-link tqdm

3. Configure Nasdaq Data Link access

The loader currently sets the API key directly inside libraries/data/loader.py:

ndl.ApiConfig.api_key = "YOUR_API_KEY"

Before running experiments, replace that value with your own key. If you prefer a cleaner setup, move it to an environment variable and load it from code.

4. Check the cache location

By default, market data is stored in:

database/arctic

This directory is managed through:

ARCTIC_URI = "lmdb://./database/arctic"

Running the Project

The entry point is main.py.

Single-ticker run

python main.py --train_mode single --ticker PFE --seeds 42 --plot

This creates:

runs/PFE/

unless you override the run label with --run_name.

Universal run

python main.py --train_mode universal --pool_tickers PFE HUM --other_tickers XLV --run_name healthcare_universe --seeds 42 --plot

This creates:

runs/healthcare_universe/

A few practical examples

Run on CPU only:

python main.py --train_mode single --ticker PFE --device cpu

Use attention in both the generator and discriminator:

python main.py --train_mode single --ticker PFE --context_mode attention --attention_heads 2

Run multiple seeds:

python main.py --train_mode universal --pool_tickers PFE HUM --other_tickers XLV --seeds 42 7 99

Disable plot generation:

python main.py --train_mode single --ticker PFE --no-plot

Enable TensorFlow JIT compilation:

python main.py --train_mode single --ticker PFE --jit_compile

Command-Line Reference

The main arguments exposed by main.py are:

Argument	Default	Meaning
`--train_mode`	`universal`	`single` or `universal`
`--run_name`	`None`	Optional run label; spaces are replaced with underscores
`--ticker`	`PFE`	Main ticker; also used as a fallback in universal mode if `--pool_tickers` is omitted
`--pool_tickers`	`None`	Training pool for universal mode
`--other_tickers`	`[]`	Extra evaluation-only tickers in universal mode
`--l`	`10`	Condition window length
`--pred`	`1`	Prediction horizon parameter used during window construction
`--tr`	`0.8`	Train split ratio
`--vl`	`0.1`	Validation split ratio
`--h`	`1`	Sliding-window stride
`--module`	`lstm`	RNN cell type: `gru` or `lstm`
`--hidden_dim`	`32`	Hidden dimension for the networks
`--num_layers`	`1`	Number of recurrent layers
`--noise_dim`	`32`	Generator noise dimension
`--n_epochs_phase1`	`25`	Autoencoder pre-training epochs
`--n_epochs_phase2`	`0`	Optional BCE warm-start epochs
`--n_epochs_phase3`, `--n_epochs`	`5`	Epochs per Phase 3 loss combination
`--ngrad`	`5`	GradientCheck epochs
`--batch_size`	`64`	Batch size
`--lr_phase1`	`0.001`	Learning rate for Phase 1
`--lr_phase2`, `--lr`	`0.0001`	Learning rate for BCE warm-start and Phase 3
`--tanh_coeff`	`100.0`	Coefficient for the differentiable sign approximation
`--diter`	`1`	Number of discriminator updates per generator update
`--eval_samples`	`1000`	Monte Carlo samples per condition during evaluation
`--phase1_early_stop_loss`	`1e-6`	Early-stop threshold for Phase 1 reconstruction loss
`--phase1_min_epochs`	`1`	Minimum Phase 1 epochs before early stopping can trigger
`--context_mode`	`last`	Shared generator/discriminator context mode: `last` or `attention`
`--attention_heads`	`2`	Number of heads used in shared attention mode
`--jit_compile`	`False`	Enable TensorFlow `jit_compile=True` on selected training steps
`--device`	`auto`	`auto`, `cpu`, `gpu`, or `npu`
`--notes_file`	`None`	Optional custom location for `run_notes.txt`
`--plot`	`True`	Save plots
`--no-plot`	`False`	Disable plot generation
`--seeds`	`[42]`	One or more random seeds

Device selection behavior:

auto prefers NPU, then GPU, then CPU
cpu hides visible GPUs/NPUs if TensorFlow allows it
gpu or npu will fall back to CPU if the requested device is not visible

Outputs

Every run writes into:

runs/<run_label>/

where:

run_label = --run_name if provided
otherwise the single-ticker label is the ticker
otherwise universal mode defaults to universal

Each run directory contains:

Path	Contents
`models/`	Saved weights for the embedder, recovery, and per-combo generator/discriminator checkpoints
`plots/`	Training curves, forecast plots, cumulative PnL charts, combo correlation heatmaps, and summary figures
`results/`	Per-seed result tables, all-seed concatenations, combo correlation CSVs, and across-seed summaries
`pnl/`	Saved PnL series per combination
`run_notes.txt`	Timestamp, configuration, device report, output paths, and a summary preview

Typical result files

Single mode examples:

runs/PFE/results/PFE_results_seed_42.csv
runs/PFE/results/PFE_results_all_seeds.csv
runs/PFE/results/PFE_summary_across_seeds.csv
runs/PFE/results/PFE_seed_42_combo_corr.csv

Universal mode examples:

runs/universal/results/universal_results_seed_42.csv
runs/universal/results/universal_results_all_seeds.csv
runs/universal/results/universal_summary_across_seeds.csv
runs/universal/results/universal_seed_42_combo_corr.csv

Result Metrics

The evaluation code lives in utils/evaluation_fintimegan.py.

The most important columns in the results tables are:

Column	Meaning
`RMSE`, `MAE`	Forecast error between the mean generated return and the real return
`Corr`	Correlation between the mean generated return and the real return
`Hit rate`	Sign accuracy of the weighted trading signal
`Hit rate hard sign`	Sign accuracy using `sign(mean prediction)`
`SR_w scaled`	Annualized Sharpe ratio of the weighted-signal PnL
`SR_w hard sign`	Annualized Sharpe ratio using the hard-sign signal
`PnL_w`	Mean paired PnL from the weighted signal
`Close-to-Open SR_w`	Sharpe ratio on the even-index leg of the paired returns
`Open-to-Close SR_w`	Sharpe ratio on the odd-index leg of the paired returns
`SR_w scaled shuffled`	Sharpe ratio after shuffling the signal as a leakage sanity check
`SR_w constant +1`	Long-only Sharpe baseline
`SR_w constant -1`	Short-only Sharpe baseline
`Pos mn`, `Neg mn`	Fraction of positive and negative mean forecasts
`narrow dist`	Collapse flag based on the sampled distribution width
`narrow means dist`	Collapse flag based on the dispersion of mean forecasts
`scope`	`ticker` or `portfolio` in universal mode

How the weighted signal is formed

For each condition window, the evaluator samples many future returns from the generator:

p_up = mean(samples >= 0)
p_down = 1 - p_up
weighted_signal = p_up - p_down

PnL is then computed as:

10000 * weighted_signal * real_return

The evaluator groups the interleaved close-to-open and open-to-close legs into paired daily PnL series before computing Sharpe metrics.

Existing Example Artifacts

The repository already contains example outputs:

runs/PFE/
runs/universal/

These are useful if you want to inspect the expected folder structure and file naming before launching a new run.

Important Assumptions and Limitations

The loader can fetch data through 2026-02-15, but preprocessing currently keeps only rows with dates earlier than 2022-01-01.
The current training loop is effectively one-step-ahead. Although --pred exists, the target used during training/evaluation is batch[:, l, :], so pred=1 is the safe setting.
Universal mode uses the first pooled ticker's validation/test tensors for Phase 1 reconstruction diagnostics.
Plotting uses the non-interactive Agg backend, so figures are saved to disk rather than shown in a GUI window.
The project currently has no requirements.txt, pyproject.toml, or license file in this directory.

Quick Start

If you just want one reproducible local run, use:

python main.py --train_mode single --ticker PFE --seeds 42 --plot

Then inspect:

runs/PFE/run_notes.txt
runs/PFE/results/PFE_summary_across_seeds.csv
runs/PFE/plots/

References Inside the Codebase

Entry point: main.py
Data loading: libraries/data/loader.py
Preprocessing: libraries/data/preprocessing.py
Single-ticker training: libraries/training/fin_timegan_trainer.py
Universal training: libraries/training/fin_timegan_universal.py
Evaluation: utils/evaluation_fintimegan.py

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
readme.md		readme.md

Folders and files

Latest commit

History

Repository files navigation

FinTimeGAN

Highlights

Repository Layout

How the Pipeline Works

1. Data loading and caching

2. Return construction

3. Training phases

4. Loss combinations

Model Architecture

Embedder

Recovery

Generator

Discriminator

Training Modes

Single mode

Universal mode

Supported Ticker Logic

Installation

1. Create a Python environment

2. Install dependencies

3. Configure Nasdaq Data Link access

4. Check the cache location

Running the Project

Single-ticker run

Universal run

A few practical examples

Command-Line Reference

Outputs

Typical result files

Result Metrics

How the weighted signal is formed

Existing Example Artifacts

Important Assumptions and Limitations

Quick Start

References Inside the Codebase

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages