Two-Tower Behavioral Anomaly Detection

Why This Exists

Most traditional fraud detection systems (like standard XGBoost models) ask a simple question: "Is this transaction bad?" They rely on static rules and historical leaks (e.g., "Transactions in Italy at 3 AM are high risk"). But fraudsters adapt, and static rules decay.

This project takes a different approach, inspired by high-end recommendation systems (like YouTube's 2016 paper). It asks: "Does this transaction logically follow this specific user's historical behavior?"

By framing fraud detection as a Self-Supervised Sequence Modeling problem, this Two-Tower neural network learns the geometry of normal human behavior. It doesn't just flag known fraud; it flags behavioral anomalies, creating a robust "Vibe Check" embedding that can supercharge downstream tree-based models.

The dataset used in this model can be found here: 💳 Financial Transactions Dataset: Analytics (kaggle)

Architecture: The Two-Tower Model

The model projects both a user's history and their current transaction into the same 128-dimensional latent space to measure their compatibility.

[Context Transactions (t-128 to t-1)] ➔ 
[Shared Feature Encoder] ➔ 
[GRU Layer] ➔ 
Context Vector
                                                                                     
[Target Transaction (t)] ➔ 
[Shared Feature Encoder] ➔ 
[Dense Layer (projection to same gru dim) ] ➔ 
Target Vector

[Concat(Context Vector, Target Vector) + MLP] ➔
[Softmax] ➔
[Output: Probability of "Next Step"]

The Shared Encoder: Acts as a tokenizer for financial transactions, mapping categorical (MCC, Country) and continuous (Amount, Time Deltas) features into a dense representation.
The Context Tower: A GRU that processes up to 128 past transactions to build a "behavioral fingerprint" of the user's rhythm and habits.
The Target Tower: Encodes the current transaction being evaluated.
The Discriminator: A deep classifier with Dropout and Layer Normalization that evaluates the distance/compatibility between the Context and the Target.

Engineering Highlights (Built for the Real World)

This project wasn't trained on a massive AWS GPU cluster. It was engineered to run efficiently on consumer hardware (8 CPU cores in a slim laptop) by solving severe memory and data-loading bottlenecks:

Interleaved Client Batching: Uses a custom cicle_iterator to stream materialized rolling sequence windows, ensuring batch diversity without memory explosion.
Streaming-Aware Feature Engineering: Features like the 30-day Z-score were designed with real-time production constraints in mind (e.g., bucketed window functions to prevent Kafka/streaming memory bottlenecks).
Universalist Features: Intentionally avoided highly specific, "leaky" features. Relies on stationary features (Z-scores, time deltas, cyclical sine/cosine time) to ensure the model generalizes to new datasets and shifting fraud patterns.

Hard Negative Mining

Training a self-supervised model requires negative examples. Random noise is too easy for the model to spot.

After experimenting with MCC swapping and rotational matrices, the final strategy uses Temporal Freezing with Future Swapping ($t+2 / t+3$):

We keep the timestamp and time delta frozen.
We swap the merchant, amount, and entry mode with a transaction the user actually makes 2 or 3 steps in the future.
The Task: The model must learn that even though the user does buy coffee, they don't buy it immediately after paying for parking at this specific velocity. This forces the model to learn deep, sequential relationships rather than lazy shortcuts.

Results & Downstream Vision

Zero-Shot Generalization: 9 clients were completely removed from the training set to ensure the model learns universal human behavior, not just specific user habits.

Validation AUC: 0.82 (on unseen clients)
Fraud Detection (Downstream Task): 0.80 AUC (Precision: 0.75, Recall: 0.70 on a balanced sample 695 fraud/695 legit).

Anomaly Score Distribution: When scoring 12.4 million transactions, the model naturally assigns lower probability scores to fraudulent transactions, proving it successfully identifies fraud as a behavioral break:

Label	Avg Score	Std Dev	Total Transactions
0 (Legit)	`0.667`	`0.155`	12,466,649
1 (Fraud)	`0.476`	`0.195`	12,546

1% Approval Impact

In fraud prevention, catching fraud is easy; doing it without blocking legitimate customers (friction) is the hard part.

A simple rule mining using completed manual tecniques was found:

89% of fruad is online and swipe magnetic stripe p25 amount fraud is $22.91 for legit is $10.96

By applying a very basic, uncalibrated rule combining the model's anomaly score with simple heuristics (score <= 0.3 AND entry_mode != chip AND amount >= $22.91), we evaluated the model across ~12.4 million transactions:

Total Transactions	Avg Score	True Positives	False Positives	False Negatives
12,479,195	0.667	1,709	123,247	10,837

Precision: 1.37%
Recall: 13.62%
Approval Impact (Rejection Rate): 1.00%

The Takeaway: Without any advanced threshold calibration, this lightweight 100k-parameter model can reduce fraud by over 13% while only causing friction for 1% of total traffic. Is defily not world class but with 100k parameter model without any calibration, fine tunning or even specific features its not that bad.

Also this model tecnique can be used as a guard, barrier in bussiness where fraud are bery rare because their are small or because their are in a ninche. Its very useful and better than pure guess work of transaction limits.

The Hybrid Stack

This model is not meant to replace XGBoost; it is meant to be a input of it. By outputting the Discriminator's score (or also the latent vectors) as a feature into a tree based model, we combine the Behavioral Intuition of deep learning with the hard rules of XGBoost, eliminating the need for noisy, manually engineered context features (e.g., amount_spent_last_30_mins), improving machine learning egenerings productivity, better generalization and more tolerance to drift.

🔮 Roadmap & Future Work

The data and architecture can be pushed further with more testing and work:

Contrastive Pre-training (SCARF): Pre-train the shared transaction encoder using Self-Supervised Contrastive Learning for Tabular Data to improve epoch-0 embeddings.
Target-to-Context Attention: Augment the GRU with an Attention mechanism (Target as Query, Context as Key/Value) to allow the model to "focus" on specific past transactions (e.g., a model found that user goes to grocery store at sunday not monday).
Adversarial Synthetic Generation (RL): Train a Reinforcement Learning agent to act as a "Red Team." The agent generates synthetic transactions designed to fool the Discriminator while staying within realistic bounds, effectively mapping and refining the blind spots of the latent space. (Very hard, experimental stuff, might have some paper to help with this might risky for production without exaustive tests). Also might be impossible to train RL in consumer CPU laptops.
Latent SMOTE: Interpolate known fraud cases within the latent space to smooth the decision boundaries for rare fraud typologies.
Demographic Feature Integration: Extract and include features like credit limit, age, and gender. This was intentionally omitted to validate the core behavioral concept as simply as possible. Adding them would allow the model to learn demographic-specific baselines (e.g., older demographics frequenting pharmacies vs. younger demographics on gaming platforms).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
credit_card		credit_card
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
discriminator_epoch_2048.keras		discriminator_epoch_2048.keras
discriminator_predictions.py		discriminator_predictions.py
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Two-Tower Behavioral Anomaly Detection

Why This Exists

Architecture: The Two-Tower Model

Engineering Highlights (Built for the Real World)

Hard Negative Mining

Results & Downstream Vision

1% Approval Impact

The Hybrid Stack

🔮 Roadmap & Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Two-Tower Behavioral Anomaly Detection

Why This Exists

Architecture: The Two-Tower Model

Engineering Highlights (Built for the Real World)

Hard Negative Mining

Results & Downstream Vision

1% Approval Impact

The Hybrid Stack

🔮 Roadmap & Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages