Most traditional fraud detection systems (like standard XGBoost models) ask a simple question: "Is this transaction bad?" They rely on static rules and historical leaks (e.g., "Transactions in Italy at 3 AM are high risk"). But fraudsters adapt, and static rules decay.
This project takes a different approach, inspired by high-end recommendation systems (like YouTube's 2016 paper). It asks: "Does this transaction logically follow this specific user's historical behavior?"
By framing fraud detection as a Self-Supervised Sequence Modeling problem, this Two-Tower neural network learns the geometry of normal human behavior. It doesn't just flag known fraud; it flags behavioral anomalies, creating a robust "Vibe Check" embedding that can supercharge downstream tree-based models.
The dataset used in this model can be found here: 💳 Financial Transactions Dataset: Analytics (kaggle)
The model projects both a user's history and their current transaction into the same 128-dimensional latent space to measure their compatibility.
[Context Transactions (t-128 to t-1)] ➔
[Shared Feature Encoder] ➔
[GRU Layer] ➔
Context Vector
[Target Transaction (t)] ➔
[Shared Feature Encoder] ➔
[Dense Layer (projection to same gru dim) ] ➔
Target Vector
[Concat(Context Vector, Target Vector) + MLP] ➔
[Softmax] ➔
[Output: Probability of "Next Step"]
- The Shared Encoder: Acts as a tokenizer for financial transactions, mapping categorical (MCC, Country) and continuous (Amount, Time Deltas) features into a dense representation.
- The Context Tower: A GRU that processes up to 128 past transactions to build a "behavioral fingerprint" of the user's rhythm and habits.
- The Target Tower: Encodes the current transaction being evaluated.
- The Discriminator: A deep classifier with Dropout and Layer Normalization that evaluates the distance/compatibility between the Context and the Target.
This project wasn't trained on a massive AWS GPU cluster. It was engineered to run efficiently on consumer hardware (8 CPU cores in a slim laptop) by solving severe memory and data-loading bottlenecks:
- Interleaved Client Batching: Uses a custom
cicle_iteratorto stream materialized rolling sequence windows, ensuring batch diversity without memory explosion. - Streaming-Aware Feature Engineering: Features like the
30-day Z-scorewere designed with real-time production constraints in mind (e.g., bucketed window functions to prevent Kafka/streaming memory bottlenecks). - Universalist Features: Intentionally avoided highly specific, "leaky" features. Relies on stationary features (Z-scores, time deltas, cyclical sine/cosine time) to ensure the model generalizes to new datasets and shifting fraud patterns.
Training a self-supervised model requires negative examples. Random noise is too easy for the model to spot.
After experimenting with MCC swapping and rotational matrices, the final strategy uses Temporal Freezing with Future Swapping (
- We keep the timestamp and time delta frozen.
- We swap the merchant, amount, and entry mode with a transaction the user actually makes 2 or 3 steps in the future.
- The Task: The model must learn that even though the user does buy coffee, they don't buy it immediately after paying for parking at this specific velocity. This forces the model to learn deep, sequential relationships rather than lazy shortcuts.
Zero-Shot Generalization: 9 clients were completely removed from the training set to ensure the model learns universal human behavior, not just specific user habits.
- Validation AUC:
0.82(on unseen clients) - Fraud Detection (Downstream Task):
0.80 AUC(Precision: 0.75, Recall: 0.70 on a balanced sample 695 fraud/695 legit).
Anomaly Score Distribution: When scoring 12.4 million transactions, the model naturally assigns lower probability scores to fraudulent transactions, proving it successfully identifies fraud as a behavioral break:
| Label | Avg Score | Std Dev | Total Transactions |
|---|---|---|---|
| 0 (Legit) | 0.667 |
0.155 |
12,466,649 |
| 1 (Fraud) | 0.476 |
0.195 |
12,546 |
In fraud prevention, catching fraud is easy; doing it without blocking legitimate customers (friction) is the hard part.
A simple rule mining using completed manual tecniques was found:
89% of fruad is online and swipe magnetic stripe p25 amount fraud is $22.91 for legit is $10.96
By applying a very basic, uncalibrated rule combining the model's anomaly score with simple heuristics (score <= 0.3 AND entry_mode != chip AND amount >= $22.91), we evaluated the model across ~12.4 million transactions:
| Total Transactions | Avg Score | True Positives | False Positives | False Negatives |
|---|---|---|---|---|
| 12,479,195 | 0.667 | 1,709 | 123,247 | 10,837 |
- Precision: 1.37%
- Recall: 13.62%
- Approval Impact (Rejection Rate): 1.00%
The Takeaway: Without any advanced threshold calibration, this lightweight 100k-parameter model can reduce fraud by over 13% while only causing friction for 1% of total traffic. Is defily not world class but with 100k parameter model without any calibration, fine tunning or even specific features its not that bad.
Also this model tecnique can be used as a guard, barrier in bussiness where fraud are bery rare because their are small or because their are in a ninche. Its very useful and better than pure guess work of transaction limits.
This model is not meant to replace XGBoost; it is meant to be a input of it. By outputting the Discriminator's score (or also the latent vectors) as a feature into a tree based model, we combine the Behavioral Intuition of deep learning with the hard rules of XGBoost, eliminating the need for noisy, manually engineered context features (e.g., amount_spent_last_30_mins), improving machine learning egenerings productivity, better generalization and more tolerance to drift.
The data and architecture can be pushed further with more testing and work:
- Contrastive Pre-training (SCARF): Pre-train the shared transaction encoder using Self-Supervised Contrastive Learning for Tabular Data to improve epoch-0 embeddings.
- Target-to-Context Attention: Augment the GRU with an Attention mechanism (Target as Query, Context as Key/Value) to allow the model to "focus" on specific past transactions (e.g., a model found that user goes to grocery store at sunday not monday).
- Adversarial Synthetic Generation (RL): Train a Reinforcement Learning agent to act as a "Red Team." The agent generates synthetic transactions designed to fool the Discriminator while staying within realistic bounds, effectively mapping and refining the blind spots of the latent space. (Very hard, experimental stuff, might have some paper to help with this might risky for production without exaustive tests). Also might be impossible to train RL in consumer CPU laptops.
- Latent SMOTE: Interpolate known fraud cases within the latent space to smooth the decision boundaries for rare fraud typologies.
- Demographic Feature Integration: Extract and include features like credit limit, age, and gender. This was intentionally omitted to validate the core behavioral concept as simply as possible. Adding them would allow the model to learn demographic-specific baselines (e.g., older demographics frequenting pharmacies vs. younger demographics on gaming platforms).