Model Architecture Comparison
Input Time Series → Random Kernels → Convolution → Feature Vector → Ridge Classifier → Output
(10,000 kernels) + PPV + Max (20k features)
Input Time Series → Linear Embedding → Transformer Encoder → Contrastive Learning → Representation
(Standard Attention) (Hierarchical)
Input Time Series → 2D Conv Embedding → Positional Encoding → Transformer Encoder → Classification
(Absolute/tAPE/Learn) (Various Encoding)
Input Time Series → Dual-Path Pipeline → Swin Transformer → Feature Fusion → Classification
(Cross-Ch + Patch) (Multi-scale Conv)
+----------------------------+
| Input Time Series |
| [B, T, C] |
+----------------------------+
|
v
+----------------------------+
| Dual-Path Pipeline |
+----------------------------+
| |
v v
+-------------+ +-------------+
|Cross-Channel| | Patch |
| Extractor | | Embedding |
+-------------+ +-------------+
| |
v v
+-------------+ +-------------+
| Channel | | Multi-scale |
| Attention | | Patching |
| + Mixing | | [3,6,9,12] |
+-------------+ +-------------+
| |
v v
+-------------+ +-------------+
| Temporal | | Swin Time |
| Statistics | | Blocks |
+-------------+ +-------------+
| |
v v
+-------------+ +-------------+
|Cross-Channel| | Global |
| Projection | | Pooling |
+-------------+ +-------------+
| |
+-------------------+
|
v
+----------------------------+
| Feature Fusion |
| [Concat + MLP + Norm] |
+----------------------------+
|
v
+----------------------------+
| Classification |
| [Dropout + Linear] |
+----------------------------+
SwinTime Components Detail
1. Cross-Channel Extractor
Input [B, T, C]
|
├── Channel Attention
│ ├── AdaptiveAvgPool1d(1)
│ ├── Conv1d(C, hidden_C, 1) → ReLU → Dropout
│ ├── Conv1d(hidden_C, C, 1) → Sigmoid
│ └── Apply attention weights
│
├── Channel Mixer
│ ├── Linear(C, mixer_hidden) → LayerNorm → GELU → Dropout
│ ├── Linear(mixer_hidden, mixer_hidden//2) → LayerNorm → GELU → Dropout
│ └── Linear(mixer_hidden//2, cross_channel_dim//2)
│
├── Temporal Statistics
│ ├── Compute: [mean, max, min, std, range] across time dimension
│ ├── Concat → [B, C*5]
│ ├── Linear(C*5, temp_hidden) → LayerNorm → GELU → Dropout
│ └── Linear(temp_hidden, cross_channel_dim//2)
│
└── Channel Interaction
├── Linear(C, interaction_dim) → LayerNorm → GELU → Dropout
├── Linear(interaction_dim, C)
└── Residual connection
Input [B, T, C]
|
├── Multi-scale Conv1D Branches
│ ├── Branch 1: Conv1d(C, embed_dim//4, kernel=3, padding=1)
│ ├── Branch 2: Conv1d(C, embed_dim//4, kernel=6, padding=3)
│ ├── Branch 3: Conv1d(C, embed_dim//4, kernel=9, padding=4)
│ └── Branch 4: Conv1d(C, embed_dim//4, kernel=12, padding=6)
│ Each: → BatchNorm1d → GELU
│
├── Adaptive Pooling
│ └── Pool each branch to target_patches (default: 20)
│
├── Concatenate branches → [B, target_patches, embed_dim]
│
└── Enhanced Projection
├── Linear(embed_dim, embed_dim*2) → GELU
├── Linear(embed_dim*2, embed_dim)
└── LayerNorm(embed_dim)
Input [B, patches, embed_dim]
|
├── Multi-scale Convolution Path
│ ├── Conv1d(embed_dim, embed_dim, kernel=3, groups=embed_dim//16)
│ ├── Conv1d(embed_dim, embed_dim, kernel=5, groups=embed_dim//16)
│ ├── Conv1d(embed_dim, embed_dim, kernel=7, groups=embed_dim//16)
│ ├── Concat → Conv1d(embed_dim*3, embed_dim, kernel=1)
│ ├── BatchNorm1d → GELU
│ └── Residual + LayerNorm
│
├── Multi-Head Attention
│ ├── MultiheadAttention(embed_dim, num_heads, dropout)
│ ├── Residual connection
│ └── LayerNorm
│
└── MLP Path
├── Linear(embed_dim, embed_dim*4) → LayerNorm → GELU → Dropout
├── Linear(embed_dim*4, embed_dim) → Dropout
├── Residual connection
└── LayerNorm
Cross-Channel Features [B, cross_channel_dim]
Patch Features [B, embed_dim] (from global pooling)
|
├── Cross-Channel Projection
│ ├── Linear(cross_channel_dim, embed_dim*2) → GELU → Dropout
│ └── Linear(embed_dim*2, embed_dim) → GELU
│
├── Concatenate → [B, embed_dim*2]
│
└── Feature Fusion MLP
├── Linear(embed_dim*2, embed_dim*2) → LayerNorm → GELU → Dropout
└── Linear(embed_dim*2, embed_dim)
Fused Features [B, embed_dim]
|
├── Linear(embed_dim, embed_dim//2) → LayerNorm → GELU → Dropout
└── Linear(embed_dim//2, num_classes)
Embedding Layer Comparison
+-------------------+ +-------------------+ +-------------------+ +-------------------+
| ROCKET | | TS2Vec | | ConvTran | | SwinTime |
+-------------------+ +-------------------+ +-------------------+ +-------------------+
| Random Kernels: | | Linear Layer | | Conv2D(1, emb*4, | | Dual-Path: |
| - 10,000 kernels | | (input → hidden) | | kernel=[1, 8]) | | 1. Cross-Channel |
| - Random weights | +-------------------+ +-------------------+ | 2. Multi-Patch |
| - Random lengths | | BatchNorm2D | +-------------------+
| - Random dilations| | GELU | | Cross-Channel: |
+-------------------+ +-------------------+ | - Channel Attn |
| Feature Extraction| | Conv2D(emb*4, emb,| | - Channel Mixing |
| - Max pooling | | kernel=[ch, 1]) | | - Temporal Stats |
| - PPV calculation | +-------------------+ +-------------------+
+-------------------+ | BatchNorm2D | | Patch Embedding: |
| Ridge Classifier | | GELU | | - Multi-scale |
+-------------------+ +-------------------+ | - Conv1D branches |
| - Projection |
+-------------------+
SwinTime Components Detail
1. Cross-Channel Extractor
Input [B, T, C] → Channel Attention → Channel Mixing → Temporal Stats
↓
Interactive Features
↓
Cross-Channel Features
Input [B, T, C] → Multi-scale Conv1D → Adaptive Pooling → Projection
(patches: 3,6,9,12)
↓
Patch Features
Input → Multi-scale Conv → Attention → MLP → Output
(3,5,7 kernels) ↓
Layer Normalization
Cross-Channel Features + Patch Features → Concat → MLP → Fused Features
ROCKET (Random Convolutional Kernel Transform)
Random Kernel Generation : 10,000 randomly initialized 1D convolutional kernels
Diverse Parameters : Random lengths, weights, biases, and dilations
Simple Features : Only 2 features per kernel (max value + PPV)
Fast Training : No backpropagation, only Ridge regression
Surprising Effectiveness : Simple approach achieves strong results
Contrastive Learning : Hierarchical contrasting at multiple scales
Transformer Encoder : Standard multi-head attention mechanism
Representation Learning : Unsupervised feature extraction
Temporal Consistency : Maintains temporal relationships
2D Convolution : Treats time series as images
Positional Encoding : Multiple types (absolute, tAPE, learnable)
Supervised Learning : Direct classification training
Flexible Architecture : Various encoding strategies
Dual-Path Architecture : Combines cross-channel analysis with multi-scale patching
Cross-Channel Extractor : Captures channel interactions and temporal statistics
Multi-scale Patching : Uses multiple patch sizes for different temporal patterns
Swin Transformer Blocks : Efficient attention with multi-scale convolutions
Feature Fusion : Combination of both pathways