Skip to content

Latest commit

 

History

History
256 lines (233 loc) · 9.42 KB

File metadata and controls

256 lines (233 loc) · 9.42 KB

Model Architecture Comparison

ROCKET

Input Time Series → Random Kernels → Convolution → Feature Vector → Ridge Classifier → Output
                   (10,000 kernels)   + PPV + Max   (20k features)

TS2Vec

Input Time Series → Linear Embedding → Transformer Encoder → Contrastive Learning → Representation
                                      (Standard Attention)    (Hierarchical)

ConvTran

Input Time Series → 2D Conv Embedding → Positional Encoding → Transformer Encoder → Classification
                                        (Absolute/tAPE/Learn)   (Various Encoding)

SwinTime

Input Time Series → Dual-Path Pipeline → Swin Transformer → Feature Fusion → Classification
                   (Cross-Ch + Patch)    (Multi-scale Conv)
+----------------------------+
|     Input Time Series      |
|       [B, T, C]            |
+----------------------------+
              |
              v
+----------------------------+
|      Dual-Path Pipeline    |
+----------------------------+
     |                   |
     v                   v
+-------------+    +-------------+
|Cross-Channel|    |   Patch     |
| Extractor   |    | Embedding   |
+-------------+    +-------------+
     |                   |
     v                   v
+-------------+    +-------------+
|  Channel    |    | Multi-scale |
| Attention   |    |  Patching   |
| + Mixing    |    | [3,6,9,12]  |
+-------------+    +-------------+
     |                   |
     v                   v
+-------------+    +-------------+
| Temporal    |    | Swin Time   |
| Statistics  |    |   Blocks    |
+-------------+    +-------------+
     |                   |
     v                   v
+-------------+    +-------------+
|Cross-Channel|    |   Global    |
| Projection  |    |  Pooling    |
+-------------+    +-------------+
     |                   |
     +-------------------+
              |
              v
+----------------------------+
|      Feature Fusion        |
|   [Concat + MLP + Norm]    |
+----------------------------+
              |
              v
+----------------------------+
|     Classification         |
|    [Dropout + Linear]      |
+----------------------------+

SwinTime Components Detail

1. Cross-Channel Extractor

Input [B, T, C]
    |
    ├── Channel Attention
    │   ├── AdaptiveAvgPool1d(1)
    │   ├── Conv1d(C, hidden_C, 1) → ReLU → Dropout
    │   ├── Conv1d(hidden_C, C, 1) → Sigmoid
    │   └── Apply attention weights
    │
    ├── Channel Mixer
    │   ├── Linear(C, mixer_hidden) → LayerNorm → GELU → Dropout
    │   ├── Linear(mixer_hidden, mixer_hidden//2) → LayerNorm → GELU → Dropout  
    │   └── Linear(mixer_hidden//2, cross_channel_dim//2)
    │
    ├── Temporal Statistics
    │   ├── Compute: [mean, max, min, std, range] across time dimension
    │   ├── Concat → [B, C*5]
    │   ├── Linear(C*5, temp_hidden) → LayerNorm → GELU → Dropout
    │   └── Linear(temp_hidden, cross_channel_dim//2)
    │
    └── Channel Interaction
        ├── Linear(C, interaction_dim) → LayerNorm → GELU → Dropout
        ├── Linear(interaction_dim, C)
        └── Residual connection

2. Patch Embedding

Input [B, T, C]
    |
    ├── Multi-scale Conv1D Branches
    │   ├── Branch 1: Conv1d(C, embed_dim//4, kernel=3, padding=1)
    │   ├── Branch 2: Conv1d(C, embed_dim//4, kernel=6, padding=3)
    │   ├── Branch 3: Conv1d(C, embed_dim//4, kernel=9, padding=4)
    │   └── Branch 4: Conv1d(C, embed_dim//4, kernel=12, padding=6)
    │       Each: → BatchNorm1d → GELU
    │
    ├── Adaptive Pooling
    │   └── Pool each branch to target_patches (default: 20)
    │
    ├── Concatenate branches → [B, target_patches, embed_dim]
    │
    └── Enhanced Projection
        ├── Linear(embed_dim, embed_dim*2) → GELU
        ├── Linear(embed_dim*2, embed_dim)
        └── LayerNorm(embed_dim)

3. SwinTime Block

Input [B, patches, embed_dim]
    |
    ├── Multi-scale Convolution Path
    │   ├── Conv1d(embed_dim, embed_dim, kernel=3, groups=embed_dim//16)
    │   ├── Conv1d(embed_dim, embed_dim, kernel=5, groups=embed_dim//16)  
    │   ├── Conv1d(embed_dim, embed_dim, kernel=7, groups=embed_dim//16)
    │   ├── Concat → Conv1d(embed_dim*3, embed_dim, kernel=1)
    │   ├── BatchNorm1d → GELU
    │   └── Residual + LayerNorm
    │
    ├── Multi-Head Attention
    │   ├── MultiheadAttention(embed_dim, num_heads, dropout)
    │   ├── Residual connection
    │   └── LayerNorm
    │
    └── MLP Path
        ├── Linear(embed_dim, embed_dim*4) → LayerNorm → GELU → Dropout
        ├── Linear(embed_dim*4, embed_dim) → Dropout
        ├── Residual connection
        └── LayerNorm

4. Feature Fusion

Cross-Channel Features [B, cross_channel_dim]
Patch Features [B, embed_dim] (from global pooling)
    |
    ├── Cross-Channel Projection
    │   ├── Linear(cross_channel_dim, embed_dim*2) → GELU → Dropout
    │   └── Linear(embed_dim*2, embed_dim) → GELU
    │
    ├── Concatenate → [B, embed_dim*2]
    │
    └── Feature Fusion MLP
        ├── Linear(embed_dim*2, embed_dim*2) → LayerNorm → GELU → Dropout
        └── Linear(embed_dim*2, embed_dim)

5. Classification Head

Fused Features [B, embed_dim]
    |
    ├── Linear(embed_dim, embed_dim//2) → LayerNorm → GELU → Dropout
    └── Linear(embed_dim//2, num_classes)

Embedding Layer Comparison

+-------------------+  +-------------------+  +-------------------+  +-------------------+
|      ROCKET       |  |      TS2Vec       |  |     ConvTran      |  |     SwinTime      |
+-------------------+  +-------------------+  +-------------------+  +-------------------+
| Random Kernels:   |  | Linear Layer      |  | Conv2D(1, emb*4,  |  | Dual-Path:        |
| - 10,000 kernels  |  | (input → hidden)  |  |   kernel=[1, 8])  |  | 1. Cross-Channel  |
| - Random weights  |  +-------------------+  +-------------------+  | 2. Multi-Patch    |
| - Random lengths  |                        | BatchNorm2D       |  +-------------------+
| - Random dilations|                        | GELU              |  | Cross-Channel:    |
+-------------------+                        +-------------------+  | - Channel Attn    |
| Feature Extraction|                        | Conv2D(emb*4, emb,|  | - Channel Mixing  |
| - Max pooling     |                        |   kernel=[ch, 1]) |  | - Temporal Stats  |
| - PPV calculation |                        +-------------------+  +-------------------+
+-------------------+                        | BatchNorm2D       |  | Patch Embedding:  |
| Ridge Classifier  |                        | GELU              |  | - Multi-scale     |
+-------------------+                        +-------------------+  | - Conv1D branches |
                                                                   | - Projection      |
                                                                   +-------------------+

SwinTime Components Detail

1. Cross-Channel Extractor

Input [B, T, C] → Channel Attention → Channel Mixing → Temporal Stats
                                    ↓
                         Interactive Features
                                    ↓
                           Cross-Channel Features

2. Patch Embedding

Input [B, T, C] → Multi-scale Conv1D → Adaptive Pooling → Projection
                  (patches: 3,6,9,12)
                                    ↓
                           Patch Features

3. SwinTime Block

Input → Multi-scale Conv → Attention → MLP → Output
        (3,5,7 kernels)      ↓
                      Layer Normalization

4. Feature Fusion

Cross-Channel Features + Patch Features → Concat → MLP → Fused Features

Model Comparison

ROCKET (Random Convolutional Kernel Transform)

  1. Random Kernel Generation: 10,000 randomly initialized 1D convolutional kernels
  2. Diverse Parameters: Random lengths, weights, biases, and dilations
  3. Simple Features: Only 2 features per kernel (max value + PPV)
  4. Fast Training: No backpropagation, only Ridge regression
  5. Surprising Effectiveness: Simple approach achieves strong results

TS2Vec

  1. Contrastive Learning: Hierarchical contrasting at multiple scales
  2. Transformer Encoder: Standard multi-head attention mechanism
  3. Representation Learning: Unsupervised feature extraction
  4. Temporal Consistency: Maintains temporal relationships

ConvTran

  1. 2D Convolution: Treats time series as images
  2. Positional Encoding: Multiple types (absolute, tAPE, learnable)
  3. Supervised Learning: Direct classification training
  4. Flexible Architecture: Various encoding strategies

SwinTime

  1. Dual-Path Architecture: Combines cross-channel analysis with multi-scale patching
  2. Cross-Channel Extractor: Captures channel interactions and temporal statistics
  3. Multi-scale Patching: Uses multiple patch sizes for different temporal patterns
  4. Swin Transformer Blocks: Efficient attention with multi-scale convolutions
  5. Feature Fusion: Combination of both pathways