BP Decoder Performance: Fundamental Architecture Issues

## Problem Summary

The current BP decoder implementation shows only **2% improvement** over a simple syndrome-parity baseline decoder, which is **far below expected performance** based on published research. After analyzing the code and comparing against recent literature, I've identified fundamental architectural issues that prevent the decoder from working properly.

## Expected vs. Actual Performance

### What We Should See (from Literature)

Based on recent 2024 research on BP decoders for surface codes:

- **Error thresholds**: 14-17% for modern BP variants ([Improved BP Decoding, arXiv:2407.11523](https://arxiv.org/html/2407.11523v2))
- **Planar surface code threshold**: ~10% for standard BP ([Surface Code Error Correction](https://www.nature.com/articles/s41586-024-08449-y))
- **Circuit-level noise threshold**: ~1% for MWPM, higher for BP ([PyMatching decoder](https://github.com/oscarhiggott/PyMatching))
- **At p=0.03** (3% physical error rate): Should achieve logical error rate **well below 3%** for d=3+ codes

### What We Actually See

At p=0.03 (3% physical error):

| Decoder | LER | Precision | Recall | F1 |
|---------|-----|-----------|--------|-----|
| Syndrome-parity baseline | 50.6% | 35.6% | 50.8% | 0.418 |
| **BP Decoder** | **49.6%** | 36.1% | 50.3% | **0.421** |
| **Improvement** | **2.0%** | - | - | - |

**This is broken!** At 3% physical error, we're getting ~50% logical error rate, which means the decoder is barely better than random guessing.

## Root Cause Analysis

I've identified **three fundamental architectural issues**:

### Issue 1: Wrong Factor Graph Structure

**Current UAI Model (from `dem_to_uai()`):**
```python
# Lines 174-182 in dem.py
n_detectors = dem.num_detectors  # 24 for d=3
lines.append(str(n_detectors))    # Variables = DETECTORS
lines.append(str(len(errors)))    # Factors = ERROR MECHANISMS
```

This creates:
- **Variables = Detectors** (binary: fired=1, not fired=0)  
- **Factors = Error mechanisms** (one factor per DEM error)

**The Problem:**
- This models P(detectors | errors), the **forward model**
- For decoding, we need P(errors | syndrome), the **inverse model**
- Setting detectors as "evidence" makes no sense—we already KNOW the detector states from the syndrome!

### Issue 2: Inference on the Wrong Quantity

**What the code does:**
```python
# examples/tanner_graph_walkthrough.py, line 438
evidence = {det_idx + 1: int(syndrome[det_idx]) for det_idx in range(len(syndrome))}
bp_ev = apply_evidence(bp, evidence)
state, info = belief_propagate(bp_ev, ...)
marginals = compute_marginals(state, bp_ev)  # Marginals over DETECTORS!
```

**The Problem:**
- After setting evidence, BP computes marginals over detector variables
- But we **already know** the detector values (from the syndrome)!
- What we **don't know** is which errors occurred
- The marginals over detectors are useless for decoding

### Issue 3: Decoding Logic Outside BP

**Current approach:**
```python
# Lines 457-499 in examples/tanner_graph_walkthrough.py  
# Build error likelihood scores based on syndrome
for det_idx in syndrome_active:
    errors_for_detector = np.where(H[det_idx, :] == 1)[0]
    error_likelihoods[errors_for_detector] *= 2.0  # Ad-hoc boost
    
# Apply parity heuristic
if syndrome_weight % 2 == 1:
    flip_likelihood_avg *= 3.0  # Ad-hoc multiplier
```

**The Problem:**
- The "decoding" happens **outside** BP using ad-hoc heuristics
- This completely bypasses the power of belief propagation!
- The 2% improvement comes from these heuristics, not from BP
- It's essentially a fancy wrapper around syndrome-parity decoding

## What a Proper BP Decoder Should Do

A correct BP decoder for surface codes should:

### 1. **Correct Factor Graph**
```
Variables = Error mechanisms (binary: occurred=1, didn't occur=0)
Factors = Parity check constraints (one per detector)
Evidence = Observed syndrome s
```

### 2. **Proper Inference**
- Enforce syndrome constraints: `s_i = ⊕_{j: H[i,j]=1} e_j` (parity checks)
- Run BP to compute marginal probabilities over **errors**
- Decode by finding most likely error configuration

### 3. **Maximum Likelihood Decoding**
```
error_config = argmax P(e | s) 
             = argmax P(s | e) · P(e)  [Bayes' rule]
             = argmax [∏_i δ(s_i = ⊕_j H[i,j]e_j)] · [∏_j p_j^{e_j}(1-p_j)^{1-e_j}]
```

## Proposed Solutions

### Option 1: Build Correct Factor Graph

Rewrite `dem_to_uai()` to create:
- **Variables**: Error mechanisms (286 binary variables for d=3, r=3)
- **Factors**: Parity constraints (24 factors, one per detector)
- Each factor enforces: `detector_i = XOR(errors affecting detector i)`

### Option 2: Use Existing Tools

Consider using proven implementations:
- **PyMatching** ([GitHub](https://github.com/oscarhiggott/PyMatching)) - MWPM decoder, ~1% threshold
- **LDPC** ([bp_osd](https://github.com/quantumgizmos/bp_osd)) - BP+OSD implementation
- **Stim** built-in decoders for benchmarking

### Option 3: Implement Modern BP Variants

From the 2024 literature:
- **EWAInit-BP**: Adaptive prior probability updates
- **MBP**: Momentum-based belief propagation  
- **BlockBP**: Outperforms MWPM by >10x in some regimes

## Impact

This affects:
- ❌ Tanner Graph Walkthrough tutorial (PR #37) - shows misleading "2% improvement"
- ❌ Any users trying to use `pytorch_bp` module for actual decoding
- ❌ Credibility of the package - claims "BP decoding" but doesn't actually work

## References

1. [Improved Belief Propagation Decoding on Surface Codes (2024)](https://arxiv.org/html/2407.11523v2)
2. [Quantum Error Correction Below the Surface Code Threshold](https://www.nature.com/articles/s41586-024-08449-y)
3. [Error Correction Zoo: Surface Code](https://errorcorrectionzoo.org/c/surface)
4. [PyMatching: MWPM Decoder](https://github.com/oscarhiggott/PyMatching)
5. [BP-OSD Decoder](https://github.com/quantumgizmos/bp_osd)

## Next Steps

I propose:
1. Document this issue (this issue)
2. Add warning to current tutorial about performance limitations
3. Design correct factor graph structure  
4. Implement proper BP decoder
5. Benchmark against PyMatching to validate

Should I proceed with designing the corrected architecture?

---

**Labels**: `bug`, `enhancement`, `research`, `decoder`
**Priority**: High - affects core functionality

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BP Decoder Performance: Fundamental Architecture Issues #38

Problem Summary

Expected vs. Actual Performance

What We Should See (from Literature)

What We Actually See

Root Cause Analysis

Issue 1: Wrong Factor Graph Structure

Issue 2: Inference on the Wrong Quantity

Issue 3: Decoding Logic Outside BP

What a Proper BP Decoder Should Do

1. Correct Factor Graph

2. Proper Inference

3. Maximum Likelihood Decoding

Proposed Solutions

Option 1: Build Correct Factor Graph

Option 2: Use Existing Tools

Option 3: Implement Modern BP Variants

Impact

References

Next Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Decoder	LER	Precision	Recall	F1
Syndrome-parity baseline	50.6%	35.6%	50.8%	0.418
BP Decoder	49.6%	36.1%	50.3%	0.421
Improvement	2.0%	-	-	-

BP Decoder Performance: Fundamental Architecture Issues #38

Description

Problem Summary

Expected vs. Actual Performance

What We Should See (from Literature)

What We Actually See

Root Cause Analysis

Issue 1: Wrong Factor Graph Structure

Issue 2: Inference on the Wrong Quantity

Issue 3: Decoding Logic Outside BP

What a Proper BP Decoder Should Do

1. Correct Factor Graph

2. Proper Inference

3. Maximum Likelihood Decoding

Proposed Solutions

Option 1: Build Correct Factor Graph

Option 2: Use Existing Tools

Option 3: Implement Modern BP Variants

Impact

References

Next Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions