py-libp2p Multicodec Integration Status and Improvement Opportunities #1172

acul71 · 2026-01-28T23:27:19Z

acul71
Jan 28, 2026
Maintainer

py-libp2p Multicodec Integration Status and Improvement Opportunities

Executive Summary

This document analyzes the current usage of multicodec functionality in py-libp2p and identifies opportunities to leverage new features from py-multicodec (v1.0.0+) to improve code quality, type safety, and maintainability.

Key Finding: py-libp2p currently uses hardcoded multicodec constants and manual prefix handling, but does not import or use the py-multicodec library. The new py-multicodec features (Code type, serialization framework, named constants) can significantly improve the codebase.

Current State Analysis

1. Hardcoded Multicodec Constants

Location: libp2p/bitswap/cid.py

# Simplified multicodec constants
CODEC_DAG_PB = 0x70
CODEC_RAW = 0x55

# Simplified multihash constants
HASH_SHA256 = 0x12

Issues:

Hardcoded integer values without type safety
No validation that codes are valid multicodec codes
No easy way to convert between code names and values
Risk of using incorrect or deprecated codes

2. Manual Prefix Handling

Location: libp2p/peer/envelope.py

PEER_RECORD_CODEC = b"\x03\x01"  # Hardcoded bytes

The Envelope class uses payload_type as raw bytes, which should be a multicodec-prefixed identifier. Currently, it's manually constructed without using multicodec's prefix functions.

Issues:

Manual byte construction is error-prone
No validation that the codec is valid
Difficult to maintain if multicodec table changes

3. CID Construction Without Proper Multicodec Support

Location: libp2p/bitswap/cid.py

The CID construction functions manually construct CIDs with hardcoded codec values:

def compute_cid_v1(data: bytes, codec: int = CODEC_RAW) -> bytes:
    # ...
    cid = bytes([CID_V1, codec]) + multihash

Issues:

Should use multicodec.add_prefix() for proper varint encoding
Codec values are raw integers instead of validated Code objects
No support for codec name strings (only integers)

py-multicodec Features Overview

Core Features Available

Type-Safe Code Objects (Code class)
- Type-safe codec handling
- String conversion (name lookup)
- Validation and error handling
- Comparison operators
Named Constants (code_table.py)
- Pre-defined Code objects for all known multicodecs
- Examples: SHA2_256, DAG_PB, RAW, IP4, TCP, etc.
- Type-safe and IDE-friendly
Prefix Operations
- add_prefix(codec_name, data) - Add multicodec prefix
- remove_prefix(prefixed_data) - Remove prefix
- get_codec(prefixed_data) - Extract codec name
- extract_prefix(prefixed_data) - Extract prefix integer
- get_prefix(codec_name) - Get prefix bytes for codec
Serialization Framework (serialization.py)
- Codec abstract base class for custom codecs
- Built-in JSONCodec and RawCodec
- encode(codec_name, data) - Generic encoding
- decode(data) - Auto-detect and decode
- Codec registry for extensibility
Code Management
- known_codes() - List all registered codes
- Code.from_string() - Create Code from name or hex
- is_codec(name) - Validate codec names
- is_reserved(code) - Check reserved range

Improvement Opportunities

Priority 1: Replace Hardcoded Constants

1.1 Bitswap CID Module (`libp2p/bitswap/cid.py`)

Current Code:

CODEC_DAG_PB = 0x70
CODEC_RAW = 0x55
HASH_SHA256 = 0x12

Proposed Improvement:

from multicodec import Code
from multicodec.code_table import DAG_PB, RAW, SHA2_256

# Use named constants
CODEC_DAG_PB = DAG_PB
CODEC_RAW = RAW
HASH_SHA256 = SHA2_256

# Or use Code.from_string for flexibility
CODEC_DAG_PB = Code.from_string("dag-pb")
CODEC_RAW = Code.from_string("raw")

Benefits:

Type safety with Code objects
Automatic validation
Self-documenting code
Easy refactoring if code values change

Impact: High - Core functionality used throughout bitswap

1.2 Envelope Payload Type (`libp2p/peer/envelope.py`)

Current Code:

PEER_RECORD_CODEC = b"\x03\x01"

Proposed Improvement:

from multicodec import get_prefix, get_codec, Code
from multicodec.code_table import LIBP2P_PEER_RECORD

# Use proper multicodec prefix
PEER_RECORD_CODEC = get_prefix("libp2p-peer-record")
# Or use Code object
peer_record_code = Code.from_string("libp2p-peer-record")
PEER_RECORD_CODEC = get_prefix(str(peer_record_code))

Benefits:

Proper varint encoding
Validation that codec exists
Consistent with multicodec spec

Impact: Medium - Used in peer record handling

Priority 2: Use Proper Prefix Operations

2.1 CID Construction (`libp2p/bitswap/cid.py`)

Current Code:

def compute_cid_v1(data: bytes, codec: int = CODEC_RAW) -> bytes:
    # ...
    cid = bytes([CID_V1, codec]) + multihash

Proposed Improvement:

from multicodec import add_prefix, Code

def compute_cid_v1(data: bytes, codec: Code | str | int = CODEC_RAW) -> bytes:
    # Convert to Code if needed
    if isinstance(codec, (str, int)):
        codec = Code.from_string(str(codec)) if isinstance(codec, str) else Code(codec)
    
    # Use proper varint encoding for codec
    codec_prefix = add_prefix(str(codec), b"")
    # Remove the prefix bytes to get just the varint
    # Then construct CID: <version><codec-varint><multihash>
    cid = bytes([CID_V1]) + codec_prefix + multihash
    return cid

Benefits:

Proper varint encoding (handles multi-byte codes)
Support for codec names as strings
Type safety with Code objects
Future-proof for new codecs

Impact: High - Critical for CID correctness

2.2 Envelope Payload Type Handling

Current Code:

payload_type: bytes  # Raw bytes

Proposed Improvement:

from multicodec import Code, get_prefix

class Envelope:
    payload_type_code: Code  # Type-safe code
    
    @property
    def payload_type(self) -> bytes:
        """Return the multicodec-prefixed payload type."""
        return get_prefix(str(self.payload_type_code))

Benefits:

Type safety
Validation
Easier to work with codec names

Impact: Medium - Improves peer record handling

Priority 3: Leverage Serialization Framework

3.1 Peer Record Serialization (`libp2p/peer/peer_record.py`)

Current Code:

def marshal_record(self) -> bytes:
    msg = self.to_protobuf()
    return msg.SerializeToString()

Proposed Improvement:

from multicodec.serialization import Codec, register_codec
from multicodec.code_table import LIBP2P_PEER_RECORD

class PeerRecordCodec(Codec[PeerRecord]):
    @property
    def name(self) -> str:
        return "libp2p-peer-record"
    
    @property
    def code(self) -> int:
        return int(LIBP2P_PEER_RECORD)
    
    def _encode(self, data: PeerRecord) -> bytes:
        return data.marshal_record()
    
    def _decode(self, data: bytes) -> PeerRecord:
        return unmarshal_record(data)

# Register the codec
peer_record_codec = PeerRecordCodec()
register_codec(peer_record_codec)

# Then use it
def marshal_record(self) -> bytes:
    return peer_record_codec.encode(self)

Benefits:

Consistent serialization interface
Auto-detection of codec from prefix
Extensible for other record types
Better error handling

Impact: Medium - Improves serialization consistency

3.2 DAG-PB Encoding (`libp2p/bitswap/dag_pb.py`)

Current Code:

def encode_dag_pb(links: list[Link], unixfs_data: UnixFSData | None = None) -> bytes:
    # Manual protobuf encoding
    pb_node = PBNode()
    # ... populate pb_node ...
    return pb_node.SerializeToString()

Proposed Improvement:

from multicodec.serialization import Codec
from multicodec.code_table import DAG_PB

class DAGPBCodec(Codec[tuple[list[Link], UnixFSData | None]]):
    @property
    def name(self) -> str:
        return "dag-pb"
    
    @property
    def code(self) -> int:
        return int(DAG_PB)
    
    def _encode(self, data: tuple[list[Link], UnixFSData | None]) -> bytes:
        links, unixfs_data = data
        return encode_dag_pb(links, unixfs_data)
    
    def _decode(self, data: bytes) -> tuple[list[Link], UnixFSData | None]:
        return decode_dag_pb(data)

dag_pb_codec = DAGPBCodec()
register_codec(dag_pb_codec)

Benefits:

Consistent interface with other codecs
Can use generic encode()/decode() functions
Better integration with CID handling

Impact: Low-Medium - Nice to have for consistency

Priority 4: Code Validation and Error Handling

4.1 Validate Codec Names

Current Code: No validation when codec names/values are used

Proposed Improvement:

from multicodec import is_codec, Code, UnknownCodecError

def compute_cid_v1(data: bytes, codec: Code | str | int = CODEC_RAW) -> bytes:
    # Validate codec
    if isinstance(codec, str):
        if not is_codec(codec):
            raise ValueError(f"Unknown codec: {codec}")
        codec = Code.from_string(codec)
    elif isinstance(codec, int):
        codec = Code(codec)
        if codec.name == "<unknown>":
            raise ValueError(f"Unknown codec code: 0x{codec:x}")
    # ... rest of function

Benefits:

Early error detection
Better error messages
Prevents invalid CIDs

Impact: Medium - Improves robustness

4.2 Envelope Payload Type Validation

Current Code:

if self.payload_type != PEER_RECORD_CODEC:
    raise ValueError("Unsuported payload type in envelope")

Proposed Improvement:

from multicodec import get_codec, Code

def record(self) -> PeerRecord:
    # Validate payload type
    try:
        codec_name = get_codec(self.payload_type)
        if codec_name != "libp2p-peer-record":
            raise ValueError(f"Unsupported payload type: {codec_name}")
    except ValueError as e:
        raise ValueError(f"Invalid payload type: {e}")
    
    # ... rest of method

Benefits:

Validates multicodec prefix format
Better error messages
Supports future codec types

Impact: Low-Medium - Improves error handling

Priority 5: Use Named Constants Throughout

5.1 Replace All Hardcoded Values

Locations to Update:

libp2p/bitswap/cid.py - Codec and hash constants
libp2p/peer/envelope.py - Payload type
Any other places using multicodec values

Proposed Improvement:

# Instead of:
codec = 0x70  # What is this?

# Use:
from multicodec.code_table import DAG_PB
codec = DAG_PB  # Clear and type-safe

Benefits:

Self-documenting code
IDE autocomplete
Refactoring safety
Type checking support

Impact: High - Improves code maintainability

Migration Strategy

Phase 1: Add Dependency and Basic Integration

Add py-multicodec to dependencies:

dependencies = [
    # ... existing dependencies ...
    "py-multicodec>=1.0.0",
]

Replace hardcoded constants in libp2p/bitswap/cid.py:
- Import Code and named constants
- Replace CODEC_DAG_PB = 0x70 with CODEC_DAG_PB = DAG_PB
- Update function signatures to accept Code | str | int
Update envelope payload type in libp2p/peer/envelope.py:
- Use get_prefix() for proper encoding
- Add validation with get_codec()

Phase 2: Improve CID Construction

Update compute_cid_v1() to use add_prefix():
- Proper varint encoding
- Support for codec names as strings
- Better error handling
Update CID parsing to use get_codec():
- Extract codec name from CID
- Validate codec values

Phase 3: Add Serialization Framework (Optional)

Create custom codecs for PeerRecord and DAG-PB
Register codecs in module initialization
Use generic encode/decode where appropriate

Phase 4: Testing and Validation

Update tests to use Code objects
Add validation tests for invalid codecs
Test backward compatibility with existing CIDs

Code Examples

Example 1: Improved CID Construction

from multicodec import Code, add_prefix, get_codec
from multicodec.code_table import RAW, DAG_PB, SHA2_256

def compute_cid_v1(data: bytes, codec: Code | str | int = RAW) -> bytes:
    """
    Compute a CIDv1 with proper multicodec support.
    
    Args:
        data: The data to hash
        codec: Codec as Code object, string name, or integer code
    
    Returns:
        CIDv1 bytes with proper varint-encoded codec prefix
    """
    # Normalize codec to Code object
    if isinstance(codec, str):
        codec = Code.from_string(codec)
    elif isinstance(codec, int):
        codec = Code(codec)
    
    # Compute multihash
    digest = hashlib.sha256(data).digest()
    multihash = bytes([int(SHA2_256), len(digest)]) + digest
    
    # Get proper varint-encoded codec prefix
    codec_prefix = add_prefix(str(codec), b"")
    
    # Construct CID: <version><codec-varint><multihash>
    return bytes([CID_V1]) + codec_prefix + multihash

def parse_cid_codec(cid: bytes) -> str:
    """Extract codec name from CID."""
    if len(cid) < 2 or cid[0] != CID_V1:
        return "cidv0"  # CIDv0 doesn't have explicit codec
    
    # Extract codec prefix (skip version byte)
    codec_prefixed = cid[1:]
    return get_codec(codec_prefixed)

Example 2: Type-Safe Envelope

from multicodec import Code, get_prefix, get_codec
from multicodec.code_table import LIBP2P_PEER_RECORD

class Envelope:
    payload_type_code: Code
    
    def __init__(
        self,
        public_key: PublicKey,
        payload_type: Code | str | bytes,
        raw_payload: bytes,
        signature: bytes,
    ):
        self.public_key = public_key
        
        # Normalize payload_type to Code
        if isinstance(payload_type, bytes):
            # Extract codec from prefix
            codec_name = get_codec(payload_type)
            self.payload_type_code = Code.from_string(codec_name)
        elif isinstance(payload_type, str):
            self.payload_type_code = Code.from_string(payload_type)
        else:
            self.payload_type_code = payload_type
        
        self.raw_payload = raw_payload
        self.signature = signature
    
    @property
    def payload_type(self) -> bytes:
        """Return the multicodec-prefixed payload type."""
        return get_prefix(str(self.payload_type_code))
    
    def record(self) -> PeerRecord:
        """Decode and return the embedded PeerRecord."""
        if self.payload_type_code != LIBP2P_PEER_RECORD:
            raise ValueError(
                f"Unsupported payload type: {self.payload_type_code.name}"
            )
        # ... rest of method

Example 3: Custom Codec for PeerRecord

from multicodec.serialization import Codec, register_codec
from multicodec.code_table import LIBP2P_PEER_RECORD

class PeerRecordCodec(Codec[PeerRecord]):
    """Codec for libp2p PeerRecord serialization."""
    
    @property
    def name(self) -> str:
        return "libp2p-peer-record"
    
    @property
    def code(self) -> int:
        return int(LIBP2P_PEER_RECORD)
    
    def _encode(self, data: PeerRecord) -> bytes:
        """Encode PeerRecord to protobuf bytes."""
        return data.marshal_record()
    
    def _decode(self, data: bytes) -> PeerRecord:
        """Decode protobuf bytes to PeerRecord."""
        return unmarshal_record(data)

# Register globally
peer_record_codec = PeerRecordCodec()
register_codec(peer_record_codec)

# Usage
def seal_record(record: PeerRecord, private_key: PrivateKey) -> Envelope:
    """Create and sign a new Envelope from a PeerRecord."""
    # Use codec for encoding
    payload = peer_record_codec.encode(record)
    
    unsigned = make_unsigned(
        record.domain(),
        peer_record_codec.encode(b""),  # Just the prefix
        payload
    )
    signature = private_key.sign(unsigned)
    
    return Envelope(
        public_key=private_key.get_public_key(),
        payload_type_code=LIBP2P_PEER_RECORD,
        raw_payload=payload,
        signature=signature,
    )

Benefits Summary

Type Safety

Code objects instead of raw integers
IDE autocomplete for codec names
Type checking support with mypy

Maintainability

Self-documenting code with named constants
Centralized codec definitions
Easy updates when multicodec table changes

Correctness

Proper varint encoding for multi-byte codes
Validation of codec names/values
Consistent with multicodec specification

Extensibility

Serialization framework for custom codecs
Codec registry for dynamic codec handling
Future-proof for new codec types

Developer Experience

Better error messages with codec names
Easier debugging with Code.repr
Consistent API across codebase

Potential Issues and Considerations

1. Backward Compatibility

Issue: Existing code may rely on integer codec values.

Solution:

Support both Code objects and integers in function signatures
Use int(code) to convert Code to integer where needed
Gradual migration path

2. Performance

Issue: Code object creation and validation may add overhead.

Solution:

Use cached Code objects (named constants)
Validation only on input, not in hot paths
Profile to ensure acceptable performance

3. Dependency Management

Issue: Adding py-multicodec as a dependency.

Solution:

py-multicodec is lightweight and well-maintained
Already used by other libp2p implementations
Minimal dependency footprint

4. Codec Table Updates

Issue: Multicodec table may change over time.

Solution:

py-multicodec provides tools to update the table
Version pinning for stability
Regular updates for new codecs

Recommendations

Immediate Actions (High Priority)

✅ Add py-multicodec dependency to pyproject.toml
✅ Replace hardcoded constants in libp2p/bitswap/cid.py
✅ Update envelope payload type to use get_prefix()
✅ Improve CID construction with proper varint encoding

Short-term Improvements (Medium Priority)

⚠️ Add codec validation in CID and envelope functions
⚠️ Use Code objects in function signatures
⚠️ Update tests to use Code objects

Long-term Enhancements (Low Priority)

📋 Create custom codecs for PeerRecord and DAG-PB
📋 Use serialization framework for consistency
📋 Add codec discovery utilities

Conclusion

py-multicodec provides powerful features that can significantly improve py-libp2p's code quality, type safety, and maintainability. The most impactful improvements are:

Replacing hardcoded constants with type-safe Code objects
Using proper prefix operations for CID and envelope handling
Adding validation for codec names and values

These changes will make the codebase more robust, easier to maintain, and better aligned with the multicodec specification. The migration can be done gradually with backward compatibility, minimizing risk while maximizing benefits.

References

py-multicodec Documentation
Multicodec Specification
Multicodec Table
py-libp2p files analyzed:
- libp2p/bitswap/cid.py
- libp2p/peer/envelope.py
- libp2p/peer/peer_record.py
- libp2p/bitswap/dag_pb.py

gerceboss · 2026-02-06T05:19:38Z

gerceboss
Feb 6, 2026

Hey @acul71 @asmit27rai , I would like to take up this issue .

0 replies

yashksaini-coder · 2026-02-10T06:58:20Z

yashksaini-coder
Feb 10, 2026

CID Breaking Change Guide

Overview

So we got a concern from one of the developer working on the Multicodec implementation, and the Breaking Change is occuring by the CIDv1 codec encoding from single-byte to varint-encoded format

Before (Legacy):  01 85 12 20 <digest>        (single-byte codec)
After (Varint):   01 85 01 12 20 <digest>     (2-byte varint codec)

This will impact like:-

✅ 95% of CIDs unaffected - codecs < 128 (raw, dag-pb, dag-cbor)
🔴 5% of CIDs affected - codecs ≥ 128 (dag-json, dag-jose, experimental)

Furthermore following the guide implementation for the migration strategy proposed by the contributor:

Compatibility:

Codecs < 128 (e.g., raw=0x55, dag-pb=0x70): Single-byte varint encoding means legacy and new formats are identical (backward compatible)
Codecs ≥ 128: Multi-byte varint encoding means formats are different (breaking change)

Migration Strategy:

✅ Legacy CIDs with codec < 128 continue to work without migration
⚠️ Legacy CIDs with codec ≥ 128 need recomputation from original data
🔧 Use detect_cid_version() to identify format
🔧 Use migrate_legacy_cid() to convert when possible

Implementation Notes:

This guide implements detect_cid_encoding_format() as the detection function (similar to detect_cid_version())
This guide implements recompute_cid_from_data() as the migration function (similar to migrate_legacy_cid())
Alternative function names are provided below for contributor preference

Quick Impact Check

# Check if a codec causes breaking change
def is_breaking_codec(codec_value: int) -> bool:
    return codec_value >= 128  # Multi-byte varint required

# Common codecs:
is_breaking_codec(0x55)    # raw → False ✅
is_breaking_codec(0x70)    # dag-pb → False ✅
is_breaking_codec(0x85)    # dag-json → True 🔴
is_breaking_codec(0x0129)  # dag-jose → True 🔴

Phase 1: Core Implementation

1.1 Fix CID Prefix Extraction

Problem: Current code assumes single-byte codec, breaks with multi-byte varint

Location: libp2p/bitswap/cid.py - get_cid_prefix()

Fix:

def get_cid_prefix(cid: bytes) -> bytes:
    """Extract CID prefix, properly parsing varint codec.
    
    CIDv1 format: <version:1> <codec:varint> <hash_type:1> <hash_len:1>
    """
    if len(cid) < 2 or cid[0] != CID_V1:
        return b""
    
    # Parse varint codec to find its length
    codec_length = 0
    offset = 1  # Skip version byte
    
    for i in range(offset, min(len(cid), offset + 10)):  # Max varint is 10 bytes
        codec_length += 1
        if (cid[i] & 0x80) == 0:  # MSB clear = last byte of varint
            break
    
    # Now we know where multihash starts
    hash_type_offset = 1 + codec_length
    if len(cid) <= hash_type_offset + 1:
        return b""
    
    # Read hash length from multihash
    hash_length = cid[hash_type_offset + 1]
    
    # Prefix = version + codec + hash_type + hash_length
    prefix_length = 1 + codec_length + 2
    
    if len(cid) < prefix_length:
        return b""
    
    return cid[:prefix_length]

Test It:

def test_get_cid_prefix_multibyte_varint():
    """Test prefix extraction with 2-byte varint codec."""
    # dag-json (0x85 = 133) uses varint: 0x85 0x01
    data = b"test data"
    cid = compute_cid_v1(data, codec=0x85)
    
    prefix = get_cid_prefix(cid)
    
    # Should be: version(1) + codec(2) + hash_type(1) + hash_len(1) = 5 bytes
    assert len(prefix) == 5, f"Expected 5, got {len(prefix)}"
    assert prefix[0] == 0x01  # CIDv1
    assert prefix[1:3] == bytes([0x85, 0x01])  # 2-byte varint

1.2 Fix CID Verification

Problem: Uses reverse indexing that fails with multi-byte varint

Location: libp2p/bitswap/cid.py - verify_cid()

Fix:

def verify_cid(cid: bytes, data: bytes) -> bool:
    """Verify CID matches data using forward parsing."""
    if len(cid) < 2 or cid[0] != CID_V1:
        return False
    
    # Parse varint codec to find where multihash starts
    codec_length = 0
    offset = 1  # Skip version byte
    
    for i in range(offset, min(len(cid), offset + 10)):
        codec_length += 1
        if (cid[i] & 0x80) == 0:  # Last byte of varint
            break
    
    # Extract multihash components
    hash_type_offset = 1 + codec_length
    if len(cid) <= hash_type_offset + 1:
        return False
    
    hash_type = cid[hash_type_offset]
    hash_length = cid[hash_type_offset + 1]
    
    # Extract digest from CID
    digest_offset = hash_type_offset + 2
    if len(cid) < digest_offset + hash_length:
        return False
    
    cid_digest = cid[digest_offset : digest_offset + hash_length]
    
    # Compute expected digest
    # TODO: Support multiple hash algorithms, not just SHA2-256
    expected_digest = hashlib.sha256(data).digest()
    
    return expected_digest == cid_digest

Test It:

def test_verify_cid_multibyte_varint():
    """Test verification with 2-byte varint codec."""
    data = b"test data"
    cid = compute_cid_v1(data, codec=0x85)  # dag-json
    
    # Should verify successfully
    assert verify_cid(cid, data) == True
    
    # Should fail with wrong data
    assert verify_cid(cid, b"wrong data") == False

Phase 2: Backward Compatibility Testing

2.1 Test Codec < 128 (Backward Compatible)

Goal: Verify codecs < 128 produce identical CIDs in both formats

def test_backward_compatible_codecs():
    """Verify codecs < 128 are backward compatible."""
    test_data = b"Hello World"
    
    # Test raw (0x55)
    cid_raw = compute_cid_v1(test_data, codec=0x55)
    assert cid_raw[1] == 0x55  # Single byte
    
    # Test dag-pb (0x70)
    cid_dag_pb = compute_cid_v1(test_data, codec=0x70)
    assert cid_dag_pb[1] == 0x70  # Single byte
    
    # Test dag-cbor (0x71)
    cid_dag_cbor = compute_cid_v1(test_data, codec=0x71)
    assert cid_dag_cbor[1] == 0x71  # Single byte
    
    # Verify prefix extraction works
    for cid in [cid_raw, cid_dag_pb, cid_dag_cbor]:
        prefix = get_cid_prefix(cid)
        assert len(prefix) == 4  # version(1) + codec(1) + type(1) + len(1)
        assert verify_cid(cid, test_data) == True

2.2 Test Codec ≥ 128 (Breaking Change)

Goal: Verify codecs ≥ 128 use multi-byte varint correctly

def test_multibyte_varint_codecs():
    """Verify codecs ≥ 128 use multi-byte varint encoding."""
    test_data = b"test data"
    
    # Test dag-json (0x85 = 133)
    cid_dag_json = compute_cid_v1(test_data, codec=0x85)
    
    # Should use 2-byte varint: 0x85 0x01
    assert cid_dag_json[1:3] == bytes([0x85, 0x01])
    
    # Prefix should be 5 bytes: version(1) + codec(2) + type(1) + len(1)
    prefix = get_cid_prefix(cid_dag_json)
    assert len(prefix) == 5, f"Expected 5, got {len(prefix)}"
    
    # Verification should work
    assert verify_cid(cid_dag_json, test_data) == True
    
    # Test dag-jose (0x0129 = 297) if available
    try:
        cid_dag_jose = compute_cid_v1(test_data, codec=0x0129)
        assert len(get_cid_prefix(cid_dag_jose)) == 5  # version(1) + codec(2) + type(1) + len(1)
        assert verify_cid(cid_dag_jose, test_data) == True
    except ValueError:
        pass  # codec not available

2.3 Test Edge Cases

def test_cid_edge_cases():
    """Test edge cases and error handling."""
    
    # Test empty data
    cid_empty = compute_cid_v1(b"", codec=0x55)
    assert verify_cid(cid_empty, b"") == True
    
    # Test large data
    large_data = b"x" * 10000
    cid_large = compute_cid_v1(large_data, codec=0x55)
    assert verify_cid(cid_large, large_data) == True
    
    # Test invalid CID (too short)
    assert get_cid_prefix(b"\x01") == b""
    assert verify_cid(b"\x01", b"test") == False
    
    # Test string codec input
    cid_str = compute_cid_v1(b"test", codec="raw")
    cid_int = compute_cid_v1(b"test", codec=0x55)
    assert cid_str == cid_int
    
    # Test invalid codec
    try:
        compute_cid_v1(b"test", codec="nonexistent")
        assert False, "Should raise ValueError"
    except ValueError as e:
        assert "Unknown codec" in str(e)

Phase 3: Migration Tools

3.1 Format Detection Function

Purpose: Identify if a CID uses legacy or varint encoding

Contributor's Proposed Name: detect_cid_version()
Implementation Name: detect_cid_encoding_format()

Note: Both function names are provided below. Use the one that fits your naming conventions.

Option A: `detect_cid_encoding_format()` (Detailed)

def detect_cid_encoding_format(cid: bytes) -> dict:
    """
    Detect CID encoding format and codec details.
    
    Returns:
        {
            'version': 0 or 1,
            'codec_value': int,
            'codec_name': str,
            'encoding': 'legacy' or 'varint',
            'needs_migration': bool,
            'is_breaking': bool
        }
    """
    from multicodec import get_codec, Code
    
    if len(cid) < 2:
        return {'version': None, 'error': 'CID too short'}
    
    version = cid[0]
    
    if version == 0x12:  # CIDv0 (multihash only)
        return {
            'version': 0,
            'codec_value': 0x70,  # dag-pb
            'codec_name': 'dag-pb',
            'encoding': 'legacy',
            'needs_migration': False,
            'is_breaking': False
        }
    
    if version != 0x01:  # Not CIDv1
        return {'version': version, 'error': 'Unknown CID version'}
    
    # Parse codec value from varint
    codec_value = 0
    shift = 0
    codec_length = 0
    
    for i in range(1, min(len(cid), 11)):  # Max varint is 10 bytes
        byte = cid[i]
        codec_value |= (byte & 0x7F) << shift
        shift += 7
        codec_length += 1
        
        if (byte & 0x80) == 0:  # Last byte
            break
    
    # Get codec name
    try:
        codec = Code(codec_value)
        codec_name = str(codec)
    except:
        codec_name = f"0x{codec_value:x}"
    
    # Determine if this uses legacy or varint encoding
    # Legacy: single byte for all codecs
    # Varint: matches codec_value encoding
    is_breaking = codec_value >= 128
    
    # For codecs < 128, legacy and varint are identical (both 1 byte)
    # For codecs ≥ 128, we can't definitively tell without the original data
    # But we assume varint if properly implemented
    encoding = 'varint' if codec_length > 1 else 'legacy-or-varint'
    
    return {
        'version': 1,
        'codec_value': codec_value,
        'codec_name': codec_name,
        'codec_length': codec_length,
        'encoding': encoding,
        'needs_migration': False,  # Can't migrate without data
        'is_breaking': is_breaking
    }

Test It:

def test_detect_cid_encoding_format():
    """Test format detection."""
    
    # Test raw codec (backward compatible)
    cid_raw = compute_cid_v1(b"test", codec=0x55)
    info = detect_cid_encoding_format(cid_raw)
    assert info['codec_value'] == 0x55
    assert info['codec_name'] == 'raw'
    assert info['is_breaking'] == False
    
    # Test dag-json (breaking)
    cid_json = compute_cid_v1(b"test", codec=0x85)
    info = detect_cid_encoding_format(cid_json)
    assert info['codec_value'] == 0x85
    assert info['codec_name'] == 'dag-json'
    assert info['is_breaking'] == True
    assert info['codec_length'] == 2  # 2-byte varint

Option B: `detect_cid_version()` (Contributor's Name)

def detect_cid_version(cid: bytes) -> dict:
    """
    Detect CID version and encoding format (contributor's naming).
    
    Alias for detect_cid_encoding_format() with same functionality.
    
    Returns:
        Same as detect_cid_encoding_format()
    """
    return detect_cid_encoding_format(cid)

# Or use as main function name:
detect_cid_version = detect_cid_encoding_format

3.2 CID Recomputation Helper

Purpose: Recompute CID from original data with new encoding

Contributor's Proposed Name: migrate_legacy_cid()
Implementation Name: recompute_cid_from_data()

Note: Both function names are provided below. Use the one that fits your naming conventions.

Option A: `recompute_cid_from_data()` (Explicit)

def recompute_cid_from_data(old_cid: bytes, data: bytes) -> bytes:
    """
    Recompute CID with proper varint encoding.
    
    Note: Original data is required because CIDs use cryptographic hashes
    (one-way functions that cannot be reversed).
    
    Args:
        old_cid: Existing CID (used to extract codec)
        data: Original data that was hashed
    
    Returns:
        New CID with proper varint-encoded codec
    
    Raises:
        ValueError: If old_cid is invalid or doesn't match data
    """
    # Detect old CID format
    info = detect_cid_encoding_format(old_cid)
    
    if info.get('error'):
        raise ValueError(f"Invalid CID: {info['error']}")
    
    # Extract codec
    codec_value = info['codec_value']
    
    # Recompute with proper varint encoding
    new_cid = compute_cid_v1(data, codec=codec_value)
    
    # Verify new CID matches data
    if not verify_cid(new_cid, data):
        raise ValueError("Recomputed CID does not verify with provided data")
    
    return new_cid

Test It:

def test_recompute_cid_from_data():
    """Test CID recomputation."""
    data = b"test data"
    
    # Create CID with dag-json
    old_cid = compute_cid_v1(data, codec=0x85)
    
    # Recompute (should be identical in this case)
    new_cid = recompute_cid_from_data(old_cid, data)
    
    assert new_cid == old_cid  # Same when properly encoded
    assert verify_cid(new_cid, data) == True
    
    # Test with wrong data (should fail)
    try:
        recompute_cid_from_data(old_cid, b"wrong data")
        assert False, "Should raise ValueError"
    except ValueError as e:
        assert "does not verify" in str(e)

Option B: `migrate_legacy_cid()` (Contributor's Name)

def migrate_legacy_cid(old_cid: bytes, data: bytes) -> bytes:
    """
    Migrate legacy CID to new varint format (contributor's naming).
    
    Alias for recompute_cid_from_data() with same functionality.
    
    Args:
        old_cid: Legacy CID (used to extract codec)
        data: Original data that was hashed
    
    Returns:
        New CID with proper varint-encoded codec
    
    Raises:
        ValueError: If old_cid is invalid or doesn't match data
    """
    return recompute_cid_from_data(old_cid, data)

# Or use as main function name:
migrate_legacy_cid = recompute_cid_from_data

Usage Example (Contributor's Function Names):

# Using contributor's preferred names
from libp2p.bitswap.cid import detect_cid_version, migrate_legacy_cid

# Detect format
info = detect_cid_version(cid)

if info['is_breaking'] and info['needs_migration']:
    # Migrate legacy CID
    new_cid = migrate_legacy_cid(old_cid, original_data)
    print(f"Migrated CID from {old_cid.hex()} to {new_cid.hex()}")

3.3 Batch CID Analysis Tool

Purpose: Analyze collections of CIDs for migration needs

def analyze_cid_collection(cids: list[bytes]) -> dict:
    """
    Analyze a collection of CIDs for migration impact.
    
    Returns:
        {
            'total': int,
            'backward_compatible': int,
            'breaking_change': int,
            'by_codec': {codec_name: count},
            'breaking_cids': [bytes]
        }
    """
    results = {
        'total': len(cids),
        'backward_compatible': 0,
        'breaking_change': 0,
        'by_codec': {},
        'breaking_cids': []
    }
    
    for cid in cids:
        try:
            info = detect_cid_encoding_format(cid)
            
            if info.get('error'):
                continue
            
            # Count by codec
            codec_name = info['codec_name']
            results['by_codec'][codec_name] = results['by_codec'].get(codec_name, 0) + 1
            
            # Categorize
            if info['is_breaking']:
                results['breaking_change'] += 1
                results['breaking_cids'].append(cid)
            else:
                results['backward_compatible'] += 1
        
        except Exception:
            continue
    
    return results

Phase 4: Documentation

4.1 Breaking Change Notice

Add to docs/getting-started or new doc file named codec

# BREAKING CHANGE: CIDv1 Codec Varint Encoding

## Summary

CIDv1 now uses proper varint encoding for codec values, as specified in the
multicodec specification. This changes the binary format for CIDs using
codecs ≥ 128.

## Impact

- **95% of CIDs unaffected**: Common codecs (raw, dag-pb, dag-cbor) use
  values < 128, which encode identically in both formats
- **5% of CIDs affected**: dag-json (0x85), dag-jose (0x0129), and
  experimental codecs ≥ 128 now use multi-byte varint encoding

## Migration Required

If you use dag-json, dag-jose, or custom codecs ≥ 128:

1. **Identify affected CIDs** using `detect_cid_encoding_format()`
2. **Recompute CIDs** from original data using `recompute_cid_from_data()`
3. **Update storage** (databases, caches) with new CIDs

## Code Examples

### Check if your CIDs are affected:

```python
from libp2p.bitswap.cid import detect_cid_encoding_format
# Or using contributor's naming:
# from libp2p.bitswap.cid import detect_cid_version

info = detect_cid_encoding_format(your_cid)
# Or: info = detect_cid_version(your_cid)

if info['is_breaking']:
    print(f"CID uses {info['codec_name']} and needs migration")

Recompute affected CIDs:

from libp2p.bitswap.cid import recompute_cid_from_data
# Or using contributor's naming:
# from libp2p.bitswap.cid import migrate_legacy_cid

new_cid = recompute_cid_from_data(old_cid, original_data)
# Or: new_cid = migrate_legacy_cid(old_cid, original_data)

Backward Compatibility

Code continues to accept integer codec values for API compatibility:

# All of these work:
cid1 = compute_cid_v1(data, codec=0x55)        # int
cid2 = compute_cid_v1(data, codec="raw")        # str
cid3 = compute_cid_v1(data, codec=CODEC_RAW)   # Code object

4.2 Code Comments

Add to libp2p/bitswap/cid.py:

"""
CID (Content Identifier) implementation for libp2p.

IMPORTANT: Breaking Change in v1.0
====================================

CIDv1 now uses proper varint encoding for codec values:

- Codecs < 128: Single byte (backward compatible)
  Example: raw (0x55) → [0x55]

- Codecs ≥ 128: Multi-byte varint (BREAKING CHANGE)
  Example: dag-json (0x85) → [0x85, 0x01]

This matches the multicodec specification but changes binary format
for dag-json, dag-jose, and experimental codecs.
"""

Phase 5: Integration & Verification

5.1 Integration Test Suite

def test_complete_cid_workflow():
    """End-to-end test of CID creation, parsing, and verification."""
    
    test_data = b"Integration test data"
    test_codecs = [
        (0x55, "raw", False),          # Backward compatible
        (0x70, "dag-pb", False),       # Backward compatible
        (0x85, "dag-json", True),      # Breaking
    ]
    
    for codec_value, codec_name, is_breaking in test_codecs:
        # Create CID
        cid = compute_cid_v1(test_data, codec=codec_value)
        
        # Detect format
        info = detect_cid_encoding_format(cid)
        assert info['codec_value'] == codec_value
        assert info['codec_name'] == codec_name
        assert info['is_breaking'] == is_breaking
        
        # Extract prefix
        prefix = get_cid_prefix(cid)
        expected_prefix_len = 5 if is_breaking else 4
        assert len(prefix) == expected_prefix_len
        
        # Verify CID
        assert verify_cid(cid, test_data) == True
        assert verify_cid(cid, b"wrong data") == False
        
        # Recompute
        new_cid = recompute_cid_from_data(cid, test_data)
        assert new_cid == cid
        
        print(f"✅ {codec_name}: All tests passed")

5.2 Pre-Review Checklist

Before merging/deploying:

Quick Reference: Implementation Checklist

🔴 P0 - Must Complete (Blockers)

Fix get_cid_prefix() - Parse varint codec correctly
Fix verify_cid() - Use forward parsing, not reverse indexing
Test codecs < 128 - Verify backward compatibility
Test codecs ≥ 128 - Verify multi-byte varint works
Document breaking change - Add to CHANGELOG/MIGRATION.md

🟡 P1 - Strongly Recommended to complete these as well,

Implement detect_cid_encoding_format() - Format detection
Implement recompute_cid_from_data() - Migration helper
Add edge case tests - Empty data, large data, invalid input
Add code comments - Document breaking change in source
Integration test suite - End-to-end verification

🟢 P2 - Nice to Have these, can be done in this PR or can raise in seperate PR

Implement analyze_cid_collection() - Batch analysis
Add dual-format support - Generate legacy format (optional)
Cross-compatibility tests - Test with other implementations

Common Pitfalls & Solutions

❌ Pitfall #1: Assuming Single-Byte Codec

Wrong:

codec = cid[1]  # Assumes single byte

Right:

# Parse varint to get codec value
codec_value = 0
shift = 0
for i in range(1, min(len(cid), 11)):
    byte = cid[i]
    codec_value |= (byte & 0x7F) << shift
    shift += 7
    if (byte & 0x80) == 0:
        break

❌ Pitfall #2: Using Reverse Indexing

Wrong:

hash_length = cid[-33]  # Assumes fixed position

Right:

# Parse forward to find multihash position
codec_length = ... # Parse varint
hash_type_offset = 1 + codec_length
hash_length = cid[hash_type_offset + 1]

❌ Pitfall #3: No Breaking Change Tests

Wrong:

# Only test common codecs
def test_cid():
    cid = compute_cid_v1(data, codec=0x55)  # Only raw
    assert verify_cid(cid, data)

Right:

# Test breaking change codecs
def test_cid_breaking_codecs():
    cid = compute_cid_v1(data, codec=0x85)  # dag-json
    assert cid[1:3] == bytes([0x85, 0x01])  # Verify 2-byte varint
    assert verify_cid(cid, data)

Run the Test Suite

# Run CID-specific tests
pytest tests/core/bitswap/test_cid.py -v

# Run with coverage
pytest tests/core/bitswap/test_cid.py --cov=libp2p.bitswap.cid

Validate Your Implementation

# Quick validation script
def validate_implementation():
    """Verify critical functions work correctly."""
    test_data = b"validation test"
    
    # Test 1: Backward compatible codec
    cid_raw = compute_cid_v1(test_data, codec=0x55)
    assert len(get_cid_prefix(cid_raw)) == 4
    assert verify_cid(cid_raw, test_data)
    print("✅ Backward compatible codec works")
    
    # Test 2: Breaking change codec
    cid_json = compute_cid_v1(test_data, codec=0x85)
    assert len(get_cid_prefix(cid_json)) == 5
    assert verify_cid(cid_json, test_data)
    print("✅ Breaking change codec works")
    
    # Test 3: Detection
    info = detect_cid_encoding_format(cid_json)
    assert info['is_breaking'] == True
    print("✅ Format detection works")
    
    print("\n🎉 Implementation validated!")

if __name__ == "__main__":
    validate_implementation()

| Phase | Priority | Description |
|-------|----------|------|-------------|
| Phase 1 | 🔴 P0 | Fix core functions |
| Phase 2 | 🔴 P0 | Backward compat tests |
| Phase 3 | 🟡 P1 | Migration tools |
| Phase 4 | 🔴 P0 | Documentation |
| Phase 5 | 🔴 P0 | Integration & verification |

0 replies

acul71 · 2026-02-22T01:15:35Z

acul71
Feb 22, 2026
Maintainer Author

Status update:

Core multicodec work: PR add multicodec #1194 (add multicodec) is merged. It implements Priority 1 and 2 from this discussion: type-safe Code constants in libp2p/bitswap/cid.py, proper varint codec encoding, envelope payload type via py-multicodec, migration helpers (detect_cid_encoding_format, recompute_cid_from_data, analyze_cid_collection), and the breaking-change doc. Issue feat: multicodec integration #1193 remains open for tracking.
Serialization (Priority 3): Was left optional in the discussion and was not part of add multicodec #1194; it can be taken up in a follow-up issue/PR if desired.
Overlap note: PR feat: Bitswap py-cid Migration Phase 1 + Phase 2 #1235 (py-cid) has since merged; the CID module is now py-cid-based. Any new multicodec work touching libp2p/bitswap/cid.py should align with that.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

py-libp2p Multicodec Integration Status and Improvement Opportunities #1172

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

py-libp2p Multicodec Integration Status and Improvement Opportunities #1172

Uh oh!

acul71 Jan 28, 2026 Maintainer

py-libp2p Multicodec Integration Status and Improvement Opportunities

Executive Summary

Current State Analysis

1. Hardcoded Multicodec Constants

2. Manual Prefix Handling

3. CID Construction Without Proper Multicodec Support

py-multicodec Features Overview

Core Features Available

Improvement Opportunities

Priority 1: Replace Hardcoded Constants

1.1 Bitswap CID Module (libp2p/bitswap/cid.py)

1.2 Envelope Payload Type (libp2p/peer/envelope.py)

Priority 2: Use Proper Prefix Operations

2.1 CID Construction (libp2p/bitswap/cid.py)

2.2 Envelope Payload Type Handling

Priority 3: Leverage Serialization Framework

3.1 Peer Record Serialization (libp2p/peer/peer_record.py)

3.2 DAG-PB Encoding (libp2p/bitswap/dag_pb.py)

Priority 4: Code Validation and Error Handling

4.1 Validate Codec Names

4.2 Envelope Payload Type Validation

Priority 5: Use Named Constants Throughout

5.1 Replace All Hardcoded Values

Migration Strategy

Phase 1: Add Dependency and Basic Integration

Phase 2: Improve CID Construction

Phase 3: Add Serialization Framework (Optional)

Phase 4: Testing and Validation

Code Examples

Example 1: Improved CID Construction

Example 2: Type-Safe Envelope

Example 3: Custom Codec for PeerRecord

Benefits Summary

Type Safety

Maintainability

Correctness

Extensibility

Developer Experience

Potential Issues and Considerations

1. Backward Compatibility

2. Performance

3. Dependency Management

4. Codec Table Updates

Recommendations

Immediate Actions (High Priority)

Short-term Improvements (Medium Priority)

Long-term Enhancements (Low Priority)

Conclusion

References

Replies: 3 comments

Uh oh!

Uh oh!

gerceboss Feb 6, 2026

Uh oh!

yashksaini-coder Feb 10, 2026

CID Breaking Change Guide

Overview

Quick Impact Check

Phase 1: Core Implementation

1.1 Fix CID Prefix Extraction

1.2 Fix CID Verification

Phase 2: Backward Compatibility Testing

2.1 Test Codec < 128 (Backward Compatible)

2.2 Test Codec ≥ 128 (Breaking Change)

2.3 Test Edge Cases

Phase 3: Migration Tools

3.1 Format Detection Function

Option A: detect_cid_encoding_format() (Detailed)

Option B: detect_cid_version() (Contributor's Name)

3.2 CID Recomputation Helper

Option A: recompute_cid_from_data() (Explicit)

Option B: migrate_legacy_cid() (Contributor's Name)

3.3 Batch CID Analysis Tool

Phase 4: Documentation

4.1 Breaking Change Notice

Recompute affected CIDs:

Backward Compatibility

acul71
Jan 28, 2026
Maintainer

1.1 Bitswap CID Module (`libp2p/bitswap/cid.py`)

1.2 Envelope Payload Type (`libp2p/peer/envelope.py`)

2.1 CID Construction (`libp2p/bitswap/cid.py`)

3.1 Peer Record Serialization (`libp2p/peer/peer_record.py`)

3.2 DAG-PB Encoding (`libp2p/bitswap/dag_pb.py`)

gerceboss
Feb 6, 2026

yashksaini-coder
Feb 10, 2026

Option A: `detect_cid_encoding_format()` (Detailed)

Option B: `detect_cid_version()` (Contributor's Name)

Option A: `recompute_cid_from_data()` (Explicit)

Option B: `migrate_legacy_cid()` (Contributor's Name)

acul71
Feb 22, 2026
Maintainer Author