Replies: 3 comments
-
|
Hey @acul71 @asmit27rai , I would like to take up this issue . |
Beta Was this translation helpful? Give feedback.
-
CID Breaking Change GuideOverviewSo we got a concern from one of the developer working on the Multicodec implementation, and the Breaking Change is occuring by the This will impact like:-
Furthermore following the guide implementation for the migration strategy proposed by the contributor: Compatibility:
Migration Strategy:
Implementation Notes:
Quick Impact Check# Check if a codec causes breaking change
def is_breaking_codec(codec_value: int) -> bool:
return codec_value >= 128 # Multi-byte varint required
# Common codecs:
is_breaking_codec(0x55) # raw → False ✅
is_breaking_codec(0x70) # dag-pb → False ✅
is_breaking_codec(0x85) # dag-json → True 🔴
is_breaking_codec(0x0129) # dag-jose → True 🔴Phase 1: Core Implementation1.1 Fix CID Prefix ExtractionProblem: Current code assumes single-byte codec, breaks with multi-byte varint Location: Fix: def get_cid_prefix(cid: bytes) -> bytes:
"""Extract CID prefix, properly parsing varint codec.
CIDv1 format: <version:1> <codec:varint> <hash_type:1> <hash_len:1>
"""
if len(cid) < 2 or cid[0] != CID_V1:
return b""
# Parse varint codec to find its length
codec_length = 0
offset = 1 # Skip version byte
for i in range(offset, min(len(cid), offset + 10)): # Max varint is 10 bytes
codec_length += 1
if (cid[i] & 0x80) == 0: # MSB clear = last byte of varint
break
# Now we know where multihash starts
hash_type_offset = 1 + codec_length
if len(cid) <= hash_type_offset + 1:
return b""
# Read hash length from multihash
hash_length = cid[hash_type_offset + 1]
# Prefix = version + codec + hash_type + hash_length
prefix_length = 1 + codec_length + 2
if len(cid) < prefix_length:
return b""
return cid[:prefix_length]Test It: def test_get_cid_prefix_multibyte_varint():
"""Test prefix extraction with 2-byte varint codec."""
# dag-json (0x85 = 133) uses varint: 0x85 0x01
data = b"test data"
cid = compute_cid_v1(data, codec=0x85)
prefix = get_cid_prefix(cid)
# Should be: version(1) + codec(2) + hash_type(1) + hash_len(1) = 5 bytes
assert len(prefix) == 5, f"Expected 5, got {len(prefix)}"
assert prefix[0] == 0x01 # CIDv1
assert prefix[1:3] == bytes([0x85, 0x01]) # 2-byte varint1.2 Fix CID VerificationProblem: Uses reverse indexing that fails with multi-byte varint Location: Fix: def verify_cid(cid: bytes, data: bytes) -> bool:
"""Verify CID matches data using forward parsing."""
if len(cid) < 2 or cid[0] != CID_V1:
return False
# Parse varint codec to find where multihash starts
codec_length = 0
offset = 1 # Skip version byte
for i in range(offset, min(len(cid), offset + 10)):
codec_length += 1
if (cid[i] & 0x80) == 0: # Last byte of varint
break
# Extract multihash components
hash_type_offset = 1 + codec_length
if len(cid) <= hash_type_offset + 1:
return False
hash_type = cid[hash_type_offset]
hash_length = cid[hash_type_offset + 1]
# Extract digest from CID
digest_offset = hash_type_offset + 2
if len(cid) < digest_offset + hash_length:
return False
cid_digest = cid[digest_offset : digest_offset + hash_length]
# Compute expected digest
# TODO: Support multiple hash algorithms, not just SHA2-256
expected_digest = hashlib.sha256(data).digest()
return expected_digest == cid_digestTest It: def test_verify_cid_multibyte_varint():
"""Test verification with 2-byte varint codec."""
data = b"test data"
cid = compute_cid_v1(data, codec=0x85) # dag-json
# Should verify successfully
assert verify_cid(cid, data) == True
# Should fail with wrong data
assert verify_cid(cid, b"wrong data") == FalsePhase 2: Backward Compatibility Testing2.1 Test Codec < 128 (Backward Compatible)Goal: Verify codecs < 128 produce identical CIDs in both formats def test_backward_compatible_codecs():
"""Verify codecs < 128 are backward compatible."""
test_data = b"Hello World"
# Test raw (0x55)
cid_raw = compute_cid_v1(test_data, codec=0x55)
assert cid_raw[1] == 0x55 # Single byte
# Test dag-pb (0x70)
cid_dag_pb = compute_cid_v1(test_data, codec=0x70)
assert cid_dag_pb[1] == 0x70 # Single byte
# Test dag-cbor (0x71)
cid_dag_cbor = compute_cid_v1(test_data, codec=0x71)
assert cid_dag_cbor[1] == 0x71 # Single byte
# Verify prefix extraction works
for cid in [cid_raw, cid_dag_pb, cid_dag_cbor]:
prefix = get_cid_prefix(cid)
assert len(prefix) == 4 # version(1) + codec(1) + type(1) + len(1)
assert verify_cid(cid, test_data) == True2.2 Test Codec ≥ 128 (Breaking Change)Goal: Verify codecs ≥ 128 use multi-byte varint correctly def test_multibyte_varint_codecs():
"""Verify codecs ≥ 128 use multi-byte varint encoding."""
test_data = b"test data"
# Test dag-json (0x85 = 133)
cid_dag_json = compute_cid_v1(test_data, codec=0x85)
# Should use 2-byte varint: 0x85 0x01
assert cid_dag_json[1:3] == bytes([0x85, 0x01])
# Prefix should be 5 bytes: version(1) + codec(2) + type(1) + len(1)
prefix = get_cid_prefix(cid_dag_json)
assert len(prefix) == 5, f"Expected 5, got {len(prefix)}"
# Verification should work
assert verify_cid(cid_dag_json, test_data) == True
# Test dag-jose (0x0129 = 297) if available
try:
cid_dag_jose = compute_cid_v1(test_data, codec=0x0129)
assert len(get_cid_prefix(cid_dag_jose)) == 5 # version(1) + codec(2) + type(1) + len(1)
assert verify_cid(cid_dag_jose, test_data) == True
except ValueError:
pass # codec not available2.3 Test Edge Casesdef test_cid_edge_cases():
"""Test edge cases and error handling."""
# Test empty data
cid_empty = compute_cid_v1(b"", codec=0x55)
assert verify_cid(cid_empty, b"") == True
# Test large data
large_data = b"x" * 10000
cid_large = compute_cid_v1(large_data, codec=0x55)
assert verify_cid(cid_large, large_data) == True
# Test invalid CID (too short)
assert get_cid_prefix(b"\x01") == b""
assert verify_cid(b"\x01", b"test") == False
# Test string codec input
cid_str = compute_cid_v1(b"test", codec="raw")
cid_int = compute_cid_v1(b"test", codec=0x55)
assert cid_str == cid_int
# Test invalid codec
try:
compute_cid_v1(b"test", codec="nonexistent")
assert False, "Should raise ValueError"
except ValueError as e:
assert "Unknown codec" in str(e)Phase 3: Migration Tools3.1 Format Detection FunctionPurpose: Identify if a CID uses legacy or varint encoding Contributor's Proposed Name: Note: Both function names are provided below. Use the one that fits your naming conventions. Option A:
|
Beta Was this translation helpful? Give feedback.
-
|
Status update:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
py-libp2p Multicodec Integration Status and Improvement Opportunities
Executive Summary
This document analyzes the current usage of multicodec functionality in py-libp2p and identifies opportunities to leverage new features from py-multicodec (v1.0.0+) to improve code quality, type safety, and maintainability.
Key Finding: py-libp2p currently uses hardcoded multicodec constants and manual prefix handling, but does not import or use the py-multicodec library. The new py-multicodec features (Code type, serialization framework, named constants) can significantly improve the codebase.
Current State Analysis
1. Hardcoded Multicodec Constants
Location:
libp2p/bitswap/cid.pyIssues:
2. Manual Prefix Handling
Location:
libp2p/peer/envelope.pyThe
Envelopeclass usespayload_typeas raw bytes, which should be a multicodec-prefixed identifier. Currently, it's manually constructed without using multicodec's prefix functions.Issues:
3. CID Construction Without Proper Multicodec Support
Location:
libp2p/bitswap/cid.pyThe CID construction functions manually construct CIDs with hardcoded codec values:
Issues:
multicodec.add_prefix()for proper varint encodingpy-multicodec Features Overview
Core Features Available
Type-Safe Code Objects (
Codeclass)Named Constants (
code_table.py)SHA2_256,DAG_PB,RAW,IP4,TCP, etc.Prefix Operations
add_prefix(codec_name, data)- Add multicodec prefixremove_prefix(prefixed_data)- Remove prefixget_codec(prefixed_data)- Extract codec nameextract_prefix(prefixed_data)- Extract prefix integerget_prefix(codec_name)- Get prefix bytes for codecSerialization Framework (
serialization.py)Codecabstract base class for custom codecsJSONCodecandRawCodecencode(codec_name, data)- Generic encodingdecode(data)- Auto-detect and decodeCode Management
known_codes()- List all registered codesCode.from_string()- Create Code from name or hexis_codec(name)- Validate codec namesis_reserved(code)- Check reserved rangeImprovement Opportunities
Priority 1: Replace Hardcoded Constants
1.1 Bitswap CID Module (
libp2p/bitswap/cid.py)Current Code:
Proposed Improvement:
Benefits:
Impact: High - Core functionality used throughout bitswap
1.2 Envelope Payload Type (
libp2p/peer/envelope.py)Current Code:
Proposed Improvement:
Benefits:
Impact: Medium - Used in peer record handling
Priority 2: Use Proper Prefix Operations
2.1 CID Construction (
libp2p/bitswap/cid.py)Current Code:
Proposed Improvement:
Benefits:
Impact: High - Critical for CID correctness
2.2 Envelope Payload Type Handling
Current Code:
Proposed Improvement:
Benefits:
Impact: Medium - Improves peer record handling
Priority 3: Leverage Serialization Framework
3.1 Peer Record Serialization (
libp2p/peer/peer_record.py)Current Code:
Proposed Improvement:
Benefits:
Impact: Medium - Improves serialization consistency
3.2 DAG-PB Encoding (
libp2p/bitswap/dag_pb.py)Current Code:
Proposed Improvement:
Benefits:
encode()/decode()functionsImpact: Low-Medium - Nice to have for consistency
Priority 4: Code Validation and Error Handling
4.1 Validate Codec Names
Current Code: No validation when codec names/values are used
Proposed Improvement:
Benefits:
Impact: Medium - Improves robustness
4.2 Envelope Payload Type Validation
Current Code:
Proposed Improvement:
Benefits:
Impact: Low-Medium - Improves error handling
Priority 5: Use Named Constants Throughout
5.1 Replace All Hardcoded Values
Locations to Update:
libp2p/bitswap/cid.py- Codec and hash constantslibp2p/peer/envelope.py- Payload typeProposed Improvement:
Benefits:
Impact: High - Improves code maintainability
Migration Strategy
Phase 1: Add Dependency and Basic Integration
Add py-multicodec to dependencies:
Replace hardcoded constants in
libp2p/bitswap/cid.py:CODEC_DAG_PB = 0x70withCODEC_DAG_PB = DAG_PBCode | str | intUpdate envelope payload type in
libp2p/peer/envelope.py:get_prefix()for proper encodingget_codec()Phase 2: Improve CID Construction
Update
compute_cid_v1()to useadd_prefix():Update CID parsing to use
get_codec():Phase 3: Add Serialization Framework (Optional)
Phase 4: Testing and Validation
Code Examples
Example 1: Improved CID Construction
Example 2: Type-Safe Envelope
Example 3: Custom Codec for PeerRecord
Benefits Summary
Type Safety
Maintainability
Correctness
Extensibility
Developer Experience
Potential Issues and Considerations
1. Backward Compatibility
Issue: Existing code may rely on integer codec values.
Solution:
Codeobjects and integers in function signaturesint(code)to convert Code to integer where needed2. Performance
Issue: Code object creation and validation may add overhead.
Solution:
3. Dependency Management
Issue: Adding py-multicodec as a dependency.
Solution:
4. Codec Table Updates
Issue: Multicodec table may change over time.
Solution:
Recommendations
Immediate Actions (High Priority)
pyproject.tomllibp2p/bitswap/cid.pyget_prefix()Short-term Improvements (Medium Priority)
Long-term Enhancements (Low Priority)
Conclusion
py-multicodec provides powerful features that can significantly improve py-libp2p's code quality, type safety, and maintainability. The most impactful improvements are:
These changes will make the codebase more robust, easier to maintain, and better aligned with the multicodec specification. The migration can be done gradually with backward compatibility, minimizing risk while maximizing benefits.
References
libp2p/bitswap/cid.pylibp2p/peer/envelope.pylibp2p/peer/peer_record.pylibp2p/bitswap/dag_pb.pyBeta Was this translation helpful? Give feedback.
All reactions