fix: replace peek() with SHA-256 hash to prevent cache poisoning (AC-2)#678
fix: replace peek() with SHA-256 hash to prevent cache poisoning (AC-2)#6783em0 wants to merge 1 commit intozilliztech:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: 3em0 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @3em0! It looks like this is your first PR to zilliztech/GPTCache 🎉 |
…AC-2) BufferedReader.peek() only returns the first ~8192 bytes of a file, making it trivial to construct different images/files that produce identical cache keys. This enables cache poisoning where an attacker's query returns another user's cached answer. Replace peek() with streaming SHA-256 hash of the full file content in get_file_bytes(), get_input_str(), and get_image_question(). The file pointer is reset after hashing so downstream LLM calls can still read the complete file. Also fixes a resource leak in get_image_question() (open() without close). Signed-off-by: 3em0 <3em0@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e901655 to
8fde3d2
Compare
|
@xiaofan-luan hi! how it is going? |
|
/assign @cxie Hi @cxie and @xiaofan-luan, DCO is fixed and the PR is ready for review. |
Summary
BufferedReader.peek()only returns the first ~8192 bytes (buffer size) of a file. Cache keys generated frompeek()output are vulnerable to collision — two files sharing the same header but different content produce identical cache keys, enabling cache poisoning and information disclosure.peek()with streaming SHA-256 hash of full file content in three affected functions:get_file_bytes(),get_input_str(), andget_image_question()seek(0)after hashing so downstream LLM calls still read the complete fileget_image_question()(open()withoutclose)Affected Functions
get_file_bytes()file.peek()→ ~8KBsha256(file.read())→ full content hashget_input_str()str(image.peek()) + questionsha256(image) + questionget_image_question()str(open(img).peek()) + questionsha256(img) + questionAttack Scenario
img_A(legitimate) +question→ answer cached with key derived from first 8KBimg_Bwith identical first 8KB but completely different contentimg_B+ samequestion→ cache returnsimg_A's answer (LLM never called)See
tests/security/AC-2_cache_poisoning_via_peek.mdfor the full vulnerability report with CVSS scoring.Test plan
tests/unit_tests/processor/test_pre.py)tests/poc_ac2_peek_collision.py,tests/poc_ac2_e2e_poisoning.py)🤖 Generated with Claude Code