Skip to content

Implement payload metadata extraction engine. Closes #13#15

Open
opbot-xd wants to merge 2 commits into
developfrom
feature/metadata-extraction
Open

Implement payload metadata extraction engine. Closes #13#15
opbot-xd wants to merge 2 commits into
developfrom
feature/metadata-extraction

Conversation

@opbot-xd
Copy link
Copy Markdown
Collaborator

Description

This PR introduces a scanner module to locate captured payloads modified within a specific timeframe using os.stat().st_mtime. Additionally, it implements a hasher module that streams files to compute MD5, SHA1, and SHA256 hashes simultaneously. It also safely integrates python-magic to detect the true MIME type of payloads without executing them.

Related issues

Closes #13

Type of change

  • Bug fix (non-breaking change which fixes an issue).
  • New feature (non-breaking change which adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • Chore (refactoring, dependency updates, CI/CD changes, code cleanup, docs-only changes).

Checklist

Please complete this checklist carefully. It helps guide your contribution and lets maintainers verify that all requirements are met.

Formalities

  • I chose an appropriate title for the pull request in the form: Implement payload metadata extraction engine. Closes #13
  • My branch is based on develop.
  • The pull request is for the branch develop.
  • I have reviewed and verified any LLM-generated code included in this PR.

Docs and tests

  • I documented my code changes with docstrings and/or comments.
  • Linter (Ruff) gave 0 errors.
  • I have added tests for the feature/bug I solved.
  • All the tests gave 0 errors.

Review process

  • We encourage you to create a draft PR first, even when your changes are incomplete. This way you refine your code while we can track your progress and actively review and help.
  • If you think your draft PR is ready to be reviewed by the maintainers, click the corresponding button. Your draft PR will become a real PR.
  • Every time you make changes to the PR and you think the work is done, you should explicitly ask for a review. After receiving a "change request", address the feedback and click "request re-review" next to the reviewer's profile picture at the top right.

- Added hasher.py for streaming MD5, SHA1, and SHA256 hashes
- Added scanner.py to extract MIME types via python-magic and filter files by mtime
- Centralized ruff linter ignore rules for S324 and BLE001
- Added pytest test coverage for metadata extraction
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a small “payload metadata extraction engine” to the service by introducing a filesystem scanner that filters payloads by mtime and extracts per-file metadata (hashes + MIME type) without executing the payloads, addressing Issue #13.

Changes:

  • Added app/scanner.py to recursively scan a directory and emit metadata for payloads modified within a given time window.
  • Added app/hasher.py to stream files once while computing MD5/SHA1/SHA256.
  • Added python-magic dependency and a new pytest module covering hashing + scanning behavior.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
app/hasher.py New streaming hash helper (md5/sha1/sha256).
app/scanner.py New recursive scanner that filters by mtime and extracts hashes + MIME type.
tests/test_metadata.py Adds unit tests for hash computation and scanner filtering/metadata.
pyproject.toml Adds python-magic dependency and adjusts Ruff ignore list.
uv.lock Locks python-magic==0.4.27 and updates project dependency metadata.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread app/hasher.py Outdated
Comment thread app/scanner.py
Comment on lines +26 to +30
try:
mime = magic.Magic(mime=True)
except Exception:
mime = None

Comment thread app/scanner.py Outdated
Comment on lines +40 to +44
mime_type = "unknown"
if mime:
with contextlib.suppress(Exception):
mime_type = mime.from_file(str(file_path))

Comment thread pyproject.toml
Comment thread tests/test_metadata.py Outdated
payload = results[0]

assert payload["file_path"] == str(file_path1)
assert "text" in payload["mime_type"] or "plain" in payload["mime_type"]
@opbot-xd opbot-xd marked this pull request as draft May 22, 2026 20:23
@opbot-xd opbot-xd marked this pull request as ready for review May 25, 2026 15:29
@opbot-xd
Copy link
Copy Markdown
Collaborator Author

Hi! @regulartim, The PR is ready for review. Please take a look when you get time :D

Comment thread app/scanner.py
Comment on lines +42 to +47
except Exception as e:
if type(e).__name__ == "MagicException":
mime = None
else:
raise

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a magic.MagicException class. If you would use that, the exception handling here would be much cleaner, no?

Comment thread app/hasher.py
Comment on lines +28 to +29
except OSError:
return {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you should include some kind of logging. Then you could log an error here to avoid silent failures.

Comment thread app/scanner.py
Comment on lines +16 to +19
except Exception as e:
if type(e).__name__ == "MagicException":
return mimetypes.guess_type(str(file_path))[0] or "unknown"
raise
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a magic.MagicException class. If you would use that, the exception handling here would be much cleaner, no?

Comment thread app/scanner.py
if type(e).__name__ == "MagicException":
return mimetypes.guess_type(str(file_path))[0] or "unknown"
raise
return mimetypes.guess_type(str(file_path))[0] or "unknown"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Malware names are often chosen to deceive. So mimetypes.guess_type can produce misleading answers (rather than an honest "unknown"). Two possible solutions to that: (1) drop it entirely and just leave it as "unknown" or (2) flag the source in the output (e.g.: mime_source: "magic" | "filename" | "unknown").

Comment thread pyproject.toml
"ISC001", # Conflicts with formatter
"D100", # Missing docstring in public module
"D104", # Missing docstring in public package
"S324", # Probable use of insecure hash functions (used for malware identification)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case you don't know: you can also exclude rules per file. See GB's pyproject.toml .

Comment thread app/hasher.py
Comment on lines +32 to +34
"md5": md5.hexdigest(),
"sha1": sha1.hexdigest(),
"sha256": sha256.hexdigest(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need three different hashes again?

Comment thread app/scanner.py
Comment on lines +64 to +70
"file_path": str(file_path),
"mime_type": mime_type,
"md5": hashes.get("md5"),
"sha1": hashes.get("sha1"),
"sha256": hashes.get("sha256"),
"mtime": mtime,
"size": stat_result.st_size,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this will be the dict that the API responds with and that will be consumed on the GB side?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Core Payload Scanner & Metadata Extractor

3 participants