Implement payload metadata extraction engine. Closes #13#15
Conversation
- Added hasher.py for streaming MD5, SHA1, and SHA256 hashes - Added scanner.py to extract MIME types via python-magic and filter files by mtime - Centralized ruff linter ignore rules for S324 and BLE001 - Added pytest test coverage for metadata extraction
There was a problem hiding this comment.
Pull request overview
This PR adds a small “payload metadata extraction engine” to the service by introducing a filesystem scanner that filters payloads by mtime and extracts per-file metadata (hashes + MIME type) without executing the payloads, addressing Issue #13.
Changes:
- Added
app/scanner.pyto recursively scan a directory and emit metadata for payloads modified within a given time window. - Added
app/hasher.pyto stream files once while computing MD5/SHA1/SHA256. - Added
python-magicdependency and a new pytest module covering hashing + scanning behavior.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
app/hasher.py |
New streaming hash helper (md5/sha1/sha256). |
app/scanner.py |
New recursive scanner that filters by mtime and extracts hashes + MIME type. |
tests/test_metadata.py |
Adds unit tests for hash computation and scanner filtering/metadata. |
pyproject.toml |
Adds python-magic dependency and adjusts Ruff ignore list. |
uv.lock |
Locks python-magic==0.4.27 and updates project dependency metadata. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| mime = magic.Magic(mime=True) | ||
| except Exception: | ||
| mime = None | ||
|
|
| mime_type = "unknown" | ||
| if mime: | ||
| with contextlib.suppress(Exception): | ||
| mime_type = mime.from_file(str(file_path)) | ||
|
|
| payload = results[0] | ||
|
|
||
| assert payload["file_path"] == str(file_path1) | ||
| assert "text" in payload["mime_type"] or "plain" in payload["mime_type"] |
…ror handling in scanner and hasher modules
|
Hi! @regulartim, The PR is ready for review. Please take a look when you get time :D |
| except Exception as e: | ||
| if type(e).__name__ == "MagicException": | ||
| mime = None | ||
| else: | ||
| raise | ||
|
|
There was a problem hiding this comment.
There seems to be a magic.MagicException class. If you would use that, the exception handling here would be much cleaner, no?
| except OSError: | ||
| return {} |
There was a problem hiding this comment.
Maybe you should include some kind of logging. Then you could log an error here to avoid silent failures.
| except Exception as e: | ||
| if type(e).__name__ == "MagicException": | ||
| return mimetypes.guess_type(str(file_path))[0] or "unknown" | ||
| raise |
There was a problem hiding this comment.
There seems to be a magic.MagicException class. If you would use that, the exception handling here would be much cleaner, no?
| if type(e).__name__ == "MagicException": | ||
| return mimetypes.guess_type(str(file_path))[0] or "unknown" | ||
| raise | ||
| return mimetypes.guess_type(str(file_path))[0] or "unknown" |
There was a problem hiding this comment.
Malware names are often chosen to deceive. So mimetypes.guess_type can produce misleading answers (rather than an honest "unknown"). Two possible solutions to that: (1) drop it entirely and just leave it as "unknown" or (2) flag the source in the output (e.g.: mime_source: "magic" | "filename" | "unknown").
| "ISC001", # Conflicts with formatter | ||
| "D100", # Missing docstring in public module | ||
| "D104", # Missing docstring in public package | ||
| "S324", # Probable use of insecure hash functions (used for malware identification) |
There was a problem hiding this comment.
In case you don't know: you can also exclude rules per file. See GB's pyproject.toml .
| "md5": md5.hexdigest(), | ||
| "sha1": sha1.hexdigest(), | ||
| "sha256": sha256.hexdigest(), |
There was a problem hiding this comment.
Why do you need three different hashes again?
| "file_path": str(file_path), | ||
| "mime_type": mime_type, | ||
| "md5": hashes.get("md5"), | ||
| "sha1": hashes.get("sha1"), | ||
| "sha256": hashes.get("sha256"), | ||
| "mtime": mtime, | ||
| "size": stat_result.st_size, |
There was a problem hiding this comment.
So this will be the dict that the API responds with and that will be consumed on the GB side?
Description
This PR introduces a
scannermodule to locate captured payloads modified within a specific timeframe usingos.stat().st_mtime. Additionally, it implements ahashermodule that streams files to compute MD5, SHA1, and SHA256 hashes simultaneously. It also safely integratespython-magicto detect the true MIME type of payloads without executing them.Related issues
Closes #13
Type of change
Checklist
Please complete this checklist carefully. It helps guide your contribution and lets maintainers verify that all requirements are met.
Formalities
Implement payload metadata extraction engine. Closes #13develop.develop.Docs and tests
Ruff) gave 0 errors.Review process