diff --git a/docs/INDEX.md b/docs/INDEX.md new file mode 100644 index 0000000..911bd32 --- /dev/null +++ b/docs/INDEX.md @@ -0,0 +1,62 @@ +# BinarySniffer Documentation + +Welcome to the BinarySniffer documentation. This index will help you find the information you need. + +## Getting Started + +- **[Installation Guide](INSTALLATION.md)** - Step-by-step installation instructions for BinarySniffer +- **[User Guide](USER_GUIDE.md)** - Complete guide to using BinarySniffer CLI and library API + +## Core Features + +- **[Detailed Features](DETAILED_FEATURES.md)** - Comprehensive overview of BinarySniffer's capabilities +- **[Architecture](ARCHITECTURE.md)** - System architecture and design principles +- **[Signature Management](SIGNATURE_MANAGEMENT.md)** - Managing and updating signature databases + +## Signature Creation + +- **[Creating Signatures](CREATING_SIGNATURES.md)** - Guide to creating new component signatures +- **[Signature Creation](SIGNATURE_CREATION.md)** - Advanced signature creation techniques and best practices + +## Advanced Topics + +- **[ML Security Analysis](ML_SECURITY.md)** - Security scanning and analysis for ML frameworks +- **[TLSH Fuzzy Matching](TLSH_FUZZY_MATCHING.md)** - Using TLSH for fuzzy hash matching +- **[TLSH Setup Guide](TLSH_SETUP_GUIDE.md)** - Installing and configuring TLSH support + +## Reference + +- **[API Reference](API_REFERENCE.md)** - Python API documentation and examples +- **[Package Verification](PACKAGE_VERIFICATION.md)** - Verifying BinarySniffer packages and integrity + +## Scripts and Examples + +- **[create_tlsh_example.py](create_tlsh_example.py)** - Example script for TLSH hash creation + +--- + +## Quick Links + +### For New Users +1. Start with [Installation Guide](INSTALLATION.md) +2. Follow the [User Guide](USER_GUIDE.md) +3. Explore [Detailed Features](DETAILED_FEATURES.md) + +### For Developers +1. Review [Architecture](ARCHITECTURE.md) +2. Check [API Reference](API_REFERENCE.md) +3. Learn about [Signature Creation](SIGNATURE_CREATION.md) + +### For Security Analysis +1. Read [ML Security Analysis](ML_SECURITY.md) +2. Understand [TLSH Fuzzy Matching](TLSH_FUZZY_MATCHING.md) +3. Configure with [TLSH Setup Guide](TLSH_SETUP_GUIDE.md) + +--- + +## Getting Help + +If you need help or have questions: +- Check the relevant documentation section above +- Review the main [README.md](../README.md) in the project root +- Report issues at [GitHub Issues](https://github.com/SemClone/binarysniffer/issues) diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md deleted file mode 100644 index 709c655..0000000 --- a/docs/SUMMARY.md +++ /dev/null @@ -1,113 +0,0 @@ -# Semantic Copycat BinarySniffer - Implementation Summary - -## Overview - -Successfully restructured the xmonkey-curator project into a new, efficient implementation called **binarysniffer** v1.0.0. - -## Key Achievements - -### 1. **Complete Architecture Redesign** -- Replaced naive Trie-based string matching with progressive three-tier matching -- Implemented SQLite-based signature storage with compression (90% size reduction) -- Created memory-efficient analysis using <100MB RAM (vs 500MB-2GB previously) - -### 2. **Core Implementation** -- **Dual Interface**: CLI tool (`binarysniffer`) and Python library API -- **Progressive Matching**: - - Tier 1: Bloom filters for quick elimination (microseconds) - - Tier 2: MinHash LSH for similarity search (milliseconds) - - Tier 3: Detailed database matching (seconds) -- **Feature Extraction**: Binary string extraction with categorization (functions, constants, imports) - -### 3. **Performance Improvements** -- Analysis speed: 10-50ms per file (20-100x faster) -- Parallel processing support -- Streaming analysis for large files -- Efficient indexing with trigrams and MinHash - -### 4. **Testing & Validation** -- Created comprehensive test suite (32 tests, 29 passing) -- Successfully built distribution packages (wheel and tarball) -- Validated detection of real components: - - Test binary: Detected OpenSSL and curl signatures - - Real curl binary: Successfully identified curl with 82.5% confidence - -## Package Structure - -``` -binarysniffer/ -├── binarysniffer/ # Core library -│ ├── core/ # Analyzer, config, results -│ ├── extractors/ # Feature extraction -│ ├── matchers/ # Progressive matching -│ ├── storage/ # Database and updates -│ ├── index/ # Bloom filters, MinHash -│ └── utils/ # Hashing utilities -├── tests/ # Test suite -├── examples/ # Usage examples -└── dist/ # Built packages - ├── semantic_copycat_binarysniffer-1.0.0-py3-none-any.whl - └── semantic_copycat_binarysniffer-1.0.0.tar.gz -``` - -## Usage Examples - -### CLI -```bash -# Analyze single file -binarysniffer analyze /path/to/binary - -# Analyze directory -binarysniffer analyze /path/to/project -r - -# Update signatures -binarysniffer update -``` - -### Python API -```python -from binarysniffer import BinarySniffer - -sniffer = BinarySniffer() -result = sniffer.analyze_file("/path/to/binary") -for match in result.matches: - print(f"{match.component}: {match.confidence:.1%}") -``` - -## Technical Features - -1. **Smart Signature Storage** - - SQLite with ZSTD compression - - Trigram indexing for substring matching - - Pre-computed MinHash for similarity search - -2. **Efficient Matching Algorithms** - - MinHash with 128 permutations - - LSH with 16 bands for candidate selection - - Tiered bloom filters (0.1%, 1%, 10% false positive rates) - -3. **Extensible Design** - - Plugin-based extractor system - - Configurable analysis parameters - - Support for signature updates - -## Migration from XMonkey-Curator - -- Created migration script (`migrate_from_xmonkey.py`) -- Addresses all major pain points: - - Memory usage: 500MB-2GB → <100MB - - Analysis speed: 1-5s → 10-50ms per file - - False positives: ~25% → <5% - - Signature storage: 100MB+ JSON → 50MB SQLite - -## Next Steps - -The implementation is ready for: -- PyPI publication -- Integration with CI/CD pipelines -- Extension with additional extractors (source code, archives) -- Machine learning enhancements for signature quality - -## Conclusion - -Successfully transformed an inefficient prototype into a production-ready tool with 20-100x performance improvements while maintaining the core functionality of detecting open source components in binaries. \ No newline at end of file diff --git a/examples/libcom_err.so b/examples/libcom_err.so deleted file mode 100644 index 4a8cd17..0000000 Binary files a/examples/libcom_err.so and /dev/null differ