A high-performance Python library and command-line tool for comprehensive DNA/RNA sequence analysis with advanced visualization capabilities. This toolkit is designed for both bioinformaticians and molecular biologists, providing a robust set of tools for sequence analysis, manipulation, and visualization.
- β¨ Key Features
- ποΈ Project Structure
- π Requirements
- π Installation
- β‘ Quick Start
- π Usage Examples
- βοΈ Configuration
- π Documentation
- π€ Contributing
- π§ͺ Testing
- π License
- π Changelog
- GC content calculation
- Melting temperature prediction
- Molecular weight calculation
- Sequence validation and sanitization
- Motif finding and pattern matching
- ORF (Open Reading Frame) detection
- Reverse complement generation
- Transcription and translation
- Sequence alignment
- Primer design
- Restriction site analysis
- FASTA/FASTQ format support
- GZIP/BZ2 compression support
- Batch processing of multiple files
- Stream processing for large files
- Configurable output formats
- Parallel processing options
- GC content plots
- Sequence logos
- Restriction maps
- Interactive sequence viewers
- User-friendly command-line tools
- Batch processing support
- Configurable output formats
- Parallel processing options
- User-friendly command-line tools
- Batch processing support
- Configurable output formats
- Parallel processing options
DNASequenceAnalysisTool/
βββ dna_sequence_analysis_tool/ # Main package
β βββ core/ # Core functionality
β β βββ __init__.py
β β βββ sequence_analysis.py # Sequence analysis functions
β β βββ sequence_io.py # File I/O operations
β β βββ sequence_validation.py # Sequence validation
β β βββ sequence_statistics.py # Statistical analysis
β β βββ sequence_transformation.py # Sequence manipulation
β β βββ visualization.py # Visualization tools
β βββ data/ # Sample data
β β βββ __init__.py
β β βββ sample_sequence.py # Sample sequences
β βββ tests/ # Test suite
β β βββ __init__.py
β β βββ test_sequence_analysis.py
β βββ utils/ # Utility functions
β β βββ __init__.py
β β βββ file_io.py
β βββ __init__.py
β βββ cli.py # Command-line interface
β βββ config.py # Configuration settings
β βββ exceptions.py # Custom exceptions
β βββ logging_config.py # Logging configuration
βββ examples/ # Example scripts
β βββ basic_sequence_analysis.py
β βββ file_io_and_visualization.py
β βββ README.md
βββ .gitignore
βββ CHANGELOG.md
βββ CODE_OF_CONDUCT.md
βββ CONTRIBUTING.md
βββ LICENSE
βββ MANIFEST.in
βββ Makefile
βββ pyproject.toml
βββ requirements-dev.txt
βββ requirements.txt
βββ setup.py
- Python 3.8+
- Dependencies are listed in
requirements.txt
- NumPy >= 1.19.0
- SciPy >= 1.5.0
- Biopython >= 1.78
- pandas >= 1.2.0
- pydantic >= 1.8.0
- pyyaml >= 5.4.1
- click >= 8.0.0
- rich >= 10.0.0
- matplotlib >= 3.3.0
- plotly >= 5.0.0
pip install dna-sequence-analysis-tool-
Clone the repository:
git clone https://github.com/YanCotta/DNASequenceAnalysisTool.git cd DNASequenceAnalysisTool -
Install with pip in development mode:
pip install -e .
-
Install development dependencies:
pip install -r requirements-dev.txt
-
Set up pre-commit hooks:
pre-commit install
from dna_sequence_analysis_tool import DNASequence, DNAToolkit
# Create a DNA sequence
sequence = DNASequence("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", "example_sequence")
# Get sequence information
print(f"Sequence ID: {sequence.id}")
print(f"Length: {sequence.length} bp")
print(f"GC content: {sequence.gc_content:.2f}%")
# Get reverse complement
rev_comp = sequence.reverse_complement()
print(f"Reverse complement: {rev_comp}")
# Find motifs
motif = "GGC"
positions = sequence.find_motif(motif)
print(f"Motif '{motif}' found at positions: {positions}")
# Analyze with toolkit
toolkit = DNAToolkit()
tm = toolkit.calculate_melting_temperature(sequence.sequence)
print(f"Melting temperature: {tm:.2f}Β°C")# Analyze a sequence file
dnatool analyze sequences.fasta --output results.csv
# Generate a GC content plot
dnatool plot-gc sequences.fasta --output gc_plot.png
# Find ORFs in a sequence
dnatool find-orfs sequence.fasta --min-length 100
# Get help
dnatool --helpThe tool can be configured using a YAML configuration file located at ~/.dna_sequence_analysis/config.yaml.
Example configuration:
# General settings
log_level: INFO
max_sequence_length: 10000000
# File I/O settings
default_input_format: fasta
default_output_format: fasta
auto_detect_format: true
# Performance settings
chunk_size: 10000
max_workers: 4
# Visualization settings
plot_theme: default
default_figure_size: [10, 6]Comprehensive documentation is available at Read the Docs.
To build the documentation locally:
cd docs
make htmlContributions are welcome! Please see our Contributing Guide for details on how to contribute to this project.
Run the test suite with:
pytestFor test coverage report:
pytest --cov=dna_sequence_analysis_tool --cov-report=term-missingThis project is licensed under the MIT License - see the LICENSE file for details.
See CHANGELOG.md for a history of changes to this project.
For support or questions, please open an issue on GitHub.
Made with β€οΈ by the DNA Sequence Analysis Tool contributors
- GC content calculation
- Melting temperature prediction
- ORF detection and analysis
- Nucleotide composition analysis
- Pattern recognition and motif finding
- DNA/RNA transcription
- Codon-optimized protein translation
- Sophisticated ORF detection
- Advanced melting temperature calculations
- FASTA/FASTQ format support
- GZIP/BZIP2 compression
- Batch processing capabilities
- Format conversion utilities
- GC content plots
- Sequence logos
- Multiple sequence alignments
- Interactive visualizations
- Intuitive command structure
- Batch processing support
- Multiple output formats (text, JSON, CSV)
- Visualization export to image files
dna_sequence_analysis_tool/
βββ core/
β βββ __init__.py
β βββ sequence_analysis.py
β βββ sequence_validation.py
β βββ visualization.py
βββ data/
β βββ sample_sequences.fasta
βββ utils/
β βββ file_io.py
β βββ logging.py
βββ tests/
β βββ test_sequence_analysis.py
β βββ test_validation.py
βββ cli.py
βββ README.md
βββ requirements.txt
- Python 3.8+
- NumPy >= 1.19.0
- SciPy >= 1.5.0
- Biopython >= 1.78
- pandas >= 1.2.0
- matplotlib >= 3.3.0 (for visualization)
- click >= 8.0.0 (for CLI)
- rich >= 10.0.0 (for rich CLI output)
- plotly >= 5.0.0 (for interactive visualizations)
- python-magic (for file type detection)
- python-magic-bin (Windows only, for file type detection)
# Install from PyPI
pip install dna-sequence-analysis-tool
# Install from source
git clone https://github.com/YanCotta/DNASequenceAnalysisTool.git
cd DNASequenceAnalysisTool
pip install -e .from dna_sequence_analysis_tool import DNAToolkit
# Initialize toolkit
toolkit = DNAToolkit()
# Analyze a sequence
sequence = "ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG"
result = toolkit.analyze_sequence(sequence)
print(f"GC Content: {result.gc_content}%")class DNASequence:
"""
Core class for DNA sequence analysis.
Attributes:
sequence (str): The DNA sequence
length (int): Sequence length
gc_content (float): GC content percentage
"""- Validates DNA sequences (A, T, G, C)
- Returns: (bool, str) - validity status and error message
- Calculates GC content percentage
- Raises ValueError for invalid sequences
- Generates reverse complement of DNA sequence
- Returns: String of complementary sequence
- Finds all occurrences of a motif
- Returns: List of starting positions (0-based)
- Validates RNA sequences (A, U, G, C)
- Returns: (bool, str) - validity status and error message
- Converts DNA to RNA sequence
- Returns: RNA sequence (replaces T with U)
- Converts RNA to protein sequence
- Returns: Amino acid sequence using standard genetic code
- Finds all possible Open Reading Frames
- Parameters:
- min_length: Minimum ORF length (default: 30)
- Returns: List of (start_position, sequence, frame)
- Calculates DNA melting temperature
- Uses different formulas based on sequence length:
- < 14 bases: Tm = (A+T)*2 + (G+C)*4
- β₯ 14 bases: Tm = 64.9 + 41*(G+C-16.4)/(A+T+G+C)
- Calculates similarity between two sequences
- Returns: Percentage of matching positions
- Provides comprehensive sequence analysis including:
- Basic statistics (length, GC content, nucleotide counts)
- Dinucleotide frequencies
- Melting temperature
- Molecular weight
DNAToolkitclass: Primary interface for DNA analysis- Comprehensive sequence analysis pipeline
- Integrated logging and error handling
- Sequence manipulation and transformation methods
- Advanced sequence analysis algorithms
- ORF detection and promoter prediction
- Local and global sequence alignment
- Repeat sequence analysis
- Motif finding and pattern matching
- Reading and writing FASTA/FASTQ files
- Support for gzip and bzip2 compression
- Automatic format detection
- Sequence validation during I/O operations
- Sequence validation for DNA/RNA
- Support for ambiguous bases
- Configurable validation rules
- Detailed error reporting
- GC content plots
- Sequence logos
- Multiple sequence alignment visualization
- Interactive plots with Plotly
- Export to various image formats
- Sequence complexity measures
- Nucleotide frequency analysis
- Statistical significance calculations
- Sequence similarity metrics
- Interactive command-line interface
- Support for batch processing
- Rich text formatting and progress bars
- Multiple output formats (text, JSON, CSV)
- Integrated help system
from dna_sequence_analysis_tool.visualization import plot_gc_content
sequence = "ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG"
fig = plot_gc_content(sequence, window_size=10, step_size=1)
fig.savefig("gc_content.png")from dna_sequence_analysis_tool.visualization import plot_sequence_logo
sequences = ["ATCG", "ATTA", "ATGC", "ATAA"]
fig = plot_sequence_logo(sequences)
fig.savefig("sequence_logo.png")from dna_sequence_analysis_tool.visualization import plot_alignment
alignment = [
"ATCGATCGAT",
"AT-GATCGAT",
"ATCGAT---T",
"ATCGATCGAT"
]
fig = plot_alignment(alignment, title="Multiple Sequence Alignment")
fig.savefig("alignment.png")The DNA Sequence Analysis Tool comes with a powerful command-line interface for batch processing and automation:
# Analyze a single sequence file
dna-tool analyze sequences.fasta --output results.json
# Process multiple files in a directory
dna-tool batch-process input_dir/ --output results/
# Convert between file formats
dna-tool convert input.fasta --output output.fastq --format fastqsequences = read_fasta("sequences.fasta")
write_fasta(sequences, "processed_sequences.fasta")
### Advanced Analysis
```python
from dna_sequence_analysis_tool.core.sequence_analysis import AdvancedSequenceAnalyzer
# Initialize analyzer
analyzer = AdvancedSequenceAnalyzer()
# Predict promoter regions
promoters = analyzer.predict_promoter_regions(sequence)
# Analyze repeats
repeats = analyzer.analyze_repeats(sequence)
from dna_sequence_analysis_tool.utils.logging import logger
# Set custom log level
logger.setLevel(logging.DEBUG)
# Add custom handler
handler = logging.FileHandler('analysis.log')
logger.addHandler(handler)# Run all tests
pytest
# Run with coverage
pytest --cov=dna_analysis tests/We welcome contributions!
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
| Operation | Time Complexity | Space Complexity |
|---|---|---|
| Sequence Validation | O(n) | O(1) |
| GC Content | O(n) | O(1) |
| ORF Detection | O(n) | O(n) |
| Sequence Alignment | O(mn) | O(mn) |
If you find this tool useful, please consider giving it a β on GitHub
Made with β€οΈ for the bioinformatics community
This project is licensed under the MIT License - see the LICENSE file for details.
- Added command-line interface with rich output
- Enhanced visualization capabilities (GC content, sequence logos, alignments)
- Improved sequence I/O with gzip/bzip2 compression support
- Added comprehensive documentation and examples
- Added support for batch processing of sequence files
- Implemented multiple output formats (text, JSON, CSV)
- Added sequence visualization module with matplotlib and plotly support
- Implemented FASTA/FASTQ file format support
- Added sequence validation framework with configurable rules
- Improved error handling and logging
- Added support for ambiguous nucleotide codes
- Initial release with core DNA/RNA analysis functionality
- Basic sequence manipulation and analysis tools
- Comprehensive documentation and examples usage
- Optimized sequence processing
- Graceful fallback for operations when NumPy/SciPy are not available
- Enhanced error handling and validation across all modules
- Memory-efficient sequence processing with caching
- Improved logging system with configurable output
- Added molecular weight calculations
- Added sequence complexity analysis
- Implemented affine gap penalties in sequence alignment
- Optimized GC content calculation
- Improved TATA box prediction algorithm
- Enhanced repeat sequence detection
- Better integration between basic and advanced analyzers
- More robust sequence validation
- Updated dependencies to latest stable versions
- Fixed dependency management in setup.py and requirements.txt
- Resolved numpy/scipy import issues
- Improved error messages and exception handling
- Fixed memory leaks in sequence analysis
- Corrected molecular weight calculations
- Added type hints throughout the codebase
- Improved test coverage
- Better documentation and code organization
- Memory optimizations for large sequences
- Performance improvements in core algorithms
- Add more comprehensive unit tests
- Include integration tests
- Add test coverage reporting
- Test edge cases and biological validity
- Implement codon usage bias analysis
- Add protein structure prediction
- Include phylogenetic analysis capabilities
- Add primer design functionality
- Implement parallel processing for large sequences
- Add memory optimization for large datasets
- Include progress tracking for long operations
- Add visualization capabilities
- Implement machine learning for sequence analysis
- Add web interface
- Add biological background for each analysis
- Include example workflows
- Add benchmarking results
- Document algorithm complexity
- Updated outdated README to properly reference the project and its functionalities
- Add type hints consistently
- Improve error messages
- Add performance metrics
- Implement logging