An intelligent bioinformatics tool that combines the power of the gget library with local AI interpretation via Ollama. Built on the FastMCP framework, it provides natural language querying of gene databases with smart gene detection and comprehensive biological insights.
- Natural Language Queries: Ask about genes in plain English - "tell me about TP53"
- Smart Gene Detection: Automatically identifies gene symbols and Ensembl IDs from text
- Typo Filtering: AI-powered validation prevents common English words from being treated as genes
- Multiple Gene Formats: Supports gene symbols (TP53, BRCA1), Ensembl IDs (ENSG00000141510), and aliases
- Local AI Integration: Uses Ollama with llama3:8b for biological insights
- No API Costs: Completely local processing with no external dependencies
- Biological Expertise: Model fine-tuned understanding of genetics and molecular biology
- Interactive Processing: Real-time interpretation of gene functions and significance
- gget Library Integration: Direct access to Ensembl, NCBI, and other genomics databases
- Gene Information: Detailed gene annotations, descriptions, and metadata
- Sequence Data: Access to gene sequences, transcripts, and protein data
- Reference Genomes: Species-specific genomic reference information
- Search Capabilities: Find genes by symbols, keywords, or biological functions
- Complete Building Guide: Comprehensive step-by-step documentation of the entire development process, including design decisions, safety framework implementation, local model setup, and beginner-friendly tutorials
- Local Model Setup: Instructions for cost-effective deployment with Qwen2.5-Coder-3B
- AI Safety Framework: Detailed explanation of safety controls and boundaries
- Python 3.8+
- Ollama for local AI processing
# Clone the repository
git clone https://github.com/georgiesamaha/bio-nerd-tool.git
cd bio-nerd-tool
# Install Python dependencies
pip install -e .
# Install Ollama (if not already installed)
curl -fsSL https://ollama.ai/install.sh | sh
# Download the AI model (llama3:8b - ~4.7GB)
ollama pull llama3:8b
# Set up the bio-nerd command
chmod +x ~/.local/bin/bio-nerd# Test the AI model
ollama run llama3:8b "What is the TP53 gene?"
# Test bio-nerd (without AI first)
bio-nerd query "TP53"
# Test with AI interpretation
bio-nerd query "tell me about BRCA1" --ai# Simple gene lookup
bio-nerd query "TP53"
bio-nerd query "ENSG00000141510"
# Natural language queries with AI interpretation
bio-nerd query "tell me about the BRCA1 gene" --ai
bio-nerd query "what does TP53 do?" --ai
bio-nerd query "EGFR function and mutations" --ai
# Multiple genes
bio-nerd query "compare TP53 and BRCA1" --ai# Disable AI interpretation for faster queries
bio-nerd query "BRCA2" --no-ai
# Interactive mode
bio-nerd
# Then type queries interactively
# Get help
bio-nerd --help
bio-nerd query --help㪠Processing: tell me about TP53
--------------------------------------------------
π― Gene IDs detected: TP53
π Ensembl ID mappings:
TP53 β ENSG00000141510
π Gene Information:
β’ TP53:
β
tumor protein p53
π tumor protein p53 [Source:HGNC Symbol;Acc:HGNC:11998]
π§ AI Interpretation:
TP53 is one of the most important tumor suppressor genes in human biology, often called the "guardian of the genome." Located on chromosome 17, it encodes the p53 protein which acts as a transcription factor that regulates the cell cycle and prevents cancer formation...
- Comprehensive gene annotations from Ensembl
- Gene descriptions, biotypes, and synonyms
- Chromosome locations and genomic coordinates
- Protein coding information and domains
- Find genes by symbols, names, or keywords
- Species-specific searches (default: human)
- Fuzzy matching for partial gene names
- Alias and synonym resolution
- DNA, RNA, and protein sequences
- Transcript isoform sequences
- UTR and coding sequence regions
- FASTA format output
- Download genome assemblies and annotations
- GTF/GFF3 annotation files
- Species-specific reference data
- Assembly metadata and statistics
- Natural Language Processing: Extract gene identifiers from conversational queries
- Biological Context: Explain gene functions, pathways, and disease associations
- Intelligent Filtering: Distinguish real gene names from typos and common words
- Multi-gene Analysis: Compare and analyze multiple genes simultaneously
- Database Access: gget uses Ensembl REST API (10-15 seconds per query)
- AI Processing: Local llama3:8b inference (~1-3 seconds)
- Optimisation: Future versions may include local database caching for faster access
- FastMCP Framework: Robust server infrastructure
- Async Processing: Non-blocking query handling
- Error Recovery: Graceful handling of network timeouts and API failures
- Memory Efficient: Streaming responses for large datasets
- Pattern Matching: Regex detection of Ensembl IDs and gene symbols
- AI Validation: llama3:8b confirms ambiguous candidates are real genes
- Database Resolution: Map gene symbols to canonical Ensembl identifiers
- Data Retrieval: Fetch comprehensive information via gget library
- Ensembl: Primary source for gene annotations and genomic data
- NCBI: Complementary database for additional gene information
- Real-time Access: Always fetches latest database versions
- No Data Caching: Ensures information is current (though slower)
- Source Attribution: All data includes original database references
- Error Handling: Clear reporting when information is unavailable
- Gene Validation: AI prevents misidentification of non-gene terms as genes
- Precise Matching: Exact gene symbol to Ensembl ID resolution
- Network Dependency: Requires internet connection for database access
- API Rate Limits: Ensembl REST API may throttle heavy usage
- Query Speed: 10-15 seconds typical for comprehensive gene information
- Human-Focused: Primarily optimized for human genome queries
- Local Database: PyEnsembl integration for faster queries (~100x speedup)
- Multi-species Support: Expanded beyond human genome
- Batch Processing: Handle multiple genes in single queries
- Visualization: Integration with plotting libraries for gene data
- API Extensions: Additional gget tools (phylogenetic trees, mutations, etc.)
- Performance: Local database caching to eliminate API delays
- Features: Pathway analysis and gene network visualization
- Integrations: Connect with other bioinformatics tools and workflows
- User Experience: Enhanced error messages and query suggestions
bio-nerd-tool/
βββ gget_mcp/
β βββ __init__.py
β βββ server_simple.py # Main FastMCP server with AI integration
β βββ server.py # Alternative MCP server implementation
β βββ safety/ # Safety framework components
β β βββ boundaries.py # Domain boundaries and validation
β β βββ epistemic.py # Confidence and uncertainty handling
β β βββ failures.py # Error management and recovery
β βββ tools/ # Bioinformatics tool implementations
β β βββ gget_info.py # Gene information queries
β β βββ nl_gene_query.py # Natural language processing
β βββ schemas/ # Data validation schemas
β βββ nlp/ # Natural language processing components
β βββ query_processor.py
βββ config/ # Configuration files for various MCP clients
β βββ claude_desktop_config.json
β βββ lm_studio_config.json
β βββ ollama_config.py
βββ tests/ # Test suite
βββ docs/ # Documentation
βββ ~/.local/bin/bio-nerd # CLI command wrapper
βββ pyproject.toml # Python package configuration
βββ README.md