Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
.Ruserdata
.DS_Store
*tmp
__pycache__/
*.pyc

data/m84082_250614_081210_s3.hifi_reads.bam
data/GCF_016432855.1_SaNama_1.0_genomic.fna.gz
Expand Down
176 changes: 176 additions & 0 deletions analyses/04-pacbio/QUICKSTART.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# Quick Start Guide: PacBio Data Summary

This guide helps you quickly generate a summary report of PacBio Revio sequencing data.

## Prerequisites

- Python 3.11 or higher
- Optional: `samtools` for detailed BAM analysis
- Optional: `pysam` Python package for BAM analysis

## Quick Commands

### 1. View the Pre-Generated Summary

A template summary report has already been generated:

```bash
# View the report
cat analyses/04-pacbio/sequencing_data_summary.md

# Or open in your markdown viewer
```

### 2. Generate a New Summary Report

If you have access to the data or want to customize the report:

#### Option A: Generate Template Only (No Data Required)

```bash
cd code/04-pacbio
python create_sequencing_summary.py --generate-template
```

#### Option B: Analyze Local Data Directory

If you have downloaded the data locally:

```bash
cd code/04-pacbio
python create_sequencing_summary.py \
--data-dir /path/to/pacbio/data \
--output ../../analyses/04-pacbio/sequencing_data_summary.md
```

#### Option C: Full Analysis with BAM Statistics

For detailed read statistics (slower, requires samtools or pysam):

```bash
cd code/04-pacbio
python create_sequencing_summary.py \
--data-dir /path/to/pacbio/data \
--analyze-bams \
--max-bams 5 \
--output ../../analyses/04-pacbio/sequencing_data_summary.md
```

## Data Location

The PacBio Revio sequencing data is hosted at:
```
https://owl.fish.washington.edu/nightingales/S_namaycush/LakeTrout/
```

**Note**: This URL may require VPN or specific network access to reach.

## Downloading Data

If you have access to the data repository, you can download files using:

```bash
# Create data directory
mkdir -p data/pacbio-reads

# Download using wget (example)
wget -r -np -nH --cut-dirs=3 \
https://owl.fish.washington.edu/nightingales/S_namaycush/LakeTrout/ \
-P data/pacbio-reads/

# Or use curl to list files first
curl -s https://owl.fish.washington.edu/nightingales/S_namaycush/LakeTrout/ | \
grep -oP 'href="\K[^"]+' | \
while read file; do
wget https://owl.fish.washington.edu/nightingales/S_namaycush/LakeTrout/$file \
-P data/pacbio-reads/
done
```

## Understanding the Report

The generated report includes:

### 1. **Overview Section**
- Project description
- Data source information
- Sequencing platform details

### 2. **File Inventory**
- List of all data files found
- Grouped by file type (BAM, XML, index files, etc.)

### 3. **Sequencing Statistics** (if BAMs analyzed)
- Total number of reads
- Mean/median read lengths
- N50 values
- Quality scores
- File sizes

### 4. **Sample Information**
- Lean lake trout subspecies details
- Siscowet lake trout subspecies details
- Biological context

### 5. **Analysis Workflows**
- Recommended commands for alignment
- Variant calling pipelines
- Methylation analysis approaches

### 6. **References**
- Links to related data (NCBI BioProject)
- Tool documentation
- Reference genome information

## Common Issues

### Cannot Access Data URL

If you cannot access the owl.fish.washington.edu URL:
- You may need VPN access to University of Washington network
- Contact the RobertsLab for access instructions
- Use `--generate-template` to create a report without data access

### Missing Dependencies

If you get import errors:
```bash
# Install pysam for Python-based BAM analysis
cd code/04-pacbio
uv add pysam

# Or install samtools system-wide
# Ubuntu/Debian:
sudo apt-get install samtools

# macOS:
brew install samtools
```

### Script Not Executable

If you get permission errors:
```bash
chmod +x code/04-pacbio/create_sequencing_summary.py
```

## Next Steps

After generating the summary:

1. **Review the report** to understand the data structure
2. **Download specific samples** you want to analyze
3. **Run alignment workflows** using `align_hifi_pbmm2.py`
4. **Perform downstream analysis** (variants, methylation, etc.)

For more details, see:
- `code/04-pacbio/README.md` - Detailed script documentation
- `analyses/04-pacbio/README.md` - Analysis outputs overview
- `code/05-pacbio-align.Rmd` - Example alignment workflow

## Support

For questions or issues:
- Open an issue in the project-lake-trout repository
- Contact the RobertsLab team
- Check existing documentation in the `notes/` directory
46 changes: 46 additions & 0 deletions analyses/04-pacbio/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# PacBio Analysis Outputs

This directory contains analysis outputs and reports from PacBio Revio sequencing data.

## Contents

### `sequencing_data_summary.md`

Comprehensive summary report of the PacBio Revio sequencing effort for Lake Trout genomics project. This report includes:

- Overview of sequencing data from https://owl.fish.washington.edu/nightingales/S_namaycush/LakeTrout/
- File inventory and organization
- Sequencing statistics (read counts, lengths, quality metrics)
- Sample information (Lean vs Siscowet subspecies)
- Technology overview (PacBio Revio HiFi sequencing)
- Recommended analysis workflows
- References and related resources

**To regenerate or update this report:**
```bash
cd ../../code/04-pacbio
python create_sequencing_summary.py --data-dir /path/to/data --analyze-bams
```

### `alignments/`

Directory for storing aligned BAM files and related outputs from pbmm2 alignment workflow.

## Related Scripts

Analysis scripts are located in `../../code/04-pacbio/`:
- `create_sequencing_summary.py` - Generate sequencing data summary reports
- `align_hifi_pbmm2.py` - Batch align HiFi reads using pbmm2
- See `../../code/04-pacbio/README.md` for detailed documentation

## Data Source

Primary sequencing data is hosted at:
https://owl.fish.washington.edu/nightingales/S_namaycush/LakeTrout/

## Project Context

This analysis is part of the Lake Trout (_Salvelinus namaycush_) genomics project comparing lean and siscowet subspecies. Related data and analyses:
- NCBI BioProject: [PRJNA674328](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA674328)
- Reference genome: GCF_016432855.1 (SaNama_1.0)
- RNAseq differential expression analysis: `../` (parent directory)
117 changes: 117 additions & 0 deletions analyses/04-pacbio/sequencing_data_summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# PacBio Revio Sequencing Data Summary

## Lake Trout (_Salvelinus namaycush_) Genomics Project

**Report Generated**: 2025-10-18 17:56:57 UTC

## Overview

This report summarizes PacBio Revio HiFi sequencing data generated for comparative genomics analysis of lean and siscowet lake trout subspecies.

## Data Source

- **Repository**: https://owl.fish.washington.edu/nightingales/S_namaycush/LakeTrout/
- **Sequencing Platform**: PacBio Revio
- **Technology**: HiFi (High-Fidelity) Circular Consensus Sequencing (CCS)
- **Species**: _Salvelinus namaycush_ (Lake Trout)
- **Subspecies**: Lean and Siscowet

## Sample Information

### Subspecies

The sequencing data includes samples from two lake trout subspecies:

1. **Lean Lake Trout**
- Morphotype: Pelagic/limnetic
- Habitat: Open water
- Characteristics: Streamlined body, smaller fat content

2. **Siscowet Lake Trout**
- Morphotype: Benthic/profundal
- Habitat: Deep water
- Characteristics: Higher fat content, adapted to deep waters

## Sequencing Technology

### PacBio Revio Platform

The PacBio Revio system is the latest generation of HiFi sequencing technology:

- **Read Type**: HiFi (High-Fidelity) reads
- **Accuracy**: >99.9% (Q30+)
- **Read Length**: Typically 10-25 kb, can exceed 30 kb
- **Chemistry**: Circular Consensus Sequencing (CCS)
- **Applications**:
- De novo genome assembly
- Structural variant detection
- Full-length isoform sequencing
- Epigenetic analysis (5mC, 6mA methylation)
- Haplotype phasing

## Potential Analyses

This dataset enables multiple types of genomic analyses:

### 1. Genome Assembly
- De novo assembly for each subspecies
- Comparative genomics between lean and siscowet
- Identification of subspecies-specific genomic features

### 2. Structural Variation Analysis
- Detection of large insertions/deletions
- Identification of inversions and translocations
- Copy number variation analysis

### 3. Isoform Analysis
- Full-length transcript sequencing
- Alternative splicing patterns
- Gene expression differences between subspecies

### 4. Epigenetic Analysis
- DNA methylation patterns (5mC)
- Comparison of methylation between subspecies
- Gene regulation insights

## Recommended Analysis Workflows

### Alignment
```bash
# Align HiFi reads to reference genome using pbmm2
pbmm2 align --preset CCS --sort \
reference.fa \
input.hifi_reads.bam \
output.aligned.bam
```

### Variant Calling
```bash
# Call variants using pbsv or DeepVariant
pbsv discover aligned.bam variants.svsig.gz
pbsv call reference.fa variants.svsig.gz variants.vcf
```

### Methylation Analysis
```bash
# Extract methylation tags using primrose
primrose aligned.bam output.bam
# Analyze with pb-CpG-tools or custom scripts
```

## References

### Related Data
- **NCBI BioProject**: [PRJNA674328](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA674328)
- **Reference Genome**: GCF_016432855.1 (SaNama_1.0)

### Tools and Documentation
- [PacBio SMRT Tools](https://www.pacb.com/support/software-downloads/)
- [pbmm2 Aligner](https://github.com/PacificBiosciences/pbmm2)
- [pbsv Structural Variant Caller](https://github.com/PacificBiosciences/pbsv)
- [Primrose Methylation Caller](https://github.com/PacificBiosciences/primrose)

---

*This report was generated using `create_sequencing_summary.py` from the project-lake-trout repository.*

*For questions or issues, please contact the RobertsLab team.*
Loading