RafigaM · RafigaM · Jan 27, 2026 · Jan 27, 2026 · Jan 27, 2026
diff --git a/.idea/.gitignore b/.idea/.gitignore
diff --git a/README.md b/README.md
@@ -1 +1,376 @@
-# genomics-scripts
+# genomics-scripts
+
+A comprehensive toolkit for analyzing genomic structural variants (SVs) across cohorts. This project provides tools for processing VCF files containing SV information and generating unified cohort-level SV catalogs.
+
+---
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Scripts](#scripts)
+  - [cohort_vcf_SV.py](#cohort_vcf_svpy)
+- [Installation](#installation)
+- [Usage](#usage)
+- [Output Files](#output-files)
+- [Functions Reference](#functions-reference)
+- [Author](#author)
+
+---
+
+## Overview
+
+This project processes structural variant (SV) data from multiple VCF files to identify unique deletions across a cohort, calculate allele frequencies, and generate standardized output files for further analysis.
+
+### Key Features
+
+- **Multi-sample VCF processing**: Reads VCF files from a directory recursively
+- **SV deduplication**: Identifies unique deletions accounting for overlapping coordinates (≥80% overlap)
+- **Allele frequency calculation**: Computes allele counts (AC), allele frequencies (AF), and distinguishes heterozygous (het) vs. homozygous (hom) calls
+- **Multiple output formats**: Generates VCF, BED, and TSV files for downstream analysis
+- **Deletion-specific filtering**: Focuses on deletions (DEL) ≥50 bp
+
+---
+
+## Scripts
+
+### cohort_vcf_SV.py
+
+**Purpose**: Processes a cohort of VCF files containing structural variant (SV) data, identifies unique deletions across samples, and generates unified SV catalogs in multiple formats.
+
+**Author**: Rafiga Masmaliyeva (@RafigaM)  
+**Email**: rmasmaliyeva@gmail.com  
+**Date Created**: Tue Nov 19 2024
+
+#### Dependencies
+
+```
+pysam
+scikit-allel (imported as allel)
+pandas
+numpy
+```
+
+Install dependencies:
+```bash
+pip install pysam scikit-allel pandas numpy
+```
+
+#### Main Functions
+
+##### `get_samples(input)`
+Recursively searches a directory for all VCF files.
+
+**Parameters:**
+- `input` (str): Path to directory containing VCF files
+
+**Returns:**
+- `list`: List of file paths to all `.vcf` files found
+
+---
+
+##### `convert_array_to_scalar(arr)`
+Converts a genotype array to VCF genotype format (e.g., `[0, 1]` → `"0/1"`).
+
+**Parameters:**
+- `arr` (array): Array with two genotype values
+
+**Returns:**
+- `str`: Genotype string in format "allele1/allele2"
+
+---
+
+##### `find_unique_thresholds_with_indexes(thresholds, chr)`
+Identifies unique SV positions by clustering overlapping intervals on the same chromosome. Uses 80% overlap threshold to group similar deletions across samples.
+
+**Parameters:**
+- `thresholds` (list): List of tuples `(start, end)` representing deletion positions
+- `chr` (str): Chromosome name
+
+**Returns:**
+- `tuple`: 
+  - `unique_thresholds` (list): List of non-overlapping representative intervals
+  - `threshold_map` (dict): Maps unique intervals to their "mate" variants with indices
+
+**Overlap Logic:**
+- Computes overlap percentage for each interval pair
+- Groups intervals if either has ≥80% overlap with existing unique interval
+- Allows clustering of SVs that are similar across samples
+
+---
+
+##### `generate_cohort(input)`
+Reads all VCF files in a directory and extracts PASS deletions ≥50 bp. Creates an initial dictionary with variant information.
+
+**Parameters:**
+- `input` (str): Path to directory with VCF files
+
+**Returns:**
+- `defaultdict`: Dictionary with structure:
+  ```
+  {
+    "chrom:pos:end": {
+      "CHROM": str,
+      "POS": int,
+      "ID": str,
+      "REF": str,
+      "ALT": str,
+      "QUAL": int,
+      "FILTER": str,
+      "INFO": str,
+      "sample": [list of genotypes],
+      "num": [list of VCF filenames]
+    }
+  }
+  ```
+
+**Filtering Criteria:**
+- Only PASS variants
+- Type: DEL (deletion)
+- Length: ≥50 bp
+
+---
+
+##### `unique_sv(my_dict, n)`
+Groups similar deletions across the cohort and calculates allele frequency statistics. This is the core function for deduplication and frequency calculation.
+
+**Parameters:**
+- `my_dict` (dict): Output from `generate_cohort()`
+- `n` (int): Total number of samples in the cohort
+
+**Returns:**
+- `defaultdict`: Unique SV dictionary with structure:
+  ```
+  {
+    "chrom:start:end": {
+      "CHROM": str,
+      "POS": int,
+      "end": int,
+      "ID": str,
+      "SVTYPE": str,
+      "SVLEN": int,
+      "FILTER": str,
+      "QUAL": int,
+      "GT": [list of genotypes],
+      "sample": [list of sample filenames],
+      "mates": [list of overlapping variants],
+      "ac": int (allele count),
+      "ac_het": int (heterozygous count),
+      "ac_hom": int (homozygous count),
+      "af": float (allele frequency)
+    }
+  }
+  ```
+
+**Filtering:**
+- Processes all 24 chromosomes (chr1-chr22, chrX, chrY)
+- Only includes deletions <1,000,000 bp (1 Mb)
+- Clusters overlapping variants
+
+**Calculations:**
+- **AC_het**: Number of heterozygous samples (0/1 genotype)
+- **AC_hom**: Number of homozygous samples (1/1 genotype)
+- **AC**: Total allele count = AC_het + (2 × AC_hom)
+- **AF**: Allele frequency = AC / (n_samples × 2)
+
+---
+
+##### `find_pos(mates)`
+Finds the longest interval among mate variants (overlapping SVs representing the same deletion).
+
+**Parameters:**
+- `mates` (list/tuple): List of `(start, end)` coordinate pairs
+
+**Returns:**
+- `tuple`: `(longest_start, longest_end)` - the longest interval
+
+---
+
+##### `find_gt(gt_list)`
+Determines the most common genotype across all samples for a variant.
+
+**Parameters:**
+- `gt_list` (list): List of genotype strings
+
+**Returns:**
+- `str`: Most frequently occurring genotype
+
+---
+
+##### `write2vcf(output_file, converted_dict)`
+Writes unique SVs to a VCF format file with proper headers.
+
+**Parameters:**
+- `output_file` (str): Output VCF filename
+- `converted_dict` (list): List of SV dictionaries (one per variant)
+
+**Output Format:** VCF 4.2 with INFO fields:
+- `AF`: Allele frequency
+- `AC`: Allele count
+- `AC_het`: Heterozygous count
+- `AC_hom`: Homozygous count
+- `END`: Deletion end position
+- `SVTYPE`: Structural variant type
+- `SVLEN`: Structural variant length
+
+---
+
+##### `write2bed(output_file, inner_dict)`
+Writes SV data to a TSV (tab-separated) file.
+
+**Parameters:**
+- `output_file` (str): Output TSV filename
+- `inner_dict` (list): List of SV dictionaries
+
+---
+
+##### `write2bed2(output_file, converted_dict)`
+Writes a simplified BED-format file with core SV information.
+
+**Parameters:**
+- `output_file` (str): Output BED filename
+- `converted_dict` (list): List of SV dictionaries
+
+**BED Format Columns:**
+1. CHROM: Chromosome
+2. POS: Start position
+3. end: End position
+4. SVTYPE: Type of SV (e.g., DEL)
+5. af: Allele frequency
+
+---
+
+#### Usage
+
+##### Basic Usage
+
+1. **Prepare your VCF files** in a directory (e.g., `vcf_files/`)
+2. **Edit the main() function** to configure:
+   - `folder`: Directory containing VCF files
+   - `n`: Total number of samples in your cohort
+   - `output_*`: Output file names
+
+3. **Run the script**:
+```bash
+python cohort_vcf_SV.py
+```
+
+##### Example Configuration
+
+```python
+def main():
+    folder = "vcf_files"           # Directory with VCF files
+    output_vcf = "cohort_del.vcf"  # Output VCF file
+    output_bed0 = "cohort_del.bed" # Output TSV file
+    output_bed = "cohort_del_all.bed"  # Output BED file
+
+    my_dict = generate_cohort(folder)
+    n = 1000  # Set to your actual cohort size
+    unique_dict = unique_sv(my_dict, n)
+
+    # Convert to records format for writing
+    unique_normal_dict = {key: value for key, value in unique_dict.items()}
+    unique_df = pd.DataFrame.from_dict(unique_normal_dict, orient='index')
+    unique_converted_dict = unique_df.to_dict(orient='records')
+
+    # Generate output files
+    write2vcf(output_vcf, unique_converted_dict)
+    write2bed(output_bed0, unique_converted_dict)
+    write2bed2(output_bed, unique_converted_dict)
+```
+
+---
+
+## Output Files
+
+The script generates three output files:
+
+### 1. **cohort_del.vcf**
+Standard VCF 4.2 format containing unique deletions with population statistics.
+
+**Fields:**
+- CHROM, POS, ID, REF, ALT, QUAL, FILTER
+- INFO: AF, AC, AC_het, AC_hom, END, SVTYPE, SVLEN
+- FORMAT: GT, FT, GQ
+- Sample genotypes
+
+### 2. **cohort_del.bed** (TSV format)
+Tab-separated file with all SV information converted from dictionaries.
+
+### 3. **cohort_del_all.bed** (BED format)
+Simplified BED format with essential SV coordinates and frequency:
+```
+chr1    1000    5000    DEL     0.025
+chr2    20000   22000   DEL     0.010
+...
+```
+
+---
+
+## Input Requirements
+
+### VCF File Format
+
+Input VCF files should contain:
+- **FILTER field**: Marked as PASS for variants to include
+- **SVTYPE**: Must be "DEL" for deletion
+- **SVLEN**: Deletion length (must be ≥50 bp)
+- **END**: Deletion end position
+- **Genotype data**: GT field with call data
+
+**Supported VCF versions**: 4.1+
+
+---
+
+## Algorithm Summary
+
+### Workflow
+
+1. **Read VCFs**: Recursively scan directory for `.vcf` files
+2. **Filter variants**: Keep only PASS, DEL, ≥50 bp
+3. **Cluster deletions**: Group overlapping variants (≥80% overlap) per chromosome
+4. **Calculate statistics**:
+   - Count heterozygous and homozygous carriers
+   - Compute allele counts and frequencies
+   - Identify longest interval among mates
+5. **Output results**: Write to VCF, BED, and TSV formats
+
+### Overlap Threshold
+
+Two intervals are considered the same deletion if:
+- Overlap percentage for interval 1 ≥ 80%, OR
+- Overlap percentage for interval 2 ≥ 80%
+
+This allows flexible clustering of imprecise SV boundaries.
+
+---
+
+## Notes and Limitations
+
+- **Maximum SV size**: 1,000,000 bp (filtered in `unique_sv()`)
+- **Chromosome handling**: Currently supports hg38-style chromosome names (chr1-chr22, chrX, chrY)
+- **Genotype parsing**: Expects standard VCF genotype format (0/0, 0/1, 1/1, etc.)
+- **Sample count**: Must specify correct `n` value for accurate AF calculation
+- **Performance**: Runtime scales with number of VCF files and total variants
+
+---
+
+## Installation & Setup
+
+1. Clone the repository
+2. Install Python 3.6+
+3. Install dependencies:
+   ```bash
+   pip install pysam scikit-allel pandas numpy
+   ```
+4. Organize your VCF files in a single directory
+5. Update file paths in `main()` function
+6. Run: `python cohort_SV/cohort_vcf_SV.py`
+
+---
+
+## Author
+
+**Rafiga Masmaliyeva**  
+Twitter/GitHub: @RafigaM  
+Email: rmasmaliyeva@gmail.com  
+Created: November 19, 2024