-
Notifications
You must be signed in to change notification settings - Fork 23
Feature/karen bacterial wrapper #237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
karenhx2
wants to merge
31
commits into
develop
Choose a base branch
from
feature/karen_bacterial_wrapper
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
31 commits
Select commit
Hold shift + click to select a range
8aa2032
Developed script for wrapping bacterial chromosomes
karenhx2 fbca683
Cleaned up files
karenhx2 1440b21
Merge branch 'main' into feature/karen_bacterial_wrapper
joshfactorial edc67b8
Worked on testing functions for stitching outputs together
karenhx2 ba60be7
Merge branch 'feature/karen_bacterial_wrapper' of github.com:ncsa/NEA…
karenhx2 541b3b1
Removed reference and data files
karenhx2 85a9c51
Updated script to include paired-ended runs
karenhx2 acef491
Implemented CLI for wrapper
karenhx2 e75e596
Removed extra copy of runner code
karenhx2 8459c8b
Fixed bugs with config files and output stitching
karenhx2 3fd4eb2
Merge branch 'develop' into feature/karen_bacterial_wrapper
joshfactorial dc28b19
Finished final debugging
karenhx2 aaaeed4
Accidentally removed poetry file and re-adding it
karenhx2 ec1c70a
Merge branch 'develop' of github.com:ncsa/NEAT into feature/karen_bac…
karenhx2 349857c
Merge branch 'develop' of github.com:ncsa/NEAT into feature/karen_bac…
karenhx2 4a33b46
Updated wrapper
karenhx2 383a4ef
Developed script for wrapping bacterial chromosomes
karenhx2 49ada66
Cleaned up files
karenhx2 7c9c546
Worked on testing functions for stitching outputs together
karenhx2 17a5340
Removed reference and data files
karenhx2 7d9798a
Updated script to include paired-ended runs
karenhx2 7c92f0f
Implemented CLI for wrapper
karenhx2 4e482af
Removed extra copy of runner code
karenhx2 a73ce4c
Fixed bugs with config files and output stitching
karenhx2 1a8dd50
Finished final debugging
karenhx2 1b7e179
Accidentally removed poetry file and re-adding it
karenhx2 625be66
Updated wrapper
karenhx2 0464d64
A couple of changes related to filetypes
joshfactorial d109128
Fixing a small error introduced by merging
joshfactorial be62d97
Resolved merging conflicts
karenhx2 f55f87d
Fixed pytest error with reuse splits
karenhx2 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| """ | ||
| Load modules needed for other parts of the program | ||
| """ | ||
| from .runner import * |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,203 @@ | ||
| import subprocess | ||
| import gzip | ||
| import shutil | ||
| import yaml | ||
| import pysam | ||
| import unittest | ||
| import os | ||
|
|
||
| from pathlib import Path | ||
| from typing import List | ||
| from Bio import bgzf | ||
| from Bio.bgzf import BgzfWriter, BgzfReader | ||
|
|
||
|
|
||
| # Rearranges the bacterial chromosome by wrapping it around | ||
|
|
||
| def wrapper(seq): | ||
| length = len(seq) | ||
|
|
||
| if (length % 2 == 0): | ||
| half_index = length // 2 | ||
| else: | ||
| half_index = (length // 2) + 1 | ||
|
|
||
| first_half = seq[:half_index] | ||
| second_half = seq[half_index:] | ||
|
|
||
| new_seq = second_half + first_half | ||
|
|
||
| return new_seq | ||
|
|
||
|
|
||
| # Writes the newly rearranged chromosome's sequence to a new fasta file | ||
|
|
||
| def write_fasta_file(new_seq, bacteria_name, fasta_header, output_dir_path, type): | ||
| fasta_file_name = f"{type}_{bacteria_name}.fna" | ||
| fasta_file_path = output_dir_path / fasta_file_name | ||
| fasta_file = open(fasta_file_path, "w") | ||
|
|
||
| fasta_file.write(fasta_header + "\n" + new_seq) | ||
|
|
||
| fasta_file.close() | ||
|
|
||
| return fasta_file_path | ||
|
|
||
|
|
||
| # Writes a yml configuration file for the newly rearranged chromosome's fasta sequence | ||
| # Splits the coverage in half for the reference and new config files | ||
| # These use default values for all other parameters for NEAT | ||
|
|
||
| def write_config_file(ref_config_file, rearranged_seq_file, orig_seq_file, bacteria_name, output_dir_path): | ||
| new_config_file_name = f"new_{bacteria_name}_config_test.yml" | ||
| old_config_file_name = f"{bacteria_name}_config_test.yml" | ||
|
|
||
| new_config_file_path = output_dir_path / new_config_file_name | ||
| old_config_file_path = output_dir_path / old_config_file_name | ||
|
|
||
| with open(ref_config_file, 'r') as ref_file, open(new_config_file_path, 'w') as new_file, open(old_config_file_path, 'w') as old_file: | ||
| for line in ref_file: | ||
| if line.find("reference:") != -1: | ||
| new_file.write(f"reference: {rearranged_seq_file}\n") | ||
| old_file.write(f"reference: {orig_seq_file}\n") | ||
| elif line.find("coverage:") != -1: | ||
| if line.strip() == "coverage: .": | ||
| new_coverage = 5.0 | ||
| else: | ||
| new_coverage = float((line.split(" "))[1].strip()) // 2 | ||
|
|
||
| new_file.write(f"coverage: {new_coverage}\n") | ||
| old_file.write(f"coverage: {new_coverage}\n") | ||
| else: | ||
| new_file.write(line) | ||
| old_file.write(line) | ||
|
|
||
|
|
||
| ref_file.close() | ||
| new_file.close() | ||
| old_file.close() | ||
|
|
||
| return old_config_file_path, new_config_file_path | ||
|
|
||
|
|
||
| # Runs the NEAT read simulator using the given config file | ||
|
|
||
| def run_neat(config_file, output_dir, prefix): | ||
| subprocess.run(["neat", "read-simulator", "-c", config_file, "-o", output_dir + "/" + prefix]) | ||
|
|
||
|
|
||
| # General function for bacterial wrapper that calls all of the functions defined above | ||
|
|
||
| def bacterial_wrapper(reference_file, bacteria_name, ref_config_file, output_dir): | ||
|
|
||
| orig_seq = "" | ||
|
|
||
| f = open(reference_file) | ||
| fasta_header = f.readline().strip() | ||
|
|
||
| plasmids = False | ||
|
|
||
| for line in f: | ||
| if line[0] != ">": | ||
| orig_seq += line.strip() | ||
| elif line.lower().find("plasmid") != -1: # exclude plasmids from the sequence to be rearranged | ||
| plasmids = True | ||
| break | ||
|
|
||
| f.close() | ||
|
|
||
| output_dir_path = Path(output_dir) | ||
|
|
||
| rearranged_seq = wrapper(orig_seq) | ||
| rearranged_seq_file = write_fasta_file(rearranged_seq, bacteria_name, fasta_header, output_dir_path, "wrapped") | ||
|
|
||
| orig_seq_file = reference_file | ||
| if plasmids: | ||
| orig_seq_file = write_fasta_file(orig_seq, bacteria_name, fasta_header, output_dir_path, "orig") | ||
|
|
||
| config_files = write_config_file(ref_config_file, rearranged_seq_file, orig_seq_file, bacteria_name, output_dir_path) | ||
| old_config_file = config_files[0] | ||
| new_config_file = config_files[1] | ||
|
|
||
| run_neat(old_config_file, output_dir, "Regular") | ||
| run_neat(new_config_file, output_dir, "Wrapped") | ||
|
|
||
|
|
||
| # Stitching all outputs together - Keshav's script | ||
|
|
||
| def concat_fq(input_files: List[Path], dest: Path) -> None: | ||
|
|
||
| if not input_files: | ||
| # Nothing to do, and no error to throw | ||
| return | ||
|
|
||
| with gzip.open(dest, 'wt') as out_f: | ||
| for input_file in input_files: | ||
| with gzip.open(input_file, 'rt') as in_f: | ||
| shutil.copyfileobj(in_f, out_f) | ||
|
|
||
| def merge_bam(bams: List[Path], dest: Path, threads: int) -> None: | ||
|
|
||
| if not bams: | ||
| return | ||
|
|
||
| unsorted = dest.with_suffix(".unsorted.bam") | ||
| pysam.merge("--no-PG", "-@", str(threads), "-f", str(unsorted), *map(str, bams)) | ||
| pysam.sort("-@", str(threads), "-o", str(dest), str(unsorted)) | ||
| unsorted.unlink(missing_ok=True) | ||
|
|
||
| def merge_vcf(vcfs: List[Path], dest: Path) -> None: | ||
| if not vcfs: | ||
| return | ||
|
|
||
| first, *rest = vcfs | ||
| shutil.copy(first, dest) | ||
|
|
||
| with dest.open("ab") as out_f: | ||
| for vcf in rest: | ||
| with vcf.open("rb") as fh: | ||
| for line in fh: | ||
| if not line.startswith(b"#"): | ||
| out_f.write(line) | ||
|
|
||
| def stitch_all_outputs(files: List[Path], output_dir) -> None: | ||
| fq1_list = [] | ||
| fq2_list = [] | ||
| vcf_list = [] | ||
| bam_list = [] | ||
|
|
||
| for file in files: | ||
| file_name = file.stem # use stem to differentiate fq1 and fq2 | ||
| suffixes = file.suffixes # use suffixes to catch vcf and bam files | ||
|
|
||
| if "r2.fastq" in file_name: | ||
| fq2_list.append(file) | ||
| elif "r1.fastq" in file_name or ".fastq" in suffixes: | ||
| fq1_list.append(file) | ||
| elif ".vcf" in suffixes and ".tbi" not in suffixes: | ||
| vcf_list.append(file) | ||
| elif ".bam" in suffixes and ".bai" not in suffixes: | ||
| bam_list.append(file) | ||
|
|
||
| dest_fq1 = Path(f"{output_dir}/stitched_fq1.gz") | ||
| dest_bam = Path(f"{output_dir}/stitched.bam") | ||
| dest_vcf = Path(f"{output_dir}/stitched.vcf") | ||
|
|
||
| concat_fq(fq1_list, dest_fq1) | ||
|
|
||
| if (fq2_list): | ||
| dest_fq2 = Path(f"{output_dir}/stitched_fq2.gz") | ||
| concat_fq(fq2_list, dest_fq2) | ||
|
|
||
| merge_bam(bam_list, dest_bam, 2) | ||
| merge_vcf(vcf_list, dest_vcf) | ||
|
|
||
|
|
||
| # Testing functions | ||
|
|
||
| class TestWrapper(unittest.TestCase): | ||
| def test_even(self): | ||
| self.assertEqual(wrapper("ABBCBB"), "CBBABB") | ||
|
|
||
| def test_odd(self): | ||
| self.assertEqual(wrapper("ABBCBBC"), "BBCABBC") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| """ | ||
| Command line interface for NEAT's bacterial wrapper function | ||
| """ | ||
|
|
||
| import argparse | ||
| import subprocess | ||
| import os | ||
| from pathlib import Path | ||
|
|
||
| from ...bacterial_wrapper import bacterial_wrapper | ||
| from ...bacterial_wrapper import stitch_all_outputs | ||
| from .base import BaseCommand | ||
| from .options import output_group | ||
|
|
||
|
|
||
| class Command(BaseCommand): | ||
| """ | ||
| Class that generates wrapped bacterial models | ||
| """ | ||
| name = "bacterial-wrapper" | ||
| description = "Generate wrapped bacterial model reads" | ||
|
|
||
| def add_arguments(self, parser: argparse.ArgumentParser): | ||
| """ | ||
| Add the command's arguments to its parser | ||
|
|
||
| :param parser: The parser to add arguments to | ||
| """ | ||
|
|
||
| parser.add_argument('reference', | ||
| type=str, | ||
| metavar='reference.fa', | ||
| help="Reference file for organism in fasta format.") | ||
|
|
||
| parser.add_argument('bacteria_name', | ||
| type=str, | ||
| metavar='bacteria_name', | ||
| help="Name of the bacteria.") | ||
|
|
||
| parser.add_argument( | ||
| "-c", "--config", | ||
| metavar="config", | ||
| type=str, | ||
| required=True, | ||
| help="Path (including filename) to the configuration file for the reference run." | ||
| ) | ||
|
|
||
| output_group.add_to_parser(parser) | ||
|
|
||
| def execute(self, arguments: argparse.Namespace): | ||
| """ | ||
| Execute the command | ||
|
|
||
| :param arguments: The namespace with arguments and their values. | ||
| """ | ||
| bacterial_wrapper(arguments.reference, arguments.bacteria_name, arguments.config, arguments.output_dir) | ||
|
|
||
| output_path = Path(arguments.output_dir) | ||
| file_list = os.listdir(output_path / "Regular") # same file names for both Regular and Wrapped folders | ||
|
|
||
| output_files = [] | ||
|
|
||
| for file in file_list: | ||
| reg_file_path = output_path / "Regular" / file | ||
| wrap_file_path = output_path / "Wrapped" / file | ||
|
|
||
| if ("vcf" in file): | ||
| subprocess.run(["gzip", "-d", reg_file_path]) | ||
| subprocess.run(["gzip", "-d", wrap_file_path]) | ||
|
|
||
| file = file[:-3] | ||
| reg_file_path = output_path / "Regular" / file | ||
| wrap_file_path = output_path / "Wrapped" / file | ||
|
|
||
| output_files.append(reg_file_path) | ||
| output_files.append(wrap_file_path) | ||
|
|
||
| stitch_all_outputs(output_files, arguments.output_dir) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.