Skip to content

Minimum genome size is not settable, making normalizing amplicons impossible #2

@da-i

Description

@da-i

Hi Author(s),

I think it is super cool that you've implemented this dynamic strategy!

We where trying to normalize some amplicons on a given chromosome, so we made sequences that where unique to these amplicons and provided these as a reference file. However the minimum genome size of 10k made this impossible.

There is an unclear error when no sequence is long enough in the reference genome


2023-06-08 15:35:31,905 reading reference file
Traceback (most recent call last):
  File "bossruns.py", line 14, in <module>
    bossrun.initialise_OTUs()
  File "/home/grid/BOSS-RUNS/bossruns/BR_core.py", line 1374, in initialise_OTUs
    otu.init_reference_wgs(ref=self.args.ref, mmi=self.args.ref_idx, reject_refs=self.otu.reject_refs)
  File "/home/grid/BOSS-RUNS/bossruns/BR_core.py", line 603, in init_reference_wgs
    reference.load_reference()
  File "/home/grid/BOSS-RUNS/bossruns/BR_reference.py", line 90, in load_reference
    self.genome = integer_genome(chromosome_sequences=self.chromosome_sequences)
  File "/home/grid/BOSS-RUNS/bossruns/BR_utils.py", line 204, in integer_genome
    genome = np.concatenate(genome)
  File "<__array_function__ internals>", line 200, in concatenate
ValueError: need at least one array to concatenate

After some digging it apears that the mmi property at 604 is None.
If we hardcode the mmi, and set min_len in BR_reference to 100: we get slightly further into the code:

2023-06-08 15:48:57,585 finished init ------- 

2023-06-08 15:48:57,585 loading mapping index: None
Traceback (most recent call last):
  File "bossruns.py", line 15, in <module>
    bossrun.initialise_merged()
  File "/home/grid/BOSS-RUNS/bossruns/BR_core.py", line 1426, in initialise_merged
    self.mapper.init_mappy(ref=ref)
  File "/home/grid/BOSS-RUNS/bossruns/BR_mapper.py", line 25, in init_mappy
    self.aligner = mappy.Aligner(fn_idx_in=ref, preset="map-ont")  # , extra_flags=0x400)
  File "python/mappy.pyx", line 144, in mappy.Aligner.__cinit__
TypeError: descriptor 'encode' for 'str' objects doesn't apply to a 'NoneType' object

If we print ref here, it is None

If we check the values at line 1426 in BR_core we get

Checking args.mu
Namespace(alpha=300, alphaPrior=1, base_quality_threshold=1, bucket_size=20000, ckp=None, conditions=False, cov_until=5, deletion_error=0.05, deletion_substitution_ratio=0.4, device='X2', err_missed_deletion=0.1, eta=11, fastq_dir='/data/./TP53_bossrun_test_01/TP53_multiplex/20230607_1545_X2_FAW73389_935289a8/fastq_pass', flank=10000, host='localhost', mu=400, out_dir='./bossruns_TP53_sampling', p0=0.1, ploidy=1, port=None, ref='/data/references/TP53_sampling.fa', ref_idx=None, ref_priors=1, reject_refs=None, rho=300, run_name='TP53_sampling', substitution_error=0.06, testing=False, theta=0.01, vcf=None, wait=90, wgs=1, window=100, windowSize=2000)
Checking ref variable in BR_mapper
None
Traceback (most recent call last):
  File "bossruns.py", line 15, in <module>
    bossrun.initialise_merged()
  File "/home/grid/BOSS-RUNS/bossruns/BR_core.py", line 1428, in initialise_merged
    self.mapper.init_mappy(ref=ref)
  File "/home/grid/BOSS-RUNS/bossruns/BR_mapper.py", line 27, in init_mappy
    self.aligner = mappy.Aligner(fn_idx_in=ref, preset="map-ont")  # , extra_flags=0x400)
  File "python/mappy.pyx", line 144, in mappy.Aligner.__cinit__
TypeError: descriptor 'encode' for 'str' objects doesn't apply to a 'NoneType' object

If i hardcode ref here, i get slightly futher, as it is actually aligning reads:

...
2023-06-08 15:57:47,973 reading file: /data/./TP53_bossrun_test_01/TP53_multiplex/20230607_1545_X2_FAW73389_935289a8/fastq_pass/FAW73389_pass_935289a8_f7ada654_10.fastq.gz
2023-06-08 15:57:48,130 processing 4000 reads in this batch
2023-06-08 15:59:29,869 Number of mapped reads: 291241, unmapped reads: 96759
2023-06-08 16:00:06,494 Counts and rel. proportions of observed reads:
2023-06-08 16:00:06,494 bl4: 27002 0.07
2023-06-08 16:00:06,494 bl5: 5773 0.015
2023-06-08 16:00:06,494 bl6: 95893 0.247
2023-06-08 16:00:06,494 bl7: 2277 0.006
2023-06-08 16:00:06,494 bl8: 58456 0.151
2023-06-08 16:00:06,494 bl9: 3670 0.009
2023-06-08 16:00:06,494 bl10: 822 0.002
2023-06-08 16:00:06,495 bl11: 89223 0.23
2023-06-08 16:00:06,495 bl12: 4950 0.013
2023-06-08 16:00:06,495 bl13: 3078 0.008
2023-06-08 16:00:06,495 bl15: 97 0.0
2023-06-08 16:00:06,899 updating read length distribution
2023-06-08 16:00:06,905 new approx ccl
2023-06-08 16:00:06,905 [ 856  963 1070 1202 1312 1394 1462 1531 1639 2178]
2023-06-08 16:00:06,905 updating scores
2023-06-08 16:00:13,063 switch count: off 0; on 7
Traceback (most recent call last):
  File "bossruns.py", line 24, in <module>
    next_update = bossrun.process_batch()
  File "/home/grid/BOSS-RUNS/bossruns/BR_core.py", line 2404, in process_batch
    self.update_strategy()
  File "/home/grid/BOSS-RUNS/bossruns/BR_core.py", line 2051, in update_strategy
    BossRun.flush_strat_buckets(strat=strat, otu=otu, switches=switches, window=self.args.window)
  File "/home/grid/BOSS-RUNS/bossruns/BR_core.py", line 2011, in flush_strat_buckets
    c_switches_adj = adjust_length(original=prev_strat, expanded=c_switches_rep)
  File "/home/grid/BOSS-RUNS/bossruns/BR_utils.py", line 85, in adjust_length
    assert repl.shape[0] == original.shape[0]
AssertionError

This is where i decided to stop digging in the codebase to hardcode "fixes".

Is there a way to make this work? How would you procede? An alternative is to add 5kb of noise to the start and end of the amplicon sequences, but that feels very hacky.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions