The EM Algorithm is an iterative method to calculate the MLE
of observed data when there is a latent variable. This
algorithm is implemented in Python in emAlgorithm.py.
The Hierarchical Estimator uses Markov Chain Composite
Likelihood (MCCL) to approximate the MLE by assuming the
observed data follow a Markov Chain. This algorithm is
implemented in R. See DOCUMENTATION.txt for details.
When you first clone the repo, run the following commands in your terminal from this directory.
python3 -m venv .venvsource .venv/bin/activatepip install -r requirements.txt
Create a file of the binarized DNA sequences, with one observed sequence per line.
To use the EM algorithm: in emAlgorithm.py, run main() with
the following parameters:
Lis an integer with the length of the DNA sequencesqis a float representing the probability of recombinationfile_nameis a path to the file containing the data
To use the Hierarchical estimator: In simulator.R, run
get_one_estimate with the following parameters:
Lis an integer with the length of the DNA sequencesq_valis a float representing the probability of recombinationmis an integer representing the margin to use, which must be between 1 and min(3,L- 2).nis the number of DNA sequences in the data setfile_nameis a path to the file containing the data
In main.py, set the parameters test_l, test_q, test_pi, and
test_m. Then, call run_full_simulation() to do the simulation.
test_lis an array of integers representing the lengths you want to testtest_qis an array of floats representing the recombination probabilities you want to testtest_piis a dictionary with a key for each element, L oftest_l. The value associated with each key is an array of length 2^L representing the true ancestral distribution.test_mis a dictionary with a key for each element, L oftest_l. The value associated with each key is an array with integers ranging from 1 to L - 2, with a maximum of 3.generate_new_datais a Boolean. It is False by default, but can be True if you don't have existing data or want to generate new data.