Summary statistics
To estimate summary statistics, we used the estimated analytical rates and empirical data following the description at section Empirical estimation
Before starting the summary statistics, consider parallelizing the process using Julia Distributed computing. If you are following the tutorial step by step, do not input the following commands.
using Distributed
addprocs(7)
@everywhere using Analytical, CSV, DataFrames, JLD2, ProgressMeter
Before to compute the summary statistic, declare a model specifying the sample size and a DAC.
adap = Analytical.parameters(n=661,dac=[2,4,5,10,20,50,200,661,925])
Note you can only input DAC already estimated, nonetheles you can perform any subset from the estimated DAC. To check the estimated DAC you can follow the hierarchy of the h5file
variable.
# Check hierarchy
h5 = jldopen("analysis/rates.jld2")
JLDFile /home/jmurga/analysis/rates.jld2 (read-only)
โโ๐ 1000
โโ๐ 661
โโ๐ข models
โโ๐ข neut
โโ๐ข sel
โโ๐ข dsdn
โโ๐ข dac
# Checking estimated dac, string pattern inside the HDF5 variable
h5["1000/661/dac"]
14-element Vector{Int64}:
1
2
4
5
10
20
50
100
200
400
500
661
925
1000
To standardize the summary statistic estimation, the function Analytical.summary_statistics
will search and read the SFS and divergence files given a folder. Please be sure that you write the SFS and divergence files (check Parsing genomic data) using the prefix sfs and div to read the files correctly. Otherwise, the function will not read the files correctly.
We include the argument bootstrap
to perform bootstrap analysis following polyDFE manual. In the following example we boostrap the SFS and divegence file 100 times subseting 10^5 summary statistic for each dataset:
@time summstat = Analytical.summary_statistics(param=adap,h5_file ="analysis/rates.jld2",analysis_folder="analysis/",summstat_size=10^5,replicas=100,bootstrap=true);
The function will create a summary statistic file and the observed data file (summaries.txt and alphas.txt respectively). Both files will be used to perform the ABC inference. Each line in alphas.txt contains the $\alpha_(x)$ estimations from the bootstrapped SFS.