Summary statistics

To estimate summary statistics, we used the estimated analytical rates and empirical data following the description at section Empirical estimation

Before starting the summary statistics, consider parallelizing the process using Julia Distributed computing. If you are following the tutorial step by step, do not input the following commands.

using Distributed
addprocs(7)

@everywhere using Analytical, CSV, DataFrames, JLD2, ProgressMeter

Before to compute the summary statistic, declare a model specifying the sample size and a DAC.

adap = Analytical.parameters(n=661,dac=[2,4,5,10,20,50,200,661,925])

Note you can only input DAC already estimated, nonetheles you can perform any subset from the estimated DAC. To check the estimated DAC you can follow the hierarchy of the h5file variable.

# Check hierarchy
h5 = jldopen("analysis/rates.jld2")

JLDFile /home/jmurga/analysis/rates.jld2 (read-only)
 โ””โ”€๐Ÿ“‚ 1000
    โ””โ”€๐Ÿ“‚ 661
       โ”œโ”€๐Ÿ”ข models
       โ”œโ”€๐Ÿ”ข neut
       โ”œโ”€๐Ÿ”ข sel
       โ”œโ”€๐Ÿ”ข dsdn
       โ””โ”€๐Ÿ”ข dac
# Checking estimated dac, string pattern inside the HDF5 variable
h5["1000/661/dac"]

14-element Vector{Int64}:
    1
    2
    4
    5
   10
   20
   50
  100
  200
  400
  500
  661
  925
 1000

To standardize the summary statistic estimation, the function Analytical.summary_statistics will search and read the SFS and divergence files given a folder. Please be sure that you write the SFS and divergence files (check Parsing genomic data) using the prefix sfs and div to read the files correctly. Otherwise, the function will not read the files correctly.

We include the argument bootstrap to perform bootstrap analysis following polyDFE manual. In the following example we boostrap the SFS and divegence file 100 times subseting 10^5 summary statistic for each dataset:

@time summstat = Analytical.summary_statistics(param=adap,h5_file ="analysis/rates.jld2",analysis_folder="analysis/",summstat_size=10^5,replicas=100,bootstrap=true);

The function will create a summary statistic file and the observed data file (summaries.txt and alphas.txt respectively). Both files will be used to perform the ABC inference. Each line in alphas.txt contains the $\alpha_(x)$ estimations from the bootstrapped SFS.