Infering the adapation rate from multiple datasets

As we explained at section Input Data it is possible to subset the TGP dataset given a matrix of Ensembl/Flybase ids. In this section we will use whole-genome protein-coding information, Viral Interacting Proteins and Non-Viral Interacting Proteins to infer the empirical adaptation.

The next step assume that you execute the previous sections. Please before to execute this example, be sure you already performed the rates estimates and your analysis/ folder contain the TGP data.

We are going to create three separate directories into the analysis/ directory we are working with to perform the estimations. You can create the folder manually, through bash, or directly in the Julia interpreter

using Distributed
addprocs(7)
@everywhere using Analytical
using CSV, DataFrames, JLD2

run(`mkdir -p analysis/wg/ analysis/vips/ analysis/nonvips/`)

Once you have the folder please download from our repository the files containing the VIPs and Non-VIPs ids.

run(`curl -o analysis/vips/vip_list.txt https://raw.githubusercontent.com/jmurga/Analytical.jl/master/data/vip_list.txt`)
run(`curl -o analysis/nonvips/nonvip_list.txt https://raw.githubusercontent.com/jmurga/Analytical.jl/master/data/nonvip_list.txt`)

Now we are going to parse our three dataset to output the SFS and divergence files into each analysis folder. We we follow the example provided at Input Data section.

Whole-Genome dataset

alpha, sfs, divergence = Analytical.parse_sfs(sample_size = 661, data = "analysis/tgp.txt")
CSV.write("analysis/wg/sfs_tgp.tsv",DataFrame(sfs,:auto),delim='\t',header=false)
CSV.write("analysis/wg/div_tgp.tsv",DataFrame(permutedims(divergence),:auto),delim='\t',header=false)

VIPs dataset

vips_list = CSV.read("analysis/vips/vip_list.txt",DataFrame) |> Matrix{String}

alpha, sfs, divergence = Analytical.parse_sfs(sample_size = 661, data = "analysis/tgp.txt",gene_list=vips_list)
CSV.write("analysis/vips/sfs.tsv",DataFrame(sfs,:auto),delim='\t',header=false)
CSV.write("analysis/vips/div.tsv",DataFrame(permutedims(divergence),:auto),delim='\t',header=false)

Non-VIPs dataset

nonvips_list = CSV.read("analysis/nonvips/nonvip_list.txt",DataFrame) |> Matrix{String}

alpha, sfs, divergence = Analytical.parse_sfs(sample_size = 661, data = "analysis/tgp.txt",gene_list=nonvips_list)
CSV.write("analysis/nonvips/sfs.tsv",DataFrame(sfs,:auto),delim='\t',header=false)
CSV.write("analysis/nonvips/div.tsv",DataFrame(permutedims(divergence),:auto),delim='\t',header=false)

Once you have the data parsed, you can estimate the summary statistic following Summary statistic section. Load the rates and select a DAC before to continue the analysis

adap = Analytical.parameters(n=661,dac=[2,4,5,10,20,50,200,661,925])

Whole-Genome dataset

@time summstat = Analytical.summary_statistics(param=adap,h5_file="analysis/rates.jld2",analysis_folder="analysis/wg/",summstat_size=10^6,replicas=100,bootstrap=true);

VIPs dataset

@time summstat = Analytical.summary_statistics(param=adap,h5_file="analysis/rates.jld2",analysis_folder="analysis/vips/",summstat_size=10^6,replicas=100,bootstrap=true);

Non-VIPs dataset

@time summstat = Analytical.summary_statistics(param=adap,h5_file="analysis/rates.jld2",analysis_folder="analysis/nonvips/",summstat_size=10^6,replicas=100,bootstrap=true);

Now you can perform the ABC inference using ABCreg or another ABC software using the files alphas.txt and summaries.txt deposited in each folder. We will perform the inference using ABCreg. Please compile ABCreg before to perform the execution as described in the Installation

Whole-Genome dataset

Analytical.ABCreg(analysis_folder="analysis/wg/",S=size(adap.dac,1),tol=0.001,abcreg="/home/jmurga/ABCreg/src/reg");

VIPs dataset

Analytical.ABCreg(analysis_folder="analysis/vips/",S=size(adap.dac,1),tol=0.001,abcreg="/home/jmurga/ABCreg/src/reg");

Non-VIPs dataset

Analytical.ABCreg(analysis_folder="analysis/nonvips/",S=size(adap.dac,1),tol=0.001,abcreg="/home/jmurga/ABCreg/src/reg");

Once the posterior distributions are estimated you can perform the Maximum-A-Posterior (MAP) estimates. We performed the MAP estimates following ABCreg examples. Please be sure you have installed R and the packages(ggplot2, locfit and data.table). This packages are already installed in Docker and Singularity images. Load the R functions to estimate and plot MAP executing the following command

Analytical.source_plot_map_R(script="analysis/script.jl")

Whole-Genome dataset

tgpmap = Analytical.plotMap(analysis_folder="analysis/wg/")
DataFrames.describe(tgpmap)

5×7 DataFrame
 Row │ variable  mean          min          median        max          nmissing  eltype   
     │ Symbol    Float64       Float64      Float64       Float64      Int64     DataType 
─────┼────────────────────────────────────────────────────────────────────────────────────
   1 │ aw           0.104377     0.0350986     0.110202      0.184566         0  Float64
   2 │ as           0.0481464   -0.004137      0.0441584     0.114982         0  Float64
   3 │ a            0.151626     0.110054      0.148903      0.203543         0  Float64
   4 │ gamNeg    1302.71       612.844      1474.7        1742.59             0  Float64
   5 │ shape        0.141999     0.130137      0.141801      0.15542          0  Float64

VIPs dataset

vipsmap = Analytical.plotMap(analysis_folder="analysis/vips/")
DataFrames.describe(vipsmap)

5×7 DataFrame
 Row │ variable  mean        min          median      max          nmissing  eltype   
     │ Symbol    Float64     Float64      Float64     Float64      Int64     DataType 
─────┼────────────────────────────────────────────────────────────────────────────────
   1 │ aw          0.131834   -0.016945     0.14184      0.284644         0  Float64
   2 │ as          0.160363    0.0780512    0.157498     0.27026          0  Float64
   3 │ a           0.284903    0.229213     0.283095     0.369928         0  Float64
   4 │ gamNeg    807.123     583.229      657.89      1594.79             0  Float64
   5 │ shape       0.202337    0.17996      0.201092     0.235535         0  Float64

Non-VIPs dataset

nonvipsmap = Analytical.plotMap(analysis_folder="analysis/nonvips/")
DataFrames.describe(nonvipsmap)

5×7 DataFrame
 Row │ variable  mean          min          median        max          nmissing  eltype   
     │ Symbol    Float64       Float64      Float64       Float64      Int64     DataType 
─────┼────────────────────────────────────────────────────────────────────────────────────
   1 │ aw           0.0982744    0.0320596     0.098271      0.196821         0  Float64
   2 │ as           0.0438451   -0.001272      0.0450831     0.099763         0  Float64
   3 │ a            0.140408     0.0964904     0.141531      0.196149         0  Float64
   4 │ gamNeg    1441.17       681.867      1547.24       1773.32             0  Float64
   5 │ shape        0.134675     0.12122       0.135073      0.148782         0  Float64