Parsing genomic data
The module includes functions to parse TGP from Uricchio et al. (2019) and DGN from Murga-Moreno et al. (2019). In addition, the module have a function to parse SFS and divergence from multi-FASTA data following Murga-Moreno et al. (2019)
Please to parse raw data into SFS and divergence counts, first download raw files deposited in our repository:
mkdir -p analysis/
curl -o analysis/tgp.txt https://raw.githubusercontent.com/jmurga/Analytical.jl/master/data/tgp.txt
curl -o analysis/dgnRal.txt https://raw.githubusercontent.com/jmurga/Analytical.jl/master/data/dgnRal.txt
Parsing TGP and DGN data manually
Once you have downloaded the files, you can use the function Analytical.parse_sfs
to convert the data into SFS and divergence counts. Please check Analytical.parse_sfs
to get more info o execute:
alpha, sfs, divergence = Analytical.parse_sfs(sample_size = 661, data = "analysis/tgp.txt")
To save the data, you can use CSV and DataFrames packages
using CSV, DataFrames
CSV.write("analysis/tgp_sfs.tsv",DataFrame(sfs,:auto),delim='\t',header=false)
CSV.write("analysis/tgp_div.tsv",DataFrame(permutedims(divergence),:auto),delim='\t',header=false)
It is possible to directly subset genes IDs using Ensembl or Flybase id. Use a variable of type Matrix{String}
into the argument geneList
download('https://raw.githubusercontent.com/jmurga/Analytical.jl/master/data/ensembl_list.txt','analysis/ensembl_list.txt')
ensembl_list = CSV.read("analysis/ensembl_list.txt",header=false,DataFrame) |> Array
alpha, sfs, divergence = Analytical.parse_sfs(sample_size = 661, data = "analysis/tgp.txt",gene_list = ensembl_list)
If you are going to parse DGN, you need to change the value of the argument isoline to true. Following the Murga-Moreno et al. (2019) sample size for each population is:
- Zambia population: 154
- RAL population: 160
alpha, sfs, divergence = Analytical.parse_sfs(sample_size = 160, data = "analysis/dgn_ral.txt",isolines=true)