read_slim: Import SLiM data to R

Description

To import SLiM data into R, we provide the read_slim function, which has been tested for SLiM versions 2.0-3.1. The read_slim function is only appropriate for single-nucleotide variant (SNV) data produced by SLiM's outputFull() method. We do not support output in MS or VCF data format, i.e. produced by outputVCFsample() or outputMSSample() in SLiM.

Usage

read_slim(file_path, keep_maf = 0.01, recomb_map = NULL,
  pathway_df = NULL, recode_recurrent = TRUE)

Arguments

file_path

character. The file path or URL of the .txt output file created by the outputFull() method in SLiM.

keep_maf

numeric. The largest allele frequency for retained SNVs, by default keep_maf = 0.01. All variants with allele frequency greater than keep_maf will be removed. Please note, removing common variants is recommended for large data sets due to the limitations of data allocation in R. See details.

recomb_map

data frame. (Optional) A recombination map of the same format as the data frame returned by create_slimMap. See details.

pathway_df

data frame. (Optional) A data frame that contains the positions for each exon in a pathway of interest. See details.

recode_recurrent

logical. When TRUE recurrent SNVs are cataloged a single observation; by default, recode_recurrent = TRUE. See details.

Value

An object of class SNVdata, which inherits from a list and contains:

Haplotypes

A sparse matrix of haplotypes. See details.

Mutations

A data frame cataloging SNVs in Haplotypes. See details.

Details

In addition to reducing the size of the data, the argument keep_maf has practicable applicability. In family-based studies, common SNVs are generally filtered out prior to analysis. Users who intend to study common variants in addition to rare variants may need to run chromosome specific analyses to allow for allocation of large data sets in R.

The argument recomb_map is used to remap mutations to their actual locations and chromosomes. This is necessary when data has been simulated over non-contiguous regions such as exon-only data. If create_slimMap was used to create the recombination map for SLiM, simply supply the output of create_slimMap to recomb_map. If recomb_map is not provided we assume that the SNV data has been simulated over a contiguous segment starting with the first base pair on chromosome 1.

The data frame pathway_df allows users to identify SNVs located within a pathway of interest. When supplied, we expect that pathwayDF does not contain any overlapping segments. All overlapping exons in pathway_df MUST be combined into a single observation. Users may combine overlapping exons with the combine_exons function.

When TRUE, the logical argument recode_recurrent indicates that recurrent SNVs should be recorded as a single observation. SLiM can model many types of mutations; e.g. neutral, beneficial, and deleterious mutations. When different types of mutations occur at the same position carriers will experience different fitness effects depending on the carried mutation. However, when mutations at the same location have the same fitness effects, they represent a recurrent mutation. Even so, SLiM stores recurrent mutations separately and calculates their prevalence independently. When the argument recode_recurrent = TRUE we store recurrent mutations as a single observation and calculate the derived allele frequency based on their combined prevalence. This convention allows for both reduction in storage and correct estimation of the derived allele frequency of the mutation. Users who prefer to store recurrent mutations from independent lineages as unique entries should set recode_recurrent = FALSE.

An object of class SNVdata, which inherits from a list and contains: The read_slim function returns an object of class SNVdata, which inherits from a list and contains the following two items:

Haplotypes A sparse matrix of class dgCMatrix (see dgCMatrix-class). The columns in Haplotypes represent distinct SNVs, while the rows represent individual haplotypes. We note that this matrix contains two rows of data for each diploid individual in the population: one row for the maternally ihnherited haplotype and the other for the paternally inherited haplotype.
Mutations A data frame cataloging SNVs in Haplotypes. The variables in the Mutations data set are described as follows:

colID
Associates the rows, i.e. SNVs, in Mutations to the columns of Haplotypes.

chrom
The chromosome that the SNV resides on.

position
The position of the SNV in base pairs.

afreq
The derived allele frequency of the SNV.

marker
A unique character identifier for the SNV.

type
The mutation type, as specified in the user's slim simulation.

pathwaySNV
Identifies SNVs located within the pathway of interest as TRUE.

Please note: the variable pathwaySNV will be omitted when pathway_df is not supplied to read_slim.

References

Haller, B., Messer, P. W. (2017). Slim 2: Flexible, interactive forward genetic simulations. Molecular Biology and Evolution; 34(1), pp. 230-240.

Douglas Bates and Martin Maechler (2018). Matrix: Sparse and Dense Matrix Classes and Methods. R package version 1.2-14. https://CRAN.R-project.org/package=Matrix

Examples

Run this code

# NOT RUN {
# Specify the URL of the example output data simulated by SLiM.
file_url <-
'https://raw.githubusercontent.com/cnieuwoudt/Example--SLiMSim/master/example_SLIMout.txt'
s_out <- read_slim(file_url)

class(s_out)
str(s_out)


# As seen above, read_slim returns an object of class SNVdata,
# which  contians two items.  The first is a sparse matrix
# named Haplotypes, which contains the haplotypes for each indiviual in the
# simulation.  The second item is a data set named Mutations, which catalogs
# the mutations in the Haplotypes matrix.

# View the first 5 lines of the mutation data
head(s_out$Mutations, n = 5)

# view the first 20 mutations on the first 10 haplotypes
s_out$Haplotypes[1:10, 1:20]


# }

Run the code above in your browser using DataLab