baysic.data: Organizes data for BaySIC functions

Description

Creates a list object from mutation and reference data for use with BaySIC fitting and testing functions

Usage

baysic.data(dat, ref.dat, plot = FALSE, N = NULL, silent = TRUE)

Arguments

dat

matrix; Mutation input data. Baysic requires a specific format similar to the MUT format file, and should be an $M\times7$ matrix with column headings "chr", "start", "end", "id","type", "gene","context," where each row details an individual mutation.

ref.dat

a dataframe or list of dataframes; ref.dat is a representation of the sequence content of each gene of interest, for 32 unique trinucleotide sequence contexts, yielding an $G\times34$ matrix, where $G$ is the total number of genes. If ref.dat is a matrix, it is assumed that all subjects correspond to the same reference data. It is possible that reference data may vary from subject to subject due to different platforms or coverages. In this case, ref.dat can also be a list of N reference data matrices, where N is the number of subjects. The names of each list element should correspond to ids used in the dat file.

plot

logical; if TRUE, a plot summarizing the mutation data at an overall and per subject basis is generated. Defaults to FALSE.

an integer (optional); equal to the number of subjects represented in dat. If N=NULL and is.list(ref.dat)==FALSE, N is assumed to the number of unique subject ids in dat. If is.list(ref.dat)=TRUE, then N=length(ref.dat).

silent

logical; if FALSE, mutations defined as 'Synonymous' or 'Silent' will be removed from the dataset and subsequent analyses. Defaults to TRUE.

Value

all.dat: Original mutation data object dat
ref.dat: Original reference data object ref.dat
N: Number of subjects with observed data
genes: Vector of length $G$ of gene names included in analysis, where $G$ is the total number of genes. Derived from ref.dat
snv.dat: A $G\times32$ matrix of total number of SNV mutations per sequence context and gene
indel.dat: Vector of length $G$ of total number of indel mutations per gene

Details

The mutation data dat is a 7-column matrix similar in style to other popular mutation file formats. The first three columns ("chr","start","end") correspond to the positional information of the somatic mutation. The "id" column represents an identification vector including subject ids for each documented mutation. The "type" column corresponds to the type of mutation for each entry. This is relatively flexible for point mutations, and only requires some form of "silent" or "synonymous" for such mutations if silent=FALSE, but insertion/deletion events should be designated as "INDEL." The "gene" column represents the name of the gene the mutation corresponds to, and must match the gene names used in ref.dat. The "context" entries represent the trinucleotide sequence context of each point mutation (NA for INDELS)

The first two columns of the data matrix (or matrices) in ref.dat should correspond to the gene name and corresponding chromosome, and the column names of the remaining 32 columns should correspond to the trinucleotide motif (e.g. "ACA"). The sequence content entries should be integer values which correspond to the number of nucleotides in the coding content of a given gene which satisify the trinucleotide motif (central base with flanking 5' and 3' bases). Each base should be uniquely represented, such that the sum of all 32 counts is equivalent to the basepair length of the total coding sequence for a given gene.

The baysic.data function has its own trinucleotide naming convention, in that all motifs are in all caps and have either "T" or "C" as the central base. Column names of ref.dat and "context" entries in dat will be adjusted to accommodate this convention if they deviate from it.

Examples

Run this code

## Not run: 
# data(example.dat)
# data(ccds.19)
# baysic.dat.ex<-baysic.data(example.dat,ccds.19)
# ## End(Not run)