featureScores: Get scores at regular sample points around genomic features.

Description

Given a GRanges / GRangesList object, or BAM file paths, of reads for each experimental condition, or a matrix or an AffynetrixCelSet, or a numeric matrix of array data, where the rows are probes and the columns are the different samples,and an anntotation of features of interest, scores at regularly spaced positions around the features is calculated. In the case of sequencing data, it is the smoothed coverage of reads divided by the library size. In the case of array data, it is array intensity.

Arguments

Usage

The ANY,data.frame method: featureScores(x, anno, ...) The ANY,GRanges method: featureScores(x, anno, up = NULL, down = NULL, ...)

Details

If x is a vector of paths or GRangesList object, then names(x) should contain the types of the experiments. If anno is a data.frame, it must contan the columns chr, start, and end. Optional columns are strand and name. If anno is a GRanges object, then the name can be present as a column called name in the element metadata of the GRanges object. If names are given, then the coverage matrices will use the names as their row names. An approximation to running mean smoothing of the coverage is used. Reads are extended to the smoothing width, rather than to their fragment size, and coverage is used directly. This method is faster than a running mean of the calculated coverage, and qualtatively almost identical. If providing a matrix of array intensity values, the column names of this matrix are used as the names of the samples. The annotation can be stranded or not. if the annotation is stranded, then the reference point is the start coordinate for features on the + strand, and the end coordinate for features on the - strand. If the annotation is unstranded (e.g. annotation of CpG islands), then the midpoint of the feature is used for the reference point. The up and down values give how far up and down from the reference point to find scores. The semantics of them depend on if the annotation is stranded or not. If the annotation is stranded, then they give how far upstream and downstream will be sampled. If the annotation is unstranded, then up gives how far towards the start of a chromosome to go, and down gives how far towards the end of a chromosome to go. If sequencing data is being analysed, and dist is "percent", then they give how many percent of each feature's width away from the reference point the sampling boundaries are. If dist is "base", then the boundaries of the sampling region are a fixed width for every feature, and the units of up and down are bases. up and down must be identical if the features are unstranded. The units of freq are percent for dist being "percent", and bases for dist being "base". In the case of array data, the sequence of positions described by up, down, and freq actually describe the boundaries of windows, and the probe that is closest to the midpoint of each window is chosen as the representative score of that window. On the other hand, when analysing sequencing data, the sequence of positions refer to the positions that coverage is taken for. Providing a mappability object for sequencing data is recommended. Otherwise, it is not possible to know if a score of 0 is because the window around the sampling position is unmappable, or if there were really no reads mapping there in the experiment. Coverage is normalised by dividing the raw coverage by the total number of reads in a sample. The coverage at a sampling position is multiplied by 1 / mappability. Any positions that have mappabilty below the mappability cutoff will have their score set to NA.

Value

A ScoresList object, that holds a list of score matrices, one for each experiment type, and the parameters that were used to create the score matrices.

Examples

Run this code

data(chr21genes)
  data(samplesList) # Loads 'samples.list.subset'.

  fs <- featureScores(samples.list.subset[1:2], chr21genes, up = 2000, down = 1000,
                      freq = 500, s.width = 500)