getPeaklist: Peak pattern extraction by non-negative ls/lad template matching

Description

Generates a candidate list of isotopic peak patterns present in a protein mass spectrum. This is achieved by matching templates calculated according to the so-called Averagine model to the raw spectrum using either non-negative least squares (ls) or non-negative least absolute deviation (lad) estimation. The presence of multiple charge states is supported. In particular, the approach is capable of deconvolving overlapping patterns. The method can be applied with two different kind of peak shapes, Gaussians and Exponentially Modified Gaussians (EMG).

Usage

getPeaklist(mz, intensities, model = c("Gaussian", "EMG"),
model.parameters = list(alpha = function(){}, sigma = function(){}, mu = function(){}),
averagine.table = NULL,                     
loss = c("L2", "L1"), binning = FALSE,
postprocessing = TRUE, trace = TRUE, returnbasis = TRUE,
control.basis = list(charges = c(1,2,3,4), eps = 1e-05),
control.localnoise = list(quantile = 0.5, factor.place = 1.5, factor.post = 0, window = NULL, subtract = FALSE),
control.postprocessing = list(mzfilter = FALSE, prune = FALSE, factor.prune = NULL, ppm = NULL, goodnessoffit = FALSE),
control.binning = list(tol = 0.01))

Arguments

A numeric vector of m/z (mass/charge) values (in Thomson), ordered increasingly.

intensities

A numeric vector of intensities corresponding to mz.

model

Basic model for the shape of a single peak. Must be "Gaussian" or "EMG" (exponentially modified Gaussian). See fitModelParameters for further information on these models.

averagine.table

If averagine.table = NULL the usual Averagine model (Senko et al. (1995): Determination of Monoisotopic Masses and Ion Populations for Large Biomolecules from Resolved Isotopic Distributions, J. Am. Soc. Mass Spect., 6, 229-233) is used. Otherwise, avergine.table has to be a matrix or data.frame of the following form: each row encodes the isotope distribution at a certain mass. The first entry of each row contains such mass while the remaining entries contain the relative abundances of the isotopes.

loss

The loss function to be used. The choice

loss =
      "L2"

yield a nonnegative least squares fit, loss = "L1" a nonnegative least absolute deviation fit. The second choice is more robust when deviations from model assumptions (peak model, Averagine model,...) frequently occur in the data. Note, however, that computation time is much higher for the second choice (at least by a factor two).

model.parameters

A list of functions with precisely one argument representing mz. The parameters of a single peak are typically modeled as a function of m/z. If model = "Gaussian", the peak shape depends on the parameter sigma (a function sigma(mz)). If model = "EMG", the peak shape additionally depends on two parameters alpha and mu (two functions alpha(mz) and mu(mz)). Note that constant functions are usually specified by using a construction of the form

parameter(mz) <- function(mz)
      rep(constant, length(mz))

. Moreover, note that a valid function has to be vectorized. For the automatic generation of those functions from a raw spectrum and further details on the meaning of the parameters, see fitModelParameters. The output of a call to fitModelParameters can directly be plugged into getPeaklist via the argument model.parameters.

binning

A logical indicating whether the fitting process should be done sequentially in 'bins'. If TRUE, the spectrum is cut into pieces (bins). Each bin is then fitted separately, and the results of all bins are combined in the end. Division into bins may be configured using control.binning. See also the 'Details' section below.

postprocessing

A logical indicating whether a post-processing correction should be applied to the raw peaklist. See also the argument control.postprocessing and the 'Details' section below.

trace

A logical indicating whether information tracing the different steps of the fitting process should be displayed.

returnbasis

A logical indicating whether the matrix of basis functions (template functions evaluated at mz) should be returned. Note that this may be expensive in terms of storage.

control.basis

A list of arguments controlling the computation of the matrix of basis functions:

charges: The set of charge states present in the spectrum.
eps: Function values below eps are set equal to precisely zero in order to make the basis function matrix sparse.

control.localnoise

A list of arguments controlling the placement and selection of basis functions on the basis of a 'local noise level':

quantile: A value from 0.1, 0.2, ..., 0.9, specifying the quantile of the intensities residing in a sliding m/z window (s. below) to be used as 'local noise level'.
factor.place: Controls the placement of basis functions. A basis function is placed at an element of mz if and only if the intensity at that position exceeds the 'local noise level' by a factor at least equal to factor.place.
factor.post: Controls which basis functions enter the postprocessing step. A basis function is discarded before the postprocessing step if its estimated amplitude does not exceed the factor.post times the 'local noise level'. By default factor.post = 0. The pre-filtering step before postprocessing is mainly done for computational speed-up, and factor.post = 0 is supposed to yield the qualitatively best solution, though it may take additional time.
window: The length of the sliding window used to compute quantile, to be specified in Thomson. By default, window is chosen in such a way that it equals the length of the support of an 'average' basis function for charge state one.
subtract: A logical indicating whether the 'local noise level' should be subtracted from the observed intensities. Setting subtract = TRUE is typically beneficial in the sense that fitting of noise is reduced.

control.postprocessing

A list of arguments controlling the postprocessing step (provided postprocessing = TRUE):

mzfilter: Setting mzfilter = TRUE removes basis functions at positions where peak patterns are highly improbable to occur, thereby removing peaks from the list which are likely to be noise peaks. This filter is sometimes called 'peptide mass rule', see Zubarev et al. (1996): Accuracy Requirements for Peptide Characterization by Monoisotopic Molecular Measurements, Anal. Chem.,88,4060-4063.
prune, factor.prune: Setting prune = TRUE activates a crude scheme that removes low-intensity peaks (likely to be noise peaks), as frequently occurring in regions with extremely intense peaks. According to this scheme, a peak is removed from the peak list if its amplitude is less than factor.prune times the locally most intense amplitude, where factor.prune typically ranges from 0.01 to 0.1.
ppm: A ppm (= parts per million) tolerance value within which basis functions at different m/z positions are considered to be merged, s. 'Details' below. By default, that value is computed from the spacing of the first two m/z positions.
goodnessoffit: A logical indicating whether a local goodness-of-fit adjustment of the signa-to-noise ratio should be computed. Yields usually more reliable evaluation of the detected patterns, but is computationally more demanding.

control.binning

Controls the division of the spectrum into bins (if binning = TRUE). Based on the 'local noise level' described in control.localnoise, if within a range of (1+tol) Thomson no further significant position occurs, a bin is closed, and a new one is not opened as long as a new significant position occurs..

Value

peaklist.

Warning

Although we have tried to choose default values expected to produce sensible results, the user should carefully examine all options.

Warning

Depending on the length and the resolution of the raw spectrum, fitting the whole spectrum simultaneously as recommended is expensive from a computational point of view, and may take up to several minutes per spectrum.

Details

While setting binning = TRUE yields a procedure which is less memory consuming than fitting the whole spectrum simultaneously (binning = FALSE), it may be inferior from a quality aspect, since division into bins has to be done with care. Otherwise, peak patterns might be split up into different bins, which would result into erroneous fitting.
Postprocessing of the raw list usually yields a considerably improved result by counteracting the 'peak-splitting phenomenon': due to a limited sampling rate and discrepancies between theoretical and observed peak patterns, several templates at adjacent positions are used to fit the same peak pattern.

Examples

Run this code

### load data
data(toyspectrum)
data(toyspectrumsolution)
mz <- toyspectrum[,"x"]
intensities <- toyspectrum[,"yyy"]
### select mz range
filter <- mz >= 2800 & mz <= 3200
### Extract peak patterns with model = "Gaussian"
sigmafun <- function (mz)  -8.5e-07 * mz + 6.09e-10 * mz^2 + 0.00076
gausslist <- getPeaklist(mz = mz[filter], intensities = intensities[filter],
                   model = "Gaussian",
                   model.parameters = list(sigma = sigmafun,
                                           alpha = function(mz){},
                                           mu = function(mz){}),
                    control.localnoise = list(quantile = 0.5, factor.place = 3))

show(gausslist)
### threshold list at signal-to-noise ratio = 2
peaklist <- threshold(gausslist, threshold = 2)

### Extract peak patterns with model = "EMG" and loss = "L1"
alpha0 <- function(mz) 0.00001875 * 0.5 * 4/3 * mz
sigma0 <- function(mz) 0.00001875 * 0.5 * mz
mu0 <- function(mz) return(rep(-0.06162891, length(mz)))
EMGlist <- getPeaklist(mz = mz[filter], intensities = intensities[filter],
                   model = "EMG", loss = "L1",
                   model.parameters = list(sigma = sigma0,
                                           alpha = alpha0,
                                           mu = mu0),
                   control.localnoise = list(quantile = 0.5, factor.place = 3))
show(EMGlist)
peaklist2 <- threshold(EMGlist, threshold = 2)

### plot results of the 1st list and compare vs. 'truth' 

### 'ground truth'
solution <- toyspectrumsolution[toyspectrumsolution[,1] >= 2800 & toyspectrumsolution[,1] <= 3200,]

visualize(gausslist, mz[filter], intensities[filter], lower = 3150, upper = 3170,
          truth = TRUE,
          signal = TRUE,
          fitted = TRUE,
          postprocessed = TRUE,
          booktrue = as.matrix(toyspectrumsolution),
          cutoff.eps = 0.2)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

Warning

Warning

Details

See Also

Examples