plotSelect.GS: Plotting (several) Selected Gene Set(s) in some Subjects

Description

This function can plot different representations of the gene expression in selected gene sets, among a subset of selected subjects.

Usage

plotSelect.GS(expr, gmt, Subject_ID, TimePoint, geneset.names.select, Subject_ID.select, display = "one subject per page", baseline = NULL, group.var = NULL, Group_ID_paired = NULL, ref = NULL, group_of_interest = NULL, FUNcluster = NULL, clustering_metric = "euclidian", clustering_method = "ward", B = 500, max_trends = 4, aggreg.fun = "median", trend.fun = "median", methodOptiClust = "firstSEmax", verbose = TRUE, clustering = TRUE, time_unit = "", title = NULL, y.lab = NULL, desc = TRUE, lab.cex = 1, axis.cex = 1, main.cex = 1, y.lab.angle = 90, x.axis.angle = 45, y.lim = NULL, x.lim = NULL, gg.add = list(theme()))

Arguments

expr

either a matrix or dataframe of gene expression upon which dynamics are to be calculated, or a list of gene sets estimation of gene expression. In the case of a matrix or dataframe, its dimension are $n$ x $p$, with the $p$ sample in column and the $n$ genes in row. In the case of a list, its length should correspond to the number of gene sets under scrutiny and each element should be an 3 dimension array of estimated gene expression, such as for the list returned in the 'Estimations' element of TcGSA.LR. See Details.

gmt

a gmt object containing the gene sets definition. See GSA.read.gmt and definition on www.broadinstitute.org.

Subject_ID

a factor of length $p$ that is in the same order as the columns of expr (when it is a dataframe) and that contains the patient identifier of each sample.

TimePoint

a numeric vector or a factor of length $p$ that is in the same order as TimePoint and the columns of expr (when it is a dataframe), and that contains the time points at which gene expression was measured.

geneset.names.select

a character vector containing the names of the gene sets to be plotted, that must appear in the "geneset.names" element of gmt.

Subject_ID.select

a character vector containing the names of the subjects to be plotted, that must appear in the Subject_ID vector.

display

How to display the resulting graphs. One of the following : "one GS per page", "one subject per page", "median over selected patients". Default is "one subject per page".

baseline

a character string which is the value of TimePoint that can be used as a baseline. Default is NULL, in which case no timepoint is used as a baseline value for gene expression. Has to be NULL when comparing two treatment groups.

group.var

in the case of several treatment groups, this is a factor of length $p$ that is in the same order as Timepoint, Subject_ID and the columns of expr. It indicates to which treatment group each sample belongs to. Default is NULL, which means that there is only one treatment group.

Group_ID_paired

a character vector of length $p$ that is in the same order as Timepoint, Subject_ID, group.var and the columns of expr. This argument must not be NULL in the case of a paired analysis, and must be NULL otherwise. Default is NULL.

ref

the group which is used as reference in the case of several treatment groups. Default is NULL, which means that reference is the first group in alphabetical order of the labels of group.var. See Details.

group_of_interest

the group of interest, for which dynamics are to be computed in the case of several treatment groups. Default is NULL, which means that group of interest is the second group in alphabetical order of the labels of group.var.

FUNcluster

a function which accepts as first argument a matrix x and as second argument the number of clusters desired k, and which returns a list with a component named 'cluster' which is a vector of length n = nrow(x) of integers in 1:k, determining the clustering or grouping of the n observations. Default is NULL, in which case a hierachical clustering is performed via the function agnes, using the metric clustering_metric and the method clustering_method. See 'FUNcluster' in clusGap and Details.

clustering_metric

character string specifying the metric to be used for calculating dissimilarities between observations in the hierarchical clustering when FUNcluster is NULL. The currently available options are "euclidean" and "manhattan". Default is "euclidean". See agnes. Also, a "sts" option is available in TcGSA. It implements the 'Short Time Series' distance [Moller-Levet et al., Fuzzy CLustering of short time series and unevenly distributed sampling points, Advances in Intelligent Data Analysis V:330-340 Springer, 2003] designed specifically for clustering time series.

clustering_method

character string defining the agglomerative method to be used in the hierarchical clustering when FUNcluster is NULL. The six methods implemented are "average" ([unweighted pair-]group average method, UPGMA), "single" (single linkage), "complete" (complete linkage), "ward" (Ward's method), "weighted" (weighted average linkage). Default is "ward". See agnes.

integer specifying the number of Monte Carlo ("bootstrap") samples used to compute the gap statistics. Default is 500. See clusGap.

max_trends

integer specifying the maximum number of different clusters to be tested. Default is 4.

aggreg.fun

a character string such as "mean", "median" or the name of any other defined statistics function that returns a single numeric value. It specifies the function used to aggregate the observations before the clustering. Default is to median.

trend.fun

a character string such as "mean", "median" or the name of any other function that returns a single numeric value. It specifies the function used to calculate the trends of the identified clustered. Default is to median.

methodOptiClust

character string indicating how the "optimal" number of clusters is computed from the gap statistics and their standard deviations. Possible values are "globalmax", "firstmax", "Tibs2001SEmax", "firstSEmax" and "globalSEmax". Default is "firstSEmax". See 'method' in clusGap, Details and Tibshirani et al., 2001 in References.

verbose

logical flag enabling verbose messages to track the computing status of the function. Default is TRUE.

clustering

logical flag. If FALSE, there is no clustering representation; if TRUE, the lines are colored according to which cluster they belong to. Default is TRUE. See Details.

time_unit

the time unit to be displayed (such as "Y", "M", "W", "D", "H", etc) next to the values of TimePoint on the x-axis. Default is "".

title

character specifying the title of the plot. If NULL, a title is automatically generated, if "", no title appears. Default is NULL.

y.lab

character specifying the annotation of the y axis. If NULL, an annotation is automatically generated, if "", no annotation appears. Default is NULL.

desc

a logical flag. If TRUE, a line is added to the title of the plot with the description of the gene set plotted (from the gmt file). Default is TRUE.

lab.cex

a numerical value giving the amount by which lab labels text should be magnified relative to the default 1.

axis.cex

a numerical value giving the amount by which axis annotation text should be magnified relative to the default 1.

main.cex

a numerical value giving the amount by which title text should be magnified relative to the default 1.

y.lab.angle

a numerical value (in [0, 360]) giving the orientation by which y-label text should be turned (anti-clockwise). Default is 90. See element_text.

x.axis.angle

a numerical value (in [0, 360]) giving the orientation by which x-axis annotation text should be turned (anti-clockwise). Default is 45.

y.lim

a numeric vector of length 2 giving the range of the y-axis. See plot.default.

x.lim

if numeric, will create a continuous scale, if factor or character, will create a discrete scale. Observations not in this range will be dropped. See xlim.

gg.add

A list of instructions to add to the ggplot2 instruction. See +.gg. Default is list(theme()), which adds nothing to the plot.

Value

A dataframe the 2 following variables:

ProbeID which contains the IDs of the probes of the plotted gene set.
Cluster which to which cluster the probe belongs to.

If clustering is FALSE, then Cluster is NA for all the probes.

Details

If expr is a matrix or a dataframe, then the "original" data are plotted. On the other hand, if expr is a list returned in the 'Estimations' element of TcGSA.LR, then it is those "estimations" made by the TcGSA.LR function that are plotted.

If indiv is 'genes', then each line of the plot is the median of a gene expression over the patients. On the other hand, if indiv is 'patients', then each line of the plot is the median of a patient genes expression in this gene set.

This function uses the Gap statistics to determine the optimal number of clusters in the plotted gene set. See clusGap.

References

Tibshirani, R., Walther, G. and Hastie, T., 2001, Estimating the number of data clusters via the Gap statistic, Journal of the Royal Statistical Society, Series B (Statistical Methodology), 63, 2: 411--423.

Examples

Run this code


## Not run:  
# data(data_simu_TcGSA)
# tcgsa_sim_1grp <- TcGSA.LR(expr=expr_1grp, gmt=gmt_sim, design=design, 
#                           subject_name="Patient_ID", time_name="TimePoint",
#                           time_func="linear", crossedRandom=FALSE)
# ## End(Not run)

## Not run:  
# plotSelect.GS(expr=tcgsa_sim_1grp$Estimations, TimePoint=design$TimePoint, 
#        Subject_ID=design$Patient_ID, gmt=gmt_sim,
#        geneset.names.select=c("Gene set 3", "Gene set 4", "Gene set 5"),
#        Subject_ID.select=c("P1", "P2"),
#        display="one GS per page", 
#        time_unit="H",
#        lab.cex=0.7
# )
# ## End(Not run)

## Not run:  
# plotSelect.GS(expr=tcgsa_sim_1grp$Estimations, TimePoint=design$TimePoint, 
#        Subject_ID=design$Patient_ID, gmt=gmt_sim,
#        geneset.names.select=c("Gene set 3", "Gene set 4", "Gene set 5"),
#        Subject_ID.select=c("P1", "P2"),
#        display="one subject per page", 
#        time_unit="H",
#        lab.cex=0.7
# )
# ## End(Not run)

Run the code above in your browser using DataLab