Performs site weighted gene set enrichment analysis or standard GSEA when
likelihood/weight columns in input_df
are 1 or 0, p=1
,
q=1
and thresh_type="val"
.
swGsea(
input_df,
thresh_type = "percentile",
thresh = 0.9,
thresh_action = "exclude",
min_set_size = 10,
max_set_size = 500,
max_score = "max",
min_score = "min",
psuedocount = 0.001,
perms = 1000,
p = 1,
q = 1,
nThreads = 1,
rng_seed = 1,
fork = FALSE
)
A list of Enrichment_Results
, Items_in_Set
and Running_Sums
.
A data frame with row names of gene set and columns of "ES", "NES", "p_val", "fdr".
A list of one-column data frames. Describes genes and their ranks in each set.
Running sum scores along genes sorted by ranked scores, with gene sets as columns.
A data frame in which first column is name of item of interest (gene, protein, phosphosite, etc.), the second is the correlation of that item of interest with the phenotype (typically log ratio of expression for phenotype vs. normal), and the remaining columns are the scores for the likelihood that the item belongs in each set (one column per set).
The type of thresh
. Use 'percentile' to include all
scores over that percentile given in thresh
(i.e., 0.9 would be all items
in 90th percentile, or top 10 percent); 'list' to include a list of set lists
where the set lists are in the same order as the corresponding set columns in
the input_df
; 'val' to apply a single threshold value to all sets; or
'values' to use a vector of unique cutoffs for each set (needs to be in the
same order as the sets are specified in the columns of input_df
")
Depends on thresh_type
. A list of lists of the items in
each set (with same names as colnames of the scores); a numeric vector of
threshold scores for each set (in the same order as the colnames of the scores
in the input_df), or a single percentile value between 0 and 1 (i.e., if
thresh
=0.9, the 90th percentile of the score or the highest scoring 10
of of the items are included in the set for each scoring regimen) (thresh
="all" is not supported at this time, as it doesn't result in a Kolgorov-Smirnoff
statistic; this may be worked in as an alternate scoring method later on).
Either "include", "exclude (default)", or "adjust"; this specifies how to treat each set if it doesn't contain a minimum number of items or contains all of the items; this option cannot be used with predefined lists of items in sets (if the number of items in a given set doesn't meet requirements, that set will be skipped).
The minimum/maximum number of items each set needs for the analysis to proceed.
A optional numeric vector of minimum/maximum boundaries to clip scores for each set.
Psuedocount (pc) is used for rescaling set scores:
(score - min_score + pc)/(max_score - min_score +pc)
; this is needed to
prevent division by 0 if max_score==min_score
(in this case, all scores
for items in set will be 1, which is equivalent to standard GSEA); it also allows
users to adjust weights for scores that are close to the minimum for the scores in
the set (unless min_score==max_score): as psuedocount value approaches 0, scaled
minimum scores also approach 0; as psuedocount approaches infinity, scaled minimum
scores approach the scaled maximum scores (which equal 1); this value must be
larger than 0.
The number of permutations.
The exponential scaling factor of the phenotype score (second column in
input_df
).
The exponential scaling factor of the likelihood score (weights).
The number of threads to use in calculating permutaions.
Random seed.
A boolean. Whether pass "fork" to type
parameter of
makeCluster
on Unix-like machines.
Eric Jaehnig
The formula for weighting is as follows $$\frac{s_{j}^{q}|r_{j}|^{p}}{\sum s^{q}|r|^{p}}$$ Where r is log ratio score, s is likelihood score, j is the index of the gene.