In most cases, genes will be a single vector or one-column matrix. However,
there are some cases where a row of seqAnnot corresponds to two (or more) genes
(e.g. the V and J gene segments of a single immune sequence). Rather than make multiple
rows for each gene, the calcVDJcounts function provides the option to provide
a multi-column matrix for genes. The counts for each column will be tallied
separately, and are then concatenated.
To ensure equal variance across all repertoires, the default RDI metric uses
subsampling to ensure that all repertoires have the same number of sequences. The
default RDI metric subsamples all repertoires to the size of the smallest repertoire,
which may result in a loss of power for comparisons between larger repertoires.
In order to increase power for various tests, it is often useful to only calculate
the repertoire counts for a subset of the repertoires in seqAnnot. This can be done by
using the select and combine parameters to specify which
repertoires to include in the analysis.
Both parameters are lists containing entries
with the same name as one of the columns of seqAnnot. For select, each entry is
a vector defining which values to include (e.g., to include only Visit 1 and 3, you
might specify select=list(visit=c("V1","V3")), where the 'visit' column
in seqAnnot contains the values "V1","V2", and "V3"). In this
case, any rows of genes and seqAnnot that come from a repertoire not
specified in select will be discarded. By default, if a select code is
not specified for a column in seqAnnot, all values from that column will be
included.
The combine parameter works in a similar fashion, but instead of a vector
describing which parameters to include, you can specify a vector of regular
expressions, and any values of the seqAnnot column that match the regular
expression will be combined into a single repertoire (e.g. to combine visits 1 and 3
into a single repertoire, you might specify combine=list(visit="V[13]")).
The vdjDrop parameter is also useful for limiting sequences. Like
select and combine, this is a named list, with entries corresponding to
the columns of genes. Each entry of vdjDrop is a vector of gene segment
names to remove from the analysis. All sequences containing those genes are removed
from the analysis before subsampling.
Once unwanted rows have been removed, the columns of seqAnnot are concatenated
to generate "repertoire" labels for each row. The repertoire labels are then used
to split the rows of genes, and gene prevalence is tallied within a repertoire.
By default, columns of seqAnnot that are constant after subsetting will not be
included in the label. However, this can be controlled by the simplifyNames
parameter. If simplifyNames is FALSE, all columns of seqAnnot are
included when generating labels.