Given Adaptive Immune Receptor Repertoire Sequencing (AIRR-Seq) data, builds the network graph for the immune repertoire based on sequence similarity, computes specified network properties and generates customized visualizations.
buildNet()
is identical to buildRepSeqNetwork()
, existing as
an alias for convenience.
buildRepSeqNetwork( ## Input ##
data,
seq_col,
count_col = NULL,
subset_cols = NULL,
min_seq_length = 3,
drop_matches = NULL,
## Network ##
dist_type = "hamming",
dist_cutoff = 1,
drop_isolated_nodes = TRUE,
node_stats = FALSE,
stats_to_include = chooseNodeStats(),
cluster_stats = FALSE,
cluster_fun = "fast_greedy",
cluster_id_name = "cluster_id",
## Visualization ##
plots = TRUE,
print_plots = FALSE,
plot_title = "auto",
plot_subtitle = "auto",
color_nodes_by = "auto",
...,
## Output ##
output_dir = NULL,
output_type = "rds",
output_name = "MyRepSeqNetwork",
pdf_width = 12,
pdf_height = 10,
verbose = FALSE
)
# Alias for buildRepSeqNetwork()
buildNet(
data,
seq_col,
count_col = NULL,
subset_cols = NULL,
min_seq_length = 3,
drop_matches = NULL,
dist_type = "hamming",
dist_cutoff = 1,
drop_isolated_nodes = TRUE,
node_stats = FALSE,
stats_to_include = chooseNodeStats(),
cluster_stats = FALSE,
cluster_fun = "fast_greedy",
cluster_id_name = "cluster_id",
plots = TRUE,
print_plots = FALSE,
plot_title = "auto",
plot_subtitle = "auto",
color_nodes_by = "auto",
...,
output_dir = NULL,
output_type = "rds",
output_name = "MyRepSeqNetwork",
pdf_width = 12,
pdf_height = 10,
verbose = FALSE
)
If the constructed network contains no nodes, the function will return
NULL
, invisibly, with a warning. Otherwise, the function invisibly
returns a list containing the following items:
A list containing information about the network and the settings used during its construction.
An object of class igraph
containing the
list of nodes and edges for the network graph.
The network graph adjacency matrix, stored as a sparse
matrix of class dgCMatrix
from the Matrix
package.
See dgCMatrix-class
.
A data frame containing containing metadata for the network
nodes, where each row corresponds to a node in the network graph. This data
frame contains all variables from data
(unless otherwise specified via
subset_cols
) in addition to the computed node-level network properties
if node_stats = TRUE
. Each row's name is the name of the corresponding
row from data
.
A data frame containing network properties for the clusters,
where each row corresponds to a cluster in the network graph. Only included if
cluster_stats = TRUE
.
A list containing one element for each plot generated
as well as an additional element for the matrix that specifies the graph layout.
Each plot is an object of class ggraph
. Only included
if plots = TRUE
.
A data frame containing the AIRR-Seq data, with variables indexed by column and observations (e.g., clones or cells) indexed by row.
Specifies the column(s) of data
containing
the receptor sequences to be used as the basis of similarity between rows.
Accepts a character string containing the column name
or a numeric scalar containing the column index.
Also accepts a vector of length 2 specifying distinct sequence columns
(e.g., alpha chain and beta chain), in which case
similarity between rows depends on similarity in both sequence columns
(see details).
Optional. Specifies the column of data
containing a measure of abundance,
e.g., clone count or unique molecular identifier (UMI) count. Accepts either
the column name or column index. Passed to
addClusterStats()
; only
relevant if cluster_stats = TRUE
.
Specifies which columns of the AIRR-Seq data are included in the output.
Accepts a vector of column names or a vector of column indices. The default
NULL
includes all columns. The receptor sequence column is always
included regardless of this argument's value.
Passed to filterInputData()
.
A numeric scalar, or NULL
. Observations whose receptor sequences have
fewer than min_seq_length
characters are removed prior to network analysis.
Optional. Passed to filterInputData()
.
Accepts a character string
containing a regular expression (see regex
).
Checks receptor sequences for a pattern match using grep()
.
Those returning a match are removed prior to network analysis.
Specifies the function used to quantify the similarity between sequences.
The similarity between two sequences determines the pairwise distance between
their respective nodes in the network graph, with greater similarity corresponding
to shorter distance. Valid options are "hamming"
(the default), which
uses hamDistBounded()
,
and "levenshtein"
, which uses
levDistBounded()
.
A nonnegative scalar. Specifies the maximum pairwise distance (based on
dist_type
) for an edge connection to exist between two nodes. Pairs of
nodes whose distance is less than or equal to this value will be joined by an
edge connection in the network graph. Controls the stringency of the network
construction and affects the number and density of edges in the network. A lower
cutoff value requires greater similarity between sequences in order for their
respective nodes to be joined by an edge connection. A value of 0
requires two sequences to be identical in order for their nodes to be joined
by an edge.
A logical scalar. When TRUE
, removes each node that is not joined by an
edge connection to any other node in the network graph.
A logical scalar. Specifies whether node-level network properties are computed.
A named logical vector returned by
chooseNodeStats()
or
exclusiveNodeStats()
.
Specifies the node-level network properties
to compute. Also accepts the value "all"
.
Only relevant if node_stats = TRUE
.
A logical scalar. Specifies whether to compute cluster-level network properties.
Passed to addClusterMembership()
.
Specifies the clustering algorithm
used when cluster analysis is performed. Cluster analysis is performed when
cluster_stats = TRUE
or when node_stats = TRUE
with the
cluster_id
property enabled via the stats_to_include
argument.
Passed to addClusterMembership()
.
Specifies the name of the cluster membership variable added to the node metadata
when cluster analysis is performed (see cluster_fun
).
A logical scalar. Specifies whether to generate plots of the network graph.
A logical scalar. If plots = TRUE
, specifies whether the plots should
be printed to the R plotting window.
A character string or NULL
. If plots = TRUE
, this is the title
used for each plot. The default value "auto"
generates the title based
on the value of the output_name
argument.
A character string or NULL
. If plots = TRUE
, this is the subtitle
used for each plot. The default value "auto"
generates a subtitle based
on the values of the dist_type
and dist_cutoff
arguments.
Optional. Specifies a variable to be used as metadata for coloring the nodes
in the network graph plot. Accepts a character string. This can be a column
name of data
or (if node_stats = TRUE
) the name of a computed
node-level network property (based on stats_to_include
). Also accepts
a character vector specifying multiple variables, in which case one plot will
be generated for each variable. The default value "auto"
attempts to use
one of several potential variables to color the nodes, depending on what is
available. A value of NULL
leaves the nodes uncolored.
Other named arguments to addPlots()
.
A file path specifying the directory for saving the output. The directory will
be created if it does not exist. If NULL
, output will be returned but
not saved.
A character string specifying the file format to use when saving the output.
The default value "individual"
saves each element of the returned list
as an individual uncompressed file, with data frames saved in csv format. For
better compression, the values "rda"
and "rds"
save the returned
list as a single file using the rda and rds format, respectively (in the former
case, the list will be named net
within the rda file). Regardless of the
argument value, any plots generated will saved to a pdf file containing one plot
per page.
A character string. All files saved will have file names beginning with this value.
Sets the width of each plot when writing to pdf.
Passed to saveNetwork()
.
Sets the height of each plot when writing to pdf.
Passed to saveNetwork()
.
Logical. If TRUE
, generates messages about the tasks
performed and their progress, as well as relevant properties of intermediate
outputs. Messages are sent to stderr()
.
Brian Neal (Brian.Neal@ucsf.edu)
To construct the immune repertoire network, each TCR/BCR clone (bulk data) or cell (single-cell data) is modeled as a node in the network graph, corresponding to a single row of the AIRR-Seq data. For each node, the corresponding receptor sequence is considered. Both nucleotide and amino acid sequences are supported for this purpose. The receptor sequence is used as the basis of similarity and distance between nodes in the network.
Similarity between sequences is measured using either the Hamming distance or Levenshtein (edit) distance. The similarity determines the pairwise distance between nodes in the network graph. The more similar two sequences are, the shorter the distance between their respective nodes. Two nodes in the graph are joined by an edge if the distance between them is sufficiently small, i.e., if their receptor sequences are sufficiently similar.
For single-cell data, edge connections between nodes can be based on similarity
in both the alpha chain and beta chain sequences.
This is done by providing a vector of length 2 to seq_cols
specifying the two sequence columns in data
.
The distance between two nodes is then the greater of the two distances between
sequences in corresponding chains.
Two nodes will be joined by an edge if their alpha chain sequences are sufficiently
similar and their beta chain sequences are sufficiently similar.
See the
buildRepSeqNetwork package vignette
for more details. The vignette can be accessed offline using
vignette("buildRepSeqNetwork")
.
Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825
set.seed(42)
toy_data <- simulateToyData()
# Simple call
network = buildNet(
toy_data,
seq_col = "CloneSeq",
print_plots = TRUE
)
# Customized:
network <- buildNet(
toy_data, "CloneSeq",
dist_type = "levenshtein",
node_stats = TRUE,
cluster_stats = TRUE,
cluster_fun = "louvain",
cluster_id_name = "cluster_membership",
count_col = "CloneCount",
color_nodes_by = c("SampleID", "cluster_membership", "coreness"),
color_scheme = c("default", "Viridis", "plasma-1"),
size_nodes_by = "degree",
node_size_limits = c(0.1, 1.5),
plot_title = NULL,
plot_subtitle = NULL,
print_plots = TRUE,
verbose = TRUE
)
typeof(network)
names(network)
network$details
head(network$node_data)
head(network$cluster_data)
Run the code above in your browser using DataLab