Given a data frame with a column containing receptor sequences, filter data rows by sequence length and sequence content. Keep all data columns or choose which columns to keep.
filterInputData(
data,
seq_col,
min_seq_length = NULL,
drop_matches = NULL,
subset_cols = NULL,
count_col = deprecated(),
verbose = FALSE
)
A data frame.
A data frame.
Specifies the column(s) of data
containing
the receptor sequences.
Accepts a character or numeric vector of length 1 or 2,
containing either column names or column indices.
Each column specified will be coerced
to a character vector. Data rows containing a value of NA
in any
of the specified columns will be dropped.
Observations whose receptor sequences have fewer than min_seq_length
characters are dropped.
Accepts a character string containing a regular expression
(see regex
). Checks values in the receptor sequence
column for a pattern match using grep()
.
Rows in which a match is found are dropped.
Specifies which columns of the AIRR-Seq data are included in the output.
Accepts a character vector of column names
or a numeric vector of column indices.
The default
NULL
includes all columns. The receptor sequence column is always
included regardless of this argument's value.
Logical. If TRUE
, generates messages about the tasks
performed and their progress, as well as relevant properties of intermediate
outputs. Messages are sent to stderr()
.
Brian Neal (Brian.Neal@ucsf.edu)
Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825
set.seed(42)
raw_data <- simulateToyData()
# Remove sequences shorter than 13 characters,
# as well as sequences containing the subsequence "GGGG".
# Keep variables for clone sequence, clone frequency and sample ID
filterInputData(
raw_data,
seq_col = "CloneSeq",
min_seq_length = 13,
drop_matches = "GGGG",
subset_cols =
c("CloneSeq", "CloneFrequency", "SampleID"),
verbose = TRUE
)
Run the code above in your browser using DataLab