The dfm class of object is a type of Matrix-class
object with additional slots, described below. quanteda uses two
subclasses of the dfm
class, depending on whether the object can be
represented by a sparse matrix, in which case it is a dfmSparse
class object, or if dense, then a dfmDense
object. See Details.
# S4 method for dfmDense
t(x)# S4 method for dfmSparse
t(x)
# S4 method for dfmSparse
colSums(x, na.rm = FALSE, dims = 1L, ...)
# S4 method for dfmSparse
rowSums(x, na.rm = FALSE, dims = 1L, ...)
# S4 method for dfmSparse
colMeans(x, na.rm = FALSE, dims = 1L, ...)
# S4 method for dfmSparse
rowMeans(x, na.rm = FALSE, dims = 1L, ...)
# S4 method for dfmSparse,numeric
+(e1, e2)
# S4 method for numeric,dfmSparse
+(e1, e2)
# S4 method for dfmDense,numeric
+(e1, e2)
# S4 method for numeric,dfmDense
+(e1, e2)
# S4 method for dfm,index,index,missing
[(x, i, j, ..., drop = FALSE)
# S4 method for dfm,index,index,logical
[(x, i, j, ..., drop = FALSE)
# S4 method for dfm,missing,missing,missing
[(x, i, j, ..., drop = FALSE)
# S4 method for dfm,missing,missing,logical
[(x, i, j, ..., drop = FALSE)
# S4 method for dfm,index,missing,missing
[(x, i, j, ..., drop = FALSE)
# S4 method for dfm,index,missing,logical
[(x, i, j, ..., drop = FALSE)
# S4 method for dfm,missing,index,missing
[(x, i, j, ..., drop = FALSE)
# S4 method for dfm,missing,index,logical
[(x, i, j, ..., drop = FALSE)
the dfm object
if TRUE
, omit missing values (including NaN
) from
the calculations
ignored
additional arguments not used here
first quantity in "+" operation for dfm
second quantity in "+" operation for dfm
index for documents
index for features
always set to FALSE
settings
settings that govern corpus handling and subsequent downstream
operations, including the settings used to clean and tokenize the texts,
and to create the dfm. See settings
.
weighting
the feature weighting applied to the dfm. Default is
"frequency"
, indicating that the values in the cells of the dfm are
simple feature counts. To change this, use the weight
method.
smooth
a smoothing parameter, defaults to zero. Can be changed using
either the smooth
or the weight
methods.
Dimnames
These are inherited from Matrix-class but are
named docs
and features
respectively.
The dfm
class is a virtual class that will contain one of two
subclasses for containing the cell counts of document-feature matrixes:
dfmSparse
or dfmDense
.
The dfmSparse
class is a sparse matrix version of
dfm-class
, inheriting dgCMatrix-class from the
Matrix package. It is the default object type created when feature
counts are the object of interest, as typical text-based feature counts
tend contain many zeroes. As long as subsequent transformations of the dfm
preserve cells with zero counts, the dfm should remain sparse.
When the Matrix package implements sparse integer matrixes, we will switch the default object class to this object type, as integers are 4 bytes each (compared to the current numeric double type requiring 8 bytes per cell.)
The dfmDense
class is a sparse matrix version of dfm-class
,
inheriting dgeMatrix-class from the Matrix package. dfm objects that
are converted through weighting or other transformations into cells without zeroes will
be automatically converted to the dfmDense class. This will necessarily be a much larger sized
object than one of dfmSparse
class, because each cell is recorded as a numeric (double) type
requiring 8 bytes of storage.
# NOT RUN {
# dfm subsetting
x <- dfm(tokens(c("this contains lots of stopwords",
"no if, and, or but about it: lots",
"and a third document is it"),
remove_punct = TRUE))
x[1:2, ]
x[1:2, 1:5]
# fcm subsetting
y <- fcm(tokens(c("this contains lots of stopwords",
"no if, and, or but about it: lots"),
remove_punct = TRUE))
y[1:3, ]
y[4:5, 1:5]
# }
Run the code above in your browser using DataLab