quanteda v2.1.2
Monthly downloads
Quantitative Analysis of Textual Data
A fast, flexible, and comprehensive framework for
quantitative text analysis in R. Provides functionality for corpus management,
creating and manipulating tokens and ngrams, exploring keywords in context,
forming and manipulating sparse matrices
of documents by features and feature co-occurrences, analyzing keywords, computing feature similarities and
distances, applying content dictionaries, applying supervised and unsupervised machine learning,
visually representing text and text analyses, and more.
Readme
About
An R package for managing and analyzing text, created by Kenneth Benoit. Supported by the European Research Council grant ERC-2011-StG 283794-QUANTESS.
For more details, see https://quanteda.io.
How to Install
The normal way from CRAN, using your R GUI or
install.packages("quanteda")
Or for the latest development version:
# devtools package required to install quanteda from Github
devtools::install_github("quanteda/quanteda")
Because this compiles some C++ and Fortran source code, you will need to have installed the appropriate compilers.
If you are using a Windows platform, this means you will need also to install the Rtools software available from CRAN.
If you are using macOS, you should install the macOS tools, namely the Clang 6.x compiler and the GNU Fortran compiler (as quanteda requires gfortran to build). If you are still getting errors related to gfortran, follow the fixes here.
How to Use
See the quick start guide to learn how to use quanteda.
How to cite
Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. (2018) “quanteda: An R package for the quantitative analysis of textual data”. Journal of Open Source Software. 3(30), 774. https://doi.org/10.21105/joss.00774.
For a BibTeX entry, use the output from citation(package =
"quanteda")
.
Leaving Feedback
If you like quanteda, please consider leaving feedback or a testimonial here.
Contributing
Contributions in the form of feedback, comments, code, and bug reports are most welcome. How to contribute:
- Fork the source code, modify, and issue a pull request through the project GitHub page. See our Contributor Code of Conduct and the all-important quanteda Style Guide.
- Issues, bug reports, and wish lists: File a GitHub issue.
- Usage questions: Submit a question on the quanteda channel on StackOverflow.
- Contact the maintainer by email.
Functions in quanteda
Name | Description | |
as.fcm | Coercion and checking functions for fcm objects | |
as.dictionary | Coercion and checking functions for dictionary objects | |
View | View methods for quanteda | |
char_tolower | Convert the case of character objects | |
as.corpus | coerce a compressed corpus to a standard corpus | |
as.matrix.dfm | Coerce a dfm to a matrix or data.frame | |
as.igraph | Convert an fcm to an igraph object | |
check_font | Check if font is available on the system | |
as.list.tokens | Coercion, checking, and combining functions for tokens objects | |
compute_mattr | Compute the Moving-Average Type-Token Ratio (MATTR) | |
as.yaml | Convert quanteda dictionary objects to the YAML format | |
corpus_sample | Randomly sample documents from a corpus | |
compute_lexdiv_stats | Compute lexical diversity from a dfm or tokens | |
as.data.frame.dfm | Convert a dfm to a data.frame | |
dfm_replace | Replace features in dfm | |
data_char_sampletext | A paragraph of text for testing various text-based functions | |
convert-wrappers | Convenience wrappers for dfm convert | |
dfm_match | Match the feature set of a dfm to given feature names | |
data_char_ukimmig2010 | Immigration-related sections of 2010 UK party manifestos | |
corpus_segment | Segment texts on a pattern match | |
as.dfm | Coercion and checking functions for dfm objects | |
compute_msttr | Compute the Mean Segmental Type-Token Ratio (MSTTR) | |
data_dictionary_LSD2015 | Lexicoder Sentiment Dictionary (2015) | |
dfm-class | Virtual class "dfm" for a document-feature matrix | |
as.matrix,textstat_simil_sparse-method | as.matrix method for textstat_simil_sparse | |
fcm | Create a feature co-occurrence matrix | |
dfm2lsa | Convert a dfm to an lsa "textmatrix" | |
dfm_compress | Recombine a dfm or fcm by combining identical dimension elements | |
dfm_sort | Sort a dfm by frequency of one or more margins | |
char_select | Select or remove elements from a character vector | |
cbind.dfm | Combine dfm objects by Rows or Columns | |
fcm-class | Virtual class "fcm" for a feature co-occurrence matrix | |
flatten_dictionary | Flatten a hierarchical dictionary into a list of character vectors | |
dfm_split_hyphenated_features | Split a dfm's hyphenated features into constituent parts | |
dictionary2-class | dictionary class objects and functions | |
as.network | redefinition of network::as.network() | |
convert | Convert quanteda objects to non-quanteda formats | |
dfm_group | Combine documents in a dfm by a grouping variable | |
attributes<- | Function extending base::attributes() | |
corpus_subset | Extract a subset of a corpus | |
dfm_lookup | Apply a dictionary to a dfm | |
data_corpus_inaugural | US presidential inaugural address texts | |
data_dfm_lbgexample | dfm from data in Table 1 of Laver, Benoit, and Garry (2003) | |
corpus-class | Base method extensions for corpus objects | |
corpus | Construct a corpus object | |
is_glob | Check if patterns contains glob wildcard | |
format_sparsity | format a sparsity value for printing | |
bootstrap_dfm | Bootstrap a dfm | |
corpus_trim | Remove sentences based on their token lengths or a pattern match | |
dictionary_edit | Conveniently edit dictionaries | |
get_object_version | Get the package version that created an object | |
dictionary | Create a dictionary | |
meta_system | Internal function to get, set or initialize system metadata | |
get_docvars | Internal function to extract docvars | |
meta | Get or set object metadata | |
is_indexed | Check if a glob pattern is indexed by index_types | |
kwic | Locate keywords-in-context | |
dfm_tolower | Convert the case of the features of a dfm and combine | |
corpus_trimsentences | Remove sentences based on their token lengths or a pattern match | |
docfreq | Compute the (weighted) document frequency of a feature | |
list2dictionary | Internal function to convert a list to a dictionary | |
phrase | Declare a compound character to be a sequence of separate pattern matches | |
create | Function to assign multiple slots to a S4 object | |
dfm_trim | Trim a dfm using frequency threshold-based feature selection | |
set_fcm_slots<- | Set values to a fcm's S4 slots | |
set_dfm_slots<- | Set values to a dfm's S4 slots | |
sample_bygroup | Sample a vector by a group | |
%>% | Pipe operator | |
reshape_docvars | Internal function to subset or duplicate docvar rows | |
corpus_reshape | Recast the document units of a corpus | |
metadoc | Get or set document-level meta-data | |
docnames | Get or set document names | |
dfm-internal | Internal functions for dfm objects | |
data-relocated | Formerly included data objects | |
dfm_sample | Randomly sample documents or features from a dfm | |
textplot_network | Plot a network of feature co-occurrences | |
textplot_keyness | Plot word keyness | |
dfm_select | Select features from a dfm or fcm | |
featnames | Get the feature labels from a dfm | |
docvars | Get or set document-level variables | |
dfm_subset | Extract a subset of a dfm | |
data-internal | Internal data sets | |
fcm_sort | Sort an fcm in alphabetical order of the features | |
field_system | Shortcut functions to access or assign metadata | |
dfm | Create a document-feature matrix | |
featfreq | Compute the frequencies of features | |
dfm_tfidf | Weight a dfm by tf-idf | |
groups | Grouping variable(s) for various functions | |
dfm_weight | Weight the feature frequencies in a dfm | |
nsyllable | Count syllables in a text | |
ntoken | Count the number of tokens or types | |
remove_empty_keys | Utility function to remove empty keys | |
textstat_lexdiv | Calculate lexical diversity | |
replace_dictionary_values | Internal function to replace dictionary values | |
textstat_keyness | Calculate keyness statistics | |
tokens_group | Recombine documents tokens by groups | |
names-quanteda | Special handling for names of quanteda objects | |
head.corpus | Return the first or last part of a corpus | |
tokens_lookup | Apply a dictionary to a tokens object | |
diag2na | convert same-value pairs to NA in a textstat_proxy object | |
split_values | Internal function for special handling of multi-word dictionary values | |
is_regex | Internal function for select_types() to check if a string is a regular expression | |
summary.corpus | Summarize a corpus | |
escape_regex | Internal function for select_types() to escape regular expressions | |
texts | Get or assign corpus texts | |
expand | Simpler and faster version of expand.grid() in base package | |
unlist_integer | Unlist a list of integer vectors safely | |
textstat_collocations | Identify and score multi-word expressions | |
textstat_proxy-class | textstat_simil/dist classes | |
keyness | Compute keyness (internal functions) | |
textstat_proxy | [Experimental] Compute document/feature proximity | |
friendly_class_undefined_message | Print friendly object class not defined message | |
matrix2dfm | Converts a Matrix to a dfm | |
matrix2fcm | Converts a Matrix to a fcm | |
pattern2id | Convert regex and glob patterns to type IDs or fixed patterns | |
unused_dots | Raise warning of unused dots | |
generate_groups | Generate a grouping vector from docvars | |
head.dfm | Return the first or last part of a dfm | |
pattern2list | Convert various input as pattern to a vector used in tokens_select, tokens_compound and kwic. | |
head.textstat_proxy | Return the first or last part of a textstat_proxy object | |
tokens_compound | Convert token sequences into compound tokens | |
print-quanteda | Print methods for quanteda core objects | |
tokens_chunk | Segment tokens object by chunks of a given size | |
make_meta | Internal functions to create a list for the meta attribute | |
lowercase_dictionary_values | Internal function to lowercase dictionary values | |
merge_dictionary_values | Internal function to merge values of duplicated keys | |
ndoc | Count the number of documents or features | |
nscrabble | Count the Scrabble letter values of text | |
nsentence | Count the number of sentences | |
types | Get word types from a tokens object | |
nest_dictionary | Utility function to generate a nested list | |
print.phrases | Print a phrase object | |
serialize_tokens | Function to serialize list-of-character tokens | |
set_dfm_dimnames<- | Internal functions to set dimnames | |
message_error | Return an error message | |
quanteda-package | An R package for the quantitative analysis of textual data | |
reexports | Objects exported from other packages | |
pattern | Pattern for feature, token and keyword matching | |
spacyr-methods | Extensions for and from spacy_parse objects | |
read_dict_functions | Internal functions to import dictionary files | |
object-builders | Object compilers | |
quanteda_options | Get or set package options for quanteda | |
wordcloud_comparison | Internal function for textplot_wordcloud | |
unlist_character | Unlist a list of character vectors safely | |
sparsity | Compute the sparsity of a document-feature matrix | |
textplot_wordcloud | Plot features as a wordcloud | |
textplot_xray | Plot the dispersion of key word(s) | |
tokenize_internal | quanteda tokenizers | |
tokens | Construct a tokens object | |
textstat_entropy | Compute entropies of documents or features | |
search_index | Internal function for select_types to search the index using fastmatch. | |
textstat_summary | Summarize documents | |
search_glob | Select types without performing slow regex search | |
textstat_simil | Similarity and distance computation between documents or features | |
textstat_frequency | Tabulate feature frequencies | |
tokens_segment | Segment tokens object by patterns | |
tokens_replace | Replace tokens in a tokens object | |
tokens_select | Select or remove tokens from a tokens object | |
tokens_wordstem | Stem the terms in an object | |
tokens_split | Split tokens by a separator pattern | |
tokens_subset | Extract a subset of a tokens | |
tokens_sample | Randomly sample documents from a tokens object | |
topfeatures | Identify the most frequent features in a dfm | |
summary_metadata | Functions to add or retrieve corpus summary metadata | |
textmodels | Models for scaling and classification of textual data | |
textstat_readability | Calculate readability | |
textstat_select | Select rows of textstat objects by glob, regex or fixed patterns | |
tokens_ngrams | Create ngrams and skipgrams from tokens | |
tokens_tolower | Convert the case of tokens | |
tokens_tortl | [Experimental] Change direction of words in tokens | |
valuetype | Pattern matching using valuetype | |
wordcloud | Internal function for textplot_wordcloud | |
tokens_recompile | recompile a serialized tokens object | |
No Results! |
Vignettes of quanteda
Name | ||
quickstart.Rmd | ||
No Results! |
Last month downloads
Details
License | GPL-3 |
LinkingTo | Rcpp, RcppParallel, RcppArmadillo (>= 0.7.600.1.0) |
URL | https://quanteda.io |
Encoding | UTF-8 |
BugReports | https://github.com/quanteda/quanteda/issues |
LazyData | TRUE |
VignetteBuilder | knitr |
Language | en-GB |
Collate | 'RcppExports.R' 'View.R' 'meta.R' 'quanteda-documentation.R' 'aaa.R' 'bootstrap_dfm.R' 'casechange-functions.R' 'char_select.R' 'convert.R' 'corpus-addsummary-metadata.R' 'corpus-methods-base.R' 'corpus-methods-quanteda.R' 'corpus.R' 'corpus_reshape.R' 'corpus_sample.R' 'corpus_segment.R' 'corpus_subset.R' 'corpus_trim.R' 'data-documentation.R' 'dfm-classes.R' 'dfm-methods.R' 'dfm-print.R' 'dfm-subsetting.R' 'dfm.R' 'dfm_compress.R' 'dfm_group.R' 'dfm_lookup.R' 'dfm_match.R' 'dfm_replace.R' 'dfm_sample.R' 'dfm_select.R' 'dfm_sort.R' 'dfm_subset.R' 'dfm_trim.R' 'dfm_weight.R' 'dictionaries.R' 'dictionary_edit.R' 'dimnames.R' 'directionchange-functions.R' 'fcm-classes.R' 'docnames.R' 'docvars.R' 'fcm-methods.R' 'fcm-print.R' 'fcm-subsetting.R' 'fcm.R' 'fcm_select.R' 'kwic.R' 'metadoc.R' 'nfunctions.R' 'nscrabble.R' 'nsyllable.R' 'object-builder.R' 'pattern2fixed.R' 'phrases.R' 'quanteda_options.R' 'readtext-methods.R' 'spacyr-methods.R' 'stopwords.R' 'summary.R' 'textmodel.R' 'textplot_keyness.R' 'textplot_network.R' 'textplot_wordcloud.R' 'textplot_xray.R' 'textstat-methods.R' 'textstat_collocations.R' 'textstat_entropy.R' 'textstat_frequency.R' 'textstat_keyness.R' 'textstat_lexdiv.R' 'textstat_readability.R' 'textstat_simil.R' 'textstat_summary.R' 'tokenizers.R' 'tokens-methods-base.R' 'tokens.R' 'tokens_chunk.R' 'tokens_compound.R' 'tokens_group.R' 'tokens_lookup.R' 'tokens_ngrams.R' 'tokens_replace.R' 'tokens_sample.R' 'tokens_segment.R' 'tokens_select.R' 'tokens_split.R' 'tokens_subset.R' 'utils.R' 'wordstem.R' 'zzz.R' |
RoxygenNote | 7.1.1 |
SystemRequirements | C++11 |
NeedsCompilation | yes |
Packaged | 2020-09-19 17:44:00 UTC; kbenoit |
Repository | CRAN |
Date/Publication | 2020-09-23 04:10:03 UTC |
imports | data.table (>= 1.9.6) , digest , extrafont , fastmatch , ggplot2 (>= 2.2.0) , ggrepel , jsonlite , magrittr , Matrix (>= 1.2) , network , proxyC (>= 0.1.4) , Rcpp (>= 0.12.12) , RcppParallel , sna , SnowballC , stopwords , stringi , xml2 , yaml |
suggests | dplyr , DT , e1071 , entropy , ExPosition , formatR , igraph , knitr , lda , lsa , proxy , purrr , quanteda.textmodels , RColorBrewer , rmarkdown , slam , spacyr , spelling , stm , svs , testthat , text2vec , tibble , tidytext , tm (>= 0.6) , tokenizers , topicmodels , wordcloud , xtable |
depends | methods , R (>= 3.1.0) |
linkingto | RcppArmadillo (>= 0.7.600.1.0) |
Contributors | Ian Fellows, Paul Nulty, Kohei Watanabe, Adam Obeng, Will Lowe, Stefan M<c3><bc>ller, Akitaka Matsuo, Haiyan Wang, Jouni Kuha, Christian M<c3><bc>ller, Lori Young, Stuart Soroka, European Research Council , Jiong Wei Lua |
Include our badge in your README
[](http://www.rdocumentation.org/packages/quanteda)