Learn R Programming

⚠️There's a newer version (0.1.5) of this package.Take me there.

textreuse (version 0.1.2)

Detect Text Reuse and Document Similarity

Description

Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.

Copy Link

Version

Install

install.packages('textreuse')

Monthly Downloads

833

Version

0.1.2

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Lincoln Mullen

Last Published

November 6th, 2015

Functions in textreuse (0.1.2)

hash_string

Hash a string to an integer
TextReuseCorpus

TextReuseCorpus
TextReuseTextDocument-accessors

Accessors for TextReuse objects
as.matrix.textreuse_candidates

Convert candidates data frames to other formats
lsh

Locality sensitive hashing for minhash
TextReuseTextDocument

TextReuseTextDocument
lsh_query

Query a LSH cache for matches to a single document
align_local

Local alignment of natural language texts
lsh_probability

Probability that a candidate pair will be detected with LSH
pairwise_candidates

Candidate pairs from pairwise comparisons
reexports

Objects exported from other packages
tokenize

Recompute the tokens for a document or corpus
wordcount

Count words
pairwise_compare

Pairwise comparisons among documents in a corpus
lsh_compare

Compare candidates identified by LSH
lsh_candidates

Candidate pairs from LSH comparisons
tokenizers

Split texts into tokens
rehash

Recompute the hashes for a document or corpus
filenames

Filenames from paths
minhash_generator

Generate a minhash function
lsh_subset

List of all candidates in a corpus
textreuse-package

Detect Text Reuse and Document Similarity
similarity-functions

Measure similarity/dissimilarity in documents