Learn R Programming

⚠️There's a newer version (0.1.5) of this package.Take me there.

textreuse (version 0.1.1)

Detect Text Reuse and Document Similarity

Description

Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.

Copy Link

Version

Install

install.packages('textreuse')

Monthly Downloads

833

Version

0.1.1

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Lincoln Mullen

Last Published

November 4th, 2015

Functions in textreuse (0.1.1)

filenames

Filenames from paths
wordcount

Count words
lsh_query

Query a LSH cache for matches to a single document
pairwise_candidates

Candidate pairs from pairwise comparisons
rehash

Recompute the hashes for a document or corpus
as.matrix.textreuse_candidates

Convert candidates data frames to other formats
textreuse-package

Detect Text Reuse and Document Similarity
pairwise_compare

Pairwise comparisons among documents in a corpus
lsh_subset

List of all candidates in a corpus
lsh

Locality sensitive hashing for minhash
reexports

Objects exported from other packages
hash_string

Hash a string to an integer
TextReuseTextDocument-accessors

Accessors for TextReuse objects
lsh_probability

Probability that a candidate pair will be detected with LSH
lsh_candidates

Candidate pairs from LSH comparisons
TextReuseCorpus

TextReuseCorpus
align_local

Local alignment of natural language texts
similarity-functions

Measure similarity/dissimilarity in documents
lsh_compare

Compare candidates identified by LSH
TextReuseTextDocument

TextReuseTextDocument
tokenize

Recompute the tokens for a document or corpus
minhash_generator

Generate a minhash function
tokenizers

Split texts into tokens