Learn R Programming

textreuse (version 0.1.2)

textreuse-package: Detect Text Reuse and Document Similarity

Description

Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.

Arguments

Details

The best place to begin with this package in the introductory vignette.

vignette("textreuse-introduction", package = "textreuse")

After reading that vignette, the "pairwise" and "minhash" vignettes introduce specific paths for working with the package.

vignette("textreuse-pairwise", package = "textreuse")

vignette("textreuse-minhash", package = "textreuse")

vignette("textreuse-alignment", package = "textreuse")

Another good place to beign with the package is the documentation for loading documents (TextReuseTextDocument and TextReuseCorpus), for tokenizers, similarity functions, and locality-sensitive hashing.

References

The sample data provided in the extdata/legal directory is taken from a http://lincolnmullen.com/blog/corpus-of-american-tract-society-publications/{corpus of American Tract Society publications} from the nineteen-century, gathered from the https://archive.org/{Internet Archive}.

The sample data provided in the extdata/legal directory, are taken from the following nineteenth-century codes of civil procedure from California and New York.

Final Report of the Commissioners on Practice and Pleadings, in 2 Documents of the Assembly of New York, 73rd Sess., No. 16, (1850): 243-250, sections 597-613. http://books.google.com/books?id=9HEbAQAAIAAJ&pg=PA243#v=onepage&q&f=false{Google Books}.

An Act To Regulate Proceedings in Civil Cases, 1851 California Laws 51, 51-53 sections 4-17; 101, sections 313-316. http://books.google.com/books?id=4PHEAAAAIAAJ&pg=PA51#v=onepage&q&f=false{Google Books}.