Learn R Programming

striprtf (version 0.2.2)

striprtf: Extract Text from RTF (Rich Text Format) File

Description

Parses an RTF file and extracts plain text as character vector.

Usage

striprtf(file, verbose = FALSE, ...)
rtf2text(text, verbose = FALSE)

Arguments

file
Path to an RTF file. Must be character of length 1.
verbose
Logical. If TRUE, progress report is printed on console. While it can be informative when parsing a large file, this option itself makes the process slow.
...
Addional arguments passed to readLines
text
Character of length 1. Expected to be contents of an RTF file.

Value

Character vector of extracted text

Details

Rich text format (RTF) files are written as a text file consisting of ASCII characters. The specification has been developed by Microsoft. This function interprets the character strings and extracts plain texts of the file. Major part of the algorithm this function employs has been discussed in a stack overflow thread (http://stackoverflow.com/a/188877) and later refactored and implemented by Gilson Filho for python 3 (https://gist.github.com/gilsondev/7c1d2d753ddb522e7bc22511cfb08676). The function is a translation of the above codes to R language, associated with C++ codes for enhancement.

An advance from the preceding implementation is that the function accomodates with various ANSI code pages. For example, RTF files created by Japanese version of Microsoft Word marks \ansicpg932, which indicates the code page 932 is used for letter-code conversion. The function detect the code page indication and convert them to UTF-8 as possible. Conversion table is retrieved from here (http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/).

References

Examples

Run this code
striprtf(system.file("extdata/king.rtf", package = "striprtf"))

Run the code above in your browser using DataLab