Learn R Programming

LexisNexisTools (version 0.1.2)

lnt_read: Read in a LexisNexis TXT file

Description

Read a LexisNexis TXT file and convert it to a data frame.

Usage

lnt_read(x, encoding = "UTF-8", extract_paragraphs = TRUE,
  convert_date = TRUE, date_format = "%B %d, %Y",
  start_keyword = "\\d+ of \\d+ DOCUMENTS$| Dokument \\d+ von \\d+$",
  end_keyword = "^LANGUAGE: |^SPRACHE: ",
  length_keyword = "^LENGTH: |^LNGE: ", verbose = TRUE)

Arguments

x

Name or names of LexisNexis TXT file to be converted.

encoding

Encoding to be assumed for input files. Defaults to UTF-8 (the LexisNexis standard value).

extract_paragraphs

A logical flag indicating if the returned object will include a third data frame with paragraphs.

convert_date

A logical flag indicating if it should be tried to convert the date of each article into Date format. Fails for non standard dates provided by LexisNexis so it might be safer to convert date afterwards.

date_format

If convert_date is set to TRUE will convert all dates using the same pattern. See strptime.

start_keyword

Is used to indicate the beginning of an article. All articles need to have same number of Beginnings, ends and lengths (which indicate the the last line of meta-data).

end_keyword

Is used to indicate the end of an article.

length_keyword

Is used to indicate the end of the meta-data.

verbose

A logical flag indicating whether information should be printed to the screen.

Value

A LNToutput S4 object consisting of 3 data.frames for meta-data, articles and paragraphs.

Details

The function can produce a LNToutput S4 object with two data.frame: meta, containing all meta information such as date, author and headline and articles, containing just the article ID and the text of the articles. When extract_paragraphs is set to TRUE, the output contains a third data.frame, similar to articles but with articles split into paragraphs.

Note: All files need to have same number of Beginnings, ends and lengths (which indicate the the last line of meta-data). If this is true can be tested with lnt_checkFiles. In some cases it makes sense to change the keywords for these three important indicators e.g. to "^LANGUAGE: ENGLISH" to narrow down the search for the ends of an article.

Examples

Run this code
# NOT RUN {
LNToutput <- lnt_read(lnt_sample())
meta.df <- LNToutput@meta
articles.df <- LNToutput@articles
paragraphs.df <- LNToutput@paragraphs
# }

Run the code above in your browser using DataLab