readtext (version 0.71)

get_nexis_html: extract texts and meta data from Nexis HTML files

Description

This extract headings, body texts and meta data (date, byline, length, section, edition) from items in HTML files downloaded by the scraper.

Usage

get_nexis_html(path, paragraph_separator = "\n\n", verbosity, ...)

Arguments

path

either path to a HTML file or a directory that containe HTML files

paragraph_separator

a character to separate paragraphs in body texts

verbosity
  • 0: output errors only

  • 1: output errors and warnings (default)

  • 2: output a brief summary message

  • 3: output detailed file-related messages

...

only to trap extra arguments

Examples

Run this code
# NOT RUN {
irt <- readtext:::get_nexis_html('tests/data/nexis/irish-times_1995-06-12_0001.html')
afp <- readtext:::get_nexis_html('tests/data/nexis/afp_2013-03-12_0501.html')
gur <- readtext:::get_nexis_html('tests/data/nexis/guardian_1986-01-01_0001.html')
sun <- readtext:::get_nexis_html('tests/data/nexis/sun_2000-11-01_0001.html')
spg <- readtext:::get_nexis_html('tests/data/nexis/spiegel_2012-02-01_0001.html', 
                                  language_date = 'german')

all <- readtext('tests/data/nexis', source = 'nexis')
all <- readtext('tests/data/nexis', source = 'nexis')
# }

Run the code above in your browser using DataLab