get_nexis_html

either path to a HTML file or a directory that containe HTML files

path

a character to separate paragraphs in body texts

paragraph_separator

<ul>
<li>0: output errors only</li>
<li>1: output errors and warnings (default)</li>
<li>2: output a brief summary message</li>
<li>3: output detailed file-related messages</li>
</ul>

verbosity

This extract headings, body texts and meta data (date, byline, length,
section, edition) from items in HTML files downloaded by the scraper.

internal

Functions for importing and handling text files and formatted text
files with additional meta-data, such including '.csv', '.tab', '.json', '.xml',
'.html', '.pdf', '.doc', '.docx', '.xls', '.xlsx', and others.

Kenneth Benoit

get_nexis_html: extract texts and meta data from Nexis HTML files

Description

Usage

Arguments

Examples