Given a link to a filing document (e.g. the 10-K, 8-K) in TXT, process the
file into parts and items. This enables follow-up processing of a desired
section - e.g. just the Risk Factors. `item.name` and `part.name` are taken
directly from the document without any attempt to normalize.
- Should non-text elements be removed? Default: true
include.raw
- Include unprocessed nodes in result? Default: false
fix.errors
- Try to fix document errors (e.g. missing part labels).
WIP. Default: true
Value
a dataframe with one row per paragraph
part.name
Detected name of the Part
item.name
Detected name of the Item
text
Text of the paragraph / node
raw*
Raw HTML of the node if include.raw = TRUE
Details
NOTE: This has been tested on a range of documents, but formatting
differences could cause failures. Please report an issue for any document
that isn't parsed correctly.
FURTHER NOTE: Not all filings are well formed - missing headings, bad
spacing, etc. These can all throw the parsing off!