get.text: get.text

Description

Extracts main textual content from NISO-JATS coded XML file or text as sectioned text.

Usage

get.text(
  x,
  sectionsplit = "",
  grepsection = "",
  letter.convert = TRUE,
  greek2text = FALSE,
  sentences = FALSE,
  paragraph = FALSE,
  cermine = "auto",
  rm.table = TRUE,
  rm.formula = TRUE,
  rm.xref = TRUE,
  rm.media = TRUE,
  rm.graphic = TRUE,
  rm.ext_link = TRUE
)

Value

List with two elements. 1: Character vector with section title/s, 2: Character vector with floating text of sections or list with vector of sentences per section/s if sentences=TRUE.

Arguments

x: a NISO-JATS coded XML file or text.
sectionsplit: search patterns for section split (forced to lower case), e.g. c("intro", "method", "result", "discus").
grepsection: search pattern to reduce text to specific section namings only.
letter.convert: Logical. If TRUE converts hexadecimal and HTML coded characters to Unicode.
greek2text: Logical. If TRUE some greek letters and special characters will be unified to textual representation (important to extract stats).
sentences: Logical. IF TRUE text is returned as sectioned list with sentences.
paragraph: Logical. IF TRUE "<New paragraph>" is added at the end of each paragraph to enable manual splitting at paragraphs.
cermine: Logical. If TRUE CERMINE specific error handling and letter conversion will be applied. If set to "auto" file name ending with 'cermxml$' will set cermine=TRUE.
rm.table: Logical. If TRUE removes <table> tag from text.
rm.formula: Logical. If TRUE removes <formula> tags.
rm.xref: Logical. If TRUE removes <xref> tag (citing) from text.
rm.media: Logical. If TRUE removes <media> tag from text.
rm.graphic: Logical. If TRUE removes <graphic> and <fig> tag from text.
rm.ext_link: Logical. If TRUE removes <ext link> tag from text.

Description

Usage

Value

Arguments

See Also