docx_summary: Get Word content in a data.frame

Description

read content of a Word document and return a data.frame representing the document.

Usage

docx_summary(x, preserve = FALSE, remove_fields = FALSE, detailed = FALSE)

Value

A data.frame with the following columns depending on the value of detailed:

When detailed = FALSE (default), the data.frame contains:

doc_index: Document element index (integer).
content_type: Type of content: "paragraph" or "table cell" (character).
style_name: Name of the paragraph style (character).
text: Collapsed text content of the paragraph or cell (character).
table_index: Index of the table (integer). NA for non-table content.
row_id: Row position in table (integer). NA for non-table content.
cell_id: Cell position in table row (integer). NA for non-table content.
is_header: Whether the row is a table header (logical). NA for non-table content.
row_span: Number of rows spanned by the cell (integer). 0 for merged cells. NA for non-table content.
col_span: Number of columns spanned by the cell (character). NA for non-table content.
table_stylename: Name of the table style (character). NA for non-table content.

When detailed = TRUE, the data.frame contains additional run-level information:

run_index: Index of the run within the paragraph (integer).
run_content_index: Index of content element within the run (integer).
run_content_text: Text content of the run element (character).
image_path: Path to embedded image stored in the temporary directory associated with the rdocx object (character). Images should be copied to a permanent location before closing the R session if needed.
field_code: Field code content (character).
footnote_text: Footnote text content (character).
link: Hyperlink URL (character).
link_to_bookmark: Internal bookmark anchor name for hyperlinks (character).
bookmark_start: Name of the bookmark starting at this run (character).
character_stylename: Name of the character/run style (character).
sz: Font size in half-points (integer).
sz_cs: Complex script font size in half-points (integer).
font_family_ascii: Font family for ASCII characters (character).
font_family_eastasia: Font family for East Asian characters (character).
font_family_hansi: Font family for high ANSI characters (character).
font_family_cs: Font family for complex script characters (character).
bold: Whether the run is bold (logical).
italic: Whether the run is italic (logical).
underline: Whether the run is underlined (logical).
color: Text color in hexadecimal format (character).
shading: Shading pattern (character).
shading_color: Shading foreground color (character).
shading_fill: Shading background fill color (character).
keep_with_next: Whether paragraph should stay with next (logical).
align: Paragraph alignment (character).
level: Numbering level (integer). NA if not a numbered list.
num_id: Numbering definition ID (integer). NA if not a numbered list.

Arguments

x: an rdocx object
preserve: If FALSE (default), text in table cells is collapsed into a single line. If TRUE, line breaks in table cells are preserved as a "\n" character. This feature is adapted from docxtractr::docx_extract_tbl() published under a MIT licensed in the 'docxtractr' package by Bob Rudis.
remove_fields: if TRUE, prevent field codes from appearing in the returned data.frame.
detailed: Should run-level information be included in the dataframe? Defaults to FALSE. If TRUE, the dataframe contains detailed information about each run (text formatting, images, hyperlinks, etc.) instead of collapsing content at the paragraph level. When FALSE, run-level information such as images, hyperlinks, and text formatting is not available since data is aggregated at the paragraph level.

Examples

Run this code

example_docx <- system.file(
  package = "officer",
  "doc_examples/example.docx"
)
doc <- read_docx(example_docx)

docx_summary(doc)

docx_summary(doc, detailed = TRUE)

Run the code above in your browser using DataLab