read content of a Word document and return a data.frame representing the document.
docx_summary(x, preserve = FALSE, remove_fields = FALSE, detailed = FALSE)A data.frame with the following columns depending on the value of detailed:
When detailed = FALSE (default), the data.frame contains:
doc_index: Document element index (integer).
content_type: Type of content: "paragraph" or "table cell" (character).
style_name: Name of the paragraph style (character).
text: Collapsed text content of the paragraph or cell (character).
table_index: Index of the table (integer). NA for non-table content.
row_id: Row position in table (integer). NA for non-table content.
cell_id: Cell position in table row (integer). NA for non-table content.
is_header: Whether the row is a table header (logical). NA for non-table content.
row_span: Number of rows spanned by the cell (integer). 0 for merged cells. NA for non-table content.
col_span: Number of columns spanned by the cell (character). NA for non-table content.
table_stylename: Name of the table style (character). NA for non-table content.
When detailed = TRUE, the data.frame contains additional run-level information:
run_index: Index of the run within the paragraph (integer).
run_content_index: Index of content element within the run (integer).
run_content_text: Text content of the run element (character).
image_path: Path to embedded image stored in the temporary directory
associated with the rdocx object (character).
Images should be copied to a permanent location before closing the R
session if needed.
field_code: Field code content (character).
footnote_text: Footnote text content (character).
link: Hyperlink URL (character).
link_to_bookmark: Internal bookmark anchor name for hyperlinks (character).
bookmark_start: Name of the bookmark starting at this run (character).
character_stylename: Name of the character/run style (character).
sz: Font size in half-points (integer).
sz_cs: Complex script font size in half-points (integer).
font_family_ascii: Font family for ASCII characters (character).
font_family_eastasia: Font family for East Asian characters (character).
font_family_hansi: Font family for high ANSI characters (character).
font_family_cs: Font family for complex script characters (character).
bold: Whether the run is bold (logical).
italic: Whether the run is italic (logical).
underline: Whether the run is underlined (logical).
color: Text color in hexadecimal format (character).
shading: Shading pattern (character).
shading_color: Shading foreground color (character).
shading_fill: Shading background fill color (character).
keep_with_next: Whether paragraph should stay with next (logical).
align: Paragraph alignment (character).
level: Numbering level (integer). NA if not a numbered list.
num_id: Numbering definition ID (integer). NA if not a numbered list.
an rdocx object
If FALSE (default), text in table cells is collapsed into a
single line. If TRUE, line breaks in table cells are preserved as a "\n"
character. This feature is adapted from docxtractr::docx_extract_tbl()
published under a MIT licensed in
the 'docxtractr' package by Bob Rudis.
if TRUE, prevent field codes from appearing in the returned data.frame.
Should run-level information be included in the dataframe?
Defaults to FALSE. If TRUE, the dataframe contains detailed information
about each run (text formatting, images, hyperlinks, etc.) instead of
collapsing content at the paragraph level. When FALSE, run-level
information such as images, hyperlinks, and text formatting is not available
since data is aggregated at the paragraph level.
example_docx <- system.file(
package = "officer",
"doc_examples/example.docx"
)
doc <- read_docx(example_docx)
docx_summary(doc)
docx_summary(doc, detailed = TRUE)
Run the code above in your browser using DataLab