Extract content from Word and PowerPoint

Import Word document

Function docx_summary is returning content of a Word document.

library(officer) example_docx <- system.file(package = "officer", "doc_examples/example.docx") doc <- read_docx(example_docx) content <- docx_summary(doc) content

Explore the results:

library(dplyr) content %>% group_by(content_type) %>% summarise(n = n_distinct(doc_index))

To get all paragraphs:

par_data <- content %>% filter(content_type %in% "paragraph") %>% select(doc_index, style_name, text, level, num_id) %>% # let's make text shorter so it can be display in that vignette mutate(text = substr(text, start = 1, stop = ifelse(nchar(text)<30, nchar(text), 30) )) par_data

Word tables

Tables are unstacked:

table_cells <- content %>% filter(content_type %in% "table cell") print(table_cells)

Cells positions and values are dispatched in columns row_id, cell_id, text and is_header (a logical column indicating if the cell is part of header or not). Note that content (column text) is a character vector.

table_body <- table_cells %>% filter(!is_header) %>% select(row_id, cell_id, text) table_body

Reshape data with columns row_id, cell_id and text, it's easy to do with tidyr :

if( require("tidyr")) table_body %>% spread(cell_id, text)

Getting headers requires another operation:

if( require("tidyr")) table_cells %>% filter(is_header) %>% select(row_id, cell_id, text) %>% spread(cell_id, text)

Import PowerPoint document

Function pptx_summary is returning content of a PowerPoint document

example_pptx <- system.file(package = "officer", "doc_examples/example.pptx") doc <- read_pptx(example_pptx) content <- pptx_summary(doc) content

Explore the results:

content %>% group_by(content_type) %>% summarise(n = n_distinct(id))

To get all paragraphs:

par_data <- content %>% filter(content_type %in% "paragraph") %>% select(id, text) par_data

To get an image:

image_row <- content %>% filter(content_type %in% "image") media_extract(doc, path = image_row$media_file, target = "extract.png")

PowerPoint tables

Tables are unstacked :

table_cells <- content %>% filter(content_type %in% "table cell") table_cells

Cells positions and values are dispatched in columns row_id, cell_id, text. Note here there is no indicator for table header.

if( require("tidyr")) table_cells %>% select(row_id, cell_id, text) %>% spread(cell_id, text)