Extract content from Word and PowerPoint

Import Word document

The function docx_summary() returns the content of a Word document.

library(officer) example_docx <- system.file(package = "officer", "doc_examples/example.docx") doc <- read_docx(example_docx) content <- docx_summary(doc) head(content)

Explore the results:

tapply(content$doc_index, content$content_type, function(x) length(unique(x)))

To get all paragraphs:

par_data <- subset(content, content_type %in% "paragraph") par_data <- par_data[, c("doc_index", "style_name", "text", "level", "num_id") ] par_data$text <- with(par_data, { substr( text, start = 1, stop = ifelse(nchar(text)<30, nchar(text), 30) ) }) par_data

Word tables

Tables are unstacked:

table_cells <- subset(content, content_type %in% "table cell") print(head( table_cells) )

Cells positions and values are dispatched in columns row_id, cell_id, text and is_header (a logical column indicating if the cell is part of a header or not). Note that the content itself (column text) is a character vector.

table_body <- subset(table_cells, !is_header) table_body <- table_body[,c("row_id", "cell_id", "text")] head(table_body)

Reshaping the data with columns row_id, cell_id and text would display something close to the orginal table:

tapply(table_body$text, list(row_id = table_body$row_id, cell_id = table_body$cell_id ), FUN = I )

Getting headers requires another operation:

data <- subset(table_cells, is_header) data <- data[, c("row_id", "cell_id", "text") ] tapply(data$text, list(row_id = data$row_id, cell_id = data$cell_id ), FUN = I )

Import PowerPoint document

The function pptx_summary() returns the content of a PowerPoint document.

example_pptx <- system.file(package = "officer", "doc_examples/example.pptx") doc <- read_pptx(example_pptx) content <- pptx_summary(doc) head(content)

Explore the results:

tapply(content$id, content$content_type, function(x) length(unique(x)))

To get all paragraphs:

par_data <- subset(content, content_type %in% "paragraph", select = c(id, text) ) head(par_data)

To get an image:

image_row <- subset(content, content_type %in% "image") media_extract(doc, path = image_row$media_file, target = "extract.png")

PowerPoint tables

Tables are unstacked :

table_cells <- subset(content, content_type %in% "table cell") head(table_cells)

Cells positions and values are dispatched in columns row_id, cell_id, text. Note that here there is no indicator for the table header.

data <- subset(table_cells, id == 18, c(row_id, cell_id, text) ) tapply(data$text, list(row_id = data$row_id, cell_id = data$cell_id ), FUN = I )