textreadr (version 0.5.1)

read_dir_transcript: Read In Multiple Transcript Files From a Directory

Description

Read in multiple transcript files from a directory and create a data.frame.

Usage

read_dir_transcript(path, col.names = c("Document", "Person", "Dialogue"),
  pattern = NULL, all.files = FALSE, recursive = FALSE, skip = 0,
  merge.broke.tot = TRUE, header = FALSE, dash = "", ellipsis = "...",
  quote2bracket = FALSE, rm.empty.rows = TRUE, na = "", sep = NULL,
  comment.char = "", max.person.nchar = 20, ...)

Arguments

path
Path to the directory.
col.names
A character vector specifying the column names of the transcript columns (document, person, dialogue).
pattern
An optional regular expression. Only file names which match the regular expression will be returned.
all.files
Logical. If FALSE, only the names of visible files are returned. If TRUE, all file names will be returned.
recursive
Logical. Should the listing recurse into directories?
skip
Integer; the number of lines of the data file to skip before beginning to read data.
merge.broke.tot
logical. If TRUE and if the file being read in is .docx with broken space between a single turn of talk read_transcript will attempt to merge these into a single turn of talk.
header
logical. If TRUE the file contains the names of the variables as its first line.
dash
A character string to replace the en and em dashes special characters (default is to remove).
ellipsis
A character string to replace the ellipsis special characters.
quote2bracket
logical. If TRUE replaces curly quotes with curly braces (default is FALSE). If FALSE curly quotes are removed.
rm.empty.rows
logical. If TRUE read_transcript attempts to remove empty rows.
na
A character string to be interpreted as an NA value.
sep
The field separator character. Values on each line of the file are separated by this character. The default of NULL instructs read_transcript to use a separator suitable for the file type being read in.
comment.char
A character vector of length one containing a single character or an empty string. Use "" to turn off the interpretation of comments altogether.
max.person.nchar
The max number of characters long names are expected to be. This information is used to warn the user if a separat appears beyond this length in the text.
ignored.

Value

Returns a dataframe of documents, dialogue, and people.

See Also

read_transcript

Examples

Run this code
skips <- c(0, 1, 1, 0, 0, 1)
path <- system.file("docs/transcripts", package = 'textreadr')
textreadr::peek(read_dir_transcript(path, skip = skips), Inf)

## Not run: ------------------------------------
# ## with additional  cleaning
# library(tidyverse, textshape, textclean)
# 
# path %>%
#     read_dir_transcript(skip = skips) %>%
#     textclean::filter_row("Person", "^\\[") %>%
#     mutate(
#         Person = stringi::stri_replace_all_regex(Person, "(^/\\s*)|(:\\s*$)", "") %>%
#             trimws(),
#         Dialogue = stringi::stri_replace_all_regex(Dialogue, "(^/\\s*)", "")
#     ) %>%
#     peek(Inf)
## ---------------------------------------------

Run the code above in your browser using DataLab