read.transcript: Read Transcripts Into R

Description

Read .docx, .csv or .xlsx files into R.

Usage

read.transcript(
  file,
  col.names = NULL,
  text.var = NULL,
  merge.broke.tot = TRUE,
  header = FALSE,
  dash = "",
  ellipsis = "...",
  quote2bracket = FALSE,
  rm.empty.rows = TRUE,
  na.strings = c("999", "NA", "", " "),
  sep = NULL,
  skip = 0,
  nontext2factor = TRUE,
  text,
  comment.char = "",
  ...
)

Arguments

file

The name of the file which the data are to be read from. Each row of the table appears as one line of the file. If it does not contain an absolute path, the file name is relative to the current working directory, getwd().

col.names

A character vector specifying the column names of the transcript columns.

text.var

A character string specifying the name of the text variable will ensure that variable is classed as character. If NULL read.transcript attempts to guess the text.variable (dialogue).

merge.broke.tot

logical. If TRUE and if the file being read in is .docx with broken space between a single turn of talk read.transcript will attempt to merge these into a single turn of talk.

header

logical. If TRUE the file contains the names of the variables as its first line.

dash

A character string to replace the en and em dashes special characters (default is to remove).

ellipsis

A character string to replace the ellipsis special characters (default is text ...).

quote2bracket

logical. If TRUE replaces curly quotes with curly braces (default is FALSE). If FALSE curly quotes are removed.

rm.empty.rows

logical. If TRUE read.transcript attempts to remove empty rows.

na.strings

A vector of character strings which are to be interpreted as NA values.

sep

The field separator character. Values on each line of the file are separated by this character. The default of NULL instructs read.transcript to use a separator suitable for the file type being read in.

skip

Integer; the number of lines of the data file to skip before beginning to read data.

nontext2factor

logical. If TRUE attempts to convert any non-text to a factor.

text

Character string: if file is not supplied and this is, then data are read from the value of text. Notice that a literal string can be used to include (small) data sets within R code.

comment.char

A character vector of length one containing a single character or an empty string. Use "" to turn off the interpretation of comments altogether.

…

Further arguments to be passed to read.table.

Value

Returns a dataframe of dialogue and people.

Warning

read.transcript may contain errors if the file being read in is .docx. The researcher should carefully investigate each transcript for errors before further parsing the data.

References

https://github.com/trinker/qdap/wiki/Reading-.docx-%5BMS-Word%5D-Transcripts-into-R

Examples

Run this code

# NOT RUN {
#Note: to view the document below use the path:
system.file("extdata/transcripts/", package = "qdap")
(doc1 <- system.file("extdata/transcripts/trans1.docx", package = "qdap"))
(doc2 <- system.file("extdata/transcripts/trans2.docx", package = "qdap"))
(doc3 <- system.file("extdata/transcripts/trans3.docx", package = "qdap"))
(doc4 <- system.file("extdata/transcripts/trans4.xlsx", package = "qdap"))

dat1 <- read.transcript(doc1)
truncdf(dat1, 40)
dat2 <- read.transcript(doc1, col.names = c("person", "dialogue"))
truncdf(dat2, 40)
dat2b <- rm_row(dat2, "person", "[C") #remove bracket row
truncdf(dat2b, 40)

## read.transcript(doc2) #throws an error (need skip)
dat3 <- read.transcript(doc2, skip = 1); truncdf(dat3, 40)

## read.transcript(doc3, skip = 1) #incorrect read; wrong sep
dat4 <- read.transcript(doc3, sep = "-", skip = 1); truncdf(dat4, 40)

dat5 <- read.transcript(doc4); truncdf(dat5, 40) #an .xlsx file
trans <- "sam: Computer is fun. Not too fun.
greg: No it's not, it's dumb.
teacher: What should we do?
sam: You liar, it stinks!"

read.transcript(text=trans)

## Read in text specify spaces as sep
## EXAMPLE 1

read.transcript(text="34    The New York Times reports a lot of words here.
12    Greenwire reports a lot of words.
31    Only three words.
 2    The Financial Times reports a lot of words.
 9    Greenwire short.
13    The New York Times reports a lot of words again.", 
    col.names=qcv(NO,    ARTICLE), sep="   ")

## EXAMPLE 2

read.transcript(text="34..    The New York Times reports a lot of words here.
12..    Greenwire reports a lot of words.
31..    Only three words.
 2..    The Financial Times reports a lot of words.
 9..    Greenwire short.
13..    The New York Times reports a lot of words again.", 
    col.names=qcv(NO,    ARTICLE), sep="\\.\\.")
# }

Run the code above in your browser using DataLab