readtext: Import and handling for plain and formatted text files
An R package for reading text files in all their various formats, by Ken Benoit, Adam Obeng, and Paul Nulty.
Introduction
readtext is a one-function package that does exactly what it says on the tin: It reads files containing text, along with any associated document-level metadata, which we call "docvars", for document variables. Plain text files do not have docvars, but other forms such as .csv, .tab, .xml, and .json files usually do.
readtext accepts filemasks, so that you can specify a pattern to load multiple texts, and these texts can even be of multiple types. readtext is smart enough to process them correctly, returning a data.frame with a primary field "text" containing a character vector of the texts, and additional columns of the data.frame as found in the document variables from the source files.
As encoding can also be a challenging issue for those reading in texts, we include functions for diagnosing encodings on a file-by-file basis, and allow you to specify vectorized input encodings to read in file types with individually set (and different) encodings. (All ecnoding functions are handled by the stringi package.)
How to Install
From GitHub
# devtools packaged required to install readtext from Github devtools::install_github("kbenoit/readtext")
From CRAN
(coming soon)
Inter-operability with quanteda
readtext was originally developed in early versions of the quanteda package for the quantitative analysis of textual data. It was spawned from the textfile()
function from that package, and now lives exclusively in readtext. Because quanteda's corpus constructor recognizes the data.frame format returned by readtext()
, it can construct a corpus directly from a readtext
object, preserving all docvars and other meta-data.
require(readtext)
## Loading required package: readtext
require(quanteda)
## Loading required package: quanteda
## quanteda version 0.9.9.57
## Using 3 of 4 cores for parallel computing
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
# set verbosity level to 2
options(readtext_verbosity = 2)
FILEDIR <- tempdir()
unzip(system.file("extdata", "data_files_encodedtexts.zip", package = "readtext"), exdir = FILEDIR)
# get encoding from filename
filenames <- list.files(FILEDIR, "\\.txt$")
# strip the extension
filenames <- gsub(".txt$", "", filenames)
parts <- strsplit(filenames, "_")
fileencodings <- sapply(parts, "[", 3)
fileencodings
## [1] "UTF-16LE" "UTF-8-BOM" "ISO-8859-6" "UTF-8"
## [5] "WINDOWS-1256" "GB2312" "GBK" "UTF-8"
## [9] "UTF-16BE" "UTF-16LE" "UTF-8" "WINDOWS-1252"
## [13] "ISO-8859-1" "UTF-8" "WINDOWS-1252" "ISO-8859-1"
## [17] "UTF-8" "WINDOWS-1252" "CP1253" "ISO-8859-7"
## [21] "UTF-8" "UTF-8" "ISO-8859-1" "UTF-8"
## [25] "WINDOWS-1252" "CP932" "ISO-2022-JP" "UTF-8"
## [29] "WINDOWS-936" "ISO-2022-KR" "UTF-8" "ISO-8859-5"
## [33] "KOI8-R" "UTF-8" "WINDOWS-1251" "UTF-8"
# find out which conversions are unavailable (through iconv())
cat("Encoding conversions not available for this platform:")
## Encoding conversions not available for this platform:
notAvailableIndex <- which(!(fileencodings %in% iconvlist()))
fileencodings[notAvailableIndex]
## [1] "UTF-8-BOM"
# read in some text files
# try readtext
txts <- readtext(paste0(FILEDIR, "/", "*.txt"))
## Reading texts from /var/folders/46/zfn6gwj15d3_n6dhyy1cvwc00000gp/T//
## RtmpmFk3IF/*.txt
##
## ... read 36 documents.
# substring(texts(txts)[1], 1, 80) # gibberish
# substring(texts(txts)[4], 1, 80) # hex
# substring(texts(txts)[40], 1, 80) # hex
# read them in again
txts <- readtext(paste0(FILEDIR, "/", "*.txt"), encoding = fileencodings)
## Reading texts from /var/folders/46/zfn6gwj15d3_n6dhyy1cvwc00000gp/T//RtmpmFk3IF/*.txt
## ... read 36 documents.
substring(texts(txts)[1], 1, 80) # English
## 1
## "WHEREAS, the Sisseton and Wahpeton Bands of Dakota or Sioux Indians, on the 20th"
substring(texts(txts)[4], 1, 80) # Arabic, looking good
## 4
## "الديباجة\nلما كان الاعتراف بالكرامة المتأصلة في جميع أعضاء الأسرة البشرية وبحقوقه"
substring(texts(txts)[40], 1, 80) # Cyrillic, looking good
## <NA>
## NA
substring(texts(txts)[7], 1, 80) # Chinese, looking good
## 7
## "世界人权宣言\n联合国大会一九四八年十二月十日第217A(III)号决议通过并颁布\n\n1948年12月10日,联合国大会通过并颁布《世界人权宣言》。这一具有历史意"
substring(texts(txts)[26], 1, 80) # Hindi, looking good
## 26
## "『世界人権宣言』\n \n\n(1948.12.10 第3回国連総会採択)\n\n \n\n〈前文〉\n \n人類社会のすべての構成員の固有の尊厳と平等で譲ることのできない権利と"
txts <- readtext(paste0(FILEDIR, "/", "*.txt"),
encoding = fileencodings,
docvarsfrom = "filenames",
docvarnames = c("document", "language", "inputEncoding"))
## Reading texts from /var/folders/46/zfn6gwj15d3_n6dhyy1cvwc00000gp/T//RtmpmFk3IF/*.txt
## ... read 36 documents.
encodingCorpus <- corpus(txts, textField = "texts",
source = "Created by encoding-tests.R")
## Warning in corpus.character(x[, text_fieldi], docvars = x[, -text_fieldi, :
## Arguments textFieldsource not used.
summary(encodingCorpus)
## Corpus consisting of 36 documents.
##
## Text Types Tokens Sentences doc_id
## text1 690 2938 155 IndianTreaty_English_UTF-16LE.txt
## text2 646 3104 154 IndianTreaty_English_UTF-8-BOM.txt
## text3 753 1555 86 UDHR_Arabic_ISO-8859-6.txt
## text4 753 1555 86 UDHR_Arabic_UTF-8.txt
## text5 753 1555 86 UDHR_Arabic_WINDOWS-1256.txt
## text6 596 1840 62 UDHR_Chinese_GB2312.txt
## text7 596 1840 62 UDHR_Chinese_GBK.txt
## text8 596 1840 62 UDHR_Chinese_UTF-8.txt
## text9 559 1916 61 UDHR_English_UTF-16BE.txt
## text10 559 1916 61 UDHR_English_UTF-16LE.txt
## text11 576 1943 63 UDHR_English_UTF-8.txt
## text12 576 1943 63 UDHR_English_WINDOWS-1252.txt
## text13 670 2132 58 UDHR_French_ISO-8859-1.txt
## text14 670 2132 58 UDHR_French_UTF-8.txt
## text15 670 2132 58 UDHR_French_WINDOWS-1252.txt
## text16 661 1831 61 UDHR_German_ISO-8859-1.txt
## text17 661 1831 61 UDHR_German_UTF-8.txt
## text18 661 1831 61 UDHR_German_WINDOWS-1252.txt
## text19 735 2193 106 UDHR_Greek_CP1253.txt
## text20 735 2193 106 UDHR_Greek_ISO-8859-7.txt
## text21 735 2193 106 UDHR_Greek_UTF-8.txt
## text22 661 2388 104 UDHR_Hindi_UTF-8.txt
## text23 725 1887 106 UDHR_Icelandic_ISO-8859-1.txt
## text24 725 1887 106 UDHR_Icelandic_UTF-8.txt
## text25 725 1887 106 UDHR_Icelandic_WINDOWS-1252.txt
## text26 578 2482 65 UDHR_Japanese_CP932.txt
## text27 578 2482 65 UDHR_Japanese_ISO-2022-JP.txt
## text28 578 2482 65 UDHR_Japanese_UTF-8.txt
## text29 578 2482 65 UDHR_Japanese_WINDOWS-936.txt
## text30 661 1322 65 UDHR_Korean_ISO-2022-KR.txt
## text31 661 1322 65 UDHR_Korean_UTF-8.txt
## text32 762 1829 62 UDHR_Russian_ISO-8859-5.txt
## text33 762 1829 62 UDHR_Russian_KOI8-R.txt
## text34 762 1829 62 UDHR_Russian_UTF-8.txt
## text35 762 1829 62 UDHR_Russian_WINDOWS-1251.txt
## text36 541 2404 34 UDHR_Thai_UTF-8.txt
## document language inputEncoding
## IndianTreaty English UTF-16LE
## IndianTreaty English UTF-8-BOM
## UDHR Arabic ISO-8859-6
## UDHR Arabic UTF-8
## UDHR Arabic WINDOWS-1256
## UDHR Chinese GB2312
## UDHR Chinese GBK
## UDHR Chinese UTF-8
## UDHR English UTF-16BE
## UDHR English UTF-16LE
## UDHR English UTF-8
## UDHR English WINDOWS-1252
## UDHR French ISO-8859-1
## UDHR French UTF-8
## UDHR French WINDOWS-1252
## UDHR German ISO-8859-1
## UDHR German UTF-8
## UDHR German WINDOWS-1252
## UDHR Greek CP1253
## UDHR Greek ISO-8859-7
## UDHR Greek UTF-8
## UDHR Hindi UTF-8
## UDHR Icelandic ISO-8859-1
## UDHR Icelandic UTF-8
## UDHR Icelandic WINDOWS-1252
## UDHR Japanese CP932
## UDHR Japanese ISO-2022-JP
## UDHR Japanese UTF-8
## UDHR Japanese WINDOWS-936
## UDHR Korean ISO-2022-KR
## UDHR Korean UTF-8
## UDHR Russian ISO-8859-5
## UDHR Russian KOI8-R
## UDHR Russian UTF-8
## UDHR Russian WINDOWS-1251
## UDHR Thai UTF-8
##
## Source: /Users/kbenoit/Dropbox (Personal)/GitHub/readtext/* on x86_64 by kbenoit
## Created: Wed May 17 11:48:23 2017
## Notes:
Piping works too:
require(magrittr)
## Loading required package: magrittr
readtext(paste0(FILEDIR, "/", "*.txt"), encoding = fileencodings) %>%
corpus(textField = "texts") %>%
summary
## Reading texts from /var/folders/46/zfn6gwj15d3_n6dhyy1cvwc00000gp/T//
## RtmpmFk3IF/*.txt
##
## ... read 36 documents.
## Warning in corpus.character(x[, text_fieldi], docvars = x[, -text_fieldi, :
## Argument textField not used.
## Corpus consisting of 36 documents.
##
## Text Types Tokens Sentences doc_id
## text1 690 2938 155 IndianTreaty_English_UTF-16LE.txt
## text2 646 3104 154 IndianTreaty_English_UTF-8-BOM.txt
## text3 753 1555 86 UDHR_Arabic_ISO-8859-6.txt
## text4 753 1555 86 UDHR_Arabic_UTF-8.txt
## text5 753 1555 86 UDHR_Arabic_WINDOWS-1256.txt
## text6 596 1840 62 UDHR_Chinese_GB2312.txt
## text7 596 1840 62 UDHR_Chinese_GBK.txt
## text8 596 1840 62 UDHR_Chinese_UTF-8.txt
## text9 559 1916 61 UDHR_English_UTF-16BE.txt
## text10 559 1916 61 UDHR_English_UTF-16LE.txt
## text11 576 1943 63 UDHR_English_UTF-8.txt
## text12 576 1943 63 UDHR_English_WINDOWS-1252.txt
## text13 670 2132 58 UDHR_French_ISO-8859-1.txt
## text14 670 2132 58 UDHR_French_UTF-8.txt
## text15 670 2132 58 UDHR_French_WINDOWS-1252.txt
## text16 661 1831 61 UDHR_German_ISO-8859-1.txt
## text17 661 1831 61 UDHR_German_UTF-8.txt
## text18 661 1831 61 UDHR_German_WINDOWS-1252.txt
## text19 735 2193 106 UDHR_Greek_CP1253.txt
## text20 735 2193 106 UDHR_Greek_ISO-8859-7.txt
## text21 735 2193 106 UDHR_Greek_UTF-8.txt
## text22 661 2388 104 UDHR_Hindi_UTF-8.txt
## text23 725 1887 106 UDHR_Icelandic_ISO-8859-1.txt
## text24 725 1887 106 UDHR_Icelandic_UTF-8.txt
## text25 725 1887 106 UDHR_Icelandic_WINDOWS-1252.txt
## text26 578 2482 65 UDHR_Japanese_CP932.txt
## text27 578 2482 65 UDHR_Japanese_ISO-2022-JP.txt
## text28 578 2482 65 UDHR_Japanese_UTF-8.txt
## text29 578 2482 65 UDHR_Japanese_WINDOWS-936.txt
## text30 661 1322 65 UDHR_Korean_ISO-2022-KR.txt
## text31 661 1322 65 UDHR_Korean_UTF-8.txt
## text32 762 1829 62 UDHR_Russian_ISO-8859-5.txt
## text33 762 1829 62 UDHR_Russian_KOI8-R.txt
## text34 762 1829 62 UDHR_Russian_UTF-8.txt
## text35 762 1829 62 UDHR_Russian_WINDOWS-1251.txt
## text36 541 2404 34 UDHR_Thai_UTF-8.txt
##
## Source: /Users/kbenoit/Dropbox (Personal)/GitHub/readtext/* on x86_64 by kbenoit
## Created: Wed May 17 11:48:24 2017
## Notes: