read_ndjson: JSON Data Input

Description

Read data from a file in newline-delimited JavaScript Object Notation (NDJSON) format.

Usage

read_ndjson(file, mmap = FALSE, simplify = TRUE, text = NULL)

Arguments

file

the name of the file which the data are to be read from, or a connection (unless mmap is TRUE, see below). The data should be encoded as UTF-8, and each line should be a valid JSON value.

mmap

whether to memory-map the file instead of reading all of its data into memory simultaneously. See the ‘Memory mapping’ section.

simplify

whether to attempt to simplify the type of the return value. For example, if each line of the file stores an integer, if simplify is set to TRUE then the return value will be an integer vector rather than a corpus_json object.

text

a character vector of string fields to interpret as text instead of character, or NULL to interpret all strings as character.

Value

In the default usage, with argument simplify = TRUE, when the lines of the file are records (JSON object literals), the return value from read_ndjson is a data frame with class c("corpus_frame", "data.frame"). With simplify = FALSE, the result is a corpus_json object.

Memory mapping

When you specify mmap = TRUE, the function memory-maps the file instead of reading it into memory directly. In this case, the file argument must be a character string giving the path to the file, not a connection object. When you memory-map the file, the operating system reads data into memory only when it is needed, enabling you to transparently process large data sets that do not fit into memory.

In terms of memory usage, enabling mmap = TRUE reduces the footprint for corpus_json and corpus_text objects; native R objects (character, integer, list, logical, and numeric) get fully deserialized to memory and produce identical results regardless of whether mmap is TRUE or FALSE. To process a large text corpus with a text field named "text", you should set text = "text" and mmap = TRUE. Or, to reduce the memory footprint even further, set simplify = FALSE and mmap = TRUE.

One danger in memory-mapping is that if you delete the file after calling read_ndjson but before processing the data, then the results will be undefined, and your computer may crash. (On POSIX-compliant systems like Mac OS and Linux, there should be no ill effects to deleting the file. On recent versions of Windows, the system will not allow you to delete the file as long as the data is active.)

Another danger in memory-mapping is that if you serialize a corpus_json object or derived corpus_text object using saveRDS or another similar function, and then you deserialize the object, R will attempt create a new memory-map using the file argument passed to the original read_ndjson call. If file is a relative path, then your working directory at the time of deserialization must agree with your working directory at the time of the read_ndjson call. You can avoid this situation by specifying an absolute path as the file argument (the normalizePath function will convert a relative to an absolute path).

Details

This function is the recommended means of reading data for processing by the corpus package.

When the text argument is non-NULL string data fields with names indicated by this argument are decoded as text values, not as character values.

Examples

Run this code

# NOT RUN {
# Memory mapping
lines <- c('{ "a": 1, "b": true }',
           '{ "b": false, "nested": { "c": 100, "d": false }}',
           '{ "a": 3.14, "nested": { "d": true }}')
file <- tempfile()
writeLines(lines, file)
(data <- read_ndjson(file, mmap = TRUE))

data$a
data$b
data$nested.c
data$nested.d

rm("data")
invisible(gc()) # force the garbage collector to release the memory-map
file.remove(file)
# }

Run the code above in your browser using DataLab