Learn R Programming

corpus (version 0.4.0)

read_ndjson: JSON Data Input

Description

Read data from a file in newline-delimited JavaScript Object Notation (NDJSON) format.

Usage

read_ndjson(file, mmap = FALSE, simplify = TRUE)

Arguments

file
the name of the file which the data are to be read from, or a connection (unless mmap is TRUE, see below). The data should be encoded as UTF-8, and each line should be a valid JSON value.
mmap
whether to memory-map the file instead of reading all of its data into memory simultaneously. See the ‘Memory mapping’ section.
simplify
whether to attempt to simplify the type of the return value. For example, if each line of the file stores an integer, if simplify is set to TRUE then the return value will be an integer vector rather than a jsondata object.

Memory mapping

When you specify mmap = TRUE, the function memory-maps the file instead of reading it into memory directly. In this case, the file argument must be a character string giving the path to the file, not a connection object. When you memory-map the file, the operating system reads data into memory only when it is needed, enabling you to transparently process large data sets that do not fit into memory.

One danger in memory-mapping is that if you delete the file after calling read_ndjson but before processing the data, then the results will be undefined, and your computer may crash. (On POSIX-compliant systems like Mac OS and Linux, there should be no ill effects to deleting the file. On recent versions of Windows, the system will not allow you to delete the file as long as the data is active; on older versions, the behavior is undefined.)

Another danger in memory-mapping is that if you serialize a jsondata object or derived text object using saveRDS or another similar function, and then you de-serialize the object, then R will attempt create a new memory-map using the original file argument passed to the read_ndjson call. If file is a relative path, then your working directory at the time of deserialization must agree with your working directory at the time of the read_ndjson call. You can avoid this situation by specifying an absolute path as the file argument (the normalizePath function will convert a relative to an absolute path).

Details

This function is the principal means of reading data for processing by the corpus package.

In the default usage, with argument simplify = TRUE, when the lines of the file are records (JSON object literals), the return value from read_ndjson is a data frame. With simplify = FALSE, the result is a jsondata object.

Examples

Run this code

    # Memory mapping
    lines <- c('{ "a": 1, "b": true }',
               '{ "b": false, "nested": { "c": 100, "d": false }}',
               '{ "a": 3.14, "nested": { "d": true }}')
    file <- tempfile()
    writeLines(lines, file)
    (data <- read_ndjson(file, mmap = TRUE))

    data$a
    data$b
    data$nested.c
    data$nested.d

    rm("data")
    gc() # force the garbage collector to release the memory-map
    file.remove(file)

Run the code above in your browser using DataLab