When you specify mmap = TRUE
, the function memory-maps the file
instead of reading it into memory directly. In this case, the file
argument must be a character string giving the path to the file, not
a connection object. When you memory-map the file, the operating
system reads data into memory only when it is needed, enabling
you to transparently process large data sets that do not fit into
memory.
In terms of memory usage, enabling mmap = TRUE
reduces the
footprint for corpus_json
and corpus_text
objects;
native R objects (character
, integer
, list
,
logical
, and numeric
) get fully deserialized to
memory and produce identical results regardless of whether
mmap
is TRUE
or FALSE
. To process a large
text corpus with a text field named "text"
, you should set
text = "text"
and mmap = TRUE
. Or, to reduce the memory
footprint even further, set simplify = FALSE
and
mmap = TRUE
.
One danger in memory-mapping is that if you delete the file
after calling read_ndjson
but before processing the data, then
the results will be undefined, and your computer may crash. (On
POSIX-compliant systems like Mac OS and Linux, there should be no
ill effects to deleting the file. On recent versions of Windows,
the system will not allow you to delete the file as long as the data
is active.)
Another danger in memory-mapping is that if you serialize a
corpus_json
object or derived corpus_text
object using
saveRDS
or another similar function, and then you
deserialize the object, R will attempt create a new memory-map
using the file
argument passed to the original read_ndjson
call. If file
is a relative path, then your working directory
at the time of deserialization must agree with your working directory
at the time of the read_ndjson
call. You can avoid this
situation by specifying an absolute path as the file
argument
(the normalizePath
function will convert a relative
to an absolute path).