When you specify mmap = TRUE, the function memory-maps the file
instead of reading it into memory directly. In this case, the file
argument must be a character string giving the path to the file, not
a connection object. When you memory-map the file, the operating
system reads data into memory only when it is needed, enabling
you to transparently process large data sets that do not fit into
memory.
In terms of memory usage, enabling mmap = TRUE reduces the
footprint for corpus_json and corpus_text objects;
native R objects (character, integer, list,
logical, and numeric) get fully deserialized to
memory and produce identical results regardless of whether
mmap is TRUE or FALSE. To process a large
text corpus with a text field named "text", you should set
text = "text" and mmap = TRUE. Or, to reduce the memory
footprint even further, set simplify = FALSE and
mmap = TRUE.
One danger in memory-mapping is that if you delete the file
after calling read_ndjson but before processing the data, then
the results will be undefined, and your computer may crash. (On
POSIX-compliant systems like Mac OS and Linux, there should be no
ill effects to deleting the file. On recent versions of Windows,
the system will not allow you to delete the file as long as the data
is active.)
Another danger in memory-mapping is that if you serialize a
corpus_json object or derived corpus_text object using
saveRDS or another similar function, and then you
deserialize the object, R will attempt create a new memory-map
using the file argument passed to the original read_ndjson
call. If file is a relative path, then your working directory
at the time of deserialization must agree with your working directory
at the time of the read_ndjson call. You can avoid this
situation by specifying an absolute path as the file argument
(the normalizePath function will convert a relative
to an absolute path).