When you specify mmap = TRUE
, the function memory-maps the file
instead of reading it into memory directly. In this case, the file
argument must be a character string giving the path to the file, not
a connection object. When you memory-map the file, the operating
system reads data into memory only when it is needed, enabling
you to transparently process large data sets that do not fit into
memory.
One danger in memory-mapping is that if you delete the file
after calling read_ndjson
but before processing the data, then
the results will be undefined, and your computer may crash. (On
POSIX-compliant systems like Mac OS and Linux, there should be no
ill effects to deleting the file. On recent versions of Windows,
the system will not allow you to delete the file as long as the data
is active; on older versions, the behavior is undefined.)
Another danger in memory-mapping is that if you serialize a
corpus_json
object or derived text
object using
saveRDS
or another similar function, and then you
deserialize the object, then R will attempt create a new memory-map
using the original file
argument passed to the read_ndjson
call. If file
is a relative path, then your working directory
at the time of deserialization must agree with your working directory
at the time of the read_ndjson
call. You can avoid this
situation by specifying an absolute path as the file
argument
(the normalizePath
function will convert a relative
to an absolute path).