Tokenizer: Create a `Tokenizer` to read string tokens from a memory mapped file. A more flexible, incremental, easy-to-use variant of a `readlines()` function.

Description

Reading and processing tokens from a text file usually is done in three steps: Load the file, cut into tokens, act upon the resulting vector of strings. The Tokenizer aims to simplify and streamline the process, when tokens must be processed in a sequential manner.

Usage

# tok <- Tokenizer$new(filename=NA, skipEmptyTokens=TRUE)
# Tokenizer$getDelimiters()
# Tokenizer$setDelimiters(delims)
# Tokenizer$nextToken()
# Tokenizer$close()
# getOffset()
# setOffset(offset)

Arguments

filename

The file to open.

skipEmptyTokens

set whether empty tokens ("") shall be skipped or returned

delims

An integer vector holding the ASCII-codes of characters that serve as delimiters. If not set, it defaults to blank, tab, carriage return and linefeed (the last two together resemble a Windows newline).

offset

an integer vector of length >= 2, where the first component holds the upper 32 bits of the offset and the second component holds the lower 32 bit of the offset

Value

A new Tokenizer object, backed by a memory mapped file and the delimiters set to the default values.

Format

An R6Class generator object.

Methods

new(): Create a new instance of a Tokenizer
nextToken(): Obtain the next token, that is the character vector from the character after the last delimiter up to the next delimiter from the current list of delimiters. It will return NA on all invocations once the EOF is reached.
setDelimiters(): Set the list of delimiters. It is given as an integer vector of (extended) ASCII-character values, i.e. in the range [0..255].
getDelimiters(): Get the current list of delimiters.
close(): Close the file behind the tokenizer. Future calls to nextToken() will return NA. It is considered good style to close the file manually to avoid to many open handles. The file will be closed automatically when there are no more references to the Tokenizer and it is garbage collected or upon exiting the R session.
print(): Prints the name of the currently opened file.
getOffset(): Get the offset relative to the beginning of the file of the next token.
setOffset(): Set the offset where reading should continue.

Final Remarks

While it may be tempting to clone a tokenizer object to split a file into different tokens from a given start position, this is not supported, as file state cannot be synchronized between the clones, leading to unpredictable results, when one of the clones closes the underlying shared file.

For efficiency reasons, Tokenizer will not re-stat the file once it is successfully opened. This means that especially a change of the file size can lead to unpredictable behaviour.

The sequence \cr\lf will be interpreted as two distinct tokens, if skipEmptyTokens=FALSE. The default setting is TRUE

Details

While the life-cycle of the Tokenizer still requires the user to act in three phases, it abstracts away the nasties of the file access, leverages the powers of the underlying operaring system for prefetching. Most of all, bookkeeping is much simpler: The user simply has to keep track of the object returned by the constructor and is free to pass it around between functions without caring for the current state. The Tokenizer will also try to close open files by itself, before it ist Garbage Collected.

The Tokenizer is object-oriented, so functions on any instance can be called in a OO style or in more imperative style:

OO style: tok$nextToken()
imperative style: nextToken(tok)

Both calls will give the same result.

Examples

Run this code

# NOT RUN {
tok<-Tokenizer$new("tokenfile.txt")
tok$nextToken()                                    # 
tok$print()                                        # or just 'tok'
tok$getDelimiters()
tok$setDelimiters(c(59L,0xaL))                     # new Delimiters: ';', newline
tok$setDelimiters(as.integer(charToRaw(";\n")))    # the same
tok$nextToken()
tok$setDelimiters(Tokenizer$new()$getDelimiters()) # reset to default
while(!is.na(s<-tok$nextToken())) print(s)         # print the remaing tokens of file
tok$close()                                        # good style, but not required
# }

Run the code above in your browser using DataLab