lineByLine: Line-by-line modification of files

Description

Modifies a data file line by line, i.e. reads a file line by line, converts each line, then writes to the modified file. This method is especially useful when modifying large datasets, where the reading of entire files may be time consuming and require a large amount of memory.

Usage

lineByLine(infile, outfile, linefunc = identity, choose.lines = NULL,
choose.columns = NULL, col.sep = " ", ask = TRUE, 
blank.lines.skip = TRUE, verbose = TRUE, ...)

Arguments

infile

A character string giving the name and path of the file to be modified.

outfile

A character string giving the name of the modified file. The name of the file is relative to the current working directory, unless the file name contains a definite path.

linefunc

lineByLine modifies each line using linefunc. Default is the identity function. The user may define his or her own line-modifying functions, see Details for a thorough description.

choose.lines

A numeric vector of lines to be selected or dropped from infile. Positive values refer to lines to be chosen, whereas negative values refer to lines to be skipped. The vector cannot include both positive and negative values at the same time. If "NULL" (default), all lines are selected.

choose.columns

A numeric vector of columns to be selected (positive values) or skipped (negative values) from infile. The vector cannot include both positive and negative values at the same time. By default, all columns are selected without reordering among the columns. Duplication and reordering among the selected columns will occur in the modified file corresponding to the order in which the columns are listed.

col.sep

Specifies the separator that splits the columns in infile. By default, col.sep = " " (space). To split at all types of spaces or blank characters, set col.sep = "[[:space:]]" or col.sep = "[[:blank:]]".

ask

Logical. Default is "TRUE". If set to "FALSE", an already existing outfile will be overwritten without asking.

blank.lines.skip

Logical. If "TRUE" (default), lineByLine ignores blank lines in the input.

verbose

Logical. Default is "TRUE", which means that the line number is displayed for each iteration, in addition to output from linefunc. If choose.columns contains invalid column numbers, this will also be displayed.

...

Further arguments to be passed to linefunc.

Value

returns the number of lines read, although invisible. The main objective is the modified file.

Details

When reading large datafiles, functions such as read.table can use a large amount of memory and be extremely time consuming. Instead of reading the entire file at once, lineByLine reads one line at a time, modifies the line using linefunc, and then writes the line to outfile. The user may specify his or her own line-converting function. This function must take the argument x, a character vector representing a single line of the file, split at spaces. However, additional arguments may be included. If verbose equals "TRUE", output should be displayed. The modified vector is returned. The framework of the line-modifying function may look something like this:

lineModify <- function(x){
.xnew <- x

## Define any modifications, for instance recoding missing values in a dataset from NA to 0: .xnew[is.na(.xnew)] <- 0

## Just to monitor progress, display, for instance, 10 first elements, without newline: cat(paste(.xnew[1:min(10, length(.xnew))], collapse = " "))

## Return converted vector return(.xnew) }

See Haplin:::lineConvert for an additional example of a line-modifying function.

References

Web Site: http://folk.uib.no/gjessing/genetics/software/haplin/

Examples

Run this code

## Not run: 
# 
# ## Extract the first ten columns from "myfile.txt", 
# ## without reordering
# lineByLine(infile = "myfile.txt", outfile = "myfile_modified.txt", 
# choose.columns = c(1:10))
# 
# ## End(Not run)

Run the code above in your browser using DataLab