import.big.data: Load a text file into a big.matrix object

Description

This provides a faster way to import text data into a big.matrix object than bigmemory::read.big.matrix(). The method allows import of a data matrix with size exceeding RAM limits. Can import from a matrix delimited file with or without row/column names, or from a long format dataset with no row/columns names (these should be specified as separate lists).

Usage

import.big.data(input.fn = NULL, dir = getwd(), long = FALSE,
  rows.fn = NULL, cols.fn = NULL, pref = "", delete.existing = TRUE,
  ret.obj = FALSE, verbose = TRUE, row.names = NULL, col.names = NULL,
  dat.type = "double", ram.gb = 2, hd.gb = 1000, tracker = TRUE)

Arguments

input.fn

character, or list, either a single file name of the data, or a list of multiple file name if the data is stored as multiple files. If multiple, then the corresponding list of row or column names that is unique between files should be a list of the same length.

dir

character, the directory containing all files. Or, if files are split between directories, then either include the directories explicitly in the filenames, or multiple directories can be entered as a list, with names 'big', 'ano' and 'col', where big is the location for big.matrix objects to file-back to, 'ano' is the location of row and column names, and 'col' is the location of the raw text datafiles.

long

logical, if TRUE, then the data is assumed to be in long format, where each datapoint is on a new line, and the file is structured so that the data for each case/sample/id is consecutive and ordered consistently between samples. If using long format the file should contain no row or column names, these should be specified in either rows.fn/cols.fn file name arguments, or row.names/col.names vector arguments. If long=FALSE, then the dimensions of the file will be automatically detected; including if the file is in long format, however, if you know the data is in long format, specifying this explicitly will be quicker and guarantees the correct import method.

rows.fn

character, with the name of a text file containing the list of row labels for the dataset. Unnecessary if importing from a matrix with row/column names in the file, or if using the row.names parameter. Must be a list of filenames if row names are split across multiple input.fn files.

cols.fn

character, with the name of a text file containing the list of column labels for the dataset. Unnecessary if importing from a matrix with row/column names in the file, or if using the col.names parameter. Must be a list of filenames if column names are split across multiple input.fn files.

pref

character, optional prefix to use in naming the big.matrix files (description/backing files)

delete.existing

logical, if a big.matrix already exists with the same name as implied by the current 'pref' and 'dir' arguments, then default behaviour (FALSE) is to return an error. to overwrite any existing big.matrix file(s) of the same name(s), set this parameter to TRUE.

ret.obj

logical, whether to return a big.matrix.descriptor object (TRUE), or just the file name of the big.matrix description file of the imported dataset.

verbose

logical, whether to display extra information about import progress and notifications.

row.names

character vector, optional alternative to specifying rows.fn file name(s), directly specify row names as a single vector, or a list of vectors if multiple input files with differing row names are being imported.

col.names

character vector, optional alternative to specifying cols.fn file name(s), directly specify oclumn names as a single vector, or a list of vectors if multiple input files with differing column names are being imported.

dat.type

character, data type being imported, default is "double", but can specify any type supported by a filebacked.big.matrix(), namely, "integer","char","short"; note these are C-style data types; double=numeric, char=character, integer=integer, short=numeric (although will be stored with less precision in the C-based big.matrix object).

ram.gb

numeric, the number of gigabytes of free RAM that it is ok for the import to use. The higher this amount, the quicker the import will be, as flushing RAM contents to the hard drive more regularly slows down the process. Setting this lower will reduce the RAM footprint of the import. Note that if you set it too high, it can't be guaranteed, but usually R and bigmemory will do a reasonable job of managing the memory, and it shouldn't crash your computer.

hd.gb

numeric, the amount of free space on your hard disk; if you set this parameter accurately the function will stop if it believes there is insufficient disk space to import the object you have specified. By default this is set to 1 terabyte, so if importing an object larger than that, you will have to increase this parameter to make it work.

tracker

logical, whether to display a progress bar for the importing process

Value

Returns a big.matrix containing the data imported (single big.matrix even when text input is split across multiple files)

Examples

Run this code

# NOT RUN {
orig.dir <- getwd(); setwd(tempdir()); # move to temporary dir
# Collate all file names to use in this example #
all.fn <- c("rownames.txt","colnames.txt","functestdn.txt","funclongcol.txt","functest.txt",
 paste("rn",1:3,".txt",sep=""),paste("cn",1:3,".txt",sep=""),
 paste("split",1:3,".txt",sep=""),
 paste("splitmatCd",1:3,".txt",sep=""),paste("splitmatRd",1:3,".txt",sep=""),
 paste("splitmatC",1:3,".txt",sep=""), paste("splitmatR",1:3,".txt",sep=""))
any.already <- file.exists(all.fn)
if(any(any.already)) { 
 warning("files already exist in the working directory with the same names as some example files") }
# SETUP a test matrix and reference files # 
test.size <- 4 # try increasing this number for larger matrices
M <- matrix(runif(10^test.size),ncol=10^(test.size-2)) # normal matrix
write.table(M,sep="\t",col.names=FALSE,row.names=FALSE,
 file="functest.txt",quote=FALSE) # no dimnames
rown <- paste("rs",sample(10:99,nrow(M),replace=TRUE),sample(10000:99999,nrow(M)),sep="")
coln <- paste("ID",sample(1:9,ncol(M),replace=TRUE),sample(10000:99999,ncol(M)),sep="")
r.fn <- "rownames.txt"; c.fn <- "colnames.txt"
Mdn <- M; colnames(Mdn) <- coln; rownames(Mdn) <- rown
# with dimnames
write.table(Mdn,sep="\t",col.names=TRUE,row.names=TRUE,file="functestdn.txt",quote=FALSE) 
prv.large(Mdn)
writeLines(paste(as.vector(M)),con="funclongcol.txt")
in.fn <- "functest.txt"

### IMPORTING SIMPLE 1 FILE MATRIX ##
writeLines(rown,r.fn); writeLines(coln,c.fn)
#1. import without specifying row/column names
ii <- import.big.data(in.fn); prv.big.matrix(ii) # SLOWER without dimnames!
#2. import using row/col names from file
ii <- import.big.data(in.fn,cols.fn="colnames.txt",rows.fn="rownames.txt", pref="p1")
prv.big.matrix(ii)
#3. import by passing colnames/rownames as objects
ii <- import.big.data(in.fn, col.names=coln,row.names=rown, pref="p2")
prv.big.matrix(ii)

### IMPORTING SIMPLE 1 FILE MATRIX WITH DIMNAMES ##
#1. import without specifying row/column names, but they ARE in the file
in.fn <- "functestdn.txt"
ii <- import.big.data(in.fn, pref="p3"); prv.big.matrix(ii)

### IMPORTING SIMPLE 1 FILE MATRIX WITH MISORDERED rownames ##
rown2 <- rown; rown <- sample(rown);
# re-run test3 using in.fn with dimnames
ii <- import.big.data(in.fn, col.names=coln,row.names=rown, pref="p4")
prv.big.matrix(ii)
# restore rownames: 
rown <- rown2

### IMPORTING SIMPLE 1 FILE LONG FORMAT by columns ##
in.fn <- "funclongcol.txt"; #rerun test 2 # 
ii <- import.big.data(in.fn,cols.fn="colnames.txt",rows.fn="rownames.txt", pref="p5")
prv.big.matrix(ii)

### IMPORTING multifile LONG by cols ##
# create the dataset and references
splF <- factor(rep(c(1:3),ncol(M)*c(.1,.5,.4)))
colnL <- split(coln,splF); MM <- as.data.frame(t(M))
Ms2 <- split(MM,splF)
Ms2 <- lapply(Ms2,
   function(X) { X <- t(X); dim(X) <- c(nrow(M),length(X)/nrow(M)); X } )
# preview Ms2 - not run # lapply(Ms2,prv.large)
colfs <- paste("cn",1:length(colnL),".txt",sep="")
infs <- paste("split",1:length(colnL),".txt",sep="")
# create multiple column name files and input files
for(cc in 1:length(colnL)) { writeLines(colnL[[cc]],con=colfs[cc]) }
for(cc in 1:length(infs)) { 
  writeLines(paste(as.vector((Ms2[[cc]]))),con=infs[cc]) }
  
# Now test the import using colnames and rownames lists
ii <- import.big.data(infs, col.names=colnL,row.names=rown, pref="p6")
prv.big.matrix(ii)

### IMPORTING multifile MATRIX by rows ##
# create the dataset and references
splF <- factor(rep(c(1,2,3),nrow(M)*c(.1,.5,.4)))
rownL <- split(rown,splF)
Ms <- split(M,splF)
Ms <- lapply(Ms,function(X) { dim(X) <- c(length(X)/ncol(M),ncol(M)); X } )
# preview Ms - not run # lapply(Ms,prv.large)
# create multiple row name files and input files
rowfs <- paste("rn",1:length(rownL),".txt",sep="")
for(cc in 1:length(rownL)) { writeLines(rownL[[cc]],con=rowfs[cc]) }
infs <- paste("splitmatR",1:length(colnL),".txt",sep="")
for(cc in 1:length(infs)) { 
 write.table(Ms[[cc]],sep="\t",col.names=FALSE,row.names=FALSE,file=infs[cc],quote=FALSE) }
 
# Now test the import using colnames and rownames files
ii <- import.big.data(infs, col.names="colnames.txt",rows.fn=rowfs, pref="p7")
prv.big.matrix(ii)

# DELETE ALL FILES ##
unlink(all.fn[!any.already]) # prevent deleting user's files
## many files to clean up! ##
unlink(c("funclongcol.bck","funclongcol.dsc","functest.bck","functest.dsc",
 "functestdn.RData","functestdn.bck","functestdn.dsc","functestdn_file_rowname_list_check_this.txt",
 "split1.bck","split1.dsc","splitmatR1.bck","splitmatR1.dsc", paste0("p",2:7)))
setwd(orig.dir) # reset working dir to original
# }

Run the code above in your browser using DataLab