cSplit_f: Split Concatenated Cells in a data.frame or a data.table

Description

A variation of the concat.split family of functions designed for large rectangular datasets. This function makes use of fread from the "data.table" package for very speedy splitting of concatenated columns of data.

Usage

cSplit_f(indt, splitCols, sep, drop = TRUE, dotsub = "|")

Arguments

indt

The input data.frame or data.table.

splitCols

The columns that need to be split up.

sep

The character or characters that serve as delimiters within the columns that need to be split up. If different columns use different delimiters, enter the delimiters as a character vector.

drop

Logical. Should the original columns be dropped? Defaults to TRUE.

dotsub

The character that should be substituted as a delimiter if sep = ".". fread does not seem to work nicely with sep = ".", so it needs to be substituted. By default, this function will substitute "."

Value

A data.table.

Details

While the general concat.split functions (cSplit in particular) are able to handle "unbalanced" datasets (for example, where the number of fields in a given column might differ from row to row) because of the nature of fread from the "data.table" package, this function does not support such data types.

References

http://stackoverflow.com/a/19231054/1270695

Examples

Run this code

## Sample data. Change `n` to larger values to test on larger data
set.seed(1)
n <- 10
mydf <- data.frame(id = sequence(n))
mydf <- within(mydf, {
  v3 <- do.call(paste, c(data.frame(matrix(sample(
  letters, n*4, TRUE), ncol = 4)), sep = "_"))
  v2 <- do.call(paste, c(data.frame(matrix(sample(
  LETTERS, n*3, TRUE), ncol = 3)), sep = "."))
  v1 <- do.call(paste, c(data.frame(matrix(sample(
  10, n*2, TRUE), ncol = 2)), sep = "-"))
})
mydf

cSplit_f(mydf, splitCols = c("v1", "v2", "v3"), sep = c("-", ".", "_"))

Run the code above in your browser using DataLab