data.table

(0 votes)
0 Direct downloads this month 0th Percentile With another indirect downloads, for a total of downloads

Extension of Data.frame

Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread). Offers a natural and flexible syntax, for faster development.

Name Description Rating
IDateTime Integer based date class
copy Copy an entire object
data.table-package Enhanced data.frame
chmatch Faster match of character vectors
rleid Generate run-length type group id
as.xts.data.table Efficient data.table to xts conversion
rbindlist Makes one data.table from a list of many
last Last item of an object
tables Display all objects of class 'data.table'
setattr Set attributes of objects by reference
fread Fast and friendly file finagler
shift Fast lead/lag for vectors and lists
truelength Over-allocation access
like Convenience function for calling regexpr.
patterns Regex patterns to extract columns from data.table
as.data.table.xts Efficient xts to as.data.table conversion
tstrsplit strsplit and transpose the resulting list efficiently
address Address in RAM of a variable
timetaken Pretty print of time taken
transpose Efficient transpose of list
:= Assignment by reference
transform.data.table Data table utilities
setcolorder Fast column reordering of a data.table by reference
subset.data.table Subsetting data.tables
J Creates a Join data table
all.equal Equality Test Between Two Data Tables
na.omit.data.table Remove rows with missing values on columns specified
duplicated Determine Duplicate Rows
test.data.table Runs a set of tests.
foverlaps Fast overlap joins
data.table-class S4 Definition for data.table
setNumericRounding Change or turn off numeric rounding
dcast.data.table Fast dcast for data.table
frank Fast rank
between Convenience function for range subset logic.
setkey Create key on a data table
setDF Convert a data.table to data.frame by reference
setDT Convert lists and data.frames to data.table by reference
melt.data.table Fast melt for data.table
setorder Fast row reordering of a data.table by reference
merge Merge Two Data Tables
No Results!

Readme

Current stable release (always even) : v1.9.6 on CRAN, released 19th Sep 2015.
Development version (always odd): v1.9.7 on GitHub Build Status codecov.io How to install?

Introduction, installation, documentation, benchmarks etc: HOMEPAGE

Guidelines for filing issues / pull requests: Please see the project's Contribution Guidelines.


Changes in v1.9.6 (on CRAN 19 Sep 2015)

NEW FEATURES

  1. fread

    • passes showProgress=FALSE through to download.file() (as quiet=TRUE). Thanks to a pull request from Karl Broman and Richard Scriven for filing the issue, #741.
    • accepts dec=',' (and other non-'.' decimal separators), #917. A new paragraph has been added to ?fread. On Windows this should just-work. On Unix it may just-work but if not you will need to read the paragraph for an extra step. In case it somehow breaks dec='.', this new feature can be turned off with options(datatable.fread.dec.experiment=FALSE).
    • Implemented stringsAsFactors argument for fread(). When TRUE, character columns are converted to factors. Default is FALSE. Thanks to Artem Klevtsov for filing #501, and to @hmi2015 for this SO post.
    • gains check.names argument, with default value FALSE. When TRUE, it uses the base function make.unique() to ensure that the column names of the data.table read in are all unique. Thanks to David Arenburg for filing #1027.
    • gains encoding argument. Acceptable values are "unknown", "UTF-8" and "Latin-1" with default value of "unknown". Closes #563. Thanks to @BenMarwick for the original report and to the many requests from others, and Q on SO.
    • gains col.names argument, and is similar to base::read.table(). Closes #768. Thanks to @dardesta for filing the FR.
  2. DT[column == value] no longer recycles value except in the length 1 case (when it still uses DT's key or an automatic secondary key, as introduced in v1.9.4). If length(value)==length(column) then it works element-wise as standard in R. Otherwise, a length error is issued to avoid common user errors. DT[column %in% values] still uses DT's key (or an an automatic secondary key) as before. Automatic indexing (i.e., optimization of == and %in%) may still be turned off with options(datatable.auto.index=FALSE).

  3. na.omit method for data.table is rewritten in C, for speed. It's ~11x faster on bigger data; see examples under ?na.omit. It also gains two additional arguments a) cols accepts column names (or numbers) on which to check for missing values. 2) invert when TRUE returns the rows with any missing values instead. Thanks to the suggestion and PR from @matthieugomez.

  4. New function shift() implements fast lead/lag of vector, list, data.frames or data.tables. It takes a type argument which can be either "lag" (default) or "lead". It enables very convenient usage along with := or set(). For example: DT[, (cols) := shift(.SD, 1L), by=id]. Please have a look at ?shift for more info.

  5. frank() is now implemented. It's much faster than base::rank and does more. It accepts vectors, lists with all elements of equal lengths, data.frames and data.tables, and optionally takes a cols argument. In addition to implementing all the ties.method methods available from base::rank, it also implements dense rank. It is also capable of calculating ranks by ordering column(s) in ascending or descending order. See ?frank for more. Closes #760 and #771

  6. rleid(), a convenience function for generating a run-length type id column to be used in grouping operations is now implemented. Closes #686. Check ?rleid examples section for usage scenarios.

  7. Efficient convertion of xts to data.table. Closes #882. Check examples in ?as.xts.data.table and ?as.data.table.xts. Thanks to @jangorecki for the PR.

  8. rbindlist gains idcol argument which can be used to generate an index column. If idcol=TRUE, the column is automatically named .id. Instead you can also provide a column name directly. If the input list has no names, indices are automatically generated. Closes #591. Also thanks to @KevinUshey for filing #356.

  9. A new helper function uniqueN is now implemented. It is equivalent to length(unique(x)) but much faster. It handles atomic vectors, lists, data.frames and data.tables as input and returns the number of unique rows. Closes #884. Gains by argument. Closes #1080. Closes #1224. Thanks to @DavidArenburg, @kevinmistry and @jangorecki.

  10. Implemented transpose() to transpose a list and tstrsplit which is a wrapper for transpose(strsplit(...)). This is particularly useful in scenarios where a column has to be split and the resulting list has to be assigned to multiple columns. See ?transpose and ?tstrsplit, #1025 and #1026 for usage scenarios. Closes both #1025 and #1026 issues.

    • Implemented type.convert as suggested by Richard Scriven. Closes #1094.
  11. melt.data.table

    • can now melt into multiple columns by providing a list of columns to measure.vars argument. Closes #828. Thanks to Ananda Mahto for the extended email discussions and ideas on generating the variable column.
    • also retains attributes wherever possible. Closes #702 and #993. Thanks to @richierocks for the report.
    • Added patterns.Rd. Closes #1294. Thanks to @MichaelChirico.
  12. .SDcols

    • understands ! now, i.e., DT[, .SD, .SDcols=!"a"] now works, and is equivalent to DT[, .SD, .SDcols = -c("a")]. Closes #1066.
    • accepts logical vectors as well. If length is smaller than number of columns, the vector is recycled. Closes #1060. Thanks to @StefanFritsch.
  13. dcast can now:

    • cast multiple value.var columns simultaneously. Closes #739.
    • accept multiple functions under fun.aggregate. Closes #716.
    • supports optional column prefixes as mentioned under this SO post. Closes #862. Thanks to @JohnAndrews.
    • works with undefined variables directly in formula. Closes #1037. Thanks to @DavidArenburg for the MRE.
    • Naming conventions on multiple columns changed according to #1153. Thanks to @MichaelChirico for the FR.
    • also has a sep argument with default _ for backwards compatibility. #1210. Thanks to @dbetebenner for the FR.
  14. .SDcols and with=FALSE understand colA:colB form now. That is, DT[, lapply(.SD, sum), by=V1, .SDcols=V4:V6] and DT[, V5:V7, with=FALSE] works as intended. This is quite useful for interactive use. Closes #748 and #1216. Thanks to @carbonmetrics, @jangorecki and @mtennekes.

  15. setcolorder() and setorder() work with data.frames too. Closes #1018.

  16. as.data.table.* and setDT argument keep.rownames can take a column name as well. When keep.rownames=TRUE, the column will still automatically named rn. Closes #575.

  17. setDT gains a key argument so that setDT(X, key="a") would convert X to a data.table by reference and key by the columns specified. Closes #1121.

  18. setDF also converts list of equal length to data.frame by reference now. Closes #1132.

  19. CJ gains logical unique argument with default FALSE. If TRUE, unique values of vectors are automatically computed and used. This is convenient, for example, DT[CJ(a, b, c, unique=TRUE)] instead of doing DT[CJ(unique(a), unique(b), unique(c))]. Ultimately, unique = TRUE will be default. Closes #1148.

  20. on= syntax: data.tables can join now without having to set keys by using the new on argument. For example: DT1[DT2, on=c(x = "y")] would join column 'y' of DT2 with 'x' of DT1. DT1[DT2, on="y"] would join on column 'y' on both data.tables. Closes #1130 partly.

  21. merge.data.table gains arguments by.x and by.y. Closes #637 and #1130. No copies are made even when the specified columns aren't key columns in data.tables, and therefore much more fast and memory efficient. Thanks to @blasern for the initial PRs. Also gains logical argument sort (like base R). Closes #1282.

  22. setDF() gains rownames argument for ready conversion to a data.frame with user-specified rows. Closes #1320. Thanks to @MichaelChirico for the FR and PR.

  23. print.data.table gains quote argument (defaul=FALSE). This option surrounds all printed elements with quotes, helps make whitespace(s) more evident. Closes #1177; thanks to @MichaelChirico for the PR.

  24. [.data.table now accepts single column numeric matrix in i argument the same way as data.frame. Closes #826. Thanks to @jangorecki for the PR.

  25. setDT() gains check.names argument paralleling that of fread, data.table, and base functionality, allowing poorly declared objects to be converted to tidy data.tables by reference. Closes #1338; thanks to @MichaelChirico for the FR/PR.

BUG FIXES

  1. if (TRUE) DT[,LHS:=RHS] no longer prints, #869 and #1122. Tests added. To get this to work we've had to live with one downside: if a := is used inside a function with no DT[] before the end of the function, then the next time DT or print(DT) is typed at the prompt, nothing will be printed. A repeated DT or print(DT) will print. To avoid this: include a DT[] after the last := in your function. If that is not possible (e.g., it's not a function you can change) then DT[] at the prompt is guaranteed to print. As before, adding an extra [] on the end of a := query is a recommended idiom to update and then print; e.g. > DT[,foo:=3L][]. Thanks to Jureiss and Jan Gorecki for reporting.

  2. DT[FALSE,LHS:=RHS] no longer prints either, #887. Thanks to Jureiss for reporting.

  3. := no longer prints in knitr for consistency with behaviour at the prompt, #505. Output of a test knit("knitr.Rmd") is now in data.table's unit tests. Thanks to Corone for the illustrated report.

  4. knitr::kable() works again without needing to upgrade from knitr v1.6 to v1.7, #809. Packages which evaluate user code and don't wish to import data.table need to be added to data.table:::cedta.pkgEvalsUserCode and now only the eval part is made data.table-aware (the rest of such package's code is left data.table-unaware). data.table:::cedta.override is now empty and will be deprecated if no need for it arises. Thanks to badbye and Stephanie Locke for reporting.

  5. fread():

    • doubled quotes ("") inside quoted fields including if immediately followed by an embedded newline. Thanks to James Sams for reporting, #489.
    • quoted fields with embedded newlines in the lines used to detect types, #810. Thanks to Vladimir Sitnikov for the scrambled data file which is now included in the test suite.
    • when detecting types in the middle and end of the file, if the jump lands inside a quoted field with (possibly many) embedded newlines, this is now detected.
    • if the file doesn't exist the error message is clearer (#486)
    • system commands are now checked to contain at least one space
    • sep="." now works (when dec!="."), #502. Thanks to Ananda Mahto for reporting.
    • better error message if quoted field is missing an end quote, #802. Thanks to Vladimir Sitnikov for the sample file which is now included in the test suite.
    • providing sep which is not present in the file now reads as if sep="\n" rather than 'sep not found', #738. Thanks to Adam Kennedy for explaining the use-case.
    • seg fault with errors over 1,000 characters (when long lines are included) is fixed, #802. Thanks again to Vladimir Sitnikov.
    • Missing integer64 values are properly assigned NAs. Closes #488. Thanks to @PeterStoyanov and @richierocks for the report.
    • Column headers with empty strings aren't skipped anymore. Closes #483. Thanks to @RobyJoehanes and @kforner.
    • Detects separator correctly when commas also exist in text fields. Closes #923. Thanks to @raymondben for the report.
    • NA values in NA inflated file are read properly. Closes #737. Thanks to Adam Kennedy.
    • correctly handles na.strings argument for all types of columns - it detect possible NA values without coercion to character, like in base read.table. fixes #504. Thanks to @dselivanov for the PR. Also closes #1314, which closes this issue completely, i.e., na.strings = c("-999", "FALSE") etc. also work.
    • deals with quotes more robustly. When reading quoted fields fail, it re-attemps to read the field as if it wasn't quoted. This helps read in those fields that might have unbalanced quotes without erroring immediately, thereby closing issues #568, #1256, #1077, #1079 and #1095. Thanks to @Synergist, @daroczig, @geotheory and @rsaporta for the reports.
    • gains argument strip.white which is TRUE by default (unlike base::read.table). All unquoted columns' leading and trailing white spaces are automatically removed. If \code{FALSE}, only trailing spaces of header is removed. Closes #1113, #1035, #1000, #785, #529 and #956. Thanks to @dmenne, @dpastoor, @GHarmata, @gkalnytskyi, @renqian, @MatthewForrest, @fxi and @heraldb.
    • doesn't warn about empty lines when 'nrow' argument is specified and that many rows are read properly. Thanks to @richierocks for the report. Closes #1330.
    • doesn't error/warn about not being able to read last 5 lines when 'nrow' argument is specified. Thanks to @robbig2871. Closes #773.
  6. Auto indexing:

    • DT[colA == max(colA)] now works again without needing options(datatable.auto.index=FALSE). Thanks to Jan Gorecki and kaybenleroll, #858. Test added.
    • DT[colA %in% c("id1","id2","id2","id3")] now ignores the RHS duplicates (as before, consistent with base R) without needing options(datatable.auto.index=FALSE). Thanks to Dayne Filer for reporting.
    • If DT contains a column class (happens to be a reserved attribute name in R) then DT[class=='a'] now works again without needing options(datatable.auto.index=FALSE). Thanks to sunnyghkm for reporting, #871.
    • := and set* now drop secondary keys (new in v1.9.4) so that DT[x==y] works again after a := or set* without needing options(datatable.auto.index=FALSE). Only setkey() was dropping secondary keys correctly. 23 tests added. Thanks to user36312 for reporting, #885.
    • Automatic indices are not created on .SD so that dt[, .SD[b == "B"], by=a] works correctly. Fixes #958. Thanks to @azag0 for the nice reproducible example.
    • i-operations resulting in 0-length rows ignore j on subsets using auto indexing. Closes #1001. Thanks to @Gsee.
    • POSIXct type columns work as expected with auto indexing. Closes #955. Thanks to @GSee for the minimal report.
    • Auto indexing with ! operator, for e.g., DT[!x == 1] works as intended. Closes #932. Thanks to @matthieugomez for the minimal example.
    • While fixing #932, issues on subsetting NA were also spotted and fixed, for e.g., DT[x==NA] or DT[!x==NA].
    • Works fine when RHS is of list type - quite unusual operation but could happen. Closes #961. Thanks to @Gsee for the minimal report.
    • Auto indexing errored in some cases when LHS and RHS were not of same type. This is fixed now. Closes #957. Thanks to @GSee for the minimal report.
    • DT[x == 2.5] where x is integer type resulted in val being coerced to integer (for binary search) and therefore returned incorrect result. This is now identified using the function isReallyReal() and if so, auto indexing is turned off. Closes #1050.
    • Auto indexing errored during DT[x %in% val] when val has some values not present in x. Closes #1072. Thanks to @CarlosCinelli for asking on StackOverflow.
  7. as.data.table.list with list input having 0-length items, e.g. x = list(a=integer(0), b=3:4). as.data.table(x) recycles item a with NAs to fit the length of the longer column b (length=2), as before now, but with an additional warning message that the item has been recycled with NA. Closes #847. Thanks to @tvinodr for the report. This was a regression from 1.9.2.

  8. DT[i, j] when i returns all FALSE and j contains some length-0 values (ex: integer(0)) now returns an empty data.table as it should. Closes #758 and #813. Thanks to @tunaaa and @nigmastar for the nice reproducible reports.

  9. allow.cartesian is ignored during joins when:

    • i has no duplicates and mult="all". Closes #742. Thanks to @nigmastar for the report.
    • assigning by reference, i.e., j has :=. Closes #800. Thanks to @matthieugomez for the report.

    In both these cases (and during a not-join which was already fixed in 1.9.4), allow.cartesian can be safely ignored.

  10. names<-.data.table works as intended on data.table unaware packages with Rv3.1.0+. Closes #476 and #825. Thanks to ezbentley for reporting here on SO and to @narrenfrei.

  11. .EACHI is now an exported symbol (just like .SD,.N,.I,.GRP and .BY already were) so that packages using data.table and .EACHI pass R CMD check with no NOTE that this symbol is undefined. Thanks to Matt Bannert for highlighting.

  12. Some optimisations of .SD in j was done in 1.9.4, refer to #735. Due to an oversight, j-expressions of the form c(lapply(.SD, ...), list(...)) were optimised improperly. This is now fixed. Thanks to @mmeierer for filing #861.

  13. j-expressions in DT[, col := x$y()] (or) DT[, col := x[[1]]()] are now (re)constructed properly. Thanks to @ihaddad-md for reporting. Closes #774.

  14. format.ITime now handles negative values properly. Closes #811. Thanks to @StefanFritsch for the report along with the fix!

  15. Compatibility with big endian machines (e.g., SPARC and PowerPC) is restored. Most Windows, Linux and Mac systems are little endian; type .Platform$endian to confirm. Thanks to Gerhard Nachtmann for reporting and the QEMU project for their PowerPC emulator.

  16. DT[, LHS := RHS] with RHS is of the form eval(parse(text = foo[1])) referring to columns in DT is now handled properly. Closes #880. Thanks to tyner.

  17. subset handles extracting duplicate columns in consistency with data.table's rule - if a column name is duplicated, then accessing that column using column number should return that column, whereas accessing by column name (due to ambiguity) will always extract the first column. Closes #891. Thanks to @jjzz.

  18. rbindlist handles combining levels of data.tables with both ordered and unordered factor columns properly. Closes #899. Thanks to @ChristK.

  19. Updating .SD by reference using set also errors appropriately now; similar to :=. Closes #927. Thanks to @jrowen for the minimal example.

  20. X[Y, .N] returned the same result as X[Y, .N, nomatch=0L]) when Y contained rows that has no matches in X. Fixed now. Closes #963. Thanks to this SO post from @Alex which helped discover the bug.

  21. data.table::dcast handles levels in factor columns properly when drop = FALSE. Closes #893. Thanks to @matthieugomez for the great minimal example.

  22. [.data.table subsets complex and raw type objects again. Thanks to @richierocks for the nice minimal example. Closes #982.

  23. Fixed a bug in the internal optimisation of j-expression with more than one lapply(.SD, function(..) ..) as illustrated here on SO. Closes #985. Thanks to @jadaliha for the report and to @BrodieG for the debugging on SO.

  24. mget fetches columns from the default environment .SD when called from within the frame of DT. That is, DT[, mget(cols)], DT[, lapply(mget(cols), sum), by=.] etc.. work as intended. Thanks to @Roland for filing this issue. Closes #994.

  25. foverlaps() did not find overlapping intervals correctly on numeric ranges in a special case where both start and end intervals had 0.0. This is now fixed. Thanks to @tdhock for the reproducible example. Closes #1006 partly.

  26. When performing rolling joins, keys are set only when we can be absolutely sure. Closes #1010, which explains cases where keys should not be retained.

  27. Rolling joins with -Inf and Inf are handled properly. Closes #1007. Thanks to @tdhock for filing #1006 which lead to the discovery of this issue.

  28. Overlapping range joins with -Inf and Inf and 0.0 in them are handled properly now. Closes #1006. Thanks to @tdhock for filing the issue with a nice reproducible example.

  29. Fixed two segfaults in shift() when number of rows in x is lesser than value for n. Closes #1009 and #1014. Thanks to @jangorecki and @ashinm for the reproducible reports.

  30. Attributes are preserved for sum() and mean() when fast internal (GForce) implementations are used. Closes #1023. Thanks to @DavidArenburg for the nice reproducible example.

  31. lapply(l, setDT) is handled properly now; over-allocation isn't lost. Similarly, for (i in 1:k) setDT(l[[i]]) is handled properly as well. Closes #480.

  32. rbindlist stack imbalance on all NULL list elements is now fixed. Closes #980. Thanks to @ttuggle.

  33. List columns can be assigned to columns of factor type by reference. Closes #936. Thanks to @richierocks for the minimal example.

  34. After setting the datatable.alloccol option, creating a data.table with more than the set truelength resulted in error or segfault. This is now fixed. Closes #970. Thanks to @caneff for the nice minimal example.

  35. Update by reference using := after loading from disk where the data.table exists within a local environment now works as intended. Closes #479. Thanks to @ChongWang for the minimal reproducible example.

  36. Issues on merges involving factor columns with NA and merging factor with character type with non-identical levels are both fixed. Closes #499 and #945. Thanks to @AbielReinhart and @stewbasic for the minimal examples.

  37. as.data.table(ll) returned a data.table with 0-rows when the first element of the list has 0-length, for e.g., ll = list(NULL, 1:2, 3:4). This is now fixed by removing those 0-length elements. Closes #842. Thanks to @Rick for the nice minimal example.

  38. as.datat.able.factor redirects to as.data.table.matrix when input is a matrix, but also of type factor. Closes #868. Thanks to @mgahan for the example.

  39. setattr now returns an error when trying to set data.table and/or data.frame as class to a non-list type object (ex: matrix). Closes #832. Thanks to @Rick for the minimal example.

  40. data.table(table) works as expected. Closes #1043. Thanks to @rnso for the SO post.

  41. Joins and binary search based subsets of the form x[i] where x's key column is integer and i a logical column threw an error before. This is now fixed by converting the logical column to integer type and then performing the join, so that it works as expected.

  42. When by expression is, for example, by = x %% 2, data.table tries to automatically extracts meaningful column names from the expression. In this case it would be x. However, if the j-expression also contains x, for example, DT[, last(x), by= x %% 2], the original x got masked by the expression in by. This is now fixed; by-expressions are not simplified in column names for these cases. Closes #497. Thanks to @GSee for the report.

  43. rbindlist now errors when columns have non-identical class attributes and are not factors, e.g., binding column of class Date with POSIXct. Previously this returned incorrect results. Closes #705. Thanks to @ecoRoland for the minimal report.

  44. Fixed a segfault in melt.data.table when measure.vars have duplicate names. Closes #1055. Thanks to @ChristK for the minimal report.

  45. Fixed another segfault in melt.data.table issue that was caught due to issue in Windows. Closes #1059. Thanks again to @ChristK for the minimal report.

  46. DT[rows, newcol := NULL] resulted in a segfault on the next assignment by reference. Closes #1082. Thanks to @stevenbagley for the MRE.

  47. as.matrix(DT) handles cases where DT contains both numeric and logical columns correctly (doesn't coerce to character columns anymore). Closes #1083. Thanks to @bramvisser for the SO post.

  48. Coercion is handled properly on subsets/joins on integer64 key columns. Closes #1108. Thanks to @vspinu.

  49. setDT() and as.data.table() both strip all classes preceding data.table/data.frame, to be consistent with base R. Closes #1078 and #1128. Thanks to Jan and @helix123 for the reports.

  50. setattr(x, 'levels', value) handles duplicate levels in value appropriately. Thanks to Jeffrey Horner for pointing it out here. Closes #1142.

  51. x[J(vals), .N, nomatch=0L] also included no matches in result, #1074. And x[J(...), col := val, nomatch=0L] returned a warning with incorrect results when join resulted in no matches as well, even though nomatch=0L should have no effect in :=, #1092. Both issues are fixed now. Thanks to @riabusan and @cguill95 for #1092.

  52. .data.table.locked attributes set to NULL in internal function subsetDT. Closes #1154. Thanks to @Jan.

  53. Internal function fastmean() retains column attributes. Closes #1160. Thanks to @renkun-ken.

  54. Using .N in i, for e.g., DT[, head(.SD, 3)[1:(.N-1L)]] accessed incorrect value of .N. This is now fixed. Closes #1145. Thanks to @claytonstanley.

  55. setDT handles key= argument properly when input is already a data.table. Closes #1169. Thanks to @DavidArenburg for the PR.

  56. Key is retained properly when joining on factor type columns. Closes #477. Thanks to @nachti for the report.

  57. Over-allocated memory is released more robustly thanks to Karl Miller's investigation and suggested fix.

  58. DT[TRUE, colA:=colA*2] no longer churns through 4 unnecessary allocations as large as one column. This was caused by i=TRUE being recycled. Thanks to Nathan Kurz for reporting and investigating. Added provided test to test suite. Only a single vector is allocated now for the RHS (colA*2). Closes #1249.

  59. Thanks to @and3k for the excellent bug report #1258. This was a result of shallow copy retaining keys when it shouldn't. It affected some cases of joins using on=. Fixed now.

  60. set() and := handle RHS value NA_integer_ on factor types properly. Closes #1234. Thanks to @DavidArenburg.

  61. merge.data.table() didn't set column order (and therefore names) properly in some cases. Fixed now. Closes #1290. Thanks to @ChristK for the minimal example.

  62. print.data.table now works for 100+ rows as intended when row.names=FALSE. Closes #1307. Thanks to @jangorecki for the PR.

  63. Row numbers are not printed in scientific format. Closes #1167. Thanks to @jangorecki for the PR.

  64. Using .GRP unnamed in j now returns a variable named GRP instead of .GRP as the period was causing issues. Same for .BY. Closes #1243; thanks to @MichaelChirico for the PR.

  65. DT[, 0, with=FALSE] returns null data.table to be consistent with data.frame's behaviour. Closes #1140. Thanks to @franknarf1.

  66. Evaluating quoted expressions with . in by works as intended. That is, dt = data.table(a=c(1,1,2,2), b=1:4); expr=quote(.(a)); dt[, sum(b), eval(expr)] works now. Closes #1298. Thanks @eddi.

  67. as.list method for IDate object works properly. Closes #1315. Thanks to @gwerbin.

NOTES

  1. Clearer explanation of what duplicated() does (borrowed from base). Thanks to @matthieugomez for pointing out. Closes #872.

  2. ?setnames has been updated now that names<- and colnames<- shallow (rather than deep) copy from R >= 3.1.0, #853.

  3. FAQ 1.6 has been embellished, #517. Thanks to a discussion with Vivi and Josh O'Brien.

  4. data.table redefines melt generic and suggests reshape2 instead of import. As a result we don't have to load reshape2 package to use melt.data.table anymore. The reason for this change is that data.table requires R >=2.14, whereas reshape2 R v3.0.0+. Reshape2's melt methods can be used without any issues by loading the package normally.

  5. DT[, j, ] at times made an additional (unnecessary) copy. This is now fixed. This fix also avoids allocating .I when j doesn't use it. As a result := and other subset operations should be faster (and use less memory). Thanks to @szilard for the nice report. Closes #921.

  6. Because reshape2 requires R >3.0.0, and data.table works with R >= 2.14.1, we can not import reshape2 anymore. Therefore we define a melt generic and melt.data.table method for data.tables and redirect to reshape2's melt for other objects. This is to ensure that existing code works fine.

  7. dcast is also a generic now in data.table. So we can use dcast(...) directly, and don't have to spell it out as dcast.data.table(...) like before. The dcast generic in data.table redirects to reshape2::dcast if the input object is not a data.table. But for that you have to load reshape2 before loading data.table. If not, reshape2's dcast overwrites data.table's dcast generic, in which case you will need the :: operator - ex: data.table::dcast(...).

    NB: Ideal situation would be for dcast to be a generic in reshape2 as well, but it is not. We have issued a pull request to make dcast in reshape2 a generic, but that has not yet been accepted.

  8. Clarified the use of bit64::integer4 in merge.data.table() and setNumericRounding(). Closes #1093. Thanks to @sfischme for the report.

  9. Removed an unnecessary (and silly) giveNames argument from setDT(). Not sure why I added this in the first place!

  10. options(datatable.prettyprint.char=5L) restricts the number of characters to be printed for character columns. For example:

    options(datatable.prettyprint.char = 5L) DT = data.table(x=1:2, y=c("abcdefghij", "klmnopqrstuv")) DT

    x y

    1: 1 abcde...

    2: 2 klmno...

  11. rolltolast argument in [.data.table is now defunct. It was deprecated in 1.9.4.

  12. data.table's dependency has been moved forward from R 2.14.0 to R 2.14.1, now nearly 4 years old (Dec 2011). As usual before release to CRAN we ensure data.table passes the test suite on the stated dependency and keep this as old as possible for as long as possible. As requested by users in managed environments. For this reason we still don't use paste0() internally, since that was added to R 2.15.0.

  13. Warning about datatable.old.bywithoutby option (for grouping on join without providing by) being deprecated in the next release is in place now. Thanks to @jangorecki for the PR.

  14. Fixed allow.cartesian documentation to nrow(x)+nrow(i) instead of max(nrow(x), nrow(i)). Closes #1123.

Changes in v1.9.4 (on CRAN 2 Oct 2014)

NEW FEATURES

  1. by=.EACHI runs j for each group in DT that each row of i joins to.

    setkey(DT, ID)
    DT[c("id1", "id2"), sum(val)]                # single total across both id1 and id2
    DT[c("id1", "id2"), sum(val), by = .EACHI]   # sum(val) for each id
    DT[c("id1", "id2"), sum(val), by = key(DT)]  # same
    

    In other words, by-without-by is now explicit, as requested by users, #371. When i contains duplicates, by=.EACHI is different to by=key(DT); e.g.,

    setkey(DT, ID)
    ids = c("id1", "id2", "id1")     # NB: id1 appears twice
    DT[ids, sum(val), by = ID]       # 2 rows returned
    DT[ids, sum(val), by = .EACHI]   # 3 rows in the order of ids (result 1 and 3 are not merged)
    

    by=.EACHI can be useful when i is event data, where you don't want the events aggregated by common join values but wish the output to be ordered with repeats, or simply just using join inherited columns as parameters; e.g.;

    X[Y, head(.SD, i.top), by = .EACHI]
    

    where top is a non-join column in Y; i.e., join inherited column. Thanks to many, especially eddi, Sadao Milberg and Gabor Grothendieck for extended discussions. Closes #538.

  2. Accordingly, X[Y, j] now does what X[Y][, j] did. To return the old behaviour: options(datatable.old.bywithoutby=TRUE). This is a temporary option to aid migration and will be removed in future. See this, this and this post for discussions and motivation.

  3. Overlap joins (#528) is now here, finally!! Except for type="equal" and maxgap and minoverlap arguments, everything else is implemented. Check out ?foverlaps and the examples there on its usage. This is a major feature addition to data.table.

  4. DT[column==value] and DT[column %in% values] are now optimized to use DT's key when key(DT)[1]=="column", otherwise a secondary key (a.k.a. index) is automatically added so the next DT[column==value] is much faster. No code changes are needed; existing code should automatically benefit. Secondary keys can be added manually using set2key() and existence checked using key2(). These optimizations and function names/arguments are experimental and may be turned off with options(datatable.auto.index=FALSE).

  5. fread():

    • accepts line breaks inside quoted fields. Thanks to Clayton Stanley for highlighting here.
    • accepts trailing backslash in quoted fields. Thanks to user2970844 for highlighting here.
    • Blank and "NA" values in logical columns (T,True,TRUE) no longer cause them to be read as character, #567. Thanks to Adam November for reporting.
    • URLs now work on Windows. R's download.file() converts \r\n to \r\r\n on Windows. Now avoided by downloading in binary mode. Thanks to Steve Miller and Dean MacGregor for reporting, #492.
    • Fixed seg fault in sparse data files when bumping to character, #796 and #722. Thanks to Adam Kennedy and Richard Cotton for the detailed reproducible reports.
    • New argument fread(...,data.table=FALSE) returns a data.frame instead of a data.table. This can be set globally: options(datatable.fread.datatable=FALSE).
  6. .() can now be used in j and is identical to list(), for consistency with i.

    DT[,list(MySum=sum(B)),by=...]
    DT[,.(MySum=sum(B)),by=...]     # same
    DT[,list(colB,colC,colD)]
    DT[,.(colB,colC,colD)]          # same
    

    Similarly, by=.() is now a shortcut for by=list(), for consistency with i and j.

  7. rbindlist gains use.names and fill arguments and is now implemented entirely in C. Closes #345:

    • use.names by default is FALSE for backwards compatibility (does not bind by names by default)
    • rbind(...) now just calls rbindlist() internally, except that use.names is TRUE by default, for compatibility with base (and backwards compatibility).
    • fill=FALSE by default. If fill=TRUE, use.names has to be TRUE.
    • When use.names=TRUE, at least one item of the input list has to have non-null column names.
    • When fill=TRUE, all items of the input list has to have non-null column names.
    • Duplicate columns are bound in the order of occurrence, like base.
    • Attributes that might exist in individual items would be lost in the bound result.
    • Columns are coerced to the highest SEXPTYPE when they are different, if possible.
    • And incredibly fast ;).
    • Documentation updated in much detail. Closes #333.
  8. bit64::integer64 now works in grouping and joins, #342. Thanks to James Sams for highlighting UPCs and Clayton Stanley for this SO post. fread() has been detecting and reading integer64 for a while.

  9. setNumericRounding() may be used to reduce to 1 byte or 0 byte rounding when joining to or grouping columns of type 'numeric', #342. See example in ?setNumericRounding and NEWS item below for v1.9.2. getNumericRounding() returns the current setting.

  10. X[Y] now names non-join columns from i that have the same name as a column in x, with an i. prefix for consistency with the i. prefix that has been available in j for some time. This is now documented.

  11. For a keyed table X where the key columns are not at the beginning in order, X[Y] now retains the original order of columns in X rather than moving the join columns to the beginning of the result.

  12. It is no longer an error to assign to row 0 or row NA.

    DT[0, colA := 1L]             # now does nothing, silently (was error)
    DT[NA, colA := 1L]            # now does nothing, silently (was error)
    DT[c(1, NA, 0, 2), colA:=1L]  # now ignores the NA and 0 silently (was error)
    DT[nrow(DT) + 1, colA := 1L]  # error (out-of-range) as before
    

    This is for convenience to avoid the need for a switch in user code that evals various i conditions in a loop passing in i as an integer vector which may containing 0 or NA.

  13. A new function setorder is now implemented which uses data.table's internal fast order to reorder rows by reference. It returns the result invisibly (like setkey) that allows for compound statements; e.g., setorder(DT, a, -b)[, cumsum(c), by=list(a,b)]. Check ?setorder for more info.

  14. DT[order(x, -y)] is now by default optimised to use data.table's internal fast order as DT[forder(DT, x, -y)]. It can be turned off by setting datatable.optimize to < 1L or just calling base:::order explicitly. It results in 20x speedup on data.table of 10 million rows with 2 integer columns, for example. To order character vectors in descending order it's sufficient to do DT[order(x, -y)] as opposed to DT[order(x, -xtfrm(y))] in base. This closes #603.

  15. mult="all" -vs- mult="first"|"last" now return consistent types and columns, #340. Thanks to Michele Carriero for highlighting.

  16. duplicated.data.table and unique.data.table gains fromLast = TRUE/FALSE argument, similar to base. Default value is FALSE. Closes #347.

  17. anyDuplicated.data.table is now implemented. Closes #350. Thanks to M C (bluemagister) for reporting.

  18. Complex j-expressions of the form DT[, c(..., lapply(.SD, fun)), by=grp]are now optimised as long as .SD is of the form lapply(.SD, fun) or .SD, .SD[1] or .SD[1L]. This resolves #370. Thanks to Sam Steingold for reporting. This also completes the first two task lists in #735.

    ## example:
    DT[, c(.I, lapply(.SD, sum), mean(x), lapply(.SD, log)), by=grp]
    ## is optimised to
    DT[, list(.I, x=sum(x), y=sum(y), ..., mean(x), log(x), log(y), ...), by=grp]
    ## and now... these variations are also optimised internally for speed
    DT[, c(..., .SD, lapply(.SD, sum), ...), by=grp]
    DT[, c(..., .SD[1], lapply(.SD, sum), ...), by=grp]
    DT[, .SD, by=grp]
    DT[, c(.SD), by=grp]
    DT[, .SD[1], by=grp] # Note: but not yet DT[, .SD[1,], by=grp]
    DT[, c(.SD[1]), by=grp]
    DT[, head(.SD, 1), by=grp] # Note: but not yet DT[, head(.SD, -1), by=grp]
    # but not yet optimised
    DT[, c(.SD[a], .SD[x>1], lapply(.SD, sum)), by=grp] # where 'a' is, say, a numeric or a data.table, and also for expressions like x>1
    

    The underlying message is that .SD is being slowly optimised internally wherever possible, for speed, without compromising in the nice readable syntax it provides.

  19. setDT gains keep.rownames = TRUE/FALSE argument, which works only on data.frames. TRUE retains the data.frame's row names as a new column named rn.

  20. The output of tables() now includes NCOL. Thanks to @dnlbrky for the suggestion.

  21. DT[, LHS := RHS] (or its equivalent in set) now provides a warning and returns DT as it was, instead of an error, when length(LHS) = 0L, #343. For example:

    DT[, grep("^b", names(DT)) := NULL] # where no columns start with b
    # warns now and returns DT instead of error
    
  22. GForce now is also optimised for j-expression with .N. Closes #334 and part of #523.

    DT[, list(.N, mean(y), sum(y)), by=x] # 1.9.2 - doesn't know to use GForce - will be (relatively) slower
    DT[, list(.N, mean(y), sum(y)), by=x] # 1.9.3+ - will use GForce.
    
  23. setDF is now implemented. It accepts a data.table and converts it to data.frame by reference, #338. Thanks to canneff for the discussion here on data.table mailing list.

  24. .I gets named as I (instead of .I) wherever possible, similar to .N, #344.

  25. setkey on .SD is now an error, rather than warnings for each group about rebuilding the key. The new error is similar to when attempting to use := in a .SD subquery: ".SD is locked. Using set*() functions on .SD is reserved for possible future use; a tortuously flexible way to modify the original data by group." Thanks to Ron Hylton for highlighting the issue on datatable-help here.

  26. Looping calls to unique(DT) such as in DT[,unique(.SD),by=group] is now faster by avoiding internal overhead of calling [.data.table. Thanks again to Ron Hylton for highlighting in the same thread. His example is reduced from 28 sec to 9 sec, with identical results.

  27. Following gsum and gmean, now gmin and gmax from GForce are also implemented. Closes part of #523. Benchmarks are also provided.

    DT[, list(sum(x), min(y), max(z), .N), by=...] # runs by default using GForce
    
  28. setorder() and DT[order(.)] handles integer64 type in descending order as well. Closes #703.

  29. setorder() and setorderv() gain na.last = TRUE/FALSE. Closes #706.

  30. .N is now available in i, FR#724. Thanks to newbie indirectly here and Farrel directly here.

  31. by=.EACHI is now implemented for not-joins as well. Closes #604. Thanks to Garrett See for filing the FR. As an example:

    DT = data.table(x=c(1,1,1,1,2,2,3,4,4,4), y=1:10, key="x")
    DT[!J(c(1,4)), sum(y), by=.EACHI] # is equivalent to DT[J(c(2,3)), sum(y), by=.EACHI]
    

BUG FIXES

  1. When joining to fewer columns than the key has, using one of the later key columns explicitly in j repeated the first value. A problem introduced by v1.9.2 and not caught bythe 1,220 tests, or tests in 37 dependent packages. Test added. Many thanks to Michele Carriero for reporting.

    DT = data.table(a=1:2, b=letters[1:6], key="a,b")    # keyed by a and b
    DT[.(1), list(b,...)]    # correct result again (joining just to a not b but using b)
    
  2. setkey works again when a non-key column is type list (e.g. each cell can itself be a vector), #54. Test added. Thanks to James Sams, Michael Nelson and Musx for the reproducible examples.

  3. The warning "internal TRUE value has been modified" with recently released R 3.1 when grouping a table containing a logical column and where all groups are just 1 row is now fixed and tests added. Thanks to James Sams for the reproducible example. The warning is issued by R and we have asked if it can be upgraded to error (UPDATE: change now made for R 3.1.1 thanks to Luke Tierney).

  4. data.table(list()), data.table(data.table()) and data.table(data.frame()) now return a null data.table (no columns) rather than one empty column, #48. Test added. Thanks to Shubh Bansal for reporting.

  5. unique(<NULL data.table>) now returns a null data.table, #44. Thanks to agstudy for reporting.

  6. data.table() converted POSIXlt to POSIXct, consistent with base:::data.frame(), but now also provides a helpful warning instead of coercing silently, #59. Thanks to Brodie Gaslam, Patrick and Ragy Isaac for reporting here and here.

  7. If another class inherits from data.table; e.g. class(DT) == c("UserClass","data.table","data.frame") then DT[...] now retains UserClass in the result. Thanks to Daniel Krizian for reporting, #64. Test added.

  8. An error object '<name>' not found could occur in some circumstances, particularly after a previous error. Reported on SO with non-ASCII characters in a column name, a red herring we hope since non-ASCII characters are fully supported in data.table including in column names. Fix implemented and tests added.

  9. Column order was reversed in some cases by as.data.table.table(), #43. Test added. Thanks to Benjamin Barnes for reporting.

  10. DT[, !"missingcol", with=FALSE] now returns DT (rather than a NULL data.table) with warning that "missingcol" is not present.

  11. DT[,y := y * eval(parse(text="1*2"))] resulted in error unless eval() was wrapped with paranthesis. That is, DT[,y := y * (eval(parse(text="1*2")))], #5423. Thanks to Wet Feet for reporting and to Simon O'Hanlon for identifying the issue here on SO.

  12. Using by columns with attributes (ex: factor, Date) in j did not retain the attributes, also in case of :=. This was partially a regression from an earlier fix (#155) due to recent changes for R3.1.0. Now fixed and clearer tests added. Thanks to Christophe Dervieux for reporting and to Adam B for reporting here on SO. Closes #36.

  13. .BY special variable did not retain names of the grouping columns which resulted in not being able to access .BY$grpcol in j. Ex: DT[, .BY$x, by=x]. This is now fixed. Closes #5415. Thanks to Stephane Vernede for the bug report.

  14. Fixed another issue with eval(parse(...)) in j along with assignment by reference :=. Closes #30. Thanks to Michele Carriero for reporting.

  15. get() in j did not see i's columns when i is a data.table which lead to errors while doing operations like: DT1[DT2, list(get('c'))]. Now, use of get makes all x's and i's columns visible (fetches all columns). Still, as the verbose message states, using .SDcols or eval(macro) would be able to select just the columns used, which is better for efficiency. Closes #34. Thanks to Eddi for reporting.

  16. Fixed an edge case with unique and duplicated, which on empty data.tables returned a 1-row data.table with all NAs. Closes #28. Thanks to Shubh Bansal for reporting.

  17. dcast.data.table resuled in error (because function CJ() was not visible) in packages that "import" data.table. This did not happen if the package "depends" on data.table. Closes bug #31. Thanks to K Davis for the excellent report.

  18. merge(x, y, all=TRUE) error when x is empty data.table is now fixed. Closes #24. Thanks to Garrett See for filing the report.

  19. Implementing #5249 closes bug #26, a case where rbind gave error when binding with empty data.tables. Thanks to Roger for reporting on SO.

  20. Fixed a segfault during grouping with assignment by reference, ex: DT[, LHS := RHS, by=.], where length(RHS) > group size (.N). Closes #25. Thanks to Zachary Long for reporting on datatable-help mailing list.

  21. Consistent subset rules on datat.tables with duplicate columns. In short, if indices are directly provided, 'j', or in .SDcols, then just those columns are either returned (or deleted if you provide -.SDcols or !j). If instead, column names are given and there are more than one occurrence of that column, then it's hard to decide which to keep and which to remove on a subset. Therefore, to remove, all occurrences of that column are removed, and to keep, always the first column is returned each time. Also closes #22 and #86.

    Note that using by= to aggregate on duplicate columns may not give intended result still, as it may not operate on the proper column.

  22. When DT is empty, DT[, newcol:=max(b), by=a] now properly adds the column, #49. Thanks to Shubh bansal for filing the report.

  23. When j evaluates to integer(0)/character(0), DT[, j, with=FALSE] resulted in error, #21. Thanks indirectly to Malcolm Cook for #52, through which this (recent) regression (from 1.9.3) was found.

  24. print(DT) now respects digits argument on list type columns, #37. Thanks to Frank for the discussion on the mailing list and to Matthew Beckers for filing the bug report.

  25. FR # 2551 implemented leniance in warning messages when columns are coerced with DT[, LHS := RHS], when length(RHS)==1. But this was very lenient; e.g., DT[, a := "bla"], where a is a logical column should get a warning. This is now fixed such that only very obvious cases coerces silently; e.g., DT[, a := 1] where a is integer. Closes #35. Thanks to Michele Carriero and John Laing for reporting.

  26. dcast.data.table provides better error message when fun.aggregate is specified but it returns length != 1. Closes #693. Thanks to Trevor Alexander for reporting here on SO.

  27. dcast.data.table tries to preserve attributes whereever possible, except when value.var is a factor (or ordered factor). For factor types, the casted columns will be coerced to type character thereby losing the levels attribute. Closes #688. Thanks to juancentro for reporting.

  28. melt now returns friendly error when meaure.vars are not in data instead of segfault. Closes #699. Thanks to vsalmendra for this post on SO and the subsequent bug report.

  29. DT[, list(m1 = eval(expr1), m2=eval(expr2)), by=val] where expr1 and expr2 are constructed using parse(text=.) now works instead of resulting in error. Closes #472. Thanks to Benjamin Barnes for reporting with a nice reproducible example.

  30. A join of the form X[Y, roll=TRUE, nomatch=0L] where some of Y's key columns occur more than once (duplicated keys) might at times return incorrect join. This was introduced only in 1.9.2 and is fixed now. Closes #700. Thanks to Michael Smith for the very nice reproducible example and nice spotting of such a tricky case.

  31. Fixed an edge case in DT[order(.)] internal optimisation to be consistent with base. Closes #696. Thanks to Michael Smith and Garrett See for reporting.

  32. DT[, list(list(.)), by=.] and DT[, col := list(list(.)), by=.] returns correct results in R >=3.1.0 as well. The bug was due to recent (welcoming) changes in R v3.1.0 where list(.) does not result in a copy. Closes #481. Also thanks to KrishnaPG for filing #728.

  33. dcast.data.table handles fun.aggregate argument properly when called from within

Dependencies

Reviews

Looks like there are no reviews yet.

You are not logged in, please Sign in or Create an account to post reviews.