Current stable release (always even) : v1.9.6 on CRAN, released 19th Sep 2015.
Development version (always odd): v1.9.7 on GitHub
How to install?
Introduction, installation, documentation, benchmarks etc: HOMEPAGE
Guidelines for filing issues / pull requests: Please see the project's Contribution Guidelines.
Changes in v1.9.6 (on CRAN 19 Sep 2015)
NEW FEATURES
fread
- passes
showProgress=FALSE
through todownload.file()
(asquiet=TRUE
). Thanks to a pull request from Karl Broman and Richard Scriven for filing the issue, #741. - accepts
dec=','
(and other non-'.' decimal separators), #917. A new paragraph has been added to?fread
. On Windows this should just-work. On Unix it may just-work but if not you will need to read the paragraph for an extra step. In case it somehow breaksdec='.'
, this new feature can be turned off withoptions(datatable.fread.dec.experiment=FALSE)
. - Implemented
stringsAsFactors
argument forfread()
. WhenTRUE
, character columns are converted to factors. Default isFALSE
. Thanks to Artem Klevtsov for filing #501, and to @hmi2015 for this SO post. - gains
check.names
argument, with default valueFALSE
. WhenTRUE
, it uses the base functionmake.unique()
to ensure that the column names of the data.table read in are all unique. Thanks to David Arenburg for filing #1027. - gains
encoding
argument. Acceptable values are "unknown", "UTF-8" and "Latin-1" with default value of "unknown". Closes #563. Thanks to @BenMarwick for the original report and to the many requests from others, and Q on SO. - gains
col.names
argument, and is similar tobase::read.table()
. Closes #768. Thanks to @dardesta for filing the FR.
- passes
DT[column == value]
no longer recyclesvalue
except in the length 1 case (when it still uses DT's key or an automatic secondary key, as introduced in v1.9.4). Iflength(value)==length(column)
then it works element-wise as standard in R. Otherwise, a length error is issued to avoid common user errors.DT[column %in% values]
still uses DT's key (or an an automatic secondary key) as before. Automatic indexing (i.e., optimization of==
and%in%
) may still be turned off withoptions(datatable.auto.index=FALSE)
.na.omit
method for data.table is rewritten in C, for speed. It's ~11x faster on bigger data; see examples under?na.omit
. It also gains two additional arguments a)cols
accepts column names (or numbers) on which to check for missing values. 2)invert
whenTRUE
returns the rows with any missing values instead. Thanks to the suggestion and PR from @matthieugomez.New function
shift()
implements fastlead/lag
of vector, list, data.frames or data.tables. It takes atype
argument which can be either "lag" (default) or "lead". It enables very convenient usage along with:=
orset()
. For example:DT[, (cols) := shift(.SD, 1L), by=id]
. Please have a look at?shift
for more info.frank()
is now implemented. It's much faster thanbase::rank
and does more. It accepts vectors, lists with all elements of equal lengths, data.frames and data.tables, and optionally takes acols
argument. In addition to implementing all theties.method
methods available frombase::rank
, it also implements dense rank. It is also capable of calculating ranks by ordering column(s) in ascending or descending order. See?frank
for more. Closes #760 and #771rleid()
, a convenience function for generating a run-length type id column to be used in grouping operations is now implemented. Closes #686. Check?rleid
examples section for usage scenarios.Efficient convertion of
xts
to data.table. Closes #882. Check examples in?as.xts.data.table
and?as.data.table.xts
. Thanks to @jangorecki for the PR.rbindlist
gainsidcol
argument which can be used to generate an index column. Ifidcol=TRUE
, the column is automatically named.id
. Instead you can also provide a column name directly. If the input list has no names, indices are automatically generated. Closes #591. Also thanks to @KevinUshey for filing #356.A new helper function
uniqueN
is now implemented. It is equivalent tolength(unique(x))
but much faster. It handlesatomic vectors
,lists
,data.frames
anddata.tables
as input and returns the number of unique rows. Closes #884. Gains by argument. Closes #1080. Closes #1224. Thanks to @DavidArenburg, @kevinmistry and @jangorecki.Implemented
transpose()
to transpose a list andtstrsplit
which is a wrapper fortranspose(strsplit(...))
. This is particularly useful in scenarios where a column has to be split and the resulting list has to be assigned to multiple columns. See?transpose
and?tstrsplit
, #1025 and #1026 for usage scenarios. Closes both #1025 and #1026 issues.
* Implemented `type.convert` as suggested by Richard Scriven. Closes [#1094](https://github.com/Rdatatable/data.table/issues/1094).
melt.data.table
- can now melt into multiple columns by providing a list of columns to
measure.vars
argument. Closes #828. Thanks to Ananda Mahto for the extended email discussions and ideas on generating thevariable
column. - also retains attributes wherever possible. Closes #702 and #993. Thanks to @richierocks for the report.
- Added
patterns.Rd
. Closes #1294. Thanks to @MichaelChirico.
- can now melt into multiple columns by providing a list of columns to
.SDcols
dcast
can now:- cast multiple
value.var
columns simultaneously. Closes #739. - accept multiple functions under
fun.aggregate
. Closes #716. - supports optional column prefixes as mentioned under this SO post. Closes #862. Thanks to @JohnAndrews.
- works with undefined variables directly in formula. Closes #1037. Thanks to @DavidArenburg for the MRE.
- Naming conventions on multiple columns changed according to #1153. Thanks to @MichaelChirico for the FR.
- also has a
sep
argument with default_
for backwards compatibility. #1210. Thanks to @dbetebenner for the FR.
- cast multiple
.SDcols
andwith=FALSE
understandcolA:colB
form now. That is,DT[, lapply(.SD, sum), by=V1, .SDcols=V4:V6]
andDT[, V5:V7, with=FALSE]
works as intended. This is quite useful for interactive use. Closes #748 and #1216. Thanks to @carbonmetrics, @jangorecki and @mtennekes.setcolorder()
andsetorder()
work withdata.frame
s too. Closes #1018.as.data.table.*
andsetDT
argumentkeep.rownames
can take a column name as well. Whenkeep.rownames=TRUE
, the column will still automatically namedrn
. Closes #575.setDT
gains akey
argument so thatsetDT(X, key="a")
would convertX
to adata.table
by reference and key by the columns specified. Closes #1121.setDF
also convertslist
of equal length todata.frame
by reference now. Closes #1132.CJ
gains logicalunique
argument with defaultFALSE
. IfTRUE
, unique values of vectors are automatically computed and used. This is convenient, for example,DT[CJ(a, b, c, unique=TRUE)]
instead of doingDT[CJ(unique(a), unique(b), unique(c))]
. Ultimately,unique = TRUE
will be default. Closes #1148.on=
syntax: data.tables can join now without having to set keys by using the newon
argument. For example:DT1[DT2, on=c(x = "y")]
would join column 'y' ofDT2
with 'x' ofDT1
.DT1[DT2, on="y"]
would join on column 'y' on both data.tables. Closes #1130 partly.merge.data.table
gains argumentsby.x
andby.y
. Closes #637 and #1130. No copies are made even when the specified columns aren't key columns in data.tables, and therefore much more fast and memory efficient. Thanks to @blasern for the initial PRs. Also gains logical argumentsort
(like base R). Closes #1282.setDF()
gainsrownames
argument for ready conversion to adata.frame
with user-specified rows. Closes #1320. Thanks to @MichaelChirico for the FR and PR.print.data.table
gainsquote
argument (defaul=FALSE
). This option surrounds all printed elements with quotes, helps make whitespace(s) more evident. Closes #1177; thanks to @MichaelChirico for the PR.[.data.table
now accepts single column numeric matrix ini
argument the same way asdata.frame
. Closes #826. Thanks to @jangorecki for the PR.setDT()
gainscheck.names
argument paralleling that offread
,data.table
, andbase
functionality, allowing poorly declared objects to be converted to tidydata.table
s by reference. Closes #1338; thanks to @MichaelChirico for the FR/PR.
BUG FIXES
if (TRUE) DT[,LHS:=RHS]
no longer prints, #869 and #1122. Tests added. To get this to work we've had to live with one downside: if a:=
is used inside a function with noDT[]
before the end of the function, then the next timeDT
orprint(DT)
is typed at the prompt, nothing will be printed. A repeatedDT
orprint(DT)
will print. To avoid this: include aDT[]
after the last:=
in your function. If that is not possible (e.g., it's not a function you can change) thenDT[]
at the prompt is guaranteed to print. As before, adding an extra[]
on the end of a:=
query is a recommended idiom to update and then print; e.g.> DT[,foo:=3L][]
. Thanks to Jureiss and Jan Gorecki for reporting.DT[FALSE,LHS:=RHS]
no longer prints either, #887. Thanks to Jureiss for reporting.:=
no longer prints in knitr for consistency with behaviour at the prompt, #505. Output of a testknit("knitr.Rmd")
is now in data.table's unit tests. Thanks to Corone for the illustrated report.knitr::kable()
works again without needing to upgrade from knitr v1.6 to v1.7, #809. Packages which evaluate user code and don't wish to import data.table need to be added todata.table:::cedta.pkgEvalsUserCode
and now only theeval
part is made data.table-aware (the rest of such package's code is left data.table-unaware).data.table:::cedta.override
is now empty and will be deprecated if no need for it arises. Thanks to badbye and Stephanie Locke for reporting.fread()
:- doubled quotes ("") inside quoted fields including if immediately followed by an embedded newline. Thanks to James Sams for reporting, #489.
- quoted fields with embedded newlines in the lines used to detect types, #810. Thanks to Vladimir Sitnikov for the scrambled data file which is now included in the test suite.
- when detecting types in the middle and end of the file, if the jump lands inside a quoted field with (possibly many) embedded newlines, this is now detected.
- if the file doesn't exist the error message is clearer (#486)
- system commands are now checked to contain at least one space
- sep="." now works (when dec!="."), #502. Thanks to Ananda Mahto for reporting.
- better error message if quoted field is missing an end quote, #802. Thanks to Vladimir Sitnikov for the sample file which is now included in the test suite.
- providing sep which is not present in the file now reads as if sep="\n" rather than 'sep not found', #738. Thanks to Adam Kennedy for explaining the use-case.
- seg fault with errors over 1,000 characters (when long lines are included) is fixed, #802. Thanks again to Vladimir Sitnikov.
- Missing
integer64
values are properly assignedNA
s. Closes #488. Thanks to @PeterStoyanov and @richierocks for the report. - Column headers with empty strings aren't skipped anymore. Closes #483. Thanks to @RobyJoehanes and @kforner.
- Detects separator correctly when commas also exist in text fields. Closes #923. Thanks to @raymondben for the report.
NA
values in NA inflated file are read properly. Closes #737. Thanks to Adam Kennedy.- correctly handles
na.strings
argument for all types of columns - it detect possibleNA
values without coercion to character, like in baseread.table
. fixes #504. Thanks to @dselivanov for the PR. Also closes #1314, which closes this issue completely, i.e.,na.strings = c("-999", "FALSE")
etc. also work. - deals with quotes more robustly. When reading quoted fields fail, it re-attemps to read the field as if it wasn't quoted. This helps read in those fields that might have unbalanced quotes without erroring immediately, thereby closing issues #568, #1256, #1077, #1079 and #1095. Thanks to @Synergist, @daroczig, @geotheory and @rsaporta for the reports.
- gains argument
strip.white
which isTRUE
by default (unlikebase::read.table
). All unquoted columns' leading and trailing white spaces are automatically removed. If \code{FALSE}, only trailing spaces of header is removed. Closes #1113, #1035, #1000, #785, #529 and #956. Thanks to @dmenne, @dpastoor, @GHarmata, @gkalnytskyi, @renqian, @MatthewForrest, @fxi and @heraldb. - doesn't warn about empty lines when 'nrow' argument is specified and that many rows are read properly. Thanks to @richierocks for the report. Closes #1330.
- doesn't error/warn about not being able to read last 5 lines when 'nrow' argument is specified. Thanks to @robbig2871. Closes #773.
Auto indexing:
DT[colA == max(colA)]
now works again without needingoptions(datatable.auto.index=FALSE)
. Thanks to Jan Gorecki and kaybenleroll, #858. Test added.DT[colA %in% c("id1","id2","id2","id3")]
now ignores the RHS duplicates (as before, consistent with base R) without needingoptions(datatable.auto.index=FALSE)
. Thanks to Dayne Filer for reporting.- If
DT
contains a columnclass
(happens to be a reserved attribute name in R) thenDT[class=='a']
now works again without needingoptions(datatable.auto.index=FALSE)
. Thanks to sunnyghkm for reporting, #871. :=
andset*
now drop secondary keys (new in v1.9.4) so thatDT[x==y]
works again after a:=
orset*
without needingoptions(datatable.auto.index=FALSE)
. Onlysetkey()
was dropping secondary keys correctly. 23 tests added. Thanks to user36312 for reporting, #885.- Automatic indices are not created on
.SD
so thatdt[, .SD[b == "B"], by=a]
works correctly. Fixes #958. Thanks to @azag0 for the nice reproducible example. i
-operations resulting in 0-length rows ignorej
on subsets using auto indexing. Closes #1001. Thanks to @Gsee.POSIXct
type columns work as expected with auto indexing. Closes #955. Thanks to @GSee for the minimal report.- Auto indexing with
!
operator, for e.g.,DT[!x == 1]
works as intended. Closes #932. Thanks to @matthieugomez for the minimal example. - While fixing
#932
, issues on subsettingNA
were also spotted and fixed, for e.g.,DT[x==NA]
orDT[!x==NA]
. - Works fine when RHS is of
list
type - quite unusual operation but could happen. Closes #961. Thanks to @Gsee for the minimal report. - Auto indexing errored in some cases when LHS and RHS were not of same type. This is fixed now. Closes #957. Thanks to @GSee for the minimal report.
DT[x == 2.5]
wherex
is integer type resulted inval
being coerced to integer (for binary search) and therefore returned incorrect result. This is now identified using the functionisReallyReal()
and if so, auto indexing is turned off. Closes #1050.- Auto indexing errored during
DT[x %in% val]
whenval
has some values not present inx
. Closes #1072. Thanks to @CarlosCinelli for asking on StackOverflow.
as.data.table.list
with list input having 0-length items, e.g.x = list(a=integer(0), b=3:4)
.as.data.table(x)
recycles itema
withNA
s to fit the length of the longer columnb
(length=2), as before now, but with an additional warning message that the item has been recycled withNA
. Closes #847. Thanks to @tvinodr for the report. This was a regression from 1.9.2.DT[i, j]
wheni
returns allFALSE
andj
contains some length-0 values (ex:integer(0)
) now returns an empty data.table as it should. Closes #758 and #813. Thanks to @tunaaa and @nigmastar for the nice reproducible reports.allow.cartesian
is ignored during joins when:
In both these cases (and during a not-join
which was already fixed in 1.9.4), allow.cartesian
can be safely ignored.
names<-.data.table
works as intended on data.table unaware packages with Rv3.1.0+. Closes #476 and #825. Thanks to ezbentley for reporting here on SO and to @narrenfrei..EACHI
is now an exported symbol (just like.SD
,.N
,.I
,.GRP
and.BY
already were) so that packages usingdata.table
and.EACHI
passR CMD check
with no NOTE that this symbol is undefined. Thanks to Matt Bannert for highlighting.Some optimisations of
.SD
inj
was done in 1.9.4, refer to #735. Due to an oversight, j-expressions of the form c(lapply(.SD, ...), list(...)) were optimised improperly. This is now fixed. Thanks to @mmeierer for filing #861.j
-expressions inDT[, col := x$y()]
(or)DT[, col := x[[1]]()]
are now (re)constructed properly. Thanks to @ihaddad-md for reporting. Closes #774.format.ITime
now handles negative values properly. Closes #811. Thanks to @StefanFritsch for the report along with the fix!Compatibility with big endian machines (e.g., SPARC and PowerPC) is restored. Most Windows, Linux and Mac systems are little endian; type
.Platform$endian
to confirm. Thanks to Gerhard Nachtmann for reporting and the QEMU project for their PowerPC emulator.DT[, LHS := RHS]
with RHS is of the formeval(parse(text = foo[1]))
referring to columns inDT
is now handled properly. Closes #880. Thanks to tyner.subset
handles extracting duplicate columns in consistency with data.table's rule - if a column name is duplicated, then accessing that column using column number should return that column, whereas accessing by column name (due to ambiguity) will always extract the first column. Closes #891. Thanks to @jjzz.rbindlist
handles combining levels of data.tables with both ordered and unordered factor columns properly. Closes #899. Thanks to @ChristK.Updating
.SD
by reference usingset
also errors appropriately now; similar to:=
. Closes #927. Thanks to @jrowen for the minimal example.X[Y, .N]
returned the same result asX[Y, .N, nomatch=0L]
) whenY
contained rows that has no matches inX
. Fixed now. Closes #963. Thanks to this SO post from @Alex which helped discover the bug.data.table::dcast
handles levels in factor columns properly whendrop = FALSE
. Closes #893. Thanks to @matthieugomez for the great minimal example.[.data.table
subsets complex and raw type objects again. Thanks to @richierocks for the nice minimal example. Closes #982.Fixed a bug in the internal optimisation of
j-expression
with more than onelapply(.SD, function(..) ..)
as illustrated here on SO. Closes #985. Thanks to @jadaliha for the report and to @BrodieG for the debugging on SO.mget
fetches columns from the default environment.SD
when called from within the frame ofDT
. That is,DT[, mget(cols)]
,DT[, lapply(mget(cols), sum), by=.]
etc.. work as intended. Thanks to @Roland for filing this issue. Closes #994.foverlaps()
did not find overlapping intervals correctly on numeric ranges in a special case where bothstart
andend
intervals had 0.0. This is now fixed. Thanks to @tdhock for the reproducible example. Closes #1006 partly.When performing rolling joins, keys are set only when we can be absolutely sure. Closes #1010, which explains cases where keys should not be retained.
Rolling joins with
-Inf
andInf
are handled properly. Closes #1007. Thanks to @tdhock for filing #1006 which lead to the discovery of this issue.Overlapping range joins with
-Inf
andInf
and 0.0 in them are handled properly now. Closes #1006. Thanks to @tdhock for filing the issue with a nice reproducible example.Fixed two segfaults in
shift()
when number of rows inx
is lesser than value forn
. Closes #1009 and #1014. Thanks to @jangorecki and @ashinm for the reproducible reports.Attributes are preserved for
sum()
andmean()
when fast internal (GForce) implementations are used. Closes #1023. Thanks to @DavidArenburg for the nice reproducible example.lapply(l, setDT)
is handled properly now; over-allocation isn't lost. Similarly,for (i in 1:k) setDT(l[[i]])
is handled properly as well. Closes #480.rbindlist
stack imbalance on allNULL
list elements is now fixed. Closes #980. Thanks to @ttuggle.List columns can be assigned to columns of
factor
type by reference. Closes #936. Thanks to @richierocks for the minimal example.After setting the
datatable.alloccol
option, creating a data.table with more than the settruelength
resulted in error or segfault. This is now fixed. Closes #970. Thanks to @caneff for the nice minimal example.Update by reference using
:=
after loading from disk where thedata.table
exists within a local environment now works as intended. Closes #479. Thanks to @ChongWang for the minimal reproducible example.Issues on merges involving
factor
columns withNA
and mergingfactor
withcharacter
type with non-identical levels are both fixed. Closes #499 and #945. Thanks to @AbielReinhart and @stewbasic for the minimal examples.as.data.table(ll)
returned adata.table
with 0-rows when the first element of the list has 0-length, for e.g.,ll = list(NULL, 1:2, 3:4)
. This is now fixed by removing those 0-length elements. Closes #842. Thanks to @Rick for the nice minimal example.as.datat.able.factor
redirects toas.data.table.matrix
when input is amatrix
, but also of typefactor
. Closes #868. Thanks to @mgahan for the example.setattr
now returns an error when trying to setdata.table
and/ordata.frame
as class to a non-list type object (ex:matrix
). Closes #832. Thanks to @Rick for the minimal example.data.table(table) works as expected. Closes #1043. Thanks to @rnso for the SO post.
Joins and binary search based subsets of the form
x[i]
wherex
's key column is integer andi
a logical column threw an error before. This is now fixed by converting the logical column to integer type and then performing the join, so that it works as expected.When
by
expression is, for example,by = x %% 2
,data.table
tries to automatically extracts meaningful column names from the expression. In this case it would bex
. However, if thej-expression
also containsx
, for example,DT[, last(x), by= x %% 2]
, the originalx
got masked by the expression inby
. This is now fixed; by-expressions are not simplified in column names for these cases. Closes #497. Thanks to @GSee for the report.rbindlist
now errors when columns have non-identical class attributes and are notfactor
s, e.g., binding column of classDate
withPOSIXct
. Previously this returned incorrect results. Closes #705. Thanks to @ecoRoland for the minimal report.Fixed a segfault in
melt.data.table
whenmeasure.vars
have duplicate names. Closes #1055. Thanks to @ChristK for the minimal report.Fixed another segfault in
melt.data.table
issue that was caught due to issue in Windows. Closes #1059. Thanks again to @ChristK for the minimal report.DT[rows, newcol := NULL]
resulted in a segfault on the next assignment by reference. Closes #1082. Thanks to @stevenbagley for the MRE.as.matrix(DT)
handles cases whereDT
contains both numeric and logical columns correctly (doesn't coerce to character columns anymore). Closes #1083. Thanks to @bramvisser for the SO post.Coercion is handled properly on subsets/joins on
integer64
key columns. Closes #1108. Thanks to @vspinu.setDT()
andas.data.table()
both strip all classes preceding data.table/data.frame, to be consistent with base R. Closes #1078 and #1128. Thanks to Jan and @helix123 for the reports.setattr(x, 'levels', value)
handles duplicate levels invalue
appropriately. Thanks to Jeffrey Horner for pointing it out here. Closes #1142.
x[J(vals), .N, nomatch=0L]
also included no matches in result, #1074. Andx[J(...), col := val, nomatch=0L]
returned a warning with incorrect results when join resulted in no matches as well, even thoughnomatch=0L
should have no effect in:=
, #1092. Both issues are fixed now. Thanks to @riabusan and @cguill95 for #1092..data.table.locked
attributes set to NULL in internal functionsubsetDT
. Closes #1154. Thanks to @Jan.Internal function
fastmean()
retains column attributes. Closes #1160. Thanks to @renkun-ken.Using
.N
ini
, for e.g.,DT[, head(.SD, 3)[1:(.N-1L)]]
accessed incorrect value of.N
. This is now fixed. Closes #1145. Thanks to @claytonstanley.setDT
handleskey=
argument properly when input is already adata.table
. Closes #1169. Thanks to @DavidArenburg for the PR.Key is retained properly when joining on factor type columns. Closes #477. Thanks to @nachti for the report.
Over-allocated memory is released more robustly thanks to Karl Miller's investigation and suggested fix.
DT[TRUE, colA:=colA*2]
no longer churns through 4 unnecessary allocations as large as one column. This was caused byi=TRUE
being recycled. Thanks to Nathan Kurz for reporting and investigating. Added provided test to test suite. Only a single vector is allocated now for the RHS (colA*2
). Closes #1249.Thanks to @and3k for the excellent bug report #1258. This was a result of shallow copy retaining keys when it shouldn't. It affected some cases of joins using
on=
. Fixed now.set()
and:=
handle RHS valueNA_integer_
on factor types properly. Closes #1234. Thanks to @DavidArenburg.merge.data.table()
didn't set column order (and therefore names) properly in some cases. Fixed now. Closes #1290. Thanks to @ChristK for the minimal example.print.data.table now works for 100+ rows as intended when
row.names=FALSE
. Closes #1307. Thanks to @jangorecki for the PR.Row numbers are not printed in scientific format. Closes #1167. Thanks to @jangorecki for the PR.
Using
.GRP
unnamed inj
now returns a variable namedGRP
instead of.GRP
as the period was causing issues. Same for.BY
. Closes #1243; thanks to @MichaelChirico for the PR.DT[, 0, with=FALSE]
returns null data.table to be consistent withdata.frame
's behaviour. Closes #1140. Thanks to @franknarf1.Evaluating quoted expressions with
.
inby
works as intended. That is,dt = data.table(a=c(1,1,2,2), b=1:4); expr=quote(.(a)); dt[, sum(b), eval(expr)]
works now. Closes #1298. Thanks @eddi.as.list
method forIDate
object works properly. Closes #1315. Thanks to @gwerbin.
NOTES
Clearer explanation of what
duplicated()
does (borrowed from base). Thanks to @matthieugomez for pointing out. Closes #872.?setnames
has been updated now thatnames<-
andcolnames<-
shallow (rather than deep) copy from R >= 3.1.0, #853.FAQ 1.6 has been embellished, #517. Thanks to a discussion with Vivi and Josh O'Brien.
data.table
redefinesmelt
generic and suggestsreshape2
instead of import. As a result we don't have to loadreshape2
package to usemelt.data.table
anymore. The reason for this change is thatdata.table
requires R >=2.14, whereasreshape2
R v3.0.0+. Reshape2's melt methods can be used without any issues by loading the package normally.DT[, j, ]
at times made an additional (unnecessary) copy. This is now fixed. This fix also avoids allocating.I
whenj
doesn't use it. As a result:=
and other subset operations should be faster (and use less memory). Thanks to @szilard for the nice report. Closes #921.Because
reshape2
requires R >3.0.0, anddata.table
works with R >= 2.14.1, we can not importreshape2
anymore. Therefore we define amelt
generic andmelt.data.table
method for data.tables and redirect toreshape2
'smelt
for other objects. This is to ensure that existing code works fine.dcast
is also a generic now in data.table. So we can usedcast(...)
directly, and don't have to spell it out asdcast.data.table(...)
like before. Thedcast
generic in data.table redirects toreshape2::dcast
if the input object is not a data.table. But for that you have to loadreshape2
before loadingdata.table
. If not, reshape2'sdcast
overwrites data.table'sdcast
generic, in which case you will need the::
operator - ex:data.table::dcast(...)
.
NB: Ideal situation would be for dcast
to be a generic in reshape2 as well, but it is not. We have issued a pull request to make dcast
in reshape2 a generic, but that has not yet been accepted.
Clarified the use of
bit64::integer4
inmerge.data.table()
andsetNumericRounding()
. Closes #1093. Thanks to @sfischme for the report.Removed an unnecessary (and silly)
giveNames
argument fromsetDT()
. Not sure why I added this in the first place!options(datatable.prettyprint.char=5L)
restricts the number of characters to be printed for character columns. For example:options(datatable.prettyprint.char = 5L) DT = data.table(x=1:2, y=c("abcdefghij", "klmnopqrstuv")) DT
x y
1: 1 abcde...
2: 2 klmno...
rolltolast
argument in[.data.table
is now defunct. It was deprecated in 1.9.4.data.table
's dependency has been moved forward from R 2.14.0 to R 2.14.1, now nearly 4 years old (Dec 2011). As usual before release to CRAN we ensure data.table passes the test suite on the stated dependency and keep this as old as possible for as long as possible. As requested by users in managed environments. For this reason we still don't usepaste0()
internally, since that was added to R 2.15.0.Warning about
datatable.old.bywithoutby
option (for grouping on join without providingby
) being deprecated in the next release is in place now. Thanks to @jangorecki for the PR.Fixed
allow.cartesian
documentation tonrow(x)+nrow(i)
instead ofmax(nrow(x), nrow(i))
. Closes #1123.
Changes in v1.9.4 (on CRAN 2 Oct 2014)
NEW FEATURES
by=.EACHI
runsj
for each group inDT
that each row ofi
joins to.
```R
setkey(DT, ID)
DT[c("id1", "id2"), sum(val)] # single total across both id1 and id2
DT[c("id1", "id2"), sum(val), by = .EACHI] # sum(val) for each id
DT[c("id1", "id2"), sum(val), by = key(DT)] # same
```
In other words, `by-without-by` is now explicit, as requested by users, [#371](https://github.com/Rdatatable/data.table/issues/371). When `i` contains duplicates, `by=.EACHI` is different to `by=key(DT)`; e.g.,
```R
setkey(DT, ID)
ids = c("id1", "id2", "id1") # NB: id1 appears twice
DT[ids, sum(val), by = ID] # 2 rows returned
DT[ids, sum(val), by = .EACHI] # 3 rows in the order of ids (result 1 and 3 are not merged)
```
`by=.EACHI` can be useful when `i` is event data, where you don't want the events aggregated by common join values but wish the output to be ordered with repeats, or simply just using join inherited columns as parameters; e.g.;
```R
X[Y, head(.SD, i.top), by = .EACHI]
```
where `top` is a non-join column in `Y`; i.e., join inherited column. Thanks to many, especially eddi, Sadao Milberg and Gabor Grothendieck for extended discussions. Closes [#538](https://github.com/Rdatatable/data.table/issues/538).
Accordingly,
X[Y, j]
now does whatX[Y][, j]
did. To return the old behaviour:options(datatable.old.bywithoutby=TRUE)
. This is a temporary option to aid migration and will be removed in future. See this, this and this post for discussions and motivation.Overlap joins
(#528) is now here, finally!! Except fortype="equal"
andmaxgap
andminoverlap
arguments, everything else is implemented. Check out?foverlaps
and the examples there on its usage. This is a major feature addition todata.table
.DT[column==value]
andDT[column %in% values]
are now optimized to useDT
's key whenkey(DT)[1]=="column"
, otherwise a secondary key (a.k.a. index) is automatically added so the nextDT[column==value]
is much faster. No code changes are needed; existing code should automatically benefit. Secondary keys can be added manually usingset2key()
and existence checked usingkey2()
. These optimizations and function names/arguments are experimental and may be turned off withoptions(datatable.auto.index=FALSE)
.fread()
:- accepts line breaks inside quoted fields. Thanks to Clayton Stanley for highlighting here.
- accepts trailing backslash in quoted fields. Thanks to user2970844 for highlighting here.
- Blank and
"NA"
values in logical columns (T
,True
,TRUE
) no longer cause them to be read as character, #567. Thanks to Adam November for reporting. - URLs now work on Windows. R's
download.file()
converts\r\n
to\r\r\n
on Windows. Now avoided by downloading in binary mode. Thanks to Steve Miller and Dean MacGregor for reporting, #492. - Fixed seg fault in sparse data files when bumping to character, #796 and #722. Thanks to Adam Kennedy and Richard Cotton for the detailed reproducible reports.
- New argument
fread(...,data.table=FALSE)
returns adata.frame
instead of adata.table
. This can be set globally:options(datatable.fread.datatable=FALSE)
.
.()
can now be used inj
and is identical tolist()
, for consistency withi
.
```R
DT[,list(MySum=sum(B)),by=...]
DT[,.(MySum=sum(B)),by=...] # same
DT[,list(colB,colC,colD)]
DT[,.(colB,colC,colD)] # same
```
Similarly, `by=.()` is now a shortcut for `by=list()`, for consistency with `i` and `j`.
rbindlist
gainsuse.names
andfill
arguments and is now implemented entirely in C. Closes #345:use.names
by default is FALSE for backwards compatibility (does not bind by names by default)rbind(...)
now just callsrbindlist()
internally, except thatuse.names
is TRUE by default, for compatibility with base (and backwards compatibility).fill=FALSE
by default. Iffill=TRUE
,use.names
has to be TRUE.- When use.names=TRUE, at least one item of the input list has to have non-null column names.
- When fill=TRUE, all items of the input list has to have non-null column names.
- Duplicate columns are bound in the order of occurrence, like base.
- Attributes that might exist in individual items would be lost in the bound result.
- Columns are coerced to the highest SEXPTYPE when they are different, if possible.
- And incredibly fast ;).
- Documentation updated in much detail. Closes #333.
bit64::integer64
now works in grouping and joins, #342. Thanks to James Sams for highlighting UPCs and Clayton Stanley for this SO post.fread()
has been detecting and readinginteger64
for a while.setNumericRounding()
may be used to reduce to 1 byte or 0 byte rounding when joining to or grouping columns of type 'numeric', #342. See example in?setNumericRounding
and NEWS item below for v1.9.2.getNumericRounding()
returns the current setting.X[Y]
now names non-join columns fromi
that have the same name as a column inx
, with ani.
prefix for consistency with thei.
prefix that has been available inj
for some time. This is now documented.For a keyed table
X
where the key columns are not at the beginning in order,X[Y]
now retains the original order of columns in X rather than moving the join columns to the beginning of the result.It is no longer an error to assign to row 0 or row NA.
```R
DT[0, colA := 1L] # now does nothing, silently (was error)
DT[NA, colA := 1L] # now does nothing, silently (was error)
DT[c(1, NA, 0, 2), colA:=1L] # now ignores the NA and 0 silently (was error)
DT[nrow(DT) + 1, colA := 1L] # error (out-of-range) as before
```
This is for convenience to avoid the need for a switch in user code that evals various `i` conditions in a loop passing in `i` as an integer vector which may containing `0` or `NA`.
A new function
setorder
is now implemented which uses data.table's internal fast order to reorder rows by reference. It returns the result invisibly (likesetkey
) that allows for compound statements; e.g.,setorder(DT, a, -b)[, cumsum(c), by=list(a,b)]
. Check?setorder
for more info.DT[order(x, -y)]
is now by default optimised to use data.table's internal fast order asDT[forder(DT, x, -y)]
. It can be turned off by settingdatatable.optimize
to < 1L or just callingbase:::order
explicitly. It results in 20x speedup on data.table of 10 million rows with 2 integer columns, for example. To order character vectors in descending order it's sufficient to doDT[order(x, -y)]
as opposed toDT[order(x, -xtfrm(y))]
in base. This closes #603.mult="all"
-vs-mult="first"|"last"
now return consistent types and columns, #340. Thanks to Michele Carriero for highlighting.duplicated.data.table
andunique.data.table
gainsfromLast = TRUE/FALSE
argument, similar to base. Default value is FALSE. Closes #347.anyDuplicated.data.table
is now implemented. Closes #350. Thanks to M C (bluemagister) for reporting.Complex j-expressions of the form
DT[, c(..., lapply(.SD, fun)), by=grp]
are now optimised as long as.SD
is of the formlapply(.SD, fun)
or.SD
,.SD[1]
or.SD[1L]
. This resolves #370. Thanks to Sam Steingold for reporting. This also completes the first two task lists in #735.
```R
## example:
DT[, c(.I, lapply(.SD, sum), mean(x), lapply(.SD, log)), by=grp]
## is optimised to
DT[, list(.I, x=sum(x), y=sum(y), ..., mean(x), log(x), log(y), ...), by=grp]
## and now... these variations are also optimised internally for speed
DT[, c(..., .SD, lapply(.SD, sum), ...), by=grp]
DT[, c(..., .SD[1], lapply(.SD, sum), ...), by=grp]
DT[, .SD, by=grp]
DT[, c(.SD), by=grp]
DT[, .SD[1], by=grp] # Note: but not yet DT[, .SD[1,], by=grp]
DT[, c(.SD[1]), by=grp]
DT[, head(.SD, 1), by=grp] # Note: but not yet DT[, head(.SD, -1), by=grp]
# but not yet optimised
DT[, c(.SD[a], .SD[x>1], lapply(.SD, sum)), by=grp] # where 'a' is, say, a numeric or a data.table, and also for expressions like x>1
```
The underlying message is that `.SD` is being slowly optimised internally wherever possible, for speed, without compromising in the nice readable syntax it provides.
setDT
gainskeep.rownames = TRUE/FALSE
argument, which works only ondata.frame
s. TRUE retains the data.frame's row names as a new column namedrn
.The output of
tables()
now includesNCOL
. Thanks to @dnlbrky for the suggestion.DT[, LHS := RHS]
(or its equivalent inset
) now provides a warning and returnsDT
as it was, instead of an error, whenlength(LHS) = 0L
, #343. For example:
```R
DT[, grep("^b", names(DT)) := NULL] # where no columns start with b
# warns now and returns DT instead of error
```
```R
DT[, list(.N, mean(y), sum(y)), by=x] # 1.9.2 - doesn't know to use GForce - will be (relatively) slower
DT[, list(.N, mean(y), sum(y)), by=x] # 1.9.3+ - will use GForce.
```
setDF
is now implemented. It accepts a data.table and converts it to data.frame by reference, #338. Thanks to canneff for the discussion here on data.table mailing list..I
gets named asI
(instead of.I
) wherever possible, similar to.N
, #344.setkey
on.SD
is now an error, rather than warnings for each group about rebuilding the key. The new error is similar to when attempting to use:=
in a.SD
subquery:".SD is locked. Using set*() functions on .SD is reserved for possible future use; a tortuously flexible way to modify the original data by group."
Thanks to Ron Hylton for highlighting the issue on datatable-help here.Looping calls to
unique(DT)
such as inDT[,unique(.SD),by=group]
is now faster by avoiding internal overhead of calling[.data.table
. Thanks again to Ron Hylton for highlighting in the same thread. His example is reduced from 28 sec to 9 sec, with identical results.Following
gsum
andgmean
, nowgmin
andgmax
from GForce are also implemented. Closes part of #523. Benchmarks are also provided.
```R
DT[, list(sum(x), min(y), max(z), .N), by=...] # runs by default using GForce
```
setorder()
andDT[order(.)]
handlesinteger64
type in descending order as well. Closes #703.setorder()
andsetorderv()
gainna.last = TRUE/FALSE
. Closes #706..N
is now available ini
, FR#724. Thanks to newbie indirectly here and Farrel directly here.by=.EACHI
is now implemented for not-joins as well. Closes #604. Thanks to Garrett See for filing the FR. As an example:
```R
DT = data.table(x=c(1,1,1,1,2,2,3,4,4,4), y=1:10, key="x")
DT[!J(c(1,4)), sum(y), by=.EACHI] # is equivalent to DT[J(c(2,3)), sum(y), by=.EACHI]
```
BUG FIXES
- When joining to fewer columns than the key has, using one of the later key columns explicitly in j repeated the first value. A problem introduced by v1.9.2 and not caught bythe 1,220 tests, or tests in 37 dependent packages. Test added. Many thanks to Michele Carriero for reporting.
```R
DT = data.table(a=1:2, b=letters[1:6], key="a,b") # keyed by a and b
DT[.(1), list(b,...)] # correct result again (joining just to a not b but using b)
```
setkey
works again when a non-key column is type list (e.g. each cell can itself be a vector), #54. Test added. Thanks to James Sams, Michael Nelson and Musx for the reproducible examples.The warning "internal TRUE value has been modified" with recently released R 3.1 when grouping a table containing a logical column and where all groups are just 1 row is now fixed and tests added. Thanks to James Sams for the reproducible example. The warning is issued by R and we have asked if it can be upgraded to error (UPDATE: change now made for R 3.1.1 thanks to Luke Tierney).
data.table(list())
,data.table(data.table())
anddata.table(data.frame())
now return a null data.table (no columns) rather than one empty column, #48. Test added. Thanks to Shubh Bansal for reporting.unique(<NULL data.table>)
now returns a null data.table, #44. Thanks to agstudy for reporting.data.table()
converted POSIXlt to POSIXct, consistent withbase:::data.frame()
, but now also provides a helpful warning instead of coercing silently, #59. Thanks to Brodie Gaslam, Patrick and Ragy Isaac for reporting here and here.If another class inherits from data.table; e.g.
class(DT) == c("UserClass","data.table","data.frame")
thenDT[...]
now retainsUserClass
in the result. Thanks to Daniel Krizian for reporting, #64. Test added.An error
object '<name>' not found
could occur in some circumstances, particularly after a previous error. Reported on SO with non-ASCII characters in a column name, a red herring we hope since non-ASCII characters are fully supported in data.table including in column names. Fix implemented and tests added.Column order was reversed in some cases by
as.data.table.table()
, #43. Test added. Thanks to Benjamin Barnes for reporting.DT[, !"missingcol", with=FALSE]
now returnsDT
(rather than a NULL data.table) with warning that "missingcol" is not present.DT[,y := y * eval(parse(text="1*2"))]
resulted in error unlesseval()
was wrapped with paranthesis. That is,DT[,y := y * (eval(parse(text="1*2")))]
, #5423. Thanks to Wet Feet for reporting and to Simon O'Hanlon for identifying the issue here on SO.Using
by
columns with attributes (ex: factor, Date) inj
did not retain the attributes, also in case of:=
. This was partially a regression from an earlier fix (#155) due to recent changes for R3.1.0. Now fixed and clearer tests added. Thanks to Christophe Dervieux for reporting and to Adam B for reporting here on SO. Closes #36..BY
special variable did not retain names of the grouping columns which resulted in not being able to access.BY$grpcol
inj
. Ex:DT[, .BY$x, by=x]
. This is now fixed. Closes #5415. Thanks to Stephane Vernede for the bug report.Fixed another issue with
eval(parse(...))
inj
along with assignment by reference:=
. Closes #30. Thanks to Michele Carriero for reporting.get()
inj
did not seei
's columns wheni
is a data.table which lead to errors while doing operations like:DT1[DT2, list(get('c'))]
. Now, use ofget
makes all x's and i's columns visible (fetches all columns). Still, as the verbose message states, using.SDcols
oreval(macro)
would be able to select just the columns used, which is better for efficiency. Closes #34. Thanks to Eddi for reporting.Fixed an edge case with
unique
andduplicated
, which on empty data.tables returned a 1-row data.table with all NAs. Closes #28. Thanks to Shubh Bansal for reporting.dcast.data.table
resuled in error (because functionCJ()
was not visible) in packages that "import" data.table. This did not happen if the package "depends" on data.table. Closes bug #31. Thanks to K Davis for the excellent report.merge(x, y, all=TRUE)
error whenx
is empty data.table is now fixed. Closes #24. Thanks to Garrett See for filing the report.Implementing #5249 closes bug #26, a case where rbind gave error when binding with empty data.tables. Thanks to Roger for reporting on SO.
Fixed a segfault during grouping with assignment by reference, ex:
DT[, LHS := RHS, by=.]
, where length(RHS) > group size (.N). Closes #25. Thanks to Zachary Long for reporting on datatable-help mailing list.Consistent subset rules on datat.tables with duplicate columns. In short, if indices are directly provided, 'j', or in .SDcols, then just those columns are either returned (or deleted if you provide -.SDcols or !j). If instead, column names are given and there are more than one occurrence of that column, then it's hard to decide which to keep and which to remove on a subset. Therefore, to remove, all occurrences of that column are removed, and to keep, always the first column is returned each time. Also closes #22 and #86.
Note that using
by=
to aggregate on duplicate columns may not give intended result still, as it may not operate on the proper column.When DT is empty,
DT[, newcol:=max(b), by=a]
now properly adds the column, #49. Thanks to Shubh bansal for filing the report.When
j
evaluates tointeger(0)/character(0)
,DT[, j, with=FALSE]
resulted in error, #21. Thanks indirectly to Malcolm Cook for #52, through which this (recent) regression (from 1.9.3) was found.print(DT)
now respectsdigits
argument on list type columns, #37. Thanks to Frank for the discussion on the mailing list and to Matthew Beckers for filing the bug report.FR # 2551 implemented leniance in warning messages when columns are coerced with
DT[, LHS := RHS]
, whenlength(RHS)==1
. But this was very lenient; e.g.,DT[, a := "bla"]
, wherea
is a logical column should get a warning. This is now fixed such that only very obvious cases coerces silently; e.g.,DT[, a := 1]
wherea
isinteger
. Closes #35. Thanks to Michele Carriero and John Laing for reporting.dcast.data.table
provides better error message whenfun.aggregate
is specified but it returns length != 1. Closes #693. Thanks to Trevor Alexander for reporting here on SO.dcast.data.table
tries to preserve attributes whereever possible, except whenvalue.var
is afactor
(or ordered factor). Forfactor
types, the casted columns will be coerced to typecharacter
thereby losing thelevels
attribute. Closes #688. Thanks to juancentro for reporting.melt
now returns friendly error whenmeaure.vars
are not in data instead of segfault. Closes #699. Thanks to vsalmendra for this post on SO and the subsequent bug report.DT[, list(m1 = eval(expr1), m2=eval(expr2)), by=val]
whereexpr1
andexpr2
are constructed usingparse(text=.)
now works instead of resulting in error. Closes #472. Thanks to Benjamin Barnes for reporting with a nice reproducible example.A join of the form
X[Y, roll=TRUE, nomatch=0L]
where some of Y's key columns occur more than once (duplicated keys) might at times return incorrect join. This was introduced only in 1.9.2 and is fixed now. Closes #700. Thanks to Michael Smith for the very nice reproducible example and nice spotting of such a tricky case.Fixed an edge case in
DT[order(.)]
internal optimisation to be consistent with base. Closes #696. Thanks to Michael Smith and Garrett See for reporting.DT[, list(list(.)), by=.]
andDT[, col := list(list(.)), by=.]
returns correct results in R >=3.1.0 as well. The bug was due to recent (welcoming) changes in R v3.1.0 wherelist(.)
does not result in a copy. Closes #481. Also thanks to KrishnaPG for filing #728.dcast.data.table
handlesfun.aggregate
argument properly when called from within