cleanup.import
will correct errors and shrink
the size of data frames. By default, double precision numeric
variables are changed to integer when they contain no fractional components.
Infinite values or values greater than 1e20 in absolute value are set
to NA. This solves problems of importing Excel spreadsheets that
contain occasional character values for numeric columns, as S
converts these to Inf
without warning. There is also an option to
convert variable names to lower case and to add labels to variables.
The latter can be made easier by importing a CNTLOUT dataset created
by SAS PROC FORMAT and using the sasdict
option as shown in the
example below. cleanup.import
can also transform character or
factor variables to dates. upData
is a function facilitating the updating of a data frame
without attaching it in search position one. New variables can be
added, old variables can be modified, variables can be removed or renamed, and
"labels"
and "units"
attributes can be provided.
Observations can be subsetted. Various checks
are made for errors and inconsistencies, with warnings issued to help
the user. Levels of factor variables can be replaced, especially
using the list
notation of the standard merge.levels
function. Unless force.single
is set to FALSE
,
upData
also converts double precision vectors to integer if no
fractional values are present in
a vector. upData
is also used to process R workspace objects
created by StatTransfer, which puts variable and value labels as attributes on
the data frame rather than on each variable. If such attributes are
present, they are used to define all the labels and value labels
(through conversion to factor variables) before any label changes
take place, and force.single
is set to a default of
FALSE
, as StatTransfer already does conversion to integer.
Variables having labels but not classed "labelled"
(e.g., data
imported using the haven
package) have that class added to them
by upData
.
The dataframeReduce
function removes variables from a data frame
that are problematic for certain analyses. Variables can be removed
because the fraction of missing values exceeds a threshold, because they
are character or categorical variables having too many levels, or
because they are binary and have too small a prevalence in one of the
two values. Categorical variables can also have their levels combined
when a level is of low prevalence.
cleanup.import(obj, labels, lowernames=FALSE, force.single=TRUE, force.numeric=TRUE, rmnames=TRUE, big=1e20, sasdict, print, datevars=NULL, datetimevars=NULL, dateformat='%F', fixdates=c('none','year'), charfactor=FALSE)
upData(object, ..., subset, rename, drop, keep, labels, units, levels, force.single=TRUE, lowernames=FALSE, caplabels=FALSE, moveUnits=FALSE, charfactor=FALSE, print=TRUE, html=FALSE)
dataframeReduce(data, fracmiss=1, maxlevels=NULL, minprev=0, print=TRUE)
force.single=FALSE
.
force.single=TRUE
will also convert vectors having only integer
values to have a storage mode of integer, in R or S-Plus.
cleanup.import
will check
each factor variable to see if the levels contain only numeric values
and ""
. In that case, the variable will be converted to numeric,
with ""
converted to NA. Set force.numeric=FALSE
to prevent
this behavior.
obj
. These character values are taken to be variable labels in the
same order of variables in obj
.
For upData
, labels
is a named list or named vector
with variables in no specific order.
TRUE
to change variable names to lower case.
upData
does this before applying any other changes, so variable
names given inside arguments to upData
need to be lower case if
lowernames==TRUE
.
cleanup.import
TRUE
or FALSE
to force or prevent printing of the current
variable number being processed. By default, such messages are printed if the
product of the number of variables and number of observations in obj
exceeds 500,000. For dataframeReduce
set print
to
FALSE
to suppress printing information about dropped or
modified variables. Similar for upData
.lowernames
is
applied) of variables to consider as a factor or character vector
containing dates in a format matching dateformat
. The
default is "%F"
which uses the yyyy-mm-dd format.lowernames
is applied) of variables to consider to be date-time variables, with
date formats as described under datevars
followed by a space
followed by time in hh:mm:ss format. chron
is used to store
date-time variables. If all times in the variable
are 00:00:00 the variable will be converted to an ordinary date variable.cleanup.import
is the input format (see
strptime
)datevars
that have a dateformat
that cleanup.import
understands,
specifying fixdates
allows corrections of certain formatting
inconsistencies before the fields are attempted to be converted to
dates (the default is to assume that the dateformat
is followed
for all observation for datevars
). Currently
fixdates='year'
is implemented, which will cause 2-digit or
4-digit years to be shifted to the alternate number of digits when
dateform
is the default "%F"
or is "%y-%m-%d"
,
"%m/%d/%y"
, or "%m/%d/%Y"
. Two-digits years are padded with 20
on the left. Set dateformat
to the desired format, not the
exceptional format.
TRUE
to change character variables to
factors if they have fewer than n/2 unique values. Null strings and
blanks are converted to NA
s.upData
, one or more expressions of the form
variable=expression
, to derive new variables or change old ones.
object
should be retained. The
expressions should use the original variable names, i.e., before any
variables are renamed but after lowernames
takes effect.age
and sex
to respectively Age
and
gender
, specify rename=list(age="Age", sex="gender")
or
rename=c(age=...)
.
"units"
attributes of
variables, in no specific order
"levels"
attributes for factor variables, in
no specific order. The values in this list may be character vectors
redefining levels
(in order) or another list (see
merge.levels
if using S-Plus).
TRUE
to capitalize the first letter of each word in
each variable label
TRUE
to look for units of measurements in variable
labels and move them to a "units"
attribute. If an expression
in a label is enclosed in parentheses or brackets it is assumed to be
units if moveUnits=TRUE
.TRUE
to print conversion information as html
vertabim at 0.6 size. The user will need to put
results='asis'
in a knitr
chunk header to properly
render this output.NA
s for a
variable to be kept. Default is to keep all variables no matter how
many NA
s are present.sas.get
, data.frame
, describe
,
label
, read.csv
, strptime
,
POSIXct
,Date
## Not run:
# dat <- read.table('myfile.asc')
# dat <- cleanup.import(dat)
# ## End(Not run)
dat <- data.frame(a=1:3, d=c('01/02/2004',' 1/3/04',''))
cleanup.import(dat, datevars='d', dateformat='%m/%d/%y', fixdates='year')
dat <- data.frame(a=(1:3)/7, y=c('a','b1','b2'), z=1:3)
dat2 <- upData(dat, x=x^2, x=x-5, m=x/10,
rename=c(a='x'), drop='z',
labels=c(x='X', y='test'),
levels=list(y=list(a='a',b=c('b1','b2'))))
dat2
describe(dat2)
dat <- dat2 # copy to original name and delete dat2 if OK
rm(dat2)
dat3 <- upData(dat, X=X^2, subset = x < (3/7)^2 - 5, rename=c(x='X'))
# Remove hard to analyze variables from a redundancy analysis of all
# variables in the data frame
d <- dataframeReduce(dat, fracmiss=.1, minprev=.05, maxlevels=5)
# Could run redun(~., data=d) at this point or include dataframeReduce
# arguments in the call to redun
# If you import a SAS dataset created by PROC CONTENTS CNTLOUT=x.datadict,
# the LABELs from this dataset can be added to the data. Let's also
# convert names to lower case for the main data file
## Not run:
# mydata2 <- cleanup.import(mydata2, lowernames=TRUE, sasdict=datadict)
# ## End(Not run)
Run the code above in your browser using DataLab