asNumericDF
Coerce to numeric dropping commas and info after a blank
For asNumericChar, delete leading blanks and a
leading dollar sign plus commas (thousand separators)
and drop information after a blank (other than leadng
blanks), then coerce to numeric or to factors, Dates,
or POSIXct as desired.
For a data.frame, apply
asNumericChar to all columns and drop
columns except those in keep, ignore,
factors, Dates, POSIX and
MSdates.
Then order the rows by the
orderBy column. Some Excel imports include
commas as thousand separators; this replaces any
commas with char(0), '', before trying to convert
to numeric.
Similarly, if "%" is found as the last character in any field, drop the percent sign and divide the resulting numeric conversion by 100 to convert to proportion.
Also, some character data includes footnote references following the year.
For example Table F-1 from the US Census Bureau
needs all three of these numeric conversion features:
It needs orderBy, because the most recent
year appears first, just the opposite of most other
data sets where the most recent year appears last.
It has footnote references following a character
string indicating the year. And it includes commas
as thousand separators.
- Keywords
- manip
Usage
asNumericChar(x, leadingChar='^\\$',
suppressChar=',', pctChar='%$')
asNumericDF(x, keep=function(x)any(!is.na(x)),
orderBy=NA, ignore=NULL, factors=NULL,
Dates=NULL, POSIX=NULL, MSdates=NULL,
format., leadingChar='^\\$',
suppressChar=',', pctChar='%$')
Arguments
- x
For
asNumericChar, this is a character vector to be converted to numeric aftergsub(',', '', x). ForasNumericDF, this is adata.framewith all character columns to be converted to numerics.- keep
something to indicate which columns to keep, in addition to columns specified in
ignore,factors,Dates, andPOSIX.- orderBy
Which columns to order the rows of
x[, keep]by. Default is to keep the input order.- ignore
vector identifying columns of
xto ignore, i.e., to keep and not attempt to convert to another data type.- factors
vector indicating columns of
xto convert tofactor- Dates
vector indicating columns of
xto convert usingas.Date(,format).- POSIX
vector indicating columns of
xto convert usingas.POSIXct(,format).- format.
Character vector of length 1 to pass as argument
formattoas.Dateand / oras.POSIXctfor conversion fromcharacter.For
Dates,as.Dateis first tried with format = '%Y-%m-%d', then with '%Y/%m/%d', '%m-%d-%Y', and '%m/%d/%Y'. The conversion with the fewest NAs is kept. If two match for numbers of NAs, the one with the minimum absolute deviations from as.Date1970(0) is used.- MSdates
The names or numbers identifying columns of x identifying dates as integer numbers of days since 1899-12-31. In Microsoft Excel, dates are stored in that format. Tests on 2016-08-10 revealed that
read.xlsdid not read those dates properly (using gdata version 2.17.0), whileread.xlsxread them as integers using openxlsx 3.0.0. This capability will convert those integers to Dates usingas.Date(., origin = as.Date('1899-12-31')).- leadingChar
A regular expression passed to
grepandsubto replace something like an initial dollar sign withcharacter(0).- suppressChar
a regular expression passed to
gsubto replace all occurrances of something like "," (a thousands separator in the U.S.) withcharacter(0).- pctChar
A regular expression passed to
grepto identify percent columns.pctCharis then passed tosubto replacepctCharwithcharacter(0), and the converted numbers are then divided by 100 to convert them to proportions.
Details
For asNumericChar:
1. Replace commas by nothing
2. strsplit on ' ' and take only the first part, thereby eliminating the footnote references.
3. Replace any blanks with NAs
4. as.numeric
for asNumericDF:
1. Copy x to X.
2. Confirm that ignore, factors,
Dates, and POSIX all refer to columns
of x and do not overlap. [*** NOTE: as of
2016-07-21, these checks have only been implemented
for ignore.]
3. Convert factors, Dates, and
POSIX.
4. Apply asNumericChar to all columns
not in ignore, factors, Dates,
or POSIX.
5. Keep columns specified by keep.
6. return the result.
Value
References
"Add (sum) or subtract dates; Applies To: Excel 2013". Microsoft. (accessed 2016-08-11)
See Also
scan
gsub
Quotes
stripBlanks
as.numeric, factor,
as.Date, as.POSIXct
read.xlsx
Examples
# NOT RUN {
##
## 1. an example
##
xDate <- as.Date('1970-01-01')+c(0, 365)
xPOSIX <- as.POSIXct(xDate)+c(1, 99)
xMSdate <- as.Date(1, as.Date('1899-12-31'))+1:2
fakeF1 <- data.frame(yr=c('1948', '1947 (1)'),
q1=c(' 1,234 ', ''), duh=rep(NA, 2),
dol=c('$1,234', ''), pct=c('1%', '2%'),
xDate=as.character(xDate, format='%m-%d-%Y'),
xPOSIX=as.character(xPOSIX,
format='%m-%d-%Y %H:%M:%S'),
xMSdate=2:3, junk=c('this is','junk'))
# This converts the last 3 columns to NAs and drops them:
str(nF1.1 <- asNumericChar(fakeF1$yr))
str(nF1.2 <- asNumericChar(fakeF1$q1))
str(nF1.3 <- asNumericChar(fakeF1$duh))
nF1 <- asNumericDF(fakeF1)
nF2 <- asNumericDF(fakeF1, Dates=6, MSdate='xMSdate',
ignore=c('junk', 'xPOSIX'), format.='%m-%d-%Y')
nF3 <- asNumericDF(nF2, POSIX='xPOSIX',
ignore=c(5,7:8), format.='%m-%d-%Y %H:%M:%S')
# check
nF1. <- data.frame(yr=asNumericChar(fakeF1$yr),
q1=asNumericChar(fakeF1$q1),
dol=asNumericChar(fakeF1$dol),
pct=c(.01, .02),
xMSdate=2:3)
nF1c <- data.frame(yr=1948:1947, q1=c(1234, NA),
dol=c(1234, NA), pct=c(.01, .02),
xMSdate=2:3)
# }
# NOT RUN {
all.equal(nF1, nF1.)
# }
# NOT RUN {
all.equal(nF1., nF1c)
# }
# NOT RUN {
nF3c <- data.frame(yr=1948:1947, q1=c(1234, NA),
dol=c(1234, NA), pct=c(.01, .02),
xDate=xDate, xPOSIX=xPOSIX,
xMSdate=xMSdate, junk=fakeF1$junk)
# }
# NOT RUN {
all.equal(nF3, nF3c)
# }
# NOT RUN {
##
## 2. orderBy=1:2
##
nF. <- asNumericDF(fakeF1, orderBy=1:2)
# }
# NOT RUN {
all.equal(nF., nF1c[2:1,])
# }