The AdultUCI
data set contains the questionnaire data of the
“Adult” database (originally called the “Census Income”
Database) formatted as a data.frame. The Adult
data set contains the
data already prepared and coerced to '>transactions
for
use with arules.
data("Adult")
data("AdultUCI")
The AdultUCI
data set contains a data frame with 48842
observations on the following 15 variables.
a numeric vector.
a factor with levels Federal-gov
,
Local-gov
, Never-worked
, Private
,
Self-emp-inc
, Self-emp-not-inc
, State-gov
,
and Without-pay
.
an ordered factor with levels Preschool
<
1st-4th
< 5th-6th
< 7th-8th
< 9th
<
10th
< 11th
< 12th
< HS-grad
<
Prof-school
< Assoc-acdm
< Assoc-voc
<
Some-college
< Bachelors
< Masters
<
Doctorate
.
a numeric vector.
a factor with levels Divorced
,
Married-AF-spouse
, Married-civ-spouse
,
Married-spouse-absent
, Never-married
,
Separated
, and Widowed
.
a factor with levels Adm-clerical
,
Armed-Forces
, Craft-repair
, Exec-managerial
,
Farming-fishing
, Handlers-cleaners
,
Machine-op-inspct
, Other-service
,
Priv-house-serv
, Prof-specialty
,
Protective-serv
, Sales
, Tech-support
, and
Transport-moving
.
a factor with levels Husband
,
Not-in-family
, Other-relative
, Own-child
,
Unmarried
, and Wife
.
a factor with levels Amer-Indian-Eskimo
,
Asian-Pac-Islander
, Black
, Other
, and
White
.
a factor with levels Female
and Male
.
a numeric vector.
a numeric vector.
a numeric vector.
a numeric vector.
a factor with levels Cambodia
,
Canada
, China
, Columbia
, Cuba
,
Dominican-Republic
, Ecuador
, El-Salvador
,
England
, France
, Germany
, Greece
,
Guatemala
, Haiti
, Holand-Netherlands
,
Honduras
, Hong
, Hungary
, India
,
Iran
, Ireland
, Italy
, Jamaica
,
Japan
, Laos
, Mexico
, Nicaragua
,
Outlying-US(Guam-USVI-etc)
, Peru
,
Philippines
, Poland
, Portugal
,
Puerto-Rico
, Scotland
, South
, Taiwan
,
Thailand
, Trinadad&Tobago
, United-States
,
Vietnam
, and Yugoslavia
.
an ordered factor with levels small
<
large
.
The “Adult” database was extracted from the census bureau database
found at https://www.census.gov/ in 1994 by
Ronny Kohavi and Barry Becker, Data Mining and Visualization, Silicon
Graphics. It was originally used to predict whether income exceeds USD 50K/yr
based on census data. We added the attribute income
with levels
small
and large
(>50K).
We prepared the data set for association mining as shown in the
section Examples. We removed the
continuous attribute fnlwgt
(final weight).
We also eliminated education-num
because it is just a
numeric representation of the attribute education
.
The other 4 continuous attributes we mapped to ordinal attributes as
follows:
cut into levels
Young
(0-25),
Middle-aged
(26-45),
Senior
(46-65) and
Old
(66+).
cut into levels
Part-time
(0-25),
Full-time
(25-40),
Over-time
(40-60) and
Too-much
(60+).
each cut into levels
None
(0),
Low
(0 < median of the values greater zero < max) and
High
(>=max).
A. Asuncion \& D. J. Newman (2007): UCI Repository of Machine Learning Databases. Irvine, CA: University of California, Department of Information and Computer Science.
The data set was first cited in Kohavi, R. (1996): Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.
# NOT RUN {
data("AdultUCI")
dim(AdultUCI)
AdultUCI[1:2,]
## remove attributes
AdultUCI[["fnlwgt"]] <- NULL
AdultUCI[["education-num"]] <- NULL
## map metric attributes
AdultUCI[[ "age"]] <- ordered(cut(AdultUCI[[ "age"]], c(15,25,45,65,100)),
labels = c("Young", "Middle-aged", "Senior", "Old"))
AdultUCI[[ "hours-per-week"]] <- ordered(cut(AdultUCI[[ "hours-per-week"]],
c(0,25,40,60,168)),
labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))
AdultUCI[[ "capital-gain"]] <- ordered(cut(AdultUCI[[ "capital-gain"]],
c(-Inf,0,median(AdultUCI[[ "capital-gain"]][AdultUCI[[ "capital-gain"]]>0]),
Inf)), labels = c("None", "Low", "High"))
AdultUCI[[ "capital-loss"]] <- ordered(cut(AdultUCI[[ "capital-loss"]],
c(-Inf,0, median(AdultUCI[[ "capital-loss"]][AdultUCI[[ "capital-loss"]]>0]),
Inf)), labels = c("None", "Low", "High"))
## create transactions
Adult <- as(AdultUCI, "transactions")
Adult
# }
Run the code above in your browser using DataLab