Learn R Programming

SPOTMisc (version 1.19.52)

getDataCensus: Get Census KDD data set (+variation)

Description

This function downloads (or loads from cache folder) the Census KDD Dataset (OpenML ID: 4535). If requested, data set is changed w.r.t the number of observations, number of numerical/categorical feature, the cardinality of the categorical features, and the task type (regr. or classif).

Usage

getDataCensus(
  task.type = "classif",
  nobs = 50000,
  nfactors = "high",
  nnumericals = "high",
  cardinality = "high",
  data.seed = 1,
  cachedir = "oml.cache",
  target = NULL,
  cache.only = FALSE
)

Value

census data set

Arguments

task.type

character, either "classif" or "regr".

nobs

integer, number of observations uniformly sampled from the full data set.

nfactors

character, controls the number of factors (categorical features) to use. Can be "low", "med", "high", or "full" (full corresponds to original data set).

nnumericals

character, controls the number of numerical features to use. Can be "low", "med", "high", or "full" (full corresponds to original data set).

cardinality

character, controls the number of factor levels (categories) for the categorical features. Can be "low", "med", "high" (high corresponds to original data set).

data.seed

integer, this will be used via set.seed() to make the random subsampling reproducible. Will not have an effect if all observations are used.

cachedir

character. The cache directory, e.g., "oml.cache". Default: "oml.cache".

target

character "age" or "income_class". If target = age, the numerical varible age is converted to a factor: age<-as.factor(age<40)

cache.only

logical. Only try to retrieve the object from cache. Will result in error if the object is not found. Default is TRUE.

Examples

Run this code
# \donttest{
## Example downloads OpenML data, might take some time:
task.type <- "classif"
nobs <- 1e4 # max: 229285
data.seed <- 1
nfactors <- "full"
nnumericals <- "low"
cardinality <- "med"
censusData <- getDataCensus(
  task.type = task.type,
  nobs = nobs,
  nfactors = nfactors,
  nnumericals = nnumericals,
  cardinality = cardinality,
  data.seed = data.seed,
  cachedir = "oml.cache",
  target="age")
  # }

Run the code above in your browser using DataLab