Learn R Programming

datanugget (version 1.4.0)

create.DN: Create Data Nuggets

Description

This function draws a random sample of observations from a large dataset and creates data nuggets, a type of representative sample of the dataset, using a specified distance metric.

Usage

create.DN(x,
          center.method = "original",
          R = 5000,
          delete.percent = .1,
          DN.num1 = 10^4,
          DN.num2 = 2000,
          dist.metric = "euclidean", 
          seed = 291102,
          no.cores = (detectCores() - 1),
          make.pbs = FALSE)

Value

An object of class datanugget:

Data Nuggets

DN.num2 by (ncol(x)+3) data frame containing the information for the data nuggets created (index, center, weight and scale).

Data Nugget Assignments

Vector of length nrow(x) containing the data nugget assignment of each observation in x.

Arguments

x

A data matrix (of class matrix, data.frame, or data.table) containing only entries of class numeric.

center.method

The method used for creating data nugget centers. Must be 'mean' or 'random' or 'original'. 'mean' chooses the data nugget center to be the mean of all observations within that data nugget, 'random' chooses the data nugget center to be some random observation within that data nugget, and 'original' chooses the original data nugget centers generated by the final run of datanugget creation using create.DNcenters function. Default is 'original'.

R

The number of observations to sample from the data matrix when creating the initial data nugget centers. Must be of class numeric within [100,10000]. Default is 5000.

delete.percent

The proportion of observations to remove from the data matrix at each iteration when finding data nugget centers. Must be of class numeric and within (0,1). Default is 0.1.

DN.num1

The number of initial data nugget centers to create. Must be of class numeric. Default is 10^4.

DN.num2

The number of final data nuggets to create. Must be of class numeric. Default is 2000.

dist.metric

The distance metric used to create the initial centers of data nuggets. Must be 'euclidean' or 'manhattan'. Default is 'euclidean'.

seed

Random seed for replication. Must be of class numeric. Default is 291102.

no.cores

Number of cores used for parallel processing. If '0' then parallel processing is not used. Must be of class numeric.

make.pbs

Logical; whether to show a progress bar while the function runs. Default is FALSE.

Author

Rituparna Dey, Traymon Beavers, Javier Cabrera, Mariusz Lubomirski

Details

Data nuggets are a representative sample meant to summarize Big Data by reducing a large dataset to a much smaller dataset by eliminating redundant points while also preserving the peripheries of the dataset. Each data nugget is defined by a center (location), weight (importance), and scale (internal variability). This function creates data nuggets using Algorithm 1 provided in the reference.

References

Beavers, T. E., Cheng, G., Duan, Y., Cabrera, J., Lubomirski, M., Amaratunga, D., & Teigler, J. E. (2024). Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure. Journal of Computational and Graphical Statistics, 1-21.

Cherasia, K. E., Cabrera, J., Fernholz, L. T., & Fernholz, R. (2022). Data Nuggets in Supervised Learning. In Robust and Multivariate Statistical Methods: Festschrift in Honor of David E. Tyler (pp. 429-449). Cham: Springer International Publishing.

Examples

Run this code

      ## small example
      X = cbind.data.frame(rnorm(10^3),
                           rnorm(10^3),
                           rnorm(10^3))

      suppressMessages({

        my.DN = create.DN(x = X,
                          R = 500,
                          delete.percent = .1,
                          DN.num1 = 500,
                          DN.num2 = 250,
                          no.cores = 0,
                          make.pbs = FALSE)

      })

      my.DN$`Data Nuggets`
      my.DN$`Data Nugget Assignments`

    # \donttest{

      ## large example
      X = cbind.data.frame(rnorm(5*10^4),
                           rnorm(5*10^4),
                           rnorm(5*10^4),
                           rnorm(5*10^4),
                           rnorm(5*10^4))

      my.DN = create.DN(x = X,
                        R = 5000,
                        delete.percent = .9,
                        DN.num1 = 10^4,
                        DN.num2 = 2000,
                        no.cores = 2)

      my.DN$`Data Nuggets`
      my.DN$`Data Nugget Assignments`

    # }

Run the code above in your browser using DataLab