rgeode: GEOmetric Density Estimation.

Description

It selects the principal directions of the data and performs inference. Moreover GEODE is also able to handle missing data.

Usage

rgeode(Y, d = 6, burn = 1000, its = 2000, tol = 0.01, atau = 1/20,
  asigma = 1/2, bsigma = 1/2, starttime = NULL, stoptime = NULL,
  fast = TRUE, c0 = -1, c1 = -0.005)

Arguments

array_like a real input matrix (or data frame), with dimensions \((n, D)\). It is the real matrix of data.

int, optional it is the conservative upper bound for the dimension D. We are confident that the real dimension is smaller then it.

burn

int, optional number of burn-in to perform in our Gibbs sampler. It represents also the stopping time that stop the choice of the principal axes.

its

int, optional number of iterations that must be performed after the burn-in.

tol

double, optional threshold for adaptively removing redundant dimensions. It is used compared with the ratio: \(\frac{\alpha_j^2(t)}{\max \alpha_i^2(t)}\).

atau

double, optional The parameter \(a_\tau\) of the truncated Exponential (the prior for \(\tau_j\)).

asigma

double, optional The shape parameter \(a_\sigma\) of the truncated Gamma (the prior for \(\sigma^2\)).

bsigma

double, optional The rate parameter \(b_\sigma\) of the truncated Gamma (the prior for \(\sigma^2\)).

starttime

int, optional starting time for adaptive pruning. It must be less then the number of burn-in.

stoptime

int, optional stop time for adaptive pruning. It must be less then the number of burn-in.

fast

bool, optional If \(TRUE\) it is run using fast d-rank SVD. Otherwise it uses the classical SVD.

double, optional Additive constant for the exponent of the pruning step.

double, optional Multiplicative constant for the exponent of the pruning step.

Value

rgeode returns a list containing the following components:

InD

array_like The chose principal axes.

matrix Containing the sample from the full conditional posterior of \(u_j\)s. We store each iteration on the columns.

tau

matrix Containing the sample from the full conditional posterior of \(tau_j\)s.

sigmaS

array_like Containing the sample from the full conditional posterior of \(sigma\).

matrix Containing the principal singular vectors.

Miss

list Containing all the informations about missing data. If there are not missing data this output is not provide.

id_m array It contains the set of rows with missing data.
pos_m list It contains the set of missing data positions for each row with missing values.
yms list The list contained the pseudo-observation substituting our missing data. Each element of the list represents the simulated data for that time.

Details

GEOmetric Density Estimation (rgeode) is a fast algorithm performing inference on normally distributed data. It is essentially divided in two principal steps:

Selection of the principal axes of the data.
Adaptive Gibbs sampler with the creation of a set of samples from the full conditional posteriors of the parameters of interest, which enable us to perform inference.

It takes in inputs several quantities. A rectangular \((N,D)\) matrix \(Y\), on which we will run a Fast rank \(d\) SVD. The conservative upper bound of the true dimension of our data \(d\). A set of tuning parameters. We remark that the choice of the conservative upper bound \(d\) must be such that \(d>p\), with \(p\) real dimension, and \(d << D\).

References

[1] Y. Wang, A. Canale, D. Dunson. "Scalable Geometric Density Estimation" (2016).

Examples

Run this code

# NOT RUN {
library(MASS)
library(RGeode)

####################################################################
# WITHOUT MISSING DATA
####################################################################
# Define the dataset
D= 200
n= 500
d= 10
d_true= 3

set.seed(321)

mu_true= runif(d_true, -3, 10)

Sigma_true= matrix(0,d_true,d_true)
diag(Sigma_true)= c(runif(d_true, 10, 100))

W_true = svd(matrix(rnorm(D*d_true, 0, 1), d_true, D))$v

sigma_true = abs(runif(1,0,1))

mu= W_true%*%mu_true
C= W_true %*% Sigma_true %*% t(W_true)+ sigma_true* diag(D)

y= mvrnorm(n, mu, C)

################################
# GEODE: Without missing data
################################

start.time <- Sys.time() 
GEODE= rgeode(Y= y, d)
Sys.time()- start.time

# SIGMAS
#plot(seq(110,3000,by=1),GEODE$sigmaS[110:3000],ty='l',col=2,
#     xlab= 'Iteration', ylab= 'sigma^2', main= 'Simulation of sigma^2')
#abline(v=800,lwd= 2, col= 'blue')
#legend('bottomright',c('Posterior of sigma^2', 'Stopping time'),
#       lwd=c(1,2),col=c(2,4),cex=0.55, border='black', box.lwd=3)
       
       
####################################################################
# WITH MISSING DATA
####################################################################

###########################
#Insert NaN
n_m = 5 #number of data vectors containing missing features
d_m = 1  #number of missing features

data_miss= sample(seq(1,n),n_m)

features= sample(seq(1,D), d_m)
for(i in 2:n_m)
{
  features= rbind(features, sample(seq(1,D), d_m))
}

for(i in 1:length(data_miss))
{
  
  if(i==length(data_miss))
  {
    y[data_miss[i],features[i,][-1]]= NaN
  }
  else
  {
    y[data_miss[i],features[i,]]= NaN
  }
  
}

################################
# GEODE: With missing data
################################
set.seed(321)
start.time <- Sys.time() 
GEODE= rgeode(Y= y, d)
Sys.time()- start.time

# SIGMAS
#plot(seq(110,3000,by=1),GEODE$sigmaS[110:3000],ty='l',col=2,
#     xlab= 'Iteration', ylab= 'sigma^2', main= 'Simulation of sigma^2')
#abline(v=800,lwd= 2, col= 'blue')
#legend('bottomright',c('Posterior of sigma^2', 'Stopping time'),
#       lwd=c(1,2),col=c(2,4),cex=0.55, border='black', box.lwd=3)



####################################################################
####################################################################
# }

Run the code above in your browser using DataLab