Learn R Programming

yaImpute (version 1.0-15)

yai: Find K nearest neighbors

Description

Given a set of observations, yai 1) separates the observations into reference and target observations, 2) applies the specified method to project X-variables into a Euclidean space (not always, see argument method), and 3) finds the k-nearest neighbors within the referenece observations and between the reference and target observations. An alternative method using randomForest classification and regression trees is provided for steps 2 and 3. Target observations are those with values for X-variables and not for Y-variables, while reference observations are those with no missing values for X-and Y-variables (see Details for the exception).

Usage

yai(x=NULL,y=NULL,data=NULL,k=1,noTrgs=FALSE,noRefs=FALSE,
    nVec=NULL,pVal=.05,method="msn",ann=TRUE,mtry=NULL,ntree=500,
    rfMode="buildClasses")

Arguments

x
1) a matrix or data frame containing the X-variables for all observations with row names are the identification for the observations, or 2) a one-sided formula defining the X-variables as a linear formula. If a formula is coded for x
y
1) a matrix or data frame containing the Y-variables for the reference observations, or 2) a one-sided formula defining the Y-variables as a linear formula.
data
when x and y are formulas, then data is a data frame or matrix that contains all the variables with row names are the identification for the observations. The observations are split by yai into two sets.
k
the number of nearest neighbors; default is 1.
noTrgs
when TRUE, skip finding neighbors for target observations.
noRefs
when TRUE, skip finding neighbors for reference observations.
nVec
number of canonical vectors to use (methods msn and msn2), or number of independent of X-variables reference data when method mahalanobis. When NULL, the number is set by the function.
pVal
significant level for canonical vectors, used when method is msn or msn2.
method
is the strategy finding neighbors; the options are the quoted key words (see details):
  • euclidean
{distance is computed in a normalized X space.} raw{like euclidean, except no normalizat

Value

  • An object of class yai, which is a list with the following tags:
  • callthe call.
  • yRefs, xRefsmatrices of the X- and Y-variables for just the reference observations (unscaled). The scale factors are attached as attributes.
  • obsDroppeda list of the row names for observations dropped for various reasons (missing data).
  • trgRowsa list of the row names for target observations as a subset of all observations.
  • xallthe X-variables for all observations.
  • cancorreturned from cancor function when method msn or msn2 is used (NULL otherwise).
  • ccaVeganan object of class cca (from package vegan) when method gnn is used.
  • ftesta list containing partial F statistics and a vector of Pr>F (pgf) corresponding to the canonical correlation coefficients when method msn or msn2 is used (NULL otherwise).
  • yScale, xScalescale data used on yRefs and xRefs as needed.
  • kthe value of k.
  • pValas input; only used when method msn or msn2 is used.
  • projectorNULL when not used. For methods msn, msn2, gnn and mahalanobis, this is a matrix that projects normalized X-variables into a space suitable for doing Eculidian distances.
  • nVecnumber of canonical vectors used (methods msn and msn2), or number of independent X-variables in the reference data when method mahalanobis is used.
  • methodas input, the method used.
  • ranForesta list of the forests if method randomForest is used. There is one forest for each Y-variable, or just one forest when there are no Y-variables.
  • ICAa list of information from fastICA when method ica is used.
  • annthe value of ann, TRUE when ann is used, FALSE otherwise.
  • xlevelsNULL if no factors are used as predictors; otherwise a list of predictors that have factors and their levels (see lm).
  • neiDstTrgsa data frame of distances between a target (identified by its row name) and the k references. There are k columns.
  • neiIdsTrgsa data frame of reference identifications that correspond to neiDstTrgs.
  • neiDstRefs, neiIdsRefscounterparts for references.

item

  • ann
  • mtry
  • ntree
  • rfMode

code

4.5-18

Details

See the paper at http://www.jstatsoft.org/v23/i10 (it includes examples).

The following information is in addition to the content in the papers.

You need not have any Y-variables to run yai for the following methods: euclidean, raw, mahalanobis, ica, random, and randomForest (in which case unsupervised classification is performed). However, normally yai classifies reference observations as those with no missing values for X- and Y- variables and target observations are those with values for X- variables and missing data for Y-variables. When Y is NULL (there are no Y-variables), all the observations are considered references. See newtargets for an example of how to use yai in this situation.

Examples

Run this code
require (yaImpute)

data(iris)

# set the random number seed so that example results are consistent
# normally, leave out this command
set.seed(12345) 

# form some test data, y's are defined only for reference 
# observations.
refs=sample(rownames(iris),50)
x <- iris[,1:2]      # Sepal.Length Sepal.Width
y <- iris[refs,3:4]  # Petal.Length Petal.Width

# build yai objects using 2 methods
msn <- yai(x=x,y=y)
mal <- yai(x=x,y=y,method="mahalanobis")

# running the following examples will load packages vegan 
# and randomForest, and is more complicated.

data(MoscowMtStJoe)

# convert polar slope and aspect measurements to cartesian
# (which is the same as Stage's (1976) transformation).

polar <- MoscowMtStJoe[,40:41]
polar[,1] <- polar[,1]*.01      # slope proportion
polar[,2] <- polar[,2]*(pi/180) # aspect radians
cartesian <- t(apply(polar,1,function (x)
               {return (c(x[1]*cos(x[2]),x[1]*sin(x[2]))) }))
colnames(cartesian) <- c("xSlAsp","ySlAsp")
x <- cbind(MoscowMtStJoe[,37:39],cartesian,MoscowMtStJoe[,42:64])
y <- MoscowMtStJoe[,1:35]

mal <- yai(x=x, y=y, method="mahalanobis", k=1)
gnn <- yai(x=x, y=y, method="gnn", k=1)
msn <- yai(x=x, y=y, method="msn", k=1)

plot(mal,vars=yvars(mal)[1:16])

# reduce the plant community data for randomForest.
yba  <- MoscowMtStJoe[,1:17]
ybaB <- whatsMax(yba,nbig=7)  # see help on whatsMax

rf <- yai(x=x, y=ybaB, method="randomForest", k=1)

# build the imputations for the original y's
rforig <- impute(rf,ancillaryData=y)

# compare the results
compare.yai(mal,gnn,msn,rforig)
plot(compare.yai(mal,gnn,msn,rforig))

# build another randomForest case forcing regression
# to be used for continuous variables. The answers differ
# but one is not clearly better than the other.

rf2 <- yai(x=x, y=ybaB, method="randomForest", rfMode="regression")
rforig2 <- impute(rf2,ancillaryData=y)
compare.yai(rforig2,rforig)

Run the code above in your browser using DataLab