partysplit: Binary and Multiway Splits

Description

A class for representing multiway splits and functions for computing on splits.

Usage

partysplit(varid, breaks = NULL, index = NULL, right = TRUE, 
    prob = NULL, info = NULL)
kidids_split(split, data, vmatch = 1:ncol(data), obs = NULL, 
    perm = NULL)
character_split(split, data = NULL, 
    digits = getOption("digits") - 2)
varid_split(split)
breaks_split(split)
index_split(split)
right_split(split)
prob_split(split)
info_split(split)

Arguments

varid

an integer specifying the variable to split in, i.e., a column number in data.

breaks

a numeric vector of split points.

index

an integer vector containing a contiguous sequence from one to the number of kid nodes. May contain NAs.

right

a logical, indicating if the intervals defined by breaks should be closed on the right (and open on the left) or vice versa.

prob

a numeric vector representing a probability distribution over kid nodes.

info

additional information.

split

an object of class partysplit.

data

a list or data.frame.

vmatch

a permutation of the variable numbers in data.

obs

a logical or integer vector indicating a subset of the observations in data.

perm

a vector of integers specifying the variables to be permuted prior before splitting (i.e., for computing permutation variable importances). The default NULL doesn't alter the data.

digits

minimal number of significant digits.

Value

The constructor partysplit() returns an object of class partysplit:
varidan integer specifying the variable to split in, i.e., a column number in data,
breaksa numeric vector of split points,
indexan integer vector containing a contiguous sequence from one to the number of kid nodes,
righta logical, indicating if the intervals defined by breaks should be closed on the right (and open on the left) or vice versa
proba numeric vector representing a probability distribution over kid nodes,
infoadditional information.
kidids_split() returns an integer vector describing the partition of the observations into kid nodes.
character_split() gives a character representation of the split and the remaining functions return the corresponding slots of partysplit objects.

Details

A split is basically a function that maps data, more specifically a partitioning variable, to a set of integers indicating the kid nodes to send observations to. Objects of class partysplit describe such a function and can be set-up via the partysplit() constructor. The variables are available in a list or data.frame (here called data) and varid specifies the partitioning variable, i.e., the variable or list element to split in. The constructor partysplit() doesn't have access to the actual data, i.e., doesn't estimate splits.

kidids_split(split, data) actually partitions the data data[obs,varid_split(split)] and assigns an integer (giving the kid node number) to each observation. If vmatch is given, the variable vmatch[varid_split(split)] is used. In case perm contains varid_split(split), the data are permuted using sample prior to partitioning. character_split() returns a character representation of its split argument. The remaining functions defined here are accessor functions for partysplit objects.

The numeric vector breaks defines how the range of the partitioning variable (after coercing to a numeric via as.numeric) is divided into intervals (like in cut) and may be NULL. These intervals are represented by the numbers one to length(breaks) + 1.

index assigns these length(breaks) + 1 intervals to one of at least two kid nodes. Thus, index is a vector of integers where each element corresponds to one element in a list kids containing partynode objects, see partynode for details. The vector index may contain NAs, in that case, the corresponding values of the splitting variable are treated as missings (for example factor levels that are not present in the learning sample). Either breaks or index must be given. When breaks is NULL, it is assumed that the partitioning variable itself has storage mode integer (e.g., is a factor).

prob defines a probability distribution over all kid nodes which is used for random splitting when a deterministic split isn't possible (due to missing values, for example).

info takes arbitrary user-specified information.

Examples

Run this code

data("iris", package = "datasets")

## binary split in numeric variable `Sepal.Length'
sl5 <- partysplit(which(names(iris) == "Sepal.Length"),
    breaks = 5)
character_split(sl5, data = iris)
table(kidids_split(sl5, data = iris), iris$Sepal.Length <= 5)

## multiway split in numeric variable `Sepal.Width', 
## higher values go to the first kid, smallest values 
## to the last kid
sw23 <- partysplit(which(names(iris) == "Sepal.Width"),    
    breaks = c(3, 3.5), index = 3:1)	
character_split(sw23, data = iris)    
table(kidids_split(sw23, data = iris), 
    cut(iris$Sepal.Width, breaks = c(-Inf, 2, 3, Inf)))   

## binary split in factor `Species'
sp <- partysplit(which(names(iris) == "Species"),
    index = c(1L, 1L, 2L))
character_split(sp, data = iris)
table(kidids_split(sp, data = iris), iris$Species)

## multiway split in factor `Species'
sp <- partysplit(which(names(iris) == "Species"), index = 1:3)
character_split(sp, data = iris)
table(kidids_split(sp, data = iris), iris$Species)

## multiway split in numeric variable `Sepal.Width'
sp <- partysplit(which(names(iris) == "Sepal.Width"), 
    breaks = quantile(iris$Sepal.Width))
character_split(sp, data = iris)
## predictions for permuted values of `Sepal.Width'
## correlation with actual data should be small
cor(kidids_split(sp, data = iris, 
    perm = which(names(iris) == "Sepal.Width")),
    iris$Sepal.Width)