xgb.importance: Show importance of features in a model

Description

Read a xgboost model text dump. Can be tree or linear model (text dump of linear model are only supported in dev version of Xgboost for now).

Usage

xgb.importance(feature_names = NULL, filename_dump = NULL, model = NULL, data = NULL, label = NULL, target = function(x) ((x + label) == 2))

Arguments

feature_names

names of each feature as a character vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be NULL.

filename_dump

the path to the text file storing the model. Model dump must include the gain per feature and per tree (with.stats = T in function xgb.dump).

model

generated by the xgb.train function. Avoid the creation of a dump file.

data

the dataset used for the training step. Will be used with label parameter for co-occurence computation. More information in Detail part. This parameter is optional.

label

the label vetor used for the training step. Will be used with data parameter for co-occurence computation. More information in Detail part. This parameter is optional.

target

a function which returns TRUE or 1 when an observation should be count as a co-occurence and FALSE or 0 otherwise. Default function is provided for computing co-occurences in a binary classification. The target function should have only one parameter. This parameter will be used to provide each important feature vector after having applied the split condition, therefore these vector will be only made of 0 and 1 only, whatever was the information before. More information in Detail part. This parameter is optional.

Value

A data.table of the features used in the model with their average gain (and their weight for boosted tree model) in the model.

Details

This is the function to understand the model trained (and through your model, your data).

Results are returned for both linear and tree models.

data.table is returned by the function. There are 3 columns :

Features name of the features as provided in feature_names or already present in the model dump.
Gain contribution of each feature to the model. For boosted tree model, each gain of each feature of each tree is taken into account, then average per feature to give a vision of the entire model. Highest percentage means important feature to predict the label used for the training ;
Cover metric of the number of observation related to this feature (only available for tree models) ;
Weight percentage representing the relative number of times a feature have been taken into trees. Gain should be prefered to search the most important feature. For boosted linear model, this column has no meaning.

Co-occurence count ------------------

The gain gives you indication about the information of how a feature is important in making a branch of a decision tree more pure. However, with this information only, you can't know if this feature has to be present or not to get a specific classification. In the example code, you may wonder if odor=none should be TRUE to not eat a mushroom.

Co-occurence computation is here to help in understanding this relation between a predictor and a specific class. It will count how many observations are returned as TRUE by the target function (see parameters). When you execute the example below, there are 92 times only over the 3140 observations of the train dataset where a mushroom have no odor and can be eaten safely.

If you need to remember one thing only: until you want to leave us early, don't eat a mushroom which has no odor :-)

Examples

Run this code

data(agaricus.train, package='xgboost')

# Both dataset are list with two items, a sparse matrix and labels
# (labels = outcome column which will be learned).
# Each column of the sparse Matrix is a feature in one hot encoding format.
train <- agaricus.train

bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
               eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")

# train$data@Dimnames[[2]] represents the column names of the sparse matrix.
xgb.importance(train$data@Dimnames[[2]], model = bst)

# Same thing with co-occurence computation this time
xgb.importance(train$data@Dimnames[[2]], model = bst, data = train$data, label = train$label)

Run the code above in your browser using DataLab