Learn R Programming

⚠️There's a newer version (0.2.2) of this package.Take me there.

partition

partition is a fast and flexible framework for agglomerative partitioning. partition uses an approach called Direct-Measure-Reduce to create new variables that maintain the user-specified minimum level of information. Each reduced variable is also interpretable: the original variables map to one and only one variable in the reduced data set. partition is flexible, as well: how variables are selected to reduce, how information loss is measured, and the way data is reduced can all be customized.

Installation

You can install the development version of partition2 GitHub with:

# install.packages("remotes)
remotes::install_github("USCbiostats/partition")

Example

library(partition)
set.seed(1234)
df <- simulate_block_data(c(3, 4, 5), lower_corr = .4, upper_corr = .6, n = 100)

#  don't accept reductions where information < .6
prt <- partition(df, threshold = .6)
prt
#> Partitioner:
#>    Director: Minimum Distance (Pearson) 
#>    Metric: Intraclass Correlation 
#>    Reducer: Scaled Mean
#> 
#> Reduced Variables:
#> 1 reduced variables created from 2 observed variables
#> 
#> Mappings:
#> reduced_var_1 = {block2_x3, block2_x4}
#> 
#> Minimum information:
#> 0.602

# return reduced data
partition_scores(prt)
#> # A tibble: 100 x 11
#>    block1_x1 block1_x2 block1_x3 block2_x1 block2_x2 block3_x1 block3_x2
#>        <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
#>  1   -1.00     -0.344      1.35     -0.526    -1.25      1.13     0.357 
#>  2    0.518    -0.434     -0.361    -1.48     -1.53     -0.317    0.290 
#>  3   -1.77     -0.913     -0.722     0.122     0.224    -0.529    0.114 
#>  4   -1.49     -0.998      0.189     0.149    -0.994    -0.433    0.0120
#>  5    0.616     0.0211     0.895     1.09     -1.25      0.440   -0.550 
#>  6    0.0765    0.522      1.20     -0.152    -0.419    -0.912   -0.362 
#>  7    1.74      0.0993    -0.654    -1.26     -0.502    -0.792   -1.03  
#>  8    1.05      2.19       0.913     0.254     0.328    -1.07    -0.976 
#>  9   -1.07     -0.292     -0.763     0.437     0.739     0.899   -0.342 
#> 10   -1.02     -0.959     -1.33     -1.57     -1.11      0.618    0.153 
#> # … with 90 more rows, and 4 more variables: block3_x3 <dbl>,
#> #   block3_x4 <dbl>, block3_x5 <dbl>, reduced_var_1 <dbl>

# access mapping keys
mapping_key(prt)
#> # A tibble: 11 x 4
#>    variable      mapping   information indices  
#>    <chr>         <list>          <dbl> <list>   
#>  1 block1_x1     <chr [1]>       1     <int [1]>
#>  2 block1_x2     <chr [1]>       1     <int [1]>
#>  3 block1_x3     <chr [1]>       1     <int [1]>
#>  4 block2_x1     <chr [1]>       1     <int [1]>
#>  5 block2_x2     <chr [1]>       1     <int [1]>
#>  6 block3_x1     <chr [1]>       1     <int [1]>
#>  7 block3_x2     <chr [1]>       1     <int [1]>
#>  8 block3_x3     <chr [1]>       1     <int [1]>
#>  9 block3_x4     <chr [1]>       1     <int [1]>
#> 10 block3_x5     <chr [1]>       1     <int [1]>
#> 11 reduced_var_1 <chr [2]>       0.602 <int [2]>

unnest_mappings(prt)
#> # A tibble: 12 x 4
#>    variable      information mapping   indices
#>    <chr>               <dbl> <chr>       <int>
#>  1 block1_x1           1     block1_x1       1
#>  2 block1_x2           1     block1_x2       2
#>  3 block1_x3           1     block1_x3       3
#>  4 block2_x1           1     block2_x1       4
#>  5 block2_x2           1     block2_x2       5
#>  6 block3_x1           1     block3_x1       8
#>  7 block3_x2           1     block3_x2       9
#>  8 block3_x3           1     block3_x3      10
#>  9 block3_x4           1     block3_x4      11
#> 10 block3_x5           1     block3_x5      12
#> 11 reduced_var_1       0.602 block2_x3       6
#> 12 reduced_var_1       0.602 block2_x4       7

# use a lower threshold of information loss
partition(df, threshold = .5, partitioner = part_kmeans())
#> Partitioner:
#>    Director: K-Means Clusters 
#>    Metric: Minimum Intraclass Correlation 
#>    Reducer: Scaled Mean
#> 
#> Reduced Variables:
#> 2 reduced variables created from 7 observed variables
#> 
#> Mappings:
#> reduced_var_1 = {block3_x1, block3_x2, block3_x5}
#> reduced_var_2 = {block2_x1, block2_x2, block2_x3, block2_x4}
#> 
#> Minimum information:
#> 0.508

# use a custom partitioner
part_icc_rowmeans <- replace_partitioner(
  part_icc, 
  reduce = as_reducer(rowMeans)
)
partition(df, threshold = .6, partitioner = part_icc_rowmeans) 
#> Partitioner:
#>    Director: Minimum Distance (Pearson) 
#>    Metric: Intraclass Correlation 
#>    Reducer: <custom reducer>
#> 
#> Reduced Variables:
#> 1 reduced variables created from 2 observed variables
#> 
#> Mappings:
#> reduced_var_1 = {block2_x3, block2_x4}
#> 
#> Minimum information:
#> 0.602

partition also supports a number of ways to visualize partitions and permutation tests; these functions all start with plot_*(). These functions all return ggplots and can thus be extended using ggplot2.

plot_stacked_area_clusters(df) +
  ggplot2::theme_minimal(14)

Copy Link

Version

Install

install.packages('partition')

Monthly Downloads

335

Version

0.1.0

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Malcolm Barrett

Last Published

May 17th, 2019

Functions in partition (0.1.0)

build_next_name

Create new variable name based on prefix and previous reductions
as_partition_step

Create a partition object from a data frame
map_partition

Map a partition across a range of minimum information
corr

Efficiently fit correlation coefficient for matrix or two vectors
is_partition_step

Is this object a partition_step?
part_kmeans

Partitioner: K-means, ICC, scaled means
direct_distance

Target based on minimum distance matrix
is_partition

Is this object a partition?
part_minr2

Partitioner: distance, minimum R-squared, scaled means
linear_k_search

Search for best k using the linear search method
direct_k_cluster

Target based on K-means clustering
k_searching_forward

Assess k search
filter_reduced

Filter the reduced mappings
calculate_new_variable

Calculate or retrieve stored reduced variable
matrix_is_exhausted

Have all pairs of variables been checked for metric?
find_min_distance_variables

Find the index of the pair with the smallest distance
fit_distance_matrix

Fit a distance matrix using correlation coefficients
assign_partition

Process a dataset with a partitioner
binary_k_search

Search for best k using the binary search method
guess_init_k

Guess initial k based on threshold and p
summarize_partitions

Summarize and map partitions and permutations
measure_icc

Measure the information loss of reduction using intraclass correlation coefficient
mutual_information

Calculate the standardized mutual information of a data set
mapping_key

Return partition mapping key
%>%

Pipe operator
plot_area_clusters

Plot partitions
search_k

Search for the best k
scaled_mean

Average and scale rows in a data.frame
part_icc

Partitioner: distance, ICC, scaled means
count_clusters

Helper functions to print partition summary
under_threshold

Compare metric to threshold
direct_measure_reduce

Apply a partitioner
pull_composite_variables

Access mapping variables
fill_in_missing

Process reduced variables when missing data
rewind_target

Set target to last value
return_if_single

Reduce targets if more than one variable, return otherwise
icc

Calculate the intraclass correlation coefficient
measure_min_icc

Measure the information loss of reduction using the minimum intraclass correlation coefficient
test_permutation

Permute partitions
update_dist

Only fit the distances for a new variable
increase_hits

Count and retrieve the number of metrics below threshold
icc_r

Calculate the intraclass correlation coefficient
measure_min_r2

Measure the information loss of reduction using minimum R-squared
is_partitioner

Is this object a partitioner?
partition

Agglomerative partitioning
is_same_function

Are two functions the same?
k_exhausted

Have all values of k been checked for metric?
measure_std_mutualinfo

Measure the information loss of reduction using standardized mutual information
find_algorithm

Which kmeans algorithm to use?
plot_permutation

Plot permutation tests
measure_variance_explained

Measure the information loss of reduction using the variance explained
partition_scores

Return the reduced data from a partition
part_pc1

Partitioner: distance, first principal component, scaled means
cat_bold

Print to the console in color
part_stdmi

Partitioner: distance, mutual information, scaled means
paste_director

Lookup partitioner types to print in English
reduce_mappings

Create a mapping key out of a list of targets
reduce_scaled_mean

Reduce selected variables to scaled means
reduce_cluster

Reduce a target
replace_partitioner

Replace the director, metric, or reducer for a partitioner
permute_df

Permute a data set
simplify_names

Simplify reduced variable names
simulate_block_data

Simulate correlated blocks of variables
reduce_first_component

Reduce selected variables to first principal component
reduce_kmeans

Reduce selected variables to scaled means
as_partitioner

Create a partitioner
as_measure

Create a custom metric
as_director

Create a custom director
as_reducer

Create a custom reducer
get_indices

Process mapping key to return from partition()
all_columns_reduced

Check if all variables reduced to a single composite
append_mappings

Append a new variable to mapping and filter out composite variables
as_partition

Return a partition object
all_done

Mark the partition as complete to stop search