Learn R Programming

⚠️There's a newer version (0.2.2) of this package.Take me there.

partition

partition is a fast and flexible framework for agglomerative partitioning. partition uses an approach called Direct-Measure-Reduce to create new variables that maintain the user-specified minimum level of information. Each reduced variable is also interpretable: the original variables map to one and only one variable in the reduced data set. partition is flexible, as well: how variables are selected to reduce, how information loss is measured, and the way data is reduced can all be customized.

Installation

You can install the partition from CRAN with:

install.packages("partition")

Or you can install the development version of partition GitHub with:

# install.packages("remotes")
remotes::install_github("USCbiostats/partition")

Example

library(partition)
set.seed(1234)
df <- simulate_block_data(c(3, 4, 5), lower_corr = .4, upper_corr = .6, n = 100)

#  don't accept reductions where information < .6
prt <- partition(df, threshold = .6)
prt
#> Partitioner:
#>    Director: Minimum Distance (Pearson) 
#>    Metric: Intraclass Correlation 
#>    Reducer: Scaled Mean
#> 
#> Reduced Variables:
#> 1 reduced variables created from 2 observed variables
#> 
#> Mappings:
#> reduced_var_1 = {block2_x3, block2_x4}
#> 
#> Minimum information:
#> 0.602

# return reduced data
partition_scores(prt)
#> # A tibble: 100 × 11
#>    block1_x1 block1_x2 block1_x3 block2_x1 block2_x2 block3_x1 block3_x2
#>        <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
#>  1   -1.00     -0.344      1.35     -0.526    -1.25      1.13     0.357 
#>  2    0.518    -0.434     -0.361    -1.48     -1.53     -0.317    0.290 
#>  3   -1.77     -0.913     -0.722     0.122     0.224    -0.529    0.114 
#>  4   -1.49     -0.998      0.189     0.149    -0.994    -0.433    0.0120
#>  5    0.616     0.0211     0.895     1.09     -1.25      0.440   -0.550 
#>  6    0.0765    0.522      1.20     -0.152    -0.419    -0.912   -0.362 
#>  7    1.74      0.0993    -0.654    -1.26     -0.502    -0.792   -1.03  
#>  8    1.05      2.19       0.913     0.254     0.328    -1.07    -0.976 
#>  9   -1.07     -0.292     -0.763     0.437     0.739     0.899   -0.342 
#> 10   -1.02     -0.959     -1.33     -1.57     -1.11      0.618    0.153 
#> # ℹ 90 more rows
#> # ℹ 4 more variables: block3_x3 <dbl>, block3_x4 <dbl>, block3_x5 <dbl>,
#> #   reduced_var_1 <dbl>

# access mapping keys
mapping_key(prt)
#> # A tibble: 11 × 4
#>    variable      mapping   information indices  
#>    <chr>         <list>          <dbl> <list>   
#>  1 block1_x1     <chr [1]>       1     <int [1]>
#>  2 block1_x2     <chr [1]>       1     <int [1]>
#>  3 block1_x3     <chr [1]>       1     <int [1]>
#>  4 block2_x1     <chr [1]>       1     <int [1]>
#>  5 block2_x2     <chr [1]>       1     <int [1]>
#>  6 block3_x1     <chr [1]>       1     <int [1]>
#>  7 block3_x2     <chr [1]>       1     <int [1]>
#>  8 block3_x3     <chr [1]>       1     <int [1]>
#>  9 block3_x4     <chr [1]>       1     <int [1]>
#> 10 block3_x5     <chr [1]>       1     <int [1]>
#> 11 reduced_var_1 <chr [2]>       0.602 <int [2]>

unnest_mappings(prt)
#> # A tibble: 12 × 4
#>    variable      mapping   information indices
#>    <chr>         <chr>           <dbl>   <int>
#>  1 block1_x1     block1_x1       1           1
#>  2 block1_x2     block1_x2       1           2
#>  3 block1_x3     block1_x3       1           3
#>  4 block2_x1     block2_x1       1           4
#>  5 block2_x2     block2_x2       1           5
#>  6 block3_x1     block3_x1       1           8
#>  7 block3_x2     block3_x2       1           9
#>  8 block3_x3     block3_x3       1          10
#>  9 block3_x4     block3_x4       1          11
#> 10 block3_x5     block3_x5       1          12
#> 11 reduced_var_1 block2_x3       0.602       6
#> 12 reduced_var_1 block2_x4       0.602       7

# use a lower threshold of information loss
partition(df, threshold = .5, partitioner = part_kmeans())
#> Partitioner:
#>    Director: <custom director> 
#>    Metric: <custom metric> 
#>    Reducer: <custom reducer>
#> 
#> Reduced Variables:
#> 2 reduced variables created from 7 observed variables
#> 
#> Mappings:
#> reduced_var_1 = {block3_x1, block3_x2, block3_x5}
#> reduced_var_2 = {block2_x1, block2_x2, block2_x3, block2_x4}
#> 
#> Minimum information:
#> 0.508

# use a custom partitioner
part_icc_rowmeans <- replace_partitioner(
  part_icc, 
  reduce = as_reducer(rowMeans)
)
partition(df, threshold = .6, partitioner = part_icc_rowmeans) 
#> Partitioner:
#>    Director: Minimum Distance (Pearson) 
#>    Metric: Intraclass Correlation 
#>    Reducer: <custom reducer>
#> 
#> Reduced Variables:
#> 1 reduced variables created from 2 observed variables
#> 
#> Mappings:
#> reduced_var_1 = {block2_x3, block2_x4}
#> 
#> Minimum information:
#> 0.602

partition also supports a number of ways to visualize partitions and permutation tests; these functions all start with plot_*(). These functions all return ggplots and can thus be extended using ggplot2.

plot_stacked_area_clusters(df) +
  ggplot2::theme_minimal(14)

Performance

partition has been meticulously benchmarked and profiled to improve performance, and key sections are written in C++ or use C++-based packages. Using a data frame with 1 million rows on a 2017 MacBook Pro with 16 GB RAM, here’s how each of the built-in partitioners perform:

large_df <- simulate_block_data(c(3, 4, 5), lower_corr = .4, upper_corr = .6, n = 1e6)

basic_benchmarks <- microbenchmark::microbenchmark(
  icc = partition(large_df, .3),
  kmeans = partition(large_df, .3, partitioner = part_kmeans()),
  minr2 = partition(large_df, .3, partitioner = part_minr2()),
  pc1 = partition(large_df, .3, partitioner = part_pc1()),
  stdmi = partition(large_df, .3, partitioner = part_stdmi())
)

ICC vs K-Means

As the features (columns) in the data set become greater than the number of observations (rows), the default ICC method scales more linearly than K-Means-based methods. While K-Means is often faster at lower dimensions, it becomes slower as the features outnumber the observations. For example, using three data sets with increasing numbers of columns, K-Means starts as the fastest and gets increasingly slower, although in this case it is still comparable to ICC:

narrow_df <- simulate_block_data(3:5, lower_corr = .4, upper_corr = .6, n = 100)
wide_df <- simulate_block_data(rep(3:10, 2), lower_corr = .4, upper_corr = .6, n = 100)
wider_df <- simulate_block_data(rep(3:20, 4), lower_corr = .4, upper_corr = .6, n = 100)

icc_kmeans_benchmarks <- microbenchmark::microbenchmark(
  icc_narrow = partition(narrow_df, .3),
  icc_wide = partition(wide_df, .3),
  icc_wider = partition(wider_df, .3),
  kmeans_narrow = partition(narrow_df, .3, partitioner = part_kmeans()),
  kmeans_wide = partition(wide_df, .3, partitioner = part_kmeans()),
  kmeans_wider  = partition(wider_df, .3, partitioner = part_kmeans())
)

For more information, see our paper in Bioinformatics, which discusses these issues in more depth (Millstein et al. 2020).

Contributing

Please read the Contributor Guidelines prior to submitting a pull request to partition. Also note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

References

Millstein, Joshua, Francesca Battaglin, Malcolm Barrett, Shu Cao, Wu Zhang, Sebastian Stintzing, Volker Heinemann, and Heinz-Josef Lenz. 2020. “Partition: A Surjective Mapping Approach for Dimensionality Reduction.” Bioinformatics 36 (3): 676–81. https://doi.org/10.1093/bioinformatics/btz661.

Copy Link

Version

Install

install.packages('partition')

Monthly Downloads

335

Version

0.2.0

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Malcolm Barrett

Last Published

January 24th, 2024

Functions in partition (0.2.0)

get_indices

Process mapping key to return from partition()
is_partition

Is this object a partition?
as_partition_step

Create a partition object from a data frame
under_threshold

Compare metric to threshold
corr

Efficiently fit correlation coefficient for matrix or two vectors
calculate_new_variable

Calculate or retrieve stored reduced variable
guess_init_k

Guess initial k based on threshold and p
k_searching_forward

Assess k search
all_columns_reduced

Check if all variables reduced to a single composite
fill_in_missing

Process reduced variables when missing data
find_min_distance_variables

Find the index of the pair with the smallest distance
as_measure

Create a custom metric
append_mappings

Append a new variable to mapping and filter out composite variables
mapping_key

Return partition mapping key
fit_distance_matrix

Fit a distance matrix using correlation coefficients
as_director

Create a custom director
count_clusters

Helper functions to print partition summary
as_partition

Return a partition object
matrix_is_exhausted

Have all pairs of variables been checked for metric?
measure_icc

Measure the information loss of reduction using intraclass correlation coefficient
measure_min_icc

Measure the information loss of reduction using the minimum intraclass correlation coefficient
replace_partitioner

Replace the director, metric, or reducer for a partitioner
direct_distance

Target based on minimum distance matrix
direct_k_cluster

Target based on K-means clustering
measure_variance_explained

Measure the information loss of reduction using the variance explained
cat_bold

Print to the console in color
all_done

Mark the partition as complete to stop search
increase_hits

Count and retrieve the number of metrics below threshold
icc

Calculate the intraclass correlation coefficient
is_partition_step

Is this object a partition_step?
direct_measure_reduce

Apply a partitioner
filter_reduced

Filter the reduced mappings
is_same_function

Are two functions the same?
binary_k_search

Search for best k using the binary search method
build_next_name

Create new variable name based on prefix and previous reductions
mutual_information

Calculate the standardized mutual information of a data set
simplify_names

Simplify reduced variable names
pull_composite_variables

Access mapping variables
return_if_single

Reduce targets if more than one variable, return otherwise
map_partition

Map a partition across a range of minimum information
k_exhausted

Have all values of k been checked for metric?
is_partitioner

Is this object a partitioner?
part_minr2

Partitioner: distance, minimum R-squared, scaled means
reduce_first_component

Reduce selected variables to first principal component
simulate_block_data

Simulate correlated blocks of variables
assign_partition

Process a dataset with a partitioner
search_k

Search for the best k
part_pc1

Partitioner: distance, first principal component, scaled means
baxter_data

Microbiome data
icc_r

Calculate the intraclass correlation coefficient
summarize_partitions

Summarize and map partitions and permutations
find_algorithm

Which kmeans algorithm to use?
plot_area_clusters

Plot partitions
partition_scores

Return the reduced data from a partition
paste_director

Lookup partitioner types to print in English
rewind_target

Set target to last value
scaled_mean

Average and scale rows in a data.frame
plot_permutation

Plot permutation tests
part_icc

Partitioner: distance, ICC, scaled means
super_partition

super_partition
test_permutation

Permute partitions
linear_k_search

Search for best k using the linear search method
part_kmeans

Partitioner: K-means, ICC, scaled means
permute_df

Permute a data set
update_dist

Only fit the distances for a new variable
%>%

Pipe operator
reduce_scaled_mean

Reduce selected variables to scaled means
measure_min_r2

Measure the information loss of reduction using minimum R-squared
reduce_cluster

Reduce a target
measure_std_mutualinfo

Measure the information loss of reduction using standardized mutual information
part_stdmi

Partitioner: distance, mutual information, scaled means
reduce_kmeans

Reduce selected variables to scaled means
partition

Agglomerative partitioning
reduce_mappings

Create a mapping key out of a list of targets
as_reducer

Create a custom reducer
as_partitioner

Create a partitioner