partition v0.1.0

0

Monthly downloads

0th

Percentile

Agglomerative Partitioning Framework for Dimension Reduction

A fast and flexible framework for agglomerative partitioning. 'partition' uses an approach called Direct-Measure-Reduce to create new variables that maintain the user-specified minimum level of information. Each reduced variable is also interpretable: the original variables map to one and only one variable in the reduced data set. 'partition' is flexible, as well: how variables are selected to reduce, how information loss is measured, and the way data is reduced can all be customized.

Readme

Travis build
status Build
status Coverage
status

partition

partition is a fast and flexible framework for agglomerative partitioning. partition uses an approach called Direct-Measure-Reduce to create new variables that maintain the user-specified minimum level of information. Each reduced variable is also interpretable: the original variables map to one and only one variable in the reduced data set. partition is flexible, as well: how variables are selected to reduce, how information loss is measured, and the way data is reduced can all be customized.

Installation

You can install the development version of partition2 GitHub with:

# install.packages("remotes)
remotes::install_github("USCbiostats/partition")

Example

library(partition)
set.seed(1234)
df <- simulate_block_data(c(3, 4, 5), lower_corr = .4, upper_corr = .6, n = 100)

#  don't accept reductions where information < .6
prt <- partition(df, threshold = .6)
prt
#> Partitioner:
#>    Director: Minimum Distance (Pearson) 
#>    Metric: Intraclass Correlation 
#>    Reducer: Scaled Mean
#> 
#> Reduced Variables:
#> 1 reduced variables created from 2 observed variables
#> 
#> Mappings:
#> reduced_var_1 = {block2_x3, block2_x4}
#> 
#> Minimum information:
#> 0.602

# return reduced data
partition_scores(prt)
#> # A tibble: 100 x 11
#>    block1_x1 block1_x2 block1_x3 block2_x1 block2_x2 block3_x1 block3_x2
#>        <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
#>  1   -1.00     -0.344      1.35     -0.526    -1.25      1.13     0.357 
#>  2    0.518    -0.434     -0.361    -1.48     -1.53     -0.317    0.290 
#>  3   -1.77     -0.913     -0.722     0.122     0.224    -0.529    0.114 
#>  4   -1.49     -0.998      0.189     0.149    -0.994    -0.433    0.0120
#>  5    0.616     0.0211     0.895     1.09     -1.25      0.440   -0.550 
#>  6    0.0765    0.522      1.20     -0.152    -0.419    -0.912   -0.362 
#>  7    1.74      0.0993    -0.654    -1.26     -0.502    -0.792   -1.03  
#>  8    1.05      2.19       0.913     0.254     0.328    -1.07    -0.976 
#>  9   -1.07     -0.292     -0.763     0.437     0.739     0.899   -0.342 
#> 10   -1.02     -0.959     -1.33     -1.57     -1.11      0.618    0.153 
#> # … with 90 more rows, and 4 more variables: block3_x3 <dbl>,
#> #   block3_x4 <dbl>, block3_x5 <dbl>, reduced_var_1 <dbl>

# access mapping keys
mapping_key(prt)
#> # A tibble: 11 x 4
#>    variable      mapping   information indices  
#>    <chr>         <list>          <dbl> <list>   
#>  1 block1_x1     <chr [1]>       1     <int [1]>
#>  2 block1_x2     <chr [1]>       1     <int [1]>
#>  3 block1_x3     <chr [1]>       1     <int [1]>
#>  4 block2_x1     <chr [1]>       1     <int [1]>
#>  5 block2_x2     <chr [1]>       1     <int [1]>
#>  6 block3_x1     <chr [1]>       1     <int [1]>
#>  7 block3_x2     <chr [1]>       1     <int [1]>
#>  8 block3_x3     <chr [1]>       1     <int [1]>
#>  9 block3_x4     <chr [1]>       1     <int [1]>
#> 10 block3_x5     <chr [1]>       1     <int [1]>
#> 11 reduced_var_1 <chr [2]>       0.602 <int [2]>

unnest_mappings(prt)
#> # A tibble: 12 x 4
#>    variable      information mapping   indices
#>    <chr>               <dbl> <chr>       <int>
#>  1 block1_x1           1     block1_x1       1
#>  2 block1_x2           1     block1_x2       2
#>  3 block1_x3           1     block1_x3       3
#>  4 block2_x1           1     block2_x1       4
#>  5 block2_x2           1     block2_x2       5
#>  6 block3_x1           1     block3_x1       8
#>  7 block3_x2           1     block3_x2       9
#>  8 block3_x3           1     block3_x3      10
#>  9 block3_x4           1     block3_x4      11
#> 10 block3_x5           1     block3_x5      12
#> 11 reduced_var_1       0.602 block2_x3       6
#> 12 reduced_var_1       0.602 block2_x4       7

# use a lower threshold of information loss
partition(df, threshold = .5, partitioner = part_kmeans())
#> Partitioner:
#>    Director: K-Means Clusters 
#>    Metric: Minimum Intraclass Correlation 
#>    Reducer: Scaled Mean
#> 
#> Reduced Variables:
#> 2 reduced variables created from 7 observed variables
#> 
#> Mappings:
#> reduced_var_1 = {block3_x1, block3_x2, block3_x5}
#> reduced_var_2 = {block2_x1, block2_x2, block2_x3, block2_x4}
#> 
#> Minimum information:
#> 0.508

# use a custom partitioner
part_icc_rowmeans <- replace_partitioner(
  part_icc, 
  reduce = as_reducer(rowMeans)
)
partition(df, threshold = .6, partitioner = part_icc_rowmeans) 
#> Partitioner:
#>    Director: Minimum Distance (Pearson) 
#>    Metric: Intraclass Correlation 
#>    Reducer: <custom reducer>
#> 
#> Reduced Variables:
#> 1 reduced variables created from 2 observed variables
#> 
#> Mappings:
#> reduced_var_1 = {block2_x3, block2_x4}
#> 
#> Minimum information:
#> 0.602

partition also supports a number of ways to visualize partitions and permutation tests; these functions all start with plot_*(). These functions all return ggplots and can thus be extended using ggplot2.

plot_stacked_area_clusters(df) +
  ggplot2::theme_minimal(14)

Functions in partition

Name Description
build_next_name Create new variable name based on prefix and previous reductions
as_partition_step Create a partition object from a data frame
map_partition Map a partition across a range of minimum information
corr Efficiently fit correlation coefficient for matrix or two vectors
is_partition_step Is this object a partition_step?
part_kmeans Partitioner: K-means, ICC, scaled means
direct_distance Target based on minimum distance matrix
is_partition Is this object a partition?
part_minr2 Partitioner: distance, minimum R-squared, scaled means
linear_k_search Search for best k using the linear search method
direct_k_cluster Target based on K-means clustering
k_searching_forward Assess k search
filter_reduced Filter the reduced mappings
calculate_new_variable Calculate or retrieve stored reduced variable
matrix_is_exhausted Have all pairs of variables been checked for metric?
find_min_distance_variables Find the index of the pair with the smallest distance
fit_distance_matrix Fit a distance matrix using correlation coefficients
assign_partition Process a dataset with a partitioner
binary_k_search Search for best k using the binary search method
guess_init_k Guess initial k based on threshold and p
summarize_partitions Summarize and map partitions and permutations
measure_icc Measure the information loss of reduction using intraclass correlation coefficient
mutual_information Calculate the standardized mutual information of a data set
mapping_key Return partition mapping key
%>% Pipe operator
plot_area_clusters Plot partitions
search_k Search for the best k
scaled_mean Average and scale rows in a data.frame
part_icc Partitioner: distance, ICC, scaled means
count_clusters Helper functions to print partition summary
under_threshold Compare metric to threshold
direct_measure_reduce Apply a partitioner
pull_composite_variables Access mapping variables
fill_in_missing Process reduced variables when missing data
rewind_target Set target to last value
return_if_single Reduce targets if more than one variable, return otherwise
icc Calculate the intraclass correlation coefficient
measure_min_icc Measure the information loss of reduction using the minimum intraclass correlation coefficient
test_permutation Permute partitions
update_dist Only fit the distances for a new variable
increase_hits Count and retrieve the number of metrics below threshold
icc_r Calculate the intraclass correlation coefficient
measure_min_r2 Measure the information loss of reduction using minimum R-squared
is_partitioner Is this object a partitioner?
partition Agglomerative partitioning
is_same_function Are two functions the same?
k_exhausted Have all values of k been checked for metric?
measure_std_mutualinfo Measure the information loss of reduction using standardized mutual information
find_algorithm Which kmeans algorithm to use?
plot_permutation Plot permutation tests
measure_variance_explained Measure the information loss of reduction using the variance explained
partition_scores Return the reduced data from a partition
part_pc1 Partitioner: distance, first principal component, scaled means
cat_bold Print to the console in color
part_stdmi Partitioner: distance, mutual information, scaled means
paste_director Lookup partitioner types to print in English
reduce_mappings Create a mapping key out of a list of targets
reduce_scaled_mean Reduce selected variables to scaled means
reduce_cluster Reduce a target
replace_partitioner Replace the director, metric, or reducer for a partitioner
permute_df Permute a data set
simplify_names Simplify reduced variable names
simulate_block_data Simulate correlated blocks of variables
reduce_first_component Reduce selected variables to first principal component
reduce_kmeans Reduce selected variables to scaled means
as_partitioner Create a partitioner
as_measure Create a custom metric
as_director Create a custom director
as_reducer Create a custom reducer
get_indices Process mapping key to return from partition()
all_columns_reduced Check if all variables reduced to a single composite
append_mappings Append a new variable to mapping and filter out composite variables
as_partition Return a partition object
all_done Mark the partition as complete to stop search
No Results!

Vignettes of partition

Name
extending-partition.Rmd
introduction-to-partition.Rmd
No Results!

Last month downloads

Details

Type Package
License MIT + file LICENSE
Encoding UTF-8
LazyData true
LinkingTo Rcpp, RcppArmadillo
RoxygenNote 6.1.1
URL https://uscbiostats.github.io/partition/, https://github.com/USCbiostats/partition
BugReports https://github.com/USCbiostats/partition/issues
Language en-US
VignetteBuilder knitr
NeedsCompilation yes
Packaged 2019-05-16 16:08:33 UTC; malcolmbarrett
Repository CRAN
Date/Publication 2019-05-17 07:00:04 UTC

Include our badge in your README

[![Rdoc](http://www.rdocumentation.org/badges/version/partition)](http://www.rdocumentation.org/packages/partition)