medic: Medication clustering (based on ATC and timing)

Description

The medic method uses agglomerative hierarchical clustering with a bespoke distance measure based on medication ATC codes similarities, medication timing and medication amount or dosage.

Usage

medic(
  data,
  k = 5,
  id,
  atc,
  timing,
  base_clustering,
  linkage = "complete",
  summation_method = "sum_of_minima",
  alpha = 1,
  beta = 1,
  gamma = 1,
  p = 1,
  theta = (5:0)/5,
  parallel = FALSE,
  return_distance_matrix = FALSE,
  set_seed = FALSE,
  ...
)
# S3 method for medic
print(x, ...)

Value

An object of class medic which describes the clusters produced the hierarchical clustering process. The object is a list with components:

data: the inputted data frame data with the cluster assignments appended at the end.
clustering: a data frame with the person id as given by id, the .analysis_order and the clusters found.
variables: a list of the variables used in the clustering.
parameters: a data frame with all the inputted clustering parameters and the corresponding method names. These method names correspond to the column names for each cluster in the clustering data frame described right above.
key: a list of keys used internally in the function to keep track of simplified versions of the data.
distance_matrix: the distance matrices for each method if return_distance_matrix is TRUE otherwise NULL.
call: the matched call.

Arguments

data

A data frame containing all the variables for the clustering.

k

a vector specifying the number of clusters to identify.

id

<tidy-select> An unquoted expression naming the variable in data describing person id.

atc

<tidy-select> An unquoted expression naming the variable in data containing ATC codes.

timing

<tidy-select> An unquoted expression naming the variable or variables in data describing medication timing. Variable names can be used as if they were positions in the data frame, so expressions like x:y can be used to select a range of variables. Moreover, pattern matching selection helpers such as starts_with or num_range may also be used to select timing variables.

base_clustering

<tidy-select> An unquoted expression naming the variable in data that gives an initial clustering to start the medic from or NULL.

linkage

The agglomeration method to be used in the clustering. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). See stats::hclust for more information. For a discussion of linkage criterion choice see details below.

summation_method

The summation method used in the distance measure. This should be either "double_sum" or "sum_of_minima". See details below for more information.

alpha

A number giving the tuning of the normalization. See details below for more information.

beta

A number giving the power of the individual medication combinations. See details below for more information.

gamma

A number giving the weight of the timing terms. See details below for more information.

p

The power of the Minkowski distance used in the timing-specific distance. See details below for more information.

theta

A vector of length 6 specifying the tuning of the ATC measure. See details below for more information.

parallel

A logical or an integer. If FALSE, the default, no parallelization is done.

If TRUE or an integer larger than 2L parallelization is implemented via parLapply from the parallel package. When parallel is TRUE the number of clusters is set to detectCores - 1, and when parallel is an integer then the number of clusters is set to parallel. For more details on the parallelization method see parallel::parLapply.

return_distance_matrix

A logical.

set_seed

A logical or an integer.

...

Additional arguments not currently in use.

x

A medic object for printing.

Methods (by generic)

print(medic): Print method for medic-objects

Details

The medic method uses agglomerative hierarchical clustering with a bespoke distance measure based on medication ATC codes and timing similarities to assign medication pattern clusters to people.

Two versions of the distance measure are available:

The double sum:

$$% d(p_i, p_j) = N_{\alpha}(M_i \times M_j) \sum_{m\in M_i}\sum_{n \in M_j}% ((1 + D_{\theta}(m,n)) (1 + \gamma T_p(t_{im},t_{jn})) - 1)^{\beta}.% $$

and the sum of minima: $$% d(p_i, p_j) = \frac{1}{2}(N_{\alpha}(M_i)\sum_{m\in M_i}\min_{n \in M_j}% ((1 + D_{\theta}(m,n)) (1 + \gamma T_p(t_{im},t_{jn})) - 1)^{\beta} + N_{\alpha}(M_j) \sum_{n\in M_j}\min_{m \in M_i}% ((1 + D_{\theta}(m,n)) (1 + \gamma T_p(t_{im},t_{jn})) - 1)^{\beta}).% $$

Normalization

$$% N_{\alpha}(x) = |x|^{-\alpha}% $$

If the normalization tuning, alpha, is 0, then no normalization is preformed and the distance measure becomes highly dependent on the number of distinct medications given. That is, people using more medication will have larger distances to others. If the normalization tuning, alpha, is 1 - the default - then the summation is normalized with the number of terms in the sum, in other words, the average is calculated.

ATC distance

The central idea of this method, namely the ATC distance, is given as $$% D_{\theta}(x, y) = \sum_{i=1,...,5}1\{x and y match on level i, but not level i + 1\}\theta_i% $$ The ATC distance is tuned using the vector theta.

Note that two ATC codes are said to match at level i when they are identical at level i. E.g. the two codes N06AB01 and N06AA01 match on level 1, 2, and 3 as they are both "N" at level 1, "N06" at level 2, and "N06A" at level 3, but at level 4 they differ ("N06AB" and "N06AA" are not the same).

Timing distance

The timing distance is a simple Minkowski distance: $$% T(x,y) =(\sum_{t \in T} |x_t - y_t|^p)^{1/p}.% $$ When p is 1, the default, the Manhattan distance is used.

Examples

Run this code

# A simple clustering based only on ATC
clust <- medic(complications, id = id, atc = atc, k = 3)

# A simple clustering with both ATC and timing
clust <- medic(
  complications,
  id = id,
  atc = atc,
  timing = first_trimester:third_trimester,
  k = 3
)

Run the code above in your browser using DataLab