The medic
method uses agglomerative hierarchical clustering with a
bespoke distance measure based on medication ATC codes similarities,
medication timing and medication amount or dosage.
medic(
data,
k = 5,
id,
atc,
timing,
base_clustering,
linkage = "complete",
summation_method = "sum_of_minima",
alpha = 1,
beta = 1,
gamma = 1,
p = 1,
theta = (5:0)/5,
parallel = FALSE,
return_distance_matrix = FALSE,
set_seed = FALSE,
...
)# S3 method for medic
print(x, ...)
An object of class medic which describes the clusters produced the hierarchical clustering process. The object is a list with components:
the inputted data frame data
with the cluster
assignments appended at the end.
a data frame with the person id as given by id
,
the .analysis_order
and the clusters found.
a list of the variables used in the clustering.
a data frame with all the inputted clustering
parameters and the corresponding method names. These method names
correspond to the column names for each cluster in the clustering
data frame described right above.
a list of keys used internally in the function to keep track of simplified versions of the data.
the distance matrices for each method if
return_distance_matrix
is TRUE
otherwise NULL
.
the matched call.
A data frame containing all the variables for the clustering.
a vector specifying the number of clusters to identify.
<tidy-select
> An unquoted
expression naming the variable in data
describing person id.
<tidy-select
> An unquoted
expression naming the variable in data
containing ATC codes.
<tidy-select
> An unquoted
expression naming the variable or variables in data
describing
medication timing. Variable names can be used as if they were positions in
the data frame, so expressions like x:y can be used to select a range of
variables. Moreover, pattern matching selection helpers such as
starts_with
or
num_range
may also be used to select timing
variables.
<tidy-select
> An
unquoted expression naming the variable in data
that gives an initial
clustering to start the medic
from or NULL
.
The agglomeration method to be used in the clustering. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). See stats::hclust for more information. For a discussion of linkage criterion choice see details below.
The summation method used in the distance measure. This should be either "double_sum" or "sum_of_minima". See details below for more information.
A number giving the tuning of the normalization. See details below for more information.
A number giving the power of the individual medication combinations. See details below for more information.
A number giving the weight of the timing terms. See details below for more information.
The power of the Minkowski distance used in the timing-specific distance. See details below for more information.
A vector of length 6 specifying the tuning of the ATC measure. See details below for more information.
A logical or an integer. If FALSE
, the default, no
parallelization is done.
If TRUE
or an integer larger than 2L parallelization is implemented via
parLapply from the parallel package. When
parallel
is TRUE
the number of clusters
is set to detectCores - 1, and when parallel
is
an integer then the number of clusters is set to
parallel
. For more details on the parallelization method see
parallel::parLapply.
A logical.
A logical or an integer.
Additional arguments not currently in use.
A medic
object for printing.
print(medic)
: Print method for medic-objects
The medic
method uses agglomerative hierarchical
clustering with a bespoke distance measure based on medication ATC codes and
timing similarities to assign medication pattern clusters to people.
Two versions of the distance measure are available:
The double sum:
$$% d(p_i, p_j) = N_{\alpha}(M_i \times M_j) \sum_{m\in M_i}\sum_{n \in M_j}% ((1 + D_{\theta}(m,n)) (1 + \gamma T_p(t_{im},t_{jn})) - 1)^{\beta}.% $$
and the sum of minima: $$% d(p_i, p_j) = \frac{1}{2}(N_{\alpha}(M_i)\sum_{m\in M_i}\min_{n \in M_j}% ((1 + D_{\theta}(m,n)) (1 + \gamma T_p(t_{im},t_{jn})) - 1)^{\beta} + N_{\alpha}(M_j) \sum_{n\in M_j}\min_{m \in M_i}% ((1 + D_{\theta}(m,n)) (1 + \gamma T_p(t_{im},t_{jn})) - 1)^{\beta}).% $$
$$% N_{\alpha}(x) = |x|^{-\alpha}% $$
If the normalization tuning, alpha
, is 0, then no normalization is
preformed and the distance measure becomes highly dependent on the number of
distinct medications given. That is, people using more medication will have
larger distances to others. If the normalization tuning, alpha
, is 1 -
the default - then the summation is normalized with the number of terms in
the sum, in other words, the average is calculated.
The central idea of this method, namely the ATC distance, is given as
$$%
D_{\theta}(x, y) = \sum_{i=1,...,5}1\{x and y match on level i, but not level i + 1\}\theta_i%
$$
The ATC distance is tuned using the vector theta
.
Note that two ATC codes are said to match at level i when they are identical at level i. E.g. the two codes N06AB01 and N06AA01 match on level 1, 2, and 3 as they are both "N" at level 1, "N06" at level 2, and "N06A" at level 3, but at level 4 they differ ("N06AB" and "N06AA" are not the same).
The timing distance is a simple Minkowski distance:
$$%
T(x,y) =(\sum_{t \in T} |x_t - y_t|^p)^{1/p}.%
$$
When p
is 1, the default, the Manhattan distance is used.
summary.medic for summaries and plots.
employ for employing an existing clustering to new data.
enrich for enriching the meta data in the medic
object with additional
data.
# A simple clustering based only on ATC
clust <- medic(complications, id = id, atc = atc, k = 3)
# A simple clustering with both ATC and timing
clust <- medic(
complications,
id = id,
atc = atc,
timing = first_trimester:third_trimester,
k = 3
)
Run the code above in your browser using DataLab