Learn R Programming

⚠️There's a newer version (2.0.2) of this package.Take me there.

distantia Time Series Dissimilarity

Warning

Version 2.0.0 of distantia is a full re-write of the original package and will break existing workflows before making them better. Please refer to the Changelog for details before updating.

Summary

The R package distantia offers an efficient, feature-rich toolkit for managing, comparing, and analyzing time series data. It is designed to handle a wide range of scenarios, including:

  • Multivariate and univariate time series.
  • Regular and irregular sampling.
  • Time series of different lengths.

Key Features

Comprehensive Analytical Tools

  • 10 distance metrics: see distantia::distances.
  • The normalized dissimilarity metric psi.
  • Free and Restricted Dynamic Time Warping (DTW) for shape-based comparison.
  • A Lock-Step method for sample-to-sample comparison
  • Restricted permutation tests for robust inferential support.
  • Analysis of contribution to dissimilarity of individual variables in multivariate time series.
  • Hierarchical and K-means clustering of time series based on dissimilarity matrices.

Computational Efficiency

  • A C++ back-end powered by Rcpp.
  • Parallel processing managed through the future package.
  • Efficient data handling via zoo.

Time Series Management Tools

  • Introduces time series lists (TSL), a versatile format for handling collections of time series stored as lists of zoo objects.
  • Includes a suite of tsl_...() functions for generating, resampling, transforming, analyzing, and visualizing univariate and multivariate time series.

Citation

If you find this package useful, please cite it as:

Blas M. Benito, H. John B. Birks (2020). distantia: an open-source toolset to quantify dissimilarity between multivariate ecological time-series. Ecography, 43(5), 660-667. doi: 10.1111/ecog.04895.

Blas M. Benito (2024). distantia: A Toolset for Time Series Dissimilarity Analysis. R package version 2.0.0. url: https://blasbenito.github.io/distantia/.

Install

Version 1.0.2 of distantia can be installed from CRAN.

install.packages("distantia")

Version 2.0.0 can be installed from GitHub.

remotes::install_github(
  repo = "blasbenito/distantia", 
  ref = "development"
  )

Getting Started

This section showcases several features of the package distantia. Please, check the Articles section for further details.

Setup

All heavy duty functions in distantia support parallelization via the future package. However, due to the high efficiency of the C++ backend of distantia, parallel execution is only worth it for very large datasets and restricted permutation analyses.

Progress bars provided by the progressr package are also available. Unfortunately, the latter does not work in Rmarkdown documents like this one.

library(distantia, quietly = TRUE)
library(future)
# library(progressr)

#parallelization setup
# future::plan(future::multisession)

#progress bar (does not work in Rmarkdown)
#progressr::handlers(global = TRUE)

Example Data

The albatross data frame contains daily GPS data of 4 individuals of Waved Albatross in the Pacific captured during the summer of 2008. Below are the first 10 rows of this data frame:

#>   name       time         x         y     speed temperature  heading
#> 1 X132 2008-05-31 -89.62097 -1.389512 0.1473333    29.06667 212.0307
#> 2 X132 2008-06-01 -89.62101 -1.389508 0.2156250    28.25000 184.0337
#> 3 X132 2008-06-02 -89.62101 -1.389503 0.2143750    27.68750 123.1269
#> 4 X132 2008-06-03 -89.62099 -1.389508 0.2018750    27.81250 183.4600
#> 5 X132 2008-06-04 -89.62098 -1.389507 0.2256250    27.68750 114.8931
#> 6 X132 2008-06-05 -89.62925 -1.425734 1.3706667    25.73333 245.8033

The code below transforms the data to a Time Series List with tsl_initialize() and applies global scaling and centering with tsl_transform() and f_scale_global to facilitate time series comparisons.

tsl <- tsl_initialize(
  x = distantia::albatross,
  name_column = "name",
  time_column = "time",
  lock_step = TRUE
) |> 
  tsl_transform(
    f = f_scale_global
  )

tsl_plot(
  tsl = tsl,
  ylim = "relative"
)

Dissimilarity Analysis

Lock-Step Analysis

Lock-step analysis performs direct comparisons between samples observed at the same time without any time distortion. It requires time series of the same length preferably observed at the same times.

df_ls <- distantia(
  tsl = tsl,
  lock_step = TRUE
)

df_ls[, c("x", "y", "psi")]
#>      x    y      psi
#> 1 X132 X134 1.888451
#> 3 X132 X153 2.128340
#> 5 X134 X153 2.187862
#> 4 X134 X136 2.270977
#> 2 X132 X136 2.427479
#> 6 X136 X153 2.666099

The “psi” column shows normalized dissimilarity values and is used to sort the data frame from lowest to highest dissimilarity. Hence, the first row shows the most similar pair of time series.

The function distantia_boxplot() enables a quick identification of the time series that are either more dissimilar (top) or similar (bottom) to others.

distantia_boxplot(df = df_ls, text_cex = 0.8)

Dynamic Time Warping

By default, distantia() computes unrestricted dynamic time warping with orthogonal and diagonal least cost paths.

df_dtw <- distantia(
  tsl = tsl
)

df_dtw[, c("x", "y", "psi")]
#>      x    y      psi
#> 1 X132 X134 1.299380
#> 5 X134 X153 2.074241
#> 3 X132 X153 2.091923
#> 4 X134 X136 2.358040
#> 2 X132 X136 2.449381
#> 6 X136 X153 2.666099

The function distantia_dtw_plot() provides detailed insights into the alignment between a pair of time series resulting from DTW.

distantia_dtw_plot(
  tsl = tsl[c("X132", "X153")]
)

Deviations from the perfect diagonal in the least-cost path reveal adjustments made by DTW to align time series by shape rather than time.

The article Dynamic Time Warping vs Lock-Step provides further insights on the advantages and disadvantages of each method in different scenarios.

Permutation Test

The function distantia() implements restricted permutation tests to assess the significance of dissimilarity scores. It provides several setups to support different assumptions.

For example, the configuration below rearranges complete rows within 7-day blocks, assuming strong dependencies within rows and between observations that are close in time.

future::plan(future::multisession)

df_dtw <- distantia(
  tsl = tsl,
  repetitions = 1000,
  permutation = "restricted_by_row",
  block_size = 7 #one week
)

future::plan(future::sequential)

df_dtw[, c("x", "y", "psi", "p_value")]
#>      x    y      psi p_value
#> 1 X132 X134 1.299380   0.001
#> 5 X134 X153 2.074241   0.001
#> 3 X132 X153 2.091923   0.001
#> 4 X134 X136 2.358040   0.184
#> 2 X132 X136 2.449381   0.499
#> 6 X136 X153 2.666099   0.007

The “p_value” column represents the fraction of permutations yielding a psi score lower than the observed value, and indicates the strength of similarity between two time series. A significance threshold (e.g., 0.05, depending on iterations) helps identifying pairs of time series with a robust similarity.

Variable Importance

When comparing multivariate time series, certain variables contribute more to similarity or dissimilarity. The momentum() function uses a leave-one-out algorithm to quantify each variable’s contribution to the overall dissimilarity between two time series.

df_importance <- momentum(
  tsl = tsl
)

df_importance[, c("x", "y", "variable", "importance", "effect")]
#>       x    y    variable   importance               effect
#> 1  X132 X134           x   87.6066043 decreases similarity
#> 2  X132 X134           y   93.9587187 decreases similarity
#> 3  X132 X134       speed  -21.9171171 increases similarity
#> 4  X132 X134 temperature   72.8121621 decreases similarity
#> 5  X132 X134     heading  -38.0165137 increases similarity
#> 6  X132 X136           x   48.3845903 decreases similarity
#> 7  X132 X136           y   93.5214543 decreases similarity
#> 8  X132 X136       speed  -61.1729252 increases similarity
#> 9  X132 X136 temperature  356.8824838 decreases similarity
#> 10 X132 X136     heading -102.9830173 increases similarity
#> 11 X132 X153           x  427.7381576 decreases similarity
#> 12 X132 X153           y  156.1285451 decreases similarity
#> 13 X132 X153       speed  -40.9249630 increases similarity
#> 14 X132 X153 temperature  -14.2831545 increases similarity
#> 15 X132 X153     heading  -79.3532025 increases similarity
#> 16 X134 X136           x   61.3361468 decreases similarity
#> 17 X134 X136           y  108.9650664 decreases similarity
#> 18 X134 X136       speed  -59.2603918 increases similarity
#> 19 X134 X136 temperature  310.6812842 decreases similarity
#> 20 X134 X136     heading  -90.2797292 increases similarity
#> 21 X134 X153           x  592.0783167 decreases similarity
#> 22 X134 X153           y  116.4310429 decreases similarity
#> 23 X134 X153       speed  -52.4149093 increases similarity
#> 24 X134 X153 temperature    0.9936944 decreases similarity
#> 25 X134 X153     heading  -85.0271172 increases similarity
#> 26 X136 X153           x  507.6153648 decreases similarity
#> 27 X136 X153           y   56.6957442 decreases similarity
#> 28 X136 X153       speed  -65.4516103 increases similarity
#> 29 X136 X153 temperature  240.9053814 decreases similarity
#> 30 X136 X153     heading -116.2461929 increases similarity

Positive “importance” values indicate variables contributing to dissimilarity, while negative values indicate contribution to similarity. The function documentation provides more details on how importance scores are computed.

The momentum_boxplot() function provides a quick insight into which variables contribute the most to similarity or dissimilarity across all pairs of time series.

momentum_boxplot(
  df = df_importance
)

Clustering

The package distantia provides tools to group together time series by dissimilarity using hierarchical or K-means clustering. The example below applies the former to the albatross dataset to find out groups of individuals with the most similar movement time series.

dtw_hclust <- distantia_cluster_hclust(
  df = df_dtw,
  clusters = NULL, #automatic mode
  method = NULL    #automatic mode
  )

#cluster object
dtw_hclust$cluster_object
#> 
#> Call:
#> stats::hclust(d = d_dist, method = method)
#> 
#> Cluster method   : ward.D 
#> Number of objects: 4

#number of clusters
dtw_hclust$clusters
#> [1] 2

#clustering data frame
#group label in column "cluster"
#negatives in column "silhouette_width" higlight anomalous cluster assignation
dtw_hclust$df
#>   name cluster silhouette_width
#> 1 X132       1        0.3077225
#> 2 X134       1        0.2846556
#> 3 X136       2        0.0000000
#> 4 X153       1        0.2186781

#tree plot
par(mar=c(3,1,1,3))

plot(
  x = stats::as.dendrogram(
    dtw_hclust$cluster_object
    ),
  horiz = TRUE
)

This is just a summary of the features implemented in the package. Please visit the Articles section to find out more about distantia.

Getting help

If you encounter bugs or issues with the documentation, please file a issue on GitHub.

Copy Link

Version

Install

install.packages('distantia')

Monthly Downloads

350

Version

2.0.0

License

MIT + file LICENSE

Maintainer

Blas M. Benito

Last Published

January 8th, 2025

Functions in distantia (2.0.0)

cost_path_diagonal_cpp

(C++) Orthogonal and Diagonal Least Cost Path
cost_matrix_diagonal_weighted_cpp

(C++) Compute Orthogonal and Weighted Diagonal Least Cost Matrix from a Distance Matrix
cost_path_orthogonal_cpp

(C++) Orthogonal Least Cost Path
cost_path_slotting_cpp

(C++) Least Cost Path for Sequence Slotting
cost_path_trim_cpp

(C++) Remove Blocks from a Least Cost Path
cost_path_sum_cpp

(C++) Sum Distances in a Least Cost Path
cost_path_orthogonal_bandwidth_cpp

(C++) Orthogonal Least Cost Path
cost_matrix_orthogonal_cpp

(C++) Compute Orthogonal Least Cost Matrix from a Distance Matrix
distance_bray_curtis_cpp

(C++) Bray-Curtis Distance Between Two Vectors
distance

Distance Between Two Numeric Vectors
covid_counties

County Coordinates of the Covid Prevalence Dataset
distance_euclidean_cpp

(C++) Euclidean Distance Between Two Vectors
distance_hamming_cpp

(C++) Hamming Distance Between Two Binary Vectors
distance_chi_cpp

(C++) Normalized Chi Distance Between Two Vectors
distance_canberra_cpp

(C++) Canberra Distance Between Two Binary Vectors
distance_russelrao_cpp

(C++) Russell-Rao Distance Between Two Binary Vectors
distance_sorensen_cpp

(C++) Sørensen Distance Between Two Binary Vectors
cost_path_cpp

Least Cost Path
covid_prevalence

Time Series of Covid Prevalence in California Counties
distance_ls_cpp

(C++) Sum of Pairwise Distances Between Cases in Two Aligned Time Series
distance_hellinger_cpp

(C++) Hellinger Distance Between Two Vectors
distance_jaccard_cpp

(C++) Jaccard Distance Between Two Binary Vectors
distance_chebyshev_cpp

(C++) Chebyshev Distance Between Two Vectors
distance_matrix

Data Frame to Distance Matrix
distance_manhattan_cpp

(C++) Manhattan Distance Between Two Vectors
distantia_cluster_kmeans

K-Means Clustering of Dissimilarity Analysis Data Frames
distantia_dtw

Dynamic Time Warping Dissimilarity Analysis of Time Series Lists
distance_cosine_cpp

(C++) Cosine Dissimilarity Between Two Vectors
distantia_dtw_plot

Two-Way Dissimilarity Plots of Time Series Lists
distantia_ls

Lock-Step Dissimilarity Analysis of Time Series Lists
distantia_boxplot

Distantia Boxplot
distance_matrix_cpp

(C++) Distance Matrix of Two Time Series
cost_path_diagonal_bandwidth_cpp

(C++) Orthogonal and Diagonal Least Cost Path Restricted by Sakoe-Chiba band
distantia_model_frame

Dissimilarity Model Frame
distantia_matrix

Convert Dissimilarity Analysis Data Frame to Distance Matrix
distantia_cluster_hclust

Hierarchical Clustering of Dissimilarity Analysis Data Frames
f_clr

Data Transformation: Rowwise Centered Log-Ratio
f_detrend_linear

Data Transformation: Linear Detrending of Zoo Time Series
f_detrend_poly

Data Transformation: Polynomial Linear Detrending of Zoo Time Series
eemian_pollen

Pollen Counts of Nine Interglacial Sites in Central Europe
f_binary

Transform Zoo Object to Binary
distantia

Dissimilarity Analysis of Time Series Lists
distances

Distance Methods
distantia-package

distantia: A Toolset for Time Series Dissimilarity Analysis
f_detrend_difference

Data Transformation: Detrending and Differencing
f_rescale_local

Data Transformation: Local Rescaling of to a New Range
f_rescale_global

Data Transformation: Global Rescaling of to a New Range
f_proportion_sqrt

Data Transformation: Rowwise Square Root of Proportions
f_scale_global

Data Transformation: Global Centering and Scaling
distantia_time_delay

Time Shift Between Time Series
f_proportion

Data Transformation: Rowwise Proportions
distantia_spatial

Spatial Representation of distantia() Data Frames
f_log

Data Transformation: Log
f_scale_local

Data Transformation: Local Centering and Scaling
eemian_coordinates

Site Coordinates of Nine Interglacial Sites in Central Europe
distantia_aggregate

Aggregate distantia() Data Frames Across Parameter Combinations
importance_dtw_cpp

(C++) Contribution of Individual Variables to the Dissimilarity Between Two Time Series (Robust Version)
f_percent

Data Transformation: Rowwise Percentages
honeycomb_climate

Rainfall and Temperature in The Americas
fagus_coordinates

Site Coordinates of Fagus sylvatica Stands
importance_dtw_legacy_cpp

(C++) Contribution of Individual Variables to the Dissimilarity Between Two Time Series (Legacy Version)
fagus_dynamics

Time Series Data from Three Fagus sylvatica Stands
distantia_stats

Stats of Dissimilarity Data Frame
f_list

Lists Available Transformation Functions
f_hellinger

Data Transformation: Rowwise Hellinger Transformation
momentum_model_frame

Dissimilarity Model Frame
momentum_boxplot

Momentum Boxplot
momentum_aggregate

Aggregate momentum() Data Frames Across Parameter Combinations
f_trend_poly

Data Transformation: Polynomial Linear Trend of Zoo Time Series
permute_free_by_row_cpp

(C++) Unrestricted Permutation of Complete Rows
permute_free_cpp

(C++) Unrestricted Permutation of Cases
f_trend_linear

Data Transformation: Linear Trend of Zoo Time Series
honeycomb_polygons

Hexagonal Grid
momentum_stats

Stats of Dissimilarity Data Frame
permute_restricted_by_row_cpp

(C++) Restricted Permutation of Complete Rows Within Blocks
importance_ls_cpp

(C++) Contribution of Individual Variables to the Dissimilarity Between Two Aligned Time Series
momentum_spatial

Spatial Representation of momentum() Data Frames
permute_restricted_cpp

(C++) Restricted Permutation of Cases Within Blocks
momentum_to_wide

Momentum Data Frame to Wide Format
psi_dtw_cpp

(C++) Psi Dissimilarity Score of Two Time-Series
psi_distance_matrix

Distance Matrix
psi_null_ls_cpp

(C++) Null Distribution of the Dissimilarity Scores of Two Aligned Time Series
subset_matrix_by_rows_cpp

(C++) Subset Matrix by Rows
psi_ls_cpp

(C++) Psi Dissimilarity Score of Two Aligned Time Series
psi_null_dtw_cpp

(C++) Null Distribution of Dissimilarity Scores of Two Time Series
psi_equation

Normalized Dissimilarity Score
psi_auto_distance

Cumulative Sum of Distances Between Consecutive Cases in a Time Series
psi_equation_cpp

(C++) Equation of the Psi Dissimilarity Score
psi_auto_sum

Auto Sum
tsl_aggregate

Aggregate Time Series List Over Time Periods
psi_cost_matrix

Cost Matrix
tsl_names_clean

Clean Time Series Names in a Time Series List
momentum

Contribution of Individual Variables to Time Series Dissimilarity
tsl_names_get

Get Time Series Names from a Time Series Lists
tsl_burst

Multivariate TSL to Univariate TSL
tsl_diagnose

Diagnose Issues in Time Series Lists
tsl_handle_NA

Handle NA Cases in Time Series Lists
psi_cost_path_sum

Sum of Distances in Least Cost Path
momentum_ls

Lock-Step Variable Importance Analysis of Multivariate Time Series Lists
tsl_initialize

Transform Raw Time Series Data to Time Series List
tsl_colnames_prefix

Append Prefix to Column Names of Time Series List
tsl_join

Join Time Series Lists
momentum_dtw

Dynamic Time Warping Variable Importance Analysis of Multivariate Time Series Lists
psi_distance_lock_step

Lock-Step Distance
tsl_colnames_set

Set Column Names in Time Series Lists
psi_cost_path

Least Cost Path
tsl_ncol

Get Number of Columns in Time Series Lists
tsl_colnames_clean

Clean Column Names in Time Series Lists
tsl_nrow

Get Number of Rows in Time Series Lists
tsl_smooth

Smoothing of Time Series Lists
tsl_repair

Repair Issues in Time Series Lists
tsl_plot

Plot Time Series List
tsl_transform

Transform Values in Time Series Lists
tsl_stats

Summary Statistics of Time Series Lists
tsl_colnames_get

Get Column Names from a Time Series Lists
tsl_colnames_suffix

Append Suffix to Column Names of Time Series List
tsl_subset

Subset Time Series Lists by Time Series Names, Time, and/or Column Names
tsl_to_df

Transform Time Series List to Data Frame
tsl_names_set

Set Time Series Names in a Time Series List
tsl_count_NA

Count NA Cases in Time Series Lists
utils_as_time

Ensures Correct Class for Time Arguments
tsl_time

Time Features of Time Series Lists
utils_check_args_matrix

Checks Input Matrix
tsl_names_test

Tests Naming Issues in Time Series Lists
tsl_resample

Resample Time Series Lists to a New Time
utils_boxplot_common

Common Boxplot Component of distantia_boxplot() and momentum_boxplot()
utils_check_args_zoo

Checks Argument x
utils_coerce_time_class

Coerces Vector to a Given Time Class
utils_check_args_momentum

Check Input Arguments of momentum()
utils_cluster_silhouette

Compute Silhouette Width of a Clustering Solution
utils_cluster_kmeans_optimizer

Optimize the Silhouette Width of K-Means Clustering Solutions
utils_block_size

Default Block Size for Restricted Permutation in Dissimilarity Analyses
utils_check_distance_args

Check Distance Argument
utils_cluster_hclust_optimizer

Optimize the Silhouette Width of Hierarchical Clustering Solutions
utils_check_list_class

Checks Classes of List Elements Against Expectation
tsl_simulate

Simulate a Time Series List
utils_clean_names

Clean Character Vector of Names
utils_check_args_distantia

Check Input Arguments of distantia()
utils_matrix_guide

Color Guide for Matrix Plot
utils_check_args_path

Checks Least Cost Path
utils_color_breaks

Auto Breaks for Matrix Plotting Functions
utils_digits

Number of Decimal Places
utils_drop_geometry

Removes Geometry Column from SF Data Frames
utils_check_args_tsl

Structural Check for Time Series Lists
utils_rescale_vector

Rescale Numeric Vector to a New Data Range
utils_time_keywords_translate

Translates The User's Time Keywords Into Valid Ones
utils_distantia_df_split

Split Dissimilarity Analysis Data Frames by Combinations of Arguments
utils_prepare_zoo_list

Convert List of Data Frames to List of Zoo Objects
utils_time_units

Data Frame with Supported Time Units
utils_prepare_matrix

Convert Matrix to Data Frame
utils_line_color

Handles Line Colors for Sequence Plots
utils_prepare_matrix_list

Convert List of Matrices to List of Data Frames
utils_line_guide

Guide for Time Series Plots
zoo_name_set

Set Name of a Zoo Time Series
zoo_smooth_window

Rolling Window Smoothing of Zoo Time Series
utils_matrix_plot

Plot Distance or Cost Matrix and Least Cost Path
utils_prepare_vector_list

Convert List of Vectors to List of Data Frames
utils_global_scaling_params

Global Centering and Scaling Parameters of Time Series Lists
utils_prepare_time

Handles Time Column in a List of Data Frames
utils_tsl_pairs

Data Frame with Pairs of Time Series in Time Series Lists
zoo_time

Get Time Features from Zoo Objects
utils_is_time

Title
utils_new_time

New Time for Time Series Aggregation
zoo_permute

Random or Restricted Permutation of Zoo Time Series
zoo_aggregate

Aggregate Cases in Zoo Time Series
zoo_vector_to_matrix

Coerce Coredata of Univariate Zoo Time Series to Matrix
zoo_to_tsl

Convert Individual Zoo Objects to Time Series List
utils_optimize_spline

Optimize Spline Models for Time Series Resampling
utils_optimize_loess

Optimize Loess Models for Time Series Resampling
zoo_resample

Resample Zoo Objects to a New Time
zoo_plot

Plot Zoo Time Series
zoo_name_clean

Clean Name of a Zoo Time Series
zoo_name_get

Get Name of a Zoo Time Series
utils_prepare_df

Convert Data Frame to a List of Data Frames
utils_time_keywords

Valid Aggregation Keywords
utils_time_keywords_dictionary

Dictionary of Time Keywords
zoo_simulate

Simulate a Zoo Time Series
zoo_smooth_exponential

Exponential Smoothing of Zoo Time Series
cities_coordinates

Coordinates of 100 Major Cities
color_discrete

Default Discrete Color Palettes
auto_sum_path_cpp

(C++) Sum Distances Between All Consecutive Samples in the Least Cost Path Between Two Time Series
color_continuous

Default Continuous Color Palette
auto_sum_cpp

(C++) Sum Distances Between Consecutive Samples in Two Time Series
cities_temperature

Long Term Monthly Temperature in 20 Major Cities
cost_matrix_diagonal_cpp

(C++) Compute Orthogonal and Diagonal Least Cost Matrix from a Distance Matrix
auto_sum_full_cpp

(C++) Sum Distances Between All Consecutive Samples in Two Time Series
albatross

Flight Path Time Series of Albatrosses in The Pacific
auto_distance_cpp

(C++) Sum Distances Between Consecutive Samples in a Time Series