This function creates a MD-plot for each variable of the data matrix. The MD-plot is a visualization for a boxplot-like Shape of the PDF published in [Thrun et al., 2019]. It is an improvement of violin or so-called bean plots and posses advantages in comparison to the conventional well-known box plot [Thrun et al., 2019].
A complete guide about the MDplot can be found in https://md-plot.readthedocs.io/en/latest/index.html.
MDplot(Data, Names, Ordering='Default', Scaling="None",Fill='darkblue', RobustGaussian=TRUE, GaussianColor='magenta',
Gaussian_lwd=1.5, BoxPlot=FALSE,BoxColor='darkred',
MDscaling='width', LineColor='black', LineSize=0.01,
MinimalAmoutOfData=40, MinimalAmoutOfUniqueData=12,
SampleSize=5e+05,SizeOfJitteredPoints=1,OnlyPlotOutput=TRUE)
[1:n,1:d] Numerical Matrix containing the n cases of d variables. Each column is one variable. A data.frame is automatically transformed to a numerical matrix.
Optional: [1:d] Names of the variables. If missing, the columnnames of data are used.
Optional: string, either Default
, Columnwise
, Alphabetical
or Statistics
. Please see details for explanation.
Optional, Default is None
, Percentalize
, CompleteRobust
, Robust
or Log
, Please see details for explanation.
Optional: string, color with which MDs are to be filled with.
Optional: If TRUE: each MDplot of a variable is overlayed with a roubustly estimated unimodal Gaussian distribution in the range of this variable, if statistical testing does not yield a significant p.value. In this case the packages moments, diptest and signal are required.
Optional: string, color of robustly estimated gaussian, only for RobustGaussian=TRUE
.
Optional: numerical, line width of robustly estimated gaussian, only for RobustGaussian=TRUE
.
Optional: If TRUE: each MDplot is overlayed with a Box-Whisker Diagram.
Optional: string, color of Boxplot, only for BoxPlot=TRUE
.
Optional: if "area", all violins have the same area (before trimming the tails). If "count", areas are scaled proportionally to the number of observations. If "width" (default), all MDs have the same maximum width.
Optional: string, color of line around the mirrored densities. NA
disables this features which is usefull if ones wants to avoid vertical lines leading to outliers.
Optional: numerical, linewidth of line around the mirrored densities.
Optional: numeric value defining a threshold. Below this threshold no density estimation is performed and a jitter plot with a median line is drawn. Only Data Science experts should change this value after they understand how the density is estimated (see [Ultsch, 2005]).
Optional: numeric value defining a threshold. Below this threshold no density estimation and statistical testing is performed and a Jitter plot is drawn. Only Data Science experts should change this value after they understand how the density is estimated (see [Ultsch, 2005]).
Optional: numeric value defining a threshold. Above this threshold uniform sampling of finite cases is performed in order to shorten computation time.If rowr is not installed, uniform sampling of all cases
is performed. If required, SampleSize=n
can be set to omit this procedure.
Optional: scalar. If not enough unique values for density estimation are given, data points are jittered. This parameter defines the size of the points.
Optional: Default TRUE only a ggplot object is given back, if FALSE: Additinally, scaled data and ordering are the output of this function in a list
.
In the default case of OnlyPlotOutput==TRUE
: The ggplot object of the MD-plot.
Otherwise for OnlyPlotOutput==FALSE
: A list of
The ggplot object of the MD-plot.
The ordering of columns of data defined by Ordering
.
[1:n,1:d] matrix of ordered and scaled data defined by Ordering
and Scaling
.
In short, the MD-plot can be described as a PDE optimized violin plot. The Pareto Density Estimation (PDE) is an approach to estimate the probability density function (pdf) [Ultsch, 2005].
MD plot was used in [Thrun et al.,2018] for the evaluation of stochastic clustering methods and used in [Thrun et al.,2018a] in order to simultaneously estimate variances of a high-dimensional data set. The MD-plot is in the process of beeing peer-reviewed [Thrun/Ultsch, 2019].
Statistical testing is performed with dip.test
and agostino.test
.
For the paramter Ordering
the following options are possible:
Default
Ordering of plots by convex/concav/unimodal/nonunimodal shapes. In this case the signal is required.
Columnwise
Ordering of plots by the order of columns of Data
.
Alphabetical
Ordering of plots by the order of columns of Data
sorted in alphabetical order by column names.
Statistics
Ordering of plots depending on the logarithm of the p-vlaues of statistical testing. In this case the packages moments, diptest and signal are required.
For the paramter Scaling
the following options are possible:
None
No Scaling of data is done.
Percentalize
Data is scaled between zero and 100.
CompleteRobust
Data is first robustly scaled between zero and 1, then centered to zero and outliers are capped by a robustly formula described in the DatabionicSwarm package.
Robust
Data is robustly scaled between zero and 1 by a formula described in the DatabionicSwarm package.
Log
Data is transformed with a sgined log allowing for negative values to be transformed with a logarithm of base 10, please see SignedLog
for details.
[Ultsch, 2005] Ultsch, A.: Pareto density estimation: A density estimation for knowledge discovery, in Baier, D.; Werrnecke, K. D., (Eds), Innovations in classification, data science, and information systems, Proc Gfkl 2003, pp 91-100, Springer, Berlin, 2005.
[Thrun et al., 2018a] Thrun, M. C., Breuer, L., & Ultsch, A. : Knowledge discovery from low-frequency stream nitrate concentrations: hydrology and biology contributions, Proc. European Conference on Data Analysis (ECDA), pp. 46-47, Paderborn, Germany, 2018.
[Thrun et al., 2018b] Thrun, M. C., Pape, F., & Ultsch, A. : Benchmarking Cluster Analysis Methods using PDE-Optimized Violin Plots, Proc. European Conference on Data Analysis (ECDA), p. 26, Paderborn, Germany, 2018.
[Thrun et al., 2019] Thrun, M. C., Gehlert, Tino, & Ultsch, A. : Analyzing the Fine Structure of Distributions, arXiv:1908.06081, 2019.
# NOT RUN {
x <- cbind(A = runif(20000, 1, 5), B = c(rnorm(10000,0,1),rnorm(10000,2.6,1)),
C = c(rnorm(20000,2.5,1)),D=rpois(20000,5))
MDplot(x)
# }
# NOT RUN {
# }
Run the code above in your browser using DataLab