Continuous data
If \(X\) is a continuous random variable with cumulative distribution function (CDF)
\(F\), then the probability integral transform (PIT) \(F(X)\) is uniformly distributed
between 0 and 1. Given a vector \(x_{1}, \dots, x_{n}\) of realisations of \(X\),
get_pit
produces an estimate \(\hat{F}\) of the CDF \(F\), and returns a
vector of PIT values corresponding to another set of realisations \(z_{1}, \dots, z_{N}\),
$$\hat{F}(z_{1}), \dots, \hat{F}(z_{n}).$$
x_ref
contains the values \(x_{1}, \dots, x_{n}\) from which the CDF estimate
\(\hat{F}\) is obtained. x_new
contains the values \(z_{1}, \dots, z_{n}\)
from which the PIT values \(\hat{F}(z_{1}), \dots, \hat{F}(z_{n})\) are calculated.
By default, x_ref
and x_new
are the same, so that the PIT values are
calculated in-sample.
To estimate the distribution, get_pit
calls fit_dist
. The arguments
dist
, method
and n_thres
are
documented in detail in the corresponding help page.
To check that the chosen distribution adequately fits the data, the argument
return_fit = TRUE
can be used to return the estimated parameters of the
distribution, as well as properties of the fit such as the AIC and a p-value
for the Kolmogorov-Smirnov goodness-of-fit test.
Non-stationary distributions
The estimated distribution can also be non-stationary, by depending on some predictor variables or covariates.
These predictors can be included via the arguments preds_ref
and preds_new
,
which should be data frames with a separate column for each predictor, and with
numbers of rows equal to the lengths of x_ref
and x_new
, respectively.
In this case, a Generalized Additive Model for Location, Scale, and Shape (GAMLSS) is
fit to x_ref
using the predictors in preds_ref
.
The PIT values corresponding to x_new
are then calculated by applying the
estimated distribution with predictors preds_new
.
If a non-stationary distribution is to be estimated, both preds_ref
and
preds_new
must be provided. By default, preds_new
is assumed to be
the same as preds_ref
, to align with x_new
being the same as x_ref
.
Censored data
If the random variable \(X\) is not continuous, the PIT will not be uniformly distributed.
A relevant case is when \(X\) is censored. For example, precipitation is censored below
at zero. This results in several PIT values being equal to \(F(0)\). The lower
and upper
arguments to get_pit
allow the user to specify the lower and upper
bounds at which the data is censored; the default is lower = -Inf
and upper = Inf
,
i.e. there is no censoring.
If the PIT values are used to construct standardised indices, this censoring can lead to
unintuitive index values.
To deal with censored data, it has been proposed to map the PIT values of the censored
values to a different constant \(c\); see references. For example, for precipitation, the PIT
values would become
$$F(X) \quad \text{if} \quad X > 0,$$
$$c \quad \text{if} \quad X = 0.$$
The constant \(c\) can be chosen so that the PIT values satisfy some desired property.
For example, if \(F(X)\) is uniformly distributed between 0 and 1, then it has mean equal
to \(1/2\). Hence, \(c\) could be chosen such that the mean of the PIT values of the
censored distribution are equal to \(1/2\).
Alternatively, if \(F(X)\) is uniformly distributed between 0 and 1, then
the transformed PIT value \(\Phi^{-1}(F(X))\) (where \(\Phi^{-1}\) is the quantile function
of the standard normal distribution) follows a standard normal distribution, and therefore
has mean equal to 0. The constant \(c\) could therefore be chosen such that the mean of the
transformed PIT values of the censored distribution are equal to 0.
The argument cens
in get_pit
can be used to treat censored data. cens
can be one of four options: a single numeric value containing the value \(c\) at which to
assign the PIT values of the censored realisations; the string 'none'
if no censoring
is to be performed; the string 'prob'
if \(c\)
is to be chosen automatically so that the mean of the PIT values is equal to \(1/2\);
or the string 'normal'
if \(c\) is to be chosen automatically so that the mean of
the transformed PIT values is equal to 0. If the data is censored both above and below,
then cens
must be a numeric vector of length two, specifying the values to assign
the realisations that are censored both below and above.
When the data is censored, dist
corresponds to the distribution used to estimate the
uncensored realisations, e.g. positive precipitations. The probability of being at the boundary
points is estimated using the relative frequency of censored observations in x_ref
.