interpollen: Interpolation of Missing Data in a Pollen Database by Different Methods

Description

Function to simultaneously replace all missing data of an historical database of several pollen types by using different methods of interpolation.

Usage

interpollen(data, method = "lineal", maxdays = 30, plot = TRUE,
  factor = 2, ndays = 3, spar = 0.5, data2 = NULL, data3 = NULL,
  data4 = NULL, data5 = NULL, mincorr = 0.6, result = "wide")

Arguments

data

A data.frame object including the general database where interpollation must be performed. This data.frame must include a first column in Date format and the rest of columns in numeric format. Each column must contain information of one pollen type. It is not necessary to insert missing gaps; the function will automatically detect them.

method

A character string specifying the method applied to calculate and generate the pollen missing data. The implemented methods that can be used are: "lineal", "movingmean", "spline", "tseries" or "neighbour". A more detailed information about the different methods may be consulted in Details. The method argument will be "lineal" by default.

maxdays

A numeric (interger) value specifying the maximum number of consecutive days with missing data that the algorithm is going to interpolate. If the gap is bigger than the argument value, the gap will not be interpolated. Not valid with "tseries" method. The maxdays argument will be 30 by default.

plot

A logical argument. If TRUE, graphical previews of the input database will be plot at the end of the interpolation process. All the interpolated gaps will be marked in red. The plot argument will be TRUE by default.

factor

A numeric (interger) value bigger than 1. Only valid if the "movingmean" method is chosen. The argument specifies the factor which will multiply the gap size to stablish the range of the moving mean that will fulfill the gap. A more detailed information about the selection of the factor may be consulted in Details. The argument factor will be 1 by default.

ndays

A numeric (interger) value bigger than 1. Only valid if the "spline" method is chosen. Specifies the number of days beyond each side of the gap which are used to perform the spline regression. The argument ndays will be 3 by default.

spar

A numeric (double) value ranging 0_1 specifying the degree of smoothness of the spline regression adjustment. As smooth as the adjustment is, more data are considered as outliers for the spline regression. Only valid if the "spline" method is chosen. The argument "spar" will be 0.5 by default.

data2, data3, data4, data5

A data.frame object (each one) including database of a neighbour pollen station which will be used to interpolate missing data in the target station. Only valid if the "neighbour" method is chosen. This data.frame must include a first column in Date format and the rest of columns in numeric format belonging to each pollen type by column. It is not necessary to insert the missing gaps; the function will automatically detect them. The arguments will be NULL by default.

mincorr

A numeric (double) value ranging 0_1. It specifies the minimal correlation coefficient (Spearman correlations) that neighbour stations must have with the target station to be taken into account for the interpolation. Only valid if the "neighbour" method is chosen. The argument "mincorr" will be 0.6 by default.

result

A character string specifying the format of the resulting data.frame. Only "wide" or "long". The result argument will be "wide" by default.

Value

This function returns different results:

If result = "wide", returns a data.frame including the original data and completed with the interpolated data.
If result = "long", returns a data.frame containing your data in long format (the first column for date, the second for pollen type, the third for concentration and an additional fourth column with 1 if this data has been interpolated or 0 if not).
If plot = TRUE, plots for each year and pollen type with daily values are represented in the active graphic window. Interpolated values are marked in red. If method argument is "tseries", the seasonality is also represented in grey.

Details

This function allows to interpolate missing data in a pollen database using 4 different methods which are described below. Interpolation for each pollen type will be automatically done for gaps smaller than the "maxdays" argument.

"lineal" method. The interpolation will be carried out by tracing a straight line between the gap extremes.
"movingmean" method. It calculates the moving mean of the pollen daily concentrations with a window size of the gap size multiplicated by the factor argument and replace the missing data with the moving mean for these days. It is a dynamic function and for each gap of the database, the window size of the moving mean changes depending of each gap size.
"spline" method. The interpolation will be carried out by performing a spline regression with the previous and following days to the gap. The number of days of each side of the gap that will be taken into account for calculating the spline regression are specified by ndays argument. The smoothness of the adjustment of the spline regression can be specified by the spar argument.
"tseries" method. The interpolation will be carried out by analysing the time series of pollen database. It performs a seasonal_trend decomposition based on LOESS (Cleveland et al., 1990). The seasonality of the historical database is extracted and used to predict the missing data by performing a linear regression with the target year.
"neighbour" method. Other near stations provided by the user are used to interpolate the missing data of the target station. First of all, a Spearman correlation is performed between the target station and the neighbour stations to discard the neighbour stations with a correlation coefficient smaller than mincorr value. For each gap, a linear regression is performed between the neighbour stations and the target stations to determine the equation which converts the pollen concentrations of the neighbour stations into the pollen concentration of the target station. Only neighbour stations without any missing data during the gap period are taken into account for each gap.

References

Cleveland RB, Cleveland WS, McRae JE, Terpenning I (1990) STL: a seasonal_trend decomposition procedure based on loess. J Off Stat 6(1):3_33.

Examples

Run this code

# NOT RUN {
data("munich_pollen")
interpollen(munich_pollen, method = "lineal", plot = FALSE)
# }

Run the code above in your browser using DataLab