Learn R Programming

genasis (version 1.0)

genoutlier: Identification and exclusion of outliers

Description

Function genoutlier finds and excludes outlied (concentration) values according to selected method and draws plot of outliers.

Usage

genoutlier(x, y=NA, input="openair", output=NA, method="lm3s", sides=2, pollutant=NA, plot=TRUE, columns=2, col.points="black", pch=1, xlab="Date", ylab="Concentration", main=NA)

Arguments

x
a vector of concentration values or data frame of genasis/openair type. See 'Details' for more detailed description of both data types.
y
a vector of measurement dates in the case of vector input only.
input
a type of data.frame in the case of data.frame input. The allowed values are "openair" (default) and "genasis". In case of vector input, this argument is meaningless.
output
a type of output data.frame. As in the input argument, both data.frames "openair" and "genasis" are available, with the default value equal to input.
method
method of threshold(s) determination. Allowed values are "m2s" and "m3s" for mean +(-) 3 standard deviation, "lm2s" and "lm3s" for log-transformed variant and "iqr2", "iqr4" and "iqr7" for interquatile distances. See 'Details' for more detailed description of methods.
sides
if sides=2 (default), both lower and upper threshold are used. If sides=1, only the upper one is in charge.
pollutant
a name(s) of the pollutant(s), for which the outliers are find. Not necessary if only data for one pollutant is available in x. If not specified, plots for all pollutants are drawn in a multi-plot arrangement.
plot
logical. Indicates, whether plot should be plotted.
columns
number of columns in the multi-plot arrangement.
col.points
color of non-outlied points inside the plot.
pch
plotting 'character', i.e., symbol to use. For more details see points.
xlab
the x label of the plot.
ylab
the y label of the plot.
main
overall title for the plot.

Value

a list containing:
res
the data frame (or vector) according to the output argument settings with outlied values substituted by NAs.
lower
numeric value of lower threshold
upper
numeric value of upper threshold

Details

The genoutlier function finds outlied (concentration) values according to a criterion given by arguments method and sides and substitutes them by NAs. The function recognises three different input formats: Option input="openair" uses "openair" format of data frame with first column of name "date" and class "Date", optional columns of names "date_end", "temp", "wind" and "note" and other columns of class "numeric" containing concentration values and named by names of the compounds. input="genasis" is used for the data frame with six columns "valu", "comp", "date_start", "date_end", "temp" and "wind" where the first, fifth and sixth are of class "numeric", second of class "character" and third and fourth columns could be both "character" or "Date" class. The names of columns in input="genasis" are not rigid, only their order is assumed. There is also a possibility to specify x and y as two vectors of equal lenght, first of class "numeric" containing concentration values, second of class "character" or "Date" containing measurement dates. The output argument specifies of which type the resul will be. Both types of "data.frame" class output="openair" and output="genasis" are available, the default value is equal to the input argument, therefore the vector class of output is possible only if x is of class "numeric" and output is not specified. There are seven available methods of outlier threshold set up: method="m3s" set the lower threshold equal to sample mean - 3 standard deviations and the uuper threshold to the sample mean + 3 standard deviations. Variant method="m2s" works similarly with only doubled standard deviations. In case of log-normally distributed data, the variant method="lm3s" could work better, setting up the lower threshold as geometric mean / 3 geometric standard deviation and the upper threshold as geometric mean * 3 geometric standard deviation. Analogously method="lm2s" works with the doubled geometric standard deviation. Non-parametric variants "iqr2", "iqr4" and "iqr7" set lower threshold to 25th quantile - a * interquartile range and upper threshold to 75th quantile + a * interquartile range with parameter a sequentially 0.5, 1.5 and 3 (thus the whole range is 2, 4 and 7 times the interquartile range). The argument sides serves to specification, whether the one-sided or two-sided exclusion of outliers will be done. In the case sides=2 (default), both outliers under the lower and over the upper threshold are excluded, conversely if sides=1, only the outliers over the upper threshold are excluded.

See Also

genloq, genhistogram, genpastoact, genanaggr, genplot, genstatistic, gentransform, genwhisker

Examples

Run this code
## Definition of simple data sources:
c1<-rnorm(100)+12
c2<-"random compound"
c3<-as.Date(as.Date("2013-01-01"):as.Date("2013-04-10"),
            origin="1970-01-01")
c4<-c3+1

sample_genasis<-data.frame(c1,c2,c3,c4)
sample_openair<-data.frame(c4,c1)
colnames(sample_openair)=c("date",c2)

## Examples of different usages:
genoutlier(sample_openair,input="openair",pollutant="random compound",
           method="m2s")
genoutlier(sample_genasis,input="genasis",method="m3s")

## Use of example data from the package:
data(kosetice.pas.openair)
genoutlier(genpastoact(kosetice.pas.openair[,1:8]),method="lm3s",
           main="Outliers",ylab="Concentration ngm-3")
genoutlier(kosetice.pas.openair[,c(1:4,23:26)],col.points="orange",
           method="lm3s")
data(kosetice.pas.genasis)
genoutlier(kosetice.pas.genasis[625:832,],input="genasis",
           method="lm2s",sides=1)

Run the code above in your browser using DataLab