Learn R Programming

SHARK4R (version 1.0.2)

get_shark_statistics: Summarize numeric SHARK parameters with ranges and outlier thresholds

Description

Downloads SHARK data for a given time period, filters to numeric parameters, and calculates descriptive statistics and Tukey outlier thresholds.

Usage

get_shark_statistics(
  fromYear = NULL,
  toYear = NULL,
  datatype = NULL,
  group_col = NULL,
  min_obs = 3,
  max_non_numeric_frac = 0.05,
  cache_result = FALSE,
  prod = TRUE,
  utv = FALSE,
  verbose = TRUE
)

Value

A tibble with one row per parameter (and optionally per group) and the following columns:

parameter

Parameter name (character).

datatype

SHARK datatype (character).

min, Q1, median, Q3, max

Observed quantiles.

P01, P05, P95, P99

1st, 5th, 95th and 99th percentiles.

IQR

Interquartile range.

mean

Arithmetic mean of numeric values.

sd

Standard deviation of numeric values.

var

Variance of numeric values.

cv

Coefficient of variation (sd / mean).

mad

Median absolute deviation.

mild_lower, mild_upper

Lower/upper bounds for mild outliers (1.5 × IQR).

extreme_lower, extreme_upper

Lower/upper bounds for extreme outliers (3 × IQR).

n

Number of numeric observations used.

fromYear

First year included in the SHARK data download (numeric).

toYear

Last year included in the SHARK data download (numeric).

<group_col>

Optional grouping column if provided.

Arguments

fromYear

Start year for download (numeric). Defaults to 5 years before the last complete year.

toYear

End year for download (numeric). Defaults to the last complete year.

datatype

Optional, one or more datatypes to filter on (e.g. "Bacterioplankton"). If NULL, all datatypes are included.

group_col

Optional column name in the SHARK data to group by (e.g. "station_name"). If provided, statistics will be computed separately for each group. Default is NULL.

min_obs

Minimum number of numeric observations required for a parameter to be included (default: 3).

max_non_numeric_frac

Maximum allowed fraction of non-numeric values for a parameter to be kept (default: 0.05).

cache_result

Logical, whether to save the result in a persistent cache (statistics.rds) for use by other functions. Default is FALSE.

prod

Logical, whether to download from the production (TRUE, default) or test (FALSE) SHARK server. Ignored if utv is TRUE.

utv

Logical. Select UTV server when TRUE.

verbose

Logical, whether to show download progress messages. Default is TRUE.

Details

By default, the function uses the previous five complete years. For example, if called in 2025 it will use data from 2020–2024.

Examples

Run this code
# \donttest{
# Uses previous 5 years automatically, Chlorophyll data only
res <- get_shark_statistics(datatype = "Chlorophyll", verbose = FALSE)
print(res)

# Group by station name and save result in persistent cache
res_station <- get_shark_statistics(datatype = "Chlorophyll",
                                    group_col = "station_name",
                                    cache_result = TRUE,
                                    verbose = FALSE)
print(res_station)
# }

Run the code above in your browser using DataLab