read.ids: Read Site-Tree-Core IDs

Description

These functions try to read site, tree, and core IDs from a rwl data.frame.

Usage

read.ids(rwl, stc = c(3, 2, 3), ignore.site.case = FALSE,
         ignore.case = FALSE, fix.typos = FALSE, typo.ratio = 5,
         use.cor = TRUE)

autoread.ids(rwl, ignore.site.case = TRUE, ignore.case = "auto",
             fix.typos = TRUE, typo.ratio = 5, use.cor = TRUE)

Arguments

rwl

a data.frame with series as columns and years as rows such as that produced by read.rwl or ca533

stc

a vector of three integral values or character string "auto". The numbers indicate the number of characters to split the site code (stc[1]), the tree IDs (stc[2]), and the

use.cor

a logical flag. If TRUE and stc is "auto", correlation clustering may be used for determining the length of the tree and core parts. See Details.

ignore.site.case

a logical flag. If TRUE, the function does not distinguish between upper case and lower case letters in the site part of the series names.

ignore.case

a logical flag or "auto". If TRUE, the function does not distinguish between upper case and lower case letters in the tree / core part of the series names. The default in read.ids is FALSE

fix.typos

a logical flag. If TRUE, the function will try to detect and fix typing errors.

typo.ratio

a numeric value larger than 1, affecting the eagerness of the function to fix typing errors. The default is 5. See Details.

Value

A data.frame with column one named "tree" giving an ID for each tree and column two named "core" giving an ID for each core. The original series IDs are copied from rwl as rownames. The order of the rows in the output matches the order of the series in rwl. If more than one site is detected, an additional third column named "site" will contain a site ID. All columns have integral valued numeric values.

Details

Because dendrochronologists often take more than one core per tree, it is occasionally useful to calculate within vs. between tree variance. The International Tree Ring Data Bank (ITRDB) allows the first eight characters in an rwl file for series IDs but these are often shorter. Typically the creators of rwl files use a logical labeling method that can allow the user to determine the tree and core ID from the label. Argument stc tells how each series separate into site, tree, and core IDs. For instance a series code might be "ABC011" indicating site "ABC", tree 1, core 1. If this format is consistent then the stc mask would be c(3, 2, 3) allowing up to three characters for the core ID (i.e., pad to the right). If it is not possible to define the scheme (and often it is not possible to machine read IDs), then the output data.frame can be built manually. See Value for format. The function autoread.ids is a wrapper to read.ids with stc="auto", i.e. automatic detection of the site / tree / core scheme, and different default values of some parameters. In automatic mode, the names in the same rwl can even follow different site / tree / core schemes. As there are numerous possible encoding schemes for naming measurement series, the function cannot always produce the correct result. With stc="auto", the site part can be one of the following.

In names mostly consisting of numbers, the longest common prefix is the site part
Alphanumeric site part ending with alphabet, when followed by numbers and alphabets
Alphabetic site part (quite complicated actual definition). Settingignore.caseto"auto"allows the function to try to guess when a case change in the middle of a sequence of alphabets signifies a boundary between the site part and the tree part.
The characters before the first sequence of space / punctuation characters in a name that contains at least two such sequences

These descriptions are somewhat general, and the details can be found in regular expressions inside the function. If a name does not match any of the descriptions, it is matched against a previously found site part, starting from the longest. The following ID schemes are detected and supported in the tree / core part. The detection is done per site.

Numbers in tree part, core part starts with something else
Alphabets in tree part, core part starts with something else
Alphabets, either tree part all lower case and core part all upper case or vice versa. For this to work,ignore.casemust be set to"auto"orFALSE.
All digits. In this case, the number of characters belonging to the tree and core parts is detected with one of the following methods.
- If numeric tree parts were found before, it is assumed that the core part is missing (one core per tree).
- It the series are numbered continuously, one core per tree is assumed.
- Otherwise, try to find a core part as the suffix so that the cores are numbered continuously.
If none of the above fits, the tree / core split of the all-digit names will be decided with the methods described further down the list, or finally with the fallback mechanism.
The combined tree / core part is empty or one character. In this case, the core part is assumed to be missing.
Tree and core parts separated by a punctuation or white space character

If the split of a tree / core part cannot be found with any of the methods described above, the prefix of the string is matched against a previously found tree part, starting from the longest. The fallback mechanism for the still undecided tree / core parts is one of the following. The first one is used if use.cor is TRUE, number two if it is FALSE.

Pairwise correlation coefficients are computed between all remaining series. Pairs of series with above median correlation are flagged as similar, and the other pairs are flagged as dissimilar. Each possible number of characters (minimum 1) is considered for the share of the treeID. The corresponding unique would-be treeIDs determine a set of clusterings where one cluster is formed by all the measurement series of a single tree. For each clustering (allocation of characters), an agreement score is computed. The agreement score is defined as the sum of the number of similar pairs with matching cluster number and the number of dissimilar pairs with non-matching cluster number. The number of characters with the maximum agreement is chosen.
If the majority of the names in the site usekcharacters for the tree part, that number is chosen. Otherwise, one core per tree is assumed. Parametertypo.ratiohas a double meaning as it also defines what is meant by majority here: at leasttypo.ratio / (typo.ratio + 1) * n.tot, wheren.totis the number of names in the site.

In both fallback mechanisms, the number of characters allocated for the tree part will be increased until all trees have a non-zero ID or there are no more characters. Suspected typing errors will be fixed by the function if fix.typos is TRUE. The parameter typo.ratio affects the eagerness to fix typos, i.e. the number of counterexamples required to declare a typo. The following main typo fixing mechanisms are implemented: [object Object],[object Object] The function attempts to convert the tree and core substrings to integral values. When this succeeds, the converted values are copied to the output without modification. When non-integral substrings are observed, each unique tree is assigned a unique integral value. The same applies to cores within a tree, but there are some subtleties with respect to the handling of duplicates. Substrings are sorted before assigning the numeric IDs. The order of columns in rwl, in most cases, does not affect the tree and core IDs assigned to each series.

Examples

Run this code

data(ca533)
read.ids(ca533, stc = c(3, 2, 3))
autoread.ids(ca533)