incadata (version 0.6.1)

as.incadata: Identify data formats used by INCA and Rockan

Description

Coerce data of any form to its relevant type as identified either by column/vector names or by variable content and convert all variable names to lower case.

Usage

as.incadata(x, ...)

is.incadata(x)

# S3 method for data.frame as.incadata(x, decode = TRUE, id = TRUE, ask = TRUE, ...)

# S3 method for default as.incadata(x, n_i = NULL, ...)

Arguments

x

data

...

arguments passed to exceed_threshold (of most use is probably "threshold" and "force", see the "interactive use" section below)

decode

Should decode be applied to variables with identified variable names? (TRUE by default).

id

Should an id-column be added (see id)?

ask

ask for input if unsure how to coerce variables (see the "interactive use" section below)

n_i

used internally between methods (should not be set by the user)

Value

as.incadata.data.frame

object of class incadata based on the "tibble"-class used within the "tidyverse" with all variables possibly coerced as described above.

as.incadata.default

input vector coerced to relevant class

is.incadata

TRUE for objects of class incadata, otherwise FALSE

factors

Note that the incadata format does not include factors. Factors can be really useful for some applications but our philosophy is that they should be explicitly stated as such when needed. It is otherwise common that factor levels are created just by the responses present in a certain data set. These might or might not contain a complete list of possible alternatives from a INCA variable with a fixed value set.

interactive use

Some vectors can be undoubtedly recognised according to specifications abov. It is however possible that a vector of an intended format might have been "contaminated" with data of some other form. This might happen for example when a numeric variable is technically a character in INCA. For example a hospital unit code like c(111, 123, "?") might suddenly occur (if someone use a question mark as placeholder for an unknown code). Ordinary coercing rules of R would treat this vector as a character (see c), although it might be more correct to treat it as a numeric with "?" set to NA.

The as.incadata function relies on exceed_threshold to ignore such contaminated values if they represent only a (preferably small) proportion of the values.

By default, if contaminated values exist but only to a proportion of less than 10 percent, the function will stop and ask the user for input on how to handle tis variable. If the proportion exceeds 10 percent, ordinary coercing principles will apply.

The 10 percent limit can be modified by argument threshold and it is possible to force vectors with contaminated values to the otherwise potential format (without the need of individual confirmation) by setting argument force = TRUE (passed to exceed_threshold).

Details

Vectors are coerced to identified formats in the following order:

  • vectors reconised as Boolean by is.incalogical are coerced to logical (this is a strict format than can not be contaminated with any unwanted values, section "interactive use" below does therefore not apply to these values)

  • vectors with an already specified class attribute (except the common "factor" class) remains as members of that class

  • columns or vectors names "persnr" or "pnr" will be coerced to the "pin" class by as.pin

  • columns or vectors with names ending in "_Beskrivning", "_Varde", "_Gruppnamn" or "_id" are always treated as character (not factors; see section "factors" below)

  • column or vectors named "PAT_ID", "KON_VALUE" and "LAN_VALUE" are also always treated as character. These could also be thought of as numerics but are treated as character internally by INCA. To stay with that format ensures the assumption of a stable format.

  • If all values of a vector are NA, it is coerced from logical to character. This might be a faulty assumption but it is in fact more likely that an empty vector is a character variable (since most INCA variables are of type character) than that it is a Boolean vector (that has its own format in INCA).

  • Dates in formats regonised by as.Dates are coerced to such.

  • Integers (even if stored as characters or factors) without leading zeros (except when the zero is the only digit) are coerced to integers

  • Numerics (even if stored as characters or factors) containing either a Swedish decimal comma or an English decimal point are coerced to numeric (with possible commas changed to points).

  • all other formats are coerced to character. This includes integers with leading zeroes (since these might be unit codes where a leading zero might bear meaning).