lsa.cut.vars: Cut continuous variables into discrete categorical

Description

lsa.cut.vars cuts continuous variables into discrete ones using user-defined ranges. For example, some continuous scales in large-scale assessments and surveys can be converted into two, three or more categories depending on the cut-points provided by the user.

Usage

lsa.cut.vars(
  data.file,
  data.object,
  src.variables,
  new.variables,
  new.var.labels,
  cut.points,
  value.labels,
  out.file
)

Value

A lsa.data object in memory (if out.file is missing) or .RData file containing lsa.data object with the new discrete variables.

Arguments

data.file: The file containing lsa.data object. Either this or data.object shall be specified, but not both. See details.
data.object: The object in the memory containing lsa.data object. Either this or data.file shall be specified, but not both. See details.
src.variables: Names of the variables to cut into categories. Accepts only continuous variables. No PV variables are accepted. See details.
new.variables: The names of the new, cut variables to append to the dataset. See details.
new.var.labels: Optional, vector of strings to add as variable labels for the new.variables. See details.
cut.points: Vector of numeric values to cut the src.variables between. See details.
value.labels: Optional, character vector of values to assign to the newly formed categorical discrete values in the new.variables. See details.
out.file: Full path to the .RData file to be written. If missing, the original object will be overwritten in the memory. See examples.

Details

The function cuts continuous variables in large-scale assessments' data in to variables with discrete values. The resulting variables can be numeric or categorical (i.e. factors) depending on if value labels for the new values are provided.

Either data.file or data.object shall be provided as source of data. If both of them are provided, the function will stop with an error message.

The src.variables specifies the variables that shall be cut. Only continuous variables are accepted. Multiple src.variables can be passed. These will be split at the same cut points (see below). PVs are not accepted.

The new.variables argument is optional and specifies the names of the new discrete variables from the src.variables. The sequence of the new.variables names is the same as the src.variables. If the new.variables argument is omitted, the function will create the names automatically, appending CUT at the end of the src.variables and store the discrete variable data under these names. If provided, the number of new.variables must be the same as the number of src.variables.

The new.var.labels is optional. Regardless whether new.variables are provided, if new.var.labels are provided, they will be assigned to the new.variables generated from the discretization. If neither new.variables not new.var.labels are provided, the function will automatically generate new.variables (see above) and copy the variable labels from src.variables to the newly generated variables, appending Cut at the beginning. The argument takes a vector with the same number of elements as the number of variable names in src.variables.

cut.points is a mandatory argument. It specifies the ranges (from-to) in the original variables to be cut into discrete categories. There can be multiple cut.points, the new values will be the ranges between them. For example, if the 3.29309, 7.97028, 9.98618, and 10.99411 cut points are passed, there will be five categories in the resulting discrete variables, as follow:

1 - from lowest up to 3.29309;
2 - from above 3.29309 up to 7.97028;
3 - from above 7.97028 up to 9.98618;
4 - from above 9.98618 up to 10.99411; and
5 - from above 10.99411 to the highest value.

The cut.points must be within the range of the src.variables. Otherwise the function will stop with an error.

The value.labels is optional. If omitted, the values in the new discrete variables will be numeric (integers). If the data was exported with missing.to.NA = FALSE (i.e. user-defined missings are kept) the missing values will remain as they are. If the value.labels are provided, the new values will be converted to factor levels. If the data was exported with missing.to.NA = FALSE the names of missing values will be assigned to factor levels too. Either way, the missing values will remain as missing values and handled properly by the analysis functions. If missing.to.NA = TRUE (i.e. setting the user-defined missing values to NA), the NA values will remain as NA in the resulting discrete new.variables.

If full path to .RData file is provided to out.file, the data.set will be written to that file. If no, the complemented data will remain in the memory.

Examples

Run this code


# Produce new discrete variables from the PIRLS 2021 Students Like Reading and the
# Home Resources for Learning scales. The values for the new variables are
# numeric. Save the dataset into a file, overwriting it. The names for the new
# variables are automatically generated.
if (FALSE) {
lsa.cut.vars(data.file = "C:/Data/PIRLS_2021_Student_Miss_to_NA.RData",
src.variables = c("ASBGSLR", "ASBGHRL"),
cut.points = c(4.1, 7.9, 9.9, 10.7),
out.file = "C:/Data/PIRLS_2021_Student_Miss_to_NA.RData")
}

# Same as the above, but assign custom variable names, value labels for the new categorical
# variables, custom variable labels. Write the data to the memory instead of saving it on
# the disk.
if (FALSE) {
lsa.cut.vars(data.file = "C:/Data/PIRLS_2021_Student_Miss_to_NA.RData",
src.variables = c("ASBGSLR", "ASBGHRL"),
new.variables = c("ASBGSLRREC", "ASBGHRLREC"),
new.var.labels = c("Categorical like reading", "Categorical learning resources"),
cut.points = c(4.1, 7.9, 9.9, 10.7),
value.labels = c("Very low", "Low", "Medium", "High", "Very high"),
out.file = "C:/Data/PIRLS_2021_Student_Miss_to_NA.RData")
}