nail_catdes: Interpret a categorical latent variable

Description

Generate an LLM response to analyze a categorical latent variable.

Usage

nail_catdes(
  dataset,
  num.var,
  introduction = NULL,
  request = NULL,
  model = "llama3",
  isolate.groups = FALSE,
  quali.sample = 1,
  quanti.sample = 1,
  drop.negative = FALSE,
  proba = 0.05,
  row.w = NULL,
  generate = FALSE
)

Value

A data frame, or a list of data frames, containing the LLM's prompt and response (if generate = TRUE).

Arguments

dataset: a data frame made up of at least one categorical variable and a set of quantitative variables and/or categorical variables.
num.var: the index of the variable to be characterized.
introduction: the introduction for the LLM prompt.
request: the request made to the LLM.
model: the model name ('llama3' by default).
isolate.groups: a boolean that indicates whether to give the LLM a single prompt, or one prompt per category. Recommended with long catdes results.
quali.sample: from 0 to 1, the proportion of qualitative features that are randomly kept.
quanti.sample: from 0 to 1, the proportion of quantitative features that are randomly kept.
drop.negative: a boolean that indicates whether to drop negative v.test values for interpretation (keeping only positive v.tests). Recommended with long catdes results.
proba: the significance threshold considered to characterize the categories (by default 0.05).
row.w: a vector of integers corresponding to an optional row weights (by default, a vector of 1 for uniform row weights)
generate: a boolean that indicates whether to generate the LLM response. If FALSE, the function only returns the prompt.

Details

This function directly sends a prompt to an LLM. Therefore, to get a consistent answer, we highly recommend to customize the parameters introduction and request and add all relevant information on your data for the LLM. We also recommend renaming the columns with clear, unshortened and unambiguous names.

Additionally, if isolate.groups = TRUE, you will need an introduction and a request that take into account the fact that only one group is analyzed at a time.

Examples

Run this code

if (FALSE) {
# Processing time is often longer than ten seconds
# because the function uses a large language model.

### Example 1: Fisher's iris ###
library(NaileR)
data(iris)

intro_iris <- "A study measured various parts of iris flowers
from 3 different species: setosa, versicolor and virginica.
I will give you the results from this study.
You will have to identify what sets these flowers apart."
intro_iris <- gsub('\n', ' ', intro_iris) |>
stringr::str_squish()

req_iris <- "Please explain what makes each species distinct.
Also, tell me which species has the biggest flowers,
and which species has the smallest."
req_iris <- gsub('\n', ' ', req_iris) |>
stringr::str_squish()

res_iris <- nail_catdes(iris,
                        num.var = 5,
                        introduction = intro_iris,
                        request = req_iris,
                        generate = TRUE)

cat(res_iris$response)

### Example 2: food waste dataset ###

library(FactoMineR)

data(waste)
waste <- waste[-14]    # no variability on this question

set.seed(1)
res_mca_waste <- MCA(waste, quali.sup = c(1,2,50:76),
ncp = 35, level.ventil = 0.05, graph = FALSE)
plot.MCA(res_mca_waste, choix = "ind",
invisible = c("var", "quali.sup"), label = "none")
res_hcpc_waste <- HCPC(res_mca_waste, nb.clust = 3, graph = FALSE)
plot.HCPC(res_hcpc_waste, choice = "map", draw.tree = FALSE,
ind.names = FALSE)
don_clust_waste <- res_hcpc_waste$data.clust

intro_waste <- 'These data were collected
after a survey on food waste,
with participants describing their habits.'
intro_waste <- gsub('\n', ' ', intro_waste) |>
stringr::str_squish()

req_waste <- 'Please summarize the characteristics of each group.
Then, give each group a new name, based on your conclusions.
Finally, give each group a grade between 0 and 10,
based on how wasteful they are with food:
0 being "not at all", 10 being "absolutely".'
req_waste <- gsub('\n', ' ', req_waste) |>
stringr::str_squish()

res_waste <- nail_catdes(don_clust_waste,
                         num.var = ncol(don_clust_waste),
                         introduction = intro_waste,
                         request = req_waste,
                         drop.negative = TRUE,
                         generate = TRUE)

cat(res_waste$response)


### Example 3: local_food dataset ###

data(local_food)

set.seed(1)
res_mca_food <- MCA(local_food, quali.sup = 46:63,
ncp = 100, level.ventil = 0.05, graph = FALSE)
plot.MCA(res_mca_food, choix = "ind",
invisible = c("var", "quali.sup"), label = "none")
res_hcpc_food <- HCPC(res_mca_food, nb.clust = 3, graph = FALSE)
plot.HCPC(res_hcpc_food, choice = "map", draw.tree = FALSE,
ind.names = FALSE)
don_clust_food <- res_hcpc_food$data.clust

intro_food <- 'A study on sustainable food systems
was led on several French participants.
This study had 2 parts. In the first part,
participants had to rate how acceptable
"a food system that..." (e.g, "a food system that
only uses renewable energy") was to them.
In the second part, they had to say
if they agreed or disagreed with some statements.'
intro_food <- gsub('\n', ' ', intro_food) |>
stringr::str_squish()

req_food <- 'I will give you the answers from one group.
Please explain who the individuals of this group are,
what their beliefs are.
Then, give this group a new name,
and explain why you chose this name.
Do not use 1st person ("I", "my"...) in your answer.'
req_food <- gsub('\n', ' ', req_food) |>
stringr::str_squish()

res_food <- nail_catdes(don_clust_food,
                        num.var = 64,
                        introduction = intro_food,
                        request = req_food,
                        isolate.groups = TRUE,
                        drop.negative = TRUE,
                        generate = TRUE)

res_food[[1]]$response |> cat()
res_food[[2]]$response |> cat()
res_food[[3]]$response |> cat()
}

Run the code above in your browser using DataLab