Given a text
input, find up to num_suggestions
possible occupation categories.
get_job_suggestions(
text,
suggestion_type = "auxco-1.2.x",
num_suggestions = 5,
suggestion_type_options = list(),
aggregate_score_threshold = 0.02,
item_score_threshold = 0,
distinctions = TRUE,
steps = list(simbased_wordwise = list(algorithm = algo_similarity_based_reasoning,
parameters = list(sim_name = "wordwise")), simbased_substring = list(algorithm =
algo_similarity_based_reasoning, parameters = list(sim_name = "substring"))),
include_general_id = FALSE
)
A data.table with suggestions or NULL if no suggestions were found.
The raw text input from the user.
Which type of suggestion to use / provide. Possible options are "auxco-1.2.x" and "kldb-2010".
The maximum number of suggestions to show. This is an upper bound and less suggestions may be returned. Defaults to 5.
A list with options for generating
suggestions. Supported options:
- datasets
: Pass specific datasets to be used whenn adding information
to predictions e.g. use a specific version of the kldb or auxco.
Supported datasets are: "auxco-1.2.x", "kldb-2010". By default the datasets
bundled with this package are used.
A single value or named list of thresholds
between 0 and 1. If it is a list, each entry should correspond to one of
the steps
. If it is a single value, it will apply to all steps.
Results from that step will only be returned if the sum of
their scores is equal to or greater than the specified threshold. With a
aggregate_score_threshold of 0 results will always be returned
(if there are any).
A threshold between 0 and 1 (usually very small, default 0). Results from any step will only be returned if they are greater than the specified threshold. Allows the removal of highly implausible suggestions.
Whether or not to add additional distinctions to similar occupational categories to the source code. Defaults to TRUE.
A list with the algorithms to use and their parameters. Each entry of the list should contain a nested list with two entries: algorithm (the algorithm's function itself) and parameters (the parameters to pass onto the algorithm). Each algorithm will also always have access to a default set of three parameters:
text_processed: The input text after preprocessing
suggestion_type: Which type of suggestion to output
num_suggestions: How many suggestions shall be returned These parameters must not be specified manually and will be provided automatically instead. Defaults to:
list(
# try similarity "one word at most 1 letter different" first
list(
algorithm = algo_similarity_based_reasoning,
parameters = list(
sim_name = "wordwise",
min_aggregate_prob = 0.535
)
),
# since everything else failed, try "substring" similarity
list(
algorithm = algo_similarity_based_reasoning,
parameters = list(
sim_name = "substring",
min_aggregate_prob = 0.02
)
)
)
Whether a general column, called "id" should always be returned. This will automatically contain the appropriate id for different suggestion_types i.e. for "auxco-1-2.x" it will contain the same data as the column "auxco_id".
The procedure implemented here is, roughly speaking, as follows:
Predict categories from KldB 2010, including their scores. The first algorithm mentioned in steps
is used (default: algo_similarity_based_reasoning()
).
Convert the predicted KldB 2010 categories to suggestion_type
(default: auxco-1.2.x
, an n:m mapping, scores are mapped accordingly.). See internal function convert_suggestions()
for details.
Remove predicted categories if their score is below item_score_threshold
and only keep the num_suggestions
top-ranked suggestions.
Start anew, trying the next algorithm in steps
, if the the top-ranked suggestions have a low chance to be correct. (Technically, this happens if the summed score of the num_suggestions
top-ranked suggestions is below aggregate_score_threshold
.)
If suggestion_type == "auxco-1.2.x"
and distinctions == TRUE
, insert additional and (highly) similar categories or replace existing ones. See internal function add_distinctions_auxco()
. Reorder and keep only the num_suggestions
top-ranked suggestions. Auxco categories which were added during this step can be identified by their scores: It equals 0.05 for categories with high similarity and 0.005 for categories with medium similarity.
data.table::setDTthreads(1)
if (FALSE) {
if (interactive()) {
get_job_suggestions("Koch")
}
if (interactive()) {
get_job_suggestions("Schlosser")
}
}
Run the code above in your browser using DataLab