Creates a learning curve object, which can be plotted using the
plotLearningCurve()
function.
createLearningCurve(
population,
plpData,
modelSettings,
testSplit = "person",
testFraction = 0.25,
trainFractions = c(0.25, 0.5, 0.75),
trainEvents = NULL,
splitSeed = NULL,
nfold = 3,
indexes = NULL,
verbosity = "TRACE",
clearffTemp = FALSE,
minCovariateFraction = 0.001,
normalizeData = T,
saveDirectory = getwd(),
savePlpData = F,
savePlpResult = F,
savePlpPlots = F,
saveEvaluation = F,
timeStamp = FALSE,
analysisId = NULL
)
The population created using createStudyPopulation()
that will be used to develop the model.
An object of type plpData
- the patient level
prediction data extracted from the CDM.
An object of class modelSettings
created using
one of the function:
setLassoLogisticRegression - a lasso logistic regression model
setGradientBoostingMachine
- a gradient boosting machine
setRandomForest
- a random forest model
setKNN
- a k-nearest neighbour model
Specifies the type of evaluation used. Can be either
'person'
or 'time'
. The value 'time'
finds the date
that splots the population into the testing and training fractions
provided. Patients with an index after this date are assigned to the test
set and patients with an index prior to this date are assigned to the
training set. The value 'person'
splits the data randomly into
testing and training sets according to fractions provided. The split is
stratified by the class label.
The fraction of the data, which will be used as the testing set in the patient split evaluation.
A list of training fractions to create models for.
Note, providing trainEvents
will override your input to
trainFractions
.
Events have shown to be determinant of model performance.
Therefore, it is recommended to provide trainEvents
rather than
trainFractions
. Note, providing trainEvents
will override
your input to trainFractions
. The format should be as follows:
c(500, 1000, 1500)
- a list of training events
The seed used to split the testing and training set when using a 'person' type split
The number of folds used in the cross validation (default =
3
).
A dataframe containing a rowId and index column where the
index value of -1 means in the test set, and positive integer represents
the cross validation fold (default is NULL
).
Sets the level of the verbosity. If the log level is at or higher in priority than the logger threshold, a message will print. The levels are:
DEBUG
- highest verbosity showing all debug statements
TRACE
- showing information about start and end of steps
INFO
- show informative messages (default)
WARN
- show warning messages
ERROR
- show error messages
FATAL
- be silent except for fatal errors
Clears the temporary ff-directory after each iteration. This can be useful, if the fitted models are large.
Minimum covariate prevalence in population to avoid removal during preprocssing.
Whether to normalise the data
Location to save log and results
Whether to save the plpData
Whether to save the plpResult
Whether to save the plp plots
Whether to save the plp performance csv files
Include a timestamp in the log
The analysis unique identifier
A learning curve object containing the various performance measures
obtained by the model for each training set fraction. It can be plotted
using plotLearningCurve
.
# NOT RUN {
# define model
modelSettings = PatientLevelPrediction::setLassoLogisticRegression()
# create learning curve
learningCurve <- PatientLevelPrediction::createLearningCurve(population,
plpData,
modelSettings)
# plot learning curve
PatientLevelPrediction::plotLearningCurve(learningCurve)
# }
# NOT RUN {
# }
Run the code above in your browser using DataLab