Learn R Programming

PatientLevelPrediction

PatientLevelPrediction is part of HADES.

Introduction

PatientLevelPrediction is an R package for building and validating patient-level predictive models using data in the OMOP Common Data Model format.

Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek PR. Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data. J Am Med Inform Assoc. 2018;25(8):969-975.

The figure below illustrates the prediction problem we address. Among a population at risk, we aim to predict which patients at a defined moment in time (t = 0) will experience some outcome during a time-at-risk. Prediction is done using only information about the patients in an observation window prior to that moment in time.

To define a prediction problem we have to define t=0 by a Target Cohort (T), the outcome we like to predict by an outcome cohort (O), and the time-at-risk (TAR). Furthermore, we have to make design choices for the model we like to develop, and determine the observational datasets to perform internal and external validation. This conceptual framework works for all type of prediction problems, for example those presented below (T=green, O=red).

Features

  • Takes one or more target cohorts (Ts) and one or more outcome cohorts (Os) and develops and validates models for all T and O combinations.
  • Allows for multiple prediction design options.
  • Extracts the necessary data from a database in OMOP Common Data Model format for multiple covariate settings.
  • Uses a large set of covariates including for example all drugs, diagnoses, procedures, as well as age, comorbidity indexes, and custom covariates.
  • Allows you to add custom covariates or cohort covariates.
  • Includes a large number of state-of-the-art machine learning algorithms that can be used to develop predictive models, including Regularized logistic regression, Random forest, Gradient boosting machines, Decision tree, Naive Bayes, K-nearest neighbours, Neural network, AdaBoost and Support vector machines.
  • Allows you to add custom algorithms.
  • Allows you to add custom feature engineering
  • Allows you to add custom under/over sampling (or any other sampling) [note: based on existing research this is not recommended]
  • Contains functionality to externally validate models.
  • Includes functions to plot and explore model performance (ROC + Calibration).
  • Build ensemble models using EnsemblePatientLevelPrediction.
  • Build Deep Learning models using DeepPatientLevelPrediction.
  • Generates learning curves.
  • Includes a shiny app to interactively view and explore results.
  • In the shiny app you can create a html file document (report or protocol) containing all the study results.

Screenshots

Demo of the Shiny Apps can be found here:

Technology

PatientLevelPrediction is an R package, with some functions using python through reticulate.

System Requirements

Requires R (version 4.0 or higher). Installation on Windows requires RTools. Libraries used in PatientLevelPrediction require Java and Python.

The python installation is required for some of the machine learning algorithms. We advise to install Python 3.9 or higher using Anaconda (https://www.continuum.io/downloads).

Getting Started

  • To install the package please read the Package Installation guide

  • Have a look at the video below for an extensive demo of the package.

Please read the main vignette for the package:

In addition we have created vignettes that describe advanced functionality in more detail:

Package function reference: Reference

User Documentation

Documentation can be found on the package website.

Support

  • Developer questions/comments/feedback: OHDSI Forum
  • We use the GitHub issue tracker for all bugs/issues/enhancements

Contributing

Read here how you can contribute to this package.

License

PatientLevelPrediction is licensed under Apache License 2.0

Development

PatientLevelPrediction is being developed in R Studio.

Acknowledgements

  • The package is maintained by Egill Fridgeirsson and Jenna Reps and has been developed with major contributions from Peter Rijnbeek, Martijn Schuemie, Patrick Ryan, and Marc Suchard.
  • We like to thank the following persons for their contributions to the package: Seng Chan You, Ross Williams, Henrik John, Xiaoyong Pan, James Wiggins, Alexandros Rekkas
  • This project is supported in part through the National Science Foundation grant IIS 1251151.

Copy Link

Version

Install

install.packages('PatientLevelPrediction')

Monthly Downloads

3,799

Version

6.5.1

License

Apache License 2.0

Issues

Pull Requests

Stars

Forks

Maintainer

Egill Fridgeirsson

Last Published

October 15th, 2025

Functions in PatientLevelPrediction (6.5.1)

createCohortCovariateSettings

Extracts covariates based on cohorts
createDatabaseDetails

Create a setting that holds the details about the cdmDatabase connection for data extraction
createFeatureEngineeringSettings

Create the settings for defining any feature engineering that will be done
createLearningCurve

createLearningCurve
createExistingSplitSettings

Create the settings for defining how the plpData are split into test/validation/train sets using an existing split - good to use for reproducing results from a different run
createPreprocessSettings

Create the settings for preprocessing the trainData.
createPlpResultTables

Create the results tables to store PatientLevelPrediction models and results into a database
createRareFeatureRemover

Create the settings for removing rare features
createRandomForestFeatureSelection

Create the settings for random foreat based feature selection
createTempModelLoc

Create a temporary model location
createUnivariateFeatureSelection

Create the settings for defining any feature selection that will be done
createModelDesign

Specify settings for developing a single model
createSklearnModel

Plug an existing scikit learn python model into the PLP framework
createSimpleImputer

Create Simple Imputer settings
createNormalizer

Create the settings for normalizing the data @param type The type of normalization to use, either "minmax" or "robust"
createLogSettings

Create the settings for logging the progression of the analysis
fitPlp

fitPlp
createStudyPopulation

Create a study population
extractDatabaseToCsv

Exports all the results from a database into csv files
createStudyPopulationSettings

create the study population settings
diagnoseMultiplePlp

Run a list of predictions diagnoses
diagnosePlp

diagnostic - Investigates the prediction problem settings - use before training a model
evaluatePlp

evaluatePlp
externalValidateDbPlp

externalValidateDbPlp - Validate a model on new databases
createRestrictPlpDataSettings

createRestrictPlpDataSettings define extra restriction settings when calling getPlpData
createValidationSettings

createValidationSettings define optional settings for performing external validation
createValidationDesign

createValidationDesign - Define the validation design for external validation
createSampleSettings

Create the settings for defining how the trainData from splitData are sampled using default sample functions.
getPredictionDistribution

Calculates the prediction distribution
getPlpData

Extract the patient level prediction data from the server
insertResultsToSqlite

Create sqlite database with the results
loadPlpShareable

Loads the plp result saved as json/csv files for transparent sharing
listCartesian

Cartesian product
listAppend

join two lists
iterativeImpute

Imputation
plotPrecisionRecall

Plot the precision-recall curve using the sparse thresholdSummary data frame
loadPlpResult

Loads the evalaution dataframe
plotPlp

Plot all the PatientLevelPrediction plots
loadPlpModel

loads the plp model
outcomeSurvivalPlot

Plot the outcome incidence over time
getDemographicSummary

Get a demographic summary
modelBasedConcordance

Calculate the model-based concordance, which is a calculation of the expected discrimination performance of a model under the assumption the model predicts the "TRUE" outcome as detailed in van Klaveren et al. https://pubmed.ncbi.nlm.nih.gov/27251001/
plotGeneralizability

Plot the train/test generalizability diagnostic
plotF1Measure

Plot the F1 measure efficiency frontier using the sparse thresholdSummary data frame
plotSparseCalibration2

Plot the conventional calibration
plotSparseCalibration

Plot the calibration
loadPrediction

Loads the prediction dataframe to json
plotNetBenefit

Plot the net benefit
plotLearningCurve

plotLearningCurve
plotSmoothCalibration

Plot the smooth calibration as detailed in Calster et al. "A calibration heirarchy for risk models was defined: from utopia to empirical data" (2016)
plotPreferencePDF

Plot the preference score probability density function, showing prediction overlap between true and false cases #'
recalibratePlp

recalibratePlp
print.summary.plpData

Print a summary.plpData object
createGlmModel

createGlmModel
createIterativeImputer

Create Iterative Imputer settings
predictGlm

predict using a logistic regression model
predictPlp

predictPlp
removeRareFeatures

A function that removes rare features from the data
recalibratePlpRefit

recalibratePlpRefit
runMultiplePlp

Run a list of predictions analyses
predictCyclops

Create predictive probabilities
robustNormalize

A function that normalizes continous by the interquartile range and optionally forces the resulting values to be between -3 and 3 with f(x) = x / sqrt(1 + (x/3)^2) '@details uses (value - median) / iqr to normalize the data and then can applies the function f(x) = x / sqrt(1 + (x/3)^2) to the normalized values. This forces the values to be between -3 and 3 while preserving the relative ordering of the values. based on https://arxiv.org/abs/2407.04491 for more details
pmmFit

Predictive mean matching using lasso
createSplineSettings

Create the settings for adding a spline for continuous variables
getThresholdSummary

Calculate all measures for sparse ROC
createStratifiedImputationSettings

Create the settings for using stratified imputation.
getCohortCovariateData

Extracts covariates based on cohorts
getCalibrationSummary

Get a sparse summary of the calibration
loadPlpAnalysesJson

Load the multiple prediction json settings from a file
getPredictionDistribution_binary

Calculates the prediction distribution
getEunomiaPlpData

Create a plpData object from the Eunomia database'
migrateDataModel

Migrate Data model
insertCsvToDatabase

Function to insert results into a database from csvs
ici

Calculate the Integrated Calibration Index from Austin and Steyerberg https://onlinelibrary.wiley.com/doi/full/10.1002/sim.8281
savePrediction

Saves the prediction dataframe to a json file
setAdaBoost

Create setting for AdaBoost with python DecisionTreeClassifier base estimator
savePlpResult

Saves the result from runPlp into the location directory
savePlpShareable

Save the plp result as json files and csv files for transparent sharing
loadPlpData

Load the plpData from a folder
setIterativeHardThresholding

Create setting for Iterative Hard Thresholding model
setLassoLogisticRegression

Create modelSettings for lasso logistic regression
setLightGBM

Create setting for gradient boosting machine model using lightGBM (https://github.com/microsoft/LightGBM/tree/master/R-package).
setGradientBoostingMachine

Create setting for gradient boosting machine model using gbm_xgboost implementation
toSparseM

Convert the plpData in COO format into a sparse R matrix
minMaxNormalize

A function that normalizes continous features to have values between 0 and 1
plotPredictionDistribution

Plot the side-by-side boxplots of prediction distribution, by class
plotPredictedPDF

Plot the Predicted probability density function, showing prediction overlap between true and false cases
validateExternal

validateExternal - Validate model performance on new data
savePlpData

Save the plpData to folder
savePlpModel

Saves the plp model
setSVM

Create setting for the python sklearn SVM (SVC function)
simpleImpute

Simple Imputation
setNaiveBayes

Create setting for naive bayes model with python
setMLP

Create setting for neural network model with python's scikit-learn. For bigger models, consider using DeepPatientLevelPrediction package.
sklearnFromJson

Loads sklearn python model from json
sklearnToJson

Saves sklearn python model object to json in path
runPlp

runPlp - Develop and internally evaluate a model using specified settings
savePlpAnalysesJson

Save the modelDesignList to a json file
setDecisionTree

Create setting for the scikit-learn DecisionTree with python
setCoxModel

Create setting for lasso Cox model
setPythonEnvironment

Use the python environment created using configurePython()
validateMultiplePlp

externally validate the multiple plp models across new datasets
setRandomForest

Create setting for random forest model using sklearn
pfi

Permutation Feature Importance
plotDemographicSummary

Plot the Observed vs. expected incidence, by age and gender
plotSparseRoc

Plot the ROC curve using the sparse thresholdSummary data frame
plotVariableScatterplot

Plot the variable importance scatterplot
print.plpData

Print a plpData object
preprocessData

A function that wraps around FeatureExtraction::tidyCovariateData to normalise the data and remove rare or redundant features
viewDatabaseResultPlp

open a local shiny app for viewing the result of a PLP analyses from a database
splitData

Split the plpData into test/train sets using a splitting settings of class splitSettings
summary.plpData

Summarize a plpData object
simulatePlpData

Generate simulated data
viewPlp

viewPlp - Interactively view the performance and model settings
viewMultiplePlp

open a local shiny app for viewing the result of a multiple PLP analyses
simulationProfile

A simulation profile for generating synthetic patient level prediction data
MapIds

Map covariate and row Ids so they start from 1
averagePrecision

Calculate the average precision
computeAuc

Compute the area under the ROC curve
PatientLevelPrediction

PatientLevelPrediction
calibrationLine

calibrationLine
covariateSummary

covariateSummary
computeGridPerformance

Computes grid performance with a specified performance function
brierScore

brierScore
configurePython

Sets up a python environment to use for PLP (can be conda or venv)
createDefaultSplitSetting

Create the settings for defining how the plpData are split into test/validation/train sets using default splitting functions (either random stratified by outcome, time or subject splitting)
createExecuteSettings

Creates list of settings specifying what parts of runPlp to execute
createDefaultExecuteSettings

Creates default list of settings specifying what parts of runPlp to execute
createDatabaseSchemaSettings

Create the PatientLevelPrediction database result schema settings
calibrationInLarge

Calculate the calibration in large