Learn R Programming

PatientLevelPrediction

PatientLevelPrediction is part of HADES.

Introduction

PatientLevelPrediction is an R package for building and validating patient-level predictive models using data in the OMOP Common Data Model format.

Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek PR. Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data. J Am Med Inform Assoc. 2018;25(8):969-975.

The figure below illustrates the prediction problem we address. Among a population at risk, we aim to predict which patients at a defined moment in time (t = 0) will experience some outcome during a time-at-risk. Prediction is done using only information about the patients in an observation window prior to that moment in time.

To define a prediction problem we have to define t=0 by a Target Cohort (T), the outcome we like to predict by an outcome cohort (O), and the time-at-risk (TAR). Furthermore, we have to make design choices for the model we like to develop, and determine the observational datasets to perform internal and external validation. This conceptual framework works for all type of prediction problems, for example those presented below (T=green, O=red).

Features

  • Takes one or more target cohorts (Ts) and one or more outcome cohorts (Os) and develops and validates models for all T and O combinations.
  • Allows for multiple prediction design options.
  • Extracts the necessary data from a database in OMOP Common Data Model format for multiple covariate settings.
  • Uses a large set of covariates including for example all drugs, diagnoses, procedures, as well as age, comorbidity indexes, and custom covariates.
  • Allows you to add custom covariates or cohort covariates.
  • Includes a large number of state-of-the-art machine learning algorithms that can be used to develop predictive models, including Regularized logistic regression, Random forest, Gradient boosting machines, Decision tree, Naive Bayes, K-nearest neighbours, Neural network, AdaBoost and Support vector machines.
  • Allows you to add custom algorithms.
  • Allows you to add custom feature engineering
  • Allows you to add custom under/over sampling (or any other sampling) [note: based on existing research this is not recommended]
  • Contains functionality to externally validate models.
  • Includes functions to plot and explore model performance (ROC + Calibration).
  • Build ensemble models using EnsemblePatientLevelPrediction.
  • Build Deep Learning models using DeepPatientLevelPrediction.
  • Generates learning curves.
  • Includes a shiny app to interactively view and explore results.
  • In the shiny app you can create a html file document (report or protocol) containing all the study results.

Screenshots

Demo of the Shiny Apps can be found here:

Technology

PatientLevelPrediction is an R package, with some functions using python through reticulate.

System Requirements

Requires R (version 4.0 or higher). Installation on Windows requires RTools. Libraries used in PatientLevelPrediction require Java and Python.

The python installation is required for some of the machine learning algorithms. We advise to install Python 3.9 or higher using Anaconda (https://www.continuum.io/downloads).

Getting Started

  • To install the package please read the Package Installation guide

  • Have a look at the video below for an extensive demo of the package.

Please read the main vignette for the package:

In addition we have created vignettes that describe advanced functionality in more detail:

Package function reference: Reference

User Documentation

Documentation can be found on the package website.

Support

  • Developer questions/comments/feedback: OHDSI Forum
  • We use the GitHub issue tracker for all bugs/issues/enhancements

Contributing

Read here how you can contribute to this package.

License

PatientLevelPrediction is licensed under Apache License 2.0

Development

PatientLevelPrediction is being developed in R Studio.

Acknowledgements

  • The package is maintained by Egill Fridgeirsson and Jenna Reps and has been developed with major contributions from Peter Rijnbeek, Martijn Schuemie, Patrick Ryan, and Marc Suchard.
  • We like to thank the following persons for their contributions to the package: Seng Chan You, Ross Williams, Henrik John, Xiaoyong Pan, James Wiggins, Alexandros Rekkas
  • This project is supported in part through the National Science Foundation grant IIS 1251151.

Copy Link

Version

Install

install.packages('PatientLevelPrediction')

Monthly Downloads

518

Version

6.4.1

License

Apache License 2.0

Issues

Pull Requests

Stars

Forks

Maintainer

Egill Fridgeirsson

Last Published

April 20th, 2025

Functions in PatientLevelPrediction (6.4.1)

createDefaultSplitSetting

Create the settings for defining how the plpData are split into test/validation/train sets using default splitting functions (either random stratified by outcome, time or subject splitting)
createLearningCurve

createLearningCurve
createPlpResultTables

Create the results tables to store PatientLevelPrediction models and results into a database
createLogSettings

Create the settings for logging the progression of the analysis
createUnivariateFeatureSelection

Create the settings for defining any feature selection that will be done
createFeatureEngineeringSettings

Create the settings for defining any feature engineering that will be done
createTempModelLoc

Create a temporary model location
createRareFeatureRemover

Create the settings for removing rare features
createValidationDesign

createValidationDesign - Define the validation design for external validation
createValidationSettings

createValidationSettings define optional settings for performing external validation
createRandomForestFeatureSelection

Create the settings for random foreat based feature selection
diagnoseMultiplePlp

Run a list of predictions diagnoses
createStudyPopulation

Create a study population
createStudyPopulationSettings

create the study population settings
evaluatePlp

evaluatePlp
diagnosePlp

diagnostic - Investigates the prediction problem settings - use before training a model
createModelDesign

Specify settings for developing a single model
createNormalizer

Create the settings for normalizing the data @param type The type of normalization to use, either "minmax" or "robust"
createRestrictPlpDataSettings

createRestrictPlpDataSettings define extra restriction settings when calling getPlpData
createSklearnModel

Plug an existing scikit learn python model into the PLP framework
createSampleSettings

Create the settings for defining how the trainData from splitData are sampled using default sample functions.
getEunomiaPlpData

Create a plpData object from the Eunomia database'
createSimpleImputer

Create Simple Imputer settings
getDemographicSummary

Get a demographic summary
createPreprocessSettings

Create the settings for preprocessing the trainData.
extractDatabaseToCsv

Exports all the results from a database into csv files
createStratifiedImputationSettings

Create the settings for using stratified imputation.
getCalibrationSummary

Get a sparse summary of the calibration
createSplineSettings

Create the settings for adding a spline for continuous variables
externalValidateDbPlp

externalValidateDbPlp - Validate a model on new databases
ici

Calculate the Integrated Calibration Index from Austin and Steyerberg https://onlinelibrary.wiley.com/doi/full/10.1002/sim.8281
getCohortCovariateData

Extracts covariates based on cohorts
getPlpData

Extract the patient level prediction data from the server
getThresholdSummary

Calculate all measures for sparse ROC
getPredictionDistribution_binary

Calculates the prediction distribution
loadPlpShareable

Loads the plp result saved as json/csv files for transparent sharing
insertCsvToDatabase

Function to insert results into a database from csvs
pfi

Permutation Feature Importance
plotPredictedPDF

Plot the Predicted probability density function, showing prediction overlap between true and false cases
loadPrediction

Loads the prediction dataframe to json
plotDemographicSummary

Plot the Observed vs. expected incidence, by age and gender
listAppend

join two lists
listCartesian

Cartesian product
fitPlp

fitPlp
insertResultsToSqlite

Create sqlite database with the results
loadPlpAnalysesJson

Load the multiple prediction json settings from a file
plotLearningCurve

plotLearningCurve
loadPlpData

Load the plpData from a folder
plotPredictionDistribution

Plot the side-by-side boxplots of prediction distribution, by class
plotPreferencePDF

Plot the preference score probability density function, showing prediction overlap between true and false cases #'
iterativeImpute

Imputation
minMaxNormalize

A function that normalizes continous features to have values between 0 and 1
plotF1Measure

Plot the F1 measure efficiency frontier using the sparse thresholdSummary data frame
migrateDataModel

Migrate Data model
plotSparseRoc

Plot the ROC curve using the sparse thresholdSummary data frame
plotGeneralizability

Plot the train/test generalizability diagnostic
plotVariableScatterplot

Plot the variable importance scatterplot
preprocessData

A function that wraps around FeatureExtraction::tidyCovariateData to normalise the data and remove rare or redundant features
plotSmoothCalibration

Plot the smooth calibration as detailed in Calster et al. "A calibration heirarchy for risk models was defined: from utopia to empirical data" (2016)
print.plpData

Print a plpData object
getPredictionDistribution

Calculates the prediction distribution
modelBasedConcordance

Calculate the model-based concordance, which is a calculation of the expected discrimination performance of a model under the assumption the model predicts the "TRUE" outcome as detailed in van Klaveren et al. https://pubmed.ncbi.nlm.nih.gov/27251001/
outcomeSurvivalPlot

Plot the outcome incidence over time
pmmFit

Predictive mean matching using lasso
savePrediction

Saves the prediction dataframe to a json file
setAdaBoost

Create setting for AdaBoost with python DecisionTreeClassifier base estimator
plotNetBenefit

Plot the net benefit
predictPlp

predictPlp
savePlpData

Save the plpData to folder
predictGlm

predict using a logistic regression model
runPlp

runPlp - Develop and internally evaluate a model using specified settings
savePlpAnalysesJson

Save the modelDesignList to a json file
savePlpModel

Saves the plp model
recalibratePlpRefit

recalibratePlpRefit
removeRareFeatures

A function that removes rare features from the data
setCoxModel

Create setting for lasso Cox model
setSVM

Create setting for the python sklearn SVM (SVC function)
setDecisionTree

Create setting for the scikit-learn DecisionTree with python
simpleImpute

Simple Imputation
setPythonEnvironment

Use the python environment created using configurePython()
predictCyclops

Create predictive probabilities
print.summary.plpData

Print a summary.plpData object
loadPlpModel

loads the plp model
plotPlp

Plot all the PatientLevelPrediction plots
loadPlpResult

Loads the evalaution dataframe
recalibratePlp

recalibratePlp
validateMultiplePlp

externally validate the multiple plp models across new datasets
setLassoLogisticRegression

Create modelSettings for lasso logistic regression
splitData

Split the plpData into test/train sets using a splitting settings of class splitSettings
plotSparseCalibration

Plot the calibration
plotPrecisionRecall

Plot the precision-recall curve using the sparse thresholdSummary data frame
viewDatabaseResultPlp

open a local shiny app for viewing the result of a PLP analyses from a database
runMultiplePlp

Run a list of predictions analyses
plotSparseCalibration2

Plot the conventional calibration
robustNormalize

A function that normalizes continous by the interquartile range and optionally forces the resulting values to be between -3 and 3 with f(x) = x / sqrt(1 + (x/3)^2) '@details uses (value - median) / iqr to normalize the data and then can applies the function f(x) = x / sqrt(1 + (x/3)^2) to the normalized values. This forces the values to be between -3 and 3 while preserving the relative ordering of the values. based on https://arxiv.org/abs/2407.04491 for more details
setIterativeHardThresholding

Create setting for Iterative Hard Thresholding model
setGradientBoostingMachine

Create setting for gradient boosting machine model using gbm_xgboost implementation
summary.plpData

Summarize a plpData object
savePlpShareable

Save the plp result as json files and csv files for transparent sharing
savePlpResult

Saves the result from runPlp into the location directory
simulatePlpData

Generate simulated data
setLightGBM

Create setting for gradient boosting machine model using lightGBM (https://github.com/microsoft/LightGBM/tree/master/R-package).
validateExternal

validateExternal - Validate model performance on new data
toSparseM

Convert the plpData in COO format into a sparse R matrix
setRandomForest

Create setting for random forest model using sklearn
viewPlp

viewPlp - Interactively view the performance and model settings
viewMultiplePlp

open a local shiny app for viewing the result of a multiple PLP analyses
simulationProfile

A simulation profile for generating synthetic patient level prediction data
sklearnFromJson

Loads sklearn python model from json
setNaiveBayes

Create setting for naive bayes model with python
setMLP

Create setting for neural network model with python's scikit-learn. For bigger models, consider using DeepPatientLevelPrediction package.
sklearnToJson

Saves sklearn python model object to json in path
configurePython

Sets up a python environment to use for PLP (can be conda or venv)
averagePrecision

Calculate the average precision
computeAuc

Compute the area under the ROC curve
calibrationInLarge

Calculate the calibration in large
brierScore

brierScore
computeGridPerformance

Computes grid performance with a specified performance function
calibrationLine

calibrationLine
covariateSummary

covariateSummary
MapIds

Map covariate and row Ids so they start from 1
PatientLevelPrediction

PatientLevelPrediction
createDefaultExecuteSettings

Creates default list of settings specifying what parts of runPlp to execute
createDatabaseSchemaSettings

Create the PatientLevelPrediction database result schema settings
createCohortCovariateSettings

Extracts covariates based on cohorts
createIterativeImputer

Create Iterative Imputer settings
createDatabaseDetails

Create a setting that holds the details about the cdmDatabase connection for data extraction
createExistingSplitSettings

Create the settings for defining how the plpData are split into test/validation/train sets using an existing split - good to use for reproducing results from a different run
createGlmModel

createGlmModel
createExecuteSettings

Creates list of settings specifying what parts of runPlp to execute