healthcareai provides a clean interface to create and compare multiple models on your data and then deploy the model that is most accurate. healthcareai also includes functions for data exploration, data cleaning, and model evaluation.
This is done in a three-step process: First, loading, profiling, and feature engineering. Second, developing a model. Third, deploying and monitoring the model.
Load and profile data
Loading Data:
Use selectData to pull data directly from a SQL
database
Profiling and Analyzing Data - One can get quite far in healthcare data analysis without ever going beyond this step:
featureAvailabilityProfiler will find how much data is
present in each variable over time.
countMissingData finds the proportion of missing data in
each variable.
findVariation and variationAcrossGroups are
used to find variation across/between subgroups of data.
findTrends finds trends that are six months or longer.
RiskAdjustedComparisons compares groups in a risk
adjusted fashion. See
RiskAdjustement
for a general introduction.
calculateTargetedCorrelations will calculate correlations
for all numeric columns and a specified variable of interest.
returnColsWithMoreThanFiftyCategories shows which
categorical columns have more than 50 categories.
KmeansClustering used to cluster data with or without an
outcome variable
Feature Engineering:
convertDateTimeColToDummies renamed to
splitOutDateTimeCols. Leaving shell with warning until the following release.
splitOutDateTimeCols will convert a date variable
into dummy columns of day, hour, etc. For seasonal pattern modeling.
countDaysSinceFirstDate shows days since first day in
input column.
groupedLOCF carries last observed value forward. This is
an imputation method for longitudinal data.
Develop a machine learning model
Models:
LassoDevelopment: Used for regression or classification
and does an especially good job with a lot of variables.
RandomForestDevelopment: Used for regression or
classification and is well suited to non-linear data.
XGBoostDevelopment: Used for multi-class classification
(problems where there are more than 2 classes). Well suited to non-linear
data.
LinearMixedModelDevelopment: Best suited for longitudinal
data and datasets with less than 100k rows and 50 variables. Can do
classification or regression.
Performance of Trained Models:
Area under the ROC curve or area under the precision-recall curve are used to evaluate the performance of classification models.
The mean squared error (MSE) and root mean squared error (RMSE) are used to evaluate the performance of regression problems.
Deploy and Monitor the Machine Learning Model
Deploy the Model:
Use LassoDeployment,
LinearMixedModelDeployment,
RandomForestDeployment, or XGBoostDeployment to
load the model from development and predict against test data. The
deployments can be tested locally, but eventually live on the production
server.
Use writeData to push the predicted values into a SQL
environment.
Monitoring the model:
generateAUC is used to monitor performance over time.
This should happen after the predictions can be validated with the result. If
you're predicting 30-day readmissions, you can't validate until 30 days have
passed since the predictions.