healthcareai provides a clean interface to create and compare multiple models on your data and then deploy the model that is most accurate. healthcareai also includes functions for data exploration, data cleaning, and model evaluation.
This is done in a three-step process: First, loading, profiling, and feature engineering. Second, developing a model. Third, deploying and monitoring the model.
Load and profile data
Loading Data:
Use selectData
to pull data directly from a SQL
database
Profiling and Analyzing Data - One can get quite far in healthcare data analysis without ever going beyond this step:
featureAvailabilityProfiler
will find how much data is
present in each variable over time.
countMissingData
finds the proportion of missing data in
each variable.
findVariation
and variationAcrossGroups
are
used to find variation across/between subgroups of data.
findTrends
finds trends that are six months or longer.
RiskAdjustedComparisons
compares groups in a risk
adjusted fashion. See
RiskAdjustement
for a general introduction.
calculateTargetedCorrelations
will calculate correlations
for all numeric columns and a specified variable of interest.
returnColsWithMoreThanFiftyCategories
shows which
categorical columns have more than 50 categories.
KmeansClustering
used to cluster data with or without an
outcome variable
Feature Engineering:
convertDateTimeColToDummies
renamed to
splitOutDateTimeCols. Leaving shell with warning until the following release.
splitOutDateTimeCols
will convert a date variable
into dummy columns of day, hour, etc. For seasonal pattern modeling.
countDaysSinceFirstDate
shows days since first day in
input column.
groupedLOCF
carries last observed value forward. This is
an imputation method for longitudinal data.
Develop a machine learning model
Models:
LassoDevelopment
: Used for regression or classification
and does an especially good job with a lot of variables.
RandomForestDevelopment
: Used for regression or
classification and is well suited to non-linear data.
XGBoostDevelopment
: Used for multi-class classification
(problems where there are more than 2 classes). Well suited to non-linear
data.
LinearMixedModelDevelopment
: Best suited for longitudinal
data and datasets with less than 100k rows and 50 variables. Can do
classification or regression.
Performance of Trained Models:
Area under the ROC curve or area under the precision-recall curve are used to evaluate the performance of classification models.
The mean squared error (MSE) and root mean squared error (RMSE) are used to evaluate the performance of regression problems.
Deploy and Monitor the Machine Learning Model
Deploy the Model:
Use LassoDeployment
,
LinearMixedModelDeployment
,
RandomForestDeployment
, or XGBoostDeployment
to
load the model from development and predict against test data. The
deployments can be tested locally, but eventually live on the production
server.
Use writeData
to push the predicted values into a SQL
environment.
Monitoring the model:
generateAUC
is used to monitor performance over time.
This should happen after the predictions can be validated with the result. If
you're predicting 30-day readmissions, you can't validate until 30 days have
passed since the predictions.