healthcareai: healthcareai: a streamlined way to develop and deploy models

Description

healthcareai provides a clean interface to create and compare multiple models on your data and then deploy the model that is most accurate. healthcareai also includes functions for data exploration, data cleaning, and model evaluation.

Arguments

Details

This is done in a three-step process: First, loading, profiling, and feature engineering. Second, developing a model. Third, deploying and monitoring the model.

Load and profile data
- Loading Data:
  - Use selectData to pull data directly from a SQL database
- Profiling and Analyzing Data - One can get quite far in healthcare data analysis without ever going beyond this step:
  - featureAvailabilityProfiler will find how much data is present in each variable over time.
  - countMissingData finds the proportion of missing data in each variable.
  - findVariation and variationAcrossGroups are used to find variation across/between subgroups of data.
  - findTrends finds trends that are six months or longer.
  - RiskAdjustedComparisons compares groups in a risk adjusted fashion. See RiskAdjustement for a general introduction.
  - calculateTargetedCorrelations will calculate correlations for all numeric columns and a specified variable of interest.
  - returnColsWithMoreThanFiftyCategories shows which categorical columns have more than 50 categories.
  - KmeansClustering used to cluster data with or without an outcome variable
- Feature Engineering:
  - convertDateTimeColToDummies renamed to splitOutDateTimeCols. Leaving shell with warning until the following release.
  - splitOutDateTimeCols will convert a date variable into dummy columns of day, hour, etc. For seasonal pattern modeling.
  - countDaysSinceFirstDate shows days since first day in input column.
  - groupedLOCF carries last observed value forward. This is an imputation method for longitudinal data.
Develop a machine learning model
- Models:
  - LassoDevelopment: Used for regression or classification and does an especially good job with a lot of variables.
  - RandomForestDevelopment: Used for regression or classification and is well suited to non-linear data.
  - XGBoostDevelopment: Used for multi-class classification (problems where there are more than 2 classes). Well suited to non-linear data.
  - LinearMixedModelDevelopment: Best suited for longitudinal data and datasets with less than 100k rows and 50 variables. Can do classification or regression.
- Performance of Trained Models:
  - Area under the ROC curve or area under the precision-recall curve are used to evaluate the performance of classification models.
  - The mean squared error (MSE) and root mean squared error (RMSE) are used to evaluate the performance of regression problems.
Note: models are saved in the working directory after creation.
Deploy and Monitor the Machine Learning Model
- Deploy the Model:
  - Use LassoDeployment, LinearMixedModelDeployment, RandomForestDeployment, or XGBoostDeployment to load the model from development and predict against test data. The deployments can be tested locally, but eventually live on the production server.
  - Use writeData to push the predicted values into a SQL environment.
- Monitoring the model:
  - generateAUC is used to monitor performance over time. This should happen after the predictions can be validated with the result. If you're predicting 30-day readmissions, you can't validate until 30 days have passed since the predictions.

References

http://healthcareai-r.readthedocs.io

http://healthcare.ai