Learn R Programming

healthdb

The goal of ‘healthdb’ is to provide a set of tools for identifying diseases or events from healthcare database and preparing data for epidemiological studies. It features abilities that are not natively support by database, such as matching strings by ‘stringr’ style regular expression and using ‘LIKE’ operator with multiple patterns in a vector. Three types of functions are included: interactive functions – for customizing complex definitions; call building functions – for batch execution of simple definition; miscellaneous functions – for data wrangling, computing age and comorbidity index, etc.

The package is tested only on SQL Server and SQLite as we do not have access to other SQL dialects. Please report bugs if you encounter issues with other dialects.

Administrative health data are often stored on SQL database with strict security measures which may disable permission to write temporary tables. Writing queries without being able to cache intermediate results is challenging, especially when the data is too large to be downloaded from database into R (i.e., local memory) without some filtering process.

This package leverages ‘dbplyr’, particularly its ability to chain subqueries, in order to implement a common disease definition as a one-shot big query. Outputs are fully compatible with ‘dplyr’ functions.

Common disease definitions often are in the form of having n primary care/hospitalization/prescription records with some International Classification of Diseases (ICD) codes within some time span. See below for an example of implementing such case definition.

Installation

Install from CRAN:

install.packages("healthdb")

You could also install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("KevinHzq/healthdb")

Usage

We are going to implement the following case definition:

One or more hospitalization with a substance use disorder (SUD) ICD-9 diagnostic code, OR Two or more physician claims with a substance use disorder ICD-10 diagnostic code within one year.

Before we get started, please see how to connect to a database and how to write query with ‘dbplyr’ if you don’t have experience of working with database in R.

First, let’s make a demo data sets for the two sources:

Physician claims

library(healthdb)
library(tidyverse)

# make_test_dat() makes either a toy data.frame or database table in memory with known number of rows that satisfy the query we will show later
claim_db <- make_test_dat(vals_kept = c("303", "304", "305", "291", "292", str_glue("30{30:59}"), str_glue("29{10:29}"), noise_val = c("999", "111")), type = "database")

# this is a database table
# note that in-memory SQLite database stores dates as numbers
claim_db %>% head()
#> # Source:   SQL [?? x 6]
#> # Database: sqlite 3.51.0 [:memory:]
#>     uid clnt_id dates diagx diagx_1 diagx_2
#>   <int>   <int> <dbl> <chr> <chr>   <chr>  
#> 1    59       1 16650 999   <NA>    <NA>   
#> 2    14       1 17100 3046  3058    <NA>   
#> 3    65       1 17381 999   <NA>    <NA>   
#> 4    19       1 17948 2916  2915    999    
#> 5    71       1 18479 999   999     999    
#> 6    66       2 16553 999   <NA>    999

Hospitalization

hosp_df <- make_test_dat(vals_kept = c(str_glue("F{10:19}"), str_glue("F{100:199}"), noise_val = "999"), type = "data.frame")

# this is a local data.frame/tibble
hosp_df %>% head()
#>   uid clnt_id      dates diagx diagx_1 diagx_2
#> 1  96       1 2016-01-04   999     999     999
#> 2 100       1 2020-09-15   999    <NA>    <NA>
#> 3  79       1 2020-12-26   999     999    <NA>
#> 4  15       2 2019-08-24  F185    F170    <NA>
#> 5  48       2 2020-02-12  F102    F138    F180
#> 6  10       3 2016-02-06  F163    F174     999

Here’s how you could use healthdb to implement the SUD definition above:

  1. Identify rows contains the target codes in the claim database

    result1 <- claim_db %>%
      identify_row(
    vars = starts_with("diagx"),
    match = "start",
    vals = c(291:292, 303:305)
      )
    #> ℹ Identify records with condition(s):
    #> • where at least one of the diagx, diagx_1, diagx_2 column(s) in each record
    #> • contains a value satisfied SQL LIKE pattern: 291% OR 292% OR 303% OR 304% OR 305%
    #> ℹ To see the final query generated by 'dbplyr', use dplyr::show_query() on the output.
    #> To extract the SQL string, use dbplyr::remote_query().
  2. Restrict the number of records per client

    result2 <- result1 %>% restrict_n(
      clnt_id = clnt_id,
      n_per_clnt = 2,
      count_by = dates,
      # here we use filter mode to remove records that failed the restriction
      mode = "filter"
    )
    #> ℹ Apply restriction that each client must have at least 2 records with distinct
    #> dates. Clients/groups which did not met the condition were excluded.
    result2 %>% head()
    #> # Source:   SQL [?? x 7]
    #> # Database: sqlite 3.51.0 [:memory:]
    #>     uid clnt_id dates diagx diagx_1 diagx_2 flag_restrict_n
    #>   <int>   <int> <dbl> <chr> <chr>   <chr>             <int>
    #> 1    14       1 17100 3046  3058    <NA>                  1
    #> 2    19       1 17948 2916  2915    999                   1
    #> 3    36       3 17235 3047  3037    999                   1
    #> 4    37       3 17984 3055  292     <NA>                  1
    #> 5    16       3 18169 3051  999     <NA>                  1
    #> 6    49       5 17935 3051  2922    3052                  1
  3. Restrict the temporal pattern of diagnoses

    result3 <- result2 %>% restrict_date(
      clnt_id = clnt_id,
      date_var = dates,
      n = 2,
      within = 365,
      uid = uid,
      # here we use flag mode to flag records that met the restriction instead of removing those
      mode = "flag"
    )
    #> ℹ Apply restriction that each client must have 2 records that were within 365
    #> days. Records that met the condition were flagged.
    result3 %>% head()
    #> # Source:   SQL [?? x 8]
    #> # Database: sqlite 3.51.0 [:memory:]
    #>     uid clnt_id dates diagx diagx_1 diagx_2 flag_restrict_n flag_restrict_date
    #>   <int>   <int> <dbl> <chr> <chr>   <chr>             <int>              <int>
    #> 1    14       1 17100 3046  3058    <NA>                  1                  0
    #> 2    19       1 17948 2916  2915    999                   1                  0
    #> 3    36       3 17235 3047  3037    999                   1                  0
    #> 4    37       3 17984 3055  292     <NA>                  1                  1
    #> 5    16       3 18169 3051  999     <NA>                  1                  0
    #> 6    49       5 17935 3051  2922    3052                  1                  1
  4. Repeat these steps for hospitalization and row bind the results.

The output of these functions, including identify_row(), exclude(), restrict_n(), restrict_date() and more, can be piped into ‘dplyr’ functions for further manipulations. Therefore, wrangling with them along with ‘dplyr’ provide the maximum flexibility for implementing complex algorithms. However, your code could look repetitive if multiple data sources were involved. See the introduction vignette (vignette("healthdb")) for a much more concise way to work with multiple sources and definitions (the ‘Call-building functions’ section).

Copy Link

Version

Install

install.packages('healthdb')

Monthly Downloads

361

Version

0.5.0

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Kevin Hu

Last Published

January 8th, 2026

Functions in healthdb (0.5.0)

fetch_var

Get variables from multiple tables with common ID columns
lookup

Find value corresponding to the input vector using a look-up table
identify_row

Identify rows with a match
healthdb-package

healthdb: Working with Healthcare Databases
restrict_n

Remove or flag groups with n less than some number
cut_period

Cut the time period in one row into multiple rows by interval
execute_def

Execute parameterized case definitions
compute_comorbidity

Compute Elixhauser Comorbidity Index
compute_duration

Compute duration between two dates
define_case_with_age

Identify diseases/events from administrative records with age restriction
exclude

Remove rows based on conditions or another data set
collapse_episode

Group records no more than n days apart as episodes
define_case

Identify diseases/events from administrative records
build_def

Build case definition function calls
bind_source

Row-bind a list of data.frames or remote tables with variable selection and renaming
if_date

Interpret if any n elements drawn from a date vector could be some days apart within some time span
report_n

Report number of distinct value in a column across data frames
pool_case

Pool qualified clients from results of multiple definitions
restrict_date

Remove or flag groups failed to meet conditions based on dates
%>%

Pipe operator
make_test_dat

Make test data