Learn R Programming

clinpubr: Clinical Publication

Overview

clinpubr is an R package designed to streamline the workflow from clinical data processing to publication-ready outputs. It provides tools for clinical data cleaning, significant result screening, and generating tables/figures suitable for medical journals.

Key Features

  • Clinical Data Cleaning: Functions to handle missing values, standardize units, convert dates, and clean numerical/categorical variables.
  • Result Screening: Screening results of regression and interaction analysis with common variable transformations to identify key findings.
  • Publication-Ready Outputs: Generate baseline characteristic tables, forest plots, RCS curves, and other visualizations formatted for medical publications.

Installation

You can install clinpubr from CRAN with:

install.packages("clinpubr")

Optional Dependencies

Some functions require additional packages for full functionality. The package will automatically prompt you to install missing packages when needed. If you want to install the package with all dependencies, you can use:

install.packages("clinpubr", dependencies = TRUE)

Basic Usage

Cleaning Tools

Example 1.1: Generate Data Overview and Cleaning Recommendations

library(clinpubr)

# Sample messy data with various quality issues
messy_data <- data.frame(
  id = 1:15,
  # Numeric with outliers
  bmi = c(
    22.5, 23.1, 24.2, 21.8, 25.0, 23.5, 999, 24.1, 22.9, 23.8,
    21.5, 24.3, 23.0, 22.7, 23.9
  ),
  # Character with case inconsistency
  city = c(
    "Beijing", "BEIJING", "beijing", "Shanghai", "SHANGHAI",
    "Guangzhou", "chengdu", "CHENGDU", "Shenzhen", "shenzhen",
    "Beijing", "Shanghai", "Guangzhou", "Chengdu", "Shenzhen"
  ),
  # Numeric with negative values in predominantly positive
  height = c(
    1.75, 1.80, 1.65, 1.70, 1.85, 1.78, 1.68, 1.72, 1.76, 1.82,
    1.60, 1.62, 1.74, 179, -1
  ),
  # Date with suspicious year
  visit_date = as.Date(c(
    "2020-01-15", "2020-02-20", "2020-03-10", "2019-05-18", "2020-06-22",
    "2018-07-30", "2020-08-12", "2020-09-25", "2020-10-08", "2020-11-15",
    "2020-12-20", "1900-01-01", "2030-02-28", "2020-03-15", "2020-04-20"
  )),
  # Numeric stored as character
  age = c(
    "25", "26", "27", "28", "29", "30", "31", "32", "33", "34",
    "35", "unknown", "36", "37", "38"
  ),
  stringsAsFactors = FALSE
)

overview <- data_overview(messy_data)
#> === Data Overview Summary ===
#> Dataset: 15 rows, 6 columns
#> 
#> Variable Types:
#>   numeric   : 3 variables
#>   character : 2 variables
#>   date      : 1 variables
#> 
#> Found 6 potential quality issues:
#>   numeric_as_character     : 1 cases
#>   outliers                 : 2 cases
#>   negative_in_positive     : 1 cases
#>   suspicious_dates         : 1 cases
#>   case_issues              : 1 cases
#> 
#> Recommendations:
#>   - Consider converting these character variables to numeric: age
#>   - Review outliers in these numeric variables: bmi, height
#>   - Numeric variables with mostly positive values but containing negatives: height
#>   - Review suspicious dates (year < 1910 or > current year) in: visit_date
#>   - These character variables have case inconsistency issues: city - consider standardizing to lowercase or uppercase

print(overview$quality_issues$case_issues)
#> $city
#> $city$n_original
#> [1] 11
#> 
#> $city$n_normalized
#> [1] 5
#> 
#> $city$reduction
#> [1] 6
#> 
#> $city$examples
#> $city$examples$beijing
#> [1] "Beijing" "BEIJING" "beijing"
#> 
#> $city$examples$shanghai
#> [1] "Shanghai" "SHANGHAI"
#> 
#> $city$examples$chengdu
#> [1] "chengdu" "CHENGDU" "Chengdu"

Example 1.2: Screen Multi-Table Cohort by Entry and Anchor Rules

patient <- data.frame(pid = 1:4)
admission <- data.frame(
  pid = c(1, 1, 2, 3, 4),
  vid = c(11, 12, 21, 31, 41),
  admit_day = c(1, 5, 2, 3, 4)
)
diagnosis <- data.frame(
  pid = c(1, 2, 3, 4),
  vid = c(11, 21, 31, 41),
  dx_day = c(1, 2, 3, 4),
  icd = c("I10", "I10", "J18", "I11")
)
lab <- data.frame(
  pid = c(1, 1, 2, 2, 3, 4),
  vid = c(11, 12, 21, 21, 31, 41),
  lab_day = c(1, 5, 2, 5, 3, 4),
  Hb = c(9.8, 10.6, 10.7, 5, 8.9, 9.1)
)

# Keep patients with any I10 diagnosis, then keep records from first Hb > 10 onward, and join tables together
res <- screen_data_list(
  data_list = list(patient = patient, admission = admission, diagnosis = diagnosis, lab = lab),
  entry_expr = any(icd == "I10"),
  entry_level = "patient_id",
  anchor_expr = any(Hb > 10),
  anchor_level = "visit_id",
  anchor_window = "from_first_anchor",
  patient_id_map = "pid",
  visit_id_map = "vid",
  date_map = c(admission = "admit_day", diagnosis = "dx_day", lab = "lab_day"),
  output = "joined"
)

knitr::kable(res)
patient_idvisit_iddateicdHb
1125NA10.6
2212I1010.7
2215NA5.0

Example 1.3: Standardize Values in Medical Records

# Sample messy data
messy_data <- data.frame(values = c("12.3", "0..45", "  67 ", "", "abandon"))
clean_data <- value_initial_cleaning(messy_data$values)
print(clean_data)
#> [1] "12.3"    "0.45"    "67"      NA        "abandon"

Example 1.4: Check Non-numerical Values

# Sample messy data
x <- c("1.2(XXX)", "1.5", "0.82", "5-8POS", "NS", "FULL")
print(check_nonnum(x))
#> [1] "1.2(XXX)" "5-8POS"   "NS"       "FULL"

This function filters out non-numerical values, which helps you choose the appropriate method to handle them.

Example 1.5: Extracting Numerical Values from Text

# Sample messy data
x <- c("1.2(XXX)", "1.5", "0.82", "5-8POS", "NS", "FULL")
print(extract_num(x))
#> [1] 1.20 1.50 0.82 5.00   NA   NA

print(extract_num(x,
  res_type = "first", # Extract the first number
  multimatch2na = TRUE, # Convert illegal multiple matches to NA
  zero_regexp = "NEG|NS", # Convert "NEG" and "NS" (matched using regex) to 0
  max_regexp = "FULL", # Convert "FULL" (matched using regex) to some specified quantile
  max_quantile = 0.95
))
#> [1] 1.20 1.50 0.82   NA 0.00 1.47

Other Cleaning Functions

  • to_date(): Convert text to date, can handle mixed format.
  • unit_view() and unit_standardize(): Provide a pipeline to standardize conflicting units.
  • cut_by(): Split numerics into factors, offers a variety of splitting options and auto labeling.
  • And more…

Screening Results to Identify Potential Findings

data(cancer, package = "survival")

# Screening for potential findings with regression models in the cancer dataset
scan_result <- regression_scan(cancer, y = "status", time = "time", save_table = FALSE)
#> Taking all variables as predictors
knitr::kable(scan_result)
predictornvalidoriginal.HRoriginal.pvaloriginal.padjlogarithm.HRlogarithm.pvallogarithm.padjcategorized.HRcategorized.pvalcategorized.padjrcs.overall.pvalrcs.overall.padjrcs.nonlinear.pvalrcs.nonlinear.padjbest.var.trans
4ph.ecog2271.60953200.00002690.0002154NANANANA0.00015300.0012237NANANANAoriginal
6pat.karno2250.98034560.00028240.00112960.27095440.00030710.00153560.57556270.00066080.00264310.00258480.01550860.59089520.8863427original
3sex2280.58800280.00149120.0039766NANANA0.58800280.00149120.0039766NANANANAcategorized
5ph.karno2270.98368630.00495790.00991570.31841680.00794680.01986690.63524650.00776700.01553390.01284620.03853850.23079610.6848245original
2age2281.01889650.04185310.06696503.02567730.04669260.07782091.14407900.39106470.39575580.08254470.16508940.34241230.6848245original
1inst2270.99036920.34598380.46131170.92920460.31814320.39767900.83840470.26000400.34667200.81752770.87071310.98397050.9839705categorized
7meal.cal1810.99987620.59294020.67764590.91415800.61280950.61280950.86206040.39575580.39575580.87071310.87071310.82272560.9839705categorized
8wt.loss2141.00132010.82819740.8281974NANANA1.31901850.09090980.14545570.11289070.16933610.05149360.3089618rcs.nonlinear

Generating Publication-Ready Tables and Figures

Example 3.1: Automatic Type Infer and Baseline Table Generation

cohort <- data.frame(
  age = c(17, 25, 30, NA, 50, 60),
  sex = c("M", "F", "F", "M", "F", "M"),
  value = c(1, NA, 3, 4, 5, NA),
  dementia = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE)
)
res <- exclusion_count(
  cohort,
  age < 18,
  is.na(value),
  dementia == TRUE,
  .criteria_names = c(
    "Age < 18 years",
    "Missing value",
    "History of dementia"
  )
)
#> Warning in exclusion_count(cohort, age < 18, is.na(value), dementia == TRUE, :
#> Criterion 'Age < 18 years' resulted in NA values. These rows have been excluded
#> by default. Consider adding an explicit check for missing values (e.g.,
#> is.na(variable)) as a preceding criterion.
knitr::kable(res) # Display the table
CriteriaN
Initial N6
Age < 18 years2
Missing value2
History of dementia1
Final N1

Example 3.2: Automatic Type Infer and Baseline Table Generation

var_types <- get_var_types(mtcars, strata = "vs") # Automatically infer variable types
print(var_types)
#> $factor_vars
#> [1] "cyl"  "vs"   "am"   "gear"
#> 
#> $exact_vars
#> [1] "cyl"  "gear"
#> 
#> $nonnormal_vars
#> [1] "drat" "carb"
#> 
#> $omit_vars
#> NULL
#> 
#> $strata
#> [1] "vs"
#> 
#> attr(,"class")
#> [1] "var_types"

tables <- baseline_table(mtcars,
  var_types = var_types, contDigits = 1, save_table = FALSE,
  filename = "baseline.csv", seed = 1 # set seed for simulated fisher exact test
)
knitr::kable(tables$baseline) # Display the table
Overallvs: 0vs: 1ptest
n321814
mpg (mean (SD))20.1 (6.0)16.6 (3.9)24.6 (5.4)<0.001
cyl (%)<0.001exact
411 (34.4)1 (5.6)10 (71.4)
67 (21.9)3 (16.7)4 (28.6)
814 (43.8)14 (77.8)0 (0.0)
disp (mean (SD))230.7 (123.9)307.1 (106.8)132.5 (56.9)<0.001
hp (mean (SD))146.7 (68.6)189.7 (60.3)91.4 (24.4)<0.001
drat (median [IQR])3.7 [3.1, 3.9]3.2 [3.1, 3.7]3.9 [3.7, 4.1]0.013nonnorm
wt (mean (SD))3.2 (1.0)3.7 (0.9)2.6 (0.7)0.001
qsec (mean (SD))17.8 (1.8)16.7 (1.1)19.3 (1.4)<0.001
am = 1 (%)13 (40.6)6 (33.3)7 (50.0)0.556
gear (%)0.001exact
315 (46.9)12 (66.7)3 (21.4)
412 (37.5)2 (11.1)10 (71.4)
55 (15.6)4 (22.2)1 (7.1)
carb (median [IQR])2.0 [2.0, 4.0]4.0 [2.2, 4.0]1.5 [1.0, 2.0]<0.001nonnorm

Example 3.3: RCS Plot

data(cancer, package = "survival")

# Performing cox regression, which is inferred by `y` and `time`
p <- rcs_plot(cancer, x = "age", y = "status", time = "time", covars = c("sex", "ph.karno"), save_plot = FALSE)
#> Warning in predictor_effect_plot(data = data, x = x, y = y, time = time, : 1
#> incomplete cases excluded.
plot(p)

Example 3.4: Interaction Plot

data(cancer, package = "survival")

# Generating interaction plot of both linear and RCS models
p <- interaction_plot(cancer,
  y = "status", time = "time", predictor = "age",
  group_var = "sex", save_plot = FALSE
)
plot(p$lin)
plot(p$rcs)

Example 3.5: Regression Forest Plot

data(cancer, package = "survival")
cancer$dead <- cancer$status == 2 # Preparing a binary variable for logistic regression
cancer$`age per 1 sd` <- c(scale(cancer$age)) # Standardizing age

# Performing multivairate logistic regression
p1 <- regression_forest(cancer,
  model_vars = c("age per 1 sd", "sex", "wt.loss"), y = "dead",
  as_univariate = FALSE, save_plot = FALSE
)
plot(p1)

p2 <- regression_forest(
  cancer,
  model_vars = list(
    Crude = c("age per 1 sd"),
    Model1 = c("age per 1 sd", "sex"),
    Model2 = c("age per 1 sd", "sex", "wt.loss")
  ),
  y = "dead",
  save_plot = FALSE
)
plot(p2)

Example 3.6: Subgroup Forest Plot

data(cancer, package = "survival")
# coxph model with time assigned
p <- subgroup_forest(cancer,
  subgroup_vars = c("age", "sex", "wt.loss"), x = "ph.ecog", y = "status",
  time = "time", covars = "ph.karno", ticks_at = c(1, 2), save_plot = FALSE
)
plot(p)

Example 3.7: Classification Model Performance

# Building models with example data
data(cancer, package = "survival")
df <- kidney
df$dead <- ifelse(df$time <= 100 & df$status == 0, NA, df$time <= 100)
df <- na.omit(df[, -c(1:3)])

model0 <- glm(dead ~ age + frail, family = binomial(), data = df)
model1 <- glm(dead ~ ., family = binomial(), data = df)
df$base_pred <- predict(model0, type = "response")
df$full_pred <- predict(model1, type = "response")

# Generating most of the useful plots and metrics for model comparison
results <- classif_model_compare(df, "dead", c("base_pred", "full_pred"), save_output = FALSE)
#> Assuming 'TRUE' is [Event] and 'FALSE' is [non-Event]

knitr::kable(results$metric_table)
ModelAUCPRAUCAccuracySensitivitySpecificityPos Pred ValueNeg Pred ValueF1KappaBriercutoffYoudenHosLem
2full_pred0.915 (0.847, 0.984)0.8850.8390.80.8890.9030.7740.8480.6770.1140.6260.6890.944
1base_pred0.822 (0.711, 0.933)0.7660.8060.80.8150.8480.7590.8240.6100.1710.4900.6150.405
plot(results$roc_plot)
plot(results$pr_plot)
plot(results$calibration_plot)
plot(results$dca_plot)

Example 3.8: Importance Plot

# Generating a dummy importance vector
set.seed(5)
dummy_importance <- runif(20, 0.2, 0.6)^5
names(dummy_importance) <- paste0("var", 1:20)

# Plotting variable importance, keeping only top 15 and splitting at 10
p <- importance_plot(dummy_importance, top_n = 15, split_at = 10, save_plot = FALSE)
plot(p)
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_bar()`).

Documentation

For detailed usage, refer to the package vignettes (coming soon) or the GitHub repository.

Contributing

Bug reports and feature requests are welcome via the issue tracker.

License

clinpubr is licensed under GPL (>= 3).

Copy Link

Version

Install

install.packages('clinpubr')

Monthly Downloads

509

Version

1.3.0

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Yue Niu

Last Published

March 7th, 2026

Functions in clinpubr (1.3.0)

exclusion_count

Count the number of excluded samples at each step
df_view_nonnum

Show non-numeric elements in a data frame
merge_by_substring

Merge Data Frame by String Key Matching
merge_ordered_vectors

Merging vectors while maintaining order
formula_add_covs

Add covariates to a formula
interaction_scan

Scan for interactions between variables
keep_by_keyword

Keep string segment by regex keyword position
format_pval

Format p-value for publication
get_var_types

Get variable types for baseline table
get_valid_subset

Get the subset that satisfies the missing rate condition.
filter_rcs_predictors

Filter predictors for RCS
first_mode

Calculate the first mode
fill_with_last

Fill NA values with the last valid value
max_missing_rates

Get the maximum missing rate of rows and columns.
mad_outlier

Mark possible outliers using different methods.
common_prefix

Get common prefix of a string vector
get_samples

Generate a sample of values from a vector and collapse them.
na2false

Replace NA values with FALSE
na_max

Safe min and max functions that return NA if all values are NA
get_valid

Get one valid value from vector.
qq_show

QQ plot
predictor_effect_plot

Plot the effect of a predictor variable
regression_scan

Scan for significant regression predictors
subject_view

Get an overview of different subjects in data.
rcs_plot

Plot restricted cubic spline
regression_fit

Obtain regression results
importance_plot

Importance plot
split_multichoice

Split multi-choice data into columns
to_wide

Fast long-to-wide conversion with item selection
screen_data_list

Screen and Join Multi-Table Clinical Data by Expression
unit_standardize

Standardize units of numeric data.
str_match_replace

Match string and replace with corresponding value
regression_basic_results

Basic results of logistic or Cox regression.
subgroup_forest

Create subgroup forest plot.
test_normality

Test normality of a numeric variable
unit_view

Generate a table of conflicting units.
replace_elements

Replacing elements in a vector
unmake_names

Unmake names
interaction_p_value

Calculate interaction p-value
indicate_duplicates

Determine duplicate elements including their first occurrence.
interaction_plot

Plot interactions
value_initial_cleaning

Preliminarily cleaning string vectors
vec2code

Generate code from string vector
regression_forest

Forest plot of regression results
time_roc_plot

Calculate and plot time-dependent ROC curves
to_date

Convert numerical or character date to date.
baseline_table

Create a baseline table for a dataset.
calc_cindex

Calculate C-index for survival data
add_lists

Adding lists element-wise
combine_files

combine multiple data files into a single data frame
check_package

Check if a package is available and provide helpful error message
calculate_index

Calculate index based on conditions
answer_check

Check answers of multiple choice questions
break_at

Generate breaks for histogram
check_nonnum

Check elements that are not numeric
classif_model_compare

Performance comparison of classification models
data_overview

Data Overview and Quality Check
cut_by

Convert Numeric to Factor
detect_outliers

Detect outliers in a numeric vector.
emp_colors

default color palette for clinpubr plots
extract_num

Extract numbers from string.
combine_multichoice

Combine multi-choice columns into one