Learn R Programming

clinpubr: Clinical Publication

Overview

clinpubr is an R package designed to streamline the workflow from clinical data processing to publication-ready outputs. It provides tools for clinical data cleaning, significant result screening, and generating tables/figures suitable for medical journals.

Key Features

  • Clinical Data Cleaning: Functions to handle missing values, standardize units, convert dates, and clean numerical/categorical variables.
  • Result Screening: Screening results of regression and interaction analysis with common variable transformations to identify key findings.
  • Publication-Ready Outputs: Generate baseline characteristic tables, forest plots, RCS curves, and other visualizations formatted for medical publications.

Installation

You can install clinpubr from CRAN with:

install.packages("clinpubr")

Optional Dependencies

Some functions require additional packages for full functionality. The package will automatically prompt you to install missing packages when needed. If you want to install the package with all dependencies, you can use:

install.packages("clinpubr", dependencies = TRUE)

Basic Usage

Cleaning Tools

Example 1.1: Standardize Values in Medical Records

library(clinpubr)

# Sample messy data
messy_data <- data.frame(values = c("12.3", "0..45", "  67 ", "", "abandon"))
clean_data <- value_initial_cleaning(messy_data$values)
print(clean_data)
#> [1] "12.3"    "0.45"    "67"      NA        "abandon"

Example 1.2: Check Non-numerical Values

# Sample messy data
x <- c("1.2(XXX)", "1.5", "0.82", "5-8POS", "NS", "FULL")
print(check_nonnum(x))
#> [1] "1.2(XXX)" "5-8POS"   "NS"       "FULL"

This function filters out non-numerical values, which helps you choose the appropriate method to handle them.

Example 1.3: Extracting Numerical Values from Text

# Sample messy data
x <- c("1.2(XXX)", "1.5", "0.82", "5-8POS", "NS", "FULL")
print(extract_num(x))
#> [1] 1.20 1.50 0.82 5.00   NA   NA

print(extract_num(x,
  res_type = "first", # Extract the first number
  multimatch2na = TRUE, # Convert illegal multiple matches to NA
  zero_regexp = "NEG|NS", # Convert "NEG" and "NS" (matched using regex) to 0
  max_regexp = "FULL", # Convert "FULL" (matched using regex) to some specified quantile
  max_quantile = 0.95
))
#> [1] 1.20 1.50 0.82   NA 0.00 1.47

Other Cleaning Functions

  • to_date(): Convert text to date, can handle mixed format.
  • unit_view() and unit_standardize(): Provide a pipeline to standardize conflicting units.
  • cut_by(): Split numerics into factors, offers a variety of splitting options and auto labeling.
  • And more…

Screening Results to Identify Potential Findings

data(cancer, package = "survival")

# Screening for potential findings with regression models in the cancer dataset
scan_result <- regression_scan(cancer, y = "status", time = "time", save_table = FALSE)
#> Taking all variables as predictors
knitr::kable(scan_result)
predictornvalidoriginal.HRoriginal.pvaloriginal.padjlogarithm.HRlogarithm.pvallogarithm.padjcategorized.HRcategorized.pvalcategorized.padjrcs.overall.pvalrcs.overall.padjrcs.nonlinear.pvalrcs.nonlinear.padjbest.var.trans
4ph.ecog2271.60953200.00002690.0002154NANANANA0.00015300.0012237NANANANAoriginal
6pat.karno2250.98034560.00028240.00112960.27095440.00030710.00153560.57556270.00066080.00264310.00258480.01550860.59089520.8863427original
3sex2280.58800280.00149120.0039766NANANA0.58800280.00149120.0039766NANANANAcategorized
5ph.karno2270.98368630.00495790.00991570.31841680.00794680.01986690.63524650.00776700.01553390.01284620.03853850.23079610.6848245original
2age2281.01889650.04185310.06696503.02567730.04669260.07782091.14407900.39106470.39575580.08254470.16508940.34241230.6848245original
1inst2270.99036920.34598380.46131170.92920460.31814320.39767900.83840470.26000400.34667200.81752770.87071310.98397050.9839705categorized
7meal.cal1810.99987620.59294020.67764590.91415800.61280950.61280950.86206040.39575580.39575580.87071310.87071310.82272560.9839705categorized
8wt.loss2141.00132010.82819740.8281974NANANA1.31901850.09090980.14545570.11289070.16933610.05149360.3089618rcs.nonlinear

Generating Publication-Ready Tables and Figures

Example 3.1: Automatic Type Infer and Baseline Table Generation

cohort <- data.frame(
  age = c(17, 25, 30, NA, 50, 60),
  sex = c("M", "F", "F", "M", "F", "M"),
  value = c(1, NA, 3, 4, 5, NA),
  dementia = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE)
)
res <- exclusion_count(
  cohort,
  age < 18,
  is.na(value),
  dementia == TRUE,
  .criteria_names = c(
    "Age < 18 years",
    "Missing value",
    "History of dementia"
  )
)
#> Warning in exclusion_count(cohort, age < 18, is.na(value), dementia == TRUE, :
#> Criterion 'Age < 18 years' resulted in NA values. These rows have been excluded
#> by default. Consider adding an explicit check for missing values (e.g.,
#> is.na(variable)) as a preceding criterion.
knitr::kable(res) # Display the table
CriteriaN
Initial N6
Age < 18 years2
Missing value2
History of dementia1
Final N1

Example 3.2: Automatic Type Infer and Baseline Table Generation

var_types <- get_var_types(mtcars, strata = "vs") # Automatically infer variable types
print(var_types)
#> $factor_vars
#> [1] "cyl"  "vs"   "am"   "gear"
#> 
#> $exact_vars
#> [1] "cyl"  "gear"
#> 
#> $nonnormal_vars
#> [1] "drat" "carb"
#> 
#> $omit_vars
#> NULL
#> 
#> $strata
#> [1] "vs"
#> 
#> attr(,"class")
#> [1] "var_types"

tables <- baseline_table(mtcars,
  var_types = var_types, contDigits = 1, save_table = FALSE,
  filename = "baseline.csv", seed = 1 # set seed for simulated fisher exact test
)
knitr::kable(tables$baseline) # Display the table
Overallvs: 0vs: 1ptest
n321814
mpg (mean (SD))20.1 (6.0)16.6 (3.9)24.6 (5.4)<0.001
cyl (%)<0.001exact
411 (34.4)1 (5.6)10 (71.4)
67 (21.9)3 (16.7)4 (28.6)
814 (43.8)14 (77.8)0 (0.0)
disp (mean (SD))230.7 (123.9)307.1 (106.8)132.5 (56.9)<0.001
hp (mean (SD))146.7 (68.6)189.7 (60.3)91.4 (24.4)<0.001
drat (median [IQR])3.7 [3.1, 3.9]3.2 [3.1, 3.7]3.9 [3.7, 4.1]0.013nonnorm
wt (mean (SD))3.2 (1.0)3.7 (0.9)2.6 (0.7)0.001
qsec (mean (SD))17.8 (1.8)16.7 (1.1)19.3 (1.4)<0.001
am = 1 (%)13 (40.6)6 (33.3)7 (50.0)0.556
gear (%)0.002exact
315 (46.9)12 (66.7)3 (21.4)
412 (37.5)2 (11.1)10 (71.4)
55 (15.6)4 (22.2)1 (7.1)
carb (median [IQR])2.0 [2.0, 4.0]4.0 [2.2, 4.0]1.5 [1.0, 2.0]<0.001nonnorm

Example 3.3: RCS Plot

data(cancer, package = "survival")

# Performing cox regression, which is inferred by `y` and `time`
p <- rcs_plot(cancer, x = "age", y = "status", time = "time", covars = c("sex", "ph.karno"), save_plot = FALSE)
#> Warning in predictor_effect_plot(data = data, x = x, y = y, time = time, : 1
#> incomplete cases excluded.
plot(p)

Example 3.4: Interaction Plot

data(cancer, package = "survival")

# Generating interaction plot of both linear and RCS models
p <- interaction_plot(cancer,
  y = "status", time = "time", predictor = "age",
  group_var = "sex", save_plot = FALSE
)
plot(p$lin)
plot(p$rcs)

Example 3.5: Regression Forest Plot

data(cancer, package = "survival")
cancer$dead <- cancer$status == 2 # Preparing a binary variable for logistic regression
cancer$`age per 1 sd` <- c(scale(cancer$age)) # Standardizing age

# Performing multivairate logistic regression
p1 <- regression_forest(cancer,
  model_vars = c("age per 1 sd", "sex", "wt.loss"), y = "dead",
  as_univariate = FALSE, save_plot = FALSE
)
plot(p1)

p2 <- regression_forest(
  cancer,
  model_vars = list(
    Crude = c("age per 1 sd"),
    Model1 = c("age per 1 sd", "sex"),
    Model2 = c("age per 1 sd", "sex", "wt.loss")
  ),
  y = "dead",
  save_plot = FALSE
)
plot(p2)

Example 3.6: Subgroup Forest Plot

data(cancer, package = "survival")
# coxph model with time assigned
p <- subgroup_forest(cancer,
  subgroup_vars = c("age", "sex", "wt.loss"), x = "ph.ecog", y = "status",
  time = "time", covars = "ph.karno", ticks_at = c(1, 2), save_plot = FALSE
)
plot(p)

Example 3.7: Classification Model Performance

# Building models with example data
data(cancer, package = "survival")
df <- kidney
df$dead <- ifelse(df$time <= 100 & df$status == 0, NA, df$time <= 100)
df <- na.omit(df[, -c(1:3)])

model0 <- glm(dead ~ age + frail, family = binomial(), data = df)
model1 <- glm(dead ~ ., family = binomial(), data = df)
df$base_pred <- predict(model0, type = "response")
df$full_pred <- predict(model1, type = "response")

# Generating most of the useful plots and metrics for model comparison
results <- classif_model_compare(df, "dead", c("base_pred", "full_pred"), save_output = FALSE)
#> Assuming 'TRUE' is [Event] and 'FALSE' is [non-Event]

knitr::kable(results$metric_table)
ModelAUCPRAUCAccuracySensitivitySpecificityPos Pred ValueNeg Pred ValueF1KappaBriercutoffYoudenHosLem
2full_pred0.915 (0.847, 0.984)0.8850.8390.80.8890.9030.7740.8480.6770.1140.6260.6890.944
1base_pred0.822 (0.711, 0.933)0.7660.8060.80.8150.8480.7590.8240.6100.1710.4900.6150.405
plot(results$roc_plot)
plot(results$pr_plot)
plot(results$calibration_plot)
plot(results$dca_plot)

Example 3.8: Importance Plot

# Generating a dummy importance vector
set.seed(5)
dummy_importance <- runif(20, 0.2, 0.6)^5
names(dummy_importance) <- paste0("var", 1:20)

# Plotting variable importance, keeping only top 15 and splitting at 10
p <- importance_plot(dummy_importance, top_n = 15, split_at = 10, save_plot = FALSE)
plot(p)
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_bar()`).

Documentation

For detailed usage, refer to the package vignettes (coming soon) or the GitHub repository.

Contributing

Bug reports and feature requests are welcome via the issue tracker.

License

clinpubr is licensed under GPL (>= 3).

Copy Link

Version

Install

install.packages('clinpubr')

Monthly Downloads

435

Version

1.1.1

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Yue Niu

Last Published

December 19th, 2025

Functions in clinpubr (1.1.1)

fill_with_last

Fill NA values with the last valid value
extract_num

Extract numbers from string.
emp_colors

default color palette for clinpubr plots
common_prefix

Get common prefix of a string vector
filter_rcs_predictors

Filter predictors for RCS
cut_by

Convert Numeric to Factor
first_mode

Calculate the first mode
exclusion_count

Count the number of excluded samples at each step
df_view_nonnum

Show non-numeric elements in a data frame
get_valid_subset

Get the subset that satisfies the missing rate condition.
interaction_p_value

Calculate interaction p-value
get_samples

Generate a sample of values from a vector and collapse them.
get_var_types

Get variable types for baseline table
interaction_plot

Plot interactions
get_valid

Get one valid value from vector.
format_pval

Format p-value for publication
importance_plot

Importance plot
formula_add_covs

Add covariates to a formula
indicate_duplicates

Determine duplicate elements including their first occurrence.
na2false

Replace NA values with FALSE
merge_ordered_vectors

Merging vectors while maintaining order
rcs_plot

Plot restricted cubic spline
mad_outlier

Mark possible outliers with MAD.
predictor_effect_plot

Plot the effect of a predictor variable
qq_show

QQ plot
max_missing_rates

Get the maximum missing rate of rows and columns.
regression_basic_results

Basic results of logistic or Cox regression.
interaction_scan

Scan for interactions between variables
na_max

Safe min and max functions that return NA if all values are NA
str_match_replace

Match string and replace with corresponding value
split_multichoice

Split multi-choice data into columns
replace_elements

Replacing elements in a vector
regression_scan

Scan for significant regression predictors
subgroup_forest

Create subgroup forest plot.
regression_fit

Obtain regression results
subject_view

Get an overview of different subjects in data.
time_roc_plot

Calculate and plot time-dependent ROC curves
test_normality

Test normality of a numeric variable
regression_forest

Forest plot of regression results
value_initial_cleaning

Preliminarily cleaning string vectors
vec2code

Generate code from string vector
to_date

Convert numerical or character date to date.
unit_standardize

Standardize units of numeric data.
unit_view

Generate a table of conflicting units.
unmake_names

Unmake names
check_nonnum

Check elements that are not numeric
check_package

Check if a package is available and provide helpful error message
add_lists

Adding lists element-wise
answer_check

Check answers of multiple choice questions
combine_files

combine multiple data files into a single data frame
classif_model_compare

Performance comparison of classification models
calc_cindex

Calculate C-index for survival data
calculate_index

Calculate index based on conditions
combine_multichoice

Combine multi-choice columns into one
baseline_table

Create a baseline table for a dataset.
break_at

Generate breaks for histogram