Learn R Programming

cleanepi (version 1.1.1)

remove_duplicates: Remove duplicates

Description

When removing duplicates, users can specify a set columns to consider with the target_columns argument.

Usage

remove_duplicates(data, target_columns = NULL)

Value

The input data <data.frame> or <linelist> without the duplicated rows identified from all or the specified columns.

Arguments

data

The input <data.frame> or <linelist>.

target_columns

A <vector> of column names to use when looking for duplicates. When the input data is a linelist object, this parameter can be set to linelist_tags if you wish to look for duplicates on tagged columns only. Default is NULL.

Details

Caveat: In many epidemiological datasets, multiple rows may share the same value in one or more columns without being true duplicates. For example, several individuals might have the same symptom onset date and admission date. Be cautious when using this function—especially when applying it to a single target column—to avoid incorrect identification or removal of valid entries.

Examples

Run this code
data <- readRDS(
  system.file("extdata", "test_linelist.RDS", package = "cleanepi")
)
no_dups <- remove_duplicates(
  data = data,
  target_columns = "linelist_tags"
)

# print the removed duplicates
print_report(no_dups, "removed_duplicates")

# print the detected duplicates
print_report(no_dups, "found_duplicates")

Run the code above in your browser using DataLab