Learn R Programming

DupChecker

# install.packages("devtools")
devtools::install_github("shengqh/DupChecker")

or from Bioconductor by following codes:

source("http://bioconductor.org/biocLite.R")
biocLite("DupChecker")

Here we show the most basic steps for a validation procedure. You need to create a target directory used to store the data. Here, we assume the target directory is your work directory.

library(DupChecker)
geoDownload(datasets = c("GSE14333", "GSE13067", "GSE17538"), targetDir=getwd())
datafile<-buildFileTable(rootDir=getwd(), filePattern="cel$")
result<-validateFile(datafile)
if(result$hasdup){
	duptable<-result$duptable
	write.csv(duptable, file="duptable.csv")
}

If the download or decompress cost too much time in R environment, user may download the GEO/ArrayExpress raw data and decompress the data to individual data files using other tools. The reason that we expect the data file not compressed CEL file is the compressed files from same CEL file but by different compress softwares may have different MD5 fingerprint.

The following code will download two datasets from ArrayExpress system and three datasets from GEO system. It may cost a few minutes to a few hours based your network performance.

library(DupChecker)

#download from ArrayExpress system
datatable<-arrayExpress(datasets = c("E-TABM-158", "E-TABM-43"), targetDir=getwd()))
datatable

#Or download from GEO system
datatable<-geoDownload(datasets = c("GSE14333", "GSE13067", "GSE17538"), targetDir=getwd())
datatable

The datatable is a data frame containing dataset name and how many CEL files in that dataset.

##Build file table

Secondly, function buildFileTable will try to find all files in the subdirectories under root directories user provided. The result data frame contains two columns, dataset and filename. Here, rootDir can also be an array of directories.

datafile<-buildFileTable(rootDir=getwd(), filePattern="cel$")
datafile

##Validate file redundancy

The function validateFile will calculate MD5 fingerprint for each file in table and then check to see if any two files have same MD5 fingerprint. The files with same fingerprint will be treated as duplication. The function will return a table contains all duplicated files and datasets.

result<-validateFile(datafile)
if(result$hasdup){
	duptable<-result$duptable
	write.csv(duptable, file="duptable.csv")
}
MD5GSE13067(64/74)GSE14333(231/290)GSE17538(167/244)
001ddd757f185561c9ff9b4e95563372GSM358397.CELGSM437169.CEL
00b2e2290a924fc2d67b40c097687404GSM358503.CELGSM437210.CEL
012ed9083b8f1b2ae828af44dbab29f0GSM327335GSM358620.CEL
023c4e4f9ebfc09b838a22f2a7bdaa59GSM358441.CELGSM437117.CEL

If you use DupChecker in published research, please cite:

Quanhu Sheng, Yu Shyr, Xi Chen.: DupChecker: a bioconductor package for checking high-throughput genomic data redundancy in meta-analysis. BMC bioinformatics 2014, 15:323.

Copy Link

Version

Version

1.10.2

License

GPL (>= 2)

Maintainer

Quanhu SHENG

Last Published

February 15th, 2017

Functions in DupChecker (1.10.2)

validateFile

validateFile
buildFileTable

buildFileTable
arrayExpressDownload

arrayExpressDownload
geoDownload

geoDownload