DataClean cleans the data in a character vector according to the 
conditions in the arguments.
DataClean(x, fix.comma = TRUE, fix.semcol = TRUE, fix.col = TRUE, fix.bracket = TRUE, fix.punct = TRUE, fix.space = TRUE, fix.sep = TRUE, fix.leadzero = TRUE)as.character.TRUE, all the commas are replaced by 
space (see Details).TRUE, all the semicolons are replaced by
space (see Details).TRUE, all the colons are replaced by space 
(see Details).TRUE, all the brackets are replaced by 
space (see Details).TRUE, all punctuation characters are 
removed (see Details).TRUE, all space characters are replaced 
by space and multiple spaces are converted to single space (see 
Details).TRUE, space between alphabetic characters 
followed by digits is removed (see Details).TRUE, leading zeros are removed (see 
Details).NAs if any are converted to blank strings.
KWIC function 
and the identification of probable duplicate accessions by the 
ProbDup function. It cleans the character strings in 
passport data fields(columns) specified as the input character vector 
x according to the conditions in the arguments in the same order. If 
the input vector x is not of type character, it is coerced to a 
character vector.This function is designed particularly for use with fields corresponding to 
accession names such as accession ids, collection numbers, accession names 
etc. It is essentially a wrapper around the gsub base 
function with regex arguments. It also converts all 
strings to upper case and removes leading and trailing spaces.
Commas, semicolons and colons which are sometimes used to separate multiple 
strings or names within the same field can be replaced with a single space 
using the logical arguments fix.comma, fix.semcol and 
fix.col respectively.
Similarly the logical argument fix.bracket can be used to replace all 
brackets including parenthesis, square brackets and curly brackets with
space.
The logical argument fix.punct can be used to remove all punctuation 
from the data.
fix.space can be used to convert all space characters such as tab, 
newline, vertical tab, form feed and carriage return to spaces and finally 
convert multiple spaces to single space.
fix.sep can be used to merge together accession identifiers 
composed of alphabetic characters separated from as series of digits by a 
space character. For example IR 64, PUSA 256 etc.
fix.leadzero can be used to remove leading zeros from accession name 
fields to facilitate matching to identify probable duplicates. e.g. IR0064 -> 
IR64
gsub, regex, 
  MergeKW, KWIC, 
  ProbDup
names <- c("S7-12-6", "ICG-3505", "U 4-47-18;EC 21127", "AH 6481", "RS   1",
           "AK 12-24", "2-5 (NRCG-4053)", "T78, Mwitunde", "ICG 3410",
           "#648-4 (Gwalior)", "TG4;U/4/47/13", "EC0021003")
DataClean(names)
Run the code above in your browser using DataLab