taxonSortPBDBocc: Sorting Unique Taxa of a Given Rank from Paleobiology Database Occurrence Data

Description

Functions for sorting out unique taxa from Paleobiology Database occurrence downloads, which should accept several different formats resulting from different versions of the PBDB API and different vocabularies available from the API.

Usage

taxonSortPBDBocc(data, rank, onlyFormal = TRUE, cleanUncertain = TRUE,
  cleanResoValues = c(NA, "\"", "", "n. sp.", "n. gen.", " ", "  "))

Arguments

data

A table of occurrence data collected from the Paleobiology Database.

rank

The selected taxon rank; must be one of 'species', 'genus', 'family', 'order', 'class' or 'phylum'.

onlyFormal

If TRUE (the default) only taxa formally accepted by the Paleobiology Database are returned. If FALSE, then the identified name fields are searched for any additional 'informal' taxa with the proper taxon. If their taxon name happens to match any forma

cleanUncertain

If TRUE (the default) any occurrences with an entry in the respective 'resolution' field that is *not* found in the argument cleanResoValue will be removed from the dataset. These are assumed to be values indicating taxonomic uncertainty, i.e. 'cf.' or

cleanResoValues

The set of values that can be found in a 'resolution' field that do not cause a taxon to be removed, as they do not seem to indicate taxonomic uncertainty.

Value

Returns a list where each element is different unique taxon obtained by the sorting function, and named with that taxon name. Each element is composed of a table containing all the same occurrence data fields as the input (potentially with some fields renamed and some field contents change, due to vocabulary translation).

Details

Data input for taxonSortPBDBocc are expected to be from version 1.2 API with the 'pbdb' vocabulary. However, datasets are passed to internal function translatePBDBocc, which attempts to correct any necessary field names and field contents used by taxonSortPBDBocc. This function can pull either just the 'formally' identified and synonymized taxa in a given table of occurrence data or pull in addition occurrences listed under informal taxa of the sought taxonomic rank. Only formal taxa are sorted by default; this is controlled by argument onlyFormal. Pulling the informally-listed taxonomic occurrences is often necessary in some groups that have received little focused taxonomic effort, such that many species are linked to their generic taxon ID and never received a species-level taxonomic ID in the PBDB. Pulling both formal and informally listed taxonomic occurrences is a hierarchical process and performed in stages: formal taxa are identified first, informal taxa are identified from the occurrences that are 'leftover', and informal occurrences with name labels that match a previously sorted formally listed taxon are concatenated to the 'formal' occurrences for that same taxon, rather than being listed under separate elements of the list as if they were separate taxa. This function is simpler than similar functions that inspired it by using the input"rank" to both filter occurrences and directly reference a taxon's accepted taxonomic placement, rather than a series of specific if() checks. Unlike some similar functions in other packages, such as version 0.3 paleobioDB's pbdb_temp_range, taxonSortPBDBocc does not check if sorted taxa have a single 'taxon_no' ID number. This makes the blanket assumption that if a taxon's listed name in relevant fields is identical, the taxon is identical, with the important caveat that occurrences with accepted formal synonymies are sorted first based on their accepted names, followed by taxa without formal taxon IDs. This should avoid mistakingly linking the same occurrences to multiple taxa or assigning occurrences listed under separate formal taxa to the same taxon based on their 'identified' taxon name, as long as all formal taxa have unique names (which is an untested assumption). In some cases, this procedure is helpful, such as when taxa with identical generic and species names are listed under separate taxon ID numbers because of a difference in the listed subgenus for some occurrences (example, "Pseudoclimacograptus (Metaclimacograptus) hughesi' and 'Pseudoclimacograptus hughesi' in the PBDB as of 03/01/2015). Presumably any data that would be affected by differences in this procedure is very minor. Occurrences with taxonomic uncertainty indicators in the listed identified taxon name are removed by default, as controlled by argument cleanUncertain. This is done by removing any occurrences that have an entry in primary_reso (was "genus_reso" in v1.1 API) when rank is a supraspecific level, and species_reso when rank=species, if that entry is not found in cleanResoValues. In some rare cases, when onlyFormal=FALSE, supraspecific taxon names may be returned in the output that have various 'cruft' attached, like 'n.sp'. Empty values in the input data table ("") are converted to NAs, as they may be due to issues with using read.csv to convert API-downloaded data.

Examples

Run this code

#load example graptolite PBDB occ dataset
data(graptPBDB)

#get formal genera
occGenus<-taxonSortPBDBocc(graptOccPBDB, rank="genus")
length(occGenus)

#get formal species
occSpeciesFormal<-taxonSortPBDBocc(graptOccPBDB, rank="species")
length(occSpeciesFormal)

#yes, there are fewer 'formal' graptolite species in the PBDB then genera

#get formal and informal species
occSpeciesInformal<-taxonSortPBDBocc(graptOccPBDB, rank="species",
	 onlyFormal=FALSE)
length(occSpeciesInformal)

#way more graptolite species are 'informal' in the PBDB

#get formal and informal species
	#including from occurrences with uncertain taxonomy
	#basically everything and the kitchen sink
occSpeciesEverything<-taxonSortPBDBocc(graptOccPBDB, rank="species",
		onlyFormal=FALSE, cleanUncertain=FALSE)
length(occSpeciesEverything)

# simple function for getting occurrence data from API v1.1
easyGetPBDBocc<-function(taxa,show=c("ident","phylo")){
  #cleans PBDB occurrence downloads of warnings
  taxa<-paste(taxa,collapse=",")
	taxa<-paste(unlist(strsplit(taxa,"_")),collapse="%20")
	show<-paste(show,collapse=",")
	command<-paste0("http://paleobiodb.org/data1.1/occs/list.txt?base_name=",
		taxa,"&show=",show,"&limit=all",
		collapse="")
	command<-paste(unlist(strsplit(command,split=" ")),collapse="%20")
	downData<-readLines(command)
	if(length(grep("Warning",downData))!=0){
		start<-grep("Records",downData)
		warn<-downData[1:(start-1)]
		warn<-sapply(warn, function(x)
			paste0(unlist(strsplit(unlist(strsplit(x,'"')),",")),collapse=""))
		warn<-paste0(warn,collapse="\n")
		names(warn)<-NULL
		mat<-downData[-(1:start)]
		mat<-read.csv(textConnection(mat))
		message(warn)
	}else{
		mat<-downData
		mat<-read.csv(textConnection(mat))
		}
	return(mat)
	}

#try a PBDB API download with lots of synonymization
	#this should have only 1 species
#old way:
#acoData<-read.csv(paste0("http://paleobiodb.org/data1.1/occs/list.txt?",
#	"base_name=Acosarina%20minuta&show=ident,phylo&limit=all"))
# with easyGetPBDBocc:
acoData<-easyGetPBDBocc("Acosarina minuta")
x<-taxonSortPBDBocc(acoData, rank="species", onlyFormal=FALSE)
names(x)

#make sure works with API v1.2
		#won't work until v1.2 goes live at the regular server
dicelloData-read.csv(paste0("http://paleobiodb.org",
	"/data1.2/occs/list.txt?base_name=Dicellograptus",
	"&show=ident,phylo&limit=all"))
dicelloOcc2<-taxonSortPBDBocc(dicelloData, rank="species", onlyFormal=FALSE)
names(dicelloOcc2)

#make sure works with compact vocab v1.1
dicelloData<-read.csv(paste0("http://paleobiodb.org",
	"/data1.1/occs/list.txt?base_name=Dicellograptus",
	"&show=ident,phylo&limit=all&vocab=com"))
dicelloOccCom1<-taxonSortPBDBocc(dicelloData, rank="species", onlyFormal=FALSE)
names(dicelloOccCom1)
head(dicelloOccCom1[[1]])[,1:7]

Run the code above in your browser using DataLab