GenomicFeatures (version 1.18.7)

makeTranscriptDbFromGFF: Make a TxDb object from annotations available as a GFF3 or GTF file

Description

The makeTranscriptDbFromGFF function allows the user to make a TxDb object from transcript annotations available as a GFF3 or GTF file.

Usage

makeTranscriptDbFromGFF(file, format=c("gff3","gtf"), exonRankAttributeName=NA, gffGeneIdAttributeName=NA, chrominfo=NA, dataSource=NA, species=NA, circ_seqs=DEFAULT_CIRC_SEQS, miRBaseBuild=NA, useGenesAsTranscripts=FALSE)

Arguments

file
path/file to be processed
format
"gff3" or "gtf" depending on which file format you have to process
exonRankAttributeName
character(1) name of the attribute that defines the exon rank information, or NA to indicate that exon ranks are inferred from order of occurrence in the GFF.
gffGeneIdAttributeName
an optional argument that can be used for gff style files ONLY. If the gff file lacks rows to specify gene IDs but the mRNA rows of the gff file specify the gene IDs via a named attribute,then passing the name of the attribute for this argument can allow the file to still extract gene IDs that map to these transcripts. If left blank, then the parser will try and extract rows that are named 'gene' for gene to transcript mappings when parsing a gff3 file. For gtf files this argument is ignored entirely.
chrominfo
data frame containing information about the chromosomes. Will be passed to the internal call to makeTranscriptDb. See ?makeTranscriptDb for the details.
dataSource
Where did this data file originate? Please be as specific as possible.
species
What is the Genus and species of this organism. Please use proper scientific nomenclature for example: "Homo sapiens" or "Canis familiaris" and not "human" or "my fuzzy buddy". If properly written, this information may be used by the software to help you out later.
circ_seqs
a character vector to list out which chromosomes should be marked as circular.
miRBaseBuild
specify the string for the appropriate build Information from mirbase.db to use for microRNAs. This can be learned by calling supportedMiRBaseBuildValues. By default, this value will be set to NA, which will inactivate the microRNAs accessor.
useGenesAsTranscripts
This flag is normally off, but if enabled it will try to salvage a file that has no RNA features by assuming that you can use the ranges available for the Gene features in their place. Obviously, this is something you won't want to do unless you are dealing with something very simple like a prokaryote.

Value

TxDb object.

Details

makeTranscriptDbFromGFF is a convenience function that feeds data from the parsed file to the lower level makeTranscriptDb function.

There are some real deficiencies in the gtf and the gff3 file formats to bear in mind when making use of them. For gtf files the length of the transcripts is not normally encoded and so it has to be inferred from the exon ranges presented. That's not a horrible problem, but it bears mentioning for the sake of full disclosure. And for gff3 files the situation is typically even worse since they usually don't encode any information about the exon rank within a transcript. This is a serious oversight and so if you have an alternative to using this kind of data, you should really do so. Some files will have an attribute defined to indicate the exon rank information. For GTF files this is usually given as "exon_number", however you still must specify this argument if you don't want the code to try and infer the exon rank information. For gff3 files, we have not seen any examples of this information encoded anywhere, but if you have a file with an attribute, you can still specify this to avoid the inference.

See Also

DEFAULT_CIRC_SEQS, makeTranscriptDbFromUCSC, makeTranscriptDbFromBiomart, makeTranscriptDb, supportedMiRBaseBuildValues

Examples

Run this code
## TESTING GFF3
gffFile <- system.file("extdata","a.gff3",package="GenomicFeatures")
txdb <- makeTranscriptDbFromGFF(file=gffFile,
            format="gff3",
            exonRankAttributeName=NA,
            dataSource="partial gtf file for Tomatoes for testing",
            species="Solanum lycopersicum")
if(interactive()) {
saveDb(txdb,file="TESTGFF.sqlite")
}

## TESTING GTF, this time specifying the chrominfo
gtfFile <- system.file("extdata","Aedes_aegypti.partial.gtf",
                       package="GenomicFeatures")
chrominfo <- data.frame(chrom = c('supercont1.1','supercont1.2'),
                        length=c(5220442, 5300000),
                        is_circular=c(FALSE, FALSE))
txdb2 <- makeTranscriptDbFromGFF(file=gtfFile,
             format="gtf",
             exonRankAttributeName="exon_number",
             chrominfo=chrominfo,
             dataSource=paste("ftp://ftp.ensemblgenomes.org/pub/metazoa/",
                              "release-13/gtf/aedes_aegypti/",sep=""),
             species="Aedes aegypti")
if(interactive()) {
    saveDb(txdb2,file="TESTGTF.sqlite")
}

Run the code above in your browser using DataLab