makeTxDb: Making a TxDb object from user supplied annotations

Description

makeTxDb is a low-level constructor for making a TxDb object from user supplied transcript annotations. See ?makeTxDbFromUCSC and ?makeTxDbFromBiomart for higher-level functions that feed data from the UCSC or BioMart sources to makeTxDb.

Usage

makeTxDb(transcripts, splicings,
         genes=NULL, chrominfo=NULL, metadata=NULL,
         reassign.ids=FALSE)

Arguments

transcripts

data frame containing the genomic locations of a set of transcripts

splicings

data frame containing the exon and cds locations of a set of transcripts

genes

data frame containing the genes associated to a set of transcripts

chrominfo

data frame containing information about the chromosomes hosting the set of transcripts

metadata

2-column data frame containing meta information about this set of transcripts like organism, genome, UCSC table, etc... The names of the columns must be "name" and "value" and their type must be character.

reassign.ids

controls how internal ids should be assigned for each type of feature i.e. for transcripts, exons, and cds. For each type, if reassign.ids is FALSE and if the ids are supplied, then they are used as the internal ids, otherwise the internal ids are assigned in a way that is compatible with the order defined by ordering the features first by chromosome, then by strand, then by start, and finally by end.

Value

A TxDb object.

Details

The transcripts (required), splicings (required) and genes (optional) arguments must be data frames that describe a set of transcripts and the genomic features related to them (exons, cds and genes at the moment). The chrominfo (optional) argument must be a data frame containing chromosome information like the length of each chromosome.

transcripts must have 1 row per transcript and the following columns:

tx_id: Transcript ID. Integer vector. No NAs. No duplicates.
tx_name: [optional] Transcript name. Character vector (or factor). NAs and/or duplicates are ok.
tx_type: [optional] Transcript type (e.g. mRNA, ncRNA, snoRNA, etc...). Character vector (or factor). NAs and/or duplicates are ok.
tx_chrom: Transcript chromosome. Character vector (or factor) with no NAs.
tx_strand: Transcript strand. Character vector (or factor) with no NAs where each element is either"+"or"-".
tx_start,tx_end: Transcript start and end. Integer vectors with no NAs.

Other columns, if any, are ignored (with a warning).

splicings must have N rows per transcript, where N is the nb of exons in the transcript. Each row describes an exon plus, optionally, the cds contained in this exon. Its columns must be:

tx_id: Foreign key that links each row in thesplicingsdata frame to a unique row in thetranscriptsdata frame. Note that more than 1 row insplicingscan be linked to the same row intranscripts(many-to-one relationship). Same type astranscripts$tx_id(integer vector). No NAs. All the values in this column must be present intranscripts$tx_id.
exon_rank: The rank of the exon in the transcript. Integer vector with no NAs. (tx_id,exon_rank) pairs must be unique.
exon_id: [optional] Exon ID. Integer vector with no NAs.
exon_name: [optional] Exon name. Character vector (or factor). NAs and/or duplicates are ok.
exon_chrom: [optional] Exon chromosome. Character vector (or factor) with no NAs. If missing thentranscripts$tx_chromis used. If present thenexon_strandmust also be present.
exon_strand: [optional] Exon strand. Character vector (or factor) with no NAs. If missing thentranscripts$tx_strandis used andexon_chrommust also be missing.
exon_start,exon_end: Exon start and end. Integer vectors with no NAs.
cds_id: [optional] cds ID. Integer vector. If present thencds_startandcds_endmust also be present. NAs are allowed and must match NAs incds_startandcds_end.
cds_name: [optional] cds name. Character vector (or factor). If present thencds_startandcds_endmust also be present. NAs and/or duplicates are ok. Must be NA if correspondingcds_startandcds_endare NAs.
cds_start,cds_end: [optional] cds start and end. Integer vectors. If one of the 2 columns is missing then allcds_*columns must be missing. NAs are allowed and must occur at the same positions incds_startandcds_end.

Other columns, if any, are ignored (with a warning).

genes must have N rows per transcript, where N is the nb of genes linked to the transcript (N will be 1 most of the time). Its columns must be:

tx_id: [optional]genesmust have either atx_idor atx_namecolumn but not both. Likesplicings$tx_id, this is a foreign key that links each row in thegenesdata frame to a unique row in thetranscriptsdata frame.
tx_name: [optional] Can be used as an alternative to thegenes$tx_idforeign key.
gene_id: Gene ID. Character vector (or factor). No NAs.

Other columns, if any, are ignored (with a warning).

chrominfo must have 1 row per chromosome and the following columns:

chrom: Chromosome name. Character vector (or factor) with no NAs and no duplicates.
length: Chromosome length. Integer vector with either all NAs or no NAs.
is_circular: [optional] Chromosome circularity flag. Logical vector. NAs are ok.

Other columns, if any, are ignored (with a warning).

Examples

Run this code

transcripts <- data.frame(
                   tx_id=1:3,
                   tx_chrom="chr1",
                   tx_strand=c("-", "+", "+"),
                   tx_start=c(1, 2001, 2001),
                   tx_end=c(999, 2199, 2199))
splicings <-  data.frame(
                   tx_id=c(1L, 2L, 2L, 2L, 3L, 3L),
                   exon_rank=c(1, 1, 2, 3, 1, 2),
                   exon_start=c(1, 2001, 2101, 2131, 2001, 2131),
                   exon_end=c(999, 2085, 2144, 2199, 2085, 2199),
                   cds_start=c(1, 2022, 2101, 2131, NA, NA),
                   cds_end=c(999, 2085, 2144, 2193, NA, NA))

txdb <- makeTxDb(transcripts, splicings)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

Details

See Also

Examples