makeTxDb: Making a TxDb object from user supplied annotations

Description

makeTxDb is a low-level constructor for making a TxDb object from user supplied transcript annotations. See ?makeTxDbFromUCSC and ?makeTxDbFromBiomart for higher-level functions that feed data from the UCSC or BioMart sources to makeTxDb.

Usage

makeTxDb(transcripts, splicings, genes=NULL, chrominfo=NULL, metadata=NULL, reassign.ids=FALSE)

Arguments

transcripts

data frame containing the genomic locations of a set of transcripts

splicings

data frame containing the exon and cds locations of a set of transcripts

genes

data frame containing the genes associated to a set of transcripts

chrominfo

data frame containing information about the chromosomes hosting the set of transcripts

metadata

2-column data frame containing meta information about this set of transcripts like organism, genome, UCSC table, etc... The names of the columns must be "name" and "value" and their type must be character.

reassign.ids

controls how internal ids should be assigned for each type of feature i.e. for transcripts, exons, and cds. For each type, if reassign.ids is FALSE and if the ids are supplied, then they are used as the internal ids, otherwise the internal ids are assigned in a way that is compatible with the order defined by ordering the features first by chromosome, then by strand, then by start, and finally by end.

Value

TxDb object.

Details

The transcripts (required), splicings (required) and genes (optional) arguments must be data frames that describe a set of transcripts and the genomic features related to them (exons, cds and genes at the moment). The chrominfo (optional) argument must be a data frame containing chromosome information like the length of each chromosome.

transcripts must have 1 row per transcript and the following columns:

tx_id: Transcript ID. Integer vector. No NAs. No duplicates.
tx_name: [optional] Transcript name. Character vector (or factor). NAs and/or duplicates are ok.
tx_type: [optional] Transcript type (e.g. mRNA, ncRNA, snoRNA, etc...). Character vector (or factor). NAs and/or duplicates are ok.
tx_chrom: Transcript chromosome. Character vector (or factor) with no NAs.
tx_strand: Transcript strand. Character vector (or factor) with no NAs where each element is either "+" or "-".
tx_start, tx_end: Transcript start and end. Integer vectors with no NAs.

Other columns, if any, are ignored (with a warning).

splicings must have N rows per transcript, where N is the nb of exons in the transcript. Each row describes an exon plus, optionally, the cds contained in this exon. Its columns must be:

tx_id: Foreign key that links each row in the splicings data frame to a unique row in the transcripts data frame. Note that more than 1 row in splicings can be linked to the same row in transcripts (many-to-one relationship). Same type as transcripts$tx_id (integer vector). No NAs. All the values in this column must be present in transcripts$tx_id.
exon_rank: The rank of the exon in the transcript. Integer vector with no NAs. (tx_id, exon_rank) pairs must be unique.
exon_id: [optional] Exon ID. Integer vector with no NAs.
exon_name: [optional] Exon name. Character vector (or factor). NAs and/or duplicates are ok.
exon_chrom: [optional] Exon chromosome. Character vector (or factor) with no NAs. If missing then transcripts$tx_chrom is used. If present then exon_strand must also be present.
exon_strand: [optional] Exon strand. Character vector (or factor) with no NAs. If missing then transcripts$tx_strand is used and exon_chrom must also be missing.
exon_start, exon_end: Exon start and end. Integer vectors with no NAs.
cds_id: [optional] cds ID. Integer vector. If present then cds_start and cds_end must also be present. NAs are allowed and must match NAs in cds_start and cds_end.
cds_name: [optional] cds name. Character vector (or factor). If present then cds_start and cds_end must also be present. NAs and/or duplicates are ok. Must be NA if corresponding cds_start and cds_end are NAs.
cds_start, cds_end: [optional] cds start and end. Integer vectors. If one of the 2 columns is missing then all cds_* columns must be missing. NAs are allowed and must occur at the same positions in cds_start and cds_end.

Other columns, if any, are ignored (with a warning).

genes must have N rows per transcript, where N is the nb of genes linked to the transcript (N will be 1 most of the time). Its columns must be:

tx_id: [optional] genes must have either a tx_id or a tx_name column but not both. Like splicings$tx_id, this is a foreign key that links each row in the genes data frame to a unique row in the transcripts data frame.
tx_name: [optional] Can be used as an alternative to the genes$tx_id foreign key.
gene_id: Gene ID. Character vector (or factor). No NAs.

Other columns, if any, are ignored (with a warning).

chrominfo must have 1 row per chromosome and the following columns:

chrom: Chromosome name. Character vector (or factor) with no NAs and no duplicates.
length: Chromosome length. Integer vector with either all NAs or no NAs.
is_circular: [optional] Chromosome circularity flag. Logical vector. NAs are ok.

Other columns, if any, are ignored (with a warning).

Examples

Run this code

transcripts <- data.frame(
                   tx_id=1:3,
                   tx_chrom="chr1",
                   tx_strand=c("-", "+", "+"),
                   tx_start=c(1, 2001, 2001),
                   tx_end=c(999, 2199, 2199))
splicings <-  data.frame(
                   tx_id=c(1L, 2L, 2L, 2L, 3L, 3L),
                   exon_rank=c(1, 1, 2, 3, 1, 2),
                   exon_start=c(1, 2001, 2101, 2131, 2001, 2131),
                   exon_end=c(999, 2085, 2144, 2199, 2085, 2199),
                   cds_start=c(1, 2022, 2101, 2131, NA, NA),
                   cds_end=c(999, 2085, 2144, 2193, NA, NA))

txdb <- makeTxDb(transcripts, splicings)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

Details

See Also

Examples