query: To get a list of sequence names from an ACNUC data base located on the web

Description

This is a major command of the package. It executes all sequence retrievals using any selection criteria the data base allows. The sequences are coming from ACNUC data base located on the web and they are transfered by socket. The command produces the list of all sequence names that fit the required criteria. The sequence names belong to the class of sequence SeqAcnucWeb.

Usage

query(socket,listname,query, invisible = FALSE)

Arguments

socket

a socket of class connection returned by choosebank.

listname

The name of the list as a quoted string of chars

query

A quoted string of chars containing the request with the syntax given in the details section

invisible

if TRUE, the result of the query will be invisible but assigned is the environment

Value

A list with the following components:
bankthe name of the bank that has been choosen by choosebank.socket
calloriginal call
namelist name
reqa list of sequence names that fit the required criteria

Details

Each selection criterion is written using the following syntax:

c = criterion value{where c indicates which criterion is used. Many selection criteria are available. They correspond mainly to the structured elements of the sequence documentation in the data banks, and are detailled thereafter. Criteria can be combined using 3 logical operations: criterion1 ET criterion2 : logical AND (sequences that fit criteria 1 and 2 simultaneously).

criterion1 OU criterion2 : logical OR (sequences that fit at least one of both criteria).

NO criterion1 : logical negation (sequences that do not fit criterion 1).

Parentheses can be used to delimit the range of operations. List of sequences can be re-used at will, which is very convenient to fragment complexe requests into simple requests. For instance, here are two equivalent ways to get all coding sequences from Escherichia coli that are not partial:

s=choosebank("genbank") query(s$socket,"final","sp=escherichia coli ET t=cds ET NO k=partial") s=choosebank("genbank") query(s$socket,"eco","sp=escherichia coli") query(s$socket,"ecocds","eco ET t=cds") query(s$socket,"final","ecocds ET NO k=partial")

}

SP = species name{ sequences from given (group of) species. The special character @ can be used to match any group of characters in the species name, ex: SP=RATTUS@. Use of space is allowed. Examples: ESCHERICHIA COLI, @COLI, E@COLI. Species names are tree-structured according to the biological classification of species.}

K = keyword{ sequences having a given keyword. Since keywords are tree structured, as are species, you will select all sequences associated to keywords further down in tree. (@ can be used to match any group of characters) }

R = reference code{sequences from a given reference. References are specified as follows depending on the type of document:}

rlll{ Document Format Example Journal article journal_code/volume/1st_page jme/34/17 Book book/year/1st_author book/1980/broker Thesis thesis/year/1st_author thesis/1984/wildgruber Patent patent/patent_coded_number patent/ep0238993 Unpublished, or submitted unpubl/year/1st_author unpubl/1993/cho } J = journal name{sequences published in a given journal.} Y = year{sequences published in given year (e.g. 1982).} Y > year{sequences published after or during a given year.} Y < year{sequences published before or during a given year.}

AU = author{sequences published by given author(s). Use @ to specify any letters in name (e.g. @ORMOND@ for Van Ormondt). Only last names are indexed - initials are ignored. All authors of journal articles are indexed. Only the first author of books, theses, patents and other documents is indexed. } T = sequence type{ sequences of given type. You generally obtain subsequences with this criterion because types are for example tRNA, rRNA or protein gene. Type should not be confused with molecule which denotes the chemical nature of the sequenced molecule (e.g., DNA, mRNA, tRNA). Type is defined only for the nucleotide sequence banks. Presently the existing types are:}

lll{ ID Locus entry (EMBL, SWISS-PROT, NRSub) LOCUS Locus entry (GenBank, Hovergen, EMGLib) CDS .PE protein coding region (all) RRNA .RR mature ribosomal RNA (all) TRNA .TR mature transfer RNA (all) MISC_RNA .RN other structural RNA coding region (EMBL, GenBank, Hovergen, NRSub, EMGLib) SNRNA .SN small nuclear RNA (EMBL, GenBank, Hovergen, EMGLib) SCRNA .SC small cytoplasmic RNA (EMBL, GenBank, Hovergen, NRSub, EMGLib) 3'INT .3I 3' intron (Hovergen) 3'NCR .3F 3' non-coding region (Hovergen) 5'INT .5I 5' intron (Hovergen) 5'NCR .5F 5' non-coding region (Hovergen) CPG .CG CpGobs/CpGexp>0.5 (Hovergen) INT_INT .IN internal intron (Hovergen) }

Each entry of a FEATURE TABLE describing a coding region of a DNA fragment gives rise to a subsequence equal to the fragments described in the location of the feature. The type of the resulting subsequence equals the key of the corresponding feature table entry. The name of the resulting subsequence is built by adding to the parent sequence's name an extension uniquely identifying this particular feature.

Sequences of a given type are generally subsequences, i.e., fragments of parent sequences, except if the coding region covers totally the parent sequence, in which case ACNUC does not create a subsequence. O = organelle{sequences from a given organelle. Organelle (e.g., chloroplast, mitochondrion) denotes the nature of the genome that harbors a particular gene. By extension, ACNUC also sees the nucleus as an organelle. Also, a nuclear-encoded gene coding for a protein exported to an organelle is considered as a nuclear gene. The existing organelles are:}

lll{ CHLOROPLAST Chloroplast genome (EMBL, GenBank, NBRF, Hovergen) MITOCHONDRION Mitochondrial genome (EMBL, GenBank, NBRF, Hovergen) KINETOPLAST Kinetoplast genome (EMBL, GenBank, Hovergen) NUCLEAR Nuclear genome (all) }

M = molecule name{ sequences with given chemical structure. In ACNUC, molecule denotes the chemical nature of the sequenced molecule (e.g., DNA, mRNA, tRNA). Molecule should not be confused with type which identifies the encoded molecule (e.g., protein, tRNA, rRNA). Thus the sequence of a tRNA gene has DNA for molecule because DNA rather than tRNA was sequenced. The subsequence covering the tRNA region has tRNA for type because this is the nature of the encoded product. Molecule is defined only for the nucleotide sequence banks (GenBank, EMBL, Hovergen, NRSub, and CGDB). Presently the existing molecules are:}

lll{ DNA Sequenced molecule is DNA (all) RNA Sequenced molecule is RNA (all) MRNA Sequenced molecule is mRNA (GenBank, Hovergen) RRNA Sequenced molecule is rRNA (GenBank, Hovergen) TRNA Sequenced molecule is tRNA (GenBank, Hovergen) URNA Sequenced molecule is snRNA (GenBank, Hovergen) } N = sequence name{ sequence of given name.} AC = accession number{ sequences of given accession number.}

F = file name{ sequences whose names are in a specified file.} FA = file name{ sequences whose accesion numbers are in a specified file.}

References

To get the release date and content of all the databases located at the pbil, please look at the following url: http://pbil.univ-lyon1.fr/search/releases.php Gouy, M., Milleret, F., Mugnier, C., Jacobzone, M., Gautier,C. (1984) ACNUC: a nucleic acid sequence data base and analysis system. Nucl. Acids Res., 12:121-127. Gouy, M., Gautier, C., Attimonelli, M., Lanave, C., Di Paola, G. (1985) ACNUC - a portable retrieval system for nucleic acid sequence databases: logical and physical designs and usage. Comput. Appl. Biosci., 3:167-172. Gouy, M., Gautier, C., Milleret, F. (1985) System analysis and nucleic acid sequence banks. Biochimie, 67:433-436. To have an overview of the seqinR's functionnality, please consult this vignette: Charif, D., Lobry, J.R. (2005) SeqinR: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis. Springer Verlag, Biological and Medical Physics/Biomedical Series, in preparation.

Examples

Run this code

s = choosebank("genbank")
 query(s$socket,"ecoli","sp=escherichia coli@")
 ecoli
 # To have the 4 first names of the sequence
 ecoli$req[1:4]
 ecoli$req[[5]]
 ecoli$call

Run the code above in your browser using DataLab