query: To get a list of sequence names from an ACNUC data base located on the web

Description

This is a major command of the package. It executes all sequence retrievals using any selection criteria the data base allows. The sequences are coming from ACNUC data base located on the web and they are transfered by socket. The command produces the list of all sequence names that fit the required criteria. The sequence names belong to the class of sequence SeqAcnucWeb.

Usage

query(listname, query, socket = "auto", invisible = TRUE, verbose = FALSE, virtual = FALSE)

Arguments

listname

The name of the list as a quoted string of chars

query

A quoted string of chars containing the request with the syntax given in the details section

socket

a socket of class connection and sockconn returned by choosebank.Default value (auto) means that the socket will be set to to the socket component of the banknameSocket variable.

invisible

if FALSE, the result is returned visibly.

verbose

if TRUE, verbose mode is on

virtual

if TRUE, no attempt is made to retrieve the information about all the elements of the list. In this case, the req component of the list is set to NA.

Value

A list with the following components:
bankthe name of the bank that has been choosen by choosebank.socket
calloriginal call
namelist name
nelemnumber of elements in the list on the server
typelistthe type of the elemnts of the list. Could be SQ for a list of sequence names, KW for a list of keywords, SP for a list of species names.
reqa list of sequence names that fit the required criteria or NA when called with parameter virtual is TRUE

Details

Each selection criterion is written using the following syntax:

c = criterion value{where c indicates which criterion is used. Many selection criteria are available. They correspond mainly to the structured elements of the sequence documentation in the data banks, and are detailled thereafter. Criteria can be combined using 3 logical operations: criterion1 ET criterion2 : logical AND (sequences that fit criteria 1 and 2 simultaneously).

criterion1 OU criterion2 : logical OR (sequences that fit at least one of both criteria).

NO criterion1 : logical negation (sequences that do not fit criterion 1).

Parentheses can be used to delimit the range of operations. List of sequences can be re-used at will, which is very convenient to fragment complexe requests into simple requests. For instance, here are two equivalent ways to get all coding sequences from Escherichia coli that are not partial:

choosebank("genbank") query("final", "sp=escherichia coli ET t=cds ET NO k=partial") choosebank("genbank") query("eco", "sp=escherichia coli") query("ecocds", "eco ET t=cds") query("final", "ecocds ET NO k=partial")

}

SP = species name{ sequences from given (group of) species. The special character @ can be used to match any group of characters in the species name, ex: SP=RATTUS@. Use of space is allowed. Examples: ESCHERICHIA COLI, @COLI, E@COLI. Species names are tree-structured according to the biological classification of species.}

K = keyword{ sequences having a given keyword. Since keywords are tree structured, as are species, you will select all sequences associated to keywords further down in tree. (@ can be used to match any group of characters) }

R = reference code{sequences from a given reference. References are specified as follows depending on the type of document:}

rlll{ Document Format Example Journal article journal_code/volume/1st_page jme/34/17 Book book/year/1st_author book/1980/broker Thesis thesis/year/1st_author thesis/1984/wildgruber Patent patent/patent_coded_number patent/ep0238993 Unpublished, or submitted unpubl/year/1st_author unpubl/1993/cho } J = journal name{sequences published in a given journal.} Y = year{sequences published in given year (e.g. 1982).} Y > year{sequences published after or during a given year.} Y < year{sequences published before or during a given year.}

AU = author{sequences published by given author(s). Use @ to specify any letters in name (e.g. @ORMOND@ for Van Ormondt). Only last names are indexed - initials are ignored. All authors of journal articles are indexed. Only the first author of books, theses, patents and other documents is indexed. } T = sequence type{ sequences of given type. You generally obtain subsequences with this criterion because types are for example tRNA, rRNA or protein gene. Type should not be confused with molecule which denotes the chemical nature of the sequenced molecule (e.g., DNA, mRNA, tRNA). Type is defined only for the nucleotide sequence banks. Presently the existing types are:}

lll{ ID Locus entry (EMBL, SWISS-PROT, NRSub) LOCUS Locus entry (GenBank, Hovergen, EMGLib) CDS .PE protein coding region (all) RRNA .RR mature ribosomal RNA (all) TRNA .TR mature transfer RNA (all) MISC_RNA .RN other structural RNA coding region (EMBL, GenBank, Hovergen, NRSub, EMGLib) SNRNA .SN small nuclear RNA (EMBL, GenBank, Hovergen, EMGLib) SCRNA .SC small cytoplasmic RNA (EMBL, GenBank, Hovergen, NRSub, EMGLib) 3'INT .3I 3' intron (Hovergen) 3'NCR .3F 3' non-coding region (Hovergen) 5'INT .5I 5' intron (Hovergen) 5'NCR .5F 5' non-coding region (Hovergen) CPG .CG CpGobs/CpGexp>0.5 (Hovergen) INT_INT .IN internal intron (Hovergen) }

Each entry of a FEATURE TABLE describing a coding region of a DNA fragment gives rise to a subsequence equal to the fragments described in the location of the feature. The type of the resulting subsequence equals the key of the corresponding feature table entry. The name of the resulting subsequence is built by adding to the parent sequence's name an extension uniquely identifying this particular feature.

Sequences of a given type are generally subsequences, i.e., fragments of parent sequences, except if the coding region covers totally the parent sequence, in which case ACNUC does not create a subsequence. O = organelle{sequences from a given organelle. Organelle (e.g., chloroplast, mitochondrion) denotes the nature of the genome that harbors a particular gene. By extension, ACNUC also sees the nucleus as an organelle. Also, a nuclear-encoded gene coding for a protein exported to an organelle is considered as a nuclear gene. The existing organelles are:}

lll{ CHLOROPLAST Chloroplast genome (EMBL, GenBank, NBRF, Hovergen) MITOCHONDRION Mitochondrial genome (EMBL, GenBank, NBRF, Hovergen) KINETOPLAST Kinetoplast genome (EMBL, GenBank, Hovergen) NUCLEAR Nuclear genome (all) }

M = molecule name{ sequences with given chemical structure. In ACNUC, molecule denotes the chemical nature of the sequenced molecule (e.g., DNA, mRNA, tRNA). Molecule should not be confused with type which identifies the encoded molecule (e.g., protein, tRNA, rRNA). Thus the sequence of a tRNA gene has DNA for molecule because DNA rather than tRNA was sequenced. The subsequence covering the tRNA region has tRNA for type because this is the nature of the encoded product. Molecule is defined only for the nucleotide sequence banks (GenBank, EMBL, Hovergen, NRSub, and CGDB). Presently the existing molecules are:}

lll{ DNA Sequenced molecule is DNA (all) RNA Sequenced molecule is RNA (all) MRNA Sequenced molecule is mRNA (GenBank, Hovergen) RRNA Sequenced molecule is rRNA (GenBank, Hovergen) TRNA Sequenced molecule is tRNA (GenBank, Hovergen) URNA Sequenced molecule is snRNA (GenBank, Hovergen) } N = sequence name{ sequence of given name.} AC = accession number{ sequences of given accession number.}

F = file name{ (not implemented) sequences whose names are in a specified file. Use crelistfromclientdata with type = "SQ" for this purpose. } FA = file name{ (not implemented) sequences whose accesion numbers are in a specified file. Use crelistfromclientdata with type = "AC" for this purpose. }

References

To get the release date and content of all the databases located at the pbil, please look at the following url: http://pbil.univ-lyon1.fr/search/releases.php Gouy, M., Milleret, F., Mugnier, C., Jacobzone, M., Gautier,C. (1984) ACNUC: a nucleic acid sequence data base and analysis system. Nucl. Acids Res., 12:121-127. Gouy, M., Gautier, C., Attimonelli, M., Lanave, C., Di Paola, G. (1985) ACNUC - a portable retrieval system for nucleic acid sequence databases: logical and physical designs and usage. Comput. Appl. Biosci., 3:167-172. Gouy, M., Gautier, C., Milleret, F. (1985) System analysis and nucleic acid sequence banks. Biochimie, 67:433-436. citation("seqinr")

Examples

Run this code

# Need internet connection
 choosebank("genbank")
 query("bb", "sp=Borrelia burgdorferi")
 # To get the names of the 4 first sequences:
 sapply(bb$req[1:4], getName)
 # To get the 4 first sequences:
 sapply(bb$req[1:4], getSequence, as.string = TRUE)

Run the code above in your browser using DataLab