DBWorld: E-mails from DBWorld mailing list

Description

The dataset contains n= 64 bodies of e-mails in binary bag-of-words representation which Filannino manually collected from DBWorld mailing list.

DBWorld mailing list announces conferences, jobs, books, software and grants.

Filannino applied supervised learning algorithm to classify e-mails between ``announces of conferences'' and ``everything else''.

Out of 64 e-mails, 29 are about conference announcements and 35 are not.

Every e-mail is represented as a vector containing p binary values, where p is the size of the vocabulary extracted from the entire corpus with some constraints: the common words such as ``the'', ``is'' or ``which'', so-called stop words, and words that have less than 3 characters or more than 30 chracters are removed from the dataset.

The entry of the vector is 1 if the corresponding word belongs to the e-mail and 0 otherwise.

The number of unique words in the dataset is p=4702.

The dataset is originally from the UCI Machine Learning Repository DBWorldData.

rawDBWorld is a list of 64 objects containing the original E-mails.

Usage

data(DBWorld)
data(rawDBWorld)

Arguments

Details

See Bache K, Lichman M (2013). for details of the data descriptions. The original dataset is freely available from USIMachine Learning Repository website http://archive.ics.uci.edu/ml/datasets/DBWorld+e-mails

References

Bache K, Lichman M (2013). UCI Machine Learning Repository." http://archive.ics.uci.edu/ml/datasets

Filannino, M., (2011). 'DBWorld e-mail classification using a very small corpus', Project of Machine Learning course, University of Manchester.

Examples

Run this code

## Not run: 
# data(DBWorld)
# data(rawDBWorld)
# ## End(Not run)

Run the code above in your browser using DataLab