DBWorld mailing list announces conferences, jobs, books, software and grants.
Filannino applied supervised learning algorithm to classify e-mails between ``announces of conferences'' and ``everything else''.
Out of 64 e-mails, 29 are about conference announcements and 35 are not.
Every e-mail is represented as a vector containing p binary values, where p is the size of the vocabulary extracted from the entire corpus with some constraints: the common words such as ``the'', ``is'' or ``which'', so-called stop words, and words that have less than 3 characters or more than 30 chracters are removed from the dataset.
The entry of the vector is 1 if the corresponding word belongs to the e-mail and 0 otherwise.
The number of unique words in the dataset is p=4702.
The dataset is originally from the UCI Machine Learning Repository DBWorldData.
rawDBWorld
is a list of 64 objects containing the original E-mails.
data(DBWorld)
data(rawDBWorld)
Filannino, M., (2011). 'DBWorld e-mail classification using a very small corpus', Project of Machine Learning course, University of Manchester.
## Not run:
# data(DBWorld)
# data(rawDBWorld)
# ## End(Not run)
Run the code above in your browser using DataLab