Spam E-mail Database
A data set collected at Hewlett-Packard Labs, that classifies 4601 e-mails as spam or non-spam. In addition to this class label there are 57 variables indicating the frequency of certain words and characters in the e-mail.
The data set contains 2788 e-mails classified as
"nonspam" and 1813
The ``spam'' concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography... This collection of spam e-mails came from the collectors' postmaster and individuals who had filed spam. The collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.
A data frame with 4601 observations and 58 variables.
The first 48 variables contain the frequency of the variable name
(e.g., business) in the e-mail. If the variable name starts with num (e.g.,
num650) the it indicates the frequency of the corresponding number (e.g., 650).
The variables 49-54 indicate the frequency of the characters `;', `(', `[', `!',
`\$', and `\#'. The variables 55-57 contain the average, longest
and total run-length of capital letters. Variable 58 indicates the type of the
mail and is either
"spam", i.e. unsolicited
T. Hastie, R. Tibshirani, J.H. Friedman. The Elements of Statistical Learning. Springer, 2001.