This function processes a vector of email addresses to extract a comprehensive set of features that can be useful for credit scoring. In addition to parsing the email into its constituent parts (such as the username and domain), the function computes various character-level statistics (e.g., counts of digits, dots, uppercase letters) and string distance metrics between the email username and client name information. If provided, it also checks for the presence of date-of-birth components in the email username (in several flexible formats).
extract_email_features(
emails,
client_name = NULL,
client_surname = NULL,
date_of_birth = NULL,
error_on_invalid = FALSE
)
A data.frame
with the following columns:
email_domain
The domain part of the email address (i.e., the substring after the '@').
email_major_domain
The major domain extracted as the substring after the last dot in the domain (e.g., "com" in "gmail.com").
email_n_chars
The number of characters in the email username (i.e., the part before the '@').
email_n_digits
The number of digits found in the email username.
email_n_dots
The number of dot ('.') characters in the email username.
email_n_caps
The number of uppercase letters in the email username.
email_total_letters
The total count of alphabetic characters (both uppercase and lowercase) in the email username.
email_prop_digits
The proportion of digits in the email username (calculated as email_n_digits/email_n_chars
).
email_max_consecutive_digits
The maximum length of any sequence of consecutive digits in the email username.
email_name_in_email
Logical. TRUE
if the provided client name is found within the email username (case-insensitive), FALSE
otherwise.
email_name_in_email_dist_lv
The Levenshtein distance between the client name and the email username (if the stringdist
package is available; otherwise NA
).
email_name_in_email_dist_lcs
The Longest Common Subsequence distance between the client name and the email username (if computed).
email_name_in_email_dist_cosine
The cosine distance between the client name and the email username (if computed).
email_name_in_email_dist_jaccard
The Jaccard distance between the client name and the email username (if computed).
email_name_in_email_dist_jw
The Jaro-Winkler distance between the client name and the email username (if computed).
email_name_in_email_dist_soundex
The Soundex distance between the client name and the email username (if computed).
email_surname_in_email
Logical. TRUE
if the provided client surname is found within the email username (case-insensitive), FALSE
otherwise.
email_surname_in_email_dist_lv
The Levenshtein distance between the client surname and the email username (if computed).
email_surname_in_email_dist_lcs
The Longest Common Subsequence distance between the client surname and the email username (if computed).
email_surname_in_email_dist_cosine
The cosine distance between the client surname and the email username (if computed).
email_surname_in_email_dist_jaccard
The Jaccard distance between the client surname and the email username (if computed).
email_surname_in_email_dist_jw
The Jaro-Winkler distance between the client surname and the email username (if computed).
email_surname_in_email_dist_soundex
The Soundex distance between the client surname and the email username (if computed).
email_fullname_in_email_dist_lv
The Levenshtein distance between the concatenated client name and surname and the email username (if computed).
email_fullname_in_email_dist_lcs
The Longest Common Subsequence distance between the concatenated client name and surname and the email username (if computed).
email_fullname_in_email_dist_cosine
The cosine distance between the concatenated client name and surname and the email username (if computed).
email_fullname_in_email_dist_jaccard
The Jaccard distance between the concatenated client name and surname and the email username (if computed).
email_fullname_in_email_dist_jw
The Jaro-Winkler distance between the concatenated client name and surname and the email username (if computed).
email_fullname_in_email_dist_soundex
The Soundex distance between the concatenated client name and surname and the email username (if computed).
email_has_full_year_of_birth
Logical. TRUE
if the full 4-digit year (e.g., "1986") of the client's date of birth is present in the email username.
email_has_last_two_digits_of_birth
Logical. TRUE
if the last two digits of the client's birth year are present in the email username.
email_has_full_dob_in_username
Logical. TRUE
if the full date of birth (in one of the following formats: YYYYMMDD, YYYY.MM.DD, YYYY_MM_DD, or YYYY-MM-DD) is present in the email username.
email_has_other_4digit_year
Logical. TRUE
if a different 4-digit year (between 1920 and 2020) is found in the email username that does not match the client's own birth year.
A character vector of email addresses. Invalid email addresses are either replaced with NA
(with a warning) or cause an error,
depending on the value of error_on_invalid
.
Optional. A character vector of client first names. When provided, its length must equal that of emails
.
Optional. A character vector of client surnames. When provided, its length must equal that of emails
.
Optional. A Date
vector containing the client's dates of birth. When provided, its length must equal that of emails
.
Logical. If TRUE
, the function will throw an error when encountering an invalid email address.
If FALSE
(the default), invalid emails are replaced with NA
and a warning is issued.
The function is designed to support feature engineering for credit-scoring datasets. It not only extracts parts of an email address (such as the username and domain) but also computes detailed characteristics from the username, which may include embedded client information.
When client name information is provided, the function computes various string distance metrics (using the stringdist
package) between
the client name (and surname) and the email username. If stringdist
is not installed, the function will issue a warning and assign NA
to the distance-based features.
stringdist
for the calculation of string distances.
# Load sample data included in the package
data("featForge_sample_data")
# Extract features from the sample emails
features <- extract_email_features(
emails = featForge_sample_data$email,
client_name = featForge_sample_data$client_name,
client_surname = featForge_sample_data$client_surname,
date_of_birth = featForge_sample_data$date_of_birth
)
# Display the first few rows of the resulting feature set
head(features)
Run the code above in your browser using DataLab