This function processes a vector of email addresses to extract a comprehensive set of features that can be useful for credit scoring. In addition to parsing the email into its constituent parts (such as the username and domain), the function computes various character-level statistics (e.g., counts of digits, dots, uppercase letters) and string distance metrics between the email username and client name information. If provided, it also checks for the presence of date-of-birth components in the email username (in several flexible formats).
extract_email_features(
emails,
client_name = NULL,
client_surname = NULL,
date_of_birth = NULL,
error_on_invalid = FALSE
)A data.frame with the following columns:
email_domainThe domain part of the email address (i.e., the substring after the '@').
email_major_domainThe major domain extracted as the substring after the last dot in the domain (e.g., "com" in "gmail.com").
email_n_charsThe number of characters in the email username (i.e., the part before the '@').
email_n_digitsThe number of digits found in the email username.
email_n_dotsThe number of dot ('.') characters in the email username.
email_n_capsThe number of uppercase letters in the email username.
email_total_lettersThe total count of alphabetic characters (both uppercase and lowercase) in the email username.
email_prop_digitsThe proportion of digits in the email username (calculated as email_n_digits/email_n_chars).
email_max_consecutive_digitsThe maximum length of any sequence of consecutive digits in the email username.
email_name_in_emailLogical. TRUE if the provided client name is found within the email username (case-insensitive), FALSE otherwise.
email_name_in_email_dist_lvThe Levenshtein distance between the client name and the email username (if the stringdist package is available; otherwise NA).
email_name_in_email_dist_lcsThe Longest Common Subsequence distance between the client name and the email username (if computed).
email_name_in_email_dist_cosineThe cosine distance between the client name and the email username (if computed).
email_name_in_email_dist_jaccardThe Jaccard distance between the client name and the email username (if computed).
email_name_in_email_dist_jwThe Jaro-Winkler distance between the client name and the email username (if computed).
email_name_in_email_dist_soundexThe Soundex distance between the client name and the email username (if computed).
email_surname_in_emailLogical. TRUE if the provided client surname is found within the email username (case-insensitive), FALSE otherwise.
email_surname_in_email_dist_lvThe Levenshtein distance between the client surname and the email username (if computed).
email_surname_in_email_dist_lcsThe Longest Common Subsequence distance between the client surname and the email username (if computed).
email_surname_in_email_dist_cosineThe cosine distance between the client surname and the email username (if computed).
email_surname_in_email_dist_jaccardThe Jaccard distance between the client surname and the email username (if computed).
email_surname_in_email_dist_jwThe Jaro-Winkler distance between the client surname and the email username (if computed).
email_surname_in_email_dist_soundexThe Soundex distance between the client surname and the email username (if computed).
email_fullname_in_email_dist_lvThe Levenshtein distance between the concatenated client name and surname and the email username (if computed).
email_fullname_in_email_dist_lcsThe Longest Common Subsequence distance between the concatenated client name and surname and the email username (if computed).
email_fullname_in_email_dist_cosineThe cosine distance between the concatenated client name and surname and the email username (if computed).
email_fullname_in_email_dist_jaccardThe Jaccard distance between the concatenated client name and surname and the email username (if computed).
email_fullname_in_email_dist_jwThe Jaro-Winkler distance between the concatenated client name and surname and the email username (if computed).
email_fullname_in_email_dist_soundexThe Soundex distance between the concatenated client name and surname and the email username (if computed).
email_has_full_year_of_birthLogical. TRUE if the full 4-digit year (e.g., "1986") of the client's date of birth is present in the email username.
email_has_last_two_digits_of_birthLogical. TRUE if the last two digits of the client's birth year are present in the email username.
email_has_full_dob_in_usernameLogical. TRUE if the full date of birth (in one of the following formats: YYYYMMDD, YYYY.MM.DD, YYYY_MM_DD, or YYYY-MM-DD) is present in the email username.
email_has_other_4digit_yearLogical. TRUE if a different 4-digit year (between 1920 and 2020) is found in the email username that does not match the client's own birth year.
A character vector of email addresses. Invalid email addresses are either replaced with NA (with a warning) or cause an error,
depending on the value of error_on_invalid.
Optional. A character vector of client first names. When provided, its length must equal that of emails.
Optional. A character vector of client surnames. When provided, its length must equal that of emails.
Optional. A Date vector containing the client's dates of birth. When provided, its length must equal that of emails.
Logical. If TRUE, the function will throw an error when encountering an invalid email address.
If FALSE (the default), invalid emails are replaced with NA and a warning is issued.
The function is designed to support feature engineering for credit-scoring datasets. It not only extracts parts of an email address (such as the username and domain) but also computes detailed characteristics from the username, which may include embedded client information.
When client name information is provided, the function computes various string distance metrics (using the stringdist package) between
the client name (and surname) and the email username. If stringdist is not installed, the function will issue a warning and assign NA
to the distance-based features.
stringdist for the calculation of string distances.
# Load sample data included in the package
data("featForge_sample_data")
# Extract features from the sample emails
features <- extract_email_features(
emails = featForge_sample_data$email,
client_name = featForge_sample_data$client_name,
client_surname = featForge_sample_data$client_surname,
date_of_birth = featForge_sample_data$date_of_birth
)
# Display the first few rows of the resulting feature set
head(features)
Run the code above in your browser using DataLab