Learn R Programming

featForge (version 0.1.2)

extract_email_features: Extract Email Features for Credit Scoring

Description

This function processes a vector of email addresses to extract a comprehensive set of features that can be useful for credit scoring. In addition to parsing the email into its constituent parts (such as the username and domain), the function computes various character-level statistics (e.g., counts of digits, dots, uppercase letters) and string distance metrics between the email username and client name information. If provided, it also checks for the presence of date-of-birth components in the email username (in several flexible formats).

Usage

extract_email_features(
  emails,
  client_name = NULL,
  client_surname = NULL,
  date_of_birth = NULL,
  error_on_invalid = FALSE
)

Value

A data.frame with the following columns:

email_domain

The domain part of the email address (i.e., the substring after the '@').

email_major_domain

The major domain extracted as the substring after the last dot in the domain (e.g., "com" in "gmail.com").

email_n_chars

The number of characters in the email username (i.e., the part before the '@').

email_n_digits

The number of digits found in the email username.

email_n_dots

The number of dot ('.') characters in the email username.

email_n_caps

The number of uppercase letters in the email username.

email_total_letters

The total count of alphabetic characters (both uppercase and lowercase) in the email username.

email_prop_digits

The proportion of digits in the email username (calculated as email_n_digits/email_n_chars).

email_max_consecutive_digits

The maximum length of any sequence of consecutive digits in the email username.

email_name_in_email

Logical. TRUE if the provided client name is found within the email username (case-insensitive), FALSE otherwise.

email_name_in_email_dist_lv

The Levenshtein distance between the client name and the email username (if the stringdist package is available; otherwise NA).

email_name_in_email_dist_lcs

The Longest Common Subsequence distance between the client name and the email username (if computed).

email_name_in_email_dist_cosine

The cosine distance between the client name and the email username (if computed).

email_name_in_email_dist_jaccard

The Jaccard distance between the client name and the email username (if computed).

email_name_in_email_dist_jw

The Jaro-Winkler distance between the client name and the email username (if computed).

email_name_in_email_dist_soundex

The Soundex distance between the client name and the email username (if computed).

email_surname_in_email

Logical. TRUE if the provided client surname is found within the email username (case-insensitive), FALSE otherwise.

email_surname_in_email_dist_lv

The Levenshtein distance between the client surname and the email username (if computed).

email_surname_in_email_dist_lcs

The Longest Common Subsequence distance between the client surname and the email username (if computed).

email_surname_in_email_dist_cosine

The cosine distance between the client surname and the email username (if computed).

email_surname_in_email_dist_jaccard

The Jaccard distance between the client surname and the email username (if computed).

email_surname_in_email_dist_jw

The Jaro-Winkler distance between the client surname and the email username (if computed).

email_surname_in_email_dist_soundex

The Soundex distance between the client surname and the email username (if computed).

email_fullname_in_email_dist_lv

The Levenshtein distance between the concatenated client name and surname and the email username (if computed).

email_fullname_in_email_dist_lcs

The Longest Common Subsequence distance between the concatenated client name and surname and the email username (if computed).

email_fullname_in_email_dist_cosine

The cosine distance between the concatenated client name and surname and the email username (if computed).

email_fullname_in_email_dist_jaccard

The Jaccard distance between the concatenated client name and surname and the email username (if computed).

email_fullname_in_email_dist_jw

The Jaro-Winkler distance between the concatenated client name and surname and the email username (if computed).

email_fullname_in_email_dist_soundex

The Soundex distance between the concatenated client name and surname and the email username (if computed).

email_has_full_year_of_birth

Logical. TRUE if the full 4-digit year (e.g., "1986") of the client's date of birth is present in the email username.

email_has_last_two_digits_of_birth

Logical. TRUE if the last two digits of the client's birth year are present in the email username.

email_has_full_dob_in_username

Logical. TRUE if the full date of birth (in one of the following formats: YYYYMMDD, YYYY.MM.DD, YYYY_MM_DD, or YYYY-MM-DD) is present in the email username.

email_has_other_4digit_year

Logical. TRUE if a different 4-digit year (between 1920 and 2020) is found in the email username that does not match the client's own birth year.

Arguments

emails

A character vector of email addresses. Invalid email addresses are either replaced with NA (with a warning) or cause an error, depending on the value of error_on_invalid.

client_name

Optional. A character vector of client first names. When provided, its length must equal that of emails.

client_surname

Optional. A character vector of client surnames. When provided, its length must equal that of emails.

date_of_birth

Optional. A Date vector containing the client's dates of birth. When provided, its length must equal that of emails.

error_on_invalid

Logical. If TRUE, the function will throw an error when encountering an invalid email address. If FALSE (the default), invalid emails are replaced with NA and a warning is issued.

Details

The function is designed to support feature engineering for credit-scoring datasets. It not only extracts parts of an email address (such as the username and domain) but also computes detailed characteristics from the username, which may include embedded client information.

When client name information is provided, the function computes various string distance metrics (using the stringdist package) between the client name (and surname) and the email username. If stringdist is not installed, the function will issue a warning and assign NA to the distance-based features.

See Also

stringdist for the calculation of string distances.

Examples

Run this code
 # Load sample data included in the package
 data("featForge_sample_data")

# Extract features from the sample emails
 features <- extract_email_features(
   emails = featForge_sample_data$email,
   client_name = featForge_sample_data$client_name,
   client_surname = featForge_sample_data$client_surname,
   date_of_birth = featForge_sample_data$date_of_birth
 )

# Display the first few rows of the resulting feature set
head(features)

Run the code above in your browser using DataLab