This function extracts a comprehensive set of features from a vector of IP address strings to support feature engineering in credit-scoring datasets. It processes both IPv4 and IPv6 addresses and returns a data frame with derived features. The features include IP version classification, octet-level breakdown for IPv4 addresses (with both string‐ and numeric-based octets), checks for leading zeros, a numeric conversion of the address, a basic approximation of IPv6 numeric values, pattern metrics such as a palindrome check and Shannon entropy, multicast status, and a Hilbert curve encoding for IPv4 addresses.
extract_ip_features(ip_addresses, error_on_invalid = FALSE)
A data frame with the following columns:
ip_version
A character vector indicating the IP version; either "IPv4"
or "IPv6"
. Invalid addresses are set to NA
.
ip_v4_octet1
The numeric conversion of the first octet of an IPv4 address as extracted from the IP string.
ip_v4_octet2
The numeric conversion of the second octet of an IPv4 address.
ip_v4_octet3
The numeric conversion of the third octet of an IPv4 address.
ip_v4_octet4
The numeric conversion of the fourth octet of an IPv4 address.
ip_v4_octet1_has_leading_zero
An integer flag indicating whether the first octet of an IPv4 address includes a leading zero.
ip_v4_octet2_has_leading_zero
An integer flag indicating whether the second octet includes a leading zero.
ip_v4_octet3_has_leading_zero
An integer flag indicating whether the third octet includes a leading zero.
ip_v4_octet4_has_leading_zero
An integer flag indicating whether the fourth octet includes a leading zero.
ip_leading_zero_count
An integer count of how many octets in an IPv4 address contain leading zeros.
ip_v4_numeric_vector
The 32-bit integer representation of an IPv4 address, computed as \((A * 256^3) + (B * 256^2) + (C * 256) + D\).
ip_v6_numeric_approx_vector
An approximate numeric conversion of an IPv6 address. This value is computed from the eight hextets and is intended for interval comparisons only; precision may be lost for large values (above 2^53).
ip_is_palindrome
An integer value indicating whether the entire IP address string is a palindrome (i.e., it reads the same forwards and backwards).
ip_entropy
A numeric value representing the Shannon entropy of the IP address string, computed over the distribution of its characters. Higher entropy values indicate a more varied (less repetitive) pattern.
A character vector of IP address strings.
Logical flag indicating how to handle invalid IP addresses. If TRUE
, the function throws an error upon encountering any invalid IP address; if FALSE
(the default), invalid IP addresses are replaced with NA
and a warning is issued.
The function follows these steps:
Validation: Each IP address is checked against regular expressions for both IPv4 and IPv6. If an IP does not match either pattern, it is deemed invalid. Depending on the value of error_on_invalid
, invalid entries are either replaced with NA
(with a warning) or cause an error.
IP Version Identification: The function determines whether an IP address is IPv4 or IPv6.
IPv4 Feature Extraction:
The IPv4 addresses are split into four octets.
For each octet, both the raw (string) and numeric representations are extracted.
The presence of leading zeros is checked for each octet, and the total count of octets with leading zeros is computed.
The full IPv4 address is converted to a 32-bit numeric value.
Hilbert curve encoding is applied to the numeric value, yielding two dimensions that can be used as features in modeling.
IPv6 Feature Extraction: For IPv6 addresses, an approximate numeric conversion is performed to allow for coarse interval analysis.
Pattern Metrics: Independent of IP version, the function computes:
A palindrome check on the entire IP string.
The Shannon entropy of the IP string to capture the diversity of characters.
# Load the package's sample dataset
data(featForge_sample_data)
# Extract IP features and combine them with the original IP column
result <- cbind(
data.frame(ip = featForge_sample_data$ip),
extract_ip_features(featForge_sample_data$ip)
)
print(result)
Run the code above in your browser using DataLab