
domain names have suffixes - common endings that people can or could register domains under. This includes things like ".org", but also things like ".edu.co". A simple Top Level Domain list, as a result, probably won't cut it.
suffix_extract
takes the list of public suffixes,
as maintained by Mozilla (see suffix_dataset
) and
a vector of domain names, and produces a data.frame containing the
suffix that each domain uses, and the remaining fragment.
suffix_extract(domains, suffixes = NULL)
a data.frame of four columns, "host" "subdomain", "domain" & "suffix". "host" is what was passed in. "subdomain" is the subdomain of the suffix. "domain" contains the part of the domain name that came before the matched suffix. "suffix" is, well, the suffix.
a vector of damains, from domain
or url_parse
. Alternately, full URLs can be provided
and will then be run through domain
internally.
a dataset of suffixes. By default, this is NULL and the function
relies on suffix_dataset
. Optionally, if you want more updated
suffix data, you can provide the result of suffix_refresh
for
this parameter.
suffix_dataset
for the dataset of suffixes.
# Using url_parse
domain_name <- url_parse("http://en.wikipedia.org")$domain
suffix_extract(domain_name)
# Using domain()
domain_name <- domain("http://en.wikipedia.org")
suffix_extract(domain_name)
if (FALSE) {
#Relying on a fresh version of the suffix dataset
suffix_extract(domain("http://en.wikipedia.org"), suffix_refresh())
}
Run the code above in your browser using DataLab