This function takes a data frame as argument and returns the column name (or index) of all columns containing country names.
It can be used to automate the search of country columns in data frames.
For the purpose of this function, a country is any of the 249 territories designated in the ISO standard 3166
.
On large datasets a random sample is used for evaluating the columns.
find_countrycol(
x,
return_index = FALSE,
allow_NA = TRUE,
min_share = 0.8,
sample_size = 1000
)
Returns a vector of country names (return_index=FALSE
) or column indices (return_index=TRUE
) of columns containing country names.
A data frame object
A logical value indicating whether the function should return the index of country columns instead of the column names. Default is FALSE
, column names are returned.
Logical value indicating whether columns containing NA
values are to be considered as country columns. Default is allow_NA=FALSE
, the function will not return country column containing NA
values.
A value between 0
and 1
indicating the minimum share of country names in columns that are returned. A value of 0
will return any column containing a country name. A value of 1
will return only columns whose entries are all country names. Default is 0.9
, i.e. at least 90 percent of the column entries need to be country names.
Either NA
or a numeric value indicating the sample size used for evaluating columns. Default is 1000
. If NA
is passed, the function will evaluate the full table. The minimum accepted value is 100
(i.e. 100 randomly sampled rows are used to evaluate the columns). This parameter can be tuned to speed up computation on long datasets. Taking a sample could result in inexact identification of key columns, accuracy improves with larger samples.
is_country, country_name, find_keycol, find_timecol
find_countrycol(x=data.frame(a=c("Brésil","Tonga","FRA"), b=c(1,2,3)))
Run the code above in your browser using DataLab