Standardize an address from a free text expression into its components as used in the PSMA (formerly, "Public Sector for Mapping Agencies") database.
standardize_address(
Address,
AddressLine2 = NULL,
return.type = c("data.table", "integer"),
integer_StreetType = FALSE,
hash_StreetName = FALSE,
check = 1L,
nThread = getOption("healthyAddress.nThread", 1L)
)standard_address2(Address, nThread = getOption("healthyAddres.nThread", 1L))
standard_address3(Line1, Line2, Postcode = NULL, KeepStreetName = FALSE)
A data.table
containing columns indicating the components of the standard address:
FLAT_NUMBER
The flat or unit number. This includes things like SHOP number.
NUMBER_FIRST
As used in the PSMA, this identified the first (or only) number in the address range.
NUMBER_LAST
As used in the PSMA, if an address is marked as having a range of street numbers, the last of the range.
NUMBER_SUFFIX
A raw
vector. The suffix observed after the numbers. The PSMA
technically has multiple suffixes for each number component.
H0
If hash_StreetName = TRUE
, the DJB2 hash (as used in
HashStreetName
of the street name.). Observed to have performance
benefits.
STREET_NAME
The (uppercase) of the street name. Streets such
as 'THE ESPLANADE' or 'THE AVENUE' are treated as entirely made up of a street
name and have a STREET_TYPE_CODE
of zero.
STREET_TYPE_CODE
An integer, the street type code marking the type of street such as ROAD, STREET, AVENUE, etc. They code corresponds approximately to the rank of their frequency in addresses.
STREET_TYPE
If integer_StreetType = FALSE
, then the (uppercase)
standard name of the street type.
POSTCODE
An integer vector, the postcode observed.
A character vector, either a full address or (if AddressLine2
is not NULL
) the first line of an Australian address.
Either NULL
(the default) or a character vector,
the same length as Address
giving the second line of the Address.
Either "data.table"
or "integer"
.
"data.table"
implies a table of columns separating the address components.
"integer"
means an integer vector creating a bijection between the
address and the PSMA
internal id.
Should the street type be returned as an integer vector?
Should STREET_NAME
be returned as an integer hash,
as in HashStreetName
?
An integer, whether the inputs should be checked for possibly invalid addresses or addresses that may not be parsed correctly.
Number of threads to use.
For addresses split by line. Line1
is
assumed to end with the street type. The second line is only used to determine
Postcode
, and then only if it is NULL
, the default.
Should an additional character vector be included in the result of the street name?
By convention observed in the PSMA, street names such as 'THE ESPLANADE' have a street name of 'THE ESPLANADE' and an absent street type code.
Non-addresses passed have unspecified behaviour, though usually the numbers of the standard address will be 0 or NA. Postcodes may be negative in some circumstances where a postcode is not detected, though this should not be relied on.
For maximum performance, consider setting integer_StreetType
and
hash_StreetName
to TRUE
. It has been observed that joining
two tables together has been faster when using the hash of the standardized
street name, rather than the street name, even when taking into account
the hashing process.
For performance reasons, addresses with more than 32 words are not supported.
If a postcode-like number exists at the end of a Address
, but is not
in fact a postcode, then NA
will be in each field, except postcode,
which will have the value -1.