This function calculates both dissimilarity and Euclidean distances for genlight or snpclone objects.
bitwise.dist(
x,
percent = TRUE,
mat = FALSE,
missing_match = TRUE,
scale_missing = FALSE,
euclidean = FALSE,
differences_only = FALSE,
threads = 0L
)
A dist object containing pairwise distances between samples.
a genlight or snpclone object.
logical
. Should the distance be represented from 0 to
1? Default set to TRUE
. FALSE
will return the distance
represented as integers from 1 to n where n is the number of loci.
This option has no effect if euclidean = TRUE
logical
. Return a matrix object. Default set to
FALSE
, returning a dist object. TRUE
returns a matrix object.
logical
. Determines whether two samples differing
by missing data in a location should be counted as matching at that
location. Default set to TRUE
, which forces missing data to match
with anything. FALSE
forces missing data to not match with any other
information, including other missing data.
A logical. If TRUE
, comparisons with missing
data is scaled up proportionally to the number of columns used by
multiplying the value by m / (m - x)
where m is the number of
loci and x is the number of missing sites. This option matches the behavior
of base R's dist()
function.
Defaults to FALSE
.
logical
. if TRUE
, the Euclidean distance will
be calculated.
logical
. When differences_only = TRUE
,
the output will reflect the number of different loci. The default setting,
differences_only = FALSE
, reflects the number of different alleles.
Note: this has no effect on haploid organisms since 1 locus = 1 allele.
This option is NOT recommended.
The maximum number of parallel threads to be used within this function. A value of 0 (default) will attempt to use as many threads as there are available cores/CPUs. In most cases this is ideal. A value of 1 will force the function to run serially, which may increase stability on some systems. Other values may be specified, but should be used with caution.
Zhian N. Kamvar, Jonah C. Brooks
The default distance calculated here is quite simple and goes by many names depending on its application. The most familiar name might be the Hamming distance, or the number of differences between two strings.
As of poppr version 2.8.0, this function now also calculates Euclidean
distance and is considerably faster and more memory-efficient than the
standard dist()
function.
diss.dist()
, snpclone,
genlight, win.ia()
, samp.ia()
set.seed(999)
x <- glSim(n.ind = 10, n.snp.nonstruc = 5e2, n.snp.struc = 5e2, ploidy = 2)
x
# Assess fraction of different alleles
system.time(xd <- bitwise.dist(x, threads = 1L))
xd
# Calculate Euclidean distance
system.time(xdt <- bitwise.dist(x, euclidean = TRUE, scale_missing = TRUE, threads = 1L))
xdt
if (FALSE) {
# This function is more efficient in both memory and speed than [dist()] for
# calculating Euclidean distance on genlight objects. For example, we can
# observe a clear speed increase when we attempt a calculation on 100k SNPs
# with 10% missing data:
set.seed(999)
mat <- matrix(sample(c(0:2, NA),
100000 * 50,
replace = TRUE,
prob = c(0.3, 0.3, 0.3, 0.1)),
nrow = 50)
glite <- new("genlight", mat, ploidy = 2)
# Default Euclidean distance
system.time(dist(glite))
# Bitwise dist
system.time(bitwise.dist(glite, euclidean = TRUE, scale_missing = TRUE))
}
Run the code above in your browser using DataLab