This function reads genotypes encoded in a Plink-formatted BED (binary) file, returning them in a standard R matrix containing genotypes encoded numerically as dosages (values in c( 0, 1, 2, NA )
).
Each genotype per locus (m
loci) and individual (n
total) counts the number of reference alleles, or NA
for missing data.
No *.fam or *.bim files are read by this basic function.
Since BED does not encode the data dimensions internally, these values must be provided by the user.
read_bed(
file,
names_loci = NULL,
names_ind = NULL,
m_loci = NA,
n_ind = NA,
ext = "bed",
verbose = TRUE
)
The m
-by-n
genotype matrix.
Input file path.
*.bed extension may be omitted (will be added automatically if file
doesn't exist but file
.bed does).
See ext
option below.
Vector of loci names, to become the row names of the genotype matrix.
If provided, its length sets m_loci
below.
If NULL
, the returned genotype matrix will not have row names, and m_loci
must be provided.
Vector of individual names, to become the column names of the genotype matrix.
If provided, its length sets n_ind
below.
If NULL
, the returned genotype matrix will not have column names, and n_ind
must be provided.
Number of loci in the input genotype table.
Required if names_loci = NULL
, as its value is not deducible from the BED file itself.
Ignored if names_loci
is provided.
Number of individuals in the input genotype table.
Required if names_ind = NULL
, as its value is not deducible from the BED file itself.
Ignored if names_ind
is provided.
The desired file extension (default "bed").
Ignored if file
points to an existing file.
Set to NA
to force file
to exist as-is.
If TRUE
(default) function reports the path of the file being read (after autocompleting the extension).
The code enforces several checks to validate data given the requested dimensions. Errors are thrown if file terminates too early or does not terminate after genotype matrix is filled. In addition, as each locus is encoded in an integer number of bytes, and each byte contains up to four individuals, bytes with fewer than four are padded. To agree with other software (plink2, BEDMatrix), byte padding values are ignored (may take on any value without causing errors).
This function only supports locus-major BED files, which are the standard for modern data. Format is validated via the BED file's magic numbers (first three bytes of file). Older BED files can be converted using Plink.
read_plink()
for reading a set of BED/BIM/FAM files.
geno_to_char()
for translating numerical genotypes into more human-readable character encodings.
Plink BED format reference: https://www.cog-genomics.org/plink/1.9/formats#bed
# first obtain data dimensions from BIM and FAM files
# all file paths
file_bed <- system.file("extdata", 'sample.bed', package = "genio", mustWork = TRUE)
file_bim <- system.file("extdata", 'sample.bim', package = "genio", mustWork = TRUE)
file_fam <- system.file("extdata", 'sample.fam', package = "genio", mustWork = TRUE)
# read annotation tables
bim <- read_bim(file_bim)
fam <- read_fam(file_fam)
# read an existing Plink *.bim file
# pass locus and individual IDs as vectors, setting data dimensions too
X <- read_bed(file_bed, bim$id, fam$id)
X
# can specify without extension
file_bed <- sub('\\.bed$', '', file_bed) # remove extension from this path on purpose
file_bed # verify .bed is missing
X <- read_bed(file_bed, bim$id, fam$id) # loads too!
X
Run the code above in your browser using DataLab