Learn R Programming

RBERT

RBERT is an R implementation of the Python package BERT developed at Google for Natural Language Processing.

Installation

You can install RBERT from GitHub with:

# install.packages("devtools")
devtools::install_github(
  "jonathanbratt/RBERT", 
  build_vignettes = TRUE
)

TensorFlow Installation

RBERT requires TensorFlow. Currently the version must be <= 1.13.1. You can install it using the tensorflow package (installed as a dependency of this package; see note below about Windows).

tensorflow::install_tensorflow(version = "1.13.1")

Windows

The current CRAN version of reticulate (1.13) causes some issues with the tensorflow installation. Rebooting your machine after installing Anaconda seems to fix this issue, or upgrade to the development version of reticulate.

devtools::install_github("rstudio/reticulate")

Basic usage

RBERT is a work in progress. While fine-tuning a BERT model using RBERT may be possible, it is not currently recommended.

RBERT is best suited for exploring pre-trained BERT models, and obtaining contextual representations of input text for use as features in downstream tasks.

  • See the "Introduction to RBERT" vignette included with the package for more specific examples.
  • For a quick explanation of what BERT is, see the "BERT Basics" vignette.
  • The package RBERTviz provides tools for making fun and easy visualizations of BERT data.

Running Tests

The first time you run the test suite, the 388.8MB bert_base_uncased.zip file will download in your tests/testthat/test_checkpoints directory. Subsequent test runs will use that download. This was our best compromise to allow for relatively rapid testing without bloating the repository.

Disclaimer

This is not an officially supported Macmillan Learning product.

Contact information

Questions or comments should be directed to Jonathan Bratt (jonathan.bratt@macmillan.com) and Jon Harmon (jon.harmon@macmillan.com).

Copy Link

Version

Version

0.1.11

License

file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Jonathan Bratt

Last Published

April 2nd, 2021

Functions in RBERT (0.1.11)

WordpieceTokenizer

Construct objects of WordpieceTokenizer class.
InputExample

Construct objects of class InputExample
InputExample_EF

Construct objects of class InputExample_EF
AdamWeightDecayOptimizer

Constructor for objects of class AdamWeightDecayOptimizer
BertConfig

Construct objects of BertConfig class
BasicTokenizer

Construct objects of BasicTokenizer class.
FullTokenizer

Construct objects of FullTokenizer class.
InputFeatures

Construct objects of class InputFeatures
BertModel

Construct object of class BertModel
apply_to_chars

Apply a function to each character in a string.
bert_config_from_json_file

Load BERT config object from json file
assert_rank

Confirm the rank of a tensor
clean_text

Perform invalid character removal and whitespace cleanup on text.
create_optimizer

Create an optimizer training op
convert_single_example

Convert a single InputExample into a single InputFeatures
create_model

Create a classification model
create_initializer

Create truncated normal initializer
convert_examples_to_features

Convert InputExamples to InputFeatures
.InputFeatures_EF

Construct objects of class InputFeatures_FE
.infer_archive_type

Infer the archive type for a BERT checkpoint
.choose_BERT_dir

Choose a directory for BERT checkpoints
convert_by_vocab

Convert a sequence of tokens/ids using the provided vocab.
convert_to_unicode

Convert `text` to Unicode
check_vocab

Check Vocabulary
.convert_examples_to_features_EF

Convert InputExample_EFs to InputFeatures_EF
.infer_checkpoint_archive_path

Infer the path to the archive for a BERT checkpoint
.get_actual_index

Standardize Indices
attention_layer

Build multi-headed attention layer
.get_model_archive_path

Locate an archive file for a BERT checkpoint
.get_model_archive_type

Get archive type of a BERT checkpoint
download_BERT_checkpoint

Download a BERT checkpoint
.process_BERT_checkpoint

Unzip and check a BERT checkpoint zip
get_activation

Map a string to a Python function
get_shape_list

Return the shape of tensor
.download_BERT_checkpoint

Download a checkpoint zip file
.get_model_url

Get url of a BERT checkpoint
.convert_single_example_EF

Convert a single InputExample_EF into a single InputFeatures_EF
reshape_from_matrix

Turn a matrix into a tensor
get_assignment_map_from_checkpoint

Compute the intersection of the current variables and checkpoint variables
file_based_convert_examples_to_features

Convert a set of InputExamples to a TFRecord file.
.get_model_subdir

Locate a subdir for a BERT checkpoint
input_fn_builder

Create an input_fn closure to be passed to TPUEstimator
layer_norm_and_dropout

Run layer normalization followed by dropout
load_vocab

Load a vocabulary file
whitespace_tokenize

Run basic whitespace cleaning and splitting on a piece of text.
reshape_to_matrix

Turn a tensor into a matrix
tokenize_word

Tokenize a single "word" (no whitespace).
embedding_postprocessor

Perform various post-processing on a word embedding tensor
.has_checkpoint

Check whether the user already has a checkpoint
create_attention_mask_from_input_mask

Create 3D attention mask from a 2D tensor mask
extract_features

Extract output features from BERT
.infer_ckpt_dir

Infer the subdir for a BERT checkpoint
.infer_model_paths

Find Paths to Checkpoint Files
dropout

Perform Dropout
make_examples_simple

Easily make examples for BERT
file_based_input_fn_builder

summary
is_whitespace

Check whether `char` is a whitespace character.
transformer_model

Build multi-head, multi-layer Transformer
model_fn_builder

Define model_fn closure for TPUEstimator
layer_norm

Run layer normalization
transpose_for_scores

Reshape and transpose tensor
truncate_seq_pair

Truncate a sequence pair to the maximum length.
embedding_lookup

Look up words embeddings for id tensor
input_fn_builder_EF

Create an input_fn closure to be passed to TPUEstimator
set_BERT_dir

Set the directory for BERT checkpoints
is_chinese_char

Check whether cp is the codepoint of a CJK character.
split_on_punc

Split text on punctuation.
tokenize

Tokenizers for various objects.
tokenize_text

Tokenize Text with Word Pieces
tokenize_chinese_chars

Add whitespace around any CJK character.
strip_accents

Strip accents from a piece of text.
find_files

Find Checkpoint Files
.maybe_download_checkpoint

Find or Possibly Download a Checkpoint
.model_fn_builder_EF

Define model_fn closure for TPUEstimator
gelu

Gaussian Error Linear Unit
is_control

Check whether `char` is a control character.
is_punctuation

Check whether `char` is a punctuation character.