Learn R Programming

RBERT

RBERT is an R implementation of the Python package BERT developed at Google for Natural Language Processing.

Installation

You can install RBERT from GitHub with:

# install.packages("devtools")
devtools::install_github(
  "jonathanbratt/RBERT", 
  build_vignettes = TRUE
)

TensorFlow Installation

RBERT requires TensorFlow. Currently the version must be <= 1.13.1. You can install it using the tensorflow package (installed as a dependency of this package; see note below about Windows).

tensorflow::install_tensorflow(version = "1.13.1")

Windows

The current CRAN version of reticulate (1.13) causes some issues with the tensorflow installation. Rebooting your machine after installing Anaconda seems to fix this issue, or upgrade to the development version of reticulate.

devtools::install_github("rstudio/reticulate")

Basic usage

RBERT is a work in progress. While fine-tuning a BERT model using RBERT may be possible, it is not currently recommended.

RBERT is best suited for exploring pre-trained BERT models, and obtaining contextual representations of input text for use as features in downstream tasks.

See the "Introduction to RBERT" vignette included with the package for more specific examples.
For a quick explanation of what BERT is, see the "BERT Basics" vignette.
The package RBERTviz provides tools for making fun and easy visualizations of BERT data.

Running Tests

The first time you run the test suite, the 388.8MB bert_base_uncased.zip file will download in your tests/testthat/test_checkpoints directory. Subsequent test runs will use that download. This was our best compromise to allow for relatively rapid testing without bloating the repository.

Disclaimer

This is not an officially supported Macmillan Learning product.

Contact information

Questions or comments should be directed to Jonathan Bratt (jonathan.bratt@macmillan.com) and Jon Harmon (jon.harmon@macmillan.com).

Copy Link

Version

Version

0.1.11

License

file LICENSE

Issues

Pull Requests

Stars

Forks

Repository

https://github.com/jonathanbratt/RBERT

Maintainer

Jonathan Bratt

Last Published

April 2nd, 2021

Functions in RBERT (0.1.11)

WordpieceTokenizer

Construct objects of WordpieceTokenizer class.

Construct objects of class InputExample

InputExample_EF

Construct objects of class InputExample_EF

AdamWeightDecayOptimizer

Constructor for objects of class AdamWeightDecayOptimizer

Construct objects of BertConfig class

Construct objects of BasicTokenizer class.

Construct objects of FullTokenizer class.

Construct objects of class InputFeatures

Construct object of class BertModel

Apply a function to each character in a string.

bert_config_from_json_file

Load BERT config object from json file

Confirm the rank of a tensor

Perform invalid character removal and whitespace cleanup on text.

create_optimizer

Create an optimizer training op

convert_single_example

Convert a single InputExample into a single InputFeatures

Create a classification model

create_initializer

Create truncated normal initializer

convert_examples_to_features

Convert InputExamples to InputFeatures

.InputFeatures_EF

Construct objects of class InputFeatures_FE

.infer_archive_type

Infer the archive type for a BERT checkpoint

.choose_BERT_dir

Choose a directory for BERT checkpoints

convert_by_vocab

Convert a sequence of tokens/ids using the provided vocab.

convert_to_unicode

Convert `text` to Unicode

Check Vocabulary

.convert_examples_to_features_EF

Convert InputExample_EFs to InputFeatures_EF

.infer_checkpoint_archive_path

Infer the path to the archive for a BERT checkpoint

.get_actual_index

Standardize Indices

attention_layer

Build multi-headed attention layer

.get_model_archive_path

Locate an archive file for a BERT checkpoint

.get_model_archive_type

Get archive type of a BERT checkpoint

download_BERT_checkpoint

Download a BERT checkpoint

.process_BERT_checkpoint

Unzip and check a BERT checkpoint zip

Map a string to a Python function

Return the shape of tensor

.download_BERT_checkpoint

Download a checkpoint zip file

Get url of a BERT checkpoint

.convert_single_example_EF

Convert a single InputExample_EF into a single InputFeatures_EF

reshape_from_matrix

Turn a matrix into a tensor

get_assignment_map_from_checkpoint

Compute the intersection of the current variables and checkpoint variables

file_based_convert_examples_to_features

Convert a set of InputExamples to a TFRecord file.

.get_model_subdir

Locate a subdir for a BERT checkpoint

input_fn_builder

Create an input_fn closure to be passed to TPUEstimator

layer_norm_and_dropout

Run layer normalization followed by dropout

Load a vocabulary file

whitespace_tokenize

Run basic whitespace cleaning and splitting on a piece of text.

reshape_to_matrix

Turn a tensor into a matrix

Tokenize a single "word" (no whitespace).

embedding_postprocessor

Perform various post-processing on a word embedding tensor

.has_checkpoint

Check whether the user already has a checkpoint

create_attention_mask_from_input_mask

Create 3D attention mask from a 2D tensor mask

extract_features

Extract output features from BERT

.infer_ckpt_dir

Infer the subdir for a BERT checkpoint

.infer_model_paths

Find Paths to Checkpoint Files

Perform Dropout

make_examples_simple

Easily make examples for BERT

file_based_input_fn_builder

Check whether `char` is a whitespace character.

transformer_model

Build multi-head, multi-layer Transformer

model_fn_builder

Define model_fn closure for TPUEstimator

Run layer normalization

transpose_for_scores

Reshape and transpose tensor

truncate_seq_pair

Truncate a sequence pair to the maximum length.

embedding_lookup

Look up words embeddings for id tensor

input_fn_builder_EF

Create an input_fn closure to be passed to TPUEstimator

Set the directory for BERT checkpoints

is_chinese_char

Check whether cp is the codepoint of a CJK character.

Split text on punctuation.

Tokenizers for various objects.

Tokenize Text with Word Pieces

tokenize_chinese_chars

Add whitespace around any CJK character.

Strip accents from a piece of text.

Find Checkpoint Files

.maybe_download_checkpoint

Find or Possibly Download a Checkpoint

.model_fn_builder_EF

Define model_fn closure for TPUEstimator

Gaussian Error Linear Unit

Check whether `char` is a control character.

Check whether `char` is a punctuation character.