Learn R Programming

wordpiece

The goal of wordpiece is to allow for easy text tokenization using a wordpiece vocabulary.

Installation

You can install the released version of wordpiece from CRAN with:

install.packages("wordpiece")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("macmillancontentscience/wordpiece")

Examples

This package can be used to tokenize text for modeling. A common usecase would be to tokenize all text in a data.frame or other tibble.

library(wordpiece)
library(dplyr, warn.conflicts = FALSE)
df_tokenized <- tibble(
  text = c(
    "I like tacos.",
    "I like apples with cheese.",
    "The unaffable coder wrote incorrect examples."
  )
) %>% 
  mutate(
    tokens = wordpiece_tokenize(text)
  )

df_tokenized
#> # A tibble: 3 x 2
#>   text                                          tokens    
#>   <chr>                                         <list>    
#> 1 I like tacos.                                 <dbl [5]> 
#> 2 I like apples with cheese.                    <dbl [6]> 
#> 3 The unaffable coder wrote incorrect examples. <dbl [10]>
df_tokenized$tokens[[1]]
#>     i  like    ta ##cos     . 
#>  1045  2066 11937 13186  1012

Code of Conduct

Please note that the wordpiece project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Disclaimer

This is not an officially supported Macmillan Learning product.

Contact information

Questions or comments should be directed to Jonathan Bratt (jonathan.bratt@macmillan.com) and Jon Harmon (jonthegeek@gmail.com).

Copy Link

Version

Install

install.packages('wordpiece')

Monthly Downloads

823

Version

2.1.3

License

Apache License (>= 2)

Issues

Pull Requests

Stars

Forks

Maintainer

Jonathan Bratt

Last Published

March 3rd, 2022

Functions in wordpiece (2.1.3)

prepare_vocab

Format a Token List as a Vocabulary
.validate_wordpiece_vocabulary

Validator for Objects of Class wordpiece_vocabulary
wordpiece_tokenize

Tokenize Sequence with Word Pieces
wordpiece_cache_dir

Retrieve Directory for wordpiece Cache
set_wordpiece_cache_dir

Set a Cache Directory for wordpiece
reexports

Objects exported from other packages
load_vocab

Load a vocabulary file
.process_wp_vocab

Process a Wordpiece Vocabulary for Tokenization
.process_vocab

Process a Vocabulary for Tokenization
.wp_tokenize_word

Tokenize a Word
.wp_tokenize_single_string

Tokenize an Input Word-by-word
.new_wordpiece_vocabulary

Constructor for Class wordpiece_vocabulary
.get_casedness

Determine Casedness of Vocabulary
load_or_retrieve_vocab

Load a vocabulary file, or retrieve from cache
.infer_case_from_vocab

Determine Vocabulary Casedness