Learn R Programming

audubon

audubon is Japanese text processing tools for:

  • filling Japanese iteration marks
  • hiraganization, katakanization and romanization using hakatashi/japanese.js
  • segmentation by phrase using google/budoux and ‘TinySegmenter.js’
  • text normalization which is based on rules for the ‘Sudachi’ morphological analyzer and the ‘NEologd’ (Neologism dictionary for ‘MeCab’).

Some features above are not implemented in ‘ICU’ (i.e., the stringi package), and the goal of the audubon package is to provide these additional features.

Installation

remotes::install_github("paithiov909/audubon")

Usage

Fill Japanese iteration marks (Odori-ji)

strj_fill_iter_mark repeats the previous character and replaces the iteration marks if the element has more than 5 characters. You can use this feature with strj_normalize or strj_rewrite_as_def.

strj_fill_iter_mark(c(
  "あいうゝ〃かき",
  "金子みすゞ",
  "のたり〳〵かな",
  "しろ/″\とした"
))
#> [1] "あいうううかき"  "金子みすず"     "のたりたりかな"  "しろじろとした"

strj_fill_iter_mark("いすゞエルフトラック") |>
  strj_normalize()
#> [1] "いすずエルフトラック"

Character class conversion

Character class conversion uses hakatashi/japanese.js.

strj_hiraganize("あのイーハトーヴォのすきとおった風")
#> [1] "あのいーはとーゔぉのすきとおった風"
strj_katakanize("あのイーハトーヴォのすきとおった風")
#> [1] "アノイーハトーヴォノスキトオッタ風"
strj_romanize("あのイーハトーヴォのすきとおった風")
#> [1] "anoīhatōvonosukitōtta"

Segmentation by phrase

strj_tokenize splits Japanese text into some phrases using google/budoux, TinySegmenter, or other tokenizers.

strj_tokenize("あのイーハトーヴォのすきとおった風", engine = "budoux")
#> $`1`
#> [1] "あのイーハトーヴォの" "すきと"               "おった"              
#> [4] "風"

Japanese text normalization

strj_normalize normalizes text following the rule based on NEologd style.

strj_normalize("――南アルプスの 天然水- Sparking* Lemon+ レモン一絞り")
#> [1] "ー南アルプスの天然水-Sparking* Lemon+レモン一絞り"

strj_rewrite_as_def is an R port of SudachiCharNormalizer that typically normalizes characters following a ’*.def’ file.

audubon package contains several ’*.def’ files, so you can use them or write a ‘rewrite.def’ file by yourself as follows.

# single characters will **never** be normalized.
…
# if two characters are separated with a tab,
# left side forms are always rewritten to right side forms
# before normalized.
斎   斉
齋   斉
齊   斉
# supports rewriting a single character to a single character,
# i.e., this cannot work.
アッ  ア

This feature is more powerful than stringi::stri_trans_* because it allows users to control which characters are normalized. For instance, this function can be used to convert kyuji-tai characters to shinji-tai characters.

stringi::stri_trans_nfkc("Ⅹⅳ")
#> [1] "Xiv"
strj_rewrite_as_def("Ⅹⅳ")
#> [1] "Ⅹⅳ"
strj_rewrite_as_def("惡と假面のルール", read_rewrite_def(system.file("def/kyuji.def", package = "audubon")))
#> [1] "悪と仮面のルール"

License

© 2023 Akiru Kato

Licensed under the Apache License, Version 2.0.

Icons made by iconixar from www.flaticon.com.

Copy Link

Version

Install

install.packages('audubon')

Monthly Downloads

726

Version

0.5.1

License

Apache License (>= 2)

Issues

Pull Requests

Stars

Forks

Maintainer

Akiru Kato

Last Published

May 2nd, 2023

Functions in audubon (0.5.1)

read_rewrite_def

Read a rewrite.def file
strj_fill_iter_mark

Fill Japanese iteration marks
strj_hiraganize

Hiraganize Japanese characters
strj_katakanize

Katakanize Japanese characters
strj_transcribe_num

Transcribe Arabic to Kansuji
strj_segment

Segment text into tokens
strj_tinyseg

Segment text into phrases
strj_romanize

Romanize Japanese Hiragana and Katakana
strj_tokenize

Split text into tokens
hiroba

Whole tokens of 'Porano no Hiroba' written by Miyazawa Kenji from Aozora Bunko
get_dict_features

Get dictionary's features
mute_tokens

Mute tokens by condition
lex_density

Calculate lexical density
ngram_tokenizer

Ngrams tokenizer
bind_tf_idf2

Bind term frequency and inverse document frequency
bind_lr

Bind importance of bigrams
collapse_tokens

Collapse sequences of tokens by condition
pack

Pack a data.frame of tokens
prettify

Prettify tokenized output
polano

Whole text of 'Porano no Hiroba' written by Miyazawa Kenji from Aozora Bunko
strj_rewrite_as_def

Rewrite text using rewrite.def
strj_normalize

Convert text following the rules of 'NEologd'
audubon-package

audubon: Japanese Text Processing Tools