cwbtools (version 0.3.3)

p_attribute_encode: Encode Positional Attribute(s).

Description

Pure R implementation to generate positional attribute from a character vector of tokens (the token stream).

Usage

p_attribute_encode(
  token_stream,
  p_attribute = "word",
  registry_dir,
  corpus,
  data_dir,
  method = c("R", "CWB"),
  verbose = TRUE,
  encoding = get_encoding(token_stream),
  compress = FALSE
)

p_attribute_recode( data_dir, p_attribute, from = c("UTF-8", "latin1"), to = c("UTF-8", "latin1") )

Arguments

token_stream

A character vector with the tokens of the corpus.

p_attribute

The positional attribute.

registry_dir

Registry directory (needed by p_attribute_huffcode and p_attribute_compress_rdx).

corpus

The CWB corpus (needed by p_attribute_huffcode and p_attribute_compress_rdx).

data_dir

The data directory for the corpus with the binary files.

method

Either 'CWB' or 'R'.

verbose

Logical.

encoding

Encoding as defined in the charset corpus property of the registry file for the corpus ('latin1' to 'latin9', and 'utf8').

compress

Logical.

from

Character string describing the current encoding of the attribute.

to

Character string describing the target encoding of the attribute.

Details

Four steps generate the binary CWB corpus data format for positional attributes: First, encode a character vector (the token stream) using p_attribute_encode. Second, create reverse index using p_attribute_makeall. Third, compress token stream using p_attribute_huffcode. Fourth, compress index files using p_attribute_compress_rdx.

The implementation for the first two steps (p_attribute_encode and p_attribute_makeall) is a pure R implementation (so far). These two steps are enough to use the CQP functionality. To run p_attribute_huffcode and p_attribute_compress_rdx, an installation of the CWB may be necessary.

See the CQP Corpus Encoding Tutorial (http://cwb.sourceforge.net/files/CWB_Encoding_Tutorial.pdf) for an explanation of the procedure (section 3, ``Indexing and compression without CWB/Perl'').

p_attribute_recode will recode the values in the avs-file and change the attribute value index in the avx file. The rng-file remains unchanged. The registry file remains unchanged, and it is highly recommended to consider s_attribute_recode as a helper for corpus_recode that will recode all s-attributes, all p-attributes, and will reset the encoding in the registry file.

Examples

Run this code
# NOT RUN {
library(RcppCWB)

# In this example, we pursue a "pure R" approach. To rely on the "CWB"
# method, you can use the cwb_install() function, which will download and
# install the CWB command line # tools within the package.

tokens <- readLines(system.file(package = "RcppCWB", "extdata", "examples", "reuters.txt"))

# Create new (and empty) directory structure

tmpdir <- normalizePath(tempdir(), winslash = "/")
registry_tmp <- file.path(tmpdir, "registry", fsep = "/")
data_dir_tmp <- file.path(tmpdir, "data_dir", "reuters", fsep = "/")
if (file.exists(file.path(data_dir_tmp, "word.corpus"))){
  file.remove(file.path(data_dir_tmp, "word.corpus"))
}
if (dir.exists(registry_tmp)) unlink(registry_tmp, recursive = TRUE)
if (dir.exists(data_dir_tmp)) unlink(data_dir_tmp, recursive = TRUE)
dir.create(registry_tmp)
dir.create(data_dir_tmp, recursive = TRUE)

# Now encode token stream

p_attribute_encode(
  corpus = "reuters",
  token_stream = tokens, p_attribute = "word",
  data_dir = data_dir_tmp, method = "R",
  registry_dir = registry_tmp,
  compress = FALSE,
  encoding = "utf8"
  )

# Create minimal registry file

regdata <- registry_data(
  id = "REUTERS", name = "Reuters Sample Corpus", home = data_dir_tmp,
  properties = c(encoding = "utf-8", language = "en"), p_attributes = "word"
)

regfile <- registry_file_write(
  data = regdata, corpus = "REUTERS",
  registry_dir = registry_tmp, data_dir = data_dir_tmp,
)

# Reload corpus and run query as a test

if (cqp_is_initialized()) cqp_reset_registry(registry_tmp) else cqp_initialize(registry_tmp)

cqp_query(corpus = "REUTERS", query = '[]{3} "oil" []{3};')
regions <- cqp_dump_subcorpus(corpus = "REUTERS")
kwic <- apply(
  regions, 1,
  function(region){
    ids <- cl_cpos2id("REUTERS", "word", registry_tmp, cpos = region[1]:region[2])
    words <- cl_id2str(corpus = "REUTERS", p_attribute = "word", registry = registry_tmp, id = ids)
    paste0(words, collapse = " ")
  }
)
kwic[1:10]
# }

Run the code above in your browser using DataCamp Workspace