Learn R Programming

mclm

The goal of mclm is to gather various functions in support of quantitative corpus linguistics. It contains classes for corpus files, frequency lists, association scores dataframes and concordances and functions to create them, manipulate them and read them from and write them to files.

The package is a companion to the Methods in Corpus Linguistics course at the Advanced Master in Linguistics (KU Leuven), but can be used for basic corpus linguistic analyses. In particular, it offers a number of learnr tutorials on how to perform basic tasks with mclm and filter objects with PERL-flavor regular expressions.

Installation

You can install the development version of mclm from GitHub with:

remotes::install_github("masterclm/mclm")

Examples

Below are some basic usages of mclm.

The freqlist() function can generate a frequency list from either the text of a corpus or corpus files.

library(mclm)
#> Loading required package: ca
#> Loading required package: tibble
#> 
#> Attaching package: 'mclm'
#> The following object is masked from 'package:tibble':
#> 
#>     as_data_frame
toy_corpus <- "Once upon a time there was a tiny toy corpus.
It consisted of three sentences. And it lived happily ever after."

flist <- freqlist(toy_corpus, as_text = TRUE)
print(flist, n = 5)
#> Frequency list (types in list: 19, tokens in list: 21)
#> rank      type abs_freq nrm_freq
#> ---- --------- -------- --------
#>    1         a        2  952.381
#>    2        it        2  952.381
#>    3     after        1  476.190
#>    4       and        1  476.190
#>    5 consisted        1  476.190
#> ...

The get_fnames() function creates a list of filenames based on the contents of a directory and can be given to different functions that process corpora. surf_cooc(), for example, computes the surface co-occurrences of an item, such as the type “government”, in a given corpus. These co-occurrences can be provided to assoc_scores() to compute the association strength of different collocates of the node (here “government”) in the corpus.

corpus_files <- get_fnames(system.file("extdata", "cleveland", package = "mclm"))
length(corpus_files)
#> [1] 4

surf <- surf_cooc(corpus_files, "government", w_left = 5, w_right = 5)
assoc_scores(surf)
#> Association scores (types in list: 77)
#>      type   a    PMI G_signed|   b    c     d dir   exp_a DP_rows
#>  1    the 230  0.578   39.554|1321 2152 20276   1 154.072   0.052
#>  2     of 136  0.403   11.259|1415 1454 20974   1 102.844   0.023
#>  3     to  57  0.286    2.323|1494  666 21762   1  46.765   0.007
#>  4     by  39  1.017   17.223|1512  259 22169   1  19.275   0.014
#>  5     in  37  0.038    0.028|1514  520 21908   1  36.028   0.001
#>  6   this  37  1.811   45.360|1514  126 22302   1  10.543   0.018
#>  7    and  36 -0.634   -8.873|1515  828 21600  -1  55.885  -0.014
#>  8      a  28  0.207    0.600|1523  347 22081   1  24.256   0.003
#>  9    has  18  1.238   11.232|1533  100 22328   1   7.632   0.007
#> 10     be  15 -0.332   -0.927|1536  277 22151  -1  18.887  -0.003
#> 11   that  15 -0.067   -0.036|1536  228 22200  -1  15.718   0.000
#> 12    for  14 -0.185   -0.258|1537  232 22196  -1  15.912  -0.001
#> 13   with  14  0.136    0.130|1537  183 22245   1  12.742   0.001
#> 14  their  13  0.112    0.082|1538  173 22255   1  12.031   0.001
#> 15  which  10 -0.120   -0.076|1541  158 22270  -1  10.867  -0.001
#> 16     as   9 -0.128   -0.078|1542  143 22285  -1   9.832  -0.001
#> 17   made   9  1.393    6.903|1542   44 22384   1   3.428   0.004
#> 18    our   9 -0.297   -0.440|1542  162 22266  -1  11.061  -0.001
#> 19 states   9  0.491    1.012|1542   90 22338   1   6.403   0.002
#> 20   been   8  0.169    0.114|1543  102 22326   1   7.115   0.001
#> ...
#> <number of extra columns to the right: 7>

The function conc() finds occurrences of a regular expression in a corpus and generates a concordance.

conc(corpus_files, "govern")
#> Concordance-based data frame (number of observations: 29)
#> idx                             left|match |right                           
#>   1 ...heir power and right of self-|govern|ment they have committed to o...
#>   2 ... the strength and safety of a|govern|ment by the people. In each s...
#>   3 ...d the surest guaranty of good|govern|ment. But the best results in...
#>   4 ...results in the operation of a|govern|ment wherein every citizen ha...
#>   5 ...efits which our happy form of|govern|ment can bestow. On this ausp...
#>   6 ...ation of a republican form of|govern|ment and most compatible with...
#>   7 ...f. In the administration of a|govern|ment pledged to do equal and ...
#>   8 ... benefits of the best form of|govern|ment ever vouchsafed to man. ...
#>   9 ...hina. The admitted right of a|govern|ment to prevent the influx of...
#>  10 ...asure of that sovereign self-|govern|ment pertaining to the States...
#>  11 ...his land of freedom, of self-|govern|ment, and of laws, here peace...
#>  12 ... of successful constitutional|govern|ment, maintenance of good fai...
#>  13 ...ulty pending with any foreign|govern|ment. The Argentine Governmen...
#>  14 ...itation in favor of a foreign|govern|ment upon the right of select...
#>  15 ... several States into a single|govern|ment. In these contests betwe...
#>  16 ... and complications of distant|govern|ments. Therefore I am unable ...
#>  17 ...hina. The admitted right of a|govern|ment to prevent the influx of...
#>  18 ...Kongo has been organized as a|govern|ment under the sovereignty of...
#>  19 ...he plenipotentiaries of other|govern|ments, thus making the United...
#>  20 ...purpose toward their original|govern|ments. These evils have had m...
#>  21 ...the safety and welfare of any|govern|ment. Emergency calling for a...
#>  22 ...es at legations. Some foreign|govern|ments do not recognize the un...
#>  23 ...he President shall invite the|govern|ments of the countries compos...
#>  24 ... attitude and intent of those|govern|ments in respect of the estab...
#>  25 ...ioned that the views of these|govern|ments are in each instance su...
#>  26 ...to the fixed rules which must|govern|the Army, I am inclined to ag...
#>  27 ...ected by a republican form of|govern|ment, to which they owe alleg...
#>  28 ...nd the people who desire good|govern|ment, having secured this sta...
#>  29 ...g for the use of the District|govern|ment which shall better secur...
#> 
#> This data frame has 6 columns:
#>    column
#> 1 glob_id
#> 2      id
#> 3  source
#> 4    left
#> 5   match
#> 6   right

Copy Link

Version

Install

install.packages('mclm')

Monthly Downloads

231

Version

0.2.7

License

GPL-2

Issues

Pull Requests

Stars

Forks

Maintainer

Mariana Montes

Last Published

October 3rd, 2022

Functions in mclm (0.2.7)

assoc_scores

Association scores used in collocation analysis and keyword analysis
as_tokens

Coerce object to class tokens
as_data_frame

Coerce object to a data frame
as_fnames

Coerce object to 'fnames'
as_character

Coerce object to character
brackets

Subset an object by different criteria
as_freqlist

Coerce table to a frequency list
as_conc

Coerce data frame to a concordance object
as_types

Coerce object to a vector of types
as_numeric

Coerce object to a numeric vector
drop_empty_rc

Drop empty rows and columns from a matrix
chisq1_to_p

Proportion of chi-squared distribution with one degree of freedom that sits to the right of x
create_cooc

Build collocation frequencies.
cleanup_spaces

Clean up the use of whitespace in a character vector
explore

Interactively navigate through an object
cat_re

Print a regular expression to the console
ca_help

Helpers for plotting ca objects
conc

Build a concordance for the matches of a regex
drop_tags

Drop XML tags from character string
details

Details on a specific item
keep_fnames

Filter collection of filenames by name
fnames

Retrieve the names of files in a given path
find_xpath

Run XPath query
keep_pos

Subset an object by index
keep_re

Subset an object based on regular expressions
keep_types

Subset an object based on a selection of types
freqlist

Build the frequency list of a corpus
keep_bool

Subset an object based on logical criteria
import_conc

Import a concordance
merge_types

Merge 'types' objects
merge_tokens

Merge tokens objects
freqlist_diff

Subtract frequency lists
merge_conc

Merge concordances
mclm_xml_text

Get text from xml node
n_types

Count types
n_fnames

Count number of items in an 'fnames' object
n_tokens

Count tokens
orig_ranks

Retrieve or set original ranks
read_assoc

Read association scores from file
read_conc

Read a concordance from a file
merge_freqlist

Merge frequency lists
merge_fnames

Merge filenames collections
p_to_chisq1

P right quantile in chi-squared distribution with 1 degree of freedom
perl_flavor

Retrieve or set the flavor of a regular expression
re_convenience

Convenience functions in support of regular expressions
read_tokens

Read a tokens object from a text file
re

Build a regular expression
read_fnames

Read a collection of filenames from a text file
read_freqlist

Read a frequency list from a csv file
ranks

Retrieve the current ranks for frequency counts.
print_kwic

Print a concordance in KWIC format
slma

Stable lexical marker analysis
read_txt

Read a text file into a character vector
mclm-package

Mastering Corpus Linguistic Methods
scan_txt

Scan a character string from console
sort.assoc_scores

Sort an 'assoc_scores' object
short_names

Shorten filenames
print.assoc_scores

Print an object
read_types

Read a vector of types from a text file
write_assoc

Write association scores to file
scan_re

Scan a regular expression from console
tot_n_tokens

Retrieve or set the total number of tokens
type_names

Return the names of the types in an object
types

Build a 'types' object
type_freqs

Retrieve frequencies from 'freqlist' object
trunc_at

Truncate a sequence of character data
write_tokens

Write a tokens object to a text file
write_freqlist

Write a frequency list to a csv file
sort.freqlist

Sort a frequency list
zero_plus

Make all values strictly higher than zero
write_fnames

Write a collection of filenames to a text file
write_txt

Write a character vector to a text file
write_types

Write a vector of types to a text file
tokens

Create or coerce an object into class tokens
write_conc

Write a concordance to file.