Learn R Programming

mscsweblm4r (version 0.1.2)

weblmBreakIntoWords: Breaks a string of concatenated words into individual words

Description

This function inserts spaces into a string of words lacking spaces, like a hashtag or part of a URL. Punctuation or exotic characters can prevent a string from being broken, so it's best to limit input strings to lower-case, alpha-numeric characters. The input string must be in ASCII format.

Internally, this function invokes the Microsoft Cognitive Services Web Language Model REST API documented at https://www.microsoft.com/cognitive-services/en-us/web-language-model-api/documentation.

You MUST have a valid Microsoft Cognitive Services account and an API key for this function to work properly. See https://www.microsoft.com/cognitive-services/en-us/pricing for details.

Usage

weblmBreakIntoWords(textToBreak, modelToUse = "body", orderOfNgram = 5L, maxNumOfCandidatesReturned = 5L)

Arguments

textToBreak
(character) Line of text to break into words. If spaces are present, they will be interpreted as hard breaks and maintained, except for leading or trailing spaces, which will be trimmed. Must be in ASCII format.
modelToUse
(character) Which language model to use, supported values: "title", "anchor", "query", or "body" (optional, default: "body")
orderOfNgram
(integer) Which order of N-gram to use, supported values: 1L, 2L, 3L, 4L, or 5L (optional, default: 5L)
maxNumOfCandidatesReturned
(integer) Maximum number of candidates to return (optional, default: 5L)

Value

An S3 object of the class weblm. The results are stored in the results dataframe inside this object. The dataframe contains the candidate breakdowns and their log(probability).

Examples

Run this code
## Not run: 
#  tryCatch({
# 
#    # Break a sentence into words
#    textWords <- weblmBreakIntoWords(
#      textToBreak = "testforwordbreak", # ASCII only
#      modelToUse = "body",              # "title"|"anchor"|"query"(default)|"body"
#      orderOfNgram = 5L,                # 1L|2L|3L|4L|5L(default)
#      maxNumOfCandidatesReturned = 5L   # Default: 5L
#    )
# 
#    # Class and structure of textWords
#    class(textWords)
#    #> [1] "weblm"
# 
#    str(textWords, max.level = 1)
#    #> List of 3
#    #>  $ results:'data.frame':  5 obs. of  2 variables:
#    #>  $ json   : chr "{"candidates":[{"words":"test for word break", __truncated__ }]}
#    #>  $ request:List of 7
#    #>   ..- attr(*, "class")= chr "request"
#    #>  - attr(*, "class")= chr "weblm"
# 
#    # Print results
#    pandoc.table(textWords$results)
#    #> ---------------------------------
#    #>       words          probability
#    #> ------------------- -------------
#    #> test for word break    -13.83
#    #>
#    #>  test for wordbreak    -14.63
#    #>
#    #>  testfor word break    -15.94
#    #>
#    #>  test forword break    -16.72
#    #>
#    #>   testfor wordbreak    -17.41
#    #> ---------------------------------
# 
#  }, error = function(err) {
# 
#    # Print error
#    geterrmessage()
# 
#  })
# ## End(Not run)

Run the code above in your browser using DataLab