qdapRegex (version 0.7.8)

rm_url: Remove/Replace/Extract URLs

Description

rm_url - Remove/replace/extract URLs from a string.

rm_twitter_url - Remove/replace/extract Twitter Short URLs from a string.

Usage

rm_url(
  text.var,
  trim = !extract,
  clean = TRUE,
  pattern = "@rm_url",
  replacement = "",
  extract = FALSE,
  dictionary = getOption("regex.library"),
  ...
)

rm_twitter_url( text.var, trim = !extract, clean = TRUE, pattern = "@rm_twitter_url", replacement = "", extract = FALSE, dictionary = getOption("regex.library"), ... )

ex_url( text.var, trim = !extract, clean = TRUE, pattern = "@rm_url", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )

ex_twitter_url( text.var, trim = !extract, clean = TRUE, pattern = "@rm_twitter_url", replacement = "", extract = TRUE, dictionary = getOption("regex.library"), ... )

Value

Returns a character string with URLs removed.

Arguments

text.var

The text variable.

trim

logical. If TRUE removes leading and trailing white spaces.

clean

trim logical. If TRUE extra white spaces and escaped character will be removed.

pattern

A character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Default, @rm_url uses the rm_url regex from the regular expression dictionary from the dictionary argument.

replacement

Replacement for matched pattern.

extract

logical. If TRUE the URLs are extracted into a list of vectors.

dictionary

A dictionary of canned regular expressions to search within if pattern begins with "@rm_".

...

Other arguments passed to gsub.

Details

The default regex pattern "(http[^ ]*)|(www\.[^ ]*)" is more liberal. More constrained versions can be accessed via pattern = "@rm_url2" & pattern = "@rm_url3" see Examples).

References

The more constrained url regular expressions ("@rm_url2" and "@rm_url3" was adapted from imme_emosol's response: https://mathiasbynens.be/demo/url-regex

See Also

gsub, stri_extract_all_regex

Other rm_ functions: rm_abbreviation(), rm_between(), rm_bracket(), rm_caps_phrase(), rm_caps(), rm_citation_tex(), rm_citation(), rm_city_state_zip(), rm_city_state(), rm_date(), rm_default(), rm_dollar(), rm_email(), rm_emoticon(), rm_endmark(), rm_hash(), rm_nchar_words(), rm_non_ascii(), rm_non_words(), rm_number(), rm_percent(), rm_phone(), rm_postal_code(), rm_repeated_characters(), rm_repeated_phrases(), rm_repeated_words(), rm_tag(), rm_time(), rm_title_name(), rm_white(), rm_zip()

Examples

Run this code
x <- " I like www.talkstats.com and http://stackoverflow.com"
rm_url(x)
rm_url(x, replacement = '\\1')
ex_url(x)

ex_url(x, pattern = "@rm_url2")
ex_url(x, pattern = "@rm_url3")

## Remove Twitter Short URL
x <- c("download file from http://example.com", 
         "this is the link to my website http://example.com", 
         "go to http://example.com from more info.",
         "Another url ftp://www.example.com",
         "And https://www.example.net",
         "twitter type: t.co/N1kq0F26tG",
         "still another one https://t.co/N1kq0F26tG :-)")

rm_twitter_url(x)
ex_twitter_url(x)

## Combine removing Twitter URLs and standard URLs
rm_twitter_n_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"))
rm_twitter_n_url(x)
rm_twitter_n_url(x, extract=TRUE)

Run the code above in your browser using DataLab