LIHKGr
The goal of LIHKGr is to scrape text data on the LIHKG, the Hong Kong version of Reddit, for analysis. LIHKG has gained popularity in 2016 and become a popular research data source during recent years. LIHKG is currently protected by Google's reCAPTCHA, this package currently builds on RSelenium and adopts a semi-manual approach to bypass it.
Installation
devtools::install_github("justinchuntingho/LIHKGr")Instructions
lihkgr.R contains all the required functions. Please install the following packages: RSelenium, raster, magrittr rvest, and purrr. Follow the following workflow:
Step 1: Create a scraper
For RSelenium to work, you need to specify the browser. If you are using Chrome, you need to also specify the version. For example,create_lihkg(browser = "chrome", chromever = "83.0.4103.39"). If a version is not supplied, by default it will run the most recent version. To see Chrome version currently sourced run binman::list_versions("chromedriver").
## Creating a Firefox instance with a random port.
lihkg <- create_lihkg(browser = "firefox", port = sample(10000:60000, 1), verbose = FALSE)Step 2: Scrape
# It can accept a single post id
lihkg$scrape(2091171)
# Or a vector
lihkg$scrape(1610753:1610755)
# Another way to do it
postids <- c(1610753, 2091171)
lihkg$scrape(postids)Step 2.1: If any post id cannot be scraped, retry
lihkg$retry()Step 3: Get / Save the data
To obtain the dataframe:
lihkg$bagTo save as .RDS:
lihkg$save("lihkg.RDS")If you don't want to save the data as RDS, you can just save the bag as any format you like. It is just a regular data frame / tibble:
rio::export(lihkg$bag, "lihkg.xlsx")Step 4: Destroy the scraper
lihkg$finalize()Contributors
- Justin Chun-ting Ho
- Nick H. K. Or
- Chung-hong Chan
- Elgar Teo
Citation
Ho, J.C. & Or, N.H.K. (2020). LIHKGr. An application for scraping LIHKG. Source code and releases available at https://github.com/justinchuntingho/LIHKGr.