Learn R Programming

textpress

A lightweight, versatile NLP package for R, focused on search-centric workflows with minimal dependencies and easy data-frame integration. This package provides key functionalities for:

  • Web Search: Perform search engine queries to retrieve relevant URLs.

  • Web Scraping: Extract URL content, including some relevant metadata.

  • Text Processing & Chunking: Segment text into meaningful units, eg, sentences, paragraphs, and larger chunks. Designed to support tasks related to retrieval-augmented generation (RAG).

  • Corpus Search: Perform keyword, phrase, and pattern-based searches across processed corpora, supporting both traditional in-context search techniques (e.g., KWIC, regex matching) and advanced semantic searches using embeddings.

  • Embedding Generation: Generate embeddings using the HuggingFace API for enhanced semantic search.

Ideal for users who need a basic, unobtrusive NLP toolkit in R.

Installation

devtools::install_github("jaytimm/textpress")

Usage

Web search

sterm <- 'AI and education'

yresults <- textpress::web_search(search_term = sterm, 
                                  search_engine = "Yahoo News", 
                                  num_pages = 5)

yresults |> select(2) |>  sample_n(5) |> knitr::kable()
raw_url
https://gulfbusiness.com/ai-is-transforming-healthcare-heres-how/
https://campustechnology.com/Articles/2023/04/06/What-the-Past-Can-Teach-Us-About-the-Future-of-AI-and-Education.aspx
https://natlawreview.com/article/decoding-californias-recent-flurry-ai-laws
https://www.zdnet.com/article/pearson-launches-new-ai-certification-with-focus-on-practical-use-in-the-workplace/
https://campustechnology.com/Articles/2024/02/21/Creating-Guidelines-for-the-Use-of-Gen-AI-Across-Campus.aspx

Web Scraping

arts <- yresults$raw_url |> 
  textpress::web_scrape_urls(cores = 4)

Text Processing & Chunking

nlp_split_paragraphs() < nlp_split_sentences() < nlp_build_chunks()

articles <- arts |>  
  mutate(doc_id = row_number())|>
  
  textpress::nlp_split_paragraphs(paragraph_delim = "\\n+") |>
  textpress::nlp_split_sentences(text_hierarchy = c('doc_id', 
                                                    'paragraph_id')) |>
  
  textpress::nlp_build_chunks(text_hierarchy = c('doc_id', 
                                                 'paragraph_id', 
                                                 'sentence_id'),
                              chunk_size = 1,
                              context_size = 1) |>
  
  mutate(id = paste(doc_id, paragraph_id, chunk_id, sep = '.'))
idchunkchunk_plus_context
1.1.1‘TO AI OR NOT TO AI?’‘TO AI OR NOT TO AI?’ This is one of the most pressing questions that today’s educators and higher education leaders face.
1.1.2This is one of the most pressing questions that today’s educators and higher education leaders face.‘TO AI OR NOT TO AI?’ This is one of the most pressing questions that today’s educators and higher education leaders face. While there is no doubt that artificial intelligence (AI) will play an increasingly central role in people’s lives, many in the education sector remain skeptical — with some even deeming it a harbinger of educational doom.
1.1.3While there is no doubt that artificial intelligence (AI) will play an increasingly central role in people’s lives, many in the education sector remain skeptical — with some even deeming it a harbinger of educational doom.This is one of the most pressing questions that today’s educators and higher education leaders face. While there is no doubt that artificial intelligence (AI) will play an increasingly central role in people’s lives, many in the education sector remain skeptical — with some even deeming it a harbinger of educational doom. In a study conducted by global educational technology or edtech leader Anthology, 30% or three in every 10 university leaders in the Philippines see generative AI as unethical and should be banned from being used in educational settings.

KWIC Search

sterm2 <- c('\\bhigher education\\b',
            '\\bsecondary education\\b')
            # '\\S+ education\\b',
            # '\\b\\w{4,}\\b education\\b')

kwics <- articles |>
  rename(text = chunk) |>
  textpress::sem_search_corpus(search = sterm2,
                               text_hierarchy = c('doc_id', 
                                                  'paragraph_id', 
                                                  'chunk_id'))

kwics |> 
  mutate(id = paste(doc_id, 
                    paragraph_id, 
                    chunk_id, 
                    sep = '.')) |>
  select(id, pattern, text) |> 
  sample_n(5) |> knitr::kable()
idpatterntext
1.3.1higher educationThe study conducted across 11 countries including the Philippines involved 5,000 higher education leaders and students.
1.8.1higher educationAI is a game-changer in higher education, bridging gaps in accessibility and quality.
1.2.3higher educationIt revealed that university leaders have certain reservations around allowing AI in higher education, perceiving it as being unethical.
9.2.1Higher EducationIn 1998, noted technology critic and historian of automation David Noble published his influential article “Digital Diploma Mills: The Automation of Higher Education,” in which he warned about the negative impacts the internet would have on education.
15.4.2secondary education“It underscores the urgent need to address the looming AI knowledge gap in schools—for both students and teachers—to raise parental awareness and increase their involvement in AI conversations, and push for stronger AI integration in American primary and secondary education.”

Semantic search

HuggingFace embeddings

api_url <- "https://api-inference.huggingface.co/models/BAAI/bge-base-en-v1.5"

vstore <- articles |>
  rename(text = chunk) |>
  textpress::api_huggingface_embeddings(
    text_hierarchy = c('doc_id', 
                       'paragraph_id',
                       'chunk_id'),
    verbose = F,
    api_url = api_url,
    dims = 768, #1024, 768, 384
    api_token = api_token)

Embedd query

“How can AI personalize learning experiences for students?”

q <- "How can AI personalize learning experiences for students?"

query <- textpress::api_huggingface_embeddings(
  query = q,
  api_url = api_url,
  dims = 768,
  api_token = api_token)
rags <- textpress::sem_nearest_neighbors(
  x = query,
  matrix = vstore,
  n = 20) |>
  left_join(articles, by = c("term2" = "id"))

Relevant chunks

idcos_simchunk_plus_context
14.10.20.8441. Personalized learning: AI can analyze data to understand each student’s learning style, strengths and areas for improvement. For example, an AI-driven platform could identify that a particular student struggles with reading comprehension and then provide tailored exercises that improve the student’s skills.
7.8.20.836There’s a better way. This is where AI-assisted learning steps in to create personalized lesson plans. In our schools, we’ve transformed the traditional teacher’s role into that of a “guide.”
22.2.20.835Artificial intelligence has permeated nearly every industry, and higher education is no exception. AI-powered solutions promise to revolutionize learning by providing personalized and adaptive experiences.
22.7.50.810Consider how new tools integrate with existing platforms and map to the entire learner lifecycle. AI should simplify, not complicate, the student experience. With thoughtful implementation, these intelligent technologies can personalize learning and improve outcomes from start to finish.
7.10.10.810AI is revolutionizing the role of teachers by excelling at delivering personalized learning experiences. These advanced AI programs can swiftly and accurately pinpoint what a student knows and doesn’t know in each subject, allowing lessons to be designed around their unique aptitudes without any judgment.

Summary

Copy Link

Version

Install

install.packages('textpress')

Monthly Downloads

192

Version

1.0.0

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Jason Timm

Last Published

October 14th, 2024

Functions in textpress (1.0.0)

nlp_split_sentences

Split Text into Sentences
.translate_query

Translate Search Query
.decode_duckduckgo_urls

Decode DuckDuckGo Redirect URLs
.extract_links

Extract links from a search engine result page
nlp_tokenize_text

Tokenize Text Data (mostly) Non-Destructively
web_search

Process search results from multiple search engines
sem_nearest_neighbors

Find Nearest Neighbors Based on Cosine Similarity
sem_search_corpus

NLP Search Corpus
standardize_date

Standardize Date Format
textpress-package

textpress: A Lightweight and Versatile NLP Toolkit
web_scrape_urls

Scrape News Data from Various Sources
.process_bing

Process Bing search results
api_huggingface_embeddings

Call Hugging Face API for Embeddings
.insert_highlight

Insert Highlight in Text
abbreviations

Common Abbreviations for Sentence Splitting
.process_duckduckgo

Process DuckDuckGo search results
.process_yahoo

Process Yahoo News search results
.get_site

Get Site Content and Extract HTML Elements
extract_date

Extract Date from HTML Content
nlp_build_chunks

Build Chunks for NLP Analysis
nlp_cast_tokens

Convert Token List to Data Frame
nlp_melt_tokens

Tokenize Data Frame by Specified Column(s)
nlp_split_paragraphs

Split Text into Paragraphs