read_urls

Fetches each URL and returns a structured data frame (one row per node:
headings, paragraphs, lists). Like <code>read_csv</code> or <code>read_html</code>: bring
an external resource into R. Follows <code>fetch_urls()</code> or <code>fetch_wiki_urls()</code>
in the pipeline: fetch = get locations, read = get text.

A lightweight toolkit for text retrieval and NLP with a consistent and
predictable API organized around four actions: fetching, reading,
processing, and searching. Functions cover the full pipeline from web
data acquisition to text processing and indexing. Multiple search
strategies are supported including regex, BM25 keyword ranking, cosine
similarity, and dictionary matching. Pipe-friendly with no heavy
dependencies and all outputs are plain data frames. Also useful as a
building block for retrieval-augmented generation pipelines and
autonomous agent workflows.

Jason Timm

textpress

A Lightweight and Versatile NLP Toolkit

read_urls function

<dl><dt>x</dt>
<dd>A character vector of URLs.</dd>
<dt>cores</dt>
<dd>Number of cores for parallel requests (default 1).</dd>
<dt>detect_boilerplate</dt>
<dd>Logical. Detect boilerplate (e.g. sign-up, related links).</dd>
<dt>remove_boilerplate</dt>
<dd>Logical. If <code>detect_boilerplate</code> is <code>TRUE</code>, remove boilerplate rows; if <code>FALSE</code>, keep them and add <code>is_boilerplate</code>.</dd></dl>

Arguments

Read content from URLs — read_urls

<dl>

<dt>x</dt>
<dd>A character vector of URLs.</dd>


<dt>cores</dt>
<dd>Number of cores for parallel requests (default 1).</dd>


<dt>detect_boilerplate</dt>
<dd>Logical. Detect boilerplate (e.g. sign-up, related links).</dd>


<dt>remove_boilerplate</dt>
<dd>Logical. If <code>detect_boilerplate</code> is <code>TRUE</code>, remove boilerplate rows; if <code>FALSE</code>, keep them and add <code>is_boilerplate</code>.</dd>

</dl>

read_urls: Read content from URLs

Description

Usage

Value

Arguments

Details

Examples