analyze_scientific_content: Enhanced scientific content analysis with citation extraction

Description

Comprehensive analysis of scientific documents including citation extraction, reference matching, text analysis, and bibliometric indicators.

Usage

analyze_scientific_content(
  text,
  doi = NULL,
  mailto = NULL,
  citation_type = c("all", "numeric_superscript", "numeric_bracketed", "author_year"),
  window_size = 10,
  min_word_length = 3,
  remove_stopwords = TRUE,
  language = "en",
  custom_stopwords = NULL,
  ngram_range = c(1, 3),
  parse_multiple_citations = TRUE,
  use_sections_for_citations = "auto",
  n_segments_citations = 10
)

Value

List with class "enhanced_scientific_content_analysis" containing:

text_analytics: Basic statistics and word frequencies
citations: All extracted citations with metadata
citation_contexts: Citations with surrounding text
citation_metrics: Citation type distribution, density, etc.
citation_references_mapping: Matched citations to references
parsed_references: Structured reference list
word_frequencies: Word frequency table
ngrams: N-gram frequency tables
network_data: Citation co-occurrence data
summary: Overall analysis summary

Arguments

text

Character string or named list. Document text or text with sections.

doi

Character string or NULL. DOI for CrossRef reference retrieval.

mailto

Character string or NULL. Email for CrossRef API.

citation_type

Character string. Type of citations to extract:

"all": Extract all citation types (default)
"numeric_superscript": Only numeric citations (brackets and superscript) + narrative
"numeric_bracketed": Only bracketed numeric citations + narrative
"author_year": Only author-year citations + narrative

window_size

Integer. Words before/after citations for context (default: 10).

min_word_length

Integer. Minimum word length for analysis (default: 3).

remove_stopwords

Logical. Remove stopwords (default: TRUE).

language

Character. Language for stopwords (default: "en").

custom_stopwords

Character vector. Additional stopwords.

ngram_range

Integer vector. N-gram range, e.g. c(1,3) (default: c(1,3)).

parse_multiple_citations

Logical. Parse complex citations (default: TRUE).

use_sections_for_citations

Logical or "auto". Use sections for mapping (default: "auto").

n_segments_citations

Integer. Segments if not using sections (default: 10).

Details

This function performs:

Citation extraction (numbered, author-year, narrative, parenthetical)
Reference parsing (from text or CrossRef API)
Citation-reference matching
Text analysis (word frequencies, n-grams)
Citation context extraction
Bibliometric indicators

The citation_type parameter filters which citation patterns to search for, reducing false positives. Narrative citations are always included as they are context-dependent.

Examples

Run this code

if (FALSE) {
# For documents with numeric citations
doc <- pdf2txt_auto("paper.pdf", citation_type = "numeric_bracketed")
analysis <- analyze_scientific_content(
  doc,
  citation_type = "numeric_bracketed",
  doi = "10.xxxx/xxxxx",
  mailto = "your@email.com"
)

# For documents with author-year citations
doc <- pdf2txt_auto("paper.pdf", citation_type = "author_year")
analysis <- analyze_scientific_content(
  doc,
  citation_type = "author_year"
)

summary(analysis)
head(analysis$citations)
table(analysis$citation_metrics$type_distribution)
}

Run the code above in your browser using DataLab