Learn R Programming

contentanalysis (version 0.2.1)

analyze_scientific_content: Enhanced scientific content analysis with citation extraction

Description

Comprehensive analysis of scientific documents including citation extraction, reference matching, text analysis, and bibliometric indicators.

Usage

analyze_scientific_content(
  text,
  doi = NULL,
  mailto = NULL,
  citation_type = c("all", "numeric_superscript", "numeric_bracketed", "author_year"),
  window_size = 10,
  min_word_length = 3,
  remove_stopwords = TRUE,
  language = "en",
  custom_stopwords = NULL,
  ngram_range = c(1, 3),
  parse_multiple_citations = TRUE,
  use_sections_for_citations = "auto",
  n_segments_citations = 10
)

Value

List with class "enhanced_scientific_content_analysis" containing:

  • text_analytics: Basic statistics and word frequencies

  • citations: All extracted citations with metadata

  • citation_contexts: Citations with surrounding text

  • citation_metrics: Citation type distribution, density, etc.

  • citation_references_mapping: Matched citations to references

  • parsed_references: Structured reference list

  • word_frequencies: Word frequency table

  • ngrams: N-gram frequency tables

  • network_data: Citation co-occurrence data

  • summary: Overall analysis summary

Arguments

text

Character string or named list. Document text or text with sections.

doi

Character string or NULL. DOI for CrossRef reference retrieval.

mailto

Character string or NULL. Email for CrossRef API.

citation_type

Character string. Type of citations to extract:

  • "all": Extract all citation types (default)

  • "numeric_superscript": Only numeric citations (brackets and superscript) + narrative

  • "numeric_bracketed": Only bracketed numeric citations + narrative

  • "author_year": Only author-year citations + narrative

window_size

Integer. Words before/after citations for context (default: 10).

min_word_length

Integer. Minimum word length for analysis (default: 3).

remove_stopwords

Logical. Remove stopwords (default: TRUE).

language

Character. Language for stopwords (default: "en").

custom_stopwords

Character vector. Additional stopwords.

ngram_range

Integer vector. N-gram range, e.g. c(1,3) (default: c(1,3)).

parse_multiple_citations

Logical. Parse complex citations (default: TRUE).

use_sections_for_citations

Logical or "auto". Use sections for mapping (default: "auto").

n_segments_citations

Integer. Segments if not using sections (default: 10).

Details

This function performs:

  • Citation extraction (numbered, author-year, narrative, parenthetical)

  • Reference parsing (from text or CrossRef API)

  • Citation-reference matching

  • Text analysis (word frequencies, n-grams)

  • Citation context extraction

  • Bibliometric indicators

The citation_type parameter filters which citation patterns to search for, reducing false positives. Narrative citations are always included as they are context-dependent.

Examples

Run this code
if (FALSE) {
# For documents with numeric citations
doc <- pdf2txt_auto("paper.pdf", citation_type = "numeric_bracketed")
analysis <- analyze_scientific_content(
  doc,
  citation_type = "numeric_bracketed",
  doi = "10.xxxx/xxxxx",
  mailto = "your@email.com"
)

# For documents with author-year citations
doc <- pdf2txt_auto("paper.pdf", citation_type = "author_year")
analysis <- analyze_scientific_content(
  doc,
  citation_type = "author_year"
)

summary(analysis)
head(analysis$citations)
table(analysis$citation_metrics$type_distribution)
}

Run the code above in your browser using DataLab