Learn R Programming

pitchRx (version 1.0)

urlsToDataFrame: Parse XML files into data frame(s)

Description

This function is deprecated as of version 1.0

Usage

urlsToDataFrame(urls, tables = list(), add.children = FALSE,
  use.values = FALSE)

Arguments

urls
set of urls for parsing
tables
list of character vectors with appropriate names. The list names should correspond to XML nodes of interest within the XML files.
add.children
logical parameter specifying whether to scrape the XML children of the node(s) specified in tables.
use.values
logical parameter specifying whether to extract XML attributes or values of the node(s).

Value

  • Returns a data frames if the length of tables is one. Otherwise, it returns a list of data frames.

Details

This function takes on a list of XML files (ie, urls) and shapes them into a data frame or list of data frames

urlsToDataFrame coerces either XML attributes or XML values into a data frame. The XML nodes (aka, tags) of interest need to be specified as the name(s) of the tables parameter. The values of each tables parameter should be a character vector that defines the field names for the respective data frame. These field names should match XML attributes or tags.

When use.values = FALSE, the length of tables is equal to the number of data frames returned and the values of tables are the fields for each data frame. If a particular value of tables is NULL, the function will automatically determine the most complete set of fields and fill in NAs where information is missing. If add.children = TRUE, tables values should be NULL since child attributes will be used for naming convention (with the relevant node as the suffix name).

When use.values = TRUE, the value(s) of tables are ignored. The XML children of the specified node are the fields. If the children are inconsistent, missing values are filled with NAs.

Examples

Run this code
Obtain "batting" stats going into a game played on May 6th, 2008:
data(urls)
dir <- gsub("players.xml", "batters/",
            urls$url_player[1000])
doc <- htmlParse(dir)
nodes <- getNodeSet(doc, "//a")
values <- gsub(" ", "",
               sapply(nodes, xmlValue))
ids <- values[grep("[0-9]+", values)]
filenames <- paste(dir, ids, sep="")
stats <- urlsToDataFrame(filenames,
                         tables=list(Player=NULL),
                         add.children=TRUE)

Run the code above in your browser using DataLab