scrape: Scrape Major League Baseball's Gameday Data

Description

Function for obtaining PITCHf/x and other related Gameday Data. scrape currently has support for files ending with: http://gd2.mlb.com/components/game/mlb/year_2011/month_04/day_04/gid_2011_04_04_minmlb_nyamlb_1/inning/inning_all.xml{inning/inning_all.xml}, http://gd2.mlb.com/components/game/mlb/year_2011/month_04/day_04/gid_2011_04_04_minmlb_nyamlb_1/inning/inning_hit.xml{inning/inning_hit.xml}, http://gd2.mlb.com/components/game/mlb/year_2011/month_04/day_04/gid_2011_04_04_minmlb_nyamlb_1/players.xml{players.xml}, or http://gd2.mlb.com/components/game/mlb/year_2011/month_04/day_04/gid_2011_04_04_minmlb_nyamlb_1/miniscoreboard.xml{miniscoreboard.xml}. It's worth noting that inning/inning_all.xml is the file which contains PITCHf/x data, but the other files can complement this data depending on the goal for analysis. Any collection file names may be passed to the suffix argument, and scrape will retrieve data from a (possibly large number) of files based on either a window of dates or a set of game.ids.

Usage

scrape(start, end, game.ids, suffix = "inning/inning_all.xml", connect)

Arguments

start

date "yyyy-mm-dd" to commence scraping.

end

date "yyyy-mm-dd" to terminate scraping.

game.ids

character vector of gameday_links. If this option is used, start and end are ignored. See data(gids, package="pitchRx") for examples.

suffix

character vector with suffix of the XML files to be parsed. Currently supported options are: 'players.xml', 'miniscoreboard.xml', 'inning/inning_all.xml', 'inning/inning_hit.xml'.

connect

A database connection object. The class of the object should be "MySQLConnection" or "SQLiteConnection". If a valid connection is supplied, tables will be copied to the database, which will result in better memory management. If a connection is su

Value

Returns a list of data frames (or nothing if writing to a database).

Details

If collecting data in bulk, it is strongly recommended that one establishes a database connection and supplies the connection to the connect argument. See the examples section for a simple example of how to do so.

Examples

Run this code

# Collect PITCHf/x (and other data from inning_all.xml files) from May 1st, 2012
dat <- scrape(start = "2013-08-01", end = "2013-08-01")
# OR, equivalently, use the game.ids argument
data(gids, package="pitchRx")
dat2 <- scrape(game.ids=gids[grep("2012_05_01", gids)])

#scrape PITCHf/x from Minnesota Twins 2011 season
twins11 <- gids[grepl("min", gids) & grepl("2011", gids)]
dat <- scrape(game.ids=twins11)

#Create SQLite database, then collect and store data in that database
library(dplyr)
my_db <- src_sqlite("my_db.sqlite3", create=T)
scrape(start = "2013-08-01", end = "2013-08-01", connect=my_db$con)

#simple example to demonstrate database query using dplyr
#note that 'num' and 'url' together make a key that allows us to join these tables
locations <- select(tbl(my_db, "pitches"), px, pz, des, num, url)
names <- select(tbl(my_db, "atbats"), pitcher_name, batter_name, num, url)
que <- inner_join(locations, filter(names, batter_name == "Paul Goldschmidt"))
que$query #refine sql query if you'd like
pitchfx <- collect(que) #submit query and bring data into R

# Collect PITCHf/x and other complementary data
files <- c("inning/inning_all.xml", "inning/inning_hit.xml",
             "miniscoreboard.xml", "players.xml")
dat3 <- scrape(start = "2012-05-01", end = "2012-05-01", suffix = files)