Learn R Programming

Introducing the nflscrapR Package

This package was built to allow R users to utilize and analyze data from the National Football League (NFL) API. The functions in this package allow users to perform analysis at the play and game levels on single games and entire seasons. By parsing the play-by-play data recorded by the NFL, this package allows NFL data enthusiasts to examine each facet of the game at a more insightful level. The creation of this package puts granular data into the hands of any R user with an interest in performing analysis and digging up insights about the game of American football. With open-source data, the development of reproducible advanced NFL metrics can occur at a more rapid pace and lead to growing the football analytics community.

Note: Data is only available after 2009… for now

Installation

# Must install the devtools package using the below commented out code
# install.packages("devtools")

# Then can install using the devtools package from either of the following:
devtools::install_github(repo = "maksimhorowitz/nflscrapR")
# or the following (these are the exact same packages):
devtools::install_github(repo = "ryurko/nflscrapR")

Gather game ids

Using the scrape_game_ids function, one can easily access all pre-, post-, and regular season games for a specified season as well as options for the week and teams. The code below returns a dataframe containing the games for week 2 of the 2018 NFL season:

# First load the package:
library(nflscrapR)
#> Loading required package: nnet
#> Loading required package: magrittr

week_2_games <- scrape_game_ids(2018, weeks = 2)
#> Loading required package: XML
#> Loading required package: RCurl
#> Loading required package: bitops
# Display using the pander package:
# install.packages("pander")
week_2_games %>%
  pander::pander()

Example play-by-play analysis

Here is an example of scraping the week 2 matchup of the 2018 NFL season between the Kansas City Chiefs and the Pittsburgh Steelers. First, access the tidyverse library to select the game id and then use the scrape_json_play_by_play function to return the play-by-play data for the game:

# install.packages("tidyverse")
library(tidyverse)
#> ── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
#> ✔ tibble  1.4.2     ✔ dplyr   0.7.6
#> ✔ tidyr   0.8.1     ✔ stringr 1.3.1
#> ✔ readr   1.1.1     ✔ forcats 0.3.0
#> Warning: package 'dplyr' was built under R version 3.5.1
#> ── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ tidyr::complete()  masks RCurl::complete()
#> ✖ tidyr::extract()   masks magrittr::extract()
#> ✖ dplyr::filter()    masks stats::filter()
#> ✖ dplyr::lag()       masks stats::lag()
#> ✖ purrr::set_names() masks magrittr::set_names()

# Now generate the play-by-play dataset for the game:
kc_vs_pit_pbp <- week_2_games %>%
  filter(home_team == "PIT") %>%
  pull(game_id) %>%
  scrape_json_play_by_play()

Now using the estimates from the nflscrapR expected points and win probability models we can generate visuals summarizing the game. For example the win probability chart below shows how the Chiefs early lead faded in the second quarter, before they took sealed the game in the second half:

# Install the awesome teamcolors package by Ben Baumer and Gregory Matthews:
# install.packages("teamcolors")
library(teamcolors)

# Pull out the Steelers and Chief colors:
nfl_teamcolors <- teamcolors %>% filter(league == "nfl")
pit_color <- nfl_teamcolors %>%
  filter(name == "Pittsburgh Steelers") %>%
  pull(primary)
kc_color <- nfl_teamcolors %>%
  filter(name == "Kansas City Chiefs") %>%
  pull(primary)

# Now generate the win probability chart:
kc_vs_pit_pbp %>%
  filter(!is.na(home_wp),
         !is.na(away_wp)) %>%
  dplyr::select(game_seconds_remaining,
                home_wp,
                away_wp) %>%
  gather(team, wpa, -game_seconds_remaining) %>%
  ggplot(aes(x = game_seconds_remaining, y = wpa, color = team)) +
  geom_line(size = 2) +
  geom_hline(yintercept = 0.5, color = "gray", linetype = "dashed") +
  scale_color_manual(labels = c("KC", "PIT"),
                     values = c(kc_color, pit_color),
                     guide = FALSE) +
  scale_x_reverse(breaks = seq(0, 3600, 300)) + 
  annotate("text", x = 3000, y = .75, label = "KC", color = kc_color, size = 8) + 
  annotate("text", x = 3000, y = .25, label = "PIT", color = pit_color, size = 8) +
  geom_vline(xintercept = 900, linetype = "dashed", black) + 
  geom_vline(xintercept = 1800, linetype = "dashed", black) + 
  geom_vline(xintercept = 2700, linetype = "dashed", black) + 
  geom_vline(xintercept = 0, linetype = "dashed", black) + 
  labs(
    x = "Time Remaining (seconds)",
    y = "Win Probability",
    title = "Week 2 Win Probability Chart",
    subtitle = "Kansas City Chiefs vs. Pittsburgh Steelers",
    caption = "Data from nflscrapR"
  ) + theme_bw()

Example of gathering season data

You can also use the scrape_season_play_by_play function to scrape all the play-by-play data meeting your desired criteria for particular season. Note that this function can take a long time to run due to pulling potentially an entire season’s worth of data. The code below demonstrates how to access all play-by-play data from the 2018 pre-season:

preseason_pbp_2018 <- scrape_season_play_by_play(2018, "pre")

Copy Link

Version

Version

1.8.3

License

CC0

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

April 3rd, 2020

Functions in nflscrapR (1.8.3)

buildURL

Building URL to scrape player season stat pages
get_season_rosters

Get team roster for a single season
calculate_expected_points

Compute expected points for provided play-by-play dataset
find_page_player_id

Find the GSIS ID for each player on the provided page.
game_play_by_play

Parsed Descriptive Play-by-Play Dataset for a Single Game
find_page_player_birthdate

Find the birthdate for each player on the provided page.
getGSISID

For a player's href, get their GSIS ID from their personal url.
getPageNumbers

Get Number of Player Position Pages
add_air_yac_ep_variables

Calculate and add the air and yac expected points variables to include in a `nflscrapR` play-by-play data frame
create_game_json_url

Create the url with the location of NFL game JSON data
add_ep_variables

Calculate and add the expected points variables to include in a `nflscrapR` play-by-play data frame
playerstats11

NFL Team Names and Abbreviations
getPlayers

Scrape Player Names and Positions
win_probability

Win probability function to add win probability columns for the home and away teams for each play in the game
add_wp_variables

Calculate and add the win probability variables to include in a `nflscrapR` play-by-play data frame
drive_summary

Drive Summary and Results
playerstats12

NFL Team Names and Abbreviations
season_play_by_play

Parsed Descriptive Play-by-Play Function for a Full Season
simple_boxscore

Simple Game Boxscore
player_game

Detailed Boxscore for Single NFL Game
scrape_json_play_by_play

Scrape an individual game's JSON play-by-play data from NFL.com
nflteams

Dataset of NFL team names, abbreviations, and colors
agg_player_season

Detailed Player Aggregate Season Statistics
add_air_yac_wp_variables

Calculate and add the air and yac win probability variables to include in a `nflscrapR` play-by-play data frame
playerstats13

NFL Team Names and Abbreviations
get_birthdate

For a player's href, get their birthdate from their personal url.
buildNameAbbr

Build formatted player name from full player name
create_game_html_url

Create the url with the location of raw NFL play-by-play HTML
playerstats14

NFL Team Names and Abbreviations
get_players

Scrape player names and positions
extracting_gameids

Extract GameIDs for each game in a given NFL season
expected_points

Expected point function to calculate expected points for each play in the play by play, and the expected points added in three ways, basic EPA, air yards EPA, and yards after catch EPA
calculate_win_probability

Compute win probability for the provided play-by-play dataset.
get_gsis_id

For a player's href, get their GSIS ID from their personal url.
scrape_season_play_by_play

Scrape season play-by-play for a given NFL season (either pre, regular, or post-season)
get_page_numbers

Get number of player position pages
playerstats09

NFL Team Names and Abbreviations
season_rosters

Season Rosters for Teams
season_games

Game Information for All Games in a Season
proper_jsonurl_formatting

Formatting URL for location of NFL Game JSON Data
season_player_game

Boxscore for Each Game in the Season - One line per player per game
playerstats10

NFL Team Names and Abbreviations
scrape_game_ids

Scrape game ids for a given NFL season (either pre, regular, or post-season)
scrape_game_play_by_play

Scrape an individual game's play-by-play data from NFL.com
playerstats15

NFL Team Names and Abbreviations
build_name_abbr

Build formatted player name from full player name
build_url

Build URL to scrape player season stat pages
findPagePlayerID

Find the GSIS ID for each player on the provided page.