packageRank: compute and visualize package download counts and rank percentiles
Features
- compute and visualize the counts and ranks (nominal and percentile) of downloads from RStudio’s CRAN mirror and Bioconductor.
- compute and visualize a package’s position in the overall distribution of download counts for a given day (cross-sectionally) or over time (longitudinally).
- compute and visualize the downloads of the R application.
NOTE: ‘packageRank’ relies on the ‘cranlogs’ package and an active internet connection. RStudio CRAN logs for the previous day are generally posted at 17:00 UTC (GMT+2); results for functions that rely on ‘cranlogs’ are available soon after.
Getting started
To install ‘packageRank’ from CRAN:
install.packages("packageRank")
To install the latest development version from GitHub:
# You may need to first install the 'remotes' via install.packages("remotes").
remotes::install_github("lindbrook/packageRank", build_vignettes = TRUE)
I - Background
The ‘cranlogs’ package computes the number of downloads, packages and R itself, from RStudio’s CRAN mirror.
For example, we can see that on the first day of 2019 the ‘HistData’ package was downloaded 51 times:
cranlogs::cran_downloads(packages = "HistData", from = "2019-01-01", to = "2019-01-01")
> date count package
> 1 2019-01-01 51 HistData
Lurking in the background is the “compared to what?” question. Is 51 downloads large or small? The objective of ‘packageRank’ is to help answer such questions by putting counts into context.
II - Computation of downloads
To compute package or R downloads, use cranDownloads()
. In contrast to
cranlogs::cran_downloads()
, cranDownloads()
offers more convenient
interface. You can pass dates as “yyyy-mm-dd”, “yyyy-mm” or “yyyy”:
# Downloads from December 31, 2018 through June 25, 2019
cranDownloads(packages = "HistData", from = "2018-12-31", to = "2019-06-25")
# Downloads from June 2015 through June 2019
cranDownloads(packages = "HistData", from = "2015-06", to = "2016-01")
# 2016 to 2019
cranDownloads(packages = "HistData", from = "2015", to = "2019")
# Year-to-date
cranDownloads(packages = "HistData", from = "2019")
To compute the downloads for multiple packages, pass a character vector of package names.
cranDownloads(packages = c("Rcpp", "rlang", "data.table"))
III - Visualization of downloads
When dealing with many observations (e.g., longer time series),
visualization can be useful. To do so, use cranDownloads()
’s plot
method:
# Downloads from December 31, 2018 throught June 25, 2019
plot(cranDownloads(packages = "HistData", from = "2018-12-31", to = "2019-06-25"))
# Downloads from June 2015 through June 2019
plot(cranDownloads(packages = "HistData", from = "2015-06", to = "2016-01"))
# 2016 to 2019 (either character or numeric are OK)
plot(cranDownloads(packages = "HistData", from = "2015", to = "2019"))
# Year-to-date
plot(cranDownloads(packages = "HistData", from = "2019"))
multiple packages
When you pass a vector of multiple packages, you’ll get a dot chart for a single date and a ‘ggplot2’ figure with multiple facets for multiple dates:
plot(cranDownloads(packages = c("Rcpp", "rlang", "data.table")))
plot(cranDownloads(packages = c("Rcpp", "rlang", "data.table"), when = "last-month"))
If you want separate plots for each package, set the ‘graphics’ argument to “base”:
plot(cranDownloads(packages = c("Rcpp", "rlang", "data.table"), when = "last-month"), graphics = "base")
smoothers, confidence intervals, and R release dates
You can also plot a smoother with the ‘smooth’ argument and release dates for R with the ‘r.version’ argument:
plot(cranDownloads(packages = "rstan", from = "2019"), smooth = TRUE, r.version = TRUE)
With ‘ggplot2’ figures, the ‘se’ argument adds confidence intervals:
plot(cranDownloads(packages = c("HistData", "rnaturalearth", "Zelig"), when = "last-month"), smooth = TRUE,
se = TRUE)
Bioconductor
For Bioconductor packages, use bioconductorDownloads()
. Note that
because logs are aggregated to the month or calendar year, dates must
either be “yyyy-mm” or “yyyy”:
# Downloads from June 2015 through June 2019
plot(bioconductorDownloads(packages = "monocle", from = "2015-06", to = "2019-06"))
# Year-to-date
plot(bioconductorDownloads(packages = "monocle", from = "2019"))
Because of the aggregation, red squares are used to indicate in-progress observations.
IV - Computing percentiles and ranks
One way to put download counts into context is to compute its rank
percentile. This statistic, familiar to those who’ve taken a
standardized test, tell us the packages had fewer downloads. In this
way, the rank percentile give us an idea of a package’s place in the
overall distribution of downloads. To compute this statistic, use
packageRank()
:
packageRank(packages = "HistData", date = "2019-01-01")
> date packages downloads percentile rank
> 1 2019-01-01 HistData 51 93.4 920 of 14,020
Here, we see that 51 downloads on January 1, 2019 put ‘HistData’ in the 93rd percentile: 93% of packages had fewer downloads than ‘HistData’.
Because packages with zero downloads are not recorded in the log, there is a potential censoring problem. However, my analysis indicates that the number of packages on CRAN without any downloads on a given day is actually pretty small. In fact, the number of archived packages, not on CRAN, that are downloaded is often pretty high. (More about these numbers in future releases.)
computing rank percentile
To compute the rank percentile, I tabulate the number of downloads per package recorded in the log. Then for a given download count, I compute the percentage of packages with fewer download counts:
pkg.rank <- packageRank(packages = "HistData", date = "2019-01-01")
downloads <- pkg.rank$crosstab
round(100 * mean(downloads < downloads["HistData"]), 1)
> [1] 93.4
# OR
(pkgs.with.fewer.downloads <- sum(downloads < downloads["HistData"]))
> [1] 13092
(tot.pkgs <- length(downloads))
> [1] 14020
round(100 * pkgs.with.fewer.downloads / tot.pkgs , 1)
> [1] 93.4
For the example above, 51 downloads puts ‘HistData’ in 920th place among the 14,020 packages downloaded. This rank is “nominal” because multiple packages can have the same number of downloads. As a result, a package’s nominal rank (but not its rank percentile) can be affected by its name: packages with the same number of downloads are sorted in alphabetical order. Thus, ‘HistData’ benefits from the fact that it is second in the list of packages with 51 downloads:
pkg.rank <- packageRank(packages = "HistData", date = "2019-01-01")
downloads <- pkg.rank$crosstab
downloads[downloads == 51]
>
> dynamicTreeCut HistData kimisc NeuralNetTools
> 51 51 51 51
> OpenStreetMap pkgKitten plotlyGeoAssets spls
> 51 51 51 51
> webutils zoom
> 51 51
warning message
With R >= 3.6, you’re likely to see a warning message the first time
you run either packageRank()
or bioconductorRank()
:
Registered S3 method overwritten by 'R.oo':
method from
throw.default R.methodsS3
This is a consequence of an upstream, higher-order package dependency. For more information, see R.methodsS3: Issue 15 and R.utils: Issue 95 on Henrik Bengtsson’s GitHub pages for details.
Bioconductor
For Bioconductor packages, use bioconductorRank()
:
bioconductorRank(packages = "cicero", date = "2019-09")
> date packages downloads percentile rank
> 1 2019-09 cicero 171 77.3 434 of 1,913
V - Visualizing percentiles and ranks (cross-sectional)
To visualize the rank percentile, use packageRank()
’s plot
method:
plot(packageRank(packages = "HistData", date = "2019-01-01"))
This provides a cross-sectional view that plots a package’s rank (x-axis) against the base 10 logarithm of its downloads (y-axis), and highlights the package’s position in the overall distribution of downloads. In addition, the plot illustrates 1) a package’s rank percentile and raw count of downloads (in red); 2) the location of the 75th, 50th and 25th percentiles (dotted gray vertical lines); 3) the package with the most downloads, in this case ‘WGCNA’ (in blue, top left); and 4) the total number of downloads (2,982,767 for CRAN) (in blue, top right).
You can also do this with
bioconductorRank()
:
plot(bioconductorRank(packages = "cicero", date = "2019-01"))
And you can also pass a vector of packages:
plot(packageRank(packages = c("cholera", "HistData", "regtools"), date = "2019-01-01"))
VI - Visualizing percentiles and ranks (longitudinal)
To visualize a package’s relative position over time, use
packageRankTime()
:
plot(packageRankTime(packages = "HistData", when = "last-month"), graphics_pkg = "base")
This longitudinal view plots the date (x-axis) against the logarithm of a package’s downloads (y-axis).
In the background, the same variable are plotted (in gray) for a stratified random sample of packages: within each 5% interval of rank percentiles (e.g., 0 to 5, 5 to 10, 95 to 100, etc.), a random sample of 5% of packages is selected and tracked over time. This sample approximates the “typical” pattern of downloads for that time period.
Note: only two time frames are currently available: “last-week” and “last-month”. Also, a version for Bioconductor packages is not currently available.
VII Graphics: base R and ‘ggplot2’
Plots are available as both base R and ‘ggplot2’ graphs. By default,
plots for one package or day use base graphics while those with multiple
packages or days use ‘ggplot2’. You can override these defaults by using
the “graphics” argument in the plot()
method.
VIII - Memoization
To avoid the bottleneck of downloading multiple log files,
packageRank()
is currently limited to individual days or observations.
However, to reduce the need to re-download logs for a given day,
‘packageRank’ makes use of memoization via the ‘memoise’ package.
Here’s relevant code:
fetchLog <- function(x) data.table::fread(x)
mfetchLog <- memoise::memoise(fetchLog)
if (RCurl::url.exists(url)) {
cran_log <- mfetchLog(url)
}
If you use fetchLog()
, the log file, which can sometimes be as large
as 50 MB, will be downloaded every time you call the function. If you
use mfetchLog()
, logs are intelligently cached; those that have
already been downloaded, in your current R session, will not be
downloaded again.