wakefield
<img src="inst/wakefield_logo/r_wakefield.png" width="60%", alt="">
Table of Contents
Installation
To download the development version of wakefield:
Download the zip
ball or tar
ball, decompress
and run R CMD INSTALL
on it, or use the pacman package to install
the development version:
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/wakefield")
pacman::p_load(dplyr, tidyr, ggplot2)
Help
Contact
You are welcome to:
- submit suggestions and bug-reports at: https://github.com/trinker/wakefield/issues
- send a pull request on: https://github.com/trinker/wakefield/
- compose a friendly e-mail to: tyler.rinker@gmail.com
Demonstration
Getting Started
The r_data_frame
function (random data frame) takes n
(the number of
rows) and any number of variables (columns). These columns are typically
produced from a wakefield variable function. Each of these variable
functions has a pre-set behavior that produces a named vector of n
length, allowing the user to lazily pass unnamed functions (optionally,
without call parenthesis). The column name is hidden as a varname
attribute. For example here we see the race
variable function:
race(n=10)
## [1] White White White White White White Black White White Black
## Levels: White Hispanic Black Asian Bi-Racial Native Other Hawaiian
attributes(race(n=10))
## $levels
## [1] "White" "Hispanic" "Black" "Asian" "Bi-Racial" "Native"
## [7] "Other" "Hawaiian"
##
## $class
## [1] "variable" "factor"
##
## $varname
## [1] "Race"
When this variable is used inside of r_data_frame
the varname
is
used as a column name. Additionally, the n
argument is not set within
variable functions but is set once in r_data_frame
:
r_data_frame(
n = 500,
race
)
## Source: local data frame [500 x 1]
##
## Race
## (fctr)
## 1 White
## 2 Black
## 3 White
## 4 White
## 5 White
## 6 White
## 7 Hispanic
## 8 White
## 9 Black
## 10 White
## .. ...
The power of r_data_frame
is apparent when we use many modular
variable functions:
r_data_frame(
n = 500,
id,
race,
age,
sex,
hour,
iq,
height,
died
)
## Source: local data frame [500 x 8]
##
## ID Race Age Sex Hour IQ Height Died
## (chr) (fctr) (int) (fctr) (tims) (dbl) (dbl) (lgl)
## 1 001 White 35 Male 00:00:00 100 65 TRUE
## 2 002 White 35 Male 00:00:00 114 79 TRUE
## 3 003 White 20 Male 00:00:00 116 63 TRUE
## 4 004 White 20 Female 00:00:00 91 77 TRUE
## 5 005 White 23 Female 00:00:00 101 64 FALSE
## 6 006 Asian 27 Female 00:00:00 119 72 TRUE
## 7 007 Hispanic 26 Female 00:00:00 89 66 FALSE
## 8 008 White 24 Male 00:00:00 84 63 TRUE
## 9 009 White 26 Female 00:00:00 108 67 TRUE
## 10 010 Hispanic 30 Female 00:30:00 100 72 FALSE
## .. ... ... ... ... ... ... ... ...
There are 49 wakefield based variable functions to chose from,
spanning R's various data types (see ?variables
for details).
r_data_frame(
n = 500,
id,
Scoring = rnorm,
Smoker = valid,
race,
age,
sex,
hour,
iq,
height,
died
)
## Source: local data frame [500 x 10]
##
## ID Scoring Smoker Race Age Sex Hour IQ Height
## (chr) (dbl) (lgl) (fctr) (int) (fctr) (tims) (dbl) (dbl)
## 1 001 0.35729131 TRUE White 28 Male 00:00:00 80 72
## 2 002 0.03406854 TRUE White 20 Male 00:00:00 101 75
## 3 003 -0.21320876 TRUE White 30 Male 00:00:00 81 64
## 4 004 -0.14670245 TRUE White 32 Female 00:00:00 87 69
## 5 005 -0.30578953 FALSE White 33 Male 00:00:00 92 73
## 6 006 -0.31817961 TRUE White 29 Male 00:00:00 90 74
## 7 007 1.14035908 FALSE Black 31 Male 00:00:00 103 61
## 8 008 -1.72324249 FALSE Asian 22 Male 00:00:00 88 63
## 9 009 0.10694555 TRUE White 28 Male 00:00:00 111 66
## 10 010 -0.24042020 TRUE White 33 Male 00:00:00 94 70
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: Died (lgl)
r_data_frame(
n = 500,
id,
age, age, age,
grade, grade, grade
)
## Source: local data frame [500 x 7]
##
## ID Age_1 Age_2 Age_3 Grade_1 Grade_2 Grade_3
## (chr) (int) (int) (int) (dbl) (dbl) (dbl)
## 1 001 32 21 25 86.0 90.8 90.6
## 2 002 29 27 35 89.7 89.0 86.1
## 3 003 21 35 34 85.0 83.6 85.7
## 4 004 28 20 32 91.1 79.3 81.0
## 5 005 30 31 31 89.3 92.8 96.4
## 6 006 23 33 22 91.1 79.8 89.7
## 7 007 30 31 32 87.1 91.2 90.3
## 8 008 22 32 31 85.9 87.9 89.3
## 9 009 31 23 23 85.9 91.1 90.0
## 10 010 22 24 32 81.0 88.7 84.8
## .. ... ... ... ... ... ... ...
While passing variable functions to r_data_frame
without call
parenthesis is handy, the user may wish to set arguments. This can be
done through call parenthesis as we do with data.frame
or
dplyr::data_frame
:
r_data_frame(
n = 500,
id,
Scoring = rnorm,
Smoker = valid,
`Reading(mins)` = rpois(lambda=20),
race,
age(x = 8:14),
sex,
hour,
iq,
height(mean=50, sd = 10),
died
)
## Source: local data frame [500 x 11]
##
## ID Scoring Smoker Reading(mins) Race Age Sex Hour
## (chr) (dbl) (lgl) (int) (fctr) (int) (fctr) (tims)
## 1 001 -1.37609383 FALSE 18 White 11 Female 00:00:00
## 2 002 2.03200096 FALSE 14 White 12 Male 00:00:00
## 3 003 -0.13097071 FALSE 21 Black 13 Male 00:00:00
## 4 004 0.82349800 TRUE 10 White 9 Male 00:00:00
## 5 005 0.05255655 FALSE 10 Black 8 Male 00:00:00
## 6 006 -0.10398656 TRUE 12 White 10 Female 00:00:00
## 7 007 0.56926154 FALSE 23 White 14 Male 00:00:00
## 8 008 0.12042217 FALSE 23 Black 8 Female 00:00:00
## 9 009 0.20717877 TRUE 11 White 14 Female 00:30:00
## 10 010 2.21729855 FALSE 26 White 10 Female 00:30:00
## .. ... ... ... ... ... ... ... ...
## Variables not shown: IQ (dbl), Height (dbl), Died (lgl)
Random Missing Observations
Often data contains missing values. wakefield allows the user to add
a proportion of missing values per column/vector via the r_na
(random
NA
). This works nicely within a dplyr/magrittr %>%
then
pipeline:
r_data_frame(
n = 30,
id,
race,
age,
sex,
hour,
iq,
height,
died,
Scoring = rnorm,
Smoker = valid
) %>%
r_na(prob=.4)
## Source: local data frame [30 x 10]
##
## ID Race Age Sex Hour IQ Height Died Scoring
## (chr) (fctr) (int) (fctr) (tims) (dbl) (dbl) (lgl) (dbl)
## 1 01 NA NA NA <NA> 92 NA FALSE 0.1645463
## 2 02 NA NA NA 01:30:00 NA 67 TRUE -0.9523500
## 3 03 White 29 Female <NA> 100 NA FALSE -1.1061811
## 4 04 White NA NA <NA> NA 67 FALSE -0.4119794
## 5 05 NA 31 Male <NA> NA 64 NA 1.6834534
## 6 06 Hispanic 30 Female 03:30:00 NA 68 NA -0.3526578
## 7 07 Black 25 NA <NA> 84 71 FALSE 0.3983277
## 8 08 NA NA NA 04:00:00 NA NA TRUE NA
## 9 09 NA 33 Female 04:30:00 95 73 FALSE NA
## 10 10 White 30 NA <NA> NA 68 NA NA
## .. ... ... ... ... ... ... ... ... ...
## Smoker
## (lgl)
## 1 NA
## 2 NA
## 3 NA
## 4 TRUE
## 5 NA
## 6 TRUE
## 7 TRUE
## 8 TRUE
## 9 TRUE
## 10 TRUE
## .. ...
Repeated Measures & Time Series
The r_series
function allows the user to pass a single wakefield
function and dictate how many columns (j
) to produce.
set.seed(10)
r_series(likert, j = 3, n=10)
## Source: local data frame [10 x 3]
##
## Likert_1 Likert_2 Likert_3
## (fctr) (fctr) (fctr)
## 1 Neutral Disagree Strongly Disagree
## 2 Agree Neutral Disagree
## 3 Neutral Strongly Agree Disagree
## 4 Disagree Neutral Agree
## 5 Strongly Agree Agree Neutral
## 6 Agree Neutral Disagree
## 7 Agree Strongly Agree Strongly Disagree
## 8 Agree Agree Agree
## 9 Disagree Agree Disagree
## 10 Neutral Strongly Disagree Agree
Often the user wants a numeric score for Likert type columns and similar
variables. For series with multiple factors the as_integer
converts
all columns to integer values. Additionally, we may want to specify
column name prefixes. This can be accomplished via the variable
function's name
argument. Both of these features are demonstrated
here.
set.seed(10)
as_integer(r_series(likert, j = 5, n=10, name = "Item"))
## Source: local data frame [10 x 5]
##
## Item_1 Item_2 Item_3 Item_4 Item_5
## (int) (int) (int) (int) (int)
## 1 3 2 1 3 4
## 2 4 3 2 5 4
## 3 3 5 2 5 5
## 4 2 3 4 1 2
## 5 5 4 3 3 4
## 6 4 3 2 2 5
## 7 4 5 1 1 5
## 8 4 4 4 1 3
## 9 2 4 2 2 5
## 10 3 1 4 3 1
r_series
can be used within a r_data_frame
as well.
set.seed(10)
r_data_frame(n=100,
id,
age,
sex,
r_series(likert, 3, name = "Question")
)
## Source: local data frame [100 x 6]
##
## ID Age Sex Question_1 Question_2
## (chr) (int) (fctr) (fctr) (fctr)
## 1 001 28 Male Agree Agree
## 2 002 24 Male Neutral Strongly Agree
## 3 003 26 Male Disagree Neutral
## 4 004 31 Male Strongly Disagree Neutral
## 5 005 21 Female Strongly Agree Strongly Disagree
## 6 006 23 Female Disagree Disagree
## 7 007 24 Female Disagree Strongly Agree
## 8 008 24 Male Strongly Disagree Agree
## 9 009 29 Female Agree Strongly Agree
## 10 010 26 Male Strongly Disagree Strongly Disagree
## .. ... ... ... ... ...
## Variables not shown: Question_3 (fctr)
set.seed(10)
r_data_frame(n=100,
id,
age,
sex,
r_series(likert, 5, name = "Item", integer = TRUE)
)
## Source: local data frame [100 x 8]
##
## ID Age Sex Item_1 Item_2 Item_3 Item_4 Item_5
## (chr) (int) (fctr) (int) (int) (int) (int) (int)
## 1 001 28 Male 4 4 1 1 1
## 2 002 24 Male 3 5 2 1 2
## 3 003 26 Male 2 3 2 1 2
## 4 004 31 Male 1 3 2 4 3
## 5 005 21 Female 5 1 1 5 4
## 6 006 23 Female 2 2 4 3 4
## 7 007 24 Female 2 5 1 5 2
## 8 008 24 Male 1 4 4 5 5
## 9 009 29 Female 4 5 5 4 3
## 10 010 26 Male 1 1 4 1 2
## .. ... ... ... ... ... ... ... ...
Related Series
The user can also create related series via the relate
argument in
r_series
. It allows the user to specify the relationship between
columns. relate
may be a named list of or a short hand string of the
form of "fM_sd"
where:
f
is one of (+, -, *, /)M
is a mean valuesd
is a standard deviation of the mean value
For example you may use relate = "*4_1"
. If relate = NULL
no
relationship is generated between columns. I will use the short hand
string form here.
Some Examples With Variation
r_series(grade, j = 5, n = 100, relate = "+1_6")
## Source: local data frame [100 x 5]
##
## Grade_1 Grade_2 Grade_3 Grade_4 Grade_5
## (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 84.5 92.5 91.6 87.4 76.7
## 2 93.1 85.0 81.8 87.8 91.3
## 3 81.6 67.5 52.6 48.8 56.8
## 4 92.5 89.3 95.3 102.2 94.5
## 5 96.6 95.9 98.7 115.9 114.7
## 6 89.7 88.1 88.8 89.0 86.4
## 7 92.8 91.7 98.3 98.7 101.6
## 8 92.1 92.9 92.6 85.5 93.1
## 9 90.6 96.9 103.9 107.6 106.2
## 10 96.0 94.8 84.3 91.1 106.6
## .. ... ... ... ... ...
r_series(age, 5, 100, relate = "+5_0")
## Source: local data frame [100 x 5]
##
## Age_1 Age_2 Age_3 Age_4 Age_5
## (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 24 29 34 39 44
## 2 24 29 34 39 44
## 3 27 32 37 42 47
## 4 22 27 32 37 42
## 5 32 37 42 47 52
## 6 27 32 37 42 47
## 7 21 26 31 36 41
## 8 29 34 39 44 49
## 9 35 40 45 50 55
## 10 33 38 43 48 53
## .. ... ... ... ... ...
r_series(likert, 5, 100, name ="Item", relate = "-.5_.1")
## Source: local data frame [100 x 5]
##
## Item_1 Item_2 Item_3 Item_4 Item_5
## (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 2 1 0 -1 -1
## 2 3 2 1 1 0
## 3 1 1 1 0 0
## 4 4 3 3 2 1
## 5 2 1 1 0 0
## 6 2 1 1 1 0
## 7 1 0 0 -1 -2
## 8 2 2 1 1 0
## 9 2 2 1 0 0
## 10 3 3 3 3 3
## .. ... ... ... ... ...
r_series(grade, j = 5, n = 100, relate = "*1.05_.1")
## Source: local data frame [100 x 5]
##
## Grade_1 Grade_2 Grade_3 Grade_4 Grade_5
## (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 85.7 94.27 113.124 113.1240 113.1240
## 2 86.4 77.76 77.760 85.5360 85.5360
## 3 90.6 99.66 89.694 98.6634 108.5297
## 4 89.1 89.10 89.100 71.2800 71.2800
## 5 87.0 95.70 114.840 103.3560 113.6916
## 6 93.9 103.29 123.948 136.3428 136.3428
## 7 80.1 72.09 64.881 84.3453 84.3453
## 8 91.7 110.04 132.048 132.0480 145.2528
## 9 87.4 96.14 96.140 105.7540 116.3294
## 10 92.9 92.90 83.610 91.9710 101.1681
## .. ... ... ... ... ...
Adjust Correlations
Use the sd
command to adjust correlations.
round(cor(r_series(grade, 8, 10, relate = "+1_2")), 2)
## Grade_1 Grade_2 Grade_3 Grade_4 Grade_5 Grade_6 Grade_7 Grade_8
## Grade_1 1.00 0.85 0.64 0.39 0.28 0.25 0.28 0.15
## Grade_2 0.85 1.00 0.86 0.68 0.61 0.56 0.56 0.47
## Grade_3 0.64 0.86 1.00 0.77 0.70 0.80 0.86 0.78
## Grade_4 0.39 0.68 0.77 1.00 0.94 0.80 0.65 0.74
## Grade_5 0.28 0.61 0.70 0.94 1.00 0.85 0.69 0.73
## Grade_6 0.25 0.56 0.80 0.80 0.85 1.00 0.92 0.89
## Grade_7 0.28 0.56 0.86 0.65 0.69 0.92 1.00 0.91
## Grade_8 0.15 0.47 0.78 0.74 0.73 0.89 0.91 1.00
round(cor(r_series(grade, 8, 10, relate = "+1_0")), 2)
## Grade_1 Grade_2 Grade_3 Grade_4 Grade_5 Grade_6 Grade_7 Grade_8
## Grade_1 1 1 1 1 1 1 1 1
## Grade_2 1 1 1 1 1 1 1 1
## Grade_3 1 1 1 1 1 1 1 1
## Grade_4 1 1 1 1 1 1 1 1
## Grade_5 1 1 1 1 1 1 1 1
## Grade_6 1 1 1 1 1 1 1 1
## Grade_7 1 1 1 1 1 1 1 1
## Grade_8 1 1 1 1 1 1 1 1
round(cor(r_series(grade, 8, 10, relate = "+1_20")), 2)
## Grade_1 Grade_2 Grade_3 Grade_4 Grade_5 Grade_6 Grade_7 Grade_8
## Grade_1 1.00 0.26 0.27 0.40 0.21 -0.21 -0.36 -0.41
## Grade_2 0.26 1.00 0.77 0.60 0.64 0.50 0.53 0.46
## Grade_3 0.27 0.77 1.00 0.78 0.76 0.66 0.62 0.66
## Grade_4 0.40 0.60 0.78 1.00 0.95 0.76 0.59 0.55
## Grade_5 0.21 0.64 0.76 0.95 1.00 0.82 0.65 0.61
## Grade_6 -0.21 0.50 0.66 0.76 0.82 1.00 0.90 0.82
## Grade_7 -0.36 0.53 0.62 0.59 0.65 0.90 1.00 0.94
## Grade_8 -0.41 0.46 0.66 0.55 0.61 0.82 0.94 1.00
round(cor(r_series(grade, 8, 10, relate = "+15_20")), 2)
## Grade_1 Grade_2 Grade_3 Grade_4 Grade_5 Grade_6 Grade_7 Grade_8
## Grade_1 1.00 -0.10 -0.50 -0.39 -0.25 -0.52 -0.26 -0.31
## Grade_2 -0.10 1.00 0.74 0.50 0.13 0.03 0.36 0.46
## Grade_3 -0.50 0.74 1.00 0.81 0.48 0.41 0.71 0.78
## Grade_4 -0.39 0.50 0.81 1.00 0.75 0.66 0.58 0.75
## Grade_5 -0.25 0.13 0.48 0.75 1.00 0.91 0.70 0.74
## Grade_6 -0.52 0.03 0.41 0.66 0.91 1.00 0.58 0.57
## Grade_7 -0.26 0.36 0.71 0.58 0.70 0.58 1.00 0.78
## Grade_8 -0.31 0.46 0.78 0.75 0.74 0.57 0.78 1.00
Visualize the Relationship
dat <- r_data_frame(12,
name,
r_series(grade, 100, relate = "+1_6")
)
dat %>%
gather(Time, Grade, -c(Name)) %>%
mutate(Time = as.numeric(gsub("\\D", "", Time))) %>%
ggplot(aes(x = Time, y = Grade, color = Name, group = Name)) +
geom_line(size=.8) +
theme_bw()
Expanded Dummy Coding
The user may wish to expand a factor
into j
dummy coded columns. The
r_dummy
function expands a factor into j
columns and works similar
to the r_series
function. The user may wish to use the original factor
name as the prefix to the j
columns. Setting prefix = TRUE
within
r_dummy
accomplishes this.
set.seed(10)
r_data_frame(n=100,
id,
age,
r_dummy(sex, prefix = TRUE),
r_dummy(political)
)
## Source: local data frame [100 x 9]
##
## ID Age Sex_Male Sex_Female Democrat Republican Constitution
## (chr) (int) (int) (int) (int) (int) (int)
## 1 001 28 1 0 1 0 0
## 2 002 24 1 0 1 0 0
## 3 003 26 1 0 0 1 0
## 4 004 31 1 0 0 1 0
## 5 005 21 0 1 1 0 0
## 6 006 23 0 1 0 1 0
## 7 007 24 0 1 0 1 0
## 8 008 24 1 0 0 0 0
## 9 009 29 0 1 1 0 0
## 10 010 26 1 0 0 1 0
## .. ... ... ... ... ... ... ...
## Variables not shown: Libertarian (int), Green (int)
Visualizing Column Types
It is helpful to see the column types and NA
s as a visualization. The
table_heat
(also the plot
method assigned to tbl_df
as well) can
provide visual glimpse of data types and missing cells.
set.seed(10)
r_data_frame(n=100,
id,
dob,
animal,
grade, grade,
death,
dummy,
grade_letter,
gender,
paragraph,
sentence
) %>%
r_na() %>%
plot(palette = "Set1")