fasteR: Fast Lane to Learning R!

"Becoming productive in R, as fast as possible"

Norm Matloff, Prof. of Computer Science, UC Davis; my bio

(See notice at the end of this document regarding copyright.)

This site is for those who know nothing of R, and maybe even nothing of programming, and seek QUICK, PAINLESS! entree to the world of R.

FAST: You'll already be doing good stuff in R -- useful data analysis

-- in your very first lesson.

For nonprogrammers: If you're comfortable with navigating

the Web and viewing charts, you're fine. This tutorial is aimed at you, not experienced C or Python coders.

Motivating: Every lesson centers around a real problem to be

solved, on real data. The lessons do not consist of a few toy examples, unrelated to the real world. The material is presented in a conversational, story-telling manner.

Just the basics, no frills or polemics:
- Notably, in the first few lessons, we do NOT use Integrated Development Environments (IDEs). RStudio, ESS etc. are great, but

you shouldn't be burdened with learning R and learning an IDE at the same time, a distraction from the goal of becoming productive in R as fast as possible.

 Note that even the excellent course by [R-Ladies

Sydney](https://threadreaderapp.com/thread/1119025557830684673.html), which does start with RStudio, laments that RStudio can be "way too overwhelming."

 So, in the initial lessons,  we stick to the R command line, and

focus on data analysis, not tools such as IDEs, which we will cover as an intermediate-level topic. (Some readers of this tutorial may already be using RStudio or an external editor, and the treatment here will include special instructions for them when needed.)

- Coverage is mainly limited to base R. So for instance the popular
  but self-described "opinionated" Tidyverse is not treated, partly

due to its controversial nature (I am a skeptic), but again mainly because it would be an obstacle to your becoming productive in R quickly.

While you can learn a few simple things in Tidy quickly, thinking

you are learning a lot, those things are quite limited in scope, and Tidy learners often find difficulty in applying R to real world data. Our tutorial here is aimed at learners whose goal is to USE the R system productively in their own data analysis.

Nonpassive approach: Passive learning, just watching the screen, is NO

learning. There will be occasional Your Turn sections, in which you the learner must devise and try your own variants on what has been presented. Sometimes the tutorial will be give you some suggestions, but even then, you should cook up your own variants to try. Remember: You get out what you put in! The more actively you work the Your Turn sections, the more powerful you will be as an R coder.

PART I

Lesson 1: Getting Started
Lesson 2: First R Steps
Lesson 3: Vectors and Indices
Lesson 4: More on Vectors
Lesson 5: On to Data Frames!
Lesson 6: The R Factor Class
Lesson 7: Extracting Rows/Columns from Data Frames
Lesson 8: More Examples of Extracting Rows, Columns
Lesson 9: The tapply Function
Lesson 10: Data Cleaning
Lesson 11: The R List Class
Lesson 12: Another Look at the Nile Data
Lesson 13: Pause to Reflect
Lesson 14: Introduction to Base R Graphics
Lesson 15: More on Base Graphics
Lesson 16: Writing Your Own Functions
Lesson 17: 'For' Loops
Lesson 18: Functions with Blocks
Lesson 19: Text Editing and IDes
Lesson 20: If, Else, Ifelse
Lesson 21: Do Pro Athletes Keep Fit?
Lesson 22: Linear Regression Analysis, I
Lesson 23: S3 Classes
Lesson 24: Baseball Player Analysis (cont'd.)
Lesson 25: R Packages, CRAN, Etc.

PART II

Lesson 26: A Pause, Before Going on to Advanced Topics
Lesson 27: A First Look at ggplot2
Lesson 28: Should You Use Functional Programming?
Lesson 29 Simple Text Processing, I
Lesson 30: Simple Text Processing, II
Lesson 31: Linear Regression Analysis, II
Lesson 32: Working with the R Date Class
Lesson 33: Tips on R Coding Style and Strategy
Lesson 34: The Logistic Model
Lesson 35: Files and Directories
Lesson 36: R 'while' Loops
To Learn More
Thanks
Appendix: Installing and Using RStudio

Lesson 1: Getting Started

For the time being, the main part of this online course will be this README.md file. It is set up as a potential R package, though, and I may implement that later.

The color figure at the top of this file was generated by our prVis package, run on a famous dataset called Swiss Roll.

Please note again:

Nonpassive learning is absolutely key! So even if the output of an R command is shown here, run the command yourself in your R console, by

copy-and-pasting from this document into the R console. You will get out of this tutorial what you put in.

Similarly, the Your Turn sections are absolutely crucial. Devise your own little examples, and try them! "When in doubt, Try it

out!" is a motto I devised for teaching. If you are unclear or curious about something, try it out! Just devise a little experiment, and type in the code. Don't worry -- you won't "break" things.

Tip: I cannot teach you how to

program. I can merely give you the tools, e.g. R vectors, and some examples. For a given desired programming task, then, you must creatively put these tools together to attain the goal. Treat it like a puzzle! I think you'll find that if you stick with it, you'll find you're pretty good at it. After all, we can all work puzzles.

Starting out:

You'll need to install R, from the R Project site. Start up R, either by clicking an icon or typing 'R' in a terminal window. We are not requiring RStudio here, but if you already have it, start it; you'll be typing into the R console, the Console pane.

As noted, this tutorial will be "bare bones." In particular, there is no script to type your command for you. Instead, you will either copy-and-paste from the text here into the R console, or type them there by hand. (Note that the code lines here will all begin with the R interactive prompt, '>'; that should not be typed.)

This is a Markdown file. You can read it right there on GitHub, which has its own Markdown renderer. Or you can download it to your own machine in Chrome and use the Markdown Reader extension to view it (be sure to enable Allow Access to File URLs).

When you end your R session, exit by typing 'quit()'.

Good luck! And if you have any questions, feel free to e-mail me, at matloff@cs.ucdavis.edu

Lesson 2: First R Steps

The R command prompt is '>'. Again, it will be shown here, but you don't type it. It is just there in your R window to let you know R is inviting you to submit a command. (If you are using RStudio, you'll see it in the Console pane.)

So, just type '1+1' then hit Enter. Sure enough, it prints out 2 (you were expecting maybe 12108?):

> 1 + 1
[1] 2

But what is that '[1]' here? It's just a row label. We'll go into that later, not needed quite yet.

Example: Nile River data

R includes a number of built-in datasets, mainly for illustration purposes. One of them is Nile, 100 years of annual flow data on the Nile River.

Let's find the mean flow:

> mean(Nile)
[1] 919.35

Here mean is an example of an R function, and in this case Nile is an argument -- fancy way of saying "input" -- to that function. That output, 919.35, is called the return value or simply value. The act of running the function is termed calling the function.

Another point to note is that we didn't need to call R's print function. We could have typed,

> print(mean(Nile))

but whenever we are at the R '>' prompt, any expression we type will be printed out.

Since there are only 100 data points here, it's not unwieldy to print them out. Again, all we have to do is type ``Nile,'' no need to call print:

> Nile
Time Series:
Start = 1871 
End = 1970 
Frequency = 1 
  [1] 1120 1160  963 1210 1160 1160  813 1230 1370 1140  995  935 1110  994 1020
 [16]  960 1180  799  958 1140 1100 1210 1150 1250 1260 1220 1030 1100  774  840
 [31]  874  694  940  833  701  916  692 1020 1050  969  831  726  456  824  702
 [46] 1120 1100  832  764  821  768  845  864  862  698  845  744  796 1040  759
 [61]  781  865  845  944  984  897  822 1010  771  676  649  846  812  742  801
 [76] 1040  860  874  848  890  744  749  838 1050  918  986  797  923  975  815
 [91] 1020  906  901 1170  912  746  919  718  714  740

Now you can see how the row labels work. There are 15 numbers per row here, so the second row starts with the 16th, indicated by '[16]'. The third row starts with the 31st output number, hence the '[31]' and so on.

Note that a set of numbers such as Nile is called a vector.

A first graph

R has great graphics, not only in base R but also in wonderful user-contributed packages, such as ggplot2 and lattice. But we'll stick with base-R graphics for now, and save the more powerful yet more complex ggplot2 for a later lesson.

We'll start with a very simple, non-dazzling one, a no-frills histogram:

> hist(Nile)

No return value for the hist function (there is one, but it is seldom used, and we won't go into it here), but it does create the graph.

Your Turn: The hist function draws 10 bins for this dataset in the histogram by default, but you can choose other values, by specifying an optional second argument to the function, named breaks. E.g.
> hist(Nile,breaks=20)
would draw the histogram with 20 bins. Try plotting using several different large and small values of the number of bins.
Note: The hist function, as with many R functions, has many different options, specifiable via various arguments. For now, we'll just keep things simple, and resist the temptation to explore them all.

R has lots of online help, which you can access via '?'. E.g. typing

> ?hist

will tell you to full story on all the options available for the hist function. Again, there are far too many for you to digest for now (most users don't ever find a need for the more esoteric ones), but it's a vital resource to know.

Your Turn: Look at the online help for mean and Nile.

Lesson 3: Vectors and Indices

Say we want to find the mean river flow after year 1950.

The above output said that the Nile series starts in 1871. That means 1951 will be the 81st year, and the 100th will be 1970. How do we designate the 81st through 100th elements in this data?

Individual elements can be accessed using subscripts or indices (singular is index), which are specified using brackets, e.g.

> Nile[2]
[1] 1160

for the second element (the output we saw earlier shows that the second element is indeed 1160). The value 2 here is the index.

The c ("concatenate") function builds a vector, stringing several numbers together. E.g. we can get the 2nd, 5th and 6th elements of Nile:

> Nile[c(2,5,6)]
[1] 1160 1160 1160

If we wish to build a vector of consecutive numbers, we can use the "colon" notation:

> Nile[c(2,3,4)]
[1] 1160  963 1210
> Nile[2:4]
[1] 1160  963 1210

As you can see, 2:4 is shorter way to specify the vector c(2,3,4).

So, 81:100 means all the numbers from 81 to 100. Thus Nile[81:100] specifies the 81st through 100th elements in the Nile vector.

Then to answer the above question on the mean flow during 1951-1971, we can do

> mean(Nile[81:100])
[1] 877.05

NOTE: Observe how the above reasoning process worked. We had a goal, to find the mean river flow after 1950. We knew we had some tools available to us, namely the mean function and R vector indices. We then had to figure out a way to combine these tools in a manner that achieves our goal, which we did.
This is how use of R works in general. As you go through this tutorial, you'll add more and more to your R "toolbox." Then for any given goal, you'll rummage around in that toolbox, and eventually figure out the right set of tools for the goal at hand. Sometimes this will require some patience, but you'll find that the more you do, the more adept you become.

If we plan to do more with that time period, we should make a copy of it:

> n81100 <- Nile[81:100]
> mean(n81100)
[1] 877.05
> sd(n81100)
[1] 125.5583

The function sd finds the standard deviation.

Note that we used R's assignment operator here to copy ("assign") those particular Nile elements to n81100. (In most situations, you can use "=" instead of "<-", but why worry about what the exceptions might be? They are arcane, so it's easier just to always use "<-". And though "keyboard shortcuts" for this are possible, again let's just stick to the basics for now.)

Note too that though we will speak of the above operation as having "extracted" the 81st through 100th elements of Nile, we have merely made a copy of those elements. The original vector Nile remains intact.

Tip: We can pretty much choose any name we want; "n81100" just was chosen to easily remember this new vector's provenance. (But names can't include spaces, and must start with a letter.)

Note that n81100 now is a 21-element vector. Its first element is now element 81 of Nile:

> n81100[1]
[1] 744
> Nile[81]
[1] 744

Keep in mind that although Nile and n81100 now have identical contents, they are separate vectors; if one changes, the other will not.

Your Turn: Devise and try variants of the above, say finding the mean over the years 1945-1960.

Another oft-used function is length, which gives the number of elements in the vector, e.g.

> length(Nile)
[1] 100

Can you guess the value of length(n81100)? Type this expression in at the '>' prompt to check your answer.

Leave R by typing 'q()' or ctrl-d. (Answer no to saving the workspace.)

Recap: What have we learned in these first two lessons?

Starting and existing R.
Some R functions: mean, hist, length.
R vectors, and vector indices.
Extracting vector subsets.
Forming vectors, using c() and ":".

Not bad for Lesson 1! And needless to say, you'll be using all of these frequently in the subsequent lessons and in your own usage of R.

Lesson 4: More on Vectors

Continuing along the Nile, say we would like to know in how many years the level exceeded 1200. Let's first introduce R's sum function:

> sum(c(5,12,13))
[1] 30

Here the c function built a vector consisting of 5, 12 and 13. That vector was then fed into the sum function, returning 5+12+13 = 30.

By the way, the above is our first example of function composition, where the output of one function, c here, is fed as input into another, sum in this case.

We can now use this to answer our question on the Nile data:

> sum(Nile > 1200)
[1] 7

The river level exceeded 1200 in 7 years.

But how in the world did that work? Bear with me a bit here. Let's look at a small example first:

> x <- c(5,12,13)
> x
[1]  5 12 13
> x > 8
[1] FALSE  TRUE  TRUE
> sum(x > 8)
[1] 2

First, notice something odd here, in the expression x > 8. Here x is a vector, 3 elements in length, but 8 is just a number. It would seem that it's nnonsense to ask whether a vector is greater than a number; they're different animals.

But R makes them "the same kind" of animal, by extending that number 8 to a 3-element vector 8,8,8. This is called recycling. This sets up an element-by-element comparison: Then, the 5 in x is compared to the first 8, yielding FALSE i.e. 5 is NOT greater than 8. Then 12 is compared to the second 8, yielding TRUE, and then the comparison of 13 to the third 8 yields another TRUE. So, we get the vector FALSE,TRUE,TRUE.

Fine, but how will sum add up some TRUEs and FALSEs? The answer is that R, like most computer languages, treats TRUE and FALSE as 1 and 0, respectively. So we summed the vector (0,1,1), yielding 2.

Getting back to the question of the number of years in which the Nile flow exceeded 1200, let's look at that expression again:

> sum(Nile > 1200)

Since the vector Nile has length 100, that number 1200 will be recycled into a vector of one hundred copies of 1200. The '>' comparison will then yield 100 TRUEs and FALSEs, so summing gives us the number of TRUEs, exactly what we want.

Your Turn: Try a few other experiments of your choice using sum. I'd suggest starting with finding the sum of the first 25 elements in Nile. You may wish to start with experiments on a small vector, say (2,1,1,6,8,5), so you will know that your answers are correct. Remember, you'll learn best nonpassively. Code away!

A question related to how many years had a flow above 1200 is which years had that property. Well, R actually has a which function:

> which(Nile > 1200)
[1]  4  8  9 22 24 25 26

So the 4th, 8th, 9th etc. elements in Nile had the queried property. (Note that those were years 1875, 1879 and so on.)

In fact, that gives us another way to get the count of the years with that trait:

> which1200 <- which(Nile > 1200)
> which1200
[1]  4  8  9 22 24 25 26
> length(which1200)
[1] 7

Of course, as usual, my choice of the variable name "which1200" was arbirary, just something to help me remember what is stored in that variable.

R's length function does what it says, i.e. finding the length of a vector. In our context, that gives us the count of years with flow above 1200.

And, what were the river flows in those 7 years?

> which1200 <- which(Nile > 1200)
> Nile[which1200]
[1] 1210 1230 1370 1210 1250 1260 1220

Finally, something a little fancier. We can combine steps above:

> Nile[Nile > 1200]
[1] 1210 1230 1370 1210 1250 1260 1220

We just "eliminated the middle man," which1200. The R interpreter saw our "Nile > 1200", and thus generated the corresponding TRUEs and FALSEs. The R interpreter then treated those TRUEs and FALSEs as subscripts in Nile, thus extracting the desired data.

Now, we might add here, "Don't try this at home, kids." For beginners, it's really easier and more comfortable to break things into steps. Once, you become experienced at R, you may wish to start skipping steps.

Less bold is the notion of negative indices, e.g.

> x <- c(5,12,13,8)
> x[-1]  
[1] 12 13  8

Here we are asking for all of x except for x[1]. Can you guess what x[c(-1,-4)] evaluates to? Guess first, then try it out.

Recap: What have we learned in this lesson?

Here you've refined your skillset for R vectors, learning R's recycling feature, and two tricks that R users employ for finding counts of things.

Once again, as you progress through this tutorial, you'll see that these things are used a lot in R.

Lesson 5: On to Data Frames!

Right after vectors, the next major workhorse of R is the data frame. It's a rectangular table consisting of one row for each data point.

Say we have height, weight and age on each of 100 people. Our data frame would have 100 rows and 3 columns. The entry in, e.g., the second row and third column would be the age of the second person in our data. The second row as a whole would be all the data for that second person, i.e. the height, weight and age of that person.

Note that that row would also be cnsidered a vector. The third column as a whole would be the vector of all ages in our dataset.

As our first example, consider the ToothGrowth dataset built-in to R. Again, you can read about it in the online help by typing

> ?ToothGrowth
``` (The data turn out to be on guinea pigs, with orange juice or
Vitamin C as growth supplements.)  Let's take a quick look from the
command line.

``` r
> head(ToothGrowth)
   len supp dose
1  4.2   VC  0.5
2 11.5   VC  0.5
3  7.3   VC  0.5
4  5.8   VC  0.5
5  6.4   VC  0.5
6 10.0   VC  0.5

R's head function displays (by default) the first 6 rows of the given dataframe. We see there are length, supplement and dosage columns, which the curator of the data decided to name 'len', 'supp' and 'dose'. Each of column is an R vector, or in the case of the second column, a vector-like object called a factor, to be discussed shortly).

Tip: To avoid writing out the long words repeatedly, it's handy to make a copy with a shorter name.

> tg <- ToothGrowth

Dollar signs are used to denote the individual columns, e.g. ToothGrowth$dose for the dose column. So for instance, we can print out the mean tooth length:

> mean(tg$len)
[1] 18.81333

Subscripts/indices in data frames are pairs, specifying row and column numbers. To get the element in row 3, column 1:

> tg[3,1]
[1] 7.3

which matches what we saw above in our head example. Or, use the fact that tg$len is a vector:

> tg$len[3]
[1] 7.3

The element in row 3, column 1 in the data frame tg is element 3 in the vector tg$letn. This duality between data frames and vectors is often exploited in R.

Your Turn: The above examples are fundamental to R, so you should conduct a few small experiments on your own this time, little variants of the above. The more you do, the better!

For any subset of a data frame d, we can extract whatever rows and columns we want using the format

d[the rows we want, the columns we want]

Some data frames don't have column names, but that is no obstacle. We can use column numbers, e.g.

> mean(tg[,1])
[1] 18.81333

Note the expression '[,1]'. Since there is a 1 in the second position, we are talking about column 1. And since the first position, before the comma, is empty, no rows are specified -- so all rows are included. That boils down to: all of column 1.

A key feature of R is that one can extract subsets of data frames, just as we extracted subsets of vectors earlier. For instance,

> z <- tg[2:5,c(1,3)]
> z
   len dose
2 11.5  0.5
3  7.3  0.5
4  5.8  0.5
5  6.4  0.5

Here we extracted rows 2 through 5, and columns 1 and 3, assigning the result to z. To extract those columns but keep all rows, do

> y <- tg[ ,c(1,3)]

i.e. leave the row specification field empty.

By the way, note that the three columns are all of the same length, a requirement for data frames. And what is that common length in this case? R's nrow function tells us the number of rows in any data frame:

> nrow(ToothGrowth)
[1] 60

Ah, 60 rows (60 guinea pigs, 3 measurements each).

Or, alternatively:

> tg <- ToothGrowth
> length(tg$len)
[1] 60
> length(tg$supp)
[1] 60
> length(tg$dose)
[1] 60

So now you know four ways to do the same thing. But isn't one enough? Of course. But in this get-acquainted period, reading all four will help reinforce the knowledge you are now accumulating about R. So, make sure you understand how each of those four approaches produced the number 60.

The head function works on vectors too:

>  head(ToothGrowth$len)
[1]  4.2 11.5  7.3  5.8  6.4 10.0

Like many R functions, head has an optional second argument, specifying how many elements to print:

> head(ToothGrowth$len,10)
 [1]  4.2 11.5  7.3  5.8  6.4 10.0 11.2 11.2  5.2  7.0

You can create your own data frames -- good for devising little tests of your understanding -- as follows:

> x <- c(5,12,13)
> y <- c('abc','de','z')
> d <- data.frame(x,y)
> d
   x   y
1  5 abc
2 12  de
3 13   z

Look at that second line! Instead of vectors consisting of numbers, one can form vectors of character strings, complete with indexing capability, e.g.

> y <- c('abc','de','z')
> y[2]
[1] "de"

As noted, all the columns in a data frame must be of the same length. Here x consists of 3 numbers, and y consists of 3 character strings. (The string is the unit in the latter. The number of characters in each string is irrelevant.)

One can use negative indices for rows and columns as well, e.g.

> z <- tg[,-2]
> head(z)
   len dose
1  4.2  0.5
2 11.5  0.5
3  7.3  0.5
4  5.8  0.5
5  6.4  0.5
6 10.0  0.5

Your Turn: Devise your own little examples with the ToothGrowth data. For instance, write code that finds the number of cases in which the tooth length was less than 16. Also, try some examples with another built-in R dataset, faithful. This one involves the Old Faithful geyser in Yellowstone National Park in the US. The first column gives duration of the eruption, and the second has the waiting time since the last eruption. As mentioned, these operations are key features of R, so devise and run as many examples as possible; err on the side of doing too many!

Recap: What have we learned in this lesson?

As mentioned, the data frame is the fundamental workhorse of R. It is made up of columns of vectors (of equal lengths), a fact that often comes in handy.

Unlike the single-number indices of vectors, each element in a data frame has 2 indices, a row number and a column number. One can specify sets of rows and columns to extra subframes.

One can use the R nrow function to query the number of rows in a data frame; ncol does the same for the number of columns.

Lesson 6: R Factor Class

Each object in R has a class. The number 3 is of the 'numeric' class, the character string 'abc' is of the 'character' class, and so on. (In R, class names are quoted; one can use single or double quotation marks.) Note that vectors of numbers are of 'numeric' class too; actually, a single number is considered to be a vector of length 1. So, c('abc','xw'), for instance, is 'character' as well.

Tip: Computers require one to be very, very careful and very, very precise. In that expression c('abc','xw') above, one might wonder why it does not evaluate to 'abcxw'. After all, didn't I say that the 'c' stands for "concatenate"? Yes, but the c function concatenates vectors. Here 'abc' is a vector of length 1 -- we have one character string, and the fact that it consists of 3 characters is irrelevant -- and likewise 'xw' is one character string. So, we are concatenating a 1-element vector with another 1-element vector, resulting in a 2-element vector.

What about tg and tg$supp in the Vitamin C example above? What are their classes?

> class(tg)
[1] "data.frame"
> class(tg$supp)
[1] "factor"

R factors are used when we have categorical variables. If in a genetics study, say, we have a variable for hair color, that might comprise four categories: black, brown, red, blond. We can find the list of categories for tg$supp as follows:

> levels(tg$supp)
[1] "OJ" "VC"

The categorical variable here is supp, the name the creator of this dataset chose for the supplement column. We see that there are two categories (levels), either orange juice or Vitamin C.

Note carefully that the values of an R factor must be quoted. Either single or double quote marks is fine (though the marks don't show up when we use head).

Factors can sometimes be a bit tricky to work with, but the above is enough for now. Let's see how to apply the notion in the current dataset.

Lesson 7: Extracting Rows/Columns from Data Frames

(The reader should cover this lesson especially slowly and carefully. The concepts are simple, but putting them together requires careful inspection.)

First, let's review what we saw in a previous lesson:

> which1200 <- which(Nile > 1200)
> Nile[which1200]
[1] 1210 1230 1370 1210 1250 1260 1220

There, we saw how to extract vector elements. We can do something similar to extract data frame rows or columns. Here is how:

Continuing the Vitamin C example, let's compare mean tooth length for the two types of supplements. Here is the code:

> whichOJ <- which(tg$supp == 'OJ')
> whichVC <- which(tg$supp == 'VC')
> mean(tg[whichOJ,1])
[1] 20.66333
> mean(tg[whichVC,1])
[1] 16.96333

In the first two lines above, we found which rows in tg (or equivalently, which elements in tg$supp) had the OJ supplement, and recorded those row numbers in whichOJ. Then we did the same for VC.

Now, look at the expression tg[whichOJ,1]. Remember, data frames are accessed with two subscript expressions, one for rows, one for colums, in the format

d[the rows we want, the columns we want]

So, tg[whichOJ,1] says to restrict attention to the OJ rows, and only column 1, tooth length. We then find the mean of those restricted numberss. This turned out to be 20.66333. Then do the same for VC.

Again, if we are pretty experienced, we can skip steps:

> tgoj <- tg[tg$supp == 'OJ',]
> tgvc <- tg[tg$supp == 'VC',]
> mean(tgoj$len)
[1] 20.66333
> mean(tgvc$len)
[1] 16.96333

Either way, we have the answer to our original question: Orange juice appeared to produce more growth than Vitamin C. (Of course, one might form a confidence interval for the difference etc.)

Recap: What have we learned in this lesson?

Just as we learned earlier how to use a sequence of TRUE and FALSE values to extract a parts of a vector, we now see how to do the analogous thing for data frames: We can use a sequence of TRUE and FALSE values to extract a certain rows or columns from a data frame.

It is imperative that the reader fully understand this lesson before continuing, trying some variations of the above example on his/her own. We'll be using this technique often in this tutorial, and it is central to R usage in the real world.

Your turn: Try some of these operations on R's built-in faithful dataset. For instance, find the number of eruptions for which the waiting time was more than 80 minutes.

Lesson 8: More Examples of Extracting Rows, Columns

Often we need to extract rows or columns from a data frame, subject to more than one condition. For instance, say we wish to extract from tg the sub-data frame consisting of OJ and length less than 8.8.

We could do this, using the ampersand symbol '&', which means a logical AND operation:

> tg[tg$supp=='OJ' & tg$len < 8.8,]
   len supp dose
37 8.2   OJ  0.5

Ah, it turns out that only one case satisfied both conditions.

If we want all rows that satisfy at least one of the conditions, not necessarily both, then we use the OR operator, '|'. Say we want to obtain all rows in which either len is greater than 28 or the treatment dose was 1.0 or both:

> tg[tg$len > 28 | tg$dose == 1.0,]
    len supp dose
11 16.5   VC    1
12 16.5   VC    1
13 15.2   VC    1
14 17.3   VC    1
15 22.5   VC    1
16 17.3   VC    1
17 13.6   VC    1
18 14.5   VC    1
19 18.8   VC    1
20 15.5   VC    1
23 33.9   VC    2
26 32.5   VC    2
30 29.5   VC    2
41 19.7   OJ    1
42 23.3   OJ    1
43 23.6   OJ    1
44 26.4   OJ    1
45 20.0   OJ    1
46 25.2   OJ    1
47 25.8   OJ    1
48 21.2   OJ    1
49 14.5   OJ    1
50 27.3   OJ    1
56 30.9   OJ    2
59 29.4   OJ    2

By the way, note that the original row numbers are displayed too. For example, the first case satisfying the conditions was row number 11 in the original data frame tg.

> w <- tg[tg$len > 28 | tg$dose == 1.0,]

Again, I chose the name 'w' arbitrarily. Names must begin with a letter, and consist only of letters, digits and a few special characters such as '-' or '.'

Note that w is a new data frame, on which we can perform the usual operations, e.g.

> head(w)
    len supp dose
11 16.5   VC    1
12 16.5   VC    1
13 15.2   VC    1
14 17.3   VC    1
15 22.5   VC    1
16 17.3   VC    1
> nrow(w)
[1] 25

We may only be interested in say, how many cases satisfied the given conditions. As before, we can use nrow for that, as seen here.

As seen early, we can also extract columns. Say our analysis will use only tooth length and dose. We write 'c(1,3)' in the "what columns we want" place, indicating columns 1 and 3:

> lendose <- tg[,c(1,3)]
> head(lendose)
   len dose
1  4.2  0.5
2 11.5  0.5
3  7.3  0.5
4  5.8  0.5
5  6.4  0.5
6 10.0  0.5

From now on, we would work with lendose instead of tg.

It's a little nicer, though, the specify the columns by name instead of number:

> lendose <- tg[,c('len','dose')]
> head(lendose)
   len dose
1  4.2  0.5
2 11.5  0.5
3  7.3  0.5
4  5.8  0.5
5  6.4  0.5
6 10.0  0.5

The logical operations work on vectors too. For example, say in the Nile data we wish to know how many years had flows in the extremes, say below 500 or above 1200:

> exts <- Nile[Nile < 800 | Nile > 1300]
> head(exts)
[1] 1370  799  774  694  701  692
> length(exts)
[1] 27

By the way, if this count were our only interest, i.e. we have no further use for exts, we can skip assigning to exts, and do things directly:

> length(Nile[Nile < 800 | Nile > 1300])
[1] 27

This is fine for advanced, experienced R users, but really, "one step at a time" is better for beginners.

Recap: What we've learned in this lesson

Here we got more practice in manipulating data frames, and were introduced to the logical operators '&' and '|'. We also saw another example of using nrow as a means of counting how many rows satisfy given conditions.

Again, these are all "bread and butter" operations that arise quite freqently in real world R usage.

By the way, note how the essence of R is "combining little things in order to do big things," e.g. combining the subsetting operation, the '&' operator, and nrow to get a count of rows satisfying given conditions. This too is the "bread and butter" of R. It's up to you, the R user, to creatively combine R's little operations (and later, some big ones) to achieve whatever goals you have for your data. Programming is a creative process. It's like a grocery store and cooking: The store has lots of different potential ingredients, and you decide which ones to buy and combine into a meal.

Your turn: Try some of these operations on R's built-in faithful dataset. For instance, find the number of eruptions for which 'eruptions' was greater than 3 and waiting time was more than 80 minutes.

Lesson 9: The tapply Function

Tip: Often in R there is a shorter, more compact way of doing things. That's the case here; we can use the magical tapply function in the above example. In fact, we can do it in just one line.

> tapply(tg$len,tg$supp,mean)
      OJ       VC 
20.66333 16.96333

In English: "Split the vector tg$len into two groups, according to the value of tg$supp, then apply mean to each group." Note that the result was returned as a vector, which we could save by assigning it to, say z:

> z <- tapply(tg$len,tg$supp,mean)
> z[1]
      OJ 
20.66333 
> z[2]
      VC 
16.96333

By the way, z is not only a vector, but also a named vector, meaning that its elements have names, in this case 'OJ' and 'VC'.

Saving can be quite handy, because we can use that result in subsequent code.

To make sure it is clear how this works, let's look at a small artificial example:

> x <- c(8,5,12,13)
> g <- c('M',"F",'M','M')

Suppose x is the ages of some kids, who are a boy, a girl, then two more boys, as indicated in g. For instance, the 5-year-old is a girl.

Let's call tapply:

> tapply(x,g,mean)
 F  M 
 5 11

That call said, "Split x into two piles, according to the corresponding elements of g, and then find the mean in each pile.

Note that it is no accident that x and g had the same number of elements above, 4 each. If on the contrary, g had 5 elements, that fifth element would be useless -- the gender of a nonexistent fifth child's age in x. Similarly, it wouldn't be right if g had had only 3 elements, apparently leaving the fourth child without a specified gender.

Tip: If g had been of the wrong length, we would have gotten an error, "Arguments must be of the same length." This is a common error in R code, so watch out for it, keeping in mind WHY the lengths must be the same.

Instead of mean, we can use any function as that third argument in tapply. Here is another example, using the built-in dataset PlantGrowth:

> tapply(PlantGrowth$weight,PlantGrowth$group,length)
ctrl trt1 trt2 
  10   10   10

Here tapply split the weight vector into subsets according to the group variable, then called the length function on each subset. We see that each subset had length 10, i.e. the experiment had assigned 10 plants to the control, 10 to treatment 1 and 10 to treatment 2.

Your Turn: One of the most famous built-in R datasets is mtcars, which has various measurements on cars from the 60s and 70s. Lots of opportunties for you to cook up little experiments here! You may wish to start by comparing the mean miles-per-gallon values for 4-, 6- and 8-cylinder cars. Another suggestion would be to find how many cars there are in each cylinder category, using tapply. As usual, the more examples you cook up here, the better!

By the way, the mtcars data frame has a "phantom" column.

> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

That first column seems to give the make (brand) and model of the car. Yes, it does -- but it's not a column. Behold:

> head(mtcars[,1])
[1] 21.0 21.0 22.8 21.4 18.7 18.1

Sure enough, column 1 is the mpg data, not the car names. But we see the names there on the far left! The resolution of this seeming contradiction is that those car names are the row names of this data frame:

> row.names(mtcars)
 [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
 [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
 [7] "Duster 360"          "Merc 240D"           "Merc 230"           
[10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
[13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
[16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
[19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
[22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
[25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
[28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
[31] "Maserati Bora"       "Volvo 142E"

So 'Mazda RX4' was the name of row 1, but not part of the row.

As with everything else, row.names is a function, and as you can see above, its return value here is a 32-element vector (the data frame had 32 rows, thus 32 row names). The elements of that vector are of class 'character', as is the vector itself.

You can even assign to that vector:

> row.names(mtcars)[7]
[1] "Duster 360"
> row.names(mtcars)[7] <- 'Dustpan'
> row.names(mtcars)[7]
[1] "Dustpan"

Inside joke, by the way. Yes, the example is real and significant, but the "Dustpan" thing came from a funny TV commercial at the time.

(If you have some background in programming, it may appear odd to you to have a function call on the left side of an assignment. This is actually common in R. It stems from the fact that '<-' is actually a function! But this is not the place to go into that.)

Your Turn: Try some experiments with the mtcars data, e.g. finding the mean horsepower for 6-cylinder cars.

Tip: As a beginner (and for that matter later on), you should NOT be obsessed with always writing code in the "optimal" way, including in terms of compactness of the code. It's much more important to write something that works and is clear; one can always tweak it later. In this case, though, tapply actually aids clarity, and it is so ubiquitously useful that we have introduced it early in this tutorial. We'll be using it more in later lessons.

Lesson 10: Data Cleaning

Most real-world data is "dirty," i.e. filled with errors. The famous New York taxi trip dataset, for instance, has one trip destination whose lattitude and longitude place it in Antartica! The impact of such erroneous data on one's statistical analysis can be anywhere from mild to disabling. Let's see below how one might ferret out bad data. And along the way, we'll cover several new R concepts.

We'll use the famous Pima Diabetes dataset. Various versions exist, but we'll use the one included in faraway, an R package compiled by Julian Faraway, author of several popular books on statistical regression analysis in R.

I've placed the data file, Pima.csv, on my Web site. Here is how you can read it into R:

> pima <- read.csv('http://heather.cs.ucdavis.edu/FasteR/data/Pima.csv',header=TRUE)

The dataset is in a CSV ("comma-separated values") file. Here we read it, and assigned the resulting data frame to a variable we chose to name pima.

Note that second argument, 'header=TRUE'. A header in a file, if one exists, is in the first line in the file. It states what names the columns in the data frame are to have. If the file doesn't have one, set header to FALSE. You can always add names to your data frame later (future lesson).

Tip: It's always good to take a quick look at a new data frame:

> head(pima)
  pregnant glucose diastolic triceps insulin  bmi diabetes age test
1        6     148        72      35       0 33.6    0.627  50    1
2        1      85        66      29       0 26.6    0.351  31    0
3        8     183        64       0       0 23.3    0.672  32    1
4        1      89        66      23      94 28.1    0.167  21    0
5        0     137        40      35     168 43.1    2.288  33    1
6        5     116        74       0       0 25.6    0.201  30    0
> dim(pima)
[1] 768   9

The dim function tells us that there are 768 people in the study, 9 variables measured on each.

Since this is a study of diabetes, let's take a look at the glucose variable. R's table function is quite handy.

> table(pima$glucose)

  0  44  56  57  61  62  65  67  68  71  72  73  74  75  76  77  78  79  80  81 
  5   1   1   2   1   1   1   1   3   4   1   3   4   2   2   2   4   3   6   6 
 82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 
  3   6  10   7   3   7   9   6  11   9   9   7   7  13   8   9   3  17  17   9 
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 
 13   9   6  13  14  11  13  12   6  14  13   5  11  10   7  11   6  11  11   6 
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 
 12   9  11  14   9   5  11  14   7   5   5   5   6   4   8   8   5   8   5   5 
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 
  5   6   7   5   9   7   4   1   3   6   4   2   6   5   3   2   8   2   1   3 
162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 
  6   3   3   4   3   3   4   1   2   3   1   6   2   2   2   1   1   5   5   5 
182 183 184 186 187 188 189 190 191 193 194 195 196 197 198 199 
  1   3   3   1   4   2   4   1   1   2   3   2   3   4   1   1

Be careful here; the first, third, fifth and so on lines are the glucose values, while the second, fourth, sixth and so on lines are the counts of women having those values. For instance, 3 women had the glucose = 68.

Uh, oh! 5 women in the study had glucose level 0. And 1 had level 44, etc. Presumably 0 is not physiologically possible, and maybe not 44 either.

Let's consider a version of the glucose data that at least excludes these 0s.

> pg <- pima$glucose
> pg1 <- pg[pg > 0]
> length(pg1)
[1] 763

As before, the expression "pg > 0" creates a vector of TRUEs and FALSEs. The filtering "pg[pg > 0]" will only pick up the TRUE cases, and sure enough, we see that pg1 has only 763 cases, as opposed to the original 768.

Did removing the 0s make much difference? Turns out it doesn't:

> mean(pg)
[1] 120.8945
> mean(pg1)
[1] 121.6868

But still, these things can in fact have major impact in many statistical analyses.

R has a special code for missing values, NA, for situations like this. Rather than removing the 0s, it's better to recode them as NAs. Let's do this, back in the original dataset so we keep all the data in one object:

> pima$glucose[pima$glucose == 0] <- NA

Tip: That's pretty complicated. It's clearer to break things up into smaller steps (I recommend this especially for beginners), as follows:

> glc <- pima$glucose
> z <- glc == 0
> glc[z] <- NA
> pima$glucose <- glc

Here is what the code does:

That first line just makes a copy of the original vector, to avoid

clutter in the code.

The second line determines which elements of glc

are 0s, resulting in z being a vector of TRUEs and FALSEs.

The third line then assigns NA to those elements in glc corresponding to

the TRUEs. (Note the recycling of NA.)

Finally, we need to have the changes in the original data, so we copy glc to it.

Tip: Note again the double-equal sign! If we wish to test whether, say, a and b are equal, the expression must be "a == b", not "a = b"; the latter would do "a <- b". This is a famous beginning programmer's error.

As a check, let's verify that we now have 5 NAs in the glucose variable:

> sum(is.na(pima$glucose))
[1] 5

Here the built-in R function is.na will return a vector of TRUEs and FALSEs. Recall that those values can always be treated as 1s and 0s, thus summable. Thus we got our count, 5.

Let's also check that the mean comes out right:

> mean(pima$glucose)
[1] NA

What went wrong? By default, the mean function will not skip over NA values; thus the mean was reported as NA too. But we can instruct the function to skip the NAs:

> mean(pima$glucose,na.rm=TRUE)
[1] 121.6868

Your Turn: Determine which other columns in pima have suspicious 0s, and replace them with NA values.
Now, look again at the plot we made earlier of the Nile flow histogram. There seems to be a gap between the numbers at the low end and the rest. What years did these correspond to? Find the mean of the data, excluding these cases.

Lesson 11: The R List Class

We saw earlier how handy the tapply function can be. Let's look at a related one, split.

Earlier we mentioned the built-in dataset mtcars, a data frame. Consider mtcars$mpg, the column containing the miles-per-gallon data. Again, to save typing and avoid clutter in our code, let's make a copy first:

> mtmpg <- mtcars$mpg

Suppose we wish to split the original vector into three vectors, one for 4-cylinder cars, one for 6 and one for 8. We could do

> mt4 <- mtmpg[mtcars$cyl == 4]

and so on for mt6 and mt8.

Your Turn: In order to keep up, make sure you understand how that line of code works, with the TRUEs and FALSEs etc. First print out the value of mtcars$cyl == 4, and go from there.

But there is a cleaner way:

> mtl <- split(mtmpg,mtcars$cyl)
> mtl
$`4`
 [1] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4

$`6`
[1] 21.0 21.0 21.4 18.1 19.2 17.8 19.7

$`8`
 [1] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 13.3 19.2 15.8 15.0
> class(mtl)
[1] "list"

In English, the call to split said, "Split mtmpg into multiple vectors, with the splitting criterion being the correspond values in mtcars$cyl."

Now mtl, an object of R class "list", contains the 3 vectors. We can access them individually with the dollar sign notation:

> mtl$`4`
 [1] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4

Or, we can use indices, though now with double brackets:

> mtl[[1]]
 [1] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4

Looking a little closer:

> head(mtcars$cyl)
[1] 6 6 4 6 8 6

We see that the first car had 6 cylinders, so the first element of mtmpg, 21.0, was thrown into the 6 pile, i.e. mtl[[2]] (see above printout of mtl), and so on.

And of course we can make copies for later convenience:

> m4 <- mtl[[1]]
> m6 <- mtl[[2]]
> m8 <- mtl[[3]]

Lists are especially good for mixing types together in one package:

> l <- list(a = c(2,5), b = 'sky')
> l
$a
[1] 2 5

$b
[1] "sky"

Note that here we can give names to the list elements, 'a' and 'b'. In forming mtl using split above, the names were assigned according to the values of the vector beiing split. (In that earlier case, we also needed backquotes , since the names were numbers.)

If we don't like those default names, we can change them:

> names(mtl) <- c('four','six','eight')
> mtl
$four
 [1] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4

$six
[1] 21.0 21.0 21.4 18.1 19.2 17.8 19.7

$eight
 [1] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 13.3 19.2 15.8
15.0

What if we want, say, the MPG for the third car in the 6-cylinder category?

> mtl[[2]][3]
[1] 21.4

The point is that mtl[[2]] is a vector, so if we want element 3 of that vector, we tack on [3].

Or,

> mtl$six[3]
[1] 21.4

By the way, it's no coincidence that a dollar sign is used for delineation in both data frames and lists; data frames are lists. Each column is one element of the list. So for instance,

> mtcars[[1]]
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4

Here we used the double-brackets list notation to get the first element of the list, which is the first column of the data frame.

Your Turn Try using split on the ToothGrowth data, say splitting into groups according to the supplement, and finding various quantities.

Lesson 12: Another Look at the Nile Data

Here we'll learn several new concepts, using the Nile data as our starting point.

If you look again at the histogram of the Nile we generated, you'll see a gap between the lowest numbers and the rest. In what year(s) did those really low values occur? Let's plot the data against time:

> plot(Nile)

Looks like maybe 1912 or so was much lower than the rest. Is this an error? Or was there some big historical event then? This would require more than R to track down, but at least R can tell us which exact year or years correspond to the unusually low flow. Here is how:

We see from the graph that the unusually low value was below 600. We can use R's which function to see when that occurred:

> which(Nile < 600)       
[1] 43

As before, make sure to understand what happened in this code. The expression "Nile < 600" yields 100 TRUEs and FALSEs. The which then tells us which of those were TRUEs.

So, element 43 is the culprit here, corresponding to year 1871+42=1913. Again, we would have to find supplementary information in order to decide whether this is a genuine value or an error, but at least now we know the exact year.

Of course, since this is a small dataset, we could have just printed out the entire data and visually scanned it for a low number. But what if the length of the data vector had been 100,000 instead of 100? Then the visual approach wouldn't work.

Tip: Remember, a goal of programming is to automate tasks, rather than doing them by hand.

Your Turn: There appear to be some unusually high values as well, e.g. one around 1875. Determine which year this was, using the techniques presented here.
Also, try some similar analysis on the built-in AirPassengers data. Can you guess why those peaks are occurring?

Here is another point: That function plot is not quite so innocuous as it may seem. Let's run the same function, plot, but with two arguments instead of one:

> plot(mtcars$wt,mtcars$mpg)

In contrast to the previous plot, in which our data were on the vertical axis and time was on the horizontal, now we are plotting two vectors, against each other. This enables us to explore the relation between car weight and gas mileage.

There are a couple of important points here. First, as we might guess, we see that the heavier cars tended to get poorer gas mileage. But here's more: That plot function is pretty smart!

Why? Well, plot knew to take different actions for different input types. When we fed it a single vector, it plotted those numbers against time (or, against index). When we fed it two vectors, it knew to do a scatter plot.

In fact, plot was even smarter than that. It noticed that Nile is not just of 'numeric' type, but also of another class, 'ts' ("time series"):

> is.numeric(Nile)
[1] TRUE
> class(Nile)
[1] "ts"

So, plot put years on the horizontal axis, instead of indices 1,2,3,...

And one more thing: Say we wanted to know the flow in the year 1925. The data start at 1871, so 1925 is 1925 - 1871 = 54 years later. Since the 1871 number is in element 1 of the vector, that means the flow for the year 1925 is in element 1+54 = 55.

> Nile[55]
[1] 698

OK, but why did we do this arithmetic ourselves? We should have R do it:

> Nile[1 + 1925 - 1871]
[1] 698

R did the computation 1925 - 1871 + 1 itself, yielding 55, then looked up the value of Nile[55]. This is the start of your path to programming -- we try to automate things as much as possible, doing things by hand as little as possible.

Lesson 13: Pause to Reflect

Tip: Repeating an earlier point: How does one build a house? There of course is no set formula. One has various tools and materials, and the goal is to put these together in a creative way to produce the end result, the house.

It's the same with R. The tools here are the various functions, e.g. mean and which, and the materials are one's data. One then must creatively put them together to achieve one's goal, say ferreting out patterns in ridership in a public transportation system. Again, it is a creative process; there is no formula for anything. But that is what makes it fun, like solving a puzzle.

And...we can combine various functions in order to build our own functions. This will come in future lessons.

Lesson 14: Introduction to Base R Graphics

One of the greatest things about R is its graphics capabilities. There are excellent graphics features in base R, and then many contributed packages, with the best known being ggplot2 and lattice. These latter two are quite powerful, and will be the subjects of future lessons, but for now we'll concentrate on the base.

As our example here, we'll use a dataset I compiled on Silicon Valley programmers and engineers, from the US 2000 census. Let's read in the data and take a look at the first records:

> pe <- 
   read.table('https://raw.githubusercontent.com/matloff/fasteR/master/data/prgeng.txt',header=TRUE)
> head(pe)
       age educ occ sex wageinc wkswrkd
1 50.30082   13 102   2   75000      52
2 41.10139    9 101   1   12300      20
3 24.67374    9 102   2   15400      52
4 50.19951   11 100   1       0      52
5 51.18112   11 100   2     160       1
6 57.70413   11 100   1       0       0

We used read.table here because the file is not of the CSV type. It uses blank spaces rather than commas as its delineator between fields.

Here educ and occ are codes, for levels of education and different occupations. For now, let's not worry about the specific codes. (You can find them in the Census Bureau document. For instance, search for "Educational Attainment" for the educ variable.)

Let's start with a scatter plot of wage vs. age:

> plot(pe$age,pe$wageinc)

Oh no, the dreaded Black Screen Problem! There are about 20,000 data points, thus filling certain parts of the screen. So, let's just plot a random sample, say 2500. (There are other ways of handling the problem, say with smaller dots or alpha blending.)

> indxs <- sample(1:nrow(pe),2500)
> pe2500 <- pe[indxs,]

Recall that the nrow() function returns the number of rows in the argument, which in this case is 20090, the number of rows in pe.

R's sample function does what its name implies. Here it randomly samples 2500 of the numbers from 1 to 20090. We then extracted those rows of pe, in a new data frame pe2500.

Tip: Note again that it's clearer to break complex operations into simpler, smaller ones. I could have written the more compact

> pe2500 <- pe[sample(1:nrow(pe),2500),]

but it would be hard to read that way. I also use direct function composition sparingly, preferring to break

h(g(f(x),3)

into

y <- f(x) 
z <- g(y,3) 
h(z)

So, here is the new plot:

> plot(pe2500$age,pe2500$wageinc)

OK, now we are in business. A few things worth noting:

The relation between wage and age is not linear, indeed not even

monotonic. After the early 40s, one's wage tends to decrease. As with any observational dataset, the underlying factors are complex, but it does seem there is an age discrimination problem in Silicon Valley. (And it is well documented in various studies and litigation.)

Note the horizontal streaks at the very top and very bottom of the picture. Some people in the census had 0 income (or close to it), as

they were not working. And the census imposed a top wage limit of $350,000 (probably out of privacy concerns).

We can break things down by gender, via color coding:

> plot(pe2500$age,pe2500$wageinc,col=as.factor(pe2500$sex))

The col argument indicates we wish to color code, in this case by gender. Note that pe2500$sex is a numeric vector, but col requires an R factor; the function as.factor does the conversion.

The red dots are the women. (Details below.) Are they generally paid less than men? There seems to be a hint of that, but detailed statistical analysis is needed (a future lesson).

It would be good to have better labels on the axes, and maybe smaller dots:

> plot(pe2500$age,pe2500$wageinc,col=as.factor(pe2500$sex),xlab='age',ylab='wage',cex=0.6)

Here 'xlab' meant "X label" and similarly for 'ylab'. The argument 'cex = 0.6' means "Draw the dots at 60% of default size."

Now, how did the men's dots come out black and the women's red? The men were coded 1, the women 2. So men got color 1 in the default palette, black, and the women color 2, red.

There are many, many other features. More in a future lesson.

Your Turn: Try some scatter plots on various datasets. I suggest first using the above data with wage against age again, but this time color-coding by education level. (By the way, 1-9 codes no college; 10-12 means some college; 13 is a bachelor's degree, 14 a master's, 15 a professional degree and 16 is a doctorate.)

Lesson 15: More on Base Graphics

We can also plot multiple histograms on the same graph. But the pictures are more effective using a smoothed version of histograms, available in R's density function. Let's compare men's and women's wages in the census data.

First we use split to separate the data by gender:

> wageByGender <- split(pe$wageinc,pe$sex)
> dm <- density(wageByGender[[1]])
> dw <- density(wageByGender[[2]])

So, wageByGender[[1]] will now be the vector of men's wages, and similarly wageByGender[[2]] will have the women's wages.

The density function does not automatically draw a plot; it has the plot information in a return value, which we've assigned to dm and dw here. We can now plot the graph:

> plot(dw,col='red')
> points(dm,cex=0.2)

Why did we call the points function instead of plot in that second line? The issue is that calling plot again would destroy the first plot; we merely want to add points to the existing graph.

And why did we plot the women's data first? As you can see, the women's curve is taller, so if we plotted the men first, part of the women's curve would be cut off. Of course, we didn't know that ahead of time, but graphics often is a matter of trial-and-error to get to the picture we really want. (In the case of ggplot2, this is handled automatically by the software.)

Well, then, what does the graph tell us? The peak for women, occurring at a little less than $50,000, seems to be at a lower wage than that for men, at something like $60,000. At salaries around, say, $125,000, there seem to be more men than women. (Black curve higher than red curve. Remember, the curves are just smoothed histograms, so, if a curve is really high at, say 168.0, that means that 168.0 is a very frequently-occurring value.)

Your Turn: Try plotting multiple such curves on the same graph, for other data.

Lesson 16: Writing Your Own Functions

We've seen a number of R's built-in functions so far, but here comes the best part -- you can write your own functions.

Recall a line we had earlier:

> sum(Nile > 1200)

This gave us the count of the elements in the Nile data larger than 1200.
Now, say we want the mean of those elements:

> gt1200 <- which(Nile > 1200)
> nileSubsetGT1200 <- Nile[gt1200]
> mean(nileSubsetGT1200)
[1] 1250

As before, we could instead write a more compact version,

> mean(Nile[Nile > 1200])
[1] 1250

But it's best to do it step by step at first. Let's see how those steps work. Writing the code with line numbers for reference, the code is

1  gt1200Indices <- which(Nile > 1200)
2  nileSubsetGT1200 <- Nile[gt1200Indices]
3  mean(nileSubsetGT1200)

Let's review how this works:

In line 1, we find the indices in Nile for the elements larger than 1200.

fasteR: Fast Lane to Learning R!

"Becoming productive in R, as fast as possible"

Norm Matloff, Prof. of Computer Science, UC Davis; my bio

Table of Contents

Lesson 1: Getting Started

Please note again:

Starting out:

Lesson 2: First R Steps

Example: Nile River data

A first graph

Lesson 3: Vectors and Indices

Recap: What have we learned in these first two lessons?

Lesson 4: More on Vectors

Recap: What have we learned in this lesson?

Lesson 5: On to Data Frames!

Recap: What have we learned in this lesson?

Lesson 6: R Factor Class

Lesson 7: Extracting Rows/Columns from Data Frames

Recap: What have we learned in this lesson?

Lesson 8: More Examples of Extracting Rows, Columns

Recap: What we've learned in this lesson

Lesson 9: The tapply Function

Lesson 10: Data Cleaning

Lesson 11: The R List Class

Lesson 12: Another Look at the Nile Data

Lesson 13: Pause to Reflect

Lesson 14: Introduction to Base R Graphics

Lesson 15: More on Base Graphics

Lesson 16: Writing Your Own Functions

Copy Link

Version

Version

License

Maintainer

Last Published