The sample analysis was shown only in class and is not viewable in this version of the notes.
Fall 2020
The sample analysis was shown only in class and is not viewable in this version of the notes.
Course overview
Introduction to R, RStudio and R Markdown
Programming basics
No programming knowledge presumed
Synchronous attendance is encouraged, but not required
Class will be very cumulative
Assignments, office hours, class notes, grading policies, useful references on R: http://www.andrew.cmu.edu/~achoulde/94842/
Canvas for gradebook and for turning in homework
Check the class website for everything else
This class will teach you to use R to:
Generate graphical and tabular data summaries
Efficiently manipulate data using tidyverse libraries
Perform statistical analyses (e.g., hypothesis testing, regression modeling)
Produce reproducible statistical reports using R Markdown
Free (open-source)
Programming language (not point-and-click)
Excellent graphics
Offers broadest range of statistical tools
Easy to generate reproducible reports
Easy to integrate with other tools
Basic interaction with R is through typing in the console
This is the terminal or command-line interface
You type in commands, R gives back answers (or errors)
Menus and other graphical interfaces are extras built on top of the console
We will use RStudio in this class
Download R: http://lib.stat.cmu.edu/R/CRAN
Then download RStudio: http://www.rstudio.com/
RStudio has 4 main windows (‘panes’):
RStudio has 4 main windows (aka ‘panes’):
Source pane: create a file that you can save and run later
Console pane: type or paste in commands to get output from R
Workspace/History pane: see a list of variables or previous commands
Files/Plots/Packages/Help pane: see plots, help pages, and other items in this window.
Use the Console pane to type or paste commands to get output from R
?function
into the Console
?mean
tab
key to auto-complete function and object namesUse the Source pane to create and edit R and Rmd files
?mean
), the documentation will appear in the Help tabR Markdown allows the user to integrate R code into a report
When data changes or code changes, so does the report
No more need to copy-and-paste graphics, tables, or numbers
Can output report in HTML (default), Microsoft Word, or PDF
To integrate R output into your report, you need to use R code chunks
Open RStudio on your machine
summary(cars)
in the first code block to print("Hello world!")
Knit HTML
to produce an HTML file.Save your Rmd file as helloworld.Rmd
All of your Homework assignments and many of your Labs will take the form of a single Rmd file, which you will edit to include your solutions and then submit on Canvas
Everything we’ll do comes down to applying functions to data
Data: things like 7, “seven”, \(7.000\), the matrix \(\left[ \begin{array}{ccc} 7 & 7 & 7 \\ 7 & 7 & 7\end{array}\right]\)
Functions: things like \(\log{}\), \(+\) (two arguments), \(<\) (two), \(\mod{}\) (two), mean
(one)
A function is a machine which turns input objects (arguments) into an output object (return value), possibly with side effects, according to a definite rule
You’ll encounter different kinds of data types
Booleans Direct binary values: TRUE
or FALSE
in R
Integers: whole numbers (positive, negative or zero)
Characters fixed-length blocks of bits, with special coding; strings = sequences of characters
Floating point numbers: a fraction (with a finite number of bits) times an exponent, like \(1.87 \times {10}^{6}\)
Missing or ill-defined values: NA
, NaN
, etc.
You can use R as a very, very fancy calculator
Command | Description |
---|---|
+,-,*,\ |
add, subtract, multiply, divide |
^ |
raise to the power of |
%% |
remainder after division (ex: 8 %% 3 = 2 ) |
( ) |
change the order of operations |
log(), exp() |
logarithms and exponents (ex: log(10) = 2.302 ) |
sqrt() |
square root |
round() |
round to the nearest whole number (ex: round(2.3) = 2 ) |
floor(), ceiling() |
round down or round up |
abs() |
absolute value |
7 + 5 # Addition
## [1] 12
7 - 5 # Subtraction
## [1] 2
7 * 5 # Multiplication
## [1] 35
7 ^ 5 # Exponentiation
## [1] 16807
7 / 5 # Division
## [1] 1.4
7 %% 5 # Modulus
## [1] 2
7 %/% 5 # Integer division
## [1] 1
Comparisons are also binary operators; they take two objects, like numbers, and give a Boolean
7 > 5
## [1] TRUE
7 < 5
## [1] FALSE
7 >= 7
## [1] TRUE
7 <= 5
## [1] FALSE
7 == 5
## [1] FALSE
7 != 5
## [1] TRUE
Basically “and” and “or”:
(5 > 7) & (6*7 == 42)
## [1] FALSE
(5 > 7) | (6*7 == 42)
## [1] TRUE
(will see special doubled forms, &&
and ||
, later)
typeof()
function returns the type
is.
foo()
functions return Booleans for whether the argument is of type foo
as.
foo()
(tries to) “cast” its argument to type foo — to translate it sensibly into a foo-type value
Special case: as.factor()
will be important later for telling R when numbers are actually encodings and not numeric values. (E.g., 1 = High school grad; 2 = College grad; 3 = Postgrad) ##
typeof(7)
## [1] "double"
is.numeric(7)
## [1] TRUE
is.na(7)
## [1] FALSE
is.character(7)
## [1] FALSE
is.character("7")
## [1] TRUE
is.character("seven")
## [1] TRUE
is.na("seven")
## [1] FALSE
We can give names to data objects; these give us variables
A few variables are built in:
pi
## [1] 3.141593
Variables can be arguments to functions or operators, just like constants:
pi*10
## [1] 31.41593
cos(pi)
## [1] -1
Most variables are created with the assignment operator, <-
or =
time.factor <- 12 time.factor
## [1] 12
time.in.years = 2.5 time.in.years * time.factor
## [1] 30
The assignment operator also changes values:
time.in.months <- time.in.years * time.factor time.in.months
## [1] 30
time.in.months <- 45 time.in.months
## [1] 45
Using names and variables makes code: easier to design, easier to debug, less prone to bugs, easier to improve, and easier for others to read
Avoid “magic constants”; use named variables
num.students <- 35
ns <- 35
What names have you defined values for?
ls()
## [1] "time.factor" "time.in.months" "time.in.years"
Getting rid of variables:
rm("time.in.months") ls()
## [1] "time.factor" "time.in.years"
Group related data values into one object, a data structure
A vector is a sequence of values, all of the same type
c()
function returns a vector containing all its arguments in order
students <- c("Sean", "Louisa", "Frank", "Farhad", "Li") midterm <- c(80, 90, 93, 82, 95)
students
## [1] "Sean" "Louisa" "Frank" "Farhad" "Li"
vec[1]
is the first element, vec[4]
is the 4th element of vec
students
## [1] "Sean" "Louisa" "Frank" "Farhad" "Li"
students[4]
## [1] "Farhad"
vec[-4]
is a vector containing all but the fourth elementstudents[-4]
## [1] "Sean" "Louisa" "Frank" "Li"
Operators apply to vectors “pairwise” or “elementwise”:
final <- c(78, 84, 95, 82, 91) # Final exam scores midterm # Midterm exam scores
## [1] 80 90 93 82 95
midterm + final # Sum of midterm and final scores
## [1] 158 174 188 164 186
(midterm + final)/2 # Average exam score
## [1] 79 87 94 82 93
course.grades <- 0.4*midterm + 0.6*final # Final course grade course.grades
## [1] 78.8 86.4 94.2 82.0 92.6
Is the final score higher than the midterm score?
midterm
## [1] 80 90 93 82 95
final
## [1] 78 84 95 82 91
final > midterm
## [1] FALSE FALSE TRUE FALSE FALSE
Boolean operators can be applied elementwise:
(final < midterm) & (midterm > 80)
## [1] FALSE TRUE FALSE FALSE TRUE
Command | Description |
---|---|
sum(vec) |
sums up all the elements of vec |
mean(vec) |
mean of vec |
median(vec) |
median of vec |
min(vec), max(vec) |
the largest or smallest element of vec |
sd(vec), var(vec) |
the standard deviation and variance of vec |
length(vec) |
the number of elements in vec |
pmax(vec1, vec2), pmin(vec1, vec2) |
example: pmax(quiz1, quiz2) returns the higher of quiz 1 and quiz 2 for each student |
sort(vec) |
returns the vec in sorted order |
order(vec) |
returns the index that sorts the vector vec |
unique(vec) |
lists the unique elements of vec |
summary(vec) |
gives a five-number summary |
any(vec), all(vec) |
useful on Boolean vectors |
course.grades
## [1] 78.8 86.4 94.2 82.0 92.6
mean(course.grades) # mean grade
## [1] 86.8
median(course.grades)
## [1] 86.4
sd(course.grades) # grade standard deviation
## [1] 6.625708
sort(course.grades)
## [1] 78.8 82.0 86.4 92.6 94.2
max(course.grades) # highest course grade
## [1] 94.2
min(course.grades) # lowest course grade
## [1] 78.8
students
## [1] "Sean" "Louisa" "Frank" "Farhad" "Li"
Vector of indices:
students[c(2,4)]
## [1] "Louisa" "Farhad"
Vector of negative indices
students[c(-1,-3)]
## [1] "Louisa" "Farhad" "Li"
which()
returns the TRUE
indexes of a Boolean vector:
course.grades
## [1] 78.8 86.4 94.2 82.0 92.6
a.threshold <- 90 # A grade = 90% or higher course.grades >= a.threshold # vector of booleans
## [1] FALSE FALSE TRUE FALSE TRUE
a.students <- which(course.grades >= a.threshold) # Applying which() a.students
## [1] 3 5
students[a.students] # Names of A students
## [1] "Frank" "Li"
You can give names to elements or components of vectors
students
## [1] "Sean" "Louisa" "Frank" "Farhad" "Li"
names(course.grades) <- students # Assign names to the grades names(course.grades)
## [1] "Sean" "Louisa" "Frank" "Farhad" "Li"
course.grades[c("Sean", "Frank","Li")] # Get final grades for 3 students
## Sean Frank Li ## 78.8 94.2 92.6
Note the labels in what R prints; these are not actually part of the value
Keystroke | Description |
---|---|
<tab> |
autocompletes commands and filenames, and lists arguments for functions. Highly useful! |
<up> |
cycle through previous commands in the console prompt |
<ctrl-up> |
lists history of previous commands matching an unfinished one |
<ctrl-enter> |
paste current line from source window to console. Good for trying things out ideas from a source file. |
<ESC> |
as mentioned, abort an unfinished command and get out of the + prompt |
“Homework” 0: Course survey
Lab 1: http://www.andrew.cmu.edu/~achoulde/94842/