Prof. Alexandra Chouldechova
94-842
Here's a sample analysis.
The analysis was shown only in class and is not viewable in this version of the notes.
Course overview
Introduction to R, RStudio and R Notebooks/R Markdown
Programming basics
No programming knowledge presumed
Some stats knowledge presumed. E.g.:
Class attendance is mandatory
Class will be very cumulative
Class participation (10%)
Quizzes (10%)
Homework assignments (35%)
Final project (45%)
Assignments, office hours, class notes, grading policies, useful references on R: http://www.andrew.cmu.edu/~achoulde/94842/
Canvas for gradebook and for turning in homework
Piazza for forum
Check the class website for everything else
No required textbook, but several are recommended:
This class will teach you to use R to:
- Generate graphical and tabular data summaries
- Perform statistical analyses (e.g., hypothesis testing, regression modeling)
- Produce reproducible statistical reports using R Markdown and R Notebooks
- Integrate R with other tools (e.g., databases, web, etc.)
Basic interaction with R is through typing in the console
This is the terminal or command-line interface
You type in commands, R gives back answers (or errors)
Menus and other graphical interfaces are extras built on top of the console
We will use RStudio in this class
Download R: http://lib.stat.cmu.edu/R/CRAN
Then download RStudio: http://www.rstudio.com/
RStudio is an IDE for R
RStudio has 4 main windows ('panes'):
Use the Console pane to type or paste commands to get output from R
To look up the help file for a function or data set, type ?function
into the Console
?mean
Use the tab
key to auto-complete function and object names
?mean
), the documentation will appear in the Help tabSource pane: create a file that you can save and run later
Console pane: type or paste in commands to get output from R
Workspace/History pane: see a list of variables or previous commands
Files/Plots/Packages/Help pane: see plots, help pages, and other items in this window.
R Markdown allows the user to integrate R code into a report
When data changes or code changes, so does the report
No more need to copy-and-paste graphics, tables, or numbers
Creates reproducible reports
Can output report in HTML (default), Microsoft Word, or PDF
Open RStudio on your machine
File > New File > R Markdown …
Change summary(cars)
in the first code block to print("Hello world!")
Click Knit HTML
to produce an HTML file.
Save your Rmd file as helloworld.Rmd
All of your Homework assignments and many of your Labs will take the form of a single Rmd file, which you will edit to include your solutions and then submit on Blackboard.
Everything we'll do comes down to applying functions to data
Data: things like 7, “seven”, \( 7.000 \), the matrix \( \left[ \begin{array}{ccc} 7 & 7 & 7 \\ 7 & 7 & 7\end{array}\right] \)
Functions: things like \( \log{} \), \( + \) (two arguments), \( < \) (two), \( \mod{} \) (two), mean
(one)
A function is a machine which turns input objects (arguments) into an output object (return value), possibly with side effects, according to a definite rule
You'll encounter different kinds of data types
TRUE
or FALSE
in RNA
, NaN
, etc.You can use R as a very, very fancy calculator
Command | Description |
---|---|
+,-,*,\ |
add, subtract, multiply, divide |
^ |
raise to the power of |
%% |
remainder after division (ex: 8 %% 3 = 2 ) |
( ) |
change the order of operations |
log(), exp() |
logarithms and exponents (ex: log(10) = 2.302 ) |
sqrt() |
square root |
round() |
round to the nearest whole number (ex: round(2.3) = 2 ) |
floor(), ceiling() |
round down or round up |
abs() |
absolute value |
7 + 5 # Addition
[1] 12
7 - 5 # Subtraction
[1] 2
7 * 5 # Multiplication
[1] 35
7 ^ 5 # Exponentiation
[1] 16807
7 / 5 # Division
[1] 1.4
7 %% 5 # Modulus
[1] 2
7 %/% 5 # Integer division
[1] 1
Comparisons are also binary operators; they take two objects, like numbers, and give a Boolean
7 > 5
[1] TRUE
7 < 5
[1] FALSE
7 >= 7
[1] TRUE
7 <= 5
[1] FALSE
7 == 5
[1] FALSE
7 != 5
[1] TRUE
Basically “and” and “or”:
(5 > 7) & (6*7 == 42)
[1] FALSE
(5 > 7) | (6*7 == 42)
[1] TRUE
(will see special doubled forms, &&
and ||
, later)
typeof()
function returns the type
is.
foo()
functions return Booleans for whether the argument is of type foo
as.
foo()
(tries to) “cast” its argument to type foo — to translate it sensibly into a foo-type value
Special case: as.factor()
will be important later for telling R when numbers are actually encodings and not numeric values. (E.g., 1 = High school grad; 2 = College grad; 3 = Postgrad)
typeof(7)
[1] "double"
is.numeric(7)
[1] TRUE
is.na(7)
[1] FALSE
is.character(7)
[1] FALSE
is.character("7")
[1] TRUE
is.character("seven")
[1] TRUE
is.na("seven")
[1] FALSE
We can give names to data objects; these give us variables
A few variables are built in:
pi
[1] 3.141593
Variables can be arguments to functions or operators, just like constants:
pi*10
[1] 31.41593
cos(pi)
[1] -1
Most variables are created with the assignment operator, <-
or =
time.factor <- 12
time.factor
[1] 12
time.in.years = 2.5
time.in.years * time.factor
[1] 30
The assignment operator also changes values:
time.in.months <- time.in.years * time.factor
time.in.months
[1] 30
time.in.months <- 45
time.in.months
[1] 45
Using names and variables makes code: easier to design, easier to debug, less prone to bugs, easier to improve, and easier for others to read
Avoid “magic constants”; use named variables
Use descriptive variable names
num.students <- 35
ns <- 35
What names have you defined values for?
ls()
[1] "time.factor" "time.in.months" "time.in.years"
Getting rid of variables:
rm("time.in.months")
ls()
[1] "time.factor" "time.in.years"
Group related data values into one object, a data structure
A vector is a sequence of values, all of the same type
c()
function returns a vector containing all its arguments in order
students <- c("Sean", "Louisa", "Frank", "Farhad", "Li")
midterm <- c(80, 90, 93, 82, 95)
students
[1] "Sean" "Louisa" "Frank" "Farhad" "Li"
vec[1]
is the first element, vec[4]
is the 4th element of vec
students
[1] "Sean" "Louisa" "Frank" "Farhad" "Li"
students[4]
[1] "Farhad"
vec[-4]
is a vector containing all but the fourth elementstudents[-4]
[1] "Sean" "Louisa" "Frank" "Li"
Operators apply to vectors “pairwise” or “elementwise”:
final <- c(78, 84, 95, 82, 91) # Final exam scores
midterm # Midterm exam scores
[1] 80 90 93 82 95
midterm + final # Sum of midterm and final scores
[1] 158 174 188 164 186
(midterm + final)/2 # Average exam score
[1] 79 87 94 82 93
course.grades <- 0.4*midterm + 0.6*final # Final course grade
course.grades
[1] 78.8 86.4 94.2 82.0 92.6
Is the final score higher than the midterm score?
midterm
[1] 80 90 93 82 95
final
[1] 78 84 95 82 91
final > midterm
[1] FALSE FALSE TRUE FALSE FALSE
Boolean operators can be applied elementwise:
(final < midterm) & (midterm > 80)
[1] FALSE TRUE FALSE FALSE TRUE
Command | Description |
---|---|
sum(vec) |
sums up all the elements of vec |
mean(vec) |
mean of vec |
median(vec) |
median of vec |
min(vec), max(vec) |
the largest or smallest element of vec |
sd(vec), var(vec) |
the standard deviation and variance of vec |
length(vec) |
the number of elements in vec |
pmax(vec1, vec2), pmin(vec1, vec2) |
example: pmax(quiz1, quiz2) returns the higher of quiz 1 and quiz 2 for each student |
sort(vec) |
returns the vec in sorted order |
order(vec) |
returns the index that sorts the vector vec |
unique(vec) |
lists the unique elements of vec |
summary(vec) |
gives a five-number summary |
any(vec), all(vec) |
useful on Boolean vectors |
course.grades
[1] 78.8 86.4 94.2 82.0 92.6
mean(course.grades) # mean grade
[1] 86.8
median(course.grades)
[1] 86.4
sd(course.grades) # grade standard deviation
[1] 6.625708
sort(course.grades)
[1] 78.8 82.0 86.4 92.6 94.2
max(course.grades) # highest course grade
[1] 94.2
min(course.grades) # lowest course grade
[1] 78.8
students
[1] "Sean" "Louisa" "Frank" "Farhad" "Li"
Vector of indices:
students[c(2,4)]
[1] "Louisa" "Farhad"
Vector of negative indices
students[c(-1,-3)]
[1] "Louisa" "Farhad" "Li"
which()
returns the TRUE
indexes of a Boolean vector:
course.grades
[1] 78.8 86.4 94.2 82.0 92.6
a.threshold <- 90 # A grade = 90% or higher
course.grades >= a.threshold # vector of booleans
[1] FALSE FALSE TRUE FALSE TRUE
a.students <- which(course.grades >= a.threshold) # Applying which()
a.students
[1] 3 5
students[a.students] # Names of A students
[1] "Frank" "Li"
You can give names to elements or components of vectors
students
[1] "Sean" "Louisa" "Frank" "Farhad" "Li"
names(course.grades) <- students # Assign names to the grades
names(course.grades)
[1] "Sean" "Louisa" "Frank" "Farhad" "Li"
course.grades[c("Sean", "Frank","Li")] # Get final grades for 3 students
Sean Frank Li
78.8 94.2 92.6
Note the labels in what R prints; these are not actually part of the value
Keystroke | Description |
---|---|
<tab> |
autocompletes commands and filenames, and lists arguments for functions. Highly useful! |
<up> |
cycle through previous commands in the console prompt |
<ctrl-up> |
lists history of previous commands matching an unfinished one |
<ctrl-enter> |
execute current line |
<ESC> |
as mentioned, abort an unfinished command and get out of the + prompt |
“Homework” 0: Course survey
Lab 1: http://www.andrew.cmu.edu/~achoulde/94842/