Lecture 12 - Interfacing R with other tools

94-842
February 19, 2015

Agenda

  • Common import/export tasks

    • Importing data from other common statistics software
    • Exporting data from R
    • Exporting graphics from R
  • R with SQL

    • SQL queries on data frames using the sqldf package
    • RSQLite, RMySQL

Common import/export tasks

In the next set of slides we'll discuss the following tasks

  • Import/export from Excel spreadsheets (xlsx files)

  • Import/export from other common statistical analysis software

  • Saving data from your R workspace

  • Saving images and tables

Excel import/export: xlsx package

  • xlsx package contains useful functions for export/import/modification of xlsx files

  • One approach: export your xlsx file to csv then using read.csv()

    • Exporting to csv generally fails if your spreadsheet has formulas or special formatting
  • Better approach: use the read.xlsx() function from the xlsx package

Excel import/export: xlsx package

  read.xlsx(file, sheetIndex, sheetName=NULL, rowIndex=NULL,
  startRow=NULL, endRow=NULL, colIndex=NULL,
  as.data.frame=TRUE, header=TRUE, colClasses=NA,
  keepFormulas=FALSE, encoding="unknown", ...)
  • A basic import call for a 1-sheet xlsx file with headers called spreadsheet.xlsx would simply look like
library(xlsx)
my.df <- read.xlsx("spreadsheet.xlsx", header = TRUE)
  • To output your data frame to an xlsx file, you can use
write.xlsx(x = my.df, file = "spreadsheet.xlsx", sheetName="Sheet1", ...) 
  • useful if you've put a lot of effort into cleaning your data in R, and you want to save your work in an Excel spreadsheet.

foreign package

  • The foreign package allows you to import data from various other kinds of statistical software.
Function name Description
read.dta Read Stata binary files
read.mtp Read a Minitab Portable Worksheet
read.spss Read an SPSS data file
read.ssd Obtain a Data Frame from a SAS Permanent Dataset, via read.xport
read.xport Read a SAS XPORT Format Library
write.dta Write Files in Stata Binary Format
write.foreign Write text files and code to read them
  • These functions are particularly useful for going back and forth between R and another language (e.g., analyze panel data in Stata, produce graphics in R)

Saving variables from your workspace

  • Fitting models to large data sets can take a lot of computation time

  • Instead of re-running all of your code every time you re-open R, you can save variables from your workspace (e.g., save that list of linear models that took plyr 30 minutes to produce)

  • To save your entire workspace, use

save.image(file = "myworkspace.RData")
  • To save a particular set of variables, use
save(..., file = "myvariables.RData")
  • (Replace … with variable names)

Saving variables from your workspace

  • E.g., if you have a data frame df and a linear model list lm.list that you want to store for later use, you can save by writing
save(df, lm.list, file = "myvariables.RData")
  • To load saved variables, just use the load() function
load(file = "myvariables.RData")

Exporting tables

  • Suppose you have a data frame or matrix that contains variable summaries (e.g., you have some output from your generateDataSummary() function from Homework 4)

  • To save the table, you can use write.table() or write.csv()

data.summary <- generateDataSummary(df, var.names, group.name)

# write.table approach, output as tab-delimited 
write.table(data.summary, file = "mytable.txt", sep = "\t")

# write.csv approach, output as csv file
write.table(data.summary, file = "mytable.csv")

Exporting graphics

  • By default, plots are output to your graphics window (Plots tab in RStudio)

  • You can save plots manually from the Plots tab by clicking Export > Save as PDF

  • To save them automatically you can use bmp(), jpeg(), png(), tiff() or pdf().

  • pdf() gives nice vectorized graphics, so it's often the best one to use

Exporting graphics: pdf()

  • The basic structure is as follows
pdf(file = "myplot.pdf", width = 6, height = 6) # Start a pdf graphics device
    # Code 
    # to
    # generate
    # plot
dev.off()  # Close the pdf graphics device

Exporting graphics: pdf()

Here's an example that saves plots in a loop

library(ggplot2)
# Generate and save a scatterplot of price vs carat for every level of cut
for(cut.level in levels(diamonds[["cut"]])) {
  file.name <- paste("./figures/", cut.level, ".pdf", sep = "") # Define file name based on cut
  pdf(file = file.name, width = 6, height = 5)  # Open pdf graphics device
  print(qplot(carat, price, colour = color, data = subset(diamonds, subset = cut == cut.level)))
  dev.off() # Close pdf graphics device
}

R with SQL

  • Most businesses use some kind of database for storing their data (commonly SQL)

  • We'll discuss two topics:

    • Using sqldf package to interact with R data frames with SQL queries
    • Using RSQLite and RMySQL to interact with SQL databases
  • Why bother with databases?

    • R data frames exist in memory, not on your hard disk
    • If you're only using a small fraction of your data, this is wasteful
    • Databases store data on your disk, and let you load into memory only the parts you need

sqldf package

  • If you already know some SQL, you might find R's data manipulation functions somewhat unnatural

  • sqldf allows you to use SQL queries on your data frames

  • The data frame is still stored in memory! (You can just query it in a different way)

sqldf

# Note, if you don't have X11 on your Mac, loading sqldf will crash your machine
library(sqldf)
  • Let's go back to our diamonds data set and see how we can interact with it using sqldf()

  • This is easier to do in a demo than with slides…