Empirical Research Methods II

Spring 2000

 

Lab 2 :  Regression

 

 

1)      FTP and Course Data Sets

2)      Turning in Assignments

3)      Normal Distributions and the Central Limit Theorem

4)      Correlation

5)      Correlation and Regression

6)      Regression and Measurement Error

 

 

1          FTP and Course Data Sets

The data sets we will be using for the course are all stored in:

/afs/andrew.cmu.edu/course/88/241/data

 

So the first thing we will do is use ftp to download a file from this directory.

 

 

Figure 1. Calling FTP within a Command Prompt

 

1)     Go to the Programs menu under Start (bottom left of the screen). Find and select Command Prompt.

2)     At the prompt inside the Command Prompt window, change directories to c:\temp, a local directory you can write to while you are on the machine in this cluster, by typing: cd   c:\temp

3)     Now invoke ftp by typing: ftp  unix.andrew.cmu.edu  (see Figure 1). Login with your andrew id and password.

4)     Change directories on the andrew side by typing:

cd   /afs/andrew.cmu.edu/course/88/241/data

5)     Go into binary mode by typing: bin

6)     Get the file lab2qs.doc by typing: get  lab2qs.doc

7)     Don't quit ftp

 

2          Turning In Assignments

Lab assignments should be turned in as MS Word files. To make things easier on everyone, we want you to answer the questions for this lab by inserting your answers into an MS Word file that already has the questions, then by saving it with the appropriate name, and finally turning it in via ftp as you have already been doing.  To do this:

1.      Open MS Word (not in ftp, but rather from the Start menu), and open the file you just downloaded: lab2qs.doc, which should be sitting in the c:\temp directory.

2.      In Word, fill in your name and email id, and then go to the File menu, choose "Save As", and name the file <your-email-id>-lab2.doc.  E.g., since my email address is scheines@andrew.cmu.edu, my file name would be: scheines-lab2.doc

 

You can copy Minitab results and graphs into this file, and turn it in electronically at any time before the beginning of the next class.  To turn in your file electronically, use ftp in binary mode and deposit the file in the /afs/andrew/course/88/241/handin directory.   If you already have ftp running, follow steps 4, 5 and 6 below.  To do it from scratch, follow all of these steps:

 

1.      Open command prompt

2.      Cd into the directory in which your Word file resides (e.g., c:\temp)

3.      type ftp  unix.andrew.cmu.edu

4.      To enter binary transfer mode, type: bin

5.      To change the remote target directory,

type: cd   /afs/andrew.cmu.edu/course/88/241/handin

6.      To transfer the file, type: put <file-name>

 

You cannot overwrite your Word file once it is turned in, so make sure it is finished before you transfer it over. Uri will send you email with the grade on your lab.

 

Keep a copy of your file, in case the transfer fails or it somehow gets corrupted. 

 

 

3. Normal distributions and the Central Limit Theorem

 

The normal distribution is a central part of statistical theory. It plays such a central role because of the Central Limit theorem, which provides a basis for how normal distributions can arise in nature. This exercise is meant to give you a hands on view of how the normal distribution can be produced, and also to give you experience with Minitab’s excellent data simulation facilities.

Intuitively, the Central Limit theorem accounts for how normal distributions can be produced from the sum of many independent random variables.  The power of the theorem comes in putting no constraints on the form of the distributions of the variables summed. However, if the distributions are quite different, particularly in the range of values with substantial probability, then a large number of variables need to contribute before a normally distributed variable emerges.  To avoid the tedium of creating dozens of samples, we will restrict the distributions to the uniform and limit their range.

For this exercise, you are first to create several random samples, each of which will be drawn from a different uniform distribution.  To do this:

 

1.      Open Minitab 12

2.      Draw a psuedo-random sample of size 1000 from 8 different uniform distributions.  Do this for each distribution by:

a.      Calc  ® Random Data  ® Uniform

b.      This opens a dialog box (Figure 2), in which you will specify the parameters of the Uniform distribution. 

 

Figure 2

 

c.       Specify the sample size by clicking and typing 1000 in the Generate __ Rows of Data box.

d.      Give the column the sample will be stored in. Use C1 for the first, C2 for the second, etc, up to C8.

e.       Specify the lower endpoint and the upper endpoint. Don’t use the same ones for each sample, but restrict the lower endpoint to be between 0 and 20, and the upper endpoint to be between 30 and 50.

f.        When you finish this part, you should have eight variables in your Minitab worksheet - each with 1000 rows.

 

3.  Now create four different variables that are simple sums of the variables you already sampled from. 

 

a.       First, create a new variable that is the sum of any two of the variables you sampled from. For example, you can do this by executing:

1.      Calc ®  Calculator

2.      Create a new variable called ‘sum12’ that equals the sum of c1 and c2 (Figure 3)

 

Figure 3

 

b.      Now create a different variable that is the sum of any three.

c.       Create a different variable that is the sum of any four.

d.      Finally, create a different variable that is the sum of all eight uniforms.

 

4. Now compare the distributions of the four summed variables you created by looking at histograms of each with a normal curve superimposed.  This can be done using:

Stat  ® Basic Statistics ® Display Descriptive Statistics

 

To a get a  histogram with a normal curve superimposed, click on the Graphs button in the Descriptive Statistics Dialog Box, and then make sure the Histogram of Data, with Normal Curve box is checked.

 

Question 3a. Copy these histograms into your Word file, and discuss in a few sentences the extent to which the four different summed variables conform to the normal distribution.  Based on this analysis, how many independent uniformly distributed variables do you need to sum to obtain a variable that is approximately normally distributed?

 

 

4. Correlation

Correlation is a statistical measure of linear association.  When two variables have zero correlation, they have no linear association.  When they have a correlation of 1.0, they have a perfect positive linear association, and a correlation of –1.0 a perfect negative correlation.  To see what sorts of scatterplots correspond to what sorts of correlations, open the following web-site:

 

http://www.phil.cmu.edu:8080/jcourse/csr/applets/correlation.html

 

The applet on this page allows you to draw samples of any size from populations in which Y is a linear function of X, i.e.,: Y = bX + e, and you choose the correlation you wish to hold between X and Y in the population. Draw a sample of size 1,000 from populations corresponding to the following correlations (you don't have to be exact in the population value you set):

Correlation = -1.0,  -0.8,  -0.3, 0.0, 0.2, 0.5, 0.9.

 

These samples show you how correlation measures the degree of association among two variables that are linearly related.  Knowing that two variables have a zero correlation doesn’t mean they are not associated, however, it just means they are not linearly associated.  Keep this in mind as you answer the following questions.

 

Open a fresh worksheet Minitab.

 

Generate a sample of size 1,000 from a normally distributed variable with mean 0 and standard deviation 2.0.  To do this:

 

1.      Calc  ® Random Data  ® Normal

2.      Set the number of observations to 1,000 rows of data.

3.      Leave the mean at 0, and change the standard deviation to 2.0.

4.      Specify the column (C1) in which the sample will be stored.

 

·          Repeat this, but store the next sample in C2.

·          Now, using Calc ® Calculator, create a new variable, C3, such that C3 = (3.0*C1) + 4.0 (i.e., C3 equals 3 times C1 plus 4).

·          Now create a new variable, C4, such that C4 = C1*C1  (i.e., C1 squared)

·          Now create a new variable, C5, such that C5 = C1 + C2.

 

Question 4a. Predict the correlation between C1 and C2. Write your prediction in your Word file for this lab and explain the basis for your prediction.

 

Question 4b. Compute the actual correlation between C1 and C2 using:

1.   Stat  ® Basic Statistics  ® Correlation

2.   Select variables C1 and C2 by double clicking them so they appear in the Variables: box.

3.   Click OK

 

Record the answer in your Word file.  Was the correlation equal to your prediction? Why not?  Plot the relationship between C1 and C2 using a scatterplot:

Graph  ® Plot

 

Put the plot into your Word file.

 

Question 4c. Predict the correlation between C1 and C3. Write your prediction in your Word file and explain the basis for your prediction. Compute the correlation. Was the correlation equal to your prediction? Why? Plot the relationship between C1 and C3 using a scatterplot and put the scatterplot into your Word file. 

 

Question 4d. Do you expect the corrrelation between C1 and C5 to be zero? Explain.  Compute the correlation, and record the answer in your Word file.  Plot the relationship between C1 and C5 using a scatterplot and put the scatterplot into your Word file.  Compare the three scatterplots you have produced so far: C1 – C2, C1 – C3, and C1-C5.  Describe the differences between the plots, and how these differences relate to the correlations.

 

Question 4e. Predict the correlation between C1 and C4. Write your prediction in your Word file, and explain. Was the correlation equal to your prediction? Plot the relationship between C1 and C4 using a scatterplot, and put it into your Word file.

 

Question 4f.  Suppose that the distribution from which you drew the sample on C1 had a mean other than 0.  Would you expect the same result for question 1e? Explain why or why not.

 

Keep this dataset open.

 


 

5. Correlation and Regression

 

Question 5a. Estimate a regression in which C5 is the dependent variable (response), and C1 is the independent variable (predictor):

 

1.      Stat  ® Regression  ® Regression

2.      Click in the "Response:" box, and insert C5 by double clicking it in the list on the left side.

3.      Click in the "Predictors:" box, and insert C1 by double clicking it in the list on the left.

4.      Click OK.

 

The results of the regression will be in the "Session" window, and they are followed by what can be a long list of "Unusual Observations."  Ignore these, and look at the results by scrolling up in the "Session" window until you see the section titled "Regression Analysis".

 

How does the R2 of the regression relate to the correlation between C1 and C5 you computed in question 4d?  Give both values and your answer in the Word file.

 

6. Regression and Measurement Error

Regression is a powerful tool, but it has limits.  In particular, when the variables of interest cannot be perfectly measured, the quantities estimated by a regression can be distorted.  This exercise is meant to sensitize you to when measurement error can create serious distortions with respect to causal relationships and when it isn’t a big concern.

Later in the course, we will be reviewing a study of the relationship between smoking by youths and cigarette prices, among other factors.  The data on youth smoking for this study come from a survey administered to many youth smokers across the U.S. Each smoker was asked to report how many packs of cigarettes per week he or she smoked on average in the past year. The data on prices of cigarette packs are from a published list of state prices which differ according to the state tax on cigarettes. Assume that all individuals in a state pay the same (published) price for cigarettes.

Suppose the true relationship between the actual quantity smoked (Y) and the actual price of cigarettes (X) is linear, and satisfies the usual assumptions in a regression analysis. Thus the relationship can be written as:  Y = b0 + b1 X + e, where e represents all other causes of youth smoking. We will assume epsilon has mean 0 and is distributed normally and independently of X.

First, we want you to simulate data on this relationship.

 

1)      Draw a psuedo-random sample of size 5000 for X assuming that X is normally distributed with mean 2.70 and standard deviation of 0.20. To do this,

a)      Execute: Calc ® Random Data  ® Normal

b)      Put the data in column C6, set the number of observations = 5000, the mean = 2.70, and the s.d. = 0.20

c)      Label C6 as X.

2)      Draw a psuedo-random sample of size 5000 for e assuming it’s normally distributed with mean 0 and standard deviation of 0.20. Store this sample in column C7 and label it ‘epsilon’.

3)      Now construct the dependent variable Y as a function of X, epsilon, and an intercept:   Y = 5.7 – 1.0*X + epsilon. Use the Calc ® Calculator to compute Y, which you should store in C8 and label as Y.

 

Question 6a.  Suppose you estimated a regression in which Y is the dependent variable and X the independent variable, using the sample you just created.  Would the estimated intercept and slope coefficient equal 5.7 and –1.0 respectively, and would the standard error of the regression (S) equal the standard deviation of epsilon, i.e., 0.2?  Explain your answer, do the analysis, and report the results in your Word file. Include the estimated intercept, slope coefficient, and R2, and S (the standard error of the regression).[1]

 

You should have found that your estimates deviated only slightly from their true counterparts. This is due to only having a sample of 5000 observations. Even though X and epsilon are independent in the population from which you drew this sample, in any finite sample there will be some correlation between X and epsilon that will cause the regression estimates to differ from their true counterparts. In a sample of size 5000, these differences will be quite small. 

 

Whereas epsilon, the “error term” in a regression, represents all the other unmeasured causes of Y, we will now consider a different type of error that can affect a regression analysis: “measurement error.”  In this section of the lab, we will examine how measurement error affects the discrepancy between the regression estimates and their true counterparts. 

 

In the study on youth smoking, the data were not actual amounts smoked, but rather the self-reported amounts of smoking, which we will call Ys to distinguish it from the actual amount of smoking Y.

 

Question 6b. People do not perfectly recollect how much they smoked, so there is measurement error in the self-reports. Suppose the difference between self-reported and actual amounts of smoking is on average equal to 0, normally distributed, and unrelated to the actual price youths pay for cigarettes.  Suppose you regressed the self-reported amount of smoking Ys  on the actual price X in order to find out the relation between actual price X and actual smoking Y. Describe how you think the estimates of the intercept, slope coefficient, R2 and standard error of the regression from the regression of Ys  on X would differ, if at all, from those in the regression of Y on X.

 

To examine your prediction, we will again simulate data.  To do this, we need to construct Ys. Since Ys differs from Y by just a random noise term, we can generate the random noise term first and then construct Ys by adding it to Y.   To construct Ys:

 

1)      Construct the random noise term. Use the Calc ® Random Data ® Normal to create a sample of 5000 observations with mean 0 and standard deviation 0.3.  Put the observations in column C9, and label the variable ‘NoiseY’.

2)      With the Calc ® Calculator, create a new variable Ys in column C10 by adding Y and NoiseY. That is, set Ys = Y + NoiseY.

 

Question 6c. Regress Ys on X, and compare the estimates of this regression to those in the regression of Y on X that you did in question 6b. Report this in the Word file. How does “random noise” or “random measurement error” in the dependent variable appear to affect the regression estimates?

 

Now we will consider the effect of measurement error that is not entirely random. People tend to underreport their vices, and smoking is no exception.  Suppose the average youth understates the amount he or she smokes per week by ˝ a pack. Some understate it by more and some by less, but on average the undereporting is ˝ a pack and not 0 as we assumed in question 3c.  Suppose the amount their underreport differs from ˝ a pack is just random noise with mean 0 that is unrelated to the actual price X. Thus the amount reported, call it Yus, is just equal to the true amount smoked, minus ˝ a pack, plus random noise.

 

Question 6d. Suppose you regressed Yus  on the actual price X, in order to find out the relation between actual price X and actual smoking Y.  Describe how you think the estimates of the intercept, slope coefficient, R2 and standard error of the regression from the regression of Yus  on X would differ, if at all, from those in the regression of Y on X.

 

Analyze your prediction by constructing Yus.

1)      Create a new random noise term, labeled NoiseY2, using Calc ® Random Data ® Normal to create a sample of 5000 observations drawn from a normal distribution with mean 0 and standard deviation 0.3.

2)      With the Calc ® Calculator, create Yus by subtracting ˝ from Y and then adding NoiseY2. That is, set Yus = Y – 0.5 + NoiseY2.

3)      Regress Yus on X

 

Question 6e. Report the estimates of the intercept, slope coefficient, R2 and standard error of the regression from the regression of Yus  on X. How do they differ from the estimates in the regression of Y on X?

 

 



[1] The standard error of  regression is the standard deviation of the residuals: Ypred - Yactual