Workshop 6 The Central Limit Theorem for Census Data
In this workshop, we will analyze the census data in your casebook (you can find it in your case book floppy or you can directly open it from the R: drive) and explore the real meaning of central limit theorem.
The following part will briefly show you how the distributional shape changes when the sample size changes. The notes show you the basic operations in Mintab, which are useful for solving the questions. You need to do more in getting simulation results and write in details your findings when you work on your homework.
STEP 1: Open the data set.
Note: you can open your data set either from your floppy or r:\academic\90786\chatterjee et all\census1.dat). I have ever shown you how to open a text data file in excel, how to partition the plain text file into columns, and how to read the columns in Excel program. This time we will use Minitab to complete the similar task.
You will see such a pop-up dialog
Note that there is only one column data in the plain text file (you can open any plain text in your Wordpad or any other word processor. In this data set you will see only one column.) So you input C1 as the column name in your Minitab worksheet to store the data.
( If you want to use the Minitab program to read in data from a multi-column plain text file, just simply select multiple columns in the pop-up dialog to store every column in the worksheet. Remember, if you only select one column, the program will read the first element of every row in the plain text file and put it in the selected column.)
Step 2: Let us have a look at the descriptive statistics and histogram of the data set
· In Minitab, from Stat>Basic Statistics>Descriptive Statistics]
· Select Column, for which you want to compute descriptive statistics
OUTPUT: Mean of C1 = 19.897
400 rows read.
400 rows read.
5000 rows read.
Descriptive Statistics
Variable N Mean Median TrMean StDev SE Mean
C1 5000
19.897 16.990 18.709 12.573 0.178
Variable Minimum Maximum Q1 Q3
C1 0.292
220.835 12.123 24.418
· In Minitab, select Graph>Histogram Click ok
· Choose column 1
Obviously, the original data set is highly skewed.
The skewness of the original dataset is quire apparent. According to Central Limit Theorem, the probability distribution of will appear to be more and more normal in shape as the sample size increases.
So let us have a look if the shape skewness improves when the sample size increases from 1 to 3.
· Sample 200 observations from the original dataset.
In Minitab, select Calc>Random Data>Sample from Column, then choose Column 1
As the sampling source and Column 2 to store the output.
Click Ok. You will see 200 sampled observations show up in column 2.
· Continue to sample 200 observations in C3, C4, so we have 600 sampled observations. Each row of C2, C3 and C4 is a sample with sample size 3.
· For each row, compute :
In Minitab, select Stat>Row Statistics
Select ‘Mean’ from statistic, select c2, c3, c4 as input variables, tell the program that you want it to store the mean of every row in c5
Click ok.
· Descriptive statistics
Descriptive Statistics
Variable N Mean Median TrMean StDev SE Mean
C5 200
19.589 18.234 19.023 6.848 0.484
Variable Minimum Maximum Q1 Q3
C5 8.165
46.970 15.005 22.589
Note: The difference between Mean and Median is reduced from 3 to 1.5 when we increase the sample size from 1 to 3. This gives us a sense that the skewness is reduced by increasing the sample size.
· Histogram
Histogram is always the most often used, direct way to explain the change of distribution shapes.
In this case, when sample size increases to 3, you see the apparent improvement of skewness.