CTI Statistics

Teaching and Learning with MS EXCEL Data Sets


Rodney Deakin
Deakin University


This article appears in the May 1999 issue of the newsletter Maths&Stats

Introduction | Creating data | Edit | Fill | Series Command | Data from a specified Distribution | Sampling from a population | Data sets with "arbitrary" distributions | Simulating data for more variables | Constructing data sets for two numerical variables | Data sets involving more variables | Conclusion | Figures

Introduction

This article describes how Excel, Microsoft's widely-used spreadsheet package, can be used to help build data sets for teaching purposes. Along the way readers are presented with some ideas for building "fancy" spreadsheets for carrying out special tasks - I can't turn you into an Excel expert in such a short time, but by the end you should appreciate the types of things that can be done - the types of things that make Excel into such an incredibly powerful and flexible tool for teaching and learning statistics.

There are many sources of data available for teaching purposes, but there is still a need for "made up" data sets that contain special features we want to talk about or use in teaching. So we go through a series of sections covering the building of (progressively more complicated) such data sets. "By hand" working is presented with some technical details and even some VB code - details are for Excel 5/95, but much the same results can be obtained in other versions. Also, we "demonstrate" the use of a workbook called SampSel.xls that forms part of the XLStatistics set - this workbook has tools for carrying out some common tasks quickly and efficiently. (XLStatistics is a set of Excel 5+ workbooks for statistical analysis of data - SampSel.xls is just one of the over 60 workbooks in the package - here we demonstrate using version 4.49 of the package.)

Creating data for a single (numerical) variable

Probably the most commonly-needed data set is for a single numerical variable. Such data sets are used for teaching many basic statistical concepts and are also commonly combined in various ways to build more complicated data sets. The first few sections cover techniques for building such data sets with various properties.

Edit | Fill | Series Command

For making a data set the simplest thing you can do is use Excel's Edit | Fill | Series command from the menu. There are a number of options, but the most common is a linear series, where each successive value is obtained from the previous one by adding the "Step value". Of course the same thing can be achieved by placing a starting value in a cell, say A1, a "step value" in another cell, say B1, then entering the formula =A1 + B$1 into A2 and filling down as far as required. You end up with a set of values that are equally-spaced over an interval - quite a commonly-needed data set, but not very exciting!

Data from a specified Distribution

A common data set is one where the numbers are chosen from a given distribution. Excel's RAND() function gives a number uniformly distributed over [0,1], and by combining with the inverse (cumulative) probability function of a given probability distribution we can generate numbers from that distribution. For example, in Figure 1 is a worksheet that uses the NORMINV function, Excel's inverse of the cumulative normal distribution, to generate values from a normal distribution.

Notes

  1. Note the use of cells containing the mean and standard deviation (B1 and B2). It is useful to put values of parameters like this in cells so that modifications to the basic data sets can be carried out quickly and easily.
  2. As it stands, the working shown in Figure 1 generates data for a continuous numerical variable - many decimal places are shown. To generate data for a discrete (integer- valued) variable you can use the ROUND function, so you might have, for example
    =ROUND(NORMINV(RAND(),$B$1,$B$2),0) in C1:C11.
  3. Excel comes with many common inverse distribution functions that can be used this way: NORMINV, GAMMAINV, CRITBINOM, etc. They can all be found in the Statistical functions section when using Excel's function wizard.
  4. For teaching purposes it is sometimes necessary to show formulas rather than values, as in the left panel of Figure 1.
You can use Tools | Options | View | Formulas (as was done for Figure 1), but if you want to just show the formula in a particular cell without having to go through the options you might use the following useful macros:
Sub SetDoubleClick()
ActiveSheet.OnDoubleClick = "DisplayFormula"
End Sub

Sub DisplayFormula()
MsgBox "The formula in cell " & Selection.Address & " is " & Selection.Formula
End Sub
These are entered as shown into a new Module sheet in a workbook. Run the first one (under Tools | Macros). Then double-clicking on any cell on the current worksheet will bring up a message box with a useful little message like that shown in Figure 2.

Normally students would initially be asked to make a data chosen from, for example, the normal distribution "by hand" as in Figure 1 (and then do something with it), but having gone through this operation once or twice, there's little point in getting them to do it again and again - when data is needed for other purposes you need to be able to generate it quickly. For your own purposes, it is also convenient to be able to do it quickly. To do this you might produce a template but you usually need to add some Visual Basic procedures to automate things that bare-bones Excel does not do. The following image taken from the SampSel.xls workbook in XLStatistics illustrates typical things that need to be done:

Note. Option buttons for selecting the various distributions have been added to the worksheet pictured in Figure 3. These can be found on the Forms menu (or the Control Toolbox in Excel 97). Also, on the worksheet the desired sample size is entered into B3 and the requisite number of values appears in Column I. This automation is achieved by assigning a macro that does the necessary filing-down to the OnCalculate event of the worksheet. With a little experience it is quite easy to do this type of thing; or you can use a product such as the XLStatistics workbooks or use one of the many vailable Excel add-ins. Examples of data sets produced using the Random Numbers worksheet.

  1. Click on Uniform. Set a = 0.5, b = 6.5. Set Sample Size = 500, Decimal Places = 0. Now you have data simulating 500 rolls of a die.
  2. Click on Poisson (this is actually not based on one of the inverse probability functions that comes supplied with Excel - it is uses a user- defined function supplied with SampSel.xls). Set the average number of events (l)=5, Sample Size = 100. Now you have data simulating the number of arrivals per minute at a queue on 100 separate occasions.
  3. Click on Normal. Set m = 65, s = 10. Sample Size = 50. Now you have data that might simulate the scores of students on an ideal exam.

Sampling from a population

Sometimes you need draw random samples of data from census data - it's a good way of illustrating sample-to-sample variation, for example. Let's say we have data in Column A and want to draw a random sample of size 10 from it. One way of doing it is to INDEX into the data set by a random amount 10 times, extracting the corresponding data items. Figure 4 illustrates a possible way of doing this.

This actually gives a random sample with replacement. It's a bit harder to obtain a sample without replacement (for simulation of a lottery drawing, for example). In Figure 5 is an image of the Sample Selection worksheet in SampSel.xls with a "Without replacement" option. There are quite a number of algorithms that can be used to sample without replacement (see the discussion in the ASSUME mailist in March and April 1998 found at http://www.mailbase.ac.uk/lists /assume/archive.html, for example) - in this workbook the sample is selected using the same INDEX function as above, but the indices of selected items are recorded and there is code to recalculate if there is a repeated index.

This worksheet also has code to automatically extract sample of specified sizes - as previously it is important to be able to carry out operations quickly and efficiently.

Example. Enter the numbers 1, 2, 3, ..., 40 as the population in Column A. Select a sample of size 6 "Without replacement". Now you have the results of a lottery drawing 6 numbers from 40.

Data sets with "arbitrary" distributions

Data sets with more complicated distributions can be formed by combining data obtained using the above methods. For example, a typical bimodal data set (of size 100, say) could be formed by combining two samples of size 50 from normal distributions, one with m = 25, s = 10, the other with m = 75, s = 10. But it is fairly time consuming to do this type of work, especially if the data set needs fine-tuning. As an alternative, there is a sheet in SampSel.xls that can be used to easily build and modify arbitrary (single-variable) data sets. It is pictured in Figure 6.

On this worksheet you simply enter the total range in which you wish to have your data (in E1 and E2) and then move the scroll bars up or down to directly alter the counts in each of 20 subintervals over this range. A sample with the desired distribution automatically appears in Column K; the frequency chart gives a clearer picture of the distribution of the sample. Because of its flexibility, this worksheet is much more commonly-used for generating data sets than any of the techniques discussed previously, unless a simulation from a specific distribution is required.

Simulating data for more variables

Up until now all the techniques discussed have been for producing data sets for a single numerical variable. Often we need data sets for a numerical variable across different groups that we might want for comparison purposes, or for two numerical variables we might want for regression- or smoothing-type exercises, or maybe larger data sets involving many variables containing more subtle or complicated relationships. Data sets for different groups can easily be constructed using the above techniques then combined - that is relatively easy and will not be discussed further. Data sets for two numerical variables are also quite easy to construct using Excel - we give some pointers in the next section. And similar techniques can be used for more variables, but it becomes progressively harder (or at least more time consuming) as the number of variables increases.

Constructing data sets for two numerical variables

Let's call the variables "X" and "Y". Probably the most common requirement is for the data to display some type of relationship of the form Y = f(X) + error; the simplest is, of course, f(X) = aX + b for constants a and b. This is easy to achieve in Excel - it might go like this:
  1. In one column (A, say) create data for X (using one of the techniques discussed previously, for example)
  2. In another column (B, say) enter the desired (exact) formula for f(X), referring directly to the X values in the first column.
  3. In another column (C, say) enter the desired "errors" using, maybe, the Random Numbers sheet in SampSel.xls.
  4. In a final column (D, say) (row-wise) sum the results in the columns in steps 2 and 3 - these are the "Y" values.
Flexibility can be added by having cells that control the standard deviation of the error term, for example.
To gain even more flexibility, if a scatterplot in Excel is based on actual values (not formulas) points can be moved either vertically or horizontally on the actual scatterplot and the data itself alters accordingly. This allows very rapid development of data sets displaying particular features - all you need do is create some data with vaguely the right features, then move the points around to get what you really want. In Figure 7 is a snapshot of this process being carried out using the "Make data - 2 Var" worksheet in SampSel.xls.

This worksheet allows you to first enter an initial sample of X-Y values allowing for commonly-needed options like the initial X- or Y- ranges to be set (the picture in Figure 7 has the Y-values initially assigned using Y = -X + 5 + error). Then you use Excel's move-the-data-points-in-the-scatterplot feature to modify the data set (the picture in Figure 7 shows the 4th data point being made into an "outlier" by moving it vertically to a Y-value of 0.7 from its initial value of 2.84).

Data sets involving more variables

The techniques discussed previously are useful when constructing data sets with more variables, especially when you are constructing data sets with specified relationships between pairs of variables - it's kind of like a backwards versions of the break-it-down-to-a-series-of-only-a-few-variables maxim that is often applied to analysis of multivarible data sets. Of course, if you want to build data sets with complicated interactions between many variables life gets harder, just as with the analysis of such data. The technique of entering formulas and placing parameters in separate cells, like in "Constructing data sets for two numerical variables" above, can be especially useful. For three variables with two numerical and a categorical one you can simply combine data sets obtained using the technique in the previous section, or you could work with a multi-series scatterplot as pictured in Figure 8, where we are producing a data set to illustrate some serious interaction.

Of course, if we need to alter the X-values of the individual series we would need to have separate X-values for each, as in Figure 8.

Conclusion

We have covered some of the basic techniques that can be used to construct data sets using MS Excel. The "by hand" techniques are quite useful, but use of a package such as SampSel.xls in XLStatistics can be used for many standard exercises.

XLStatistics can be obtained from the author.

Further Information