This article describes how Excel, Microsoft's widely-used spreadsheet package, can be used to help build data sets for teaching purposes. Along the way readers are presented with some ideas for building "fancy" spreadsheets for carrying out special tasks - I can't turn you into an Excel expert in such a short time, but by the end you should appreciate the types of things that can be done - the types of things that make Excel into such an incredibly powerful and flexible tool for teaching and learning statistics.
There are many sources of data available for teaching purposes, but there is still a need for "made up" data sets that contain special features we want to talk about or use in teaching. So we go through a series of sections covering the building of (progressively more complicated) such data sets. "By hand" working is presented with some technical details and even some VB code - details are for Excel 5/95, but much the same results can be obtained in other versions. Also, we "demonstrate" the use of a workbook called SampSel.xls that forms part of the XLStatistics set - this workbook has tools for carrying out some common tasks quickly and efficiently. (XLStatistics is a set of Excel 5+ workbooks for statistical analysis of data - SampSel.xls is just one of the over 60 workbooks in the package - here we demonstrate using version 4.49 of the package.)
Probably the most commonly-needed data set is for a single numerical variable. Such data sets are used for teaching many basic statistical concepts and are also commonly combined in various ways to build more complicated data sets. The first few sections cover techniques for building such data sets with various properties.
For making a data set the simplest thing you can do is use Excel's Edit | Fill | Series command from the menu. There are a number of options, but the most common is a linear series, where each successive value is obtained from the previous one by adding the "Step value". Of course the same thing can be achieved by placing a starting value in a cell, say A1, a "step value" in another cell, say B1, then entering the formula =A1 + B$1 into A2 and filling down as far as required. You end up with a set of values that are equally-spaced over an interval - quite a commonly-needed data set, but not very exciting!
A common data set is one where the numbers are chosen from a given distribution. Excel's RAND() function gives a number uniformly distributed over [0,1], and by combining with the inverse (cumulative) probability function of a given probability distribution we can generate numbers from that distribution. For example, in Figure 1 is a worksheet that uses the NORMINV function, Excel's inverse of the cumulative normal distribution, to generate values from a normal distribution.
Notes
Data from a specified Distribution
=ROUND(NORMINV(RAND(),$B$1,$B$2),0)
in C1:C11.
These are entered as shown into a new Module sheet in a workbook. Run the first one (under Tools | Macros). Then double-clicking on any cell on the current worksheet will bring up a message box with a useful little message like that shown in Figure 2.
ActiveSheet.OnDoubleClick = "DisplayFormula"
End Sub
Sub DisplayFormula()
MsgBox "The formula in cell " & Selection.Address & " is " & Selection.Formula
End Sub
Normally students would initially be asked to make a data chosen from, for example, the normal distribution "by hand" as in Figure 1 (and then do something with it), but having gone through this operation once or twice, there's little point in getting them to do it again and again - when data is needed for other purposes you need to be able to generate it quickly. For your own purposes, it is also convenient to be able to do it quickly. To do this you might produce a template but you usually need to add some Visual Basic procedures to automate things that bare-bones Excel does not do. The following image taken from the SampSel.xls workbook in XLStatistics illustrates typical things that need to be done:
Note. Option buttons for selecting the various distributions have been added to the worksheet pictured in Figure 3. These can be found on the Forms menu (or the Control Toolbox in Excel 97). Also, on the worksheet the desired sample size is entered into B3 and the requisite number of values appears in Column I. This automation is achieved by assigning a macro that does the necessary filing-down to the OnCalculate event of the worksheet. With a little experience it is quite easy to do this type of thing; or you can use a product such as the XLStatistics workbooks or use one of the many vailable Excel add-ins. Examples of data sets produced using the Random Numbers worksheet.
Sometimes you need draw random samples of data from census data - it's a good way of illustrating sample-to-sample variation, for example. Let's say we have data in Column A and want to draw a random sample of size 10 from it. One way of doing it is to INDEX into the data set by a random amount 10 times, extracting the corresponding data items. Figure 4 illustrates a possible way of doing this.
This actually gives a random sample with replacement. It's a bit harder to obtain a sample without replacement (for simulation of a lottery drawing, for example). In Figure 5 is an image of the Sample Selection worksheet in SampSel.xls with a "Without replacement" option. There are quite a number of algorithms that can be used to sample without replacement (see the discussion in the ASSUME mailist in March and April 1998 found at http://www.mailbase.ac.uk/lists /assume/archive.html, for example) - in this workbook the sample is selected using the same INDEX function as above, but the indices of selected items are recorded and there is code to recalculate if there is a repeated index.
This worksheet also has code to automatically extract sample of specified sizes - as previously it is important to be able to carry out operations quickly and efficiently.
Example. Enter the numbers 1, 2, 3, ..., 40 as the population in Column A. Select a sample of size 6 "Without replacement". Now you have the results of a lottery drawing 6 numbers from 40.
Data sets with "arbitrary" distributions
Data sets with more complicated distributions can be formed by combining data obtained using the above methods. For example, a typical bimodal data set (of size 100, say) could be formed by combining two samples of size 50 from normal distributions, one with m = 25, s = 10, the other with m = 75, s = 10. But it is fairly time consuming to do this type of work, especially if the data set needs fine-tuning. As an alternative, there is a sheet in SampSel.xls that can be used to easily build and modify arbitrary (single-variable) data sets. It is pictured in Figure 6.
On this worksheet you simply enter the total range in which you wish to have your data (in E1 and E2) and then move the scroll bars up or down to directly alter the counts in each of 20 subintervals over this range. A sample with the desired distribution automatically appears in Column K; the frequency chart gives a clearer picture of the distribution of the sample. Because of its flexibility, this worksheet is much more commonly-used for generating data sets than any of the techniques discussed previously, unless a simulation from a specific distribution is required.
Up until now all the techniques discussed have been for producing data sets for a single numerical variable. Often we need data sets for a numerical variable across different groups that we might want for comparison purposes, or for two numerical variables we might want for regression- or smoothing-type exercises, or maybe larger data sets involving many variables containing more subtle or complicated relationships. Data sets for different groups can easily be constructed using the above techniques then combined - that is relatively easy and will not be discussed further. Data sets for two numerical variables are also quite easy to construct using Excel - we give some pointers in the next section. And similar techniques can be used for more variables, but it becomes progressively harder (or at least more time consuming) as the number of variables increases.
Simulating data for more variables
Constructing data sets for two numerical variables
To gain even more flexibility, if a scatterplot in Excel is based on actual values (not formulas) points can be moved either vertically or horizontally on the actual scatterplot and the data itself alters accordingly. This allows very rapid development of data sets displaying particular features - all you need do is create some data with vaguely the right features, then move the points around to get what you really want. In Figure 7 is a snapshot of this process being carried out using the "Make data - 2 Var" worksheet in SampSel.xls.
Flexibility can be added by having cells that control the standard deviation of the error term, for example.
This worksheet allows you to first enter an initial sample of X-Y values allowing for commonly-needed options like the initial X- or Y- ranges to be set (the picture in Figure 7 has the Y-values initially assigned using Y = -X + 5 + error). Then you use Excel's move-the-data-points-in-the-scatterplot feature to modify the data set (the picture in Figure 7 shows the 4th data point being made into an "outlier" by moving it vertically to a Y-value of 0.7 from its initial value of 2.84).
Data sets involving more variables
The techniques discussed previously are useful when constructing data sets with more variables, especially when you are constructing data sets with specified relationships between pairs of variables - it's kind of like a backwards versions of the break-it-down-to-a-series-of-only-a-few-variables maxim that is often applied to analysis of multivarible data sets. Of course, if you want to build data sets with complicated interactions between many variables life gets harder, just as with the analysis of such data. The technique of entering formulas and placing parameters in separate cells, like in "Constructing data sets for two numerical variables" above, can be especially useful. For three variables with two numerical and a categorical one you can simply combine data sets obtained using the technique in the previous section, or you could work with a multi-series scatterplot as pictured in Figure 8, where we are producing a data set to illustrate some serious interaction.
Of course, if we need to alter the X-values of the individual series we would need to have separate X-values for each, as in Figure 8.
Conclusion
We have covered some of the basic techniques that can be used to construct data sets using MS Excel. The "by hand" techniques are quite useful, but use of a package such as SampSel.xls in XLStatistics can be used for many standard exercises.
XLStatistics can be obtained from the author.
Further Information