CALIFORNIA STATE UNIVERSTIY, LONG BEACH

GEOG 400
Geographical Analysis

Project 2: Multiple Regression Using Excel

==========

The purpose of this lab is:

  • to introduce you to multiple regression, in which you build a model that explains variation in a dependent variable by reference to the weighted influence of two or more independent variables.
  • to familiarize you with Excel's multiple regression capacities (you may not have access to expensive statistical software in your professional future, so it's good to know that Excel can perform some multivariate analysis, including the ubiquitous multiple regression)
  • to alert you to shortcomings in even the most official of datasets: caveat emptor
  • to give you practice in judging among several good models

This project has the following deliverables:

  • the answer sheet to this lab, printed, filled out, and autographed
  • your Excel spreadsheet, autographed, showing:
    • your six simple linear regressions' output for all 46 records (R, t, b, a, R2, and prob-value)
    • your three multiple regression models for all 46 records (kitchen sink, reduced and reduced further through backwards elimination)
    • six scatterplots with linear trendlines added in for all 46 records, showing the association between Y and each of the Xi variables
    • your six simple linear regressions' output for all but the extreme outlier (n=45)
    • two more multiple regression models for this reduced dataset (kitchen sink and the single obvious reduced model)

Your data set this time pertains to an important social issue that varies in context and impact geographically at a variety of scales, international, intranational, and intrastate: murder involving gun use. The data come from a variety of sources, including U.S. Census estimates of the population in 1997, the FBI Uniform Crime Report for 1997, the L.A. Times' report on the 2000 presidential election results, and the National Rifle Association reports on the status of gun control laws in 1999. The scale of analysis is intranational: variations from state to state.

Unfortunately, these data, the best available at the time I put the database together back in 2001, are buggy, which is unfortunate, given the importance of the issue and the demand for risk management policy. For example, I could not find NRA data on gun control status for years earlier than 1999; ideally, gun control data should be older than the dependent variable. We can proceed with this as a pædagogical exercise, hoping that there have not been significant changes in the stance various states have taken towards concealed-carry laws from 1996 or 1997 to 1999.

Another example often widely discussed is the flawed nature of the Uniform Crime Report. The FBI must depend on local police agencies to report crime broken down the way the FBI defines various crimes. Political agendas vary among police departments, which may affect their tendency to underreport or overreport crime rates in their jurisdictions. The 1997 data also feature several missing states, so n=46, instead of 51.

Yet another problem is the Census data. They are not hard counts in the sense of the 1990 and 2000 censuses (even the nose-counts are themselves subject to underrepresenting the population in large cities, which affects their share of Federal grants). The 1997 data are U.S. Census estimates of the population, based on the agency's mathematical projections. The per capita income and urbanization data, however, are for 2000 and 1999, respectively.

The L.A. Times' data on the presidential election represent the percentage of voters choosing Bush out of those voting for Bush, Gore, or Nader (Buchanan, Hagelin, other third party candidates, or write-in candidates were not included).

The NRA data are in an ordinal format. The degree of permissiveness toward concealed carry is not shown as a scalar continuum, which is ideal for regression, but as an ordinal ranking of difficulty in getting the right to carry a gun concealed on your person when you venture out into public space. 0 stands for it's completely legal and no permit is needed (Vermont is the only state that does that, but it's not in this data set); 1 for "shall issue" states (meaning you ask for a permit and you get it, but the police know you have it, as in Connecticut and Utah); 2 means the state offers concealed carry permits but county sheriffs have the final say and there's a lot of variation within state (e.g., California); 3 means that restrictions are extremely tight, as in Illinois and Missouri, and there is little to no chance of earning a permit (the state may not allow concealed carry at all). Regression, however, is pretty robust with ordinal X variables (but not Y variables, which is where logistic regression came from), so we'll treat these numbers as if they were scalar and just go ahead.

So, with these caveats of mismatches among scales and timing and levels of measurement, let's work through an "everything but the kitchen sink" example anyhow just to see how you can use Excel to do multiple regression analysis.

==============================

Getting the Data

==========

The data are again in an Excel spreadsheet formatted with variable names across the top. There are several columns: "A" is "State" (the record name); "B" is your Y variable: "Gunmrd97%" (percentage of the 1997 population murdered with guns). Your X1 through X6 variables are as follows: "C" is "%cen cty 99" (percentage of a state's 1997 population that resides in a central city); "D" is "%gbn Bush" (Y2K voters for Gore/Bush/Nader who voted for Bush, to operationalize the conservative vote); "E" is "2K per cap" (per capita income in 2000); "F" is "%>24:BA" (percentage of the 1997 population 25 or older which has completed at least a baccalaureate degree); "G" is "NRA carry99" (NRA data on the 1999 status of concealed-carry laws); and "H" is "% yngmen" (percentage of the 1997 population consisting of young men aged 15-24 years, who are disproportionately represented among arrestees for violent crimes in general).

Again, click here to download the file. If it downloads and opens up in Excel, great. The lab browsers may warn you about the potential security hazard. You can either open the file or save it and then navigate to where you saved it and open it in Excel. Whatever you do, make sure to pick the "Save as" option at some point and save the file to your diskette (e.g., a:\gunmurder.xls). Remember, Woody purges the student storage area once a week or so.

==============================

Hypothesis Formulation

==========

State what you think the relationship is going to be between each of the X variables and the Y variable: direct or inverse.

Y and X1:
______________________________________________________________________________  

Y and X2:

______________________________________________________________________________  

Y and X3:

______________________________________________________________________________  

Y and X4:

______________________________________________________________________________  

Y and X5:

______________________________________________________________________________  

Y and X6:

______________________________________________________________________________  
Because each of you may come to different expectations and be able to defend them theoretically, we'll evaluate the significance of these associations using two-tailed tests (half the alpha goes on each end of the normal curve to define two rejection regions). That is, the null version of all of these expectations will be "there is no significant association between Xi and Y."

Pick an alpha level that you think optimizes the hazards of getting all excited over random associations or missing associations that might be there in our buggy data (Type I and Type II error balancing act). Briefly defend your "truth standard." Think about whether this is a more exploratory study with weaker standards or a final polished study with Type I consequences for policy that are so severe they require more rigorous standards.

alpha = __________

______________________________________________________________________________

______________________________________________________________________________

==============================

Building Simple Linear Models of Each of the Associations

==========

Let's have a look at the array of X variables to see if any of them are likely to be significantly associated with the percentage of gun murders. In cell b49, enter "r"; in b50, "t"; in b51, "b"; in b52, "a"; in b53, "r sq"; and in b54, something new: "alpha 45, 2, .05" (for prob-value of your tcalc, at 45 degrees of freedom and 2 tails).

Now, copy the following formulas in the following cells.

  • In c49, put in =CORREL($b$2:$b$47,c2:c47)
  • In c50, put in =(C49*(SQRT(46-2)))/SQRT(1-C49^2)
  • In c51, put in =SLOPE($B$2:$B$47,C2:C47)
  • In c52, put in =INTERCEPT($B$2:$B$47,C2:C47)
  • In c53, put in =C49^2
  • In c54, put in =TDIST(ABS(C50),45,2)

The $ signs force Excel to keep absolute references to the Y variable in column B, so when you copy the formulas to cover the other five X variables, you'll be correlating each of them to Y (instead of to the preceding X variable!).

TDIST is a prob-value generator built into Excel. Thought you'd appreciate that after struggling with the t-tables in your basic statistics class! The first number in the parentheses is your tcalc value. You can enter it in manually or, in this case, by reference to a cell where it was calculated. The next number after the comma is degrees of freedom (n-2, losing one degree of freedom for each variable in your bivariate analysis). The last number in the parentheses is the number of tails to be used in the test (1 or, in this case, 2).

The ABS before (C50), by the way, means that Excel should calculate the t score on the absolute value of C50's contents, whether it's negative or positive. Excel can't calculate negative t scores for some boneheaded reason and, since these are two-tailed tests you're doing, the absolute value will work just fine. Dealing with software not written by statisticians (as here) or cartographers (as with GIS packages) sometimes means you have to find creative workarounds for their inadequate programming. This ABS bit is an example.

Having written your formula in C49, copy c49 to d49:h49 by clicking on cell c49 and moving the cursor until it turns into a skinny black one, holding down the left mouse button, dragging the mouse to cell h49 and letting the button go. Do the same thing for c50 through c54.

Format the whole block from b49 to h54 as numbers at a consistent level of decimal accuracy, perhaps 3 or 4 decimal places of accuracy.

==============================

Interpretation of Your Simple Linear Regressions

==========

Okay, now interpret your results. Which associations turned out significant at your chosen alpha level? Which associations turned out to have the direction you had predicted (direct or inverse)? Are you surprised at your results? Why? What do you think is going on (other than buggy data)?


____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

==============================

Building a First Draft Multiple Regression Model

==========

Let's see if we can improve the explanation offered by any one significant X variable by adding other X variables in with it. You build multiple regression models either by throwing everything you have in there and then eliminating variables that do not contribute significantly to the overall effect (backwards elimination) or by adding one X variable at a time, noting whether it has added anything significant to the model and, if it does, adding another and another until there is no more significant improvement (forward stepwise).

Let's do this by backward elimination. Let's start with everything thrown in, running the regression with all six X variables. We'll note the multiple correlation co-efficient and the adjusted co-efficient of multiple determination and compare them with the original best single factor model to see if the improvement in the model warrants keeping everything in there.

To run a multiple regression with everything, bring down the Tools menu and choose Data Analysis. In the dialogue box, select Regression. This brings up another dialogue box.

For Input Y Range, put in b1:b47; for Input X Range, enter c1:h47. Click in the Labels box to let Excel know that row 1 contains the names of the variables. Under Output Options, click on Output Range and fill in the name of the cell you would like the regression output to begin. This should be below the rest of your spreadsheet and there should be plenty of empty rows below it and empty columns to its right, as you're about to get a bunch of regression stats! Hit Okay and stand back!

Now, to interpret that "Summary Output" mess. At the top of the output, you get a block labelled "Regression Statistics." This includes the multiple correlation co-efficient (R), the co-efficient of multiple determination (R2), and, more importantly, the sample-size-and-variable adjusted R2. Compare the R and adjusted R2 to the best of your original simple linear regression model's r and r2.

Name of variables in your best simple linear regression model:

_________________________ Y variable (dependent, criterion variable)

_________________________ X variable (independent, predictor variable)

Best simple linear regression model r = __________

Everything-in-it regression model R = __________

Best simple linear regression model r2 = __________

Multiple regression adjusted R2 = __________

So, did this exercise in building the more elaborate model pay off in a noticeably better explanation of gun murder death rates?

______________________________________________________________________________





The next block in the output table is labelled "ANOVA" (analysis of variance). You may remember doing an ANOVA in Geography 200, and it is commonly taught in other basic statistics classes. At this point, we are simply concerned with the F statistic that the ANOVA yields to test the significance of the entire model (the way we used the t statistic to test the significance of individual simple linear regression models). Next to the F statistic is a value labelled "Significance" or prob-value. This tells you the probability of getting results as extreme as those in your model if there were only a chance association among your various variables (if the null hypothesis were true). Compare that with the prob-value you got from your best simple linear regression model (oh, Excel will often show really teeny prob-values in scientific notation, e.g., 2.9269E-06, for 2.9269 x 10 to the -06 power. Just move the decimal place, in this example, six decimal places to the left of where it is now shown: 0.0000029269).

At four decimal places of accuracy, is there any noticeable improvement in significance from your best simple linear regression model to the everything-in-it multiple regression model?


______________________________________________________________________________

The last block gives you all sorts of goodies. The first column is your variable names (the labels you told it to keep track of in the original regression dialogue box). The second column is your regression co-efficients. The one beside Intercept is, well, your intercept. The number beside each of the named X variables is its partial correlation co-efficients (bi). With the information in that column, you can now write up your model in the Y = a + bX1 + bX2 + ... bX6 format. Oh, heck, why don't you do that now?

Y = ____ + ____X1 + ____X2 + ____X3 + ____X4 + ____X5 + ____X6

 
That block also gives you the t score for each partial correlation co-efficient and the associated prob-value. This way you can see the significance of each X variable, with the indirect influences it exerts through all other X variables partialed out or controlled.

Have a look at those prob-values and compare them with the ones you got from doing all those simple linear regressions. Two of the variables had very low prob-values when considered in stand-alone simple linear regressions but the highest prob-values when all the variables are considered together (meaning that they show high collinearity with another variable). Which two factors are they?


______________________________________________________________________________

==============================

Refining the Model by Backward Elimination

==========

Well, it seems that not all the variables in our original model are worth bothering with. This was apparent back when we did all those simple linear regressions: There were some high prob-values for individual X variables and gun murder rates.

The collection of useless variables grew upon close examination of the prob-values of the individual X variables in the everything-in-it multiple regression. A couple of the X variables had low prob-values considered alone in their association with Y, but those prob-values ballooned when all X variables were regressed with Y. This was because of collinearity: These X variables produce some of their effects on Y through their high correlations with other X variables.

So, let's whip out the pruning shears and dump all X variables that contribute trivially to understanding gun murder rates. In other words, identify the X variables that do not produce prob-values less than your pre-selected alpha standards. Using that standard, which variables are you going to eliminate?


______________________________________________________________________________

Now, rerun the multiple regression with the surviving variables. Excel will recalculate all the statistics, including the new partial correlation co-efficients. Before you do that, make sure that your surviving X variables are grouped together in adjacent columns, because Excel will only handle multiple X variables if they form an uninterrupted array or block. So, gun murder rates should stay in column B, and your surviving X variables should start in column C and go right from there.

Again, pull down the Tools menu and choose Data Analysis and then Regression. Again, make sure b1:b47 is your Y array, and your X input array should be c1 to whatever47. Again, click on labels. Again, pick some cell at the bottom of your spreadsheet to be the starting cell for your regression output. Hit okay.

Now, concentrate on the R and adjusted R2 statistic. The removal of some X variables will cause a drop in R, to be sure, but the question is what happens to the adjusted R2? Did it rise, stay about the same, or decline? In other words, did the reduced model improve explanation or hurt explanation? Another way to check on model performance is to look at the F statistic and the F Significance or prob-value (the bigger the F stat, the lower the prob-value). Do you see an increase, decrease, or stagnation in the F statistic?


______________________________________________________________________________
Examine the regression co-efficients block and write your new model:

Y = ______________________________________________

Now, look at the prob-values associated with your surviving X variables. Did the reconfiguration of the model cause any of them to get bumped above your alpha standard? If so, which one(s)?


______________________________________________________________________________
Remove any X variable that no longer meets your truth standard and rerun the regression the way you did before (you may be down to one variable, but do the regression again: Excel can do simple linear regression through the same dialogue box -- and it yields the adjusted R2, which I didn't have you calculate before in the simple linear regression section).

Using the regression co-efficients from this iteration, re-write your regression model:


______________________________________________________________________________
Now, let's have a look at the new R (which you expect to decline) and the new adjusted R2? Did the adjusted R2 rise, stay about the same, or decline compared to the everything-in-it model and the first reduced model? In other words, did the latest iteration improve explanation or hurt explanation?

______________________________________________________________________________

==============================

Getting Suspicious
Which, Really, You Should Have Been before Actually Doing the Regression

==========

Okay, now let's have a closer look at these data and models. Do X-Y scatterplots for all six simple linear regressions. Fit them with linear trendlines. Not exactly your homoscedastic balance of points on either side of the trendline, is it?

What you have is an "extreme outlier," a bizarre case that is skewing the associations you found out of all common sense. By examining your data table, can you figure out which record is the extreme outlier on all six graphs? What is the outlier?


______________________________________________________________________________
Okay, let's have a look at its effect. First, highlight your whole data table including your simple linear regression outputs from Cell A1 through Cell H 54, then copy it (Control-C), then click on Sheet 2 at the bottom of your spreadsheet, which will put you in an empty "sheet" or page of your spreadsheet workbook. In Cell A of Sheet 2, paste your spreadsheet in (Control-V). Cool idea, no? You can run different scenarios or experiments on the same data in different sheets and keep everything in the same workbook.

Now, highlight the row containing the extreme outlier. Delete it (Edit -- Delete). You now have 45 records, instead of 46. And look what happens to all those simple linear regression outputs! Everything changes. Drastically.

Do a kitchen sink multiple regression. Not so impressive anymore, is it?

Eyeball those prob-values. Does any variable in the kitchen sink produce a prob-value below your alpha standard? Which one is it?


______________________________________________________________________________
So, now re-do your regression for just that variable. Of the two regressions you did this way, which one produces the better overall significance (F sig or prob-value)?

______________________________________________________________________________


==============================

Interpretation

==========

In clear, readable English below, summarize the highlights of this process. All three models produced in the original multiple regression and backwards elimination section were highly significant, but only one of them combines explanatory power with economy of modelling. Which of the first round of regression models (before you removed the outlier) produced the best F-statistic and proved the most powerful and economical modelling exercise?

How did your analysis change when you threw out the extreme outlier?

The gun control issue will certainly not be resolved by our quick 'n' dirty analysis of questionable and skewed databases, but, within the limits of these data, how did the relationship between gun control and gun murder rates turn out in either of the kitchen sink models? What seems to drive gun murder rates more than anything else? Why did this variable seem so insignificant in the original kitchen sink regression?


____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

==============================
GEOG 400 Home | GEOG 400 Syllabus | Dr. Rodrigue's Home | Geography Home | Scientific Calculator

==============================

This document is maintained by Dr. Rodrigue
First placed on Web: 03/22/01
Last Updated: 02/12/08

==============================