CALIFORNIA STATE UNIVERSTIY, LONG BEACH
GEOG 400
Geographical AnalysisProject 3: Logistic Regression Using SPSS
![]()
The purpose of this lab is:
- to acquaint you with bivariate and multivariate binomial logistic regression analysis (say that fast a few times)
- to introduce you to a dedicated statistical software package, SPSS (since Excel does not support logistic regression, this is a good point to move over to SPSS, which you'll find much easier to use)
This lab project has the following deliverables:
- the answer sheet to this lab, printed, filled out, and autographed
- your SPSS output (the .spv file), autographed, showing your models:
- bivariate model of elevation difference (either N/S or W/E) and likelihood of finding an archæological site in a given quadrat
- multivariate model resulting from backwards logistic regression
- Excel output:
- Excel spreadsheet of your bivariate logistic regression model, showing, for each Xi and Y, Z, eZ, Y' (expectd probability), and the expected odds of finding a site in a quadrat with that value of terrain steepness (Xi)
- graphs of the probabilities and the odds of finding archæological sites in a given quadrat on the map, depending on terrain steepness (scatterplots with fitted trendlines)
![]()
Background Information
![]()
You are presented with a 1:24,000 USGS contour map of a stretch along the California coast, which shows the location of 30 Native Californian archæological sites found during a "survey" of the area (actually, I made all of them up). These sites are not differentiated on the map or in the database. The culture involved was a gathering-hunting-fishing people who did not practice agriculture, though they may have altered their environment through the use of fire. Some of the sites are permanent villages; most are temporary base camps or specialized camps. Some people maintained essentially year-round occupance of the villages; others had a seasonally migratory activity pattern allowing utilization of a variety of seasonally and spatially variable resources. Resources moved across space and time through the movements of the more migratory members of a clan and through more formal trade and ritual exchanges among people with fewer kinship ties with one another.
Archæology has always been interested in particular sites. Beginning in the 1950s with the work of Gordon Willey in the Virú Valley of Peru and of Philip Phillips, James Ford, and James Griffin in the Lower Mississippi Valley, however, archæology, like human geography, has developed an interest in systems of sites, how they interact, how they are differentiated into higher order and lower order sites, and how they express the interaction between a particular culture and the envionments on which it depended. Even as cultural anthropology and cultural geography share common roots and many interests going back to the 1920s (e.g., Carl Sauer and Alfred Kroeber), archæologists and, first, economic geographers and, later, GIScientists began to converge in the "quantitative revolution" that overtook both disciplines in the 1950s and 1960s and continues to the present day. And, even as physical geography and geology overlap in interests and techniques, archæology now overlaps greatly with both of these fields. This growing amalgamation among physical geography, geology, geophysics, geochemistry, and archæology has come to be recognized as "geoarchæology"!
One outgrowth of all this interdisciplinary cross-fertilization has been the interest of archæologists in "predictive modelling," sometimes done in a GIS or exported to a GIS. This involves the use of statistical techniques to characterize the sites and spatial situations of archæological sites in order to begin probabilistic prediction of where more such sites might be found. One of the most common statistical techniques used is logistic regression. Here are links to a couple basic sources if you are curious about this development:
- Banning (2002), Archæological Survey
- Baxter (2003), Statistics in Archæology
Logistic regression is very similar to simple linear regression and to multiple regression in that it models a Y variable's response to one or more X variables' variations. Like these previously introduced regression systems, multiple logistic regression can also deal with the occasional non-scalar X variable. Now that I've lulled you into your comfort zone, let me introduce you to the kicker in logistic regression.
Y (as observed) is not your basic garden variety scalar dependent variable. It is a binary variable. There are only two states that Y can take: on-off, present-absent, yes-no, 0-1. What you're trying to do with logistic regression is predict the change in the odds of getting one or the other of the two outcomes, depending on which value X takes (X is often scalar, but it can be ordinal or even nominal). You fit a logistic (S-shaped) curve to the probabilities of getting either of the Y states, depending on the values of X. The reason it takes the logistic shape is to keep the curve from undershooting or overshooting the maximum (1) and minimum (0) values that the Y value can take, which is what a linear curve will do. The logistic curve will move between the two states but approach each asymptotically (never quite getting to 0 or 1).
It gets worse. Odds are not the same thing as probability or of chance as we've dealt with them all semester. Remember how in probability you divide the number of trials that come out fitting some requirement by the total number of trials? That is, if you get 20 floods of a given magnitude out of 100 years of stream data, you figure you have yourself a 20% chance or a 0.20 probability of getting at least such a flood next year. Odds have a different denominator. It's not the total number of trials but the number of trials having the opposite outcome. It would be like saying that you have 20:80 or 1:4 or 25% odds of getting that level of flood.
As if that weren't bad enough to keep in mind, there's another plot complication with logistic regression, and that is the meaning of the b coëfficient. In regular old regression, b can be thought of as the slope of the regression line or plane, and it is a constant. In logistic regression, b is the the natural logarithm (to the base e or 2.71828) of the odds ratio. The odds ratio is constant throughout the range of X, so, in that sense it is kind of like a slope. Sort of. Oy, vey!
Probability: p/(p+q)
Odds: p/q
The odds ratio is the change in the odds of getting a given outcome, for every one-unit change in the X variable. To compute the odds ratio for an X variable, you multiply the b coëfficient by the size of the increase in X (e.g., 1) before you raise e to the power of the coëfficient. That is, eb = odds ratio.
So, read chapter 7 in Grimm and Yarnold.
Back to archæology and site location analysis/predictive modelling. You will use logistic regression to compute the probability and the odds of finding a site in a given areal unit (a quadrat 1/100th of a square kilometer or 2.5 acres) depending on the steepness of the terrain in that quadrat and then evaluate the model. You will also build a multivariate logistic regression analysis of several environmental variables and again compute the probability and the odds of finding a site in a given quadrat.
You can then turn the analysis around and try to create a set of guidelines for site prospecting surveys along other stretches of coastline in the culture zone. Where should prospecting teams look for signs of possible archæological occupations for further investigation? What, if anything, was important to the pre-Spanish Indians of this stretch of California coastline as they set about emplacing their villages and camps?
![]()
Getting Your Data
![]()
You can view the map at
- http://www.csulb.edu/~rodrigue/geog400/logitregrdatamapgrid.jpg or
- http://www.csulb.edu/~rodrigue/geog400/logitregrdatamapgrid.pdf
Your data sheet is at
You can open the file in Excel to have a look at it. Or not.
Now, you can import the spreadsheet into SPSS. Fire up SPSS (Start button on your lab computer, then All Programs, then SPSS for Windows, then SPSS for Windows 17.0.
Make sure that you don't have your spreadsheet open in Excel, or SPSS will have a cow.
When SPSS comes up, select Open an Existing Data Source. Browse over to where you saved that spreadsheet and then you'll see an opening box asking if you want it to Read Variable Names from the First Row of Data (yes, make sure it's clicked) and identifying the Worksheet. You can leave the Range blank. Hit OK.
All the variable names (which have to be 8 characters or less, by the way) are now embedded in the grey box at the top of each column. You can change the width of the columns, just like in Excel.
A note about those very short variable names.
- Quadrat is the designation of each square in the map you downloaded, the names arranged just like a spreadsheet. Quadrat B3 is the square in the second column and the third row.
- CentElev is the elevation in feet of the exact center of each quadrat. From that, four measures of terrain steepness are derived:
- two having to do with slope gradients (the gradient in either case is the rise (or fall) in feet over the run in feet (how many feet of elevation difference divided by the 0.1 km of a quadrat's sides, converted into 328 feet):
- NSGrad is the gradient from the center of a quadrat to the center of the quadrat just south of it.
- WEGrad is the gradient from the center of a quadrat to the center of the quadrat just east of it.
- Also derived from CentElev are two sets of raw elevation differences:
- DiffSQd is the difference in feet from one quadrat to the one just south of it
- DiffEQd is the difference in feet from one quadrat to the one just east of it
- Stream means a stream is present anywhere in the quadrat (1) or not (0)
- SitePres means an archæological site is present (1) or not (0).
- SiteID gives numbers for each of the 30 sites, which are keyed to the magenta numbers on the map.
- SiteElev is the elevation of a given site in feet
- DistWtr is the distance from a site to the nearest source (on the map, anyway) of fresh water
- ElevWtr is the elevation difference between a site to the nearest source of fresh water
You may notice that we do not have data for every single one of the 304 quadrats on the map. Obviously, 40 are in ocean water, and we're not doing marine geoarchæology here! Also, you will notice that the columns on the far right of the spreadsheet are really sparse. These are data strictly for the 30 quadrats containing the archæological sites themselves, not for the quadrats without sites. We will not be dealing with those (columns I:L) in this lab (whew! but make sure you don't put them in the "kitchen sink" later!).
![]()
Hypotheses
![]()
- So, what do you suppose is the relationship of the steepness of terrain and the likelihood of finding an archæological site? Is it direct or inverse? What makes you think so?
______________________________________________________________________________ ______________________________________________________________________________
- What would be the null version of this hypothesis?
______________________________________________________________________________ ______________________________________________________________________________- Select and justify an appropriate alpha level for this test.
______________________________________________________________________________ ______________________________________________________________________________
![]()
Analyzing Your Data in SPSS
![]()
Pick "Analyze" from the SPSS Data Editor (the spreadsheet in SPSS), which is a button about in the middle of the ruler bar across the top. Then, choose "Regression." Under that, pick "Logistic."
Choose your dependent variable (SitePres) and your first "covariate," which would be one raw measure of terrain steepness ("DiffSQd" OR "DiffEQd"). Hit "Okay." That's all there is to it! After all that build-up!
Before you get tooooo comfortable, do it AGAIN! This time use "stream" as your independent "covariate."
![]()
Making Sense of the Output
![]()
A whole pile of computer schmutz will appear in the output box each time you do this. You might want to save it at this point. SPSS will save this as a .spv file. At the very bottom each time, there should be a box with two rows of output and seven columns. The column on the left side, titled "B," gives you the constants, b and a (or b1 and b0), for the simple linear regression model that starts the whole logistic regression process (Z = a + bX). The bottom number ("Constant") is a, the Y intercept. The top number is b, the slope (they're both called B, because many people refer to a and b as b0 and b1, respectively (Z = b0 + b1X). You now have enough to show your logistic regression model as Z = a + b1X1. And, after doing it for the other "covariate," you can write out Z = a + b2X2 .
- Z1 = ________ + ________X1
Z2 = ________ + ________X2
From Z, you are now in a position to calculate the expected Y or Y', which in logistics regression is the probability of a quadrat having an archæological site (a Y of 1). To do that, put Z into the following formulæ:
- Y'1 =e ________ / (1 + e ________ )
Y'2 =e ________ / (1 + e ________ )
We're now in a position to have a bit of fun with this and make it "come alive"! At this point, open your original Excel spreadsheet. To keep your analyses separate from the original data, we're going to open a new worksheet inside that spreadsheet, one for each of your two X variables.
To do this, first highlight the column containing SitePres and the terrain steepness variable you used above (DiffSQd or DiffEQd). Then, hit Control C to copy them onto the clipboard. Then, click on Sheet 2 at the bottom of the screen. Once you're in Sheet 2, put your cursor at A1 and hit Control V to copy your columns onto the new sheet. You can click on the Sheet 2 tag and right click and rename it to DiffSQd or DiffEQd, so you can find it more easily later. Do the same in Sheet 3, but this time use Stream and SitePres (and maybe rename Sheet 3 as Stream).
In each sheet, make sure that your X variable is in the A column and SitePres is in your B column. Now, highlight cell C1 and name it Z. Z = a + bX. In C2, type a formula for Z. Start with =, followed by whatever you got for a (up in #4 above), followed by +, followed by whatever you got for b1, followed by *, followed by A2. It should come out looking something like this (I'm making up constants here): =3.338-0.643*A2.
Moving to cell D1, type in Y'. This is the expected probability of getting a 1 or a yes or, here, an archæological site. It is calculated as e (2.718282) raised to the Z. In D2, type =(exp(1)^C2)/(1+exp(1)^C2). This will render your formula above (in #5) in Excel-speak.
Now that you've calculated the expected probability of a site being found in a given quadrat, depending on terrain steepness, you can calculate the odds of finding one. To do this, you divide the probability or the Y' for any given X by 1 minus the probability. So, in E2, it would be =D2/(1-D2), and, voilà, the odds that a site will be found for a given value of X.
A nice way of checking your work is to calculate the log odds and put them in Column F. This is LN(Y'/[1-Y']), which is equal to Z, which is equal to a + bX. Label F1 "log odds" and, then, in Excel-speak, type in LN(E2) or, alternatively, =LN(D2/(1-D2)). Now, it had better be equal to the answers in C2. If it isn't, you made a boo-boo somewhere and you need to de-bug it.
Having done all that (with no "accidents"), now copy each formula (C2, D2, E2, and F2) down the whole column of numbers. You can do that by holding down the Control key and highlighting C2:F2 and then moving the cursor to the lower right corner of cell F2 until it changes from the fat white cursor to the skinny black one and then drag down to row 265.
That done, you can make X-Y scatterplots of probabilities (columns A and D) and of odds (columns A and E) as they vary with terrain steepness. There's no particular point to doing this with the Stream and SitePres relationship, so don't bother: You get two dots, each in opposite corners. But with the steepness and SitePres relationship, it can be done. You should faintly see the logistic (S-shaped) curve on the probabilities with steepness graph and a clearly concave pattern with the odds. Looking at your regression significance values in the SPSS output, want to speculate on why the logistic trend is so faint?
Leaving Excel and squinting at the SPSS output, another column, on the right, is entitled "Exp(B)" and this stands for "exponentiated B," or e raised to the b AKA the odds ratio (the natural logarithm [ln] of the odds ratio is b). This tells you how much each foot increase in terrain steepness from one quadrat to the next alters the odds that a site was found in that quadrat. This change in odds is constant throughout the range of X: In this sense, logistic regression is a kind of linearization of a binary outcome and a logistic probability curve. So, what is the odds ratio for your model? Enter your Exp(B):
Exp(B) = ________
There's another column in the SPSS output entitled "sig" and this is the significance or the prob-value attached to your variables. SPSS calculates it as b divided by the standard error (S.E.) of b, which is then squared to give the "Wald statistic." The calculated Wald statistic is then compared with a critical Chi-square value to estimate the probability that random sorting would produce a pattern as extreme as yours, with a calculated Wald statistic as big as yours (in other words, the probability of a Type I error).
Actually, there's quite a bit of ferment in statistics about coming up with reliable estimates of prob-value, or significance, for logistic regression. The Wald statistic/Chi-square approach may possibly inflate prob-values in certain situations involving big b values, thus increasing the likelihood of a Type II error (missing something important going on). Pending resolution of that issue, the Wald/Chi-square statistic is the best we have to go on, and that's why it's used in SPSS. The Wald is reported as "Score" and its significance is reported in the Variables box (either "in the equation" or "not in the equation," depending on how well it did) next to "df" (or degrees of freedom).
Anyhow, with that caveat, hopefully, your sig for your X variable is less than your chosen alpha level. Is it?
- ________Yes ________No
Which helps put that not-so-logistic curve in context.
Another handy bit in that output mess is the little classification table, right above the model itself. It compares the actual observations of sites found and not found with the predictions from your model. That is, it classifies the X value (here, Diff in elevation) by likelihood of a site being found in a quadrat, defining a "hit" as a value where the odds for finding a site are 50:50 (that cutoff works out to -a/b). Any X smaller than that value will be assumed to to have a site (Y=1); any X larger than that value will be assumed to be barren of found sites (Y=0). The counts in the table, then, are the number of quadrats with observations of actual sites compared to those that don't, cross-tabulated with the model's prediction (those quadrats having X values above the 50:50 cutoff and those below the cutoff). That's where the 1 df comes from, by the way: actual counts and predicted counts for Y=0 and Y=1 (2 rows minus 1 times 2 columns minus 1).
This gives you a graphic representation of how powerful your classification model is. The contingency table is subjected to an ordinary Chi-square test of significance, which is presented in the "Omnibus Tests of Model Coefficients" block above the table. The model significance should be lower than your chosen alpha level above. Is it significant?
- ________yes ________no
There's another box, one called "Model Summary." It includes two measures floating around the statistical universe, which are supposed to be "like" an R2 coëfficient of determination in regular linear regression. They're supposed to range from -1 through 0 to +1, so you can interpret them similarly to ordinary R2adj. That this is still a "work in progress" can be seen in the discrepancy between the two measures provided by SPSS (Cox and Snell, Nagelkerke). Soooo, how much does variation in terrain steepness affect likelihood of finding an archæological site in a given quadrat? Report the two R2 stats provided:
- Cox & Snell R2: ________
- Nagelkerke R2: ________
![]()
Interpretation
![]()
Briefly summarize in regular English what the nature of the association you looked at in terms of how significant the association is. If it's significant, also comment on the direction (direct or inverse) and how the likelihood of a quadrat having an archæological site varies with elevation steepness, in other words, its strength as reported in the R2 measures.
____________________________________________________________________________________________________ ____________________________________________________________________________________________________ ____________________________________________________________________________________________________ ____________________________________________________________________________________________________ ____________________________________________________________________________________________________
![]()
Multivariate Binomial Logistic Regression
![]()
As mentioned in class, one of the extensions of logistic regression is into a multivariate format! Oh, NO!
This time, go back to Analyze and, this time, move SitePres over to the Dependent variable box and CentElev, DiffSQd, NSGrad, DiffEQd, WEGrad, and Stream into the Independent box. Yes, we're starting with a sort of kitchen sink (well, everything except columns I through L) and working backward, tossing in all the spaghetti to see what sticks to the walls. So, you need to pick a Method. In the drop down menu in the box below the variable selection boxes, please select Backward LR and then OK.
Just as in ordinary multiple regression, you'll get a series of models, four of them. Each one is checked by SPSS for significance of all variables left in the model, with the insignificant ones dropped systematically in each round to build the next model. The end result (Step 4) is a model that explains the likelihood of each of the two possible outcomes with the greatest power, yet with the fewest possible variables. It will be shown in the "Variables in the Equation" box, at the bottom, beside Step 4. If you're feeling particularly frisky (not required, but perversely amusing), you might try running the model over by entering all the variables the way you just did, but this time select Forward LR as the method, Did you get a different end-state model?
So, when all is said and done, what is the model distilled out of the backwards elimination approach? What seems to drive Native Californian location selection?
So, in this lab, I've walked you through a geoarchæological example of logistic regression. Can you come up with another application of this technique in some other area of interest to you? Cultural geography or anthropology? Economic geography or planning? Geology or geomorphology? Biogeography or ecology? Crime analysis or social geography? Political geography or political science? Marketing or business location analysis?
____________________________________________________________________________________________________ ____________________________________________________________________________________________________ ____________________________________________________________________________________________________ ____________________________________________________________________________________________________ ____________________________________________________________________________________________________Think up some situation where predicting a binary outcome in this way might be useful and describe it briefly. I'm trying to get you to think about these techniques beyond the particular narrow application I build the labs around (and that might help you in your independent analysis of something up your own alley later in the semester ...).
____________________________________________________________________________________________________ ____________________________________________________________________________________________________ ____________________________________________________________________________________________________ ____________________________________________________________________________________________________ ____________________________________________________________________________________________________
![]()
GEOG 400 Home | Dr. Rodrigue's Home | Geography Home | Scientific Calculator
![]()
This document is maintained by Dr. Rodrigue
First placed on Web: 02/17/08
Last Updated: 02/20/11
![]()