CALIFORNIA STATE UNIVERSTIY, LONG BEACH

GEOG 400
Geographical Analysis

Project 3: Logistic Regression Using SPSS

==========

The purpose of this lab is:

  • to acquaint you with bivariate and multivariate binomial logistic regression analysis (say that fast a few times)
  • to introduce you to a dedicated statistical software package, SPSS (since Excel does not support logistic regression, this is a good point to move over to SPSS, which you'll find much easier to use)

This lab project has the following deliverables:

  • the answer sheet to this lab, printed, filled out, and autographed
  • your SPSS output (the .spv file), autographed, showing your models:
    • bivariate model of elevation difference (either N/S or W/E) and likelihood of finding an archæological site in a given quadrat
    • multivariate model resulting from backwards logistic regression
  • Excel output:
    • Excel spreadsheet of your bivariate logistic regression model, showing, for each Xi and Y, Z, eZ, Y' (expectd probability), and the expected odds of finding a site in a quadrat with that value of terrain steepness (Xi)
    • graphs of the probabilities and the odds of finding archæological sites in a given quadrat on the map, depending on terrain steepness (scatterplots with fitted trendlines)

==============================

Background Information

==========

You are presented with a 1:24,000 USGS contour map of a stretch along the California coast, which shows the location of 30 Native Californian archæological sites found during a "survey" of the area (actually, I made all of them up). These sites are not differentiated on the map or in the database. The culture involved was a gathering-hunting-fishing people who did not practice agriculture, though they may have altered their environment through the use of fire. Some of the sites are permanent villages; most are temporary base camps or specialized camps. Some people maintained essentially year-round occupance of the villages; others had a seasonally migratory activity pattern allowing utilization of a variety of seasonally and spatially variable resources. Resources moved across space and time through the movements of the more migratory members of a clan and through more formal trade and ritual exchanges among people with fewer kinship ties with one another.

Archæology has always been interested in particular sites. Beginning in the 1950s with the work of Gordon Willey in the Virú Valley of Peru and of Philip Phillips, James Ford, and James Griffin in the Lower Mississippi Valley, however, archæology, like human geography, has developed an interest in systems of sites, how they interact, how they are differentiated into higher order and lower order sites, and how they express the interaction between a particular culture and the envionments on which it depended. Even as cultural anthropology and cultural geography share common roots and many interests going back to the 1920s (e.g., Carl Sauer and Alfred Kroeber), archæologists and, first, economic geographers and, later, GIScientists began to converge in the "quantitative revolution" that overtook both disciplines in the 1950s and 1960s and continues to the present day. And, even as physical geography and geology overlap in interests and techniques, archæology now overlaps greatly with both of these fields. This growing amalgamation among physical geography, geology, geophysics, geochemistry, and archæology has come to be recognized as "geoarchæology"!

One outgrowth of all this interdisciplinary cross-fertilization has been the interest of archæologists in "predictive modelling," sometimes done in a GIS or exported to a GIS. This involves the use of statistical techniques to characterize the sites and spatial situations of archæological sites in order to begin probabilistic prediction of where more such sites might be found. One of the most common statistical techniques used is logistic regression. Here are links to a couple basic sources if you are curious about this development:

Logistic regression is very similar to simple linear regression and to multiple regression in that it models a Y variable's response to one or more X variables' variations. Like these previously introduced regression systems, multiple logistic regression can also deal with the occasional non-scalar X variable. Now that I've lulled you into your comfort zone, let me introduce you to the kicker in logistic regression.

Y (as observed) is not your basic garden variety scalar dependent variable. It is a binary variable. There are only two states that Y can take: on-off, present-absent, yes-no, 0-1. What you're trying to do with logistic regression is predict the change in the odds of getting one or the other of the two outcomes, depending on which value X takes (X is often scalar, but it can be ordinal or even nominal). You fit a logistic (S-shaped) curve to the probabilities of getting either of the Y states, depending on the values of X. The reason it takes the logistic shape is to keep the curve from undershooting or overshooting the maximum (1) and minimum (0) values that the Y value can take, which is what a linear curve will do. The logistic curve will move between the two states but approach each asymptotically (never quite getting to 0 or 1).

It gets worse. Odds are not the same thing as probability or of chance as we've dealt with them all semester. Remember how in probability you divide the number of trials that come out fitting some requirement by the total number of trials? That is, if you get 20 floods of a given magnitude out of 100 years of stream data, you figure you have yourself a 20% chance or a 0.20 probability of getting at least such a flood next year. Odds have a different denominator. It's not the total number of trials but the number of trials having the opposite outcome. It would be like saying that you have 20:80 or 1:4 or 25% odds of getting that level of flood.

  • Probability: p/(p+q)

  • Odds: p/q

As if that weren't bad enough to keep in mind, there's another plot complication with logistic regression, and that is the meaning of the b coëfficient. In regular old regression, b can be thought of as the slope of the regression line or plane, and it is a constant. In logistic regression, b is the the natural logarithm (to the base e or 2.71828) of the odds ratio. The odds ratio is constant throughout the range of X, so, in that sense it is kind of like a slope. Sort of. Oy, vey!

The odds ratio is the change in the odds of getting a given outcome, for every one-unit change in the X variable. To compute the odds ratio for an X variable, you multiply the b coëfficient by the size of the increase in X (e.g., 1) before you raise e to the power of the coëfficient. That is, eb = odds ratio.

So, read chapter 7 in Grimm and Yarnold.

Back to archæology and site location analysis/predictive modelling. You will use logistic regression to compute the probability and the odds of finding a site in a given areal unit (a quadrat 1/100th of a square kilometer or 2.5 acres) depending on the steepness of the terrain in that quadrat and then evaluate the model. You will also build a multivariate logistic regression analysis of several environmental variables and again compute the probability and the odds of finding a site in a given quadrat.

You can then turn the analysis around and try to create a set of guidelines for site prospecting surveys along other stretches of coastline in the culture zone. Where should prospecting teams look for signs of possible archæological occupations for further investigation? What, if anything, was important to the pre-Spanish Indians of this stretch of California coastline as they set about emplacing their villages and camps?

==============================

Getting Your Data

==========

You can view the map at

Your data sheet is at

You can open the file in Excel to have a look at it. Or not.

Now, you can import the spreadsheet into SPSS. Fire up SPSS (Start button on your lab computer, then All Programs, then SPSS for Windows, then SPSS for Windows 17.0.

Make sure that you don't have your spreadsheet open in Excel, or SPSS will have a cow.

When SPSS comes up, select Open an Existing Data Source. Browse over to where you saved that spreadsheet and then you'll see an opening box asking if you want it to Read Variable Names from the First Row of Data (yes, make sure it's clicked) and identifying the Worksheet. You can leave the Range blank. Hit OK.

All the variable names (which have to be 8 characters or less, by the way) are now embedded in the grey box at the top of each column. You can change the width of the columns, just like in Excel.

A note about those very short variable names.