CALIFORNIA STATE UNIVERSTIY, LONG BEACH

GEOG 400
Geographical Analysis

Project 1: Linear Regression with Transformed Variables

==========

This lab has the following purposes:

  • to (re)familiarize you with the use of Excel for basic statistics;
  • to refresh your memory of simple linear regression from your basic stats class;
  • to introduce you to the use of mathematical transformations to linearize certain kinds of curvilinear associations between variables; and
  • to illustrate a fundamental physical geography axiom about stream discharge levels and stream velocity.
Project deliverables:
  • a paper copy of this lab filled in and autographed;
  • a paper copy of your spreadsheet, including your three regression models (simple linear, logarithmic, and power), autographed;
  • two histograms, autographed; and
  • three X-Y scatterplots with trendlines, autographed.

==============================

Background

==============================

A bivariate association is a relationship between two variables, something like
  • the heavier the soil, the greater the cation exchange capacity;
  • the greater the annual precipitation in an area, the lower the soil pH;
  • the greater the distance from the freeway, the lesser the concentration of lead in the soil.
Simple linear regression gives us the ability to describe bivariate associations in terms of the familiar statistical concepts of significance testing. Unfortunately, a lot of bivariate associations are not linear, so all of these advantages can't be used all the time.

Luckily, however, sometimes simple mathematical transformations of one or both variables can turn a curved scatterplot into a straight one. If you can do this, you may have found an inherently linear relationship. So? Well, if you can linearize an association, then you can validly use simple linear regression to analyze its direction, strength, and significance (because of algebraic invariance). This can be shaky, if the linearization makes the scatterplot heteroskedastic, so there is some controversy over this, but it remains a very popular simplification. Linearization can make the statistical analysis of some curvilinear relationships not much more complicated than simple linear regression.

Your data set for this project comes from physical geography, having to do with the association between stream discharge and stream velocity. You might want to go over your introductory physical geography textbook and class notes on the various associations among slope angle, stream discharge, bed shape and roughness, and velocity, in case you don't remember all the interactions. Here's an optional link to my own Geography 140 notes on the topic:
http://www.csulb.edu/~rodrigue/geog140/lectures/streams.html.

          Since regression is about specifying the direction of influence, which makes more physical sense?

  1. A. Velocity of stream flow governs discharge?

    B. Discharge of stream flow affects velocity?

    So, which is the X (independent) variable and which is the Y (dependent) variable?

  2. A. Velocity

    B. Discharge

This particular data set is created by the United States Geological Survey (USGS) from remotely sensed stream gaging stations, which record the height of the stream. Height of the stream can be converted to discharge, if prior studies have been done on the local riverbed structure (wetted bed channel perimeter, depth, velocity). These data are coupled with readings of stream flow velocity. This particular data set, from Soos Creek, Washington, was compiled from USGS data by Profs. Greg Langkamp and Joe Hull, Seattle Central Community College, as part of their NSF-funded Quantitative Environmental Learning Project (QELP) grant (NSF Division of Undergraduate Education #9980740).

In this lab, you will build models of the association of stream discharge and stream flow velocity, by playing around with ways of linearizing the data to find the best model. In this project, we are less interested in inferential uses of regression modelling (testing hypotheses) as we are in using regression to describe an association. But, we might as well pick an alpha standard anyhow, to avoid wasting time on insignificant associations. In case you've forgotten, alpha is the probability of seeing a pattern that really isn't there (a Type I error). What's the standard-issue prob-value used for alpha in a scientific study, in which human life is not at direct risk but you don't want to delude yourself into getting excited over patterns that don't exist (Type I error)?

  1. 
    _________________________________
    
    

==============================

Getting the Data

==============================

The data are in an Excel spreadsheet, soosdata.xls. It is formatted with variable names across the top. There are three columns: "A" is "Observations" (to keep track of the 50 random readings reported in this data set); "B" is "Discharge"; and "C" is "velocity." Discharge is given in cubic feet per second, and velocity in feet per second. Again, click here to download the file. It may just open up in Excel. Great. Save it to your own storage device at this point. Alternatively, the lab browsers may warn you about the potential security hazard. If that happens, pick the "Save as" option and save the file to your storage device (e.g., a:\soosdata.xls). You can download your file to the lab hard drives if you've forgotten a flash drive, but be aware that the data are purged each week. If you have to save it on the lab computers, rename the file to your name and soosdata, so you'll be able to find it when you remember to bring your own storage device.

If you've never used Excel before or forgotten everything you learned in my GEOG 200 class, here's a quick refresher tutorial from the University of South Dakota.

==============================

Visualizing the Data

==============================

First, let's get an idea of how much each of the variables, well, vary. In Cell A53, type "minima"; in Cell A54, type "maxima" (minus the quotation marks). In Cell B53, type "min(b2:b51)"; in Cell B54, type "max(b2:b51)". We'll be making use of this later.

To make your lives easier, use Excel's graphics functions to create and view scatterplots of discharge and velocity. To do this, highlight cells B2:C51 (left-click your mouse as you drag it from B2 down to C51). Then click on the "Chart Wizard" box (the little icon with a multi-colored bar chart on it). Up will come the "Chart Wizard ... Chart Type" box.

Pick "X-Y (Scatter)." Then pick the box (probably upper left one) that shows your X and Y axes and a cloud of dots, but no lines. Then click "Next."

Accept Data range: soosdata:$b$2:$c$51 and "series in columns." The dollar signs tell Excel that b2 through c51 are absolute positions (don't worry about that). Note the first draft of your graph coming up just like that. Yes, this is a non-linear association. Click "Next."

At this point, you can customize the graph a lot. You might want to give it a snappy title, such as "Raw discharge and velocity, Soos Creek, WA," or some such (you want to keep track of these scatterplots, so titles and axes are useful). For the X axis, you could label that "Discharge (cfs)" and the Y axis would be something like "Velocity (ft/sec)." When you're done with your creative artistry, hit "Next."

At this point, you can ask Excel to put your graphs in a separate sheet within the spreadsheet or as an object within the main working screen of your spreadsheet. Moi, I prefer to keep everything on one sheet. Hit "Finish," and your graph will appear as a free-floating box within your spreadsheet. The grid of black dots at the corners tells you it's active, so you can put your cursor on a blank white area in it somewhere and, depressing the left mouse button, move the graph someplace convenient and maybe resize it (by clicking one of the corner dots, left-clicking your mouse, and moving it around). You can also get rid of that annoying Series 1 legend box by right-clicking on it and picking "Clear." I stuck my graphs at the right end of the spreadsheet. So, plop it somewhere out of your way.

You can also ask Excel to put a trendline through your data set. Once your graph is created and sitting in your spreadsheet somewhere, put your mouse exactly over one of the data points and click your right mouse button. Select "Add Trendline" and choose the linear version. Painless (but kind of ugly). Look how the trendline starts off with most the dots below it, then, in the middle stretch of the line, most of the dots are above it, and the last stretch has the dots below the line again. Obviously, the linear model is not the happiest description of the data cloud: The data cloud is curved.

Right click the line, pick Format Trendline, and then the Type tab. You can now experiment with some of the other mathematical functions Excel provides for you when you right-click. Which one produces the prettiest curve, the one that seems to reproduce the curving trend in the scatterplot?

  1. 
    _________________________________
    
    

==============================

Building a Simple Linear Model of the Raw Data

==============================

At this point, let's build a regression model of what you have, curvilinear and all. We'll use this model as the baseline for comparing our other experimental models later.

Select the Tools menu from the top of Excel. In it, pick Data Analysis* ???! and then Regression and then OK. For Input Y Range, type C2:C51; for Input X Range, type B2:B51. Under Output Options, pick Output Range and enter a cell far from the scene of action (e.g., U2). Hit OK. A big messy table will appear at Cell U2: That is your simple linear regression model. Pretty simple, eh? It's also pretty ugly. While it's still highlighted, you might want to pick Format from the main menu and then Cells and then Number and accept 2 decimal places. That will make the table a little more readable. You might want to widen a column or two in there so you can read everything (if you don't remember how to do that, just move the mouse into the grey box above the offending column until the cursor changes shape into something that looks like an anticline symbol on a geological map and then left-click your mouse and drag the boundary of the column around).

Now, to intepret it. A simple linear regression model takes the form:

Y = a + bX + e

Where X is the independent or forcing variable (on the horizontal axis), Y is the dependent variable (on the vertical axis) that we're trying to predict with the model, a is the Y intercept (or point on the Y axis where the regression line would touch), b is the slope of the regression line (positive for a direct relationship and negative for an inverse one), and e is the error term (or the amount of residual variation in Y that can't be explained by variation in X). The error term implies that there are other X variables that we could explore in a more elaborate, multivariate model. We'll worry about that later. Most commonly, you'll see the equation shortened to (since everyone knows that there's bound to be at least some unexplained residual error):

Y = a + bX

Excel doesn't call things by their standard notation >:-( . So, look at the Coefficients column at the bottom of the model: a is called Intercept (clear enough) and b is called X Variable 1 (instead of the slope of X Variable 1). So, at this point, write in your model (leaving Y and X in there):

  1. Y = ______ + ______ X

Something else reported in a regression analysis is the correlation co-efficient, or r. That's given in Excel's Regression statistics as Multiple R. What did you get?

  1. 
    _________________________________
    
    
    

Below that is the co-efficient of determination. This is the proportion of variation in Y, which is explained by variation in X. It is, basically, the square of the correlation co-efficient, or r2. In Excel, it's called R Square. What did you get?

  1. 
    _________________________________
    
    

Okay, now just how significant is this model? What is the probability that a random mish-mash of 50 pairs of numbers would have produced results as extreme as this? You get that from the ANOVA part of the box. It's reported through the F statistic. What did you get for the F ratio?

  1. 
    _________________________________
    
    

Looking at the "Significance F" entry, what did you get for the prob-value (prob-value is the significance of the F statistic, which you can compare with your alpha)?

  1. 
    _________________________________
    
    

So, using the alpha standard you wrote down in the first part of this project, is this simple linear regression model significant?

  1. 
    _____  yes     _____  no
    
    

==============================

Interpretation

==============================

Briefly summarize in regular English what the nature of this association is in terms of direction (direct or inverse), how much of the variation in velocity is explained by variation in discharge, and how significant the association is.

  1.  

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

==============================

Data Structure

==============================

Let's see if we can improve an already pretty strong model by transforming one of the variables. Those of you who actually played around with the trendline formatting function have already figured out that a logarithmic transformation might do the trick. Let's do a logarithmic transformation of one variable. But which one?

At this point, let's use Excel's graphing ability again to create a histogram, so that we can get a handle on which variable is the least conformal to the normal distribution. Simple linear regression should be used on scalar data from fairly normal distributions.

Excel is not exactly fast in this histogram regard, unfortunately. To make a histogram of the data structure, we need to come up with the number of bars (called "bins" here), the width of the bins, and then have it work out a graph for us. It's a pain. And it doesn't help that there's an actual bug in Excel.

Let's work on Discharge (Column B) first. What's the highest and the lowest value? What we want to do is come up with a nice round number below the lowest value and above the highest value and then divide the range between the rounded high and low values into a useful number of bins or categories, say, about 10. It helps if each category is some nice round number, too. Okay, let's pick 0 to 750, which creates 10 bins of 75 each. In Cell A63, write "BINS." In Cell B63, enter 0. In Cell B64, write "=b63+75" and hit enter. Now, drag Cell B64 down to Cell B85. There are your bins.

Let's use Excel's tools to count up how many observations fit in each bin. In Cell A75, type "Discharge." Click Tools and Data Analysis, but this time pick Histogram. In Input Range, type "b2:b51". In Bin Range, type "b63:b73" and, for Output Range, put in "b75". Instant counting. But the counts are in the wrong rows (this is that bug I was telling you about). Look at how many observations Excel counted for the bin ranging from 0 to 75. Zero. Now, go back up to your data in B2:b51, and you'll find that, in fact, if you hand- counted, there are 23 observations with discharges below 75 cfs. What to do? Well, this'll work. Highlight cells C77 to C87 and then hit Control-X. Put your cursor in Cell C76 and hit Control-V. You've moved the whole set up one bin.

Now, let's turn this into an actual histogram. Click on the garish little graphics button up top. Select Column and the first sub-type and Next. For Data Range, pick B76:C87 and click the Series in Columns button. Hit Next. Fill in a Chart title and then Category (X) axis and Value (Y) axis titles and hit Finish. Voilà!

Now, do the same thing for Velocity (Column C). Come up with a nice round data range that will take in everything in the actual data range. Divide it up into ten nice round bin values. Write in Velocity in Cell A88. Use the Tools --> Histogram function to set up your histogram bin counts, and then create a graphic the way you did for Discharge.

These are supposed to be histograms: The bars are supposed to touch to show you have continuous data on the X axis. They don't. They look like a bar chart for nominal data. Right-click the bars and pick Format Data Series and then click the Options tab and turn the Gap Width from 150 pixels to 0. They'll touch now.

Now, look at your two hard-earned graphs. Which one does not show a more bell-shaped distribution that might be from a normal distribution (especially if you had a big enough sample to fill in)? In other words, is there one that forms a really concave distribution and is likelier, therefore, to be creating that curvilinearity in the association? Okay, we're going to log that one.

  1. 
    _________________________________
    
    
    

==============================

Toward a Better Mousetrap: Transforming the Concave Data Distribution

==============================

First, in Cell D1, write "Log D." Then, look up and down Column B to make sure there are no zero values. How come? Well, try typing "=log(0)" in some random cell. You can't: There is no log of 0. We have to make sure none of our values are exactly 0. Having assured yourself of that, in Cell D2, write =log(b2). Voilà: your first log transform?

Now, copy that formula from D2 down to D51. First, move your cursor around in D2 until the white fat cross becomes a black skinny cross. Then, hold down the left button of the mouse and drag it down to D51 and let go. Instant painless log transform.

Go on and start another X-Y graph using C2:D51 just the way you did your first scatterplot, but, this time, you have to click the Series tab after you've entered your Data Range information. Excel assumes the X variable is on the left and the Y variable is on the left, but we've just switched that. So, under Series, substitute C and D in the X and Y positions. Hit Next.

Create a title, remembering to put Log Discharge somewhere in the title and X axis, so you can tell this graph from the other one). This time, again, move the completed graph somewhere out of the way, someplace where you can compare it with your first scatterplot.

Ta da! Quite an improvement over your first graph, eh? Well, not so fast: First let's see if the visual improvement is echoed in the statistics (sometimes an improved visual rendering can actually be less effective statistically).

Go on ahead and fit a linear trendline to your data. Looks pretty good, hmmm? The line goes pretty much right through the center of the data cloud at any level of X values: There is no segregation of dots below the line in one area and above the line in another. The data seem to be homoskedastic, which makes us (who?) feel better about using the logarithmic transform.

==============================

Building a Logarithmic Regression Model of the Raw Data

==============================

Once more, let's do a regression model. From the Tools menu up top, pick Data Analysis and then Regression and OK. For Input Y Range, it's C2:C51; for Input X Range, it's D2:D51. For Output Range, how about U22?

Enter your new values in your regression model (Y = a + b(logX)):

  1. Y = ______ + ______ log(X)

If you wanted to read this relationship, you'd have to remember that Y reflects the LOG of X, not X itself. To figure out which X value predicted a given value of Y, you'd have to take the antilog of the value on the X axis or be sure to go back to the original column of X values in your spreadsheet.

Now, the new correlation co-efficient (holy Toledo)!

  1. 
    _________________________________
    
    

And the new co-efficient of determination:

  1. 
    _________________________________
    
    

And the new F statistic?

  1. 
    _________________________________
    
    
    

And the new prob-value or alpha (okay, you can't tell the difference with so few decimal places, but let's complete the set)?

  1. 
    _________________________________
    
    

==============================

Interpretation

==============================

Again, in clear, readable English, summarize the highlights of this model. How much of the variation in stream flow velocity is accounted for in this model? How did the significance of this association change (all you have to go on is the size of the F values) in comparison with your first, simple linear regression model? How æsthetically satisfying is the logarithmic regression line when drawn through the data points as compared to the first model? Would you say that the extra hassle involved in doing a logarithmic transformation of X is worthwhile in terms of additional descriptive power and significance?

  1.  

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________


==============================

If Logging One Variable is Great, What about Logging TWO?

==============================

Let's see if we can improve this model even MORE by logging both variables and creating a power curve regression model. Where a simple linear regression model is reported as Y = a + bX, and a logarithmic regression model is reported as Y = a + b(logX), a power curve is reported as Y = aXb. This really is nothing more than a more elegant way of saying logY = loga + b(logX).

First, make sure there are no perfect zeros in Column C. Then, in E1, write Log V. In E2, type "log(c2)" and then copy that all the way down to E51.

Go on and build another X-Y graph just the way you did your first two (remembering to put Log Discharge and Log Velocity somewhere in the title and X and Y axes, so you can tell this graph from the other two). This time, however, select cells D2 through E51. Again, move the completed graph somewhere out of the way, below the other graphs at the bottom of the spreadsheet.

Add a linear trendline to your scatterplot (first intimations of trouble). Now, do a regression model with Column D as X and Column E as Y and put the result in U42.

Enter your new values in your regression model (Y = aXb or logY = loga + b(logX)):

  1. Y = ______X______

New correlation co-efficient:

  1. 
    _________________________________
    
    
    

New co-efficient of determination:

  1. 
    _________________________________
    
    

New F ratio:

  1. 
    _________________________________
    
    

And the new prob-value (yeah, yeah, you still can't tell the difference with so few decimal places, but...)?

  1. 
    _________________________________
    
    

==============================

Interpretation

==============================

Again, in clear, readable English, summarize the highlights of this model. How much of the variation in stream flow velocity is accounted for in this model? How did the significance of this association change (all you have to go on it the size of the F values) in comparison with your first two regression models? How æsthetically satisfying is the logarithmic regression line when drawn through the data points as compared to the first model? Would you say that the extra hassle involved in doing a logarithmic transformation of BOTH X and Y is worthwhile in terms of descriptive power and significance, compared to just logging X?

  1.  

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

==============================

Grand Overall Analysis

==============================

Again, in clear English, compare all three models with one another. Which one produced the highest co-efficient of determination (explained the most variation in velocity of stream flow) and the highest F ratio (key to significance)? Which model performed the least well? Was the weakest model still a decent description (look at the F ratio value and associated prob-value)? Which of the three would be the best choice in terms of effort and rewards (quality of descriptiveness)? How could we have saved ourselves a little work in scatterplotting and model-building ahead of time and estimated which of the three models might be the most rewarding before actually doing the work?

Let's get more adventuresome. Why is it that increasing discharge of a stream results in faster-flowing water? Time to dust off your old Geography 140 or introductory geology textbooks! Think in terms of the flow patterns in water as the wetted perimeter declines in comparison with the stream cross-section as discharge increases. The point of this is we have to link our statistical analyses with the processes they describe.

  1.  

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

==============================

* What if you don't see Data Analysis among your options when you click on Tools? It just means that the Data Analysis pack hasn't been installed on your machine's copy of Excel. Not to worry. It hasn't been installed, but it is actually there. To install it, click on Add-Ins. A box comes up. Put a check next to Analysis TookPak and another next to Analysis TookPak - VBA. That's all there is to it. Now, when you click on the Tools menu, you'll see Data Analysis toward the bottom. Back to our regularly scheduled lab.

==============================

GEOG 400 Home   |   Dr. Rodrigue's Home   |   Geography Home   |   Scientific Calculator   |   QELP

==============================

This document is maintained by Dr. Rodrigue
First placed on Web: 02/20/01
Last Updated: 01/27/09

==============================