California State University, Long Beach
Department of Geography

Hazards Lab

Flood Magnitude-Frequency Relationships

==========

This lab will familiarize you with magnitude-frequency relationships in risk analysis. Many natural hazards show a pattern where low magnitude events happen pretty frequently, while high magnitude events are rare. Earthquakes, taken as a whole from around the world, fit this pattern. Asteroid and comet impacts with Earth or its atmosphere are another hazard that behaves this way. So do floods along a particular stretch of river.

Floods (and other hazards) can be characterized by magnitude, how large they are. With floods, a common measure of magnitude is discharge, or how much water in the stream bed is flowing past a certain point in a given period of time. For example, discharge could be given in cumecs or cubic meters per second or, as in this lab, as cfs or cubic feet per second.

If you have a long run of discharge data for a particular stream, you can rank each year or event by its discharge. So, the biggest flood would be ranked #1, the second biggest as #2, the third biggest as #3, and so on until you have ranked even the tiniest event.

Once you've ranked your data set, you are in a position to calculate recurrence intervals or frequencies. Recurrence interval means the average number of years between one flood of a given magnitude and the next flood that is as big or bigger. If you were interested in a flood of, say, magnitude 5,000 cumecs, you would figure out the average time it would take to get another flood of at least 5,000 cumecs.

Once you have recurrence interval, you can figure out the probability that a flood of a given magnitude will hit during any particular year. Probability is merely the reciprocal of recurrence interval, that is, 1 divided by the recurrence Interval ( 1 / I ).

Something else you can figure out from recurrence interval is the magnitude of a specified level of flood. In other words, you can estimate the size of the 100 year flood (the flood level that has a 0.01 or 1% chance of happening in any given year: 0.01 is the reciprocal of 100). Planners and insurance companies like to think in nice, round recurrence intervals (e.g., the "100 year flood" or the "500 year flood").

Use of regression modelling can help you predict that kind of flood, even if you don't have 100 years of data (which we don't in this lab). The idea is you model the relationship between magnitude and frequency from the data you have, using simple linear regression, and then extrapolate that relationship to times outside the years of your data set.

Extrapolation can be a bit hazardous, because there is always some uncertainty in this kind of modelling, and the uncertainties increase the farther you extrapolate beyond your data. Keeping that in mind, extrapolation can give us at least a ball-park idea of the kinds of floods we might have to face.

The data used for this lab come from the USGS WATSTORE water data warehouse. They represent an 89 year run of peak discharge data taken from station 11098000 in Arroyo Seco near Pasadena for every flood year from 1914 (1 October 1913 through 30 September 1914), except for 1933. The peak discharge event for each water year is listed by the date it was recorded. Peak discharge is given in cubic feet per second (cfs).

What's a "water year"? It is conventionally defined as the twelve month period from October 1st through September 30th. The water year is named for the calendar year in which it ends (which includes nine of the twelve months). For example, the year ending on September 30th of 2005 is called the "2005 water year." Because of this water year business, you will notice a couple of peak discharge events may have occurred during the same calendar year (e.g., 1996) and some calendar years are missing an event (e.g., 1997)

Skills developed in this lab include:

  • basic spreadsheet functions, such as sorting, counting, writing equations, copying formulas, and formatting
  • using Excel to create graphs
  • building a simple linear regression model in Excel
Materials needed for this lab:
  • access to Excel or a comparable spreadsheet program
  • access to the Internet
Deliverables include:
  • answers filled in the blanks below
  • a print-out of your spreadsheet's regression models and
  • graphs with your regression lines drawn on them by hand or by Excel's Add Trendline function
  • your name is an important feature!

==========

Part 1: Download Your Data

By clicking http://www.csulb.edu/geography/labs/data/floodLA02.xls, you can download your data in the form of an Excel for Windows 5.0/95 workbook (which should be usable by any version of Excel you have from that day to this). You may be warned that you are about to download a file of type x-excel. If so, select Save File and then be sure to pick your own Zip, floppy disk, or USB flash drive to save it in or, if you're at home, wherever you save your "stuff."

If it opens when you click on the link, great. Otherwise, if you had to Save File, go on and open it using Excel.

You will find three columns of data: Date in Column A and Discharge in Column E and again in Column G, the data running from row 2 through row 90. Below that, you'll find a few rows of "metadata," data describing the data above. These are in comment lines preceded by an #, and you don't need to worry about them. You'll see that there are four empty columns, headed by Rank, R Prob, RI, and Log RI. These are the columns that you'll be filling in during this lab.

Let's make sure that the spreadsheet will show your calculations in an attractive format, rounded to two decimal places (e.g., 22.04). Holding down the Control key, tap the grey boxes on top labelled B, C, D, and F. This will cause the four columns to be highlighted. Then, touch the Format box up in the menu on top of Excel. When the menu drops down, choose Cells. A grey box should come up and the Number tab should be active (if not, click on Number). Under Category, choose Number and then OK. Now, any number that you put in any of those four columns will be displayed with two decimal places.

==========

Part 2: Calculate Estimated Recurrence Intervals

The formula for recurrence intervals is:
I = (n + 1) / r
Where:
I = Interval of recurrence
n = number of years for which you have data (don't forget that water year 1933 has no data)
r = rank of a particular magnitude of event, e.g., 509 cfs
To estimate probabilities of recurrence in any one year, the formula is:
P = 1/ I
Where:
P = Probability of recurrence
I = Interval of recurrence
To make sense of this, go to the spreadsheet, put your cursor in the grey box that labels row number 2, and then left-click and drag the cursor to the grey box labelled row 90. All rows with data in them should now appear highlighted.

On the upper Excel menu bar, click on Data and then on Sort. Up will come a grey box, asking you to sort by one of the columns of data. If you click My List Has a Header Row, you will be able to sort by the named column, "Discharge." If you click on No Header Row, then you'll be given the option to sort by a column (choose E). Click on Descending, so your data are arranged from highest discharge to lowest. Then click OK. Voil`! Your peak flow data are neatly ranked from the largest to the smallest peak discharge event. Now save your file (Control-S), either as a Excel for Windows 5.0/95 file or the format your version of Excel prefers.

==========

Part 3: Using Excel's Formula Function

At this point, you can assign ranks to your floods. You'll see that Column B is already labelled "Rank" up in cell B1. The easiest way to rank your sorted data is to put the number 1 in cell B2. Then, in cell B3, enter the following: =b2+1 and hit Enter.

Now, put the cursor back in cell B3. Move the cursor to the lower right corner of that cell and you'll see it turn from a fat white cross to a skinny black cross. As soon as that happens, hold down the left-click button on the mouse and then drag the mouse down the column to cell B89 (don't bother with 1933 in cell B90). All your ranking will be done automatically.

What, then, is n (not counting 1933)? __________ >

(a short cut: Type =count(b2:b89) in cell B92 -- it'll count them up for you)

Now that you know the ranks for each water year, use the formulas above to calculate recurrence intervals in years, and probabilities of recurrence in any one year for the following discharges. To do this, please put your cursor in cell D2 under the "RI" heading. Type in =(88+1)/B2 and hit Enter. You now have the recurrence interval for the largest flood in your data set. Again, put your cursor back on cell D2 and move it to the lower right corner of the cell until it turns back into a skinny black cursor and then hold down the left mouse button and drag the formula all the way down to cell D89. Almost painless calculation of recurrence intervals for 88 records!

From recurrence interval for each rank of flood event, you can calculate the probability of a particular level of flood hazard. All you need to do is divide 1 by the recurrence interval for the flood you're interested in. So, in cell C2 under "R Prob," type in =1/D2. Again, put your cursor back on cell C2 and move it to the lower left corner until it changes to a skinny black cursor and left-drag the mouse all the way to C89.

So, what would be the discharge associated with the 80th ranked flood?
_______________ cfs

What is the recurrence interval for a flood of 230 cfs?

_______________ years

What is the probability that next year will equal or exceed 1,400 cfs?

_______________ 


What is the rank of the 1,800 cfs flood?

_______________ 
What is the discharge associated with the 11th ranked flood?

_______________ cfs
What is the recurrence interval of the flood that has a 0.03 probability of occurring next year?

_______________ years
What is the probability of getting a flood of at least 8,540 cfs next year?

_______________ 

==========

Part 4: Graph Recurrence Interval and Magnitude

Let's graph that, while we're at it. Click the icon on the top of the Excel screen, which looks like a miniature, loudly colored bar chart. Select X-Y (Scatter), and then Next to accept the default scatterplot. For data range, type in d2:e89 and click on the Columns button. Cool, huh?

Then hit Next again. For Chart Title, try Discharge and Recurrence Interval. For Value (X) Axis, put in Recurrence Interval. For Value (Y) Axis, put in Discharge. Hit Next. In the next box, select As Object In and hit Finish.

A graph comes up, inconveniently located in your spreadsheet. Click on a white area within the graph, so a bunch of little pin boxes come up around the frame of the graph, then, holding down the left mouse button, move it somewhere more useful, such as cell I1 or I2.

==========

Part 5: Analysis of Your Graph with Simple Linear Regression

How would you characterize the relationship between recurrence interval and discharge? A direct relationship exists when one variable goes up as the other one goes up in value, so it creates a curve going from the lower left of an X-Y graph to the upper right. An inverse relationship exists when one variable drops as the other one climbs in value. It creates a curve going from the upper left to the lower right. Is the relationship between recurrence interval and discharge direct or inverse?
Direct or inverse? __________________________
The longer the recurrence interval, the worse the flood discharge associated with it is. This relationship is the basis of predicting areas subject to the "100 year flood" for insurance purposes.

Let's use simple linear regression to fit a curve to these data, so we can come up with very rough estimates of the flood magnitude associated with a particular recurrence interval.

Go to the Tools menu and select Data Analysis*. In the grey box that comes up, select Regression and OK. For Input Y Range, write in e2:e89; for Input X Range, write in d2:d89; for Output Options, select Output Range and use cell i18.

All of a sudden, a box with a mess of numbers in it will come up. This is your regression model, with Recurrence Interval predicting Discharge. Here's what these numbers mean.

The "Multiple" R is the correlation co-efficient. A correlation co-efficient describes the strength of the association between recurrence interval and discharge. R can vary from -1 through 0 to +1. The larger the correlation co-efficient, the stronger the association. A -1 describes a perfect inverse relationship; a +1 describes a perfect direct relationship; and a 0 means there is absolutely no association whatsoever between the two variables.

The "Multiple R Squared" tells you the proportion of variation in Discharge explained by variation in Recurrence Interval. Again, the closer to +1, the better your model. If it's under +0.25, X is generally not really a significant predictor of Y; if it's somewhere around +0.25 to, oh, +0.50, it's a weak influence though it may be real (significant); if it's +0.50 to +0.75, you might think of X as a moderate influence on Y; if it's somewhere around +0.75 to 1.00, X looks like a pretty strong predictor of Y.

What is R? __________

What is R2? __________

A regression model has the form:

Y = a + bX
Where:
  • Y = Expected or modelled Discharge
  • a = Y intercept, or where the curve touches the Y or vertical axis (that is the co-efficient next to Intercept at the bottom of the regression box that popped up in Excel)
  • b = slope of the equation, or the amount of change in Y seen with a given change in X (that is the co-efficient appearing next to X Variable 1)
  • X = Recurrence Interval
What is a? __________

What is b? __________

What good is this? Well, we can use this formula to estimate the discharge (Y) of the 100 year flood by plugging numbers into it, now that your spreadsheet has calculated the a and b constants. Multiply 100 by the b co-efficient and then add the answer to the a co-efficient. That is the estimated discharge for the 100 year flood, the one that has a 0.01 probability of occurring in any one year. That's all there is to it!

That discharge for the 100 year flood is? __________ cfs

What about the 500 year flood? __________ cfs

And the 1,000 year flood? __________ cfs

Now, do this estimate for the 29.67 year flood: __________ cfs

This should bother you a little: Compare the estimated 29.67 year flood with the real 29.67 year flood. Why do you suppose you got somewhat surprising results? That is, why isn't the estimated discharge the same value as the real or observed discharge data given for the 88 year flood in your table of original data? (helpful hints in Note below):

_________________________________________________________________________________________

_________________________________________________________________________________________

_________________________________________________________________________________________

Note: Disparities between actual values and expected values demonstrate that your independent (X) value does not completely account for the variation in the dependent (Y) variable. This is why, in the real world, correlation co-efficients (Multiple R in Excel's language) are not +1.00. When you square the correlation co-efficient, you get the co-efficient of determination (R Square in Excel), which tells you how much of the variation in Y can be accounted for by variations in X. Take a look at your regression model's R Square -- you can see that a pretty sizable amount of variation in discharge remains to be explained after you've accounted for recurrence interval.

This can be the result of one or more factors independently affecting Y. You can try to build a more complex model by bringing in other variables if you have reason to suspect their presence (multiple regression models can handle multiple independent variables, but that's beyond what we're doing in this lab). Another reason may be that we're applying the wrong model (simple linear regression), but let's not worry about that right now (though we will in a bit).

The difference between an actual value of Y and the expected value of Y for a particular value of X is called the residual. A positive residual is an actual value which lies above the regression line of expected values (is larger than expected); a negative residual is an actual value lying below the regression line (is smaller than expected). Analyzing residuals can often lead to identification of other factors that might soak up some of the unexplained variation in Y. But that's another story for another day!

==========

Part 6: Analysis of Appropriateness of Simple Linear Regression

To see another problem with this particular example of simple linear regression, put a dot on the Y axis at the value of a. Now, on the other end of the graph, put a dot at the expected value of Y for a recurrence interval of 88 (from your answer above), not the actual value. Now, draw a straight line from a through the expected value for 88 years recurrence interval and continue it to the rightmost edge of the graph. This is your regression line.

You can cheat a little, using Excel's Add Trendline function. Instead of manually drawing your regression line, you can simply click on one of the dots in your graph and then right-click your mouse. Pick Add Trendline. Pick Linear (we're doing a "simple" linear regression, after all). Under Options, ask it to add 12 units under Forward (88 + 12 = 100). Click Okay. Pretty handy, eh?

Notice how the actual values are consistently below the line on the leftmost side of the graph? And how they are consistently above the line on the right? This kind of consistency in the residuals tells you that the relationship between recurrence interval and discharge is actually not a linear relationship best described by a simple linear regression model and line. It's actually curvilinear. So what?

This means that our predicted flood discharges are going to be increasingly off as we try to extrapolate way beyond the 88 years for which we have data. And we do want to extrapolate to those round number recurrence intervals, such as the 100 year flood, the 500 year flood, and so on.

If we wanted to improve our model, we can try doing various mathematical transformations on one or both variables. Let's try one of these, which will actually improve our model enough to justify the bother. We're going to do a logarithmic transformation on the X variable, Recurrence Interval.

==========

Part 7: Improving a Simple Linear Regression to Handle a Curvilinear Relationship

What we're going to do is take the common logarithm of Recurrence Interval. A common logarithm is the power you'd need to raise 10 by to get your original Recurrence Interval. If your recurrence interval were, say, 1,000 years, the log of 1,000 would be 3. If you raised 10 to the third power (multiplied 10 by itself 3 times: 10 x 10 x 10), you'd get 1,000. If your recurrence interval were 100, the log would be 2. Try taking the log of 50 by entering 50 on a calculator and hitting the LOG button: 1.699. Okay, let's have Excel do the heavy lifting for us.

In cell f2, type =log(d2). That's all there is to that. Now, drag the formula all the way down to f89.

Let's try this simple linear regression thing again, but this time on discharge and the logarithm of recurrence interval. Again, go to the Tools menu and select Data Analysis and then Regression. This time, for Input Y Range, type in g2:g89. For Input X range, put in f2:f89. For Output range, try cell i60. Hit OK.

Take a look at the Multiple R and R Square co-efficients. Quite an improvement, eh? Let's see why that might be.

Click the chart icon on the top of the Excel screen. Again, select X-Y (Scatter), and then Next to accept the default scatterplot. For data range, type in f2:g89 and click on the Columns button.

Then hit Next again. For Chart Title, try Discharge and Logarithm of Recurrence Interval. For Value (X) Axis, put in Log Recurrence Interval. For Value (Y) Axis, put in Discharge. Hit Next. In the next box, select As Object In and hit Finish.

Once more, the graph comes up, no doubt inconveniently located in your spreadsheet. Click on a white area within the graph, and drag it somewhere under your first model, maybe cell I44.

Again, click one of the dots and right-click your mouse and pick Add Trendline. Again, choose Linear (because what we've done is used a logarithmic transformation to linearize or straighten a curvilinear relationship).

Look how much better your new regression line fits your data points. It's not perfect, of course, but it's a lot better and the residuals are a lot smaller. This is a much tighter model. It would be safer to use this model than the first one to extrapolate beyond your data to make rough estimates of expected 100 year floods.

Let's try this. We have to remember to convert the logarithm of recurrence interval back into the original recurrence interval units. A small extra step to pay for a much better model. To get this for each of the X values below, type =J76+J77*(log(X)). Don't actually type X: Type the numbers asked for below instead of X:

What would the discharge for the 29.67 year flood be? __________ cfs

What about the 100 year flood? __________ cfs

What about the 500 year flood? __________ cfs

And the 1,000 year flood? __________ cfs

A little trickier, what would the recurrence interval be for a flood with a discharge of 3,080 cfs? (this entails looking up that discharge and noting the logarithm of RI in Column F and then taking its antilog -- or, since you have the original data, looking up the corresponding RI in Column D)
__________ years
Let's try this on a discharge not actually listed in Column G. What would the recurrence interval be for a 10,000 cfs flood? Plot complication: You have the Y value and you need the X value now. The formula you have, however, is Y = a + b(logX). We need to get logX all by itself on one side of the equation and then get it converted back to a normal number (recurrence interval in years). Not to worry: That's simple algebra.
     Y = a + b(logX)

     Y - a = b(logX)
    
     (Y - a) = logX
        b

     X = antilog(logX)

     X = 10logX 
So, plug in the Intercept (a) and the X Variable 1 (b) from your second model and enter 10,000 as your Y. Subtract a from it and then divide the answer by b. That gives you logX.
What is logX (logRI)? __________
Raise 10 to that power (10logX), which you can do on a good calculator or else you can just take some empty cell in your spreadsheet and type =10^#, where "#" stands for the logRI or logX you got for a discharge of 10,000 cfs. The answer is called the antilog.
What is the antilog of your expected Y (logRI)? __________ years. This is the recurrence interval corresponding to a flood matching or topping 10,000 cfs.

==========

Conclusions

So, that's how you can build a model of flood frequency and magnitude and use it to estimate, at least roughly, how big a flood you can expect in a century or a millenium. This, then, helps society plan for life on a floodplain. What kind of land use is appropriate next to a river? How high does a system of levees have to reach to protect human beings and their homes and possessions? How much is society willing to pay for protection against larger (and rarer) floods? How high should your insurance premium be if you choose to live in an area subject to the 100 year (1% chance each year) flood?

Unfortunately, most people do not understand the concept of a "100 year flood." They think that, if we had a 100 year flood last year, we are safe from that level of flood and suffering for the rest of their lives. Noooo, it is quite possible to get two 100 year floods back-to-back! A better way of thinking of it is the 1% flood. This is the level of flooding that there is ALWAYS a small chance, a 1% chance, of experiencing any year. The risk is always there, though it may be only 1 small percent.

Another plot complication is that predicting 100 year floods on a given stretch of river assumes a lot of things:

  • Are the years for which you have data actually representative of the climate rigime you're living in? That is, are the data from a run of unusually wet years or unusually dry years?
  • Is the climate changing in a way that could change the flooding behavior of a stream in your part of the world?
  • Is there urban construction or deforestation or overgrazing going on upstream from you? If so, that can drastically alter the flood behavior of a stream by changing the timing of when precipitation hits the stream (these human-altered surfaces don't allow water to percolate as efficiently into groundwater, so it all comes roaring down asphalt and concrete and exposed dirt surfaces right into the stream, overwhelming its banks and creating bigger floods that hit faster after a rain).

==========

* What if there is no Data Analysis option under the Tools menu? Not to worry, you can install the Data Analysis packages yourself, almost instantly. On the Tools menu, select Add-Ins.... Put a check by Analysis ToolPak and Analysis ToolPak -- VBA. Now, hit OK. Click your cursor onto any cell and now try the Tools menu -- you should see Data Analysis there towards the bottom. Piece of cake!

==========

first placed on the web: 01/25/03
last revised: 02/10/07
) Dr. Christine M. Rodrigue

==========