I have a problem that I want to pose to my Statistics students, and if you are reading this, I would love to get some feedback in the comments section.

Prior to this activity, students will have studied:

– Mean, Median, Mode, Range

– Standard Deviation of a population, (but not a sample)

– *Very* brief references to normal distribution

– Correlation coefficient

– Linear regressions, and some curve-fitting.

I would like to use this to reinforce measures of central tendency, and also as a jumping off ground for box-whisker plots and the interquartile range, and standard deviation of a sample. As well, I want to get to normal distributions in more depth.

Let me provide the problem, and then you’ll see some questions at the bottom:

WHAT EXACTLY IS “MIDDLE CLASS”?In a recent interview, Romney commented upon the middle class, in talking about his tax plan. From the San Francisco Chronicle:Romney’s comments came an interview broadcast Friday on ABC’s “Good Morning America.”“No one can say my plan is going to raise taxes on middle-income people, because principle number one is (to) keep the burden down on middle-income taxpayers,” Romney told host George Stephanopoulos.“Is $100,000 middle income?” Stephanopoulos asked.“No, middle income is $200,000 to $250,000 and less,” Romney responded.His campaign later clarified that Romney was referencing household income, not individual income.ATTACHEDYou will see a spreadsheet that has data on two different sheets:Sheet One, entitledIncome by Each Household: This shows incomes of 177 households. These are numbers based upon US Census data, and reflect the distribution of income in the United States in 2011.Sheet Two, entitledUS Census Data: This shows the data that actually comes from the US Census Bureau. The data you should look at and use is highlighted in yellow.YOUR TASKDefine the range of incomes for “Middle Class” in America. It is up to you to define your criteria, but you must describe your methodology fully.Level 1: Use the first sheet, “Income by Each Household”Level 2: Use the data from the Census BureauBoth can land you full credit – it is just that Level 2 is more challenging.

My Questions:

- The data in the first sheet is
*made up*. All I had was the US Census data, not individual household data – so I used the randombetween function in Excel to generate the incomes, so that they will average and distribute in the same way as the US Census information. So the data is representative of US data, but not actual data. How do I communicate this to the students? Is the above description ok? - Thoughts on taking this to normal distribution? The data is not at all normally distributed.
- I hope that students will be able to at least sort the data, find mean (around $71,000), median (around 50,000), and std dev (around 81,000. Yikes!). Because of the huge size of the std dev compared to the mean, students will not be able to say, “A typical range of incomes will be the mean plus or minus the standard deviation” with much meaning, since the mean minus std dev gives someone negative income.
- Any other thoughts that occur to you as you look at this would be great.

Advertisements

This is a fantastic idea to introduce the concept of normal distribution and standard deviation. In fact, I would consider taking the concept of a normal distribution head-on. I think it's ok that you are using fabricated data here for now. I'm guessing the actual income distribution will be heavily skewed. In fact, that might make a good follow-up: you've seen this sample distribution, here's a more representative sample, what does that imply about the distribution of household income? Does it mean there are a relatively few super-rich people, or that there are relatively few poor people? Neither? Both? Love this idea.

–Geoff

LikeLike

I think this is a great activity.

Your first comment: There's nothing wrong with making up data that accurately reflects the population. Statisticians do it often in the form of simulations and is quite common when it's difficult to gather data or when data gathering is risky to the subjects. Communicating that this is another tool would be a good way of getting this across to them.

Second: While the data itself is not normal, you might benefit from discussing what they imagine an ideal distribution of this data looks like. When I think of curve, I think of grades. Not many on the ends with the bulk of kids falling somewhere in the middle. How does the normal distribution relate to “fairness?” is a question that might get some good debate. Then the kids can compare the distribution of the data you provided and compare it to their ideal distribution.

I think this is a great activity that offers insight into a big debate this election while also introducing concepts of statistics. Speaking of which, all four domains of statistics as determined by AP are covered in this activity: I. Exploring data and distributions II. Sampling & Experimentation III. Anticipating patterns and randomness IV. Statistical inference on a population. Have the kids debate whether this distribution is good fr the country. If not, how could it be “solved?”

-Benji

LikeLike

Benji,

So great to hear from you! Thanks for the super valuable thoughts. We are doing a project on the election, so that was the inspiration for this. Trying to decide if I just introduce normal distribution before discussion the binomial distribution. Leaning toward “yes”.

-KG

LikeLike

Re: the description — I would be more explicit that the data in the first spreadsheet was fabricated. And then offer even more credit for a plausible scheme for accurately generating a similar table from the Census data (especially if it isn't the same scheme you used — which isn't meant as a ding on your methodology in any way, but a nod to innovation).

I would try to consider whether the students noticed and accounted for the fact that the numbers (i.e. the counts) are in thousands, but not the dollars. It's just a common-sense sanity check on the data; “Hmmm … is the census data saying that there are only 121,000 households in the entire country?” It won't change the shape of the analysis, but they should maintain an awareness of the scale.

LikeLike

Mr. Gant (sorry, I can't call you Kevin),

I agree that this is a very practical and relevant way to add depth to your statistics curriculum. It sure beats analyzing a table of unitless values.

I think your current description is fine. Your method for generating the data isn't that important…but I think the shape of the data is. My recommendation is to replace the data on the first table with boostrapped data that has the same distribution as US Census income data (say, log-normal). Besides being more realistic, it provides you with another learning opportunity, in that the students will expect the mean, median and mode to be relatively close to each other, but in fact they will be separated. That leads you into discussion of a different distributions that better characterize the data.

Another thought is to have them analyze data from both the log-normal and normal distributions. You can then have an in-class discussion about, “which data set do you think is a better fit for actual US incomes?” Here you can highlight why it's important to understand the difference between mean, mode, and median (as shown in this link:

http://upload.wikimedia.org/wikipedia/commons/d/de/Comparison_mean_median_mode.svg

Again, very good idea. Your students are very lucky. Thanks for the opportunity to help.

v/r

Neil

LikeLike

Neil! Great to hear from you! And awesome idea with the log-normal. I just created a histogram of the natural log of the first set of data, and indeed – it looks much more like a normal distribution.

Regarding the shape: I think it is important for the students to understand what a distribution from a randomly-determined variable looks like (measured period of a pendulum taken over several trials, for instance), vs a not-so random variable, like income in the US. That said, we can take a look at the log-normal, and parse that out.

Just to be clear – both the first set, and the second set of data as they are now, distribute the same way – both histograms look almost identical except for scale. I think I know what you mean by “bootstrapped”, and if so, I effectively bootstrapped data from each range given in the US Census data. I used the randombetween to generate data in each of those ranges, and made sure that the number of incomes in the first set was proportional to the frequency in each income range from the Census data, and that the average income in each range was about the same for both sets. If you download them to excel, you can check it out.

Thanks for the comment! Nerd.

LikeLike