The So Called Middle Class

I have a problem that I want to pose to my Statistics students, and if  you are reading this, I would love to get some feedback in the comments section.

Prior to this activity, students will have studied:
– Mean, Median, Mode, Range
– Standard Deviation of a population, (but not a sample)
Very brief references to normal distribution
– Correlation coefficient
– Linear regressions, and some curve-fitting.

I would like to use this to reinforce measures of central tendency, and also as a jumping off ground for box-whisker plots and the interquartile range, and standard deviation of a sample.   As well, I want to get to normal distributions in more depth.

Let me provide the problem, and then you’ll see some questions at the bottom:

In a recent interview, Romney commented upon the middle class, in talking about his tax plan.  From the San Francisco Chronicle:
Romney’s comments came an interview broadcast Friday on ABC’s “Good Morning America.”
“No one can say my plan is going to raise taxes on middle-income people, because principle number one is (to) keep the burden down on middle-income taxpayers,” Romney told host George Stephanopoulos.
“Is $100,000 middle income?” Stephanopoulos asked.
“No, middle income is $200,000 to $250,000 and less,” Romney responded.
His campaign later clarified that Romney was referencing household income, not individual income.

Sheet One, entitled Income by Each Household:  This shows incomes of 177 households.  These are numbers based upon US Census data, and reflect the distribution of income in the United States in 2011.
Sheet Two, entitled US Census Data:  This shows the data that actually comes from the US Census Bureau.  The data you should look at and use is highlighted in yellow.
Define the range of incomes for “Middle Class” in America.   It is up to you to define your criteria, but you must describe your methodology fully.
Level 1:  Use the first sheet, “Income by Each Household”
Level 2:  Use the data from the Census Bureau
Both can land you full credit – it is just that Level 2 is more challenging.

My Questions:

  • The data in the first sheet is made up.   All I had was the US Census data, not individual household data – so I used the randombetween function in Excel to generate the incomes, so that they will average and distribute in the same way as the US Census information.  So the data is representative of US data, but not actual data.  How do I communicate this to the students?  Is the above description ok?
  • Thoughts on taking this to normal distribution?   The data is not at all normally distributed.  
  • I hope that students will be able to at least sort the data, find mean (around $71,000), median (around  50,000), and std dev (around 81,000.  Yikes!).    Because of the huge size of the std dev compared to the mean, students will not be able to say, “A typical range of incomes will be the mean plus or minus the standard deviation” with much meaning, since the mean minus std dev gives someone negative income.
  • Any other thoughts that occur to you as you look at this would be great.


6 thoughts on “The So Called Middle Class

  1. This is a fantastic idea to introduce the concept of normal distribution and standard deviation. In fact, I would consider taking the concept of a normal distribution head-on. I think it's ok that you are using fabricated data here for now. I'm guessing the actual income distribution will be heavily skewed. In fact, that might make a good follow-up: you've seen this sample distribution, here's a more representative sample, what does that imply about the distribution of household income? Does it mean there are a relatively few super-rich people, or that there are relatively few poor people? Neither? Both? Love this idea.



  2. I think this is a great activity.

    Your first comment: There's nothing wrong with making up data that accurately reflects the population. Statisticians do it often in the form of simulations and is quite common when it's difficult to gather data or when data gathering is risky to the subjects. Communicating that this is another tool would be a good way of getting this across to them.

    Second: While the data itself is not normal, you might benefit from discussing what they imagine an ideal distribution of this data looks like. When I think of curve, I think of grades. Not many on the ends with the bulk of kids falling somewhere in the middle. How does the normal distribution relate to “fairness?” is a question that might get some good debate. Then the kids can compare the distribution of the data you provided and compare it to their ideal distribution.

    I think this is a great activity that offers insight into a big debate this election while also introducing concepts of statistics. Speaking of which, all four domains of statistics as determined by AP are covered in this activity: I. Exploring data and distributions II. Sampling & Experimentation III. Anticipating patterns and randomness IV. Statistical inference on a population. Have the kids debate whether this distribution is good fr the country. If not, how could it be “solved?”



  3. Benji,

    So great to hear from you! Thanks for the super valuable thoughts. We are doing a project on the election, so that was the inspiration for this. Trying to decide if I just introduce normal distribution before discussion the binomial distribution. Leaning toward “yes”.



  4. Re: the description — I would be more explicit that the data in the first spreadsheet was fabricated. And then offer even more credit for a plausible scheme for accurately generating a similar table from the Census data (especially if it isn't the same scheme you used — which isn't meant as a ding on your methodology in any way, but a nod to innovation).

    I would try to consider whether the students noticed and accounted for the fact that the numbers (i.e. the counts) are in thousands, but not the dollars. It's just a common-sense sanity check on the data; “Hmmm … is the census data saying that there are only 121,000 households in the entire country?” It won't change the shape of the analysis, but they should maintain an awareness of the scale.


  5. Mr. Gant (sorry, I can't call you Kevin),

    I agree that this is a very practical and relevant way to add depth to your statistics curriculum. It sure beats analyzing a table of unitless values.

    I think your current description is fine. Your method for generating the data isn't that important…but I think the shape of the data is. My recommendation is to replace the data on the first table with boostrapped data that has the same distribution as US Census income data (say, log-normal). Besides being more realistic, it provides you with another learning opportunity, in that the students will expect the mean, median and mode to be relatively close to each other, but in fact they will be separated. That leads you into discussion of a different distributions that better characterize the data.

    Another thought is to have them analyze data from both the log-normal and normal distributions. You can then have an in-class discussion about, “which data set do you think is a better fit for actual US incomes?” Here you can highlight why it's important to understand the difference between mean, mode, and median (as shown in this link:

    Again, very good idea. Your students are very lucky. Thanks for the opportunity to help.




  6. Neil! Great to hear from you! And awesome idea with the log-normal. I just created a histogram of the natural log of the first set of data, and indeed – it looks much more like a normal distribution.

    Regarding the shape: I think it is important for the students to understand what a distribution from a randomly-determined variable looks like (measured period of a pendulum taken over several trials, for instance), vs a not-so random variable, like income in the US. That said, we can take a look at the log-normal, and parse that out.

    Just to be clear – both the first set, and the second set of data as they are now, distribute the same way – both histograms look almost identical except for scale. I think I know what you mean by “bootstrapped”, and if so, I effectively bootstrapped data from each range given in the US Census data. I used the randombetween to generate data in each of those ranges, and made sure that the number of incomes in the first set was proportional to the frequency in each income range from the Census data, and that the average income in each range was about the same for both sets. If you download them to excel, you can check it out.

    Thanks for the comment! Nerd.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s