Monday, September 26, 2016

Mean, Median, Mode - Mathematical Description


One thing that it always hard for new statistics students is the idea that the mean, median, and mode each have two different (but related) meanings.  Each can refer to a property of some specific set of data that has been collected or to the property of an entire distribution.  

Mode:  This one is pretty straightforward.  If you have a set of numbers (i.e. data points), then the mode is the number (or numbers) appearing most frequently.  It doesn't matter if the most frequent number is the smallest in the list, the largest, or anywhere in the middle.  If it shows up on the list the largest number of times, it is the mode.

Median:  This one is also pretty easy to understand.  You take your list of numbers, put them in order from smallest to largest, and then pick the one that is exactly in the middle.  That's the median.  There are some details left to nail down that aren't very interesting (and I'll discuss them at the end of this part), but the big idea is that the median has the property that half of the data points are below it and half are above it. 

Now notice that it could be the case that the small numbers on the list are really "bunched up" while the larger number are really spread out (or vice versa) -- something like:  1, 2, 3, 4, 10, 100, 1000.  The median is 4 because 1,2,3 are below it and 10, 100, 1000 are above it.  So the median can be really close to the smallest numbers and really far from the largest numbers (or vice versa).

One uninteresting detail has to do with the number of data points.  If the number of data points is odd, then there is always a data point that is the median (like in the example above).  However, if the number of data points is even -- for example: 1, 4, 5, 11 -- what do you do?  In this case, people typically find the two numbers in the middle of the list (4 and 5 from my example) and average them to get the median.  So it would be 4.5 for this example.  Notice that 4.5 has the desired property that half the data points are less than 4.5 and half a greater.  So the median need not be one of the data points. 

The other uninteresting detail has to do with repeated data points.  For example: 1, 2, 3, 3, 3, 3, 4, 4, 5.  In this sorted list, the third 3 is the middle data point so it is the median.  If we include two of the threes as "smaller than the median" and one of the threes as "larger than the median" then exactly half are larger and half are smaller.  The more natural idea of only looking at data points strictly larger than three and strictly smaller fails since 2 data points are actually smaller and 3 are actually larger.  There does not exist a number that has the property that exactly half the data are strictly larger.  Nevertheless, the median is three.

Mean:  It's a little harder to understand the "physical significance" of the mean.  With the median, we had a notion of the "middle" of some data that had the property that it didn't matter how much a data point was above (or below) the "middle" only that it was above (or below) the "middle."  In the 1, 2, 3, 4, 10, 100, 1000 example, that notion makes the "middle" of this set of numbers equal to 4.  That "middle" wouldn't change if the 1000 data point were changed to 100,000.  In some ways, that is a nice property to have, but in other ways, it seems rather odd.  I mean, 4 doesn't really seem to capture the idea that the numbers are getting really spread out on the top end. 

The mean is a way to describe the "middle" so that numbers far from the middle have a bigger influence than numbers close to it.  The best physical analog is the idea of the center of mass but if you haven't thought carefully about physics, this analog won't help much.  Instead, let me ask this question.  If you are generally an "A student" in a class, and you happen to fail an exam, would you rather fail it with a 50% or with a 0%?  Why?  In either case, that failing grade will be your lowest grade so why should it matter whether it was bad or REALLY bad.  The reason is that the 0% score brings down the average more than the 50% score does.  (They have the same impact on the median.)  The mean is an idea of the "middle" that is sensitive to how "stretched out" the data is.


Finally We Have:
Three Ways to Find an Average:
1. Mean: Add up all the parts and divide by the number of pieces. (40+12+8+7+1+1+1)/7 This is the most commonly used average.
2. Median: Arrange all the numbers from most to least. Pick the middle number. If you have an even number of data, take the mean of the two middle numbers.
3. Mode: Look through all the numbers and count how often each number happens. Pick the number that happens most often.


EmoticonEmoticon