Back to basics – what’s the “summary” stuff?
Number of service requests to NYC’s 311 from 2004 to 2016: 24.54 million
> summary(Days To Close)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-42440.0 0.1 2.9 9.0 10.0 365200.0
Take the numbers 1, 2, 4, 7, 13, 16, 21 as an example. 1 is the smallest number in the set – the “Min.” and 21 is the largest, the “Max.”
The minimum number of days to close out a complaint is, apparently, -42440 days. How can anything take less than zero days to complete? Someone put the wrong date under “Closed Date.” That’s how. Similarly, the “Max.”, the most number of days it took to close out a ticket is allegedly 365200 days, which is a 1000 years and some months. Again, easy to do if you punch in 3016 instead of 2016 for year, or 2102 instead of 2012, or various other combinations of typos. That takes care of the min and max for now. We will get back to what to do with these extreme values later.
The second item “1st Qu.” is short for 1st quartile. Take a piece of string and cut it into four equal parts. You have four quartiles when it’s divided into four equal parts. Mark out five equal parts and you have five quantiles, and so on. If you heard “percentile,” say in reference to your grades/scores, that means a 100 equal parts. If your score was reported as being at the 87th percentile that means your score is higher than 86% of people who took that test.
Take a 101 dalmatians, count and tag the number of dots on each, and stand them in a single line for a photo. The dalmatian with the least number of spots is the first one on the left, then the next higher number and so on, and the dalmatian with the most number of spots is at the right end of the line. We want to see if there is any pattern to the number of spots and we decided to divide the 101 into four equal groups. Round down or up, don’t take an axe to the dogs! Counting from the first, identify the 25th dog. Say it has 42 spots. The 1st quantile is 42, meaning a quarter of the dalmatians have 42 or fewer spots. And, by extension, three-quarters have more than 42.
How you get 101 dalmatians to sit in a line for any length of time without chasing them after something is beyond the scope of this discussion, but seeing as they are still sitting there – find the 51st and see what the tag says. It says 59 spots? Ok, so 59 is the second quartile, also the same as something called the median. This is an important measure. The median, much like the one on the highway, divides a thing into two halves. Half the dogs have less than the median number of spots and half have more than the median number of spots. The dog at the 51st position divides the 101 into two – fifty dogs to the left of dog #51 and fifty dogs to the right of #51. The median is also an “average” but different from the ‘mean’ which is also an average. More on that later.
The 76th dog would be the 3rd quartile. If you divvied them up into 5 quantiles you would count the 21st, 41st, 61st and 81st dogs to mark each of the 5 buckets, er, 5 quantiles. In this case, or in any case where you have less or more than 4 groups, the 2nd quantile, the 41st dog, will not be the same as the median, which is the 51st dog. The median is at the center of two equal halves, always.
What happens if one of the dogs ran away? You now have a 100 dalmatians and taking the 51st dog as the median marker would give you two unequal buckets, er groups – one with 50 and the other with 49 dogs. So what’s the median number of spots on these 100 dalmatians? Take the 50th and 51st dogs, average out the number of spots they have, and that result is your median. Say the 50th dalmatian has 58 spots. We know from before that the 51st has 59 spots, so that’s a total of 117 spots, divide by two, and the average spots on the two dogs is 58.5. That’s your median for the 100 dogs – 50 dogs have fewer than 58.5 spots and 50 have more.
Let’s see what we have after median:
> summary(Days.To.Close)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
-42440.0 0.1 2.9 9.0 10.0 365200.0
Mean: This is your every-day average. Strictly speaking, mean and median, and the third leg of that triad, “mode”, are all averages, but in everyday usage when someone says “average” it is referring to the mean, not median or mode.
The Census Bureau reports median incomes, not average incomes, for a reason. Take a common example – the median income for your ZIP code is say $50,000. That is, half the people in your ZIP code earn less and half earn more. Bill Gates moves into your neck of the woods. Your ZIP Code average income, the mean, will jump. Because you are adding his income to the income of each of the other residents and dividing that total income by the number of people in the ZIP code. The median income will stay roughly the same because we are not adding Bill Gates’ income, we are only adding him as one extra person, at the super-high end of the income line, the dalmatian with an outsize number of spots.
The median helps as a counter-check to the mean: with Bill Gates on your street the ‘average income’ on the street would be in the millions even if none of the other residents makes a cent over 50k a year. If you take the median instead, you might actually have a number that is more representative of the average income on the street.
Going back to the dogs, say the 101st dalmatian had 4 bazillion spots. As far as median goes, we don’t care. The median is still the number of spots on the 51st dog. Though the median stays the same, this 101st dalmatian with the bazillion spots will send the “mean”, the everyday average, skyrocketing. Because you are adding that bazillion to the total spots on the other 100 dogs and dividing by 101. Boom! Through the roof. If #101 runs off to chase squirrels, your “mean”, the everyday average, will drop quite a lot, but the median only goes from 59 to 58.5. Big whoop.
And thereby hangs a tale. The closer the mean and median are, the better. When the mean is way higher than the median, somebody or something has a lot more of whatever it is that you are counting. The bazillion spots on dog 101, and Bill Gates’ wealth are extreme values. When the difference between median and mean is very large, it pays to poke around a bit and figure out why. It is normal in many statistical exercises to omit a certain number of values at either end of the chain. If you are trying to make policy decisions about school lunches in a given ZIP code based on average incomes, you want leave out Bill Gates so you get a truer picture of incomes in the area and the potential demand for subsidized lunches. On the other hand, if you are studying a new drug, a side-effect called “death” would be an extreme value but one you want to keep, not throw away.
Back to “Days To Close”:
> summary(Days To Close)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-42440.0 0.1 2.9 9.0 10.0 365200.0
We have extreme values on both ends. The minimum can be a negative number, but in this case, it is nonsense. You cannot close out a complaint before it was made. And likewise, complaints that get closed out years and decades into the future, we intuitively understand, is also nonsense beyond a point.
The minimum side is easy to deal with. Ignore all instances when a complaint was closed out before it was made (which is how you get a negative number here). We take out these and get:
> summary(DaysToClose)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 0.2 3.0 19.5 10.0 365200.0
The most significant change by setting a floor of zero days to close? The mean/average jumps by 10.5 days! Earlier we were subtracting a fairly large number of days from the total, now we are not, so the average jumps.
Dealing with the max side in this data is a bit tricker. Say a ticket is opened on Jan 1, 2004, and closed out on Feb 13, 2104, a hundred years into the future. That is clearly an error. We take out numbers that involve time-travel. But what if that ticket from Jan 2004 is shown closed out ten years later in 2014? What is the harm in leaving that in? Maybe some problems do take the city years and decades to resolve?
This gets us to the point behind this statistical analysis. It is to get a reasonable estimate of the median and mean number of days it takes city agencies to attend to a complaint. Say tickets were not closed out because somebody forgot, a manager notices that and closes them out, but months or years later. If you leave them as is, are you getting a reasonable estimate of how long it takes the city to take care of complaints? You are coming to vastly different conclusions if it takes the city an average of 9 days to address a problem versus 20 days, as in the above example.
Let’s see what happens if we take out the obvious errors. We exclude closed dates that occur after Feb 19, 2017, the date when the 2004 to 2016 data was downloaded (that is, they are in the future).
> summary(DaysToClose)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 0.2 3.0 19.5 10.0 2926.0
Nothing much happened except the Max is now “only” 2926 days – which is 8+ years! Versus 365200 days or a 1000+ years before. Clearly we need to use our judgement on this. Let’s leave out complaints that were closed more than a year after being filed and see what happens.
>summary(DaysToClose)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 0.2 3.0 12.9 10.0 365.0
The average/mean drops from 19.5 to 12.9. That is quite significant. Is 365 days the right cut-off to use? The average is still 10 days over the median, which is a pretty big gap. We lose 204,019 rows by taking out anything that was closed out after a gap of 365 days. By contrast, we lost 1.15 million rows by taking out tickets that were closed out before they were created. The difference is simple – there is no way a ticket can be closed before it is created. The close-out date is, therefore, without doubt an error, and we take out such errors. We know zero about how long those complaints really took so we cannot arbitrarily assign a number of days it took.
But a complaint can take more than 365 days to address and close out, however unlikely. Taking out errors is one thing, and taking out the unlikely, which are not necessarily errors, is something else. I tried a cut off of 180 days. The median drops from 3.0 to 2.9 and the mean from 12.9 to 10.5 days. What now? The drop from 12.9 to 10.5 is not as large as a drop of 6.5 days, from 19.5 to 12.9.
Looking only at tickets that were closed out 365 days or more after they were created, the average life of such a ticket was 775 days and the median fairly close at 660 days.
>summary(DaysToClose)
Min. 1st Qu. Median Mean 3rd Qu. Max.
365.0 476.5 660.0 775.3 992.3 365200.0
By lowering the limit to 180 days we may actually be crossing over from cleaning up errors to tampering with data, “massaging” it. So 365 days it is.
That’s “summary” statistics. Maybe more detail than you bargained for but hopefully with enough clarity.
311 data from NYC Open Data downloaded at various times. As of Feb 19, 2017, 24.5 million rows, 52 variables till 2009, 53 variables from 2010. Parsed using R v3.3.1 in RStudio v1.0.136