Select Page

Forensics of DOT’s time travel

Number of service requests to NYC’s 311 from 2004 to 2016: 24.54 million
So I ran into an interesting problem today. I was trying to compute the average life of an NYC 311 service request – the time that a request is created and the time that it is closed out.  This called for a basic summary stats of the variable.  (What’s summary stats? See note at the end). 

I noticed that the “min” was a negative number. So I switched from looking at the lifespan of a service request in days to looking at it in hours – just in case there was something going on under the R hood that was eluding me and messing up the days calculation. But no,  the min for hours was negative too. Like days, it was not an insignificant number. A quick check showed 457,472 instances where a service request was closed out before it was created. 1.86% of all cases from 2004 onwards.  To make sure it wasn’t some “recycling of unequal columns” while reading in data via data.table. I verified (on Feb 22, 2017) a random handful at NYC Open Data and they checked out.

Of 457,472,  most were related to the DOT (Department of Transportation). Followed by the department of Health and Mental Hygiene Here is the count for the top 5:

1:    DOT               424,313
2:    DOHMH          30,677
3:    DEP                    1,121
4:    DSNY                   774
5:    DPR                      445

Want a list of all 457, 472?  Email me or leave a comment and I’ll stick it on Google Drive.

(Summary stats give you an overview of the data. Say you are talking about the height of students in a class. You list the shortest and tallest height in the group. You find and list the average – add up the heights and divide by the number of students. List the median – line up the students from shortest to tallest and pick the student in the middle, the value that divides the class into two equal halves, like the median on the road; half the students are taller than the median and half are shorter than the median. If a prankster substituted the height of of the tallest student with the height of Godzilla in the table, your median would be the same, but your average height is going to be way over anything you’ve ever seen (because you are adding them all up) and you take out or adjust for that kind of extreme value. In the above case, we know something is wrong when we see an average time that is negative. That’s the same as getting an average height for the class that is less than zero – you know something is wrong because no one can measure less than zero in height).

311 data from NYC Open Data downloaded at various times. As of Feb 19, 2017, 24.5 million rows, 52 variables till 2009, 53 variables from 2010. Parsed using R v3.3.1 in RStudio v1.0.136

About The Author

Leave a reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.