Forensics of DOT’s time travel
Number of service requests to NYC’s 311 from 2004 to 2016: 24.54 million
I noticed that the “min” was a negative number. So I switched from looking at the lifespan of a service request in days to looking at it in hours – just in case there was something going on under the R hood that was eluding me and messing up the days calculation. But no, the min for hours was negative too. Like days, it was not an insignificant number. A quick check showed 457,472 instances where a service request was closed out before it was created. 1.86% of all cases from 2004 onwards. To make sure it wasn’t some “recycling of unequal columns” while reading in data via data.table. I verified (on Feb 22, 2017) a random handful at NYC Open Data and they checked out.
Of 457,472, most were related to the DOT (Department of Transportation). Followed by the department of Health and Mental Hygiene Here is the count for the top 5:
1: DOT 424,313
2: DOHMH 30,677
3: DEP 1,121
4: DSNY 774
5: DPR 445
Want a list of all 457, 472? Email me or leave a comment and I’ll stick it on Google Drive.
(Summary stats give you an overview of the data. Say you are talking about the height of students in a class. You list the shortest and tallest height in the group. You find and list the average – add up the heights and divide by the number of students. List the median – line up the students from shortest to tallest and pick the student in the middle, the value that divides the class into two equal halves, like the median on the road; half the students are taller than the median and half are shorter than the median. If a prankster substituted the height of of the tallest student with the height of Godzilla in the table, your median would be the same, but your average height is going to be way over anything you’ve ever seen (because you are adding them all up) and you take out or adjust for that kind of extreme value. In the above case, we know something is wrong when we see an average time that is negative. That’s the same as getting an average height for the class that is less than zero – you know something is wrong because no one can measure less than zero in height).
311 data from NYC Open Data downloaded at various times. As of Feb 19, 2017, 24.5 million rows, 52 variables till 2009, 53 variables from 2010. Parsed using R v3.3.1 in RStudio v1.0.136