Saturday, February 13, 2016

NYC Open Data project: NYC Health Inspections

Since moving to New York in 2009, I've had a mouse or roach problem in 2 out of 3 of my apartments. Am I dirty, or is it just this city? I prefer to believe the latter. Call me a hypocrite, but I would think twice about eating in a restaurant that had the same problem. I have definitely eaten at questionable places before, and I don't think that cleanliness is necessarily the most important factor when choosing a restaurant. Some of the best food on this Earth is served out of hole-in-the-wall joints. So take the following project with a grain of salt.

Ignorance is bliss, but I was very interested when NYC opened up the data from its restaurant inspections. While scanning the rows, I noticed that a good number of restaurants had violated my most feared codes: 04L ("Evidence of mice or live mice") and 04M ("Live roaches present") but still managed to earn an "A" grade at some point. Since restaurants get two chances to earn an A grade, it's possible that a restaurant might have cleaned up its act and may not have earned the A grade in the same inspection where the mice or roaches were spotted. The number of points docked for one of these violations is not enough to knock a grade down to a B in and of itself. The restaurant must commit other violations in order to be given a B or lower grade. 

For people who share the same fear of tiny little critters, I made this little app to help people see the lists of the restaurants in their zip code that had live roaches present during one or more inspections, but still managed to earn an "A" grade.

I don't want to punish restaurants for having a roach problem...it happens to the best of us. Plus, who knows how many restaurants have roaches, but didn't happen to get caught on the day of the inspection? I don't want this information to get blown out of proportion.

However, I believe that people should have access to information if they want it. They should be able to easily find out if roaches were spotted at any given restaurant's health inspection, and make an informed decision of whether this information bothers them or not before proceeding to eat at said establishment. That's why I created Dine Decisive.

I chose to focus on Manhattan for Dine Decisive. Today, I decided to dive into that data to explore how the five boroughs compared to each other in terms of roaches. 

At first, I queried the data in the form of JSON via their API, but I quickly realized that some restaurants (e.g. "domodomo") appeared in the portal but were missing from the JSON data. For example, I couldn't find "domodomo" when doing a quick control-F on the JSON link above. However, when I looked at the data on Socrata and searched for "domodomo," 5 results came up:


What the heck was going on?

I decided to download the data in the form of a CSV and was relieved to see that "domodomo" was in that file. 

So I suggest that you download the data in the form of a CSV rather than query the API (plus, Socrata often does maintenance on the weekends so you won't be able to access the data during certain times). You can actually see a popup about scheduled maintenance in the screenshot above.

At the top of the CSV file, you will see a helpful guide that tells you what each row means:
CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,VIOLATION CODE,VIOLATION DESCRIPTION,CRITICAL FLAG,SCORE,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE

In Python, process the csv file by writing this:

    with open(nyc_csv, 'rb') as csvfile:
        my_data = csv.reader(csvfile)
        for row in my_data:

If you run the script with the csv file in the same directory as your script, you don't need to specify a directory. nyc_csv will just be the name of your csv file as the form of a string (i.e. 'nyc_csv.csv').

Some things to note about the file:
Restaurant violations are counted as a new row every time. We don't want to add any one restaurant more than once. Therefore, I created another list called rest_ids to keep track of what restaurants I had processed so far. I wanted to count how many total restaurants there were in each borough. I did this by appending the CAMIS (row[0]) to the list of rest_ids. To account for multiple locations of the same restaurant (i.e. chains like Dunkin Donuts), I used CAMIS rather than DBA (the name of the restaurant).

I only wanted to check recent inspections, so I filtered for inspection dates in the year 2015. You can update this in the code to look for inspections from prior years, if you're interested in seeing those results instead.

The results are in...drumroll, please!


Manhattan: 788 out of 9422 (~8.36%)
Staten Island: 38 out of 828 (~4.59%)
Queens: 615 out of 5361 (~11.47%)
Brooklyn: 677 out of 5710 (~11.86%)
Bronx: 238 out of 2237 (~10.64%)

The percentage refers to the number of restaurants with roaches divided by the total number of restaurants in that borough. I divided and multiplied by 100, then rounded to 2 digits. Make sure to convert the numbers to floats before dividing.

It's interesting to see that Staten Island scored the best, and Manhattan was in 2nd place. Brooklyn fared the worst, but was only slightly worse than Queens.

No comments:

Post a Comment