Exploring New York City water tank inspection data.
My approach to exploring, analyzing and visualizing real estate data using Python and Plotly.
Lonnie HalpernBlockedUnblockFollowFollowingJun 25Typical water tank.
Photo by author.
I have been exploring New York City real estate data and experimenting with various methods of data analysis and visualization for the last few years.
Along the way I have learned much about New York City real estate data, in particular how to obtain, clean, analyze, and visualize / map it.
I have spent a good amount of time learning about data and various programming languages and techniques, searching Q&A’s on Stackoverflow, articles on Medium and various data tutorials.
I hope that by summarizing a project that I have been working on and outlining my process for exploring a data driven real estate question, sources of real estate data and tools for data analysis, I will illustrate that insight can be gained through data analytics and be helpful to those exploring their own data driven questions.
I spy over 12 water tanks in this photo.
Photo by author.
I have always been interested in building water tanks.
Despite all the advances in science, these wooden barrels which hold a building’s water supply have remained largely unchanged for the last century.
In New York City they are necessary to maintain water pressure for buildings over 6 stories.
Step 1 — Identify your question and data sourcesI am curious about how clean the water tanks are and whether there are any identifiable patterns to this.
Through some research, I learned that in May 2015 Chapter 31 of New York City’s Administrative Code went into affect which requires building owners to conduct annual inspections of their water tanks and report the results to the NYC Health Department.
Chapter 31 requires that each building accomplish this by January 15th of the following year and that all results are to be made publicly available.
The NYC Health Department maintains an online tool to look up the inspection results of any building in New York City.
When city data is available in an online based tool, there is a good chance that the underlying data is also available via the NYC open data portal (and if not, it should be).
This is because New York City recognized that data is valuable and should be made accessible to everyone and in 2012 passed Local Law 11 which called for the identification and aggregation of city wide data by 2018.
The on-ramp for searching this expansive data resource is https://opendata.
cityofnewyork.
us/Various ways to search for datasets on https://opendata.
cityofnewyork.
us/A simple search for water tanks quickly leads you to the data we are looking for which can be downloaded as a .
csv file.
Step 2 — Clean and explore the dataI generally use Python to analyze data.
The reasons I like Python (and why it is enormously popular) is that its constantly being developed and new libraries added, there is enormous amount of resources for learning and chances are that for anything you are trying to accomplish there are multiple discussions on StackOverflow to guide you.
If you are going to follow along, I suggest that you install Anaconda, its a very popular Python package which includes many of the must-have Python libraries preinstalled.
One such library is Jupyter which allows you to run Python commands in your browser via a notebook, which makes it easy to experiment and share your notebooks with others.
The Jupyter notebook corresponding to this article can be viewed via nbviewer here and all files (including underlying data) can be found in this GitHub repository.
Another popular library (also part of the Jupyter package) is Pandas which simplifies how you can organize, search and modify tabular data.
After the libraries are loaded the data (which was previously downloaded) can be read into a Panda’s dataframe.
Created a dataframe called data with the tank inspection resultsThe command shape tells us that the data consists of 19,752 rows with 48 columns, we can get a sense of the data (and the type of data) that it contains below.
One of the advantages of Python is that you can quickly summarize the data, the code below loops through certain columns of data to determine the number of results in each category.
From this we learn that there were 17,002 inspections in Manhattan during the history of the inspection program, 5,461 inspection in 2018, and more observations of biological growth and bird/insect debris than I would like to see (although on a percentage basis its extremely low).
It is also fairly simple to visualize data using Python.
When we do, a few observations stand out such as an increase in inspections in the last four years (with the biggest increase being between 2017 and 2018, the majority of inspections taking place in Manhattan and the dominance of a few firms in conducting inspection.
When I originally started this analysis in May of 2019, the data was last updated on October 2018 and consequently didn’t include the last three months of reporting in 2018.
I was hopefully that the reports were done and filed and reached out to NYC Open Data to request that that data be updated (NYC Open Data has a procedure for requesting additional information on its data sets).
In early June the data was updated (as of June 5th) and it appears that it is now being updated weekly!In addition to identifying missing data while exploring the data you often come across aspects of the data that need to be cleaned.
For instance, when I was working with the October 2018 data I noticed that there was no data for the Latitude column, and this led me to realize that the Latitude data was incorrectly labeled as Longitude and each of the columns to the right were skewed by one column.
This was subsequently corrected in the June 2019 data release.
In addition, the names of inspecting firms vary, “ISSEKS BROS INC” is also referred to as “Isseks Bros Inc” and “Isseks”.
If analyzing this is important to your research, you will want to make the data consistent.
Step 3 — Analyze and visualize the dataAs a result of my initial exploration of the data, I decided to change the focus of my analysis from the results of the inspections to which water tanks were not in compliance with the annual reporting obligations.
Group Data — Python has a powerful function GROUPBY which allows you to efficiently group data by certain attributes and run various calculations on the results (such as sum, average and count).
Since we are limiting our inquiry to whether a building has had its annual inspection, we will first select the data we need.
Since there is one row of data for each report, we will filter out multiple reports for the same year with the same BBL (keeping the last one).
Combine Data — Data analysis usually gets more interesting when you combine multiple data sets to look for patterns.
To illustrate this, I am going to introduce the PLUTO dataset which combines geographical data with information on each building.
For this analysis, I am going to focus on residential cooperative and condominium buildings in Manhattan.
I have created a .
csv file which contains all 6,065 cooperative and condominium buildings in Manhattan geocoded with Latitude and Longitude information.
Based on our understanding of the situation in which water tanks are necessary, I filtered this data set to only include buildings that are over six stories — there are 2,919 of them.
My plan is to combine the dataframe on water tank inspections with the dataframe on coop/condo buildings to create one dataframe.
In order to do this we need a key or unique identifier in each dataframe to match the data.
In our case we have BBL which is a unique identifier for each building (based on the borough / block / lot of each building).
We can then use Pandas to merge the df_cc dataframe with the df_summary dataframe.
We can then determine how many coop/condo buildings in Manhattan which should have water towers were not inspected in the last 5 years — 1,232 of them!Finally, using Plotly express we can create an interactive map (in one line of code!) which identifies Manhattan coops and condo buildings and color codes those which have not been inspected in the last 5 years in red.
px.
scatter_mapbox(dfnew, lat=”Latitude”, lon=”Longitude”, color=”Reported”, hover_name=”Address”, hover_data=a, color_discrete_map = {“Never”: “red”, “Yes”: “green”}, zoom=14)Conclusions / Next StepsBased on our preliminary analysis, there are 2,919 coop / condo buildings with water tanks in Manhattan and 1,232 never had a tank inspection — 42.
2% .
This is a very significant number and hopefully an overstatement!My next step is to test a sampling of the buildings to (i) confirm the existence of a water tank (google map’s satellite view or similar service can be used), I suspect that the premise that buildings greater than six stories require a water tank was an over simplification, and (ii) compare my results against the information provided via the NYC Health building tank search tool to confirm the findings.
I welcome feedback, suggestions and am particularly interested to see how others build on this.
I can be reached on Twitter @LonnieHalpern.