The art of parsing Zillow property data
Image of a letter addressed to Harry Potter with the address visible
Table of contents
Introduction
Extracting data from the web is a very common task for data scientists. It is also a very common task for data engineers. In this article, I will explain how I parse addresses from Zillow properties. I will also explain why I do it this way. To start with, we need to forget every assumptions that has ever been made regarding what constitutes valid data.
Zillow does its best to represent the data it has. But its is not full proof and errors, spelling mistakes and other issues sometimes creep in. This is why we need to be prepared for the worst and hope for the best.
Zillow Data Exporter works well because I always assume that some things are going to go wrong when I try to extract the data. Based on this assumptions, I build the software to handle mistakes and program it in a defensive manner. If I cannot assert with 99% accuracy that a piece of data is correct, I will not use it.
That is why you will see empty cells in your spreadsheet. That is why you will see some properties that are not parsed correctly. There is no way to get this kind of software to work 100% of the time. But I can get it to work 99% of the time.
Usually in software we use the Pareto principle . 80% of the work is done by 20% of the code. In this case, 99% of the work is done by 1% of the code. The rest is just defensive programming.
To understand why I do it this way, we need to understand what an address is.
Myths about data
There are a lot of myths about data. One of the most common ones is that data is always correct. This is not true. Data is always wrong. It is just a matter of how wrong it is. If I asked you to write down your address, you would probably write it down correctly.
But if I asked a million people to write down their addresses, I would get a data set containing many mistakes, including misspellings, duplicates, amd completely wrong addresses.
It's just human nature to make mistakes and this is why we need to be prepared for the worst.
Lets look at an example. This address below seems fairly simple. It is an imaginary address in a city in the US. It has a street number, a street name, a city, a state and a zip code:
1234 Main Street, New York, NY 10001
To you, a human, it looks perfectly fine and you can accurately guess that:
- it is located on Main Street
- it is located in the city of New York
- it is located in the state of New York
- it is located in the zip code 10001
- it has the street number 1234
To a computer it is a lot more difficult. It is a string of characters. It is not clear what is the street name, what is the street number, what is the city, what is the state and what is the zip code. This is why we need to be prepared for the worst. We need to assume that the data is wrong or malformed and we need to be prepared to handle it.
What is an address?
If I asked you to give me a definition of an address, you would probably say something like this, an address is a string of characters that contains:
- a street name
- a street number
- a city
- a state (optional, some countries don't have states)
- zip code
- a country (optional because in Zillow's case, we know that each property is located in the USA)
If your answer is similar to the one above, then unfortunately you are wrong
To understand more about addresses, please read this very interesting article about the many assumptions that we have about addresses in our daily lives.
Some of those assumptions are:
- An address will start with, or at least include, a building number
- A building number will only be used once per street
- A street name won't include a number ...
So where does this leave us and how do we parse addresses?
There are many ways to parse addresses. But they are not perfect and they are not 100% accurate.
If we are dealing with US addresses, we want to parse the most obvious parts of the address, the street name, the street number, the city, the state and the zip code.
The zip code
The zip code is the easiest part of the address to parse. It is always 5 digits long. It is usually located at the end of the address. It is usually separated by a space or a comma from the rest of the address. We can therefore assume that if we find in the address a string of 5 digits, separated by a space or a comma, then there is high chance that it is the zip code.
This works 99% of the time. But it does not work 100% of the time. Sometimes the zip code is not separated by a space or a comma. Sometimes the zip code is not 5 digits long.
If we can't be certain that this is a zip-code, Zillow Data Exporter will not use it and your spreadsheet row will contain an empty cell.
The state
The state is a bit more difficult to parse. It is always 2 letters long.
It is usually located towards the end of the address. It is usually separated by a space or a comma.
To get the state, we can therefore assume that if we find in the address 2 capital letters, separated by a space or a comma, we can assume with a high probability that it is the state. To confirm that it is indeed a state, we can check if the 2 capital letters is in a list of all known states in the US.
If we cant' be certain that this is a state, Zillow Data Exporter will not use it and your spreadsheet will contain an empty cell.
The city
The city is a lot more difficult.
It can, in theory, be any sequence of characters. No help using pattern matching here. To be certain that we have the correct city, we need to check if the city is in a list of all known cities in the US.
Unfortunately, doing a check like this is very slow and costly. A few services exist that provide a list of all the cities in the US. But they are not free and they are not cheap.
That means that at the moment I do not validate the name of the city. I just assume that the city is correct. If you want to validate the city, you can use a service like GeoNames to get a list of all the cities in the US and then use that list to validate the city.
The street name
The street name is the most difficult part of the address to parse. It can be any string of characters. It can contain numbers. It can contain special characters. It can be a single word or multiple words. It can be a street name or a street type. It can be a street name or a street direction.
It is very hard to verify and therefore I do not verify it. I just assume that the street name is correct.
Please verify the street name manually. If you find a mistake, please report it to me and I will try to fix it.
The street number
Once again, we can use pattern matching to find the street number. We look for numbers at the beginning of the address and we hope that they are the street number. There is once again, there is no real way to guarantee that the number we found is the street number.
Always double check the street number in the exported spreadsheet. If you find a mistake, you can correct it manually.
Zillow Data Exporter makes no promises about the accuracy of the street number. It is just an educated guess.
Conclusion
Zillow Data Exporter will try to extract data as best as it can but it does not guarantee that the data is correct. No tool, can guarantee that the data is correct.
It is in my opinion better to remove the invalid data than to have incorrect data. If you have incorrect data, you can't trust it. If you can't trust it, you can't use it.
I hope this blog post will help you understand why Zillow Data Exporter does not guarantee the accuracy of the data. I also hope that it explains, why sometimes, you see some rows of data with empty cells.