Brief Overall Description of the Dataset:
Zillow is a company focused on providing citizens and homeowners information on the real estate market in an easy and understandable way. Their dataset consists of data on median housing values, demography, mortgage rates, home value history and building characteristics. Zillow collects data through multiple resources, including sales records, county records, tax assessments, real estate listings and mortgage information, with the ability for users to update the data set. Zillow uses most of this data to compute their ‘zestimates’, which are continually updated housing values for individual properties based off previous sales and current sales in their area. While the data is recorded with detail, including characteristics of the property and address, little of this information is available for purchase or use by outside industry, due to confidentiality rules and their terms of agreement when registering to use their API. – Upon further investigation, Zillow data was deemed not to be of use for this project due to heavy restrictions to access.
Link: http://www.zillow.com/howto/api/APIOverview.htm, http://www.zillow.com/blog/research
Date Inventory Completed: 5/21/2015
Screening
- Is the data collected opinion-based?
- Is the data collection recurring?
- Is there data available for 2013?
- Is the data collected at the property or housing unit level?
- Can we access the data by August 15th?
Purpose
What is the purpose of the organization collecting the data?
The purpose of the Zillow data base is for housing market analysis and research.
Why is it collected and how does the organization use it?
For realtors, government officials and individual homeowners to have access to property history and monthly median property values. It is also useful for monitoring real-estate trends across the country.
Who else uses the data?
Businesses, realtors
Who do they sell the data to?
Individual homeowners, businesses
Method
What is the data collection method?
Using their “Zestimate” or estimated market value of every home, the construction of this data is to first calculate raw median zestimates. These zestimates are adjusted for any residual systematic error. Next the Henderson moving average filter is applied, followed by applying the seasonal adjustment, finishing with a final quality control.
What is the type of data collected?
Administrative and estimation.
If designed, who created the questions?
What is the raw source of the collected data (prior to any aggregation)?
The raw data comes from prior sales records, county records, tax assessments, real estate listings, mortgage information and GIS data. Over one third of the homes in the database have been updated by users.
Description
What is the general topic of the data (1-2 words)?
Housing prices
What are the earliest and latest dates for which data is available?
1997-2015
Is data collected and available periodically?
Yes, monthly
How soon after a reference period ends can a data source be prepared and provided?
18-23 days
Selectivity
What is the universe (e.g., population) that the data represents?
Housing market (single-family homes, condos and co-ops) in the United States
Accessibility
How is the data accessed?
Is it open data?
Some, there is .csv summarized data available at www.zillow.com/research/data, as well as API downloads, but due to Terms of Agreement, Zillow branding has to be present if API is used, and the API can not be locally downloaded.
Any legal, regulatory, or administrative restrictions on accessing the data source?
None
Cost? - One time or annual or project based payment?
Data available is free, more in depth data is not listed
Does this dataset appear to meet our needs for the Census study? YES
Full Inventory
Description
- What is the general contents of the data source?
Demographics, housing costs, property history
- Features
- What is the temporal nature of the data: longitudinal, time-series, or one time point?
Time-series
- Geospatial? If Yes, at what level?
Yes, State, Metro/US, County, City, Zip Code, Neighborhood
Metadata
- Is there information available to assess the transparency and soundness of the methods to gather the data for our purposes?
Yes, the methods are clearly laid out, with equations for estimations. http://www.zillow.com/research/zhvi-methodology-6032/
- Is there a description of each variable in the source along with their valid values?
Not available
- Are there unique IDs for unique elements that can be used for linking data?
Addresses, if PIIs were available
- Is there a data dictionary or codebook?
Not available
Selectivity
- What unit is represented at the record level of the data source?
Property
- Does this universe match the stated intentions for the data collection? If not, what has been included or excluded and why?
Only those property that have been put up on Zillow by a realtor is included with full information. Otherwise, unknown
What is the sampling technique used (if applicable)?
Not available
What was the coverage?
95% of US housing stock by market value
Stability/Coherence
- Were there any changes to the universe of data being captured (including geographical areas covered) and if so what were they?
Yes, due to availability of timely and reliable data, the regions measured shift, can change the number of geographies measured.
- Were there any changes in the data capture method and if so what were they? (e.g., revised questions, data collection mode, classification categories, algorithms for social media data)
None that is known
- Were there any changes in the sources of data and if so what were they?
No
Accuracy
- Any known sources of error?
The housing values are estimates, based off Zillow’s “Zestimate”, the error is “just as likely to be above the actual sale price of the home as below.”
- Describe any quality control checks performed by the data’s owner.
There is a final quality control, in which the zestimates are put against a four-star quality rating function. The variables evaluated are number of zestimates, number of transactions in the most recent three months, temporal volatility, number of outliers, gaps, jumps, disclosure/ non-disclosure states. Those with less than two-star ratings are then thrown out.
Accessibility
- Any records or fields collected, but not included in data source, such as for confidentiality reasons)?
Addresses, zestimates, sale prices and home characteristics
- Is there a subset of variables and/or data that is must be obtained through a separate process? ? If yes, is there a separate legal, regulatory, or administrative restrictions on accessing the data source? Cost? - One time or annual or project based payment?
No
Privacy and security
- Was consent given by participant? If so, how was consent given?
No, the data is looking at just property, not the people in the property, so it is all public information.
- Are there legal limitations or restrictions on the use of the data?
No
- What confidentiality policies does the source have?
“Bulk distribution of property-level data (i.e. addresses, zestimates, sale prices, home characteristics, etc.) cannot be given or sold due to, among other reasons, contractual limitations”.
Research
- What research has been done with this dataset? (e.g., impact of policies, predictors of student success)
Unknown
- Include any links to research if provided:
N/A
- List any other data use notes provided by the supplier.
Gaps/Concerns
- Feasibility - can all jurisdiction levels provide the data (if applicable)?
Yes, this data is available for most of the United States.
- Data ownership - a lack of clarity in legal guidance stemming from a lack of clarity with who owns digital data?
- Data collection authority - what data is reasonably private and what constitutes unwarranted intrusion?
- Describe any other notes you have or any gaps/concerns you see with this dataset:
This data does rely on estimates, though the methodology is heavily backed up, it is not purely raw data points.