Profiling
Codebook created (Data came with no documentation, so had to create a codebook in order to proceed)
- See codebook here: WMLS Codebook
Each variable profiled for quality (completeness, validity, consistency, and uniqueness). This is documented in the codebook.
Overall Data Description:
- Number of observations: 10,674
- 10 complete years of data (2005:2014) and 2015 up to about June.
- Unique Identifier: “Parcel I.D" (see data preparation page to see more about quality of this identifier)
Profiling summary
- Duplications: 1 duplicated entry (Listing.ID 22010703 and 22010704)
- All variables the same besides Listing ID
- For each table, a row is a selling transaction of a housing unit. Housing units can repeat.
- Need to clean parcel.ID variable to have a unique identifier.
The tables below contain the results of the data profiling of key variables. To see more details and profiling for all variables, see codebook.
Quality: Location | ||||
---|---|---|---|---|
PUMA | 21 of the sellings occurred outside of the JCC PUMA | |||
“List.Number” | No duplications | |||
Variables | Completeness | Validity | Uniqueness | Consistency |
“str_number” | 100% | 100% | 100% | |
“st_direction” | currently updating these as seen | Invalid entries (e.g. numbers) | Inconsistent levels (e.g. E., East, E) | |
“Street.Name” | 100% | Some streets are combined with suffix | ||
“Street.Suffix” | 45% missing (Some are with Street.Name) -fixing these as seen | Inconsistent NA coding | ||
“City” | 100% | 100% | Levels: “Lanexa, Newport News, Toano, Williamsburg” | 100% |
“Parcel.ID” | 2.7% missing (starting at about 680, currently down to about 255) | Three TBD categories | Inconsistent NA coding Inconsistent labeling (e.g. with or without dashes). There are different rules for the parcel ids, but all are acceptable. First step is to fill in missing then go back and clean codes entered that do not follow the four rules. (10 consecutive numbers, 11 consecutive numbers, 10 continuous followed with 1 letter, 10 digits with one letter in between) | |
“geo_lat” | 11 (<1%) missing, addresses can not be matched with any current address in google maps | Need to be multiplied by .000001 | 100% | |
“geo_lon” | 11 (<1%) missing (same ones as above) (process for collection is the same as lat) | Need to be multiplied by .000001 | 100% |
Quality: Housing | ||||
---|---|---|---|---|
“List.Number” | No duplications | |||
Variables | Completeness | Validity | Uniqueness | Consistency |
“Assessed.Value” | 60% Missing | Assessed values 0 or nearly 0 | 100% | |
“Year.Built | 100% | 2 Invalid entries (e.g. 0, 3, 19, 2942) | 100% | |
“Total.Rooms” | 9% Missing | 100% | 100% | |
“Total.Bedrooms” | 100% | 100% | 100% | |
“total_bath” | 100% | 100% | 100% | |
“Type” | 100% | 100% | Levels: “Mobile Home, Single Family Detach/Single Family Attach, Other” | 100% |
"Ownership” | 100% | 100% | Levels: "Fee simple, Condominium, Coop” | 100% |
Quality: Selling | ||||
---|---|---|---|---|
“List.Number” | No duplications | |||
Variables | Completeness | Validity | Uniqueness | Consistency |
"Sold.Date” | 100% | 100% | 100% | |
“Sold.Price” | 100% | 100% | 100% | |
“Occupied.By” | 13% missing | 100% | Levels: "Owner, Tenant, Vacant" | 100% |
“How.Sold” | 100% | 100% | Levels: "Assumption" ,"Cash","Contract For Deed", "Conventional","Farmers Home","FHA","Other","OwnerHeld","VA" | 100% |