Profiling

Codebook created (Data came with no documentation, so had to create a codebook in order to proceed)

Each variable profiled for quality (completeness, validity, consistency, and uniqueness). This is documented in the codebook.

Overall Data Description:

  • Number of observations: 10,674
  • 10 complete years of data (2005:2014) and 2015 up to about June.
  • Unique Identifier:  “Parcel I.D" (see data preparation page to see more about quality of this identifier)

Profiling summary

  • Duplications: 1 duplicated entry (Listing.ID 22010703 and 22010704)
    • All variables the same besides Listing ID
  • For each table, a row is a selling transaction of a housing unit. Housing units can repeat. 
    • Need to clean parcel.ID variable to have a unique identifier.

The tables below contain the results of the data profiling of key variables. To see more details and profiling for all variables, see codebook. 

Quality: Location
PUMA

21 of the sellings occurred outside of the JCC PUMA

“List.Number”

No duplications
VariablesCompletenessValidityUniquenessConsistency

“str_number”

100%100% 100%

“st_direction”

currently updating these as seen

Invalid entries (e.g. numbers)

 

Inconsistent levels (e.g. E., East, E)

“Street.Name”

100%  

Some streets are combined with suffix

“Street.Suffix”

45% missing (Some are with Street.Name) -fixing these as seen

  

Inconsistent NA coding

“City”

100%100%

Levels: “Lanexa, Newport News, Toano, Williamsburg”

100%

“Parcel.ID”

2.7% missing (starting at about 680, currently down to about 255)

Three TBD categories

 

Inconsistent NA coding Inconsistent labeling (e.g. with or without dashes).

There are different rules for the parcel ids, but all are acceptable. First step is to fill in missing then go back and clean codes entered that do not follow the four rules. (10 consecutive numbers, 11 consecutive numbers, 10 continuous followed with 1 letter, 10 digits with one letter in between)

“geo_lat”

11 (<1%) missing, addresses can not be matched with any current address in google maps

Need to be multiplied by .000001

 100%

“geo_lon”

11 (<1%) missing (same ones as above) (process for collection is the same as lat)

 

Need to be multiplied by .000001

 100%
Quality: Housing

“List.Number”

No duplications
VariablesCompletenessValidityUniquenessConsistency

“Assessed.Value”

60% Missing

Assessed values 0 or nearly 0

 100%
“Year.Built100%

2 Invalid entries (e.g. 0, 3, 19, 2942)

 100%

“Total.Rooms”

9% Missing 100%100%

“Total.Bedrooms”

100% 100%100%

“total_bath”

100% 100%100%

“Type”

100%100%

Levels: “Mobile Home, Single Family Detach/Single Family Attach, Other”

100%
"Ownership”100%100%Levels: "Fee simple, Condominium, Coop”100%
Quality: Selling

“List.Number”

No duplications
VariablesCompletenessValidityUniquenessConsistency
"Sold.Date”100%100% 100%

“Sold.Price”

100%100% 100%

“Occupied.By”

13% missing100%

Levels: "Owner, Tenant, Vacant"

100%

“How.Sold”

100%100%

Levels: "Assumption" ,"Cash","Contract For Deed", "Conventional","Farmers Home","FHA","Other","OwnerHeld","VA"

100%

Attachments:

WMLS Data Dictionary.docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document)