In order to create estimates and the sub-county level, it was necessary to geo code the final set of parcels. By getting the latitude and longitude, we would then be able to place them with in the appropriate tract and block group.

External data and in consistent or missing geocodes.

Goal: Create a set of geocodes that would be consistent within a dataset across years and also across datasets.

Addresses

In order to geocode, we needed a clean set of addresses. To get this list of clean address, we followed some basic rules: Different addresses for same parcel across the years

  • Parcels that have no street number + street then street number + same street — add street number to those missing it.
  • Parcels that have no street number + street then street number + different street — changes those with out street number to the more complete address (number and street)
  • Parcels that have street number + street then different street number + different street + no information — change to match last appearing address
  • Parcels that have street number + street then different street number + different street + year built change that coincides with address change — keep both addresses
  • Parcels that have minor difference in address (e.g. one number off, st. instead of rd.) — change to match the majority
  • Parcels that have same street number + street then different formats of unit numbering — change to match last appearing unit formatting
  • Parcels that do not have an address or year built following a set of consistent addresses — keep NAs (inactive parcel)
  • Parcels that  do not have an address or year built before a set of consistent addresses — keep NAs (new parcel)

The Geocodes

BKFS, CoreLogic, and WMLS came with their own set of coordinates. After checking BlackKnight's Arlington Data, we found that these coordinates changes across they years for the same parcel. We received the following information from BKFS about their geocodes: Latitude and Longitude are populated using Pitney Bowes Geostan address standardization software.  The database used in the software is updated by PB on a monthly basis.  The latitude and latitude values represents data provided by PB from different periods of time. See below for more on CoreLogic's geocodes.

GoogleMaps API

To place into census tract and to merge data sets together (when no parcel number is available), will use latitude and longitude. By making our own set ot geocodes and not relying on the ones in BKFS or CoreLogic, we were able to have a stable set of coordinates both within and across datasets.

A "Master Address List" was formed with includes Address and geocoded information. New data sets will be merged with this list master list to get necessary geocoded information. For those addresses that are not already in the master list, they will be geocoded and added to the list.

Geocoded addresses through GoogleAPI via R.

  • Daily max of 2,500 queries per IP address

  • Paid max is 100,000 queries per account. First 2,500 are free than $0.50 per 1,000

  • Limit of 10 queries per a second.

There are three main types of geocoding: rooftop, approximate, interpolated. The goals is to  have them all rooftop as its the most accurate. Approximate is when google maps does not have that address or a close match so will the the centroid of the smallest known administrative area. Interpolated is when there is no street number or close street number so it takes the center between two points.

The output includes the following variables

  • "lng" – longitude
  • "lat" – latitude 
  • "type" – type of address (premise, route, street_address, subpremise)
  • "loctype" – type of geocoding (rooftop, range_interpolated, geometric_center, approximate)
  • "address" – address google maps ultimately search for (can be different than input)
  • "lat.1", "lat.2", "lng.1", "lng.2" – latitudes and longitudes of the viewport ("the recommended viewport for displaying the returned result, specified as two latitude,longitude values defining the southwest andnortheast corner of the viewport bounding box. Generally the viewport is used to frame a result when displaying it to a user.")
  • street_address – a precise street address. (address google maps ultimately search for  and can be different than input)
  • route – street/route address is located on
  • country – the national political entity the address is located in
  • administrative_area_level_1 – state the address is located in
  • administrative_area_level_2 – county the address is located in
  • administrative_area_level_3 – district the address is located in 
  • locality – city the address is located in
  • neighborhood -- a named neighborhood the address is located in
  • premise -- a named location that is located at the address, e.g a building or collection of buildings with a common name
  • postal_code -- a postal code for the address
  • postal_code_suffix – the postal code +4 for the address
  • natural_feature – a prominent natural feature listed at the address
  • airport – an airport listed at the address
  • park – a named park listed at the address
  • point_of_interest – a named point of interest listed at the address.

The data fields listed here are fields we received back. For more potential fields, see: https://developers.google.com/maps/documentation/geocoding/intro

When geocoding, there is an inherent level of error (e.g. error created due to satellite technology). Thus, some outputted coordinates were outside Arlington County's line even though it is an Arlington County residence.

Arlington County

Arlington County Real Estate Data formed the base of this "Master Address List" with 61,341 unique addresses.

  • Took approximately 17 hours to run.
  • Of this, there were 415 errors
    • 6 came back with no results
    • 17 came back with results outside of Virginia
    • 368 came back with "approximate" loctype (how Google found the coordinates) and gave the center of Arlington County
  • These were re-ran to include state in the address. 380 Errors remained
    • None were missing
    • 11 came back with results outside of Virginia
    • 383 came back with "approximate" or "range interpolated" loctype  and gave the center of Arlington County
  • Manually edited the errors by removing the Unit Number and re-ran. 4 Errors remained
    • Compared address to the one that google ultimately used and deemed ok

These were then placed with Census Tract. 56 had lat and lon outside of the Arlington County boundaries (multiple jurisdictions). These did not receive a census tract

BlackKnight Addresses

After merging the addresses from BKFS to Arlington County's, there were 5,722 new address to geocode. These were added to the Master Address List.

MRIS Addresses

After merging the addresses from BKFS to Arlington County's, there were ~8000 new address to geocode. Many of these were because of difference in the street name (e.g. N Fenwick st vs Fenwick St). It was also necessary to remove the "#" from the units in the addresses and the google API did not fully recognize them and would misplace the address.

During the first run, there were 1294 that needed to be clean and re run as they had non-rooftop locations.

These were still geocode and added to the Master Address List. Another option would have been probability matching if the data set was larger. Due To missing APNS in the MRIS, geocoding everything allowed dataset to be merged with MRIS by latitude and longitude.

CoreLogic

CoreLogic (CL) data came with latitudes and longitudes. When checking Arlington County's CL data, we discovered that they are consistent from 2009 to 2012. In 2013, 4,717 (8%) changes. this difference on average is equivalent to about 1 foot—the largest difference results in about 980 feet.

As addresses were difficult to merge on due to structure (~16,000 didn't match at some level), merged by APN instead. 

Between 2009 to 2013, 75 APNs in CoreLogic were not in any of the final county's data (were removed when going from parcel to housing units). In addition, 24 were not geocoded as within Arlington (on the border). 

For James City County:

  • 6 did not have any address so were dropped.
  • 24 street names were "The" and the mode "GRN". These were changed to "the green".
  • 6 street names were "east" and the mode "Lndg". These were changed to the landing
  • 1 did not have addresses google could find near the area and could not find a fix. These were removed