In this tutorial, you will learn how to geocode street and location into longitude and latitude coordinate.
For the sample data, I will use a portion of Metro Manila spatial traffic accident data that is publicly available in GitHub. For clarity in this tutorial, I will only use a small portion of rows out of almost a million rows of data.
To use the code in this tutorial, I assume you have installed googlemaps:
pip install googlemaps
Geocode is an operation to transform a location such as street and city into longitude and latitude coordinate of the earth. Knowing the coordinate of (long, lat), you can have more precision and can do more data spatial analysis, such as computing the risk probability of accident, and do more data visualization such as using the heatmap.
When you go to Google Map and you type a street name and the city, Google Map will display the icon of the location. For example, I use the first data of Metro Manila Spatial accident data: Central Bicutan, Upper Bicutan, Taguig that represents the street, location and the city.
If you notice the HTTP address contains some number like .../@14.4889308,121.0517255,15z/.... That is the coordinate location of the icon and the zoom level. If you type again the coordinate location, Google Map will give you back the approximate reversed geocoding, that is the actual location in the map. It is not the exact location but close to the actual location.
Now if you have so many data of street location and the city, how to automate the geocoding process? The answer is to use Google Map Geocoding API. To use the API, you need to sign up to Google Map, specify your project and get the API_KEY. Google Map allows 2,500 free requests per day. On top of that, you need to pay $0.50 USD / 1000 additional requests, up to 100,000 daily.
Once we have the API_KEY, we can import the necessary modules and set up GoogleMap client object.
import pandas as pd
import numpy as np
import googlemaps
import gmaps
API_KEY = 'AI....'
gm = googlemaps.Client(key=API_KEY)
Now we read the spatial data. For this tutorial, I only use the first 250 rows out of almost a million rows of Metro Manila spatial traffic accident data that is publicly available in GitHub.
For your own purpose, you just change the data using the same format of the name of variables.
To display the geocoded location as heatmap, go to the next tutorial.
data = pd.read_csv('MetroManilaSpatialAccidentData.csv', encoding = "ISO-8859-1", low_memory=False, index_col = 'Key')
data = data.fillna('') # fill empty entries with ''
print(list(data)) # print Variable Name
data.head() # show some data
Observe that this data has 4 variables and 250 rows.
[maxRow,maxCol]=data.shape
maxRow,maxCol
The following functions will do the geocoding of street, location and the city. The algorithm goes as follow: the code will try to search for street, location and city. If it is not succesful, it will try to search for street and the city. If it is still unsuccessful, it will search for location and city. If all of these attempts are not successful, it will produce fail in geocoding. If any of the attempt is successful, it will append the result into the list and return this list of latitude and longitude.
def Geocode(query):
# do geocoding
try:
geocode_result = gm.geocode(query)[0]
latitude = geocode_result['geometry']['location']['lat']
longitude = geocode_result['geometry']['location']['lng']
return latitude,longitude
except IndexError:
return 0
def GeocodeStreetLocationCity(data):
lat=[] # initialize latitude list
lng=[] # initialize longitude list
start = data.index[0] # start from the first data
end = data.index[maxRow-1] # end at maximum number of row
for i in range(start,end+1,1): # iterate all rows in the data
isSuccess=True # initial Boolean flag
query = data.Street[i] + ' ' + data.Location[i] + ' ' + data.City[i] # try set up our query street-location-city
result=Geocode(query)
if result==0: # if not successful,
query = data.Location[i] + ' ' + data.City[i] # try set up another query location-city
result=Geocode(query)
if result==0: # if still not successful,
query = data.Street[i] + ' ' + data.City[i] # try set up another query street-city
result=Geocode(query)
if Geocode(query)==0: # if still not successful,
isSuccess=False # mark as unsuccessful
print(i, 'is failed')
else:
print(i, result)
else:
print(i, result)
else:
print(i, result)
if isSuccess==True: # if geocoding is successful,
# store the results
lat.append(result[0]) # latitude
lng.append(result[1]) # longitude
return lat,lng
Now we call the geocoding function and store the result into two list name [lat,lng] and we put the result into pandas data frame for further data cleaning.
# call the geocoding function
[lat,lng]=GeocodeStreetLocationCity(data)
# we put the list of latitude,longitude into pandas data frame
df = pd.DataFrame(
{'latitude': lat,
'longitude': lng
})
Notice the printed geocoding process above and you can observe that not all geocode process is successful. Even if it is successful, some geocoded locations are clearly outside the boundary of the city and need to be cleaned further.
We need to delete the geocoded locations that outside the boundary location of the city. For that purpose, first we need to define the center of the city and the boundingbox boundary coordinate of the city. Then, we create a filter mask to remove the location coordinate that is outside of the city bounding box.
To draw a heat map, we need to add how frequent the event (in this case, accident event) happenned in one place and we put that into our weight.
# do geocode for the whole mega city
geocode_result = gm.geocode('Metro Manila')[0] # change the name into your city of interest
# get the center of the city
center_lat=geocode_result['geometry']['location']['lat']
center_lng=geocode_result['geometry']['location']['lng']
print('center=',center_lat,center_lng)
# get the bounding box of the city i.e. Metro Manila
bounding_box = geocode_result['geometry']['bounds']
lat_boundary = [bounding_box['southwest']['lat'], bounding_box['northeast']['lat']]
lng_boundary = [bounding_box['southwest']['lng'], bounding_box['northeast']['lng']]
print('boundary:',lat_boundary,lng_boundary)
# remove the list that is outside of the city bounding box
mask=(df.latitude>lat_boundary[0]) & \
(df.latitude<lat_boundary[1]) & \
(df.longitude>lng_boundary[0]) & \
(df.longitude<lng_boundary[1])
df=df[mask]
# add frequency of accidents in the location
df['weight']=1
# save into csv file
df.to_csv('locations.csv')
print('saved gocoded locations to "locations.csv"')
last update: September 2017
Cite this tutorial as: Teknomo, K. (2017) Automatic Geocoding using Python (http://people.revoledu.com/kardi/tutorial/Python/)
See Also:
Visit www.Revoledu.com for more tutorials in Data Science
Copyright © 2017 Kardi Teknomo
Permission is granted to share this notebook as long as the copyright notice is intact.