Automatic Geocoding using Python

by Kardi Teknomo

In this tutorial, you will learn how to geocode street and location into longitude and latitude coordinate.

For the sample data, I will use a portion of Metro Manila spatial traffic accident data that is publicly available in GitHub. For clarity in this tutorial, I will only use a small portion of rows out of almost a million rows of data.

Installation

To use the code in this tutorial, I assume you have installed googlemaps:

pip install googlemaps

Understanding Geocode using Google Map

Geocode is an operation to transform a location such as street and city into longitude and latitude coordinate of the earth. Knowing the coordinate of (long, lat), you can have more precision and can do more data spatial analysis, such as computing the risk probability of accident, and do more data visualization such as using the heatmap.

When you go to Google Map and you type a street name and the city, Google Map will display the icon of the location. For example, I use the first data of Metro Manila Spatial accident data: Central Bicutan, Upper Bicutan, Taguig that represents the street, location and the city.

Geocode1

If you notice the HTTP address contains some number like .../@14.4889308,121.0517255,15z/.... That is the coordinate location of the icon and the zoom level. If you type again the coordinate location, Google Map will give you back the approximate reversed geocoding, that is the actual location in the map. It is not the exact location but close to the actual location.

Geocode2

Now if you have so many data of street location and the city, how to automate the geocoding process? The answer is to use Google Map Geocoding API. To use the API, you need to sign up to Google Map, specify your project and get the API_KEY. Google Map allows 2,500 free requests per day. On top of that, you need to pay $0.50 USD / 1000 additional requests, up to 100,000 daily.

Automatic Geocoding

Once we have the API_KEY, we can import the necessary modules and set up GoogleMap client object.

In [1]:
import pandas as pd
import numpy as np
import googlemaps
import gmaps

API_KEY = 'AI....' 
gm = googlemaps.Client(key=API_KEY)

Now we read the spatial data. For this tutorial, I only use the first 250 rows out of almost a million rows of Metro Manila spatial traffic accident data that is publicly available in GitHub.

For your own purpose, you just change the data using the same format of the name of variables.

To display the geocoded location as heatmap, go to the next tutorial.

In [2]:
data = pd.read_csv('MetroManilaSpatialAccidentData.csv', encoding = "ISO-8859-1", low_memory=False, index_col = 'Key')
data = data.fillna('')                   # fill empty entries with ''
print(list(data))                        # print Variable Name
data.head()                              # show some data
['District', 'City', 'Street', 'Location']
Out[2]:
District City Street Location
Key
1 Southern Taguig Central Bicutan Upper Bicutan
2 Southern Makati Antonio S. Arnaiz Ave. (Pasay Road) Makati Ave.
3 Central Quezon IBP Road Kalayaan B
4 Eastern San Juan ICA at parking area
5 Southern Las Piñas Naga Road before Baga Bridge, Tramo

Observe that this data has 4 variables and 250 rows.

In [3]:
[maxRow,maxCol]=data.shape
maxRow,maxCol
Out[3]:
(250, 4)

The following functions will do the geocoding of street, location and the city. The algorithm goes as follow: the code will try to search for street, location and city. If it is not succesful, it will try to search for street and the city. If it is still unsuccessful, it will search for location and city. If all of these attempts are not successful, it will produce fail in geocoding. If any of the attempt is successful, it will append the result into the list and return this list of latitude and longitude.

In [4]:
def Geocode(query):
    # do geocoding
    try:
        geocode_result = gm.geocode(query)[0]       
        latitude = geocode_result['geometry']['location']['lat']
        longitude = geocode_result['geometry']['location']['lng']
        return latitude,longitude
    except IndexError:
        return 0
        
def GeocodeStreetLocationCity(data):
    lat=[]                            # initialize latitude list
    lng=[]                            # initialize longitude list
    start = data.index[0]             # start from the first data
    end = data.index[maxRow-1]        # end at maximum number of row
    for i in range(start,end+1,1):    # iterate all rows in the data
        isSuccess=True                # initial Boolean flag
        query = data.Street[i] + ' ' + data.Location[i] + ' ' + data.City[i]  # try set up our query street-location-city 
        result=Geocode(query)
        if result==0:         # if not successful,
            query = data.Location[i] + ' ' + data.City[i]                     # try set up another query location-city
            result=Geocode(query)
            if result==0:     # if still not successful,
                query =  data.Street[i] + ' ' + data.City[i]                  # try set up another query street-city
                result=Geocode(query)
                if Geocode(query)==0: # if still not successful,
                    isSuccess=False                                           # mark as unsuccessful
                    print(i, 'is failed')
                else:
                    print(i, result)
            else:
                print(i, result)
        else:
            print(i, result)
        if isSuccess==True:           # if geocoding is successful,
            # store the results
            lat.append(result[0])     # latitude
            lng.append(result[1])     # longitude
    return lat,lng

Now we call the geocoding function and store the result into two list name [lat,lng] and we put the result into pandas data frame for further data cleaning.

In [5]:
# call the geocoding function
[lat,lng]=GeocodeStreetLocationCity(data)

# we put the list of latitude,longitude into pandas data frame
df = pd.DataFrame(
    {'latitude': lat,
     'longitude': lng
    })
1 (14.4906694, 121.0539203)
2 (14.5473703, 121.0266411)
3 (14.6908755, 121.0904757)
4 (18.4646092, -66.1131414)
5 (14.4648525, 120.9848185)
6 (14.6059915, 121.0280523)
7 (14.6031718, 121.0449142)
8 (14.609836, 121.024398)
9 (14.5973214, 121.0416502)
10 (14.600964, 121.050945)
11 (14.6030544, 121.0447426)
12 (18.4643424, -66.1135206)
13 (14.533729, 121.019789)
14 (14.5247621, 121.0130761)
15 (18.4655394, -66.1057355)
16 (14.540074, 121.0167916)
17 (14.5665013, 121.0235176)
18 (14.63895, 121.0415387)
19 (14.6163291, 121.0087231)
20 (14.5195775, 121.0295749)
21 (14.5377516, 121.0013794)
22 (14.7261006, 121.1142911)
23 (14.6679612, 121.0343384)
24 (14.6568349, 121.0258305)
25 (14.6658784, 121.031513)
26 (14.6780511, 121.0320825)
27 (14.666284, 121.021652)
28 (14.6471575, 121.0413032)
29 (14.6926062, 121.0295935)
30 (14.6580629, 121.0244324)
31 (14.7022046, 121.087096)
32 (14.6290209, 121.0398585)
33 (14.6122428, 121.0089184)
34 (14.6893721, 121.0938472)
35 (14.6952349, 121.08659)
36 (14.666911, 121.073833)
37 (14.699944, 121.0655779)
38 (14.7055283, 121.0729646)
39 (14.68479, 121.086693)
40 (14.503567, 120.999019)
41 (14.5053609, 120.9938534)
42 (14.5170069, 120.9943476)
43 (14.466633, 121.0160936)
44 (14.479429, 120.998694)
45 (14.480046, 120.997699)
46 (14.4576786, 121.0340585)
47 (14.483542, 120.993134)
48 (14.5271423, 120.9961197)
49 (14.530833, 120.993663)
50 (14.5229333, 120.9911273)
51 (14.5162994, 121.0020029)
52 (14.4881394, 120.9822995)
53 (14.5053609, 120.9938534)
54 (14.5053609, 120.9938534)
55 (14.5053609, 120.9938534)
56 (14.503567, 120.999019)
57 (14.494603, 120.991926)
58 (14.4666206, 121.0158502)
59 is failed
60 (14.4644577, 121.0171425)
61 (14.4426052, 121.0008785)
62 (14.444546, 120.9938736)
63 (14.444546, 120.9938736)
64 (14.4482601, 120.9858195)
65 (14.4403551, 120.9910369)
66 (14.4363842, 121.0060787)
67 (14.424349, 120.999389)
68 (14.4735281, 120.9794314)
69 (14.4363842, 121.0060787)
70 (14.451903, 120.977878)
71 (14.4525317, 120.929924)
72 (14.444546, 120.9938736)
73 (14.429455, 121.016095)
74 (14.4060165, 121.0467584)
75 (14.453507, 120.986851)
76 (14.390941, 121.044365)
77 (14.4557199, 121.0451145)
78 (14.4174455, 121.0283875)
79 (14.4557199, 121.0451145)
80 (14.5764261, 121.0851562)
81 (14.5508918, 121.0770532)
82 (14.56548, 121.085981)
83 (14.595911, 121.091862)
84 (14.6096664, 121.0948879)
85 (14.601452, 121.092083)
86 (14.5981621, 121.0750718)
87 (14.602702, 121.050722)
88 (14.5854063, 121.0602225)
89 (14.65073, 121.1028546)
90 (14.6438405, 121.1145683)
91 (14.640704, 121.102523)
92 (14.6170874, 121.1354571)
93 (14.664604, 121.108051)
94 (14.6561218, 121.1045501)
95 is failed
96 (14.6170874, 121.1354571)
97 (14.6312221, 121.0821643)
98 (14.6393109, 121.1009351)
99 (14.6136845, 121.3356469)
100 (14.65752, 121.108376)
101 (14.6170874, 121.1354571)
102 (14.6382441, 121.1091564)
103 (14.6690521, 121.1287129)
104 (14.6292307, 121.0782002)
105 (14.643334, 121.1144743)
106 (14.664457, 121.1080278)
107 (14.643243, 121.10931)
108 (14.6413945, 121.1019515)
109 (14.6263181, 121.0991328)
110 (14.6393109, 121.1009351)
111 (14.6295032, 121.1000325)
112 (14.6352916, 121.0909397)
113 (14.636002, 121.095841)
114 (14.6226341, 121.1027126)
115 (14.6522306, 121.1204848)
116 (14.6325299, 121.0996633)
117 (14.6289928, 121.0986635)
118 (14.63314, 121.093097)
119 (14.6585333, 121.1112249)
120 (14.662523, 121.123667)
121 (14.632587, 121.07779)
122 (14.6215909, 121.1019809)
123 (14.653782, 121.118877)
124 (14.6299628, 121.0969868)
125 (14.6593553, 121.1135705)
126 (14.6386321, 121.1254008)
127 (14.6295032, 121.1000325)
128 (14.6501054, 121.1069515)
129 (14.5705268, 120.9845386)
130 (14.6297402, 120.9795193)
131 (14.607676, 120.981138)
132 (14.6285283, 120.9845885)
133 (14.568429, 120.984465)
134 (14.582402, 121.014191)
135 (14.6382145, 121.1226881)
136 (14.4814948, 120.9965911)
137 (14.668831, 120.947793)
138 is failed
139 (14.6561809, 120.9515114)
140 (14.675943, 120.944376)
141 (14.6693621, 120.9603839)
142 (14.8207405, 120.9032569)
143 (14.6578192, 120.9609761)
144 (14.663086, 120.9834167)
145 (14.689373, 120.953979)
146 (14.6736831, 120.9629247)
147 (14.7064227, 120.9937569)
148 (14.6909548, 120.9732872)
149 (14.540998, 121.018749)
150 (14.6734083, 120.9821202)
151 (14.692939, 120.9681965)
152 (14.6853237, 120.9772537)
153 (15.6081782, 120.6042891)
154 (14.6494902, 120.984625)
155 (14.6587001, 120.9839805)
156 (14.6573373, 120.9917185)
157 (14.6573934, 120.9802038)
158 (14.6437934, 120.991254)
159 (14.657606, 120.9961366)
160 (14.6539987, 120.983832)
161 (14.6371921, 120.9723798)
162 (14.656937, 120.997471)
163 (14.7594002, 121.052143)
164 (14.7566441, 121.0579634)
165 (14.7754612, 121.052429)
166 (14.740101, 121.0326872)
167 (14.7684203, 121.080971)
168 (14.5665013, 121.0235176)
169 (14.543346, 121.059857)
170 (14.558961, 121.019497)
171 (14.5567621, 121.0194522)
172 (14.558961, 121.019497)
173 (14.5624374, 121.0277578)
174 (14.570635, 121.023437)
175 (14.5570054, 121.0372487)
176 (14.558961, 121.019497)
177 (14.5582074, 121.022241)
178 (14.537669, 121.0021747)
179 (14.5470862, 120.9984773)
180 (14.548029, 120.987526)
181 (14.5451858, 121.0685089)
182 (14.5447508, 121.0665593)
183 (14.5484822, 121.0676274)
184 (14.5437203, 121.0676331)
185 (48.0509756, -119.9034008)
186 (14.5685317, 121.0281027)
187 (14.559857, 121.016931)
188 (14.5521994, 121.0210759)
189 (14.566206, 121.0125691)
190 (14.574442, 121.011934)
191 (14.547488, 121.040543)
192 (14.5518921, 121.0275105)
193 (14.554729, 121.0244452)
194 (14.5679113, 121.030147)
195 (14.558961, 121.019497)
196 (14.5654707, 121.0474543)
197 (14.561039, 121.025448)
198 (14.553727, 121.015802)
199 (14.5562844, 121.0258626)
200 (14.5668923, 121.0122099)
201 (14.563696, 121.028776)
202 (14.5665013, 121.0235176)
203 (14.564405, 121.029359)
204 (14.5616906, 121.0114426)
205 (14.5566901, 121.0048228)
206 (14.563315, 121.012534)
207 (14.559955, 121.041387)
208 (14.5681196, 121.0293973)
209 (14.5619792, 121.026889)
210 (14.5639569, 121.043231)
211 (14.560454, 121.012098)
212 (14.558961, 121.019497)
213 (14.5597856, 121.0109022)
214 (14.5610709, 121.0289089)
215 (14.5665839, 121.0516121)
216 (14.554729, 121.0244452)
217 (14.5596325, 121.031366)
218 (14.5609722, 121.0224008)
219 (14.558429, 121.0224254)
220 (14.549372, 121.029081)
221 (14.55973, 121.0270887)
222 (14.554729, 121.0244452)
223 (14.566206, 121.0125691)
224 (14.544293, 121.016703)
225 (14.5596325, 121.031366)
226 (14.5576121, 121.0134854)
227 (14.533729, 121.019789)
228 (14.5554563, 121.0021878)
229 (14.559436, 121.008263)
230 (14.5561004, 121.004504)
231 (14.5524156, 121.0276229)
232 (14.533729, 121.019789)
233 (14.5473703, 121.0266411)
234 (14.5491802, 121.0279696)
235 (14.5571584, 121.0085076)
236 (14.5584848, 121.0144262)
237 (14.5610709, 121.0289089)
238 (14.5401512, 121.0200663)
239 (14.5615986, 121.0230249)
240 (14.5504311, 121.0150133)
241 (14.5546141, 121.0239416)
242 (14.556689, 121.020681)
243 (14.5558142, 121.0228987)
244 (14.569501, 121.01641)
245 (14.557619, 121.013325)
246 (14.5647074, 121.0446476)
247 (14.5587858, 121.0269604)
248 (14.5462633, 121.0648054)
249 (14.476603, 120.99994)
250 (14.5261728, 120.9936322)

Spatial Data Cleaning

Notice the printed geocoding process above and you can observe that not all geocode process is successful. Even if it is successful, some geocoded locations are clearly outside the boundary of the city and need to be cleaned further.

We need to delete the geocoded locations that outside the boundary location of the city. For that purpose, first we need to define the center of the city and the boundingbox boundary coordinate of the city. Then, we create a filter mask to remove the location coordinate that is outside of the city bounding box.

To draw a heat map, we need to add how frequent the event (in this case, accident event) happenned in one place and we put that into our weight.

In [6]:
# do geocode for the whole mega city
geocode_result = gm.geocode('Metro Manila')[0]  # change the name into your city of interest

# get the center of the city
center_lat=geocode_result['geometry']['location']['lat']
center_lng=geocode_result['geometry']['location']['lng']
print('center=',center_lat,center_lng)

# get the bounding box of the city i.e. Metro Manila
bounding_box = geocode_result['geometry']['bounds']
lat_boundary = [bounding_box['southwest']['lat'], bounding_box['northeast']['lat']]
lng_boundary = [bounding_box['southwest']['lng'], bounding_box['northeast']['lng']]
print('boundary:',lat_boundary,lng_boundary)

# remove the list that is outside of the city bounding box
mask=(df.latitude>lat_boundary[0])  & \
     (df.latitude<lat_boundary[1])  & \
     (df.longitude>lng_boundary[0]) & \
     (df.longitude<lng_boundary[1])
df=df[mask]

# add frequency of accidents in the location
df['weight']=1

# save into csv file
df.to_csv('locations.csv')
print('saved gocoded locations to "locations.csv"')
center= 14.6090537 121.0222565
boundary: [14.3480961, 14.7874961] [120.9062111, 121.1350759]
saved gocoded locations to "locations.csv"

last update: September 2017

Cite this tutorial as: Teknomo, K. (2017) Automatic Geocoding using Python (http://people.revoledu.com/kardi/tutorial/Python/)

See Also:

Visit www.Revoledu.com for more tutorials in Data Science

Copyright © 2017 Kardi Teknomo

Permission is granted to share this notebook as long as the copyright notice is intact.