If you know what geocoding is and want to start building your own geocoding databases, I will show you how to build geocoding databases in this article.
Article Contents
Build geocoding databases
To make a geocoding database, you will need a data source. You can use data provided by Geonames, or search for any source that fits your requirements.
I will use http://download.geonames.org/export/zip/ to make a database.
The readme file shows that each country zipped in a compressed file in format COUNTRY_CODE.zip
.
Inside the ZIP package, there is a CSV file with TXT extension that contains everything we need for a geocoding database (well, most of).
The CSV file uses t
as separator character and has following fields,
country code : iso country code, 2 characters
postal code : varchar(20)
place name : varchar(180)
admin name1 : 1. order subdivision (state) varchar(100)
admin code1 : 1. order subdivision (state) varchar(20)
admin name2 : 2. order subdivision (county/province) varchar(100)
admin code2 : 2. order subdivision (county/province) varchar(20)
admin name3 : 3. order subdivision (community) varchar(100)
admin code3 : 3. order subdivision (community) varchar(20)
latitude : estimated latitude (wgs84)
longitude : estimated longitude (wgs84)
accuracy : accuracy of lat/lng from 1=estimated to 6=centroid
Up to this point, I think you should know how to do it already.
These are the steps:
- Download the ZIP package.
- Decompressed to get CSV file.
- Parse the CSV file (inc. filter if necessary)
- Import into database (ex. MySQL, Postgresql, Sqlite..)
Those can be done in a simple Python script, or any tech that you’re good at.
Step 1: Download the ZIP package
This can be done easily in Python.
import wget
wget.download(FILE_URL, FILE_ZIP, bar=bar_progress)
The wget module provides a pretty handy progress bar for tracking current download progress.
Step 2: Extract the ZIP package
Now, we need to extract the compressed file to get the CSV file.
With zipfile module, it cannot get easier.
from zipfile import ZipFile
z = ZipFile('US.zip', 'r')
z.extractall()
z.close()
After that, you will see two files in current directory, US.txt
and readme.txt
.
Step 3: Parse the CSV file
By using pandas, we can parse this CSV file into a 2-dimensional DataFrame and it is easy to manipulate. Read CSV file using pandas is a common task in data science.
import pandas
df = pandas.read_csv('data/US.txt', delimiter='\t')
Because the data from Geonames uses tab \t
as separator, we need to specify it explicitly.
Moreover, it is better to assign name to columns:
df = pd.read_csv('./data/US.txt', sep='\t', names=['country', 'postal_code', 'name', 'state_name', 'state_code',
'county_name', 'county_code', 'comm_name', 'comm_code',
'latitude', 'longitude', 'accuracy'])
You can also filter DataFrame as you want. For example, you might not need to use comm_name
, comm_code
and accuracy
at all, you can drop columns from DataFrame in pandas using just one line of code.
df.drop(['comm_name', 'comm_code', 'accuracy'], axis=1)
Step 4: Import into databases
After filtering and having nice dataset to use, we need to put them into a database. We will use MySQL in this tutorial.
The quickest solution is to use SQLAlchemy to handle importing process, to move a DataFrame into a MySQL table.
from sqlalchemy import create_engine
engine = create_engine('mysql+mysqlconnector://root:@localhost:3306/')
df.to_sql('geocodes', con=engine, if_exists='replace', schema='geo_db', index_label='id', chunksize=100)
The noticable parameter using here is chunksize
.
Because the data in DataFrame might be too large and take a long time to import into database through single commit, we use chunksize
to put data into table so it can be faster.
Conclusion
Well done, you have a geocoding database that is ready to use now. The process is simple and easy, any programmer can do this.
If you need references, check out my demo code here, it might give you some ideas, https://github.com/petehouston/geonames-db