Using the MaxMind GeoLite City data with MySQL

Last updated 2009-07-21
Introduction
The database schema
Efficient search
References

Abstract

This is a brief tutorial on using the MaxMind GeoLite City data for geolocation, using a more efficient query to search on the database than the one suggested by MaxMind.

Introduction

MaxMind provides data to do geolocation of IP addresses. (See http://www.maxmind.com/app/ip-locate . The information provided by MaxMind on (http://www.maxmind.com/app/csv) only applies to GeoLite Country . This article discusses the use of GeoLite City , which is more accurate and not well documented.

They provide the data, both in a binary format, accessed through an API, and a CSV (comma-separated-values) bulk of data. Maxmind recommends using the binary format, but this is not feasible in some circumstances:

The database schema

When using the MaxMind CSV files, you will find they are just data, there are no database schemas to fit the data on the database. This makes them suitable to be included in any database system.

The CSV files you should download are the ones provided on http://www.maxmind.com/app/geolitecity

Following the link Download the latest GeoLite City CSV Format, you will download an archive of the form GeoLiteCity_YYYYMMDD.zip This archive contains two files, GeoLiteCity-Blocks.csv, and GeoLiteCity-Location.csv

The first one is a list of IP addresses intervals, and a location id (locId), linking them to the Location table The table schema can be as follows:

	(Logging in mysql as root)
	CREATE DATABASE geoip;
	USE geoip;
	CREATE TABLE blocks (
		startIpNum int(10) unsigned NOT NULL,
		endIpNum int(10) unsigned NOT NULL,
		locId int(10) unsigned NOT NULL,
		PRIMARY KEY (endIpNum)
	);
   

The second file is a list of locations, holding information about the location.

	DROP TABLE IF EXISTS location;
	CREATE TABLE location(
		locId int(10) unsigned NOT NULL,
		country char(2) NOT NULL,
		region char(2) NOT NULL,
		city varchar(50),
		postalCode char(5) NOT NULL,
		latitude float,
		longitude float,
		dmaCode integer,
		areaCode integer,
		PRIMARY KEY (locId)
	);
   

To import the data into the two tables, we use the LOAD DATA INFILE command, using the local csv files we have unzipped. Assuming you have unpacked them in /tmp:

	LOAD DATA LOCAL INFILE '/tmp/GeoLiteCity-Blocks.csv'
		INTO TABLE geoip.blocks
		FIELDS TERMINATED BY ','
		OPTIONALLY ENCLOSED BY '\"'
		LINES TERMINATED BY '\n';
	LOAD DATA LOCAL INFILE '/tmp/GeoLiteCity-Location.csv'
		INTO TABLE geoip.location
		FIELDS TERMINATED BY ','
		OPTIONALLY ENCLOSED BY '\"'
		LINES TERMINATED BY '\n';
   

To update the data, you should block the tables with, as the geolocating program could be reading them at the same time, and delete the data, with:

   	LOCK TABLES blocks READ, location READ;
	DELETE from blocks;
	DELETE from location;
   

After creating the tables, you will want to add a user with read-only permissions to this tables, with GRANT READ on geoip.* TO [user] identified by [password] Check the MySQL manual for details.

Efficient search

In the the MaxMind help page the query recommended is:

    SELECT ip_country FROM geoip 
      WHERE [ip_num] >= begin_ip_num
      AND [ip_num] <= end_ip_num
   

The [ip_num] is calculated by converting the four bytes of the IP address to a decimal number;

This is recommended for the GeoLite Country database. There is no information about GeoLite City. If we translate this to our setup, we can do:

    SELECT l.country,l.region,l.city 
      FROM location l JOIN blocks b ON (l.locId=b.locId)
      WHERE [ip_num] >= b.beginIpNum
      AND [ip_num] <= b.endIpNum;
   

When trying to optimize this query, the first thing we find is that two fields are used in the query, and that MySQL uses only one key per query. This is, one of the two values will be quickly found, but the other has to be scanned and compared without sorting.

This is far from optimal, and the GeoIP API is faster than this query by two orders of magnitude. So, you are asking by now, Is there a way to use only one key in the query? Yes, there is.

Note that this query searches using the two limits as if the same IP address could be contained in two different intervals. But the intervals do not overlap. So...

If we search for the first occurrence of endIpNum >= [ip_num], we are using only one key, the endIpNum that I used before as PRIMARY KEY strategically, and the result is exactly the same:

     SELECT l.country,l.region,l.city
       FROM location l JOIN blocks b ON (l.locId=b.locId)
       WHERE b.endIpNum >= %s order by b.endIpNum limit 1;
   

This way is more efficient than searching using BETWEEN with two indexes.

References

Pablo Martin