High-volume reverse geo­cod­ing with Nominatim



A recent pro­ject required us to map a large num­ber of GPS coordin­ates to their respect­ive muni­cip­al­ity names. This pro­cess, known as reverse geo­cod­ing, involved an ini­tial data­set of over 3 mil­lion coordin­ate pairs from a few spe­cific regions in Switzer­land and its adja­cent bor­der areas, with about a thou­sand new pairs added daily.

Eval­u­at­ing geo­cod­ing solu­tions: cost vs. control

When deal­ing with this volume of data, the choice of a geo­cod­ing ser­vice has sig­ni­fic­ant implications.

Com­mer­cial ser­vices offer con­veni­ence and are simple to integ­rate. How­ever, the costs can be sub­stan­tial. Based on the pub­lic pri­cing cal­cu­lator of a major map­ping pro­vider, our 3 mil­lion look­ups would cost approx­im­ately $6,500, with ongo­ing monthly costs of around $100 for the daily addi­tions. Fur­ther­more, licens­ing terms often place restric­tions on the per­man­ent stor­age and reuse of the results.

Free ser­vices, such as the one offered by the Swiss gov­ern­ment, are valu­able but might not be a com­plete fit if your data, like in our case, includes loc­a­tions in neigh­bor­ing countries.

This led us to explore Nom­in­atim, an open-source geo­cod­ing engine that uses Open­Street­Map data. While Nom­in­atim provides a free pub­lic API, its usage policy is designed for occa­sional, non-bulk use, with a limit of one request per second. Attempt­ing to pro­cess a large data­set through the pub­lic API would be slow and would not align with their fair use policy.

The most viable path for­ward was to fol­low Nom­in­atim’s recom­mend­a­tion for power users: host our own instance.

A self-hos­ted Nom­in­atim instance with a smart cach­ing strategy

By host­ing our own Nom­in­atim server, we can address the core require­ments of our scen­ario: pro­cessing a large volume of requests without rate lim­its and hav­ing unres­tric­ted rights to store and reuse the res­ult­ing data.

Because our data is geo­graph­ic­ally con­cen­trated around known regions in Switzer­land, we could imple­ment a “smart cache” to dra­mat­ic­ally reduce the num­ber of look­ups required.

  1. Pre-com­pu­ta­tion and Cach­ing: Instead of pro­cessing every coordin­ate indi­vidu­ally, we cre­ated a com­pre­hens­ive grid of points to cover the main areas of interest. These points were roun­ded to three decimal places, which added a mean error of approx­im­ately 36 meters. That is a fair trade-off between accur­acy on the muni­cip­al­ity level and cache size. We then com­bined this pre-cal­cu­lated grid with any unique, roun­ded coordin­ates from our data­set that fell out­side these primary zones. This approach reduced the ini­tial 3 mil­lion coordin­ates to a set of approx­im­ately 200,000 unique points to be geo­coded. We then ran this set through our local Nom­in­atim server and store the res­ults in a cache.
  2. Optim­ized Daily Look­ups: For the thou­sand new coordin­ates added each day, we apply the same round­ing logic. The vast major­ity of these points, once roun­ded, will already be in our pre-pop­u­lated cache. For the few that are truly new, we query the pub­lic Nom­in­atim server and add the res­ult to the cache, con­tinu­ously improv­ing its coverage.

This pre-com­pu­ta­tion and round­ing strategy is the key to hand­ling a large volume of data efficiently.

Set­ting up a local Nom­in­atim server

The offi­cial pro­ject provides detailed install­a­tion instruc­tions. The pro­cess is well-doc­u­mented for vari­ous Linux distributions.

As my devel­op­ment machine runs Win­dows, I used the Win­dows Sub­sys­tem for Linux (WSL) to set up an appro­pri­ate envir­on­ment: wsl --install --distribution Ubuntu-24.04 --name Nominatim

A full Nom­in­atim install­a­tion with world­wide data requires nearly 1TB of stor­age. How­ever, our tutori­al’s scope is lim­ited. We can use regional data extracts from Geo­fab­rik to sig­ni­fic­antly reduce the stor­age and pro­cessing foot­print. In the fol­low­ing, we’ll focus on Baden, which is near the Ger­man state of Baden-Württem­berg. There­fore, we will need data for both regions to ensure com­plete cov­er­age. We can import the rel­ev­ant files:

nominatim import --osm-file switzerland-latest.osm.pbf --osm-file baden-wuerttemberg-latest.osm.pbf

After the import, we can start the server by run­ning nominatim serve.

Veri­fy­ing the local instance

To ensure our local server pro­duces the same res­ults as the pub­lic API, we can run a test query for an office loc­a­tion in Baden. We’ll request the muni­cip­al­ity (zoom=10), spe­cify the lan­guage (accept-language=de-CH), and use the stable geocodejson format.

Query against the pub­lic API:

$ curl "https://nominatim.openstreetmap.org/reverse?format=geocodejson&lat=47.4798817&lon=8.3052468&zoom=10&accept-language=de-CH"
# Returns "name":"Baden"

Query against our local server:

$ curl "http://localhost:8088/reverse?format=geocodejson&lat=47.4798817&lon=8.3052468&zoom=10&accept-language=de-CH"
# Also returns "name":"Baden"

The res­ults are identical. Our local instance is func­tion­ing cor­rectly, ready to handle requests without external depend­en­cies or rate limits.

Grid scan around Baden

To show­case the per­form­ance, we’ll run a prac­tical test: a reverse geo­cod­ing grid scan of the area around Baden. We can write a simple Python script to query 40,000 points in a 200 × 200 grid cov­er­ing the region. For each point, we save the muni­cip­al­ity and coun­try name in a comma-sep­ar­ated val­ues file.

import requests
import numpy as np

#url = f"https://nominatim.openstreetmap.org/reverse"
url = f"http://localhost:8088/reverse"

def get_city_and_country(lat : float, lon : float):
    params = {
        'format': 'geocodejson',
        'zoom': 10,
        'accept-language': 'de-CH',
        'lat': lat,
        'lon': lon
    }

    response = requests.get(url, params=params, timeout = 10)
    data = response.json()

    city = data['features'][0]['properties']['geocoding']['name']
    country = data['features'][0]['properties']['geocoding']['country']

    return city, country

with open("baden.csv", 'w') as f:
    f.write('latitude,longitude,city,country\n')
    for lon in np.arange(8.205, 8.405, 0.001):
        for lat in np.arange(47.380 , 47.580, 0.001):
            city, country = get_city_and_country(lat=lat, lon=lon)
            f.write(f"{lat:.3f},{lon:.3f},{city},{country}\n")
$ time python baden.py

real    3m29.410s
user    0m34.029s
sys     0m4.458s

The test com­pletes in about 3.5 minutes on a stand­ard laptop, which works out to roughly 191 requests per second. At this rate, pro­cessing 200,000 unique coordin­ates for a cache would take less than 20 minutes. This real-world per­form­ance con­firms the viab­il­ity of this approach. The script used for this is straightforward.

Res­ult

The script above pro­duces the fol­low­ing data file:

Let’s visu­al­ize the coordin­ates on a map. We can see the indi­vidual data points around our office:

A visualization of coordinates in a rectangular grid over a city map of Baden.
Fig­ure 1 Indi­vidual data points around our Baden office.

The grid dens­ity of 0.001° by 0.001° is suf­fi­cient to identify the muni­cip­al­ity. For street- or house level iden­ti­fic­a­tion, a finer grid would be neces­sary. If we zoom out, we can see the extent of data we covered:

Data points in a grid over a map of Baden
Fig­ure 2 All data points around Baden. Due to the high dens­ity, the indi­vidual data points are not vis­ible and appear as a rectangle.

The 200 × 200 dot grid appears oblong rather than square, since Switzer­land is not on the equator. In Switzer­land, a step of 0.001° in lon­git­ude only cor­res­ponds to a dis­tance of 75m. A step of 0.001° in lat­it­ude cor­res­ponds to a dis­tance of 111m (every­where in the world). 

Con­clu­sion

We demon­strated that for tasks involving high-volume reverse geo­cod­ing, com­mer­cial ser­vices are not the only option. Host­ing a private Nom­in­atim instance presents a prac­tical and cost-effect­ive alternative.

The ini­tial setup requires some tech­nical effort, but com­bin­ing it with a smart cach­ing strategy like coordin­ate round­ing can dra­mat­ic­ally reduce the com­pu­ta­tional work­load. This approach offers full con­trol over per­form­ance, no rate lim­its, and unres­tric­ted use of the geo­coded data. We are grate­ful to the Open­Street­Map and Nom­in­atim com­munit­ies for provid­ing the power­ful open-source tools that make this possible.

40 digits: You are optimistic about our understanding of the nature of distance itself.
Coordin­ate Pre­ci­sion” (Comic #2170) by Ran­dall Mun­roe, licensed under CC BY-NC 2.5.