Geospatial Indexing
Finding every restaurant within 2 km, or the nearest available driver, sounds simple — but with millions of rows a naïve SELECT … WHERE distance(lat, lng, …) < 2 scans every row. Geospatial indexes solve this by translating two-dimensional coordinates into structures a database can prune efficiently, reducing a full table scan to a small set of candidates.
Why Spatial Indexing
Standard B-tree indexes work on a single ordered dimension. Latitude and longitude form a 2-D plane — a B-tree on lat alone still forces a scan over all longitudes that match, and vice versa. Spatial indexes solve this by partitioning the 2-D space itself so a query like "find all points within this bounding box" can skip entire regions without reading them.
Core operations
- Nearest-neighbour (k-NN) — find the k closest points to a query location.
- Range query — find all points inside a circle or bounding box.
- Contains / intersects — does a polygon contain a point, or do two polygons overlap?
Geohashing
Geohashing maps a (latitude, longitude) pair to a short alphanumeric string by interleaving the binary representations of the two coordinates and then base-32 encoding the result. The key insight is that a shared prefix means geographic proximity — two hashes that start with u09t are in the same cell, and cells with longer shared prefixes are smaller and closer together.
How encoding works
- Divide the latitude range [−90, 90] in half repeatedly, encoding each bisection as a bit (0 = left half, 1 = right half). Do the same for longitude [−180, 180].
- Interleave the longitude bits and latitude bits:
lon₁ lat₁ lon₂ lat₂ … - Group the resulting bits into 5-bit chunks and map each to one of the 32 base-32 characters (
0–9,b–zexcludinga,i,l,o).
Precision scales with length — each extra character roughly halves the cell size in both dimensions:
4 chars≈ 39 × 20 km6 chars≈ 1.2 km × 0.6 km8 chars≈ 38 × 19 m12 chars≈ centimetre precision
Proximity search
To find all points near a location, compute the query point's geohash and expand the search to the 8 neighbouring cells — always 9 cells total (the target cell + 8 neighbours). This is necessary because two points right next to each other can land in different cells if they straddle a cell boundary.
- Hash the query location to precision p.
- Compute all 8 neighbours using geohash library functions.
- Query the index for rows whose geohash column starts with any of the 9 prefixes.
- Post-filter by exact distance to discard corner cases that fall outside the radius.
Strengths and limitations
- Simple to store — a geohash is a regular string, indexable with any B-tree. Works in any database with a
LIKE 'u09t%'or prefix query. - Easy to cache — the hash of a cell is deterministic; results can be cached at cell granularity. Used by Redis
GEOADD/GEODIST. - Edge discontinuities — the Hilbert-like curve has "wrap-around" edges (the equator, prime meridian, poles) where geographically close points can have very different hashes. The 9-cell search mitigates but does not eliminate this.
- Rectangles, not circles — cells are rectangles; a circle query still requires a post-filter step.
Quadtrees
A Quadtree is a tree in which every internal node has exactly four children, each representing one quadrant (NW, NE, SW, SE) of the parent's bounding box. The tree recursively subdivides 2-D space until each leaf cell satisfies a capacity constraint (e.g. ≤ 100 points).
Variants
- Point Quadtree — each node stores one point and partitions the space at that point's coordinates. Simple but unbalanced in skewed data.
- Region Quadtree (PR Quadtree) — space is split into equal quadrants regardless of point positions, down to a fixed minimum cell size or point density. More predictable depth. Used in most geospatial systems.
- Compressed / Loose Quadtree — skips levels where only one child is occupied, reducing tree height for sparse data.
Building a Region Quadtree
- Start with the root cell covering the full map bounds.
- If the cell contains more than k points, split it into four equal quadrants and redistribute the points.
- Repeat recursively until all leaves satisfy the capacity constraint or the minimum cell size is reached.
Range and nearest-neighbour queries
Range query: traverse the tree top-down. At each node, test whether the node's bounding box intersects the query region — if not, prune the entire subtree. If it does, recurse into children. Only leaves are checked for individual points.
k-NN query: use a priority queue ordered by the minimum possible distance from the query point to each cell's bounding box. Pop the closest candidate (cell or point), expand cells, collect points, and stop when the queue's top entry is farther than the k-th nearest point found so far.
Use cases
- In-memory spatial partitioning for game engines and map renderers.
- Dynamic segmentation of ride-share driver locations (Uber, Lyft).
- Collision detection and viewport culling.
R-Trees
An R-Tree generalises the B-tree to multiple dimensions. Each node stores a set of minimum bounding rectangles (MBRs) — the smallest axis-aligned rectangle that encloses all entries in a subtree. Unlike Quadtrees, R-Trees are disk-based, balanced, and designed for large datasets.
Structure
- Leaf nodes hold the actual spatial objects (points, lines, polygons) along with their MBRs.
- Internal nodes hold the MBR of all entries in their subtree and pointers to child nodes.
- The tree is always height-balanced; all leaves are at the same depth (just like B+Trees).
Insertion and splits
Inserting a new entry walks the tree choosing the child whose MBR needs the least enlargement to accommodate the new entry. If a node overflows (exceeds M entries), it is split into two nodes. The most common split algorithms are:
- Quadratic split (Guttman 1984) — finds the pair of entries that waste the most area if grouped together and builds two groups from there. O(M²).
- Linear split — approximates the above in O(M) but produces lower quality splits.
- R*-Tree — re-inserts overflowing entries before splitting, producing tighter MBRs and better query performance. The de-facto standard in modern systems.
Querying
R-Tree search is identical in spirit to Quadtree search: test each node's MBR against the query window and prune non-overlapping branches. Because MBRs can overlap between sibling nodes (unlike Quadtrees which partition space disjointly), multiple branches may need to be explored, but the pruning is still highly effective in practice.
R-Trees are the index behind PostGIS, SQLite's SpatiaLite, MySQL's spatial indexes, and Oracle Spatial.
Comparison
| Property | Geohash | Quadtree | R-Tree |
|---|---|---|---|
| Data type | Points only | Points / regions | Points, lines, polygons |
| Storage | String column in any DB | In-memory tree / custom store | On-disk balanced tree |
| Edge discontinuities | Yes (wrap-around artefacts) | No | No |
| Dynamic updates | Easy (just re-hash) | Moderate (may re-split cells) | Good (B-Tree-style rebalance) |
| Polygon / shape queries | Approximate only | Limited | Native |
| Typical usage | Redis, ride-share proximity | In-process / game engines | PostGIS, MySQL, Oracle Spatial |
Spatial Queries
Radius / circle search
Find all points within distance r of (lat, lng). Strategy with a spatial index:
- Compute a bounding box that circumscribes the circle (min/max lat and lng offsets).
- Query the index for all points inside the bounding box.
- Post-filter with the exact Haversine or Euclidean distance formula.
k-Nearest Neighbours
Most spatial databases implement k-NN as a first-class operator:
- PostGIS:
ORDER BY geom <-> ST_MakePoint(lng, lat) LIMIT k - MySQL:
ST_Distance_Spherewith an R-Tree index on aPOINTcolumn. - Redis:
GEORADIUS/GEOSEARCH(geohash-backed).
Bounding-box query
The fastest geospatial query — filter rows whose coordinates fall inside a rectangle defined by (min_lat, min_lng, max_lat, max_lng). Directly maps to an index range scan on two dimensions, so it is used as the first-pass filter before any shape-based refinement.
Point-in-polygon
Determine whether a coordinate falls inside a geofence (e.g. a delivery zone, a country border). The standard algorithm is ray-casting: cast a horizontal ray from the point and count how many polygon edges it crosses — an odd count means inside. Spatial databases expose this as ST_Within or ST_Contains and use the R-Tree to find candidate polygons before running the precise test.
Coordinate systems and the Haversine formula
Earth is not flat — distances between GPS coordinates must account for curvature. The Haversine formula computes the great-circle distance between two points on a sphere:
- Convert lat/lng differences to radians.
a = sin²(Δlat/2) + cos(lat₁)·cos(lat₂)·sin²(Δlng/2)distance = 2R · arcsin(√a), where R ≈ 6 371 km.
Haversine has ~0.5% error near the poles due to Earth's oblateness. For centimetre precision use the Vincenty formula or a projected coordinate system (e.g. UTM).
Spatial Databases
PostGIS (PostgreSQL extension)
The most feature-complete open-source spatial database. Adds geometry and geography types, hundreds of spatial functions (ST_Distance, ST_Intersects,ST_Buffer, …), and a GiST-based R-Tree index. Use the geographytype (spherical) for GPS coordinates and geometry (planar) for local projections.
Redis GEO
Redis stores geospatial members in a Sorted Set where the score is the 52-bit geohash of the coordinate. Commands:
GEOADD key lng lat member— add a location.GEODIST key m1 m2 [unit]— great-circle distance between two members.GEOSEARCH key FROMMEMBER m BYRADIUS r km ASC COUNT k— k-NN within radius.
Redis GEO is ideal for real-time use cases (driver locations, delivery tracking) because writes and reads are O(log N) in-memory operations. Persistence and replication are inherited from Redis.
Elasticsearch Geo
Elasticsearch supports geo_point and geo_shape field types backed by a BKD-tree (a multi-dimensional variant of k-d trees). It is well-suited for combining full-text search with proximity filters — e.g. "coffee shops near me that mention oat milk".
MySQL / MariaDB
Both support the POINT, LINESTRING, and POLYGON types with an R-Tree spatial index via CREATE SPATIAL INDEX. Coverage of spatial functions is narrower than PostGIS but sufficient for basic proximity queries.
Choosing the right tool
- Real-time point tracking (drivers, devices) — Redis GEO. Sub-millisecond writes; geohash-backed sorted set handles millions of members.
- Complex geofences and shape queries — PostGIS. Full spatial SQL, coordinate reprojection, topology operations.
- Search + proximity hybrid — Elasticsearch with
geo_point. - Embedded / mobile — SQLite + SpatiaLite, which exposes R-Tree indexes and a subset of spatial functions.