SLAM: Simultaneous Localisation and Mapping

4 minute read

Published:

TL;DR: SLAM is the problem of building a map of an unknown environment while simultaneously tracking the robot's location within it. Classical approaches use EKF or particle filters; modern systems use factor graphs (g2o, GTSAM) for globally consistent optimisation. ORB-SLAM3 represents the feature-based state of the art, while neural SLAM systems like iMAP use implicit neural representations for photorealistic mapping.
SLAM in robot navigation
Navigation and localisation in learned robot systems (Brohan et al., 2022)

The SLAM Problem

SLAM (Simultaneous Localisation and Mapping) is one of the most fundamental problems in mobile robotics. The challenge is circular: accurate localisation requires a map of the environment, but building an accurate map requires knowing your location.

Formally, SLAM estimates the joint posterior:

$$p(\mathbf{x}_{1:t}, \mathbf{m} \mid \mathbf{z}_{1:t}, \mathbf{u}_{1:t})$$

where \(\mathbf{x}_{1:t}\) is the robot trajectory, \(\mathbf{m}\) is the map, \(\mathbf{z}_{1:t}\) are observations (laser scans, images), and \(\mathbf{u}_{1:t}\) are control inputs (odometry). The map can be represented as a set of landmarks \(\mathbf{m} = \{l_1, \dots, l_K\}\) in feature-based SLAM, or as an occupancy grid in metric SLAM.

EKF-SLAM

Extended Kalman Filter SLAM maintains a joint Gaussian over the robot pose and all landmark positions: state vector \(\mathbf{x} = [\mathbf{x}_r^T, l_1^T, \dots, l_K^T]^T \in \mathbb{R}^{3 + 2K}\) (for a 2D robot with 2D landmarks). The covariance matrix \(\Sigma \in \mathbb{R}^{(3+2K) \times (3+2K)}\) captures uncertainty and correlations.

The EKF proceeds in two steps per timestep:

  • Predict: propagate state through motion model \(f(\mathbf{x}_r, \mathbf{u})\), inflate uncertainty via process noise \(Q\).
  • Update: incorporate each new landmark observation \(\mathbf{z}_i = h(\mathbf{x}_r, l_i) + \mathbf{n}\) via Kalman gain.

EKF-SLAM scales quadratically with the number of landmarks โ€” the full covariance matrix grows as \(O(K^2)\), making it impractical for large environments (\(K > 1000\)).

FastSLAM: Particle Filters

FastSLAM (Montemerlo et al. 2002) exploits a key conditional independence: given the robot trajectory, each landmark can be estimated independently. Using the Rao-Blackwellised particle filter, each particle maintains its own trajectory estimate plus \(K\) independent 2D Kalman filters for landmarks:

$$p(\mathbf{x}_{1:t}, \mathbf{m} \mid \mathbf{z}_{1:t}) = p(\mathbf{x}_{1:t} \mid \mathbf{z}_{1:t}) \prod_{k=1}^K p(l_k \mid \mathbf{x}_{1:t}, \mathbf{z}_{1:t})$$

This reduces complexity to \(O(KM)\) where \(M\) is the number of particles, enabling SLAM in larger environments.

Graph-Based SLAM

Modern SLAM systems use a pose graph formulation. Robot poses \(x_1, \dots, x_T\) are nodes; relative pose measurements (from odometry or visual matching) are edges with associated uncertainty. Loop closure โ€” recognising a previously visited place โ€” adds long-range edges that correct accumulated drift.

SLAM then becomes a maximum a posteriori (MAP) estimation problem solved by nonlinear least squares:

$$\mathbf{x}^* = \arg\min_\mathbf{x} \sum_{(i,j) \in \mathcal{E}} \|h(x_i, x_j) - z_{ij}\|^2_{\Omega_{ij}}$$

where \(\Omega_{ij}\) is the information matrix of measurement \((i,j)\). Libraries g2o and GTSAM solve this efficiently using sparse Cholesky factorisation, exploiting the graphโ€™s sparsity. iSAM2 extends this to incremental updates.

ORB-SLAM3 (Campos et al. 2021) is the state-of-the-art feature-based SLAM system supporting monocular, stereo, and RGB-D cameras. It extracts ORB (Oriented FAST and Rotated BRIEF) features, tracks them across frames, builds a pose graph, and performs bundle adjustment. It achieves centimetre-level accuracy in real-time on consumer hardware.

Neural SLAM

Recent systems replace classical map representations with implicit neural representations. iMAP (Sucar et al. 2021) represents the scene as a neural radiance field (NeRF): a small MLP \(f_\theta(x, y, z) \to (\text{color}, \text{density})\). The pose and map are jointly optimised by minimising photometric and depth rendering losses.

NICE-SLAM (Zhu et al. 2022) extends iMAP with a hierarchical feature grid for faster optimisation in larger environments. Neural SLAM produces photorealistic reconstructions and generalises well to novel viewpoints, but remains slower than classical SLAM and harder to scale to large scenes.

Key Insight: Loop closure is the crux of SLAM. Without it, odometry drift accumulates into global inconsistency. Reliable loop closure requires place recognition โ€” recognising that two sensor observations are from the same location despite viewpoint and lighting changes. Deep learning (NetVLAD, SuperPoint+SuperGlue) has dramatically improved place recognition and is now standard in modern SLAM systems.

References

  1. Thrun, S., Burgard, W., & Fox, D. (2005). Probabilistic Robotics. MIT Press (Ch. 10: SLAM).
  2. Mur-Artal, R., Montiel, J. M. M., & Tardos, J. D. (2015). ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE TRO. arXiv:1502.00956.
  3. Sucar, E., Liu, S., Ortiz, J., & Davison, A. J. (2021). iMAP: Implicit mapping and positioning in real-time. ICCV. arXiv:2103.12352.