GNNs for Computer Vision: Scene Graphs and Beyond

6 minute read

Published: May 27, 2024

TL;DR: Vision tasks require relational understanding: "the cat is sitting on the mat," "the person is holding a cup," "joint 3 moves because joint 1 moved." Scene graph generation, VQA, skeleton action recognition, and 3D point cloud analysis all use GNNs to encode these relations — going beyond the pixel-level features that CNNs provide.

GNNs for computer vision — GNNs for scene graph generation and visual reasoning (Xu et al., 2019)

Vision Is Relational

Intuition First: A CNN classifying cats sees pixels and textures — it answers “is there a cat-shaped pattern here?” A scene graph GNN answers “is the cat sitting on the mat, or next to it, or under it?” These are different questions entirely. The relation matters for image captioning, for VQA (“Is there anything on the mat?”), and for robotics (“pick up the object on top of the red box”). CNNs process each object in isolation; GNNs pass messages between objects so each object’s representation is informed by its relational context.

A single-object CNN classifier answers “what is in this image?” A relational vision system answers “how do the objects relate?” The latter is required for:

Image captioning: “a person riding a bicycle on a road” — requires knowing the person-bicycle relation
Visual question answering: “Is the cup to the left of the plate?” — spatial relation query
Action recognition: “throwing” vs “catching” — involves interaction between multiple body parts
3D scene understanding: robot navigation requires knowing object spatial relations

GNNs are the natural tool for encoding and reasoning over these relational structures.

Application 1: Scene Graph Generation

A scene graph represents an image as a graph where:

Nodes = detected objects (person, dog, cup, table)
Edges = predicate relations (holding, sitting-on, next-to)
Node features = visual features from bounding boxes

Task: given an image, predict the scene graph.

GNN approach:

Detect objects with a detector (Faster R-CNN) → bounding boxes + features
Build a fully connected graph over detected objects
Run GNN (message passing between object nodes)
Predict relation label for each edge: (person, dog, walking) vs (person, cup, holding)

The GNN refines object representations by incorporating context from other objects — “a bounding box near a computer on a desk is more likely a keyboard than a random rectangle.”

Application 2: Visual Question Answering (VQA)

Task: given image + question text → answer.

“How many objects are to the left of the red cube?”

Relation networks / GNN approach:

Extract object-level features (not just global image feature)
Build a scene graph (or dense pairwise graph)
GNN propagates information between object nodes
Answer predicted from aggregated graph embedding + question encoding

GNN-based VQA outperforms global feature + LSTM by 8-15% on CLEVR (spatial/compositional reasoning benchmark) — because relational reasoning requires explicit object-to-object information flow.

CLEVR benchmark: CLEVR tests compositional visual reasoning: "Is there any rubber thing that is the same size as the green sphere and to the right of the cyan cylinder?" Solving this requires tracking multiple objects and their spatial relations — impossible for models that process images globally. GNNs that construct and query explicit object graphs achieve near-perfect performance on CLEVR.

Application 3: Skeleton Action Recognition

Human skeletons are natural graphs: joints (wrists, elbows, shoulders) are nodes; bones are edges. Action recognition from skeleton data (motion capture, Kinect, pose estimation) is a spatio-temporal GNN problem.

ST-GCN (Yan et al., 2018): spatio-temporal GCN on skeleton graphs. At each timestep, runs GCN over 18 joints. Temporal convolution across timesteps captures motion dynamics.

Applications: action recognition (running, jumping, waving), fall detection, sports analysis, rehabilitation monitoring.

Advantage: unlike CNN on RGB video, skeleton GNNs are:

View-invariant (joints are 3D positions, not pixel patterns)
Background-invariant (ignores visual clutter)
Interpretable (which joint contributed to which prediction?)

Application 4: 3D Point Cloud Processing

Point clouds from LiDAR/depth sensors are unordered sets of 3D points — no natural grid structure. GNNs handle this naturally: construct a graph (k-nearest neighbours in 3D space), run message passing.

PointNet++ and DGCNN: process point clouds as graphs. Applications:

Autonomous driving: 3D object detection (cars, pedestrians, cyclists)
Indoor mapping: furniture segmentation
Medical: 3D organ segmentation from CT/MRI

Equivariant GNNs (EGNN): point cloud processing that is SE(3)-equivariant — predictions are consistent regardless of sensor orientation. Critical for robotics where the sensor is mounted in various orientations.

Key Insight: The CLEVR benchmark makes the advantage of relational GNNs concrete: questions like "Is the large metallic cube to the left of the small rubber sphere?" require tracking two objects and computing a spatial relation — impossible for a model that reads the image as a single global feature vector. GNNs that build an explicit object graph and pass messages between nodes achieve near-perfect accuracy on CLEVR, while CNN+LSTM baselines struggle below 70%. The difference is not model capacity — it is whether spatial relations are explicitly represented.

Application 5: Object Detection with Region-Relation Reasoning

Relation networks for object detection (Hu et al., 2018): detect objects and then refine detection scores by aggregating context from nearby objects. A car near a road is more likely a car than the same bounding box in a forest.

GNN over detected regions:

Nodes = detected bounding boxes
Edges = spatial proximity or semantic similarity
Message passing → refined detection scores

This post-detection relation module improves mAP by 2-3% on COCO — a significant gain.

Summary

Application	Graph structure	GNN role
Scene graph generation	Object-relation graph	Encode context for relation prediction
Visual QA	Scene graph + question	Relational reasoning over objects
Skeleton action	Joint-bone kinematic graph	Spatio-temporal action recognition
Point cloud	k-NN in 3D space	Unordered 3D processing
Object detection	Spatial proximity graph	Context-aware refinement

GNNs bring relational reasoning to computer vision — moving beyond “what objects are present” to “how do objects relate.” This shift is enabling vision systems that answer compositional questions, understand actions, and reason about 3D spatial structure — capabilities that are increasingly central to real-world visual intelligence.

References

Yan, S., Xiong, Y., & Lin, D. (2018). Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. AAAI 2018 (ST-GCN: spatio-temporal GNN on human skeleton joint graphs for action recognition from pose sequences).
Wang, Y., Sun, Y., Liu, Z., Sarma, S. E., Bronstein, M. M., & Solomon, J. M. (2019). Dynamic Graph CNN for Learning on Point Clouds. ACM Transactions on Graphics 2019 (DGCNN: EdgeConv on dynamically recomputed k-NN graphs in feature space for 3D point cloud classification).
Yang, J., Lu, J., Lee, S., Batra, D., & Parikh, D. (2018). Graph R-CNN for Scene Graph Generation. ECCV 2018 (Graph R-CNN: end-to-end scene graph generation using GNNs to reason over detected object relations).

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

GNNs for Computer Vision: Scene Graphs and Beyond

Vision Is Relational

Application 1: Scene Graph Generation

Application 2: Visual Question Answering (VQA)

Application 3: Skeleton Action Recognition

Application 4: 3D Point Cloud Processing

Application 5: Object Detection with Region-Relation Reasoning

Summary

References

Share on

You May Also Enjoy

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization

Alessio Borgi

Vision Is Relational

Application 1: Scene Graph Generation

Application 2: Visual Question Answering (VQA)

Application 3: Skeleton Action Recognition

Application 4: 3D Point Cloud Processing

Application 5: Object Detection with Region-Relation Reasoning

Summary

References

Share on

You May Also Enjoy

📄 Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

📄 Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

📄 Activation Functions in Neural Networks: Why Non-Linearity Matters

📄 FoPE: Fourier Position Embedding for Length Generalization

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization