Inferring and understanding the underlying objects in a scene is a well-researched task in the domain of AI. However, robustly understanding the component relations in the same scene remains a challenging task.

A recent study on arXiv.org proposes an approach to represent and factorize individual visual relations.

Example of automated object detection and object recognition. Image credit: MTheiler via Wikimedia, CC-BY-SA-4.0

A scene description is represented as the product of the individual probability distributions across relations. As a result, interactions between multiple relations can be modeled. The framework enables reliably capturing and generating images with multiple relational descriptions. Furtherly, images can be edited to have a desired set of relations. The approach can generalize to a previously unseen relation description, even if the underlying objects and descriptions are not seen during training.

The method can be used to infer the objects and their relations in a scene in tasks such as image-to-text retrieval and classification.

The visual world around us can be described as a structured set of objects and their associated relations. An image of a room may be conjured given only the description of the underlying objects and their associated relations. While there has been significant work on designing deep neural networks which may compose individual objects together, less work has been done on composing the individual relations between objects. A principal difficulty is that while the placement of objects is mutually independent, their relations are entangled and dependent on each other. To circumvent this issue, existing works primarily compose relations by utilizing a holistic encoder, in the form of text or graphs. In this work, we instead propose to represent each relation as an unnormalized density (an energy-based model), enabling us to compose separate relations in a factorized manner. We show that such a factorized decomposition allows the model to both generate and edit scenes that have multiple sets of relations more faithfully. We further show that decomposition enables our model to effectively understand the underlying relational scene structure. Project page at: this https URL.

Research paper: Liu, N., Li, S., Du, Y., Tenenbaum, J. B., and Torralba, A., “Learning to Compose Visual Relations”, 2021. Link: https://arxiv.org/abs/2111.09297