Humans have no trouble recognizing objects and reasoning about their behaviors — it’s at the core of their cognitive development. Even as children, they group segments into objects based on motion and use concepts of object permanence, solidity, and continuity to explain what has happened and imagine what would happen in other scenarios. Inspired by this, a team of researchers hailing from the MIT-IBM Watson AI Lab, MIT’s Computer Science and Artificial Intelligence Laboratory, Alphabet’s DeepMind, and Harvard University sought to simplify the problem of visual recognition by introducing a benchmark — CoLlision Events for Video REpresentation and Reasoning (CLEVRER) — that draws on inspirations from developmental psychology.
CLEVRER contains over 20,000 5-second videos of colliding objects (three shapes of two materials and eight colors) generated by a physics engine and more than 300,000 questions and answers, all focusing on four elements of logical reasoning: descriptive (e.g., “what color”), explanatory (“what’s responsible for”), predictive (“what will happen next”), and counterfactual (“what if”). It comes with ground-truth motion traces and event histories for each object in the videos, and with functional programs representing underlying logic that pair with each question.
The researchers analyzed CLEVRER to identify the elements necessary to excel not only at the descriptive questions, which state-of-the-art visual reasoning models can do, but at the explanatory, predictive, and counterfactual questions as well. They found three elements — recognition of the objects and events in the videos, modeling the dynamics and causal relations between the objects and events, and understanding of the symbolic logic behind the questions — to be the most important, and they developed a model — Neuro-Symbolic Dynamic Reasoning (NS-DR) — that explicitly joined them together via a representation.
NS-DR is actually four models in one: a video frame parser, a neural dynamics predictor, a question parser, and a program executor. Given an input video, the video frame parser detects objects in the scene and extracts both their traces and attributes (i.e. position, color, shape, material). These form an abstract representation of the video, which is sent to the neural dynamics predictor to anticipate the motions and collisions of the objects. The question parser receives the input question to obtain a functional program representing its logic. Then the symbolic program executor runs the program on the dynamic scene and outputs an answer.
The team reports that their model achieved 88.1% accuracy when the question parser was trained under 1,000 programs, outperforming other baseline models. On explanatory, predictive, and counterfactual questions, it managed a “more significant” gain.
“NS-DR [incorporates a] dynamics planner into the visual reasoning task, which directly enables predictions of unobserved motion and events, and enables the model for the predictive and counterfactual tasks,” noted the researchers. “This suggests that dynamics planning has great potential for language-grounded visual reasoning tasks, and NS-DR takes a preliminary step toward this direction. Second, symbolic representation provides a powerful common ground for vision, language, dynamics, and causality. By design, it empowers the model to explicitly capture the compositionality behind the video’s causal structure and the question logic.”
The researchers concede that even though the amount of data required for training is relatively minimal, it’s hard to come by in real-world applications. Additionally, NS-DR’s performance decreased on tasks that required long-term dynamics prediction, such as the counterfactual questions, which they say suggests the need for a better dynamics model capable of generating more stable and accurate trajectories.