Introduction and Motivation
The future of AI is largely believed to be multi-modal— instead of acting on only text, or only images, or only sounds, future AI would be able to interact with all of these in all just one model. DeepMind tries to achieve that in Gato, “a generalist agent”.
Motivation: Larger, more general, models tend to outperform specific, smaller ones with enough data. Can there be a generalist agent that is capable of a number of tasks, and can be adapted in a little data to be capable of an even larger number of tasks?
Development Details
Core guiding principle: Train Gato on the widest variety of data possible. Convert all data into a flat sequence and train on it, much like a LLM.
One of the goals is to maximize “out of distribution transfer” — being able to learn tasks outside of its training distribution faster.
The agent is finally used as a control policy — it accepts some “fixed prompt” (unchanging environment data) and an additional prompt (“observations”), predicts an action, which is used to update the environment and get the next observation.
The data distribution is very heavily skewed to learning in 2D and 3D control environments, with only about 15% being text and images.
Evaluation
The biggest complaint levied against Gato is that it’s results feel somewhat subpar, compared to the latest advances in each of the specific fields it attempts to generalize over.
Here are Gato’s (less than stellar) attempts to caption images:
But Gato is quite impressive in robotic tasks. The statistics below say that Gato is roughly 50% as well as expert demonstrations.
Also realize that Gato is significantly larger than the baseline (BC-IMP) comparison in terms of parameters or training data. This is therefore somewhat of an unfair comparison.
The hope with Gato was that there would be positive transfer— seeing a lot of different tasks would help fulfill specific tasks easier. The data seems to suggest like this isn’t the result.
Instead, some of this data suggests that there could be negative transfer — generalizing making it harder to perform specific tasks. Note some “all data” does worse than “same domain only data” below:
Limitations and Future Work
Gato does not yet output image tokens or non-textual observations, but there is no reason it could not do so.
Gato learns using imitation learning, and thus relies heavily on high-quality expert demonstration data for a vast majority of its tasks.
The model currently has a prompt length limited to 1024 tokens. As mentioned previously, when the model is used as an RL agent, this is often a bottleneck.
In a lot of the comparisons to baseline, the size of the baseline model is not mentioned, which could make any comparisons unfair.
The paper also does not explicitly highlight the potential “negative transfer” of the model — the data that suggests such negative effects can only be found in the appendix of the paper.
While I appreciate the motivation of the paper, I think it makes the crucial error of attempting to brush over weaknesses to highlight it’s strengths (something very common in CS research). I am not the biggest fan of such an approach.