Several recent vision-language models have demonstrated remarkable multimodal generation capabilities. But usually they require training huge models on huge datasets. The researchers present Prism, a data- and parameter-efficient vision language model that uses a set of domain experts, as a scalable alternative. By inheriting most of the weights from the publicly available network of pre-trained domain experts and freezing them during training, Prismer requires only a few components to be trained.
The generalization abilities of large pre-trained models are exceptional in many different tasks. However, these features come at a high price, requiring a lot of training data and computational resources for training and inference. Models with hundreds of billions of trainable parameters are common in the language field, and they typically require a yottaFLOP-scale computing budget.
Problems related to learning visual language are more difficult to solve. Even though this domain is a superset of language processing, it also requires expertise in visual and multimodal thinking. Using its projected multimodal signals, Prismer is a data-efficient visual language model that utilizes a wide array of pretrained experts. It can handle visual question answering and image captioning, two examples of visual language reasoning tasks. Using a prism as an example, Prismer breaks down a general reasoning assignment into smaller, more manageable chunks.
The researchers developed a visually conditioned autoregressive text generation model for two of the most important design features of Prism is vision-only. Language-only models for web-scale knowledge to build our core network backbones, and (ii) modality-specific vision experts encoding multiple types of visual information, from low-level vision cues like depth to high-level vision cues like instance and semantic labels, as ancillary knowledge, directly from their corresponding network outputs. Researchers developed a visually conditioned autoregressive text generation model to better utilize various pretrained domain experts for exploratory vision-language reasoning tasks.
Even though Prismer was only trained on 13 million publicly available image/alt text data examples, it exhibits strong multimodal reasoning performance in tasks such as image captioning, image classification, and visual question answering, which is competitive with many state-of-the-art visual language models. The researchers conclude with an in-depth investigation of Prism’s learning habits, where the researchers find several good characteristics.
The Prism model, presented in its encoder-decoder transformer version, relies on a large pool of already trained subject matter experts to accelerate the training process. A visual encoder and an autoregressive language decoder make up this system. The vision encoder receives a sequence of RGB and multimodal labels (depth, surface normal and segmentation labels anticipated by frozen pretrained experts) as input. It produces a sequence of RGB and multimodal features as output. As a result of this cross-attention training, the language decoder is conditioned to generate a string of text tokens.
- The Prismer model has several advantages, but one of the most notable is that it uses data extremely efficiently when training. Prism is built on pre-trained vision-only and language-only models to achieve this goal with a dramatic decrease in GPU hours required to achieve performance equivalent to other state-of-the-art vision-language models. One can use these pre-trained parameters to utilize the massive amounts of knowledge available web-wide.
- The researchers also developed a multimodal signal input for the vision encoder. The multimodal auxiliary knowledge created can better capture the semantics and information about the input image. Prismer’s architecture is optimized to maximize the use of trained experts with few trainable parameters.
The researchers included two varieties of pretrained specialists in Prism:
- Backbone Specialists The pretrained models responsible for translating text and images into a meaningful sequence of tokens are referred to as “vision-only” and “language-only” models, respectively.
- Depending on the data used in their training, discourse model moderators can label tasks in different ways.
- The more competent people there are, the better the results. As the number of Prism modality specialists increases, its performance improves.
- More skilled professionals, better hit seekers replace a portion of the predicted depth labels with random noise drawn from a uniform distribution to create a corrupted depth expert and evaluate the effect of expert quality on prismer performance.
- Resistance to Unnecessary Opinions The results further demonstrate that Prism’s performance is stable when noise prediction experts are incorporated.
Check Paper And GithubGenericName. All credit for this research goes to the researchers of this project. Also don’t forget to register. our 26k+ ML subreddit, Discord ChannelAnd E-mailwhere we share the latest AI research news, cool AI projects, and more.