Logo Sugar

Unified Generative and Discriminative Training for
Multi-modal Large Language Models

1Zhejiang University, 2National University of Singapore,
3Nanyang Technological University, 4Singapore Management University

Introduction

We propose Sugar: Structure-induced approach to unify generative and discriminative paradigms, leveraging discriminative training to acquire the two abilities above while harnessing the potential of generative training in complex discriminative tasks like image-text interleaved retrieval and fine-grained retrieval.

algebraic reasoning

(a) Dynamic Sequence Alignment. Semantically matched slices are connected with a blue dashed line. The arrows indicate the direction of the ordered temporal alignment path. With these alignments, we can obtain the similarity between two interleaved inputs for training.
(b) Sugar Framework. Sugar supports both multi-modal generation and retrieval simultaneously.


Method

Specifically, we explicitly impose the semantic relationships between different input samples as an induced structural constraint on the hidden state of MLLMs. We consider the interleaved image-text sequence as the general format of input samples, and then formulate the relationship between any two samples as a dynamic sequence alignment problem within the Dynamic Time Warping framework. In this way, we can explicitly modulate the hidden states of the MLLM by leveraging the semantic relationships between interleaved input sequences, thereby encouraging the MLLM to fully capture the global semantics of the multi-modal inputs.

To further enhance the ability to differentiate fine-grained semantics, we integrate a novel kernel into the Dynamic Time Warping framework. Leveraging the strengths of various discriminative pre-trained models, it performs dynamic sequence alignment for diverse embeddings tailored to specific contexts, thus addressing the inherent limitations in fully utilizing input semantics.

algebraic reasoning

Our structure-induced generative and discriminative training joint training strategy.


Results

Sugar is capable of producing compelling results on various vision-language tasks and has demonstrated some emergent abilities.


Multimodal Comprehension on 11 Benchmarks

algebraic reasoning

Comparison with state-of-the-art methods on 11 visual-language benchmarks


Complicated Multimodal Comprehension on DEMON

algebraic reasoning

Comparision with state-of-the-art method on DEMON benchmark


Zero-shot Cross-modal Information Retrieval

algebraic reasoning

Retrieval results compared with previous models, reported by Recall@k for (a)(b) and Accuracy (%) for (c). (a) MSCOCO for image-text retrieval: FROMAGe(d) indicates the FROMAGe model pre-trained only with discriminative loss, and FROMAGe(g+d) indicates joint training with both discriminative and generative losses. (b) VIST for interleaved retrieval: † indicates retrieval over images not previously seen in the story sequence. "5c+4i" is shorthand for 5 captions and 4 images, and "5c" is shorthand for 5 captions. (c) Winoground for fine-grained retrieval.


Retrieval-Augmented Generation

algebraic reasoning

Retrieval-Augmented Generation.


Retrieval for Knowledge-based VQA

algebraic reasoning

Comparison between the independent generator + retriever and Sugar on knowledge-based VQA. ’/’ indicates not applicable.


Quality Results

Below are selected examples for various image-text tasks. The pink background indicates retrieval results, while the blue background indicates generated results.

algebraic reasoning

Sensitivity with Detailed Semantics


algebraic reasoning

World Knowledge


algebraic reasoning

Fine-grained Image Discrimination


algebraic reasoning

Multimodal Concept Composition


algebraic reasoning

Retrieval and Dialog


algebraic reasoning

Retrieval at Different Place


BibTeX


      @inproceedings{
          TODO
      }