News Release 18-Dec-2024

A Survey on Multimodal Large Language Models

Peer-Reviewed Publication

Science China Press

A timeline of representative MLLMs — **image:**
**A surge in related works is happening on a daily basis. More recent works can be found on the GitHub page (https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models).**
view more

Credit: ©Science China Press

The review is led by Mr. Shukang Yin (University of Science and Technology of China), Dr. Chaoyou Fu (Nanjing University), and Dr. Sirui Zhao (University of Science and Technology of China). Given the great potential of MLLMs shown by the GPT-4 series, Dr. Fu launched the project and built up a GitHub page to keep track of the latest advancements in the realm. Just as expected, the research of MLLMs has blossomed.

This review meticulously summarizes the progress in the field, highlighting the new possibilities enabled by the unprecedented capabilities of MLLMs. Apart from traditional tasks like visual question answering, MLLMs make possible advanced applications such as intelligent agents, tools for analyzing charts/documents, and coding assistants.

Besides the scaling law that works magic again, new training paradigms and techniques play pivotal roles in developing strong MLLMs. As outlined in the paper, the training of MLLMs generally involves three stages: pre-training, instruction-tuning, and alignment tuning. During the pre-training stage, large amounts of paired data are used to align multimodal information with the LLM's representation space. Instruction-tuning enhances the model's ability to understand and follow new instructions. Alignment tuning, often implemented with reinforcement learning techniques, ensures the model aligns with human values and specific requirements, such as reduced hallucinations.

Open problems and promising directions still abound, as suggested in the paper, leaving much space for further explorations: multimodal hallucinations, multimodal in-context learning, multimodal Chain-of-Thought, LLM-aided visual reasoning, novel architecture to support the natural generation of multimodal content, multilingual support to benefit a wider audience, efficient deployment on end devices like cellphones, and leveraging MLLMs to perform interdisciplinary research.

As the era of MLLMs has just begun, the insights and guidance in this review shall pave the way for future endeavors to break new ground.

See the article and the associated GitHub page:

A Survey on Multimodal Large Language Models

https://doi.org/10.1093/nsr/nwae403

https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models

Journal

National Science Review

DOI

10.1093/nsr/nwae403

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.