The review is led by Mr. Shukang Yin (University of Science and Technology of China), Dr. Chaoyou Fu (Nanjing University), and Dr. Sirui Zhao (University of Science and Technology of China). Given the great potential of MLLMs shown by the GPT-4 series, Dr. Fu launched the project and built up a GitHub page to keep track of the latest advancements in the realm. Just as expected, the research of MLLMs has blossomed.
This review meticulously summarizes the progress in the field, highlighting the new possibilities enabled by the unprecedented capabilities of MLLMs. Apart from traditional tasks like visual question answering, MLLMs make possible advanced applications such as intelligent agents, tools for analyzing charts/documents, and coding assistants.
Besides the scaling law that works magic again, new training paradigms and techniques play pivotal roles in developing strong MLLMs. As outlined in the paper, the training of MLLMs generally involves three stages: pre-training, instruction-tuning, and alignment tuning. During the pre-training stage, large amounts of paired data are used to align multimodal information with the LLM's representation space. Instruction-tuning enhances the model's ability to understand and follow new instructions. Alignment tuning, often implemented with reinforcement learning techniques, ensures the model aligns with human values and specific requirements, such as reduced hallucinations.
Open problems and promising directions still abound, as suggested in the paper, leaving much space for further explorations: multimodal hallucinations, multimodal in-context learning, multimodal Chain-of-Thought, LLM-aided visual reasoning, novel architecture to support the natural generation of multimodal content, multilingual support to benefit a wider audience, efficient deployment on end devices like cellphones, and leveraging MLLMs to perform interdisciplinary research.
As the era of MLLMs has just begun, the insights and guidance in this review shall pave the way for future endeavors to break new ground.
See the article and the associated GitHub page:
A Survey on Multimodal Large Language Models
https://doi.org/10.1093/nsr/nwae403
https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models
Journal
National Science Review