Large language models (LLMs) have demonstrated unprecedented capabilities on a variety of language tasks, such as GPT-3, Gopher, PaLM, Chinchilla, GLM-130B, LLaMA, and GPT-4. After aligning with human preferences, these LLMs can serve as capable AI assistants that are helpful across many domains. Such AI assistants are usually trained to interact with users in a conversational manner, capturing widespread attention from not only the research community but also the public.
The unprecedented intelligence that LLMs exhibit goes beyond some of the authors might expect a probabilistic model to have. One speculative explanation to this is that the model learns some concepts abstract and general enough to function in more situations than it has been trained on. The impetus of LLMs to learn such complicated structure of world knowledge by merely imitating, i.e., minimizing the loss of the next token might lie in that language itself implies human cognition of the world′s logic. And with the size of the model scales, the model learns more general concepts and is able to handle more situations i.e. a better compression of knowledge. The idea of training MOSS serves as a first step to validate and, if this is true to an acceptable extent, implement a Chinese version of prototype of this promising path.
Despite the success and popularity of large-scale AI assistants, at the start time of this work, few studies have been publicly disclosed due to the expensive annotation and training costs. To that end, the team of researchers from Fudan University present MOSS, an open-sourced conversational LLM with 16B parameters, in this paper. The development of MOSS includes three stages: cross-lingual pre-training, supervised fine-tuning, and preference-aware training. Compared with existing efforts (e.g., LLaMA and Stanford Alpaca) in the open-source community, MOSS is featured by:
1. Cross-lingual pre-training. At the inception of the MOSS project, researchers encountered significant challenges in training a large-scale, purely Chinese model (such as CPT or Chinese BART) to function as a versatile AI assistant. To address this, researchers initiated the pre-training of the MOSS base model on a diverse dataset comprising 360B English tokens (predominantly sourced from the Pile), 100B Chinese tokens (largely derived from proprietary datasets), and 220B code tokens (mainly extracted from the Pile, BigQuery, and BigPython). This strategy was instrumental in validating their hypothesis that knowledge transfer between Chinese and English is feasible, even in the absence of direct sentence-level alignment between the two languages.
2. Helpful, honest, and harmless (HHH). In contrast to most existing open-sourced models that mainly focus on improving helpfulness, MOSS is also designed to be honest and harmless. Researchers collect and extend honesty- and harmlessness-relevant conversational data for supervised fine-tuning (SFT). In addition, researchers perform preference-aware training on additional data to ensure MOSS is aware of the quality of its response in helpfulness.
3. Alignment with the real-world distribution of user intents. Real-world user prompts are inevitably diverse, making it difficult to optimize LLMs targeting user intents. To this end, researchers deployed an early version of MOSS and collected 100K user prompts submitted through the web application. Their SFT data and preference data are synthesized from a filtered subset of the user prompts, ensuring that the training data of MOSS and the real-world user intents are identically distributed.
4. Preference-aware training. Aligning LLMs with human preferences is becoming a necessary step before the public release, which significantly improves model usability and harmlessness. Existing alignment research usually requires a preference model (also referred to as reward model) trained from human feedback or AI feedback to measure the quality of model responses to human preferences. The preference model can be used for performing rejection sampling or reinforcement learning. The former approach is inefficient as it requires the model to generate multiple responses at inference time. The latter one, a.k.a. reinforcement learning from human feedback (RLHF), is sensitive to hyperparameters and therefore is practically hard to tune. Instead, researchers employ a preference model to tag model responses with their overall quality. These tags are prepended to the model responses for each round of the conversation. By performing conventional fine-tuning on such preference-tagged conversational data, MOSS is capable of distinguishing high-quality responses from low-quality ones. At inference time, MOSS can generate desired responses conditioning on specific preference tags, for instance, <quality: 100>.
5. Augmentation with tools. Probabilistic language models are notorious for suffering from “hallucinations”, e.g., they often generate outputs containing factual errors or basic arithmetic mistakes. Inspired by recent work in tool-augmented LLMs, researchers perform a tool-oriented training to augment MOSS with several tools, i.e., search engine, calculator, equation solver, and text-to-image generator. Though the capability of the model is not fundamentally improved, researchers observe significant benefits when allowing MOSS to access external tools to answer user queries.
Researchers conduct automatic evaluations for MOSS, demonstrating significant improvement over its base model and concurrent chat models in terms of model capabilities and real-world user experiences.
Section 2 introduces some large language models. This section presents a brief overview of LLMs published within 8 months after the release of ChatGPT. MOSS is one of the pioneer conversational LLMs. In comparison, LLMs released after MOSS were usually pre-trained using much more tokens and more advanced architectures (e.g., the LLaMA Transformer architecture). Though, MOSS achieved competitive interactive experiences by employing strong language models to generate complicated training signals, including multi-turn conversational data, preference data, and tool-augmented data. Without any human annotation and RL algorithms, MOSS achieved good performance as an AI assistant, providing practice for building conversational language models in a cheaper and faster fashion.
Section 3 is about cross-lingual pre-training. The pre-trained MOSS-base exhibited strong capabilities in both Chinese and English usage after bilingual pre-training. More importantly, the base model can generate knowledge in Chinese while the knowledge only exists in English form rather than Chinese form in the pre-training corpora. These preliminary results validated researchers’ hypothesis that knowledge can be transferred between Chinese and English.
The MOSS alignment is presented in Section 4. The alignment is divided into three key phases: Model warmup, model alignment with real-world data, and preference modeling. Moreover, researchers investigated the potential of augmenting MOSS with external tools to further enhance its ability to evaluate response quality effectively.
Section 5 describes model warmup with synthetic data. Converting a base language model into an instruction-following assistant is a critical step for LLM. Typically, existing research accomplishes this through four main methods: in-context prompting, instruction tuning, supervised fine-tuning on labeler demonstrations, and self-instruct. In this work, researchers adopt a method similar to self-instruct for extending prompts, which are then used as the user inputs of the first round of conversations. This section includes three parts: The first part is about user prompts; The second part is conversational data, the generated user prompts are used to construct multi-turn conversations; The third part is supervised fine-tuning, researchers perform supervised fine-tuning (SFT) using the collected conversational data.
Section 6 is about alignment with the real-world distribution. After the warmup SFT, the model is capable of performing multi-turn conversations with humans and following input instructions. However, the topic distribution of the SFT data is somehow identically distributed with the human-written seed prompts, which are inevitably difficult to cover the diverse user intents in the real world. To that end, researchers deploy the warmuped model and develop a web application for serving public users and collecting user data.
Preference modeling is introduced in Section 7. Inspired by the success of InstructGPT and ChatGPT, researchers explore aligning their SFT model with human preference. Different from the common practice that collects human-annotated preference data and performs reinforcement learning from human feedback (RLHF), researchers simulate human preference data with responses of varying quality generated by multiple models and perform preference-aware fine-tuning instead of PPO to align MOSS with human preference.
Though scaling up has endowed LLMs with powerful capabilities in language-related tasks, there are still inherent deficiencies that scaling up cannot address. For instance, LLMs cannot access real-time information and are prone to issues like “hallucinations”. They exhibit relatively poor performance in mathematical problems such as numerical calculations and equation solving. In addition, LLMs can only engage in natural language interaction, incapable of generating data in other modalities such as images. To address these issues, inspired by Toolformer, researchers explore fine-tuning MOSS to incorporate external tools such as search engines, calculators, equation solvers, and drawing tools. In Section 8, researchers provide details of the construction for their tool augmentation data, the strategy of fine-tuning, the implementation details of these tools, as well as tool-targeted evaluation results.
In this paper, researchers present MOSS, an open-sourced conversational large language model with 16B parameters. The development of MOSS contains three stages: cross-lingual pre-training, supervised fine-tuning, and preference-aware training. Firstly, researchers significantly improved the quality and efficiency of MOSS in generating Chinese texts by extending vocabulary, gradual parameter unfreezing, and cross-lingual pre-training. Secondly, researchers deployed an early version of MOSS as an online application service and synthesized conversational data based on the collected user data, aligning the distribution of the training data with the distribution of real-world user intentions. Thirdly, researchers performed preference-aware training to further improve the generation quality based on AI feedbacks. In addition, they also explored training MOSS to use external tools including the search engine, calculator, equation solver, and text-to-image generator. In conclusion, as an early practice of Chinese conversational large language model, this paper verifies the feasibility of building such models with capabilities of instruction-following and multi-turn Chinese dialogue by making full use of relatively small language models and high-quality synthetic data.
See the article:
MOSS: An Open Conversational Large Language Model
http://doi.org/10.1007/s11633-024-1502-8
Journal
Machine Intelligence Research
Article Title
MOSS: An Open Conversational Large Language Model
Article Publication Date
20-May-2024