1. Overview of the Released LLM
(1) Computing Resources
- Approximately 0.4 trillion tokens of pre-training were conducted using cloud resources provided by Google Cloud Japan under the support of the Ministry of Economy, Trade and Industry’s GENIAC project.
- Subsequently, up to about 0.7 trillion tokens of pre-training and tuning were carried out using cloud resources from SAKURA Internet, procured with a grant from the Ministry of Education, Culture, Sports, Science and Technology (MEXT).
(2) Model Training Corpus(*1)
- The corpus consists of approximately 2.1 trillion tokens, with pre-training completed for about one-third.
- Japanese: About 592 billion tokens
▪️Text extracted from the full Common Crawl (CC) archive in Japanese
▪️Data crawled from websites based on URLs collected by the National Diet Library’s Web Archiving Project (WARP), with URL lists provided by the Library
▪️Japanese Wikipedia
▪️The summary text of each research project in the KAKEN (Database of Grants-in-Aid for Scientific Research). - English: About 950 billion tokens (Dolma, etc.)
- Other languages: About 1 billion tokens (Chinese and Korean)
- Program code: About 114 billion tokens
- These add up to around 1.7 trillion tokens, and approximately 0.4 trillion Japanese tokens are to be used twice, resulting in about 2.1 trillion tokens
- Japanese: About 592 billion tokens
(3) Model
- Number of parameters(*2): Approximately 172 billion (172B)
- Model architecture: LlaMA-2 based
(4) Tuning
- Tuning experiments were conducted using 13 types of Japanese instruction data and translations of English instruction data.
(5) Evaluation
- The model was evaluated using "llm-jp-eval v1.3.1," a cross-evaluation framework developed by LLM-jp, based on 22 types of existing Japanese language resources. The pre-training model, trained on 0.7 trillion tokens, achieved a score of 0.548.
- It was also evaluated using the "llm-leaderboard (g-leaderboard branch)," a framework used for performance assessment in the GENIAC project. The tuning model trained on 0.7 trillion tokens achieved a score of 0.529.
(6) URL for the Released Model, Tools, and Corpus
- https://llm-jp.nii.ac.jp/en/release/
- Note: Although the released model has undergone tuning for safety, it is still in the preview stage and not intended for direct use in practical services. The preview version will be provided under a limited license to approved applicants.
2. Future Plans
- To effectively utilize LLMs in society, ensuring transparency and reliability is essential. Additionally, as models become more advanced, safety considerations will become increasingly important. In response, NII established the Research and Development Center for Large Language Models in April 2024 with support from MEXT’s project "R&D Hub Aimed at Ensuring Transparency and Reliability of Generative AI Models" (P7, https://www.mext.go.jp/content/20240118-ope_dev03-000033586-11.pdf, In Japanese). NII will continue to promote research and development using the released models and those yet to be built, contributing to the advancement of LLM research.
- With regard to the model released this time, all checkpoints up to the final checkpoint (100k steps), including every 1k step along the way, are saved and will be made available.
(Reference 1) Overview of LLM-jp
- LLM-jp, organized by NII, consists of over 1,700 participants (as of September 17, 2024) from universities, companies, and research institutions, mainly focusing on researchers in natural language processing and computer systems. LLM-jp shares information on LLM research and development through hybrid meetings, online sessions, and Slack, while also conducting joint research on building LLMs. Specific activities include:
- Promoting the development of open LLMs proficient in Japanese and related research.
- Regular information exchange on model building expertise and the latest research developments.
- Fostering collaboration across institutions by sharing data and computing resources.
- Publishing outcomes such as models, tools, and technical documentation.
- LLM-jp has established working groups such as "Corpus Construction WG," "Model Construction WG," "Fine-tuning & Evaluation WG," "Safety WG," "Multi-modal WG" and “Real Environment Interaction WG.” Each group, led respectively by Professor Daisuke Kawahara of Waseda University, Professor Jun Suzuki of Tohoku University, Professor Yusuke Miyao of the University of Tokyo, Project Professor Satoshi Sekine of NII, Professor Naoaki Okazaki of Tokyo Tech, and Professor Tetsuya Ogata of Waseda University, is engaged in research and development activities. Additionally, the initiative is propelled by the contributions of many others, including Professor Kenjiro Taura of the University of Tokyo, Associate Professor Yohei Kuga of the University of Tokyo (for utilization technologies of computational resource), and Professor Rio Yokota of Tokyo Tech (parallel computation methods).
- For more details, visit the website: https://llm-jp.nii.ac.jp/en/
(Reference 2) Support for This Achievement
This achievement was made possible through a grant from the New Energy and Industrial Technology Development Organization (NEDO) and MEXT’s subsidy.
---------------------------------------------
(*1) Corpus:A database that stores large amounts of natural language texts in a structural manner.
(*2) Number of Parameters:Large language models are massive neural networks trained on language data, and the number of parameters is one of the indicators of the network’s size. It is generally believed that more parameters indicate higher performance.
###
About the National Institute of Informatics (NII)
NII is Japan's only academic research institute dedicated to the new discipline of informatics. Its mission is to "create future value" in informatics. NII conducts both long-term basic research and practical research aimed at solving social problems in a wide range of informatics research fields, from fundamental theories to the latest topics, such as artificial intelligence, big data, the Internet of Things, and information security.
As an inter-university research institute, NII builds and operates academic information infrastructure essential for the research and educational activities of the entire academic community (including the Science Information Network) as well as developing services such as those that enable the provision of academic content and service platforms. https://www.nii.ac.jp/en/
About the Research Organization of Information and Systems (ROIS)
ROIS is a parent organization of four national institutes (National Institute of Polar Research, National Institute of Informatics, the Institute of Statistical Mathematics and National Institute of Genetics) and the Joint Support-Center for Data Science Research. It is ROIS's mission to promote integrated, cutting-edge research that goes beyond the barriers of these institutions, in addition to facilitating their research activities, as members of inter-university research institutes.