OpenGPT-X Unveils Multilingual Open Source Model

Forschungszentrum Juelich

26 November 2024

The large language model of the OpenGPT-X research project is now available for download: "Teuken-7B" contains seven billion parameters and has been trained from scratch in all 24 official languages of the European Union (EU), with the help of experts from Forschungszentrum Jülich and the JUWELS supercomputer. Researchers and companies can leverage this commercially usable open source model for their own artificial intelligence (AI) applications. Funded by the German Federal Ministry of Economic Affairs and Climate Action (BMWK), the OpenGPT-X consortium - led by the Fraunhofer Institutes for Intel-ligent Analysis and Information Systems IAIS and for Integrated Circuits IIS - have developed a large language model that is open source and has a distinctly European perspective.

The supercomputer JUWELS at the Jülich Supercomputing Centre (JSC), among others, was used for the training of Teuken-7B. Copyright: Forschungszentrum Jülich / Sascha Kreklau

Teuken-7B is currently one of the few large language models developed multilingually from the ground up. It contains approximately 50 percent non-English pre-training data and has has proven to be stable and reliable in its performance across multiple languages. This provides added value, particularly for international companies and organizations with multilingual communi-cation requirements, products and services. The open source model allows companies and organizations to run their own customized models in real-world applications. Sen-sitive corporate data can remain within the company.

In addition to model training, the OpenGPT-X team also addressed a number of re-search questions, such as how to train and operate multilingual AI language models in a more energy- and cost-efficient way. To this end, the project developed a multilingual "tokenizer". The task of a tokenizer is to break down words into individual word com-ponents - the fewer tokens, the more (energy-) efficiently and quickly a language model can generate the answer. The developed tokenizer leads to a reduction in train-ing costs compared to other multilingual tokenizers like Llama3 or Mistral. This is par-ticularly valuable for European languages with longer word structures such as German, Finnish or Hungarian.

Important research results from the OpenGPT-X project have been incorporated into the model development, such as tools and technologies for processing large amounts of data, leveraging powerful European HPC infrastructure and performing efficient model training. Teuken-7B was trained on the JUWELS supercomputer at For-schungszentrum Jülich. This computer is currently the fastest of its kind in Germany and is equipped with 3,744 NVIDIA A100 GPUs for training large AI models. The expertise gained from the OpenGPT-X project has already been applied to the procurement of the first European exascale supercomputer, JUPITER, which is currently being built at Forschungszentrum Jülich. Starting next year, it will offer many times greater performance for developing complex AI models in Germany and Europe.

In addition to the two Fraunhofer Institutes and Forschungszentrum Jülich, the consortium's partners include TU Dresden, the German Research Center for Artificial Intelligence (DFKI), IONOS, Aleph Alpha, ControlExpert, Westdeutscher Rundfunk (WDR) and the German AI Association (KI Bundesverband). The technology developed in OpenGPT-X will also provide the partners with a basis for training their own models in the future.

>>> Press release of the Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.