Kielipankki - The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Aku Rouhe tells us about his research on speech recognition.
His current work includes, among other things, fine-tuning large language models that are optimized for Finnish and Nordic languages. These openly available LLMs have been created through successful academia-enterprise collaboration.
Who are you?
I am Aku Rouhe. For several years, I did research in the Aalto University Speech Recognition research group, and defended my doctoral thesis there this past February. After Aalto, I moved to Silo AI (now owned by AMD), where I work with large language models (LLMs) - I have moved from speech to text. My interest in language is also part of my free time in creative writing.
What is your research topic?
In my doctoral thesis, I compared end-to-end models with more traditional multi-model decomposed systems. In recent years, both the academia and commercial deployments in speech recognition have largely moved to end-to-end models. However, my work showed how multi-model decomposed systems remain a competitive alternative, for instance, in terms of recognition accuracy. Indeed, the main advantage of end-to-end models is probably their simplicity.
End-to-end models often require vast training resources. Thus, it was important for me to study end-to-end models applied to under-resourced languages as well.
My current work at Silo is on fine-tuning large language models such as Poro and Viking, which are models optimized for Finnish and Nordic language. These LLMs were developed in a collaborative research project between Silo and TurkuNLP.
How is your research related to Kielipankki?
End-to-end models hunger for data, so large corpora are needed. I was involved in compiling the Aalto Finnish Parliament ASR Corpus 2008-2020, which consists of Finnish Parliament plenary session recordings, and also in the Lahjoita Puhetta project, where volunteers donated their speech to produce the Puhelahjat corpus. I got to combine both of these large speech corpora in an article that was published when I was finalizing my PhD, at a time when I was involved with the LAREINA project. Nowadays, the Finnish speech recognition resources are respectable for a language spoken by so few.
Recent publications
Rouhe, A., Grósz, T., Kurimo, M. 2024. Principled Comparisons for End-to-End Speech Recognition: Attention vs Hybrid at the 1000-Hour Scale. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 623-638, 2024.
Virkkunen, A., Rouhe, A., Phan, N. et al. 2023. Finnish parliament ASR corpus. Lang Resources & Evaluation 57, 1645-1670 (2023).
Moisio, A., Porjazovski, D., Rouhe, A. et al. 2023. Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks. Lang Resources & Evaluation 57, 1295-1327 (2023).
Rouhe, A., Virkkunen, A., Leinonen, J., Kurimo, M. 2022. Low Resource Comparison of Attention-based and Hybrid ASR Exploiting wav2vec 2.0. Proc. Interspeech 2022, 3543-3547.