Italian Joins Semantic Web with LiITA Project

Universita Cattolica del Sacro Cuore

Many Italian texts, lexicons, and dictionaries are just a click away, interacting seamlessly to form a bridge between words and knowledge. This enables users to see, and thus predict, where these terms are and will be used (in other words, their occurrences), creating a network that interlinks Italian language resources. This network fosters dialogue between resources, revealing new perspectives and enabling the development of artificial intelligence models for advanced linguistic analyses. These are some of the main objectives of LiITA (Linking Italian), an extraordinary observatory for digital linguistic data, both textual and lexical, focused on the Italian language. LiITA focuses on the creation of an interoperable Knowledge Base (KB) for Italian language resources (from dictionaries to texts, ancient and modern), following the principles of the Linked Data paradigm used in the Semantic Web.

The LiITA project will be presented at the conference CLiC-it 2024 – Tenth Italian Conference on Computational Linguistics, scheduled to take place in Pisa from December 4 to 6, 2024 ( https://clic2024.ilc.cnr.it/ ). Additionally, a paper titled The Lemma Bank of the LiITA Knowledge Base of Interoperable Resources for Italian will be featured in the conference proceedings.

Funded by Italy's Ministry of University and Research through a PRIN-2022 PNRR grant of €237,695, the LiITA project is led by the Catholic University of the Sacred Heart, Milan campus, under the coordination of Dr. Eleonora Litta, in collaboration with the University of Turin. According to Professor Marco Passarotti, a computational linguistics expert at the Faculty of Linguistic Sciences and Foreign Literatures at the Catholic University, the architecture of the LiITA Knowledge Base is straightforward and portable to any language.

The foundation of LiITA is a large collection of lemmas, the canonical citation forms of words (as they appear in dictionary entries). Each lemma will be linked online to its occurrences in various Italian textual corpora connected to the Knowledge Base, as well as to its entries in different lexicons and dictionaries. Professor Passarotti explains: "The result will be a vast knowledge graph made up of nodes (e.g., lemmas and their occurrences) and the relationships between them".

"This graph will not only facilitate the extraction of information from the interoperable language resources enabled by LiITA, but will also help fine-tuning artificial intelligence models, supporting the development of applications for Italian language analysis across various fields—from research to publishing, from medicine to the web," he adds.

Professor Passarotti emphasizes that projects like LiITA, which merge data and technology, represent a pivotal moment in linguistics, driven by the rise of artificial intelligence, itself based on models of natural language processing. "We are witnessing the first industrial-technological revolution that directly affects the most humanistic object of study: language. It's a paradigm shift: the discipline that studies it cannot ignore it," he stresses.

This new approach is also shaping the teaching of tomorrow, underscoring the urgency of innovation to counteract the worrying trend of declining interest in linguistics studies. Data reveal that, for example, Italian universities offering degrees in Linguistic Mediation, Linguistic Sciences, and Linguistic Sciences for International Relations have experienced an average applications' drop of 9.4% from the 2022–23 academic year to 2023–24. Similar declines are observed in Modern Languages and Literatures and other humanities disciplines.

However, in the era of artificial intelligence, linguistics is both the present and the future. Studying it will unlock new technologies beneficial to all sectors.

A Predecessor: The LiLa Project

LiITA builds on the LiLa (Linking Latin) project, which developed a similar knowledge base for Latin, supported by a €2 million ERC grant. LiLa has built a collection of over 200,000 lemmas and integrated diverse Latin linguistic resources, assigning unique and persistent identifiers to each lemma, textual occurrence, and dictionary entry. LiLa's Knowledge Base is continuously expanding.

LiLa integrates into a unified structure several language resources for Latin (corpora, dictionaries, lexical resources), using lemmas as the central node to interlink data from different sources. 'Each lemma,' professor Passarotti explains, 'word occurrence in texts and lexical entry in dictionaries is assigned to an identifier, thus enabling their interaction on the basis of relations whose meaning is processable by machines".

'The architecture of LiLa is language-independent and can be adopted for any language,' the professor points out, 'everything is based on triples featuring a subject an object and a relation.

Professor Passarotti concludes: 'The beauty of knowledge bases like LiLa or LiITA is that they can be used as a source of data, metadata and explicit relationships between them to refine artificial intelligence models'.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.