ChatGPT claims to be able to “generate coherent and understandable text” in about 25 languages. Basque is not on that list. This language model has some ability to interact in Basque, but lacks the precision with which it produces text in languages with billions of speakers. It seems far away to think that it is among the priorities of the OpenAI company to improve the skills of its model for a language spoken by some 800,000 people in the world . This feeds the “digital divide” that exists in this type of technology, according to Eneko Agirre, director of the Basque Center for Language Technology (HiTZ) at the University of the Basque Country (UPV). For this reason, the group of specialists that he leads is working on a chatbot specific to Basque, which they have named Latxa and it already surpasses the GPT-3.5 “in all evaluations.” But they don’t stop there: “We will be the first to create a model as good as the GPT-4 ″.
Agirre has dedicated his entire professional career as a computer scientist to language processing. At 21 years old, while pursuing his degree at the UPV, he obtained a scholarship to work on the first analyzer for Basque. “It is a very attractive topic if you have intellectual concerns about how thinking, speaking, languages work, how there are so many different languages and how to computerize language,” he explains by video call.
Since 2020, this computer scientist has directed the HiTZ, a center that aims to promote research, training, technology transfer and innovation in artificial intelligence focused on language and speech. The multidisciplinary team is made up of computer scientists, linguists and engineers. The project to create Latxa was born from the concern that languages like Basque lack sufficient digital tools that the majority languages do have .
“There are 1,000 times more data for English than for Basque and 100 times more for Spanish than for Basque. We were worried that for this language there were no tools for people to use, because this could cause the digital divide to increase between the largest and smallest languages,” says the director of HiTZ. Agirre claims that ChatGPT works “worse” the smaller the language. In the case of Basque, he assures, although it can generate text, “there are always grammatical errors.”
Latxa (wool in Basque) was baptized with that name because it was inspired by the LlaMA model from the company Meta . Agirre says that, on the one hand, they did not want to hide that they were inspired by LLaMA, and that since this name remembers the animal, they associated it with the wool of the latxa sheep of the Basque Country.
Feed with text
To create a system like Latxa, explains Agirre, three elements are needed. First, a team of “leading researchers and engineers,” because “there are not many people in the world who can do it.” The second thing is that text is needed. The more text the model consumes, the better quality the results will be. And finally, supercomputing, because processing all these texts alone can be done with this technology. For Latxa, at the HiTZ they managed to access the LEONARDO supercomputer, located at the Tecnopolo in Bologna (Italy).
Regarding the algorithm, Agirre points out that it is the same one used by all language models. With this algorithm, a process must be carried out so that it “learns about the world”, which consists of providing it with texts so that it processes the information and learns to make connections between words. “So what the algorithm learns is what words are the most likely for a combination of any text. It seems little, but you have to learn a lot about grammar, morphology and the world,” says the computer scientist.
According to Agirre, “almost everything” that ChatGPT knows how to do has been learned through this process of reading and learning common sense, which is the first big step. And the most expensive, since according to the expert it requires millions of dollars. In the case of HiTZ, they obtained resources from the Basque Government and the European Recovery Funds to develop this project.
Once the system can understand the language, Agirre explains that what follows is “teaching it how to interact with users,” a broad process that ranges from not saying “bad words” to not explaining “how to make a bomb or how to use it.” kill your father-in-law.”
The language model race
In May, the Minister for Digital Transformation and Public Service, José Luis Escrivá, presented a government initiative to promote the implementation of a language model in Spanish and co-official languages . Escrivá and the Minister for Culture, Ernest Urtasun, also chaired the first meeting of the institutions involved in the Governance Agreement to Generate Models and Corpus for a public infrastructure of Language Models. Since the new revolution of generative artificial intelligence began, the European Union has been concerned about not being left behind in the development and regulation of this technology.
“Technology itself is an end, because right now there is a race all over the world to master it. If a country does not invest in these models, it will not have people prepared,” explains Agirre. For the computer scientist, there is no need to sit idly waiting for OpenAI to develop a good model for Spanish or any other language. The director of HiTZ believes that this power should not remain only in a few hands and that open models must be developed that companies in Spain and Europe can use without having to depend on Silicon Valley.
Latxa is an example of what can be achieved locally by investing in language models. “Not only will we be one of the first groups to create a language model that is as good as GPT-4 in linguistic competence, but we are already better than GPT-4 in Basque grammar,” explains Agirre.
The director of HiTZ is clear that the development of this technology has a cultural and identity significance: “Just as it was important to have press, radio, television or education in a language, technology is also important, because if not the gap that exists between widely used and less used languages will increase.”