Despite growing interest in AI in the Middle East, Arabic-language models have lagged behind. But a team of academics, researchers and engineers in the United Arab Emirates (UAE) recently unveiled a powerful tool tailored to the world’s Arabic speakers, which its creators say could pave the way for large language model (LLM systems) in other languages that are “underrepresented in mainstream AI.”
Named after the UAE’s largest mountain, “Jais” was created in collaboration between Abu Dhabi’s Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Silicon Valley-based Cerebras Systems, and Inception, a subsidiary of UAE-based AI company G42.
Although ChatGPT, Meta’s LLaMA and other LLMs have Arabic-language capabilities, they were mostly trained on English data on the internet, according to Timothy Baldwin, acting provost and professor of natural language processing at MBZUAI.
Instead, Jais used English and Arabic datasets, with a focus on content from the Middle East, allowing it to go beyond “what anyone else has been able to achieve for Arabic,” Baldwin says.
Languages that use the Latin alphabet dominate the internet, with English by far the most-used. That means datasets are largest in those languages, according to Mohammed Soliman, director of strategic technologies and the cyber security program at the Middle East Institute, in Washington DC.
“Making access to AI tools exclusive to those who can speak specific languages could prevent disadvantaged cross-sections of societies from reaping the benefits of AI,” he told CNN.
Typically, language models trained in English have Western-centric data sets. “[These LLMs] lack awareness of other cultures, adversely affecting the user experience for people of diverse backgrounds,” Soliman added.
As a result of its training, Jais understands cultural nuances and dialects, according to MBZUAI, enabling it to be used more widely across different industries. In future releases, the team aims to have Jais work with images, graphs or tabular data instead of just text, broadening its uses and potentially enabling it to interpret medical scans, investment data or data from satellites.
Different dialects
Arabic is the sixthmost spoken language in the world and is rich with a “constellation” of different dialects, which adds to the complexity of training a language model, Baldwin said. Modern Standard Arabic is typically used for official documents and formal writing, but local dialects are often used on blogs or social media. By training on a diverse set of data Jais can usually switch between dialects, said Baldwin.
“There’s certainly room for improvement there, but the focus has been more on the robustness in terms of being able to understand if we do have more informal inputs to the model,” Baldwin added.
A recent update allows Google’s Bard to also understand questions in over a dozen Arabic dialects, including Egyptian colloquial Arabic and Saudi colloquial Arabic; the response are then returned using Modern Standard Arabic.
Jais has 13 billion parameters, and a 30-billion parameter update is in the works, Baldwin said. Parameters quantify the size of a language model, but not necessarily the accuracy. ChatGPT-3.5 has around 175 billion parameters, according to OpenAI.
Jais, like other generative AI models, uses instruction tuning to prevent it from creating “toxic” or “harmful” answers, Baldwin said. It won’t generate anything that could lead to self-harm, damage to others, or is suggestive of addiction. The responses it generates adhere to local rules and customs on topics such as homosexuality and drugs.
MBZUAI had “various dialogues” with the UAE government and other institutions around responsible AI, which were referenced when developing Jais, according to Baldwin.
Regional developments
There have been growing efforts in the UAE to develop generative AI systems. It was the first country in the world to appoint a minister of AI, in 2017, and the region’s largest generative AI model, Falcon, was unveiled by Abu Dhabi’s Advanced Technology Research Council and the Technology Innovation Institute (TII) in March, with a new iteration released in September.
Although not currently available in Arabic, Falcon is more powerful than Jais in English, with 180 billion parameters, and outperforms competitors such as Meta’s LLaMA 2 based on its ability to reason, code and complete knowledge tests, according to TII. Unlike Google’s Bard and ChatGPT, Falcon and Jais are open-source, which means their code is available for anyone to use or change.
A 2018 report by consulting firm PwC estimated that the Middle East could accrue up to $320 billion in benefits from AI by 2030. The region wants to make sure it has its “own capabilities” in terms of AI, says Ali Hosseini, PwC’s Middle East chief digital officer.
“Some of the best open-source models are actually developed in our region,” Hosseini added, referencing Falcon and Jais.
Its makers hope that Jais will further the development of generative AI in the Middle East. “This is kind of step one of many future steps,” Baldwin said. “Not just for Arabic large language models, but elsewhere.”