Known by more than 1.5 billion people in the Indian subcontinent, Indic languages present unique challenges and opportunities for natural language processing (NLP) research due to their rich cultural heritage, linguistic diversity, and complex structures. IndicMMLU-Pro is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) across Indic languages, building upon the MMLU Pro (Massive Multitask Language Understanding) framework. Covering major languages such as Hindi, Bengali, Gujarati, Marathi, Kannada, Punjabi, Tamil, Telugu, and Urdu, our benchmark addresses the unique challenges and opportunities presented by the linguistic diversity of the Indian subcontinent. This benchmark encompasses a wide range of tasks in language comprehension, reasoning, and generation, meticulously crafted to capture the intricacies of Indian languages. IndicMMLU-Pro provides a standardized evaluation framework to push the research boundaries in Indic language AI, facilitating the development of more accurate, efficient, and culturally sensitive models. This paper outlines the benchmarks’ design principles, task taxonomy, and data collection methodology, and presents baseline results from state-of-the-art multilingual models.
Category: Sector: Inclusion
Hard Questions: Inclusion
-

Data Journalism Appropriation in African Newsrooms: A Comparative Study of Botswana and Namibia
Data journalism has received relatively limited academic attention in Southern Africa, with even less focus on smaller countries such as Botswana and Namibia. This article seeks to address this gap by exploring how selected newsrooms in these countries have engaged with data journalism, the ways it has enhanced their daily news reporting, and its impact on newsgathering and production routines. The study reveals varied patterns in the adoption of technology for data journalism across the two contexts. While certain skills remain underdeveloped, efforts to train journalists in data journalism have been evident. These findings support the argument that in emerging economies, the uneven adoption of data journalism technologies is influenced by exposure to these tools and practices.
-

Advancements in Modern Recommender Systems: Industrial Applications in Social Media, E-commerce, Entertainment, and Beyond
In the current digital era, the proliferation of online content has overwhelmed users with vast amounts of information, necessitating effective filtering mechanisms. Recommender systems have become indispensable in addressing this challenge, tailoring content to individual preferences and significantly enhancing user experience. This paper delves into the latest advancements in recommender systems, analyzing 115 research papers and 10 articles, and dissecting their application across various domains such as e-commerce, entertainment, and social media. We categorize these systems into content-based, collaborative, and hybrid approaches, scrutinizing their methodologies and performance. Despite their transformative impact, recommender systems grapple with persistent issues like scalability, cold-start problems, and data sparsity. Our comprehensive review not only maps the current landscape of recommender system research but also identifies critical gaps and future directions. By offering a detailed analysis of datasets, simulation platforms, and evaluation metrics, we provide a robust foundation for developing next-generation recommender systems poised to deliver more accurate, efficient, and personalized user experiences, inspiring innovative solutions to drive forward the evolution of recommender technology.
-

Unravelling socio-technological barriers to AI integration: A qualitative study of Southern African newsrooms
This study explores the socio-technological barriers to the adoption of artificial intelligence (AI)-powered solutions in three countries of the global south – South Africa, Lesotho, Eswatini, Botswana and Zimbabwe. Through 20 in-depth interviews with key stakeholders, it examines the distribution and circulation of AI technologies within selected newsrooms. Furthermore, the article explores socio-technological obstacles to the integration of AI among journalists. Lastly, it examines the consequences of these socio-technological obstacles to journalism. The article specifically seeks to answer three questions: How are AI technologies integrated in southern African newsrooms? What are the socio-technological barriers attendant to the use of AI in selected news organisations of sub-Saharan Africa? What are the implications of these socio-technological barriers to the process of news production in these newsrooms?
-

Decoding the Diversity: A Review of the Indic AI Research Landscape
This review paper provides a comprehensive overview of large language model (LLM) research directions within Indic languages. Indic languages are those spoken in the Indian subcontinent, including India, Pakistan, Bangladesh, Sri Lanka, Nepal, and Bhutan, among others. These languages have a rich cultural and linguistic heritage and are spoken by over 1.5 billion people worldwide. With the tremendous market potential and growing demand for natural language processing (NLP) based applications in diverse languages, generative applications for Indic languages pose unique challenges and opportunities for research. Our paper deep dives into the recent advancements in Indic generative modeling, contributing with a taxonomy of research directions, tabulating 84 recent publications. Research directions surveyed in this paper include LLM development, fine-tuning existing LLMs, development of corpora, benchmarking and evaluation, as well as publications around specific techniques, tools, and applications. We found that researchers across the publications emphasize the challenges associated with limited data availability, lack of standardization, and the peculiar linguistic complexities of Indic languages. This work aims to serve as a valuable resource for researchers and practitioners working in the field of NLP, particularly those focused on Indic languages, and contributes to the development of more accurate and efficient LLM applications for these languages.





