Navigating the Cultural Kaleidoscope: A Hitchhiker's Guide to Sensitivity in Large Language Models

1 Institute of Technology Kharagpur, India
2Microsoft Corporation, India
3INSAIT, Sofia University "St. Kliment Ohridski"
NAACL 2025 (Main track)

Our evaluation highlighted the disparity in cultural harms produced by Llama-2(7B) model across the globe. Shade darkness represents propensity towards cultural harm.

Abstract

As LLMs are increasingly deployed in global applications, the importance of cultural sensitivity becomes paramount, ensuring that users from diverse backgrounds feel respected and understood. Cultural harm can arise when these models fail to align with specific cultural norms, resulting in misrepresentations or violations of cultural values. This work addresses the challenges of ensuring cultural sensitivity in LLMs, especially in small-parameter models that often lack the extensive training data needed to capture global cultural nuances. We present two key contributions: (1) A cultural harm test dataset, created to assess model outputs across different cultural contexts through scenarios that expose potential cultural insensitivities, and (2) A culturally aligned preference dataset, aimed at restoring cultural sensitivity through fine-tuning based on feedback from diverse annotators. These datasets facilitate the evaluation and enhancement of LLMs, ensuring their ethical and safe deployment across different cultural landscapes. Our results show that integrating culturally aligned feedback leads to a marked improvement in model behavior, significantly reducing the likelihood of generating culturally insensitive or harmful content. Ultimately, this work paves the way for more inclusive and respectful AI systems, fostering a future where LLMs can safely and ethically navigate the complexities of diverse cultural landscapes.

Cultural Safety Dataset

We select twelve distinct areas from the World Values Survey (WVS) and Candle that are potentially sensitive in nature and reflect critical social concerns. Both of them are an international research program devoted to the scientific and academic study of social, political, economic, religious and cultural values of people in the world.

Pie charts show the 11 main cultures in our dataset, while the list on the right outlines the key areas that could lead to potential harm within each culture.

Singleturn Global and Local Dataset

As LLMs are increasingly deployed in global applications, the importance of cultural sensitivity becomes paramount, ensuring that users from diverse backgrounds feel respected and understood. Cultural harm can arise when these models fail to align with specific cultural norms, resulting in misrepresentations or violations of cultural values. This work addresses the challenges of ensuring cultural sensitivity in LLMs, especially in small-parameter models that often lack the extensive training data needed to capture global cultural nuances. We present two key contributions: (1) A cultural harm test dataset, created to assess model outputs across different cultural contexts through scenarios that expose potential cultural insensitivities, and (2) A culturally aligned preference dataset, aimed at restoring cultural sensitivity through fine-tuning based on feedback from diverse annotators. These datasets facilitate the evaluation and enhancement of LLMs, ensuring their ethical and safe deployment across different cultural landscapes. Our results show that integrating culturally aligned feedback leads to a marked improvement in model behavior, significantly reducing the likelihood of generating culturally insensitive or harmful content. Ultimately, this work paves the way for more inclusive and respectful AI systems, fostering a future where LLMs can safely and ethically navigate the complexities of diverse cultural landscapes.

Image 1
Singleturn and multiturn performance comparison across various cultures for the Global TestSet.
Image 2
Singleturn and multiturn performance comparison across various cultures for the Local TestSet

Multiturn Global and Local Dataset

Multiturn examples

Cultural Safety Dataset

Key insight for single turn
Overall, in the single-turn setup, for a large majority of models the average ASR is way higher for the Local TestSet compared to the Global TestSet. In other words it is easier to elicit harmful responses when the questions are predominantly local to a culture. We believe that the key reason behind this observation is that the LLMs are not safety-trained to be sensitive to most of the nuances of individual cultures.


Key insight for multi turn
In summary we note that heightened vulnerability to adversarial prompts over sustained conversations increases the ASR for the Global TestSet. This suggests that multi-turn dialogues, by introducing greater complexity and context, make models more susceptible to generating harmful responses. On the other hand,for the Local TestSet extended interactions prove to promote safer responses.

Image 1
Single- and multi-turn performance comparison across various cultures for the Global TestSet.
Image 2
Single- and multi-turn performance comparison across various cultures for the Local TestSet

Single- and multi-turn performance comparison across various cultures for the Global TestSet. Shade darkness represents propensity toward cultural harm. P4B: Phi(4B), M7B: Mistral-v0.2(7B), Z7B: Zephyr(7B), Q7B: Qwen-2(7B), L27B: Llama-2(7B), L38B: Llama-3(8B), L38B: Llama-3.1(8B), L213B: Llama-2(13B), V13B: Vicuna(13B), A: Arabic, B: Bengali, C: Chinese, H: Hindi, J: Japanese, R: Russian, G: German, K: Korean, S: Spanish, P: Portuguese, E: English (US). The same notations are used in the subsequent tables.

Cultural Preference Dataset

To prepare the preference dataset, we follow a procedure similar to that used for the cultural safety dataset, generating questions for both global and local sets. Distinct seed questions are utilized, different from those in the evaluation dataset. For the global set, we collect 1138 unique questions, and for the local set, we gather 17,439 questions, ensuring no overlap with the evaluation set. Along with harmful questions, we also sample ~6700 safe questions plus their answers from the
cultural bank dataset. Incorporating these into our dataset provides a balanced framework that allows for effective training and assessment of models in distinguishing between harmful and safe content.

Image 1
Results obtained from different alignment methods. Shade darkness represents propensity toward cultural harm. Green in average showcase the reduce in ASR.

BibTeX

@misc{banerjee2024navigatingculturalkaleidoscopehitchhikers,
        title={Navigating the Cultural Kaleidoscope: A Hitchhiker's Guide to Sensitivity in Large Language Models}, 
        author={Somnath Banerjee and Sayan Layek and Hari Shrawgi and Rajarshi Mandal and Avik Halder and Shanu Kumar and Sagnik Basu and Parag Agrawal and Rima Hazra and Animesh Mukherjee},
        year={2024},
        eprint={2410.12880},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2410.12880}, 
  }