ChatGPT Biases – How Diverse Data Shapes a Language Model

ChatGPT Biases- How Diverse Data Shapes a Language Model

The widespread application of advanced AI language models like OpenAI’s ChatGPT, based on the GPT-4 architecture, has transformed fields like virtual personal assistants and content generation. While ChatGPT’s capabilities are impressive, its accuracy and reliability are constantly being questioned when it comes to answering queries in different languages. What fuels this proclamation?

The objective of the test was to craft news articles espousing prevalent China-related misinformation narratives.

NewsGuard, a fact-checking organization, recently reported that ChatGPT is more likely to generate false information in Chinese dialects than when responding to English queries. The report claims that during an April 2023 evaluation, NewsGuard engaged ChatGPT-3.5 with seven different prompts in English, simplified Chinese, and traditional Chinese.

In the English-language endeavor, ChatGPT tactfully refrained from producing erroneous assertions for six of the seven prompts, even when persistently nudged with leading inquiries. In stark contrast, the chatbot generated the fallacious claims all seven times in both simplified and traditional Chinese.

Data and Training – The Backbone of AI-Language Models

According to experts, the primary reason behind ChatGPT’s uneven performance across languages is the data and training process. Language models are constructed using massive text datasets from diverse sources like books, articles, and websites. The quality and quantity of data available for different languages directly impact the AI model’s performance.

The more data available for a language, the better the model can learn its intricacies and provide accurate and reliable responses. Unfortunately, not all languages have equal representation in the available data.Maria Toneva, an AI and NLP researcher

It’s also being said that while these models possess multilingual capabilities, the languages do not inherently influence each other. They coexist as separate yet connected portions of the dataset, and the model currently lacks a mechanism to evaluate the disparities in phrases or predictions across these distinct areas.

Given this, languages with less online presence, less diverse data sources, and those with complex grammar and syntax are more likely to produce less accurate or misleading information. In some cases, the AI model may generate outputs that seem to “lie” due to a lack of understanding or inability to grasp the nuances of the language.

Another contributing factor to ChatGPT’s language-based disparity may be the training data’s cultural nuances and inherent biases.

Since the AI model learns from existing text, it may inadvertently absorb and reproduce cultural biases and stereotypes in the data. Consequently, the AI system may sometimes provide biased or culturally insensitive responses in certain languages.

Addressing the Challenges

Addressing the disparity in ChatGPT’s performance requires a multi-faceted approach. Researchers and developers are actively working to improve data quality and expand the representation of underrepresented languages. One such effort involves the collection of more diverse, high-quality data sources that accurately reflect linguistic variations and cultural nuances.

It’s not merely about more propaganda in one language versus another but also about subtle biases or beliefs

Additionally, developers are focusing on addressing the biases present in the training data. Techniques like fairness-aware machine learning and the implementation of external human feedback loops can help mitigate bias and improve the overall performance of AI systems across languages.

Collaboration between academia, industry, and communities is also essential to raise awareness of the challenges faced by AI language models and to share knowledge, resources, and best practices in developing inclusive AI systems.

This report serves as a reminder that when ChatGPT or a comparable model provides an answer, it is essential to question the source of that answer and the trustworthiness of the data upon which it is based instead of solely relying on the model’s response.

Read More
Diego Lupo

Latest

Newsletter

Don't miss

5 Business Ideas Worth Starting in 2026

If there is one thing Nigerians understand well, it is how to spot opportunity inside hardship. In 2026, that mindset will matter more than ever. The economy is tough, competition is rising, and many people are looking for smarter ways to earn, build, and survive. But even in a difficult environment, some businesses still stand

Getting a business loan now comes with a frequent flyer upside

Australian fintech Prospa has partnered with Qantas Business Rewards, letting eligible SMEs earn up to 500,000 points per loan. What’s happening: Australian fintech lender Prospa has partnered with Qantas Business Rewards to allow eligible small and medium business owners to earn up to 500,000 Qantas Points per loan when taking out a Prospa Small Business

Why I went into real estate business years ago – Pastor Matthew Ashimolowo explains

Pastor Matthew Ashimolowo is one of the respected men of God in Nigeria. He is a global player. The headquarters of his church. KICC is in the UK. And he has run the church for years. Many don’t know that he is also a big real estate player both in London & Nigeria. He is the