The Fault in Our Data

In Short

The Fault in Our Data

Aug 8, 2019

Marina DehnikShutterstock.com

Tong “Echo” Wu

Email

Save as PDF

“Siri, call Bonnie Sun.”

I was recently trying to get my voice assistant to call my friend. Siri responded, “Sorry, I didn’t get that.”

I tried again. No luck. I grew frustrated. And then I realized that my friend’s last name is pronounced like sun instead of tsun (with a slight T sound at the beginning and short U sound in the middle), as I’d say in my native Chinese. Finally, in a flurry of frustration, I switched the phone’s settings from English to Mandarin—confident that Siri would at last understand what I was saying. So I tried again … and she still didn’t understand me.

I probably wouldn’t have thought twice about this if I hadn’t noticed how accurately Siri understands my boyfriend, a native English speaker. The episode made me wonder: Why do AI systems treat languages differently? What’s missing during the development stages such that it’s incapable of recognizing people speaking fluently in distinct languages?

The answer, I’ve learned, lies largely in uneven data collection.

TIME magazine recently published a , based on findings from a UNESCO report, about how voice assistants often bolster gender bias. The story underscores that “the female voices and personalities projected onto AI technology reinforces the impression that women typically hold assistant jobs and that they should be docile and servile.” And as my own aforementioned experience taught me, this bias can also be racial in nature.

Indeed, many researchers have demonstrated that by the ballooning presence of AI systems. More specifically, the development of this particular technology has tended to benefit already-advantaged individuals, bringing more convenience to their daily lives while leaving many others without access to some technical features.

The question, then, is: Where do these AI problems come from in the first place?

Many of AI’s biases are essentially invisible—difficult to detect because they’re embedded in data and coding. As algorithms become more sophisticated and “smarter,” sometimes even the programmers . And with some high-level, low-explainability algorithms, such as natural-language processing, people have more difficulty recognizing biases in an AI system.

What this in turn means is that, even though larger volumes and wider varieties of data, coupled with new techniques for analysis, are available, various data defects remain during collection, data-mining, and the processing of algorithms. In particular, there are three primary ways that data analytics can discriminate: oversampling and overrepresenting specific demographic groups, “inheriting” prejudice from past data patterns, and using proxies when selecting individuals.

In simplest terms, oversampling and overrepresentation mean that too much data is disproportionately collected from certain demographic groups. One example is Google’s facial recognition function, which labeled due to a lack of black American women’s faces in the training data. (The facial recognition application used white men’s faces to train the machine.) Think of it this way: If the training data is defective, the system can be defective. As they say in some professions, “garbage in, garbage out.”

Another way training data can be defective is when prejudice from previous, already-biased data sets and algorithmic patterns is transferred to future ones. Imagine that a computer-hiring software thinks that men are, on the whole, more qualified for a certain position, and then a man is selected. The computer takes this selection into consideration, and might think: Yes, men are more suitable for this role. But that might not be true. Women are likely suitable, too. The point is that algorithms can produce a vicious cycle when past defective data is baked into the new data-analytics process.

Algorithmic use of proxies is a source of blatant unfairness, too. For instance, zip codes and neighborhoods are as proxies for race; the reputation of the school someone graduated from is often used as a proxy for job qualifications. Proxies represent a straightforward and cheap way to predict future outcomes, but they’re frequently predicated on present, troublesome realities—like high-interest loans for people living in certain areas, or higher risk-assessment scores for people who don’t have a college degree—and, consequently, they tend to replicate a variety of inequalities in the future.

Because humans create algorithms, the biases of the algorithms largely reflect the flaws of their creators and the environments they live in. People speaking other languages are systematically underrepresented and disadvantaged in home-assistance systems largely because the decisions made by Siri programmers favor native English speakers. According to the , while AI is taught to recognize different accents, “too many of the people training, testing and working with the systems all sound the same.” In the end, the more common “broadcast English,” which is the “predominantly white, nonimmigrant, non-regional dialect of TV newscasters,” is more likely to be understood. Siri, in short, has essentially excluded many non-native English speakers from using their native language for voice commands. (Also excluded from the voice-assistant boom are .)

Use of AI is becoming increasingly common in contexts beyond home assistants—think of its prevalence for criminal justice, credit scoring, cybersecurity, automated vehicles, and financial services. Yet even these applications leave out or harm some marginalized groups.

In , the risk assessment scores calculated by an AI system—which was developed by the private firm, NorthPointe—are used in court. The risk assessment score predicts the likelihood that a person will commit a crime in the future (), and thus is used to decide jail time in the justice system. Yet a ProPublica study showed that the computer software systematically gives higher risk scores to black Americans than to similarly-positioned white Americans. This goes against the original ideal of using AI to improve human welfare, and to make life easier and more efficient.

So, how to chart a more equitable path forward?

The solution isn’t to “turn off” certain features or stymie technological development simply to avoid being called biased, as Google did in an effort to fix its racist algorithm. (It the gorilla category from Google Photo, without actually fixing the deeper issue.)

Rather, to make meaningful moves toward preventing AI systems from reproducing inequality, programmers, policymakers, and users more broadly ought to be aware of the possible biases of the system, the causes of these biases, and the attendant risks for certain groups.

The solution, in other words, lies in building awareness of the data fed into AI systems—of the subtle ways datasets can entrench bias in algorithms.

��Ƶ

Education & Work

Democratic Futures

Global Security

Technology & Democracy

Thriving Families

Trending Topics

A Model for Associational Party Building

��Ƶ Celebrates 2026 Pulitzer Prizes

Workforce Pell: A State Playbook for Implementation

Virginia’s New Paid Leave Law Closes the Loop Between Implementation and Policy

Redrawing School Boundaries for Fairer Funding

Reframing Fusion Voting as a Practical, Powerful Reform Strategy

Harnessing Terrorism Data to Reshape U.S. National Security Policy

Establishing a National Housing Loss Rate

��Ƶ Fellows

Strengthening Connections to Early Intervention Services Through Policy and Practice

Wired for Equity: Bridging the Disability Gap in Classrooms

The AI Agenda: Exploring the Trump Administration’s Approach to AI Literacy and Use

Unreasonable Women

The Fault in Our Data

More ��Ƶ the Authors

Tong "Echo" Wu

The Fault in Our Data

������Ƶ

Education & Work

Democratic Futures

Global Security

Technology & Democracy

Thriving Families

A Model for Associational Party Building

������Ƶ Celebrates 2026 Pulitzer Prizes

Workforce Pell: A State Playbook for Implementation

Virginia’s New Paid Leave Law Closes the Loop Between Implementation and Policy

Redrawing School Boundaries for Fairer Funding

Reframing Fusion Voting as a Practical, Powerful Reform Strategy

Harnessing Terrorism Data to Reshape U.S. National Security Policy

Establishing a National Housing Loss Rate

������Ƶ Fellows

Strengthening Connections to Early Intervention Services Through Policy and Practice

Wired for Equity: Bridging the Disability Gap in Classrooms

The AI Agenda: Exploring the Trump Administration’s Approach to AI Literacy and Use

Unreasonable Women

More ������Ƶ the Authors

Tong "Echo" Wu

Related

“Getting the Golden Ticket”: Participants’ Perspectives on Lexington’s First Assembly

Civic Assemblies: A Framework for Community Colleges

Don’t START Now

Room to Grow: Expanding Access to Early Education by Rethinking Physical Space

The Fault in Our Data

��Ƶ

��Ƶ Celebrates 2026 Pulitzer Prizes

��Ƶ Fellows

More ��Ƶ the Authors