Artificial intelligence has advanced at remarkable speed, but its progress has been shaped by a narrow foundation of data. Most large language models are trained on internet text, books, and online forums. This scale is impressive, but it is not representative. The voices that dominate these sources are often urban, wealthy, educated, English-speaking, and other world-dominant languages. When models learn only from them, the risk is obvious: bias in, bias out. The result is AI that works well for some, and poorly for many.
Representative AI requires something different. It demands that models hear the breadth of human experience and language variation, not just the loudest or most connected groups. That begins with representative data. For decades, survey science has developed the tools to measure populations accurately through sampling, stratification, and weighting. Unlike scraped web data, which reflects who chooses to publish, survey research ensures inclusion of those who might otherwise be invisible.
This is where GeoPoll’s work is unique. We operate primarily in low-income countries across Africa, Latin America, and Asia. These regions are systematically underrepresented in global datasets. Our surveys reach communities that are often excluded from the digital traces AI relies on. Beyond geography, our sampling design incorporates income and education as core criteria, ensuring that the perspectives of low-income and less-educated populations are captured alongside those of more affluent groups. This intentional inclusion is critical because these voices are most often absent from the data that feeds AI systems.
Representative Survey Research Data for AI
Our approach is grounded in scale and depth. Every year, we conduct hundreds of thousands of telephone-based interviews that extend into rural villages, low-connectivity areas, and places where literacy rates are low and internet access is scarce. These conversations are live and unscripted, capturing how people actually communicate with the slang, cadence, accents, and evolving language that web-based datasets overlook. The result is a corpus of representative audio that reflects the daily realities of underserved populations.
This data has unique value for AI training. Unlike scripted phrases or synthetic samples, GeoPoll’s representative audio captures natural variation across cultures and regions. When used to train or fine-tune models, it consistently outperforms curated voice datasets because it is drawn from the real world rather than produced in a studio. It gives models the ability to recognize speech patterns as they exist in daily life, not as they appear in filtered or idealized forms.
Contrast this with the risks in today’s AI pipelines. Web-scraped data carries selection bias, temporal bias, and cultural bias. It reflects what gets published, not how people live and speak. Models then amplify those distortions, producing outputs that misinterpret slang, misrecognize dialects, or stereotype entire groups. Left unchecked, these gaps compound and erode trust in AI systems, hindering emerging market adoption widening the divide.
The science of sampling provides the corrective. By embedding representative data into AI pipelines, researchers can fill blind spots and build systems that perform consistently across diverse populations. This approach also provides a benchmark: survey data can test model outputs, reveal where failures occur, and guide targeted fine-tuning. It creates a feedback loop where AI evolves alongside the societies it is meant to serve.
If AI is to be truly global, it must be trained on datasets that reflect the global population. That requires more than volume. It requires representativity. Survey science has perfected the methods to listen to everyone, not just the few. Now it offers AI what it has always lacked: balance, diversity, and authenticity. The companies that focus on the quality and representativeness of their training data will be the ones that meet users where they are. Just as WhatsApp became ubiquitous by working for people everywhere, the companies that build representative AI will gain the most users and will emerge as the clear global leaders.
Nick Becker is GeoPoll’s CEO.