Small Data: The Key to Successful Data for Development
This article was originally published on Thomson Reuters Foundation.
Data for Development (D4D) is increasingly gaining traction, and for good reason. Machine learning, cloud computing and a constantly-expanding universe of data are bringing new approaches to our understanding of human needs and behavior. Applying these tools to the challenge of poverty alleviation offers the rigor of a “data-driven” approach without the cost and complication of traditional survey-based data. It’s no surprise that D4D is becoming a crucial factor in a range of growing programs, from agriculture and health to financial services.
Being data-driven is not without its dangers, however. D4D efforts are often organized around residual datasets—datasets captured as a byproduct of other activities, like mobile phone usage, social media activity and digital commerce. Yet, residual datasets often fail to capture the complex, human dimension at the center of our work. For example, residual data tends to ignore communities with limited access to the digital universe. For those of us working to serve the base of the world’s poor, this exclusion constitutes a fatal flaw. How, then, can we be sure our data comprehensively captures the scope of the human challenges D4D seeks to address?
Purposeful, or small data—which starts with questions then gathers responses from a target population—can provide a base understanding that helps to correct the errors and exclusions of residual data. It enriches the context of D4D projects, ensuring they are oriented toward the right problems, with all the necessary information to produce the best solutions. For example, advocates contend that an over-reliance on credit scoring—a key use of big data—is leading to deeper racial disparities in access to financial services. Moreover, employers routinely use credit reports to screen job applicants—even though there is no correlation between credit report data and job performance. Meanwhile, other survey data shows that consumers’ understanding of credit scores is deteriorating. In aggregate, these “small” data points provide important context for understanding the potential negative impact of big data, and guidance to improve its use.
By giving voice to individuals, purposeful data also enacts a vital premise—that locally-impacted communities understand their lives better than anyone else. For example, in an impact study of BrightLife—a social enterprise by FINCA International that provides clean energy solutions to off-grid customers—we began data collection with qualitative focus group interviews. While most discussion of off-grid products emphasize cost savings from the displacement of traditional fuels, BrightLife customers spoke about perceived positive health outcomes. Following this development, the research team crafted a quantitative survey, which helped to pinpoint primary customer benefits: better eye and respiratory health and improved sleep quality. At the same time, through these customer surveys, we learned that individuals in the most rural areas were struggling to use our product, simply due to the limited availability of properly-sized wood. These kinds of granular insights are critical for funders and front-line organizations, because they provide a clear view on how one’s program is functioning in the context of real life.
Purposeful data collection presents its own challenges—namely the cost and difficulties involved in capturing direct, unbiased answers. But the science of survey design is constantly improving to detect inevitable human flaws. Our data collection platform, ValiData, was created to identify and correct survey errors in real-time, using AI to detect surveyor bias. For example, using ValiData, we found a surveyor was estimating respondents’ ages, rather than asking them. Once we pinpointed that problem, we were able to address the challenge head-on. While big data may be appealing for its sheer volume, ready availability and seeming objectivity, its biases can operate in more insidious ways, making flaws harder to correct.
D4D should not lead us away from purposeful surveys but inspire growing demand for the very type of data that only individuals and communities can provide. By ensuring that data science and development solutions are informed by, and responsive to, actual problems, we will become better equipped to serve those in need with full and accurate context. With this knowledge, D4D will enable us to drive toward the aspirations of all people, not just the digitally connected.