More Data, Fewer Problems?
How one computer science professor is using data to fight development aid fraud.
The work of lifting countries out of poverty is never easy. But thought he could make it far more effective.
Rozier is a computer science professor at Iowa State with a penchant for who saw a big problem at the World Bank. Every year the Bank issues development loans across the world. Some are earmarked for projects to help countries or regions improve infrastructure and basic living standards鈥娾斺妕hink access to clean water and medicine. But each project is complex, and requires many international suppliers to provide many goods and services to make them a reality. Unsurprisingly, the supplier selection process is competitive.In an ideal world, the contract would go to the company that offers the best combination of price and quality. And then, the development project would be underway.
But all too often, that鈥檚 not how it works. The bank sees a non-trivial amount of fraud in the supplier bid and selection process. From setting up fake companies to collusions and price fixes, fraud in this process can lead to the embezzlement or siphoning of funds. That could mean no road, well, or access to medicine for the intended country or region.
In 2014, Rozier started a research project with (DSSG) to 鈥渇ight data with data.鈥 His project focused on helping the bank identify fraud scenarios by building an automatic reasoning system that studies data patterns, and identifies what he calls 鈥渄ata integrity attacks鈥, which have a high correlation to potential fraud.
Rozier sees this as an attacker-vs-defender problem. The attackers鈥娾斺妕he fraudsters鈥娾斺奱im to inject the system with bad data or subvert the decision process of the system. The defender鈥檚 task is to spot the bad data and/or the patterns of subversions.
An example of a data integrity attack, Rozier says, is collusion. In other words, multiple entities may work together to artificially inflate their bids on a loan in order to make a moderately high bid competitive. In those cases, either a single fraudster submits nearly all bids, sometimes over 90% of them, via fake companies. Or a supplier may collude with others to submit artificially inflated bids, subsequently dictating what a 鈥渞easonably-priced鈥 bar is.
Another commonly seen fraud tactic relates to companies that the World Bank put on a 鈥渄ebarred鈥 list, which hosts entities prohibited from submitting bids because of past violations or dubious practices. To thwart this process, a debarred company may obfuscate its identity by taking on a new name that is similar to a well-recognized, legitimate organization鈥娾斺妕hink 鈥淧ricewaterhoseCoopers鈥 to resemble 鈥淧ricewaterhouseCoopers鈥濃娾斺妎r changing its name slightly to confuse automated identification algorithms鈥娾斺妕hink 鈥淎mce inc鈥 to 鈥淎cme Inc鈥.
And here, we get back to 鈥渇ighting data with data.鈥 In this situation, Rozier saw that 鈥渢he defender鈥濃娾斺奿n this case the World Bank鈥娾斺奾ad fundamentally more data than a typical fraudster and attacker. The defender can leverage this fact to fight data fraud. For instance, the World Bank has an entire history of all bids submitted, both winning and losing ones. The fraudster may only have a partial view of such a history. In one case, Rozier鈥檚 study identified a number of fake bids that stood out because they followed a model of what bids looked like 10 years ago. 鈥淭he bidding process has changed significantly in the last 5 years,鈥 Rozier said, 鈥渢hese bids had the wrong parameters and patterns, and it was clear that they were manufactured bids.鈥
Rozier had discovered the key fighting attacks: 鈥 You need a superior data model than your adversaries.鈥
Understanding how to preserve data integrity, and to build a superior data model to fight data fraud, is becoming ever more crucial. We rely on the integrity of financial records, election results, news, and medical records, to name just a handful of sources. If these data and the algorithms that process them become compromised, the very foundation of our society may be at risk. 鈥淭he question is: how vulnerable is your system to data manipulation, gaming, and other forms of data integrity attacks, and what will you do about it?鈥 Rozier asks.
It鈥檚 a question that鈥檚 just beginning to be answered. Data science for fraud and security is a relatively young field. Rozier鈥檚 work using semantic and syntactic clustering to resolve name conflicts, when tested on a World Bank data set, has greatly improved upon previous results gained using opensource tools like OpenRefine.
In addition, recent advances in deep learning applied in conjunction with data science, such as those seen in Google鈥檚 successful AI-driven GO game against the world鈥檚 best GO player, is 鈥渋ncredibly encouraging鈥, Rozier said. 鈥淭he same deep learning techniques can be applied to fraud detection: We can label the data, learn something about the governing dynamics, refine the model, iterate, and eventually build sound analysis.鈥
Rozier is deep in the where he aims to apply deep learning principles to tackle data integrity. He believes that data science can help to effectively eliminate supplier fraud. 鈥淭he hope is more projects can go forward unimpeded, and we鈥檒l see more clean water, better infrastructure, and improved healthcare for more regions sooner rather than later.鈥