Follow the Data! Algorithmic Transparency Starts with Data Transparency
This is part of The Ethical Machine: Big ideas for designing fairer AI and algorithms, an on-going series about AI and ethics, curated by Dipayan Ghosh, a former Public Interest Technology fellow. You can see the full series on the .
JULIA STOYANOVICH
ASSISTANT PROFESSOR, COMPUTER SCIENCE AND ENGINEERING, TANDON SCHOOL OF ENGINEERING; CENTER FOR DATA SCIENCE, NEW YORK UNIVERSITY; AND APPOINTED MEMBER OF THE NEW YORK CITY AUTOMATED DECISION SYSTEMS TASK FORCE
BILL HOWE
ASSOCIATE PROFESSOR, INFORMATION SCHOOL; AND DIRECTOR, URBANALYTICS LAB, UNIVERSITY OF WASHINGTON
The data revolution that is transforming every sector of science and industry has been slow to reach the local and municipal governments and NGOs that deliver vital human services in health, housing, and mobility [1-2]. Urbanization has made the issue acute: in 2016, lived in cities with at least 500,000 inhabitants [3]. The opportunities that Big Data presents in this context have long been recognized, evidenced by the remarkable progress around Open Data and open access to it [4], the digitization of government records and processes [5]; and, perhaps most visibly, smart city efforts that emphasize using sensors to optimize city processes [6].
Despite this progress, the public sector is slow to adopt predictive analytics due to its mandate for responsibility鈥攎eaning that any decision made by algorithms will need to be scrutinized by the individuals and organizations affects, just as taxpayers must verify that resources are being distributed equitably.
Recent reports on data-driven decision-making underscore that of individuals and groups is difficult to achieve [7], and that and of algorithmic processes are indispensable but rarely enacted [8-9]. As a society, we cannot afford the status quo: algorithmic bias in administrative processes limits access to resources for those who need these resources most, and amplifies the effects of systemic historical discrimination. Lack of transparency and accountability threatens the democratic process itself.
In response to these threats, New York City recently requiring that a task force be put in place to survey the current use of 鈥渁utomated decision systems,鈥 defined as 鈥渃omputerized implementations of algorithms, including those derived from machine learning or other data processing or artificial intelligence techniques, which are used to make or assist in making decisions鈥 in City agencies [10]. The task force will develop a set of recommendations for enacting algorithmic transparency by the agencies, and will propose procedures for:
- interrogating automated decision systems for bias and discrimination against members of legally protected groups, and addressing instances in which a person is harmed based on membership in such groups (Sections 3 (c) and (d));
- requesting and receiving an explanation of an algorithmic decision affecting an individual (Section 3 (b));
- assessing how automated decision systems function and are used, and archiving the systems together with the data they use (Sections 3 (e) and (f)).
New York is the first US city to pass an algorithmic transparency law, and we expect other municipalities to follow with similar legal frameworks or recommendations. Of utmost importance as this happens is recognizing the central role of data transparency in any algorithmic transparency framework. Meaningful transparency of algorithmic processes cannot be achieved without transparency of data.
What Is Data Transparency?
In applications involving predictive analytics, data is used to customize generic algorithms for specific situations鈥攖hat is to say that algorithms are trained using data. The same algorithm may exhibit radically different behavior鈥攎ake different predictions; make a different number of mistakes and even different kinds of mistakes鈥攚hen trained on two different data sets. In other words, without access to the training data, it is impossible to know how an algorithm would actually behave.
Algorithms and corresponding training data are used, for example, in predictive policing applications to target areas or people deemed to be high-risk. But as has been shown extensively, when the data used to train these algorithms reflects the systemic historical bias toward poor and predominately African-American neighborhoods, the predictions will simply reinforce the status quo rather than provide any new insight into crime patterns [11]. The transparency of the algorithm is neither necessary nor sufficient to understand and counteract these particular errors. Rather, the conditions under which the data was collected must be retained and made available to make the decision-making process transparent.
The conditions under which the data was collected must be retained and made available to make the decision-making process transparent.
Even those decision-making applications that do not explicitly attempt to predict future behavior based on past behavior are still heavily influenced by the properties of the underlying data. For example, the [12], used to prioritize homeless individuals for receiving services, does not involve machine learning, but still assigns a risk score based on survey responses鈥攁 score that cannot be interpreted without understanding the conditions under which the data was collected. As another example, matchmaking methods such as those used by the Department of Education to assign children to spots in public schools are designed and validated using data sets; if these datasets are not made available, the matchmaking method itself cannot be considered transparent.
What is data transparency, and how can we achieve it? One immediate interpretation of this term is 鈥渕aking the training and validation data sets publicly available.鈥 However, while data should be made open whenever possible, much of it is sensitive and cannot be shared directly. That is, data transparency is in tension with the privacy of individuals who are included in the data set. In light of this, we offer an alternative interpretation of data transparency:
- In addition to releasing training and validation data sets whenever possible, agencies shall make publicly available summaries of relevant statistical properties of the data sets that can aid in interpreting the decisions made using the data, while applying state-of-the-art methods to preserve the privacy of individuals.
- When appropriate, privacy-preserving synthetic data sets can be released in lieu of real data sets to expose certain features of the data, if real data sets are sensitive and cannot be released to the public.
An important aspect of data transparency is interpretability鈥攕urfacing the statistical properties of a data set, the methodology that was used to produce it, and, ultimately, substantiating its 鈥渇itness for use鈥 in the context of a specific automated decision system or task. This consideration of a specific use is particularly important because data sets are increasingly used outside the original context for which they were intended. This compels us to augment our interpretation of data transparency in the public sector to include:
- Agencies shall make publicly available information about the data collection and preprocessing methodology, in terms of assumptions, inclusion criteria, known sources of bias, and data quality.
Data transparency is important both when an automated decision system is interrogated for systematic bias and discrimination, and when it is asked to explain an algorithmic decision that affects an individual. For example, suppose that a system scores and ranks individuals for access to a service. If an individual enters her data and receives the result鈥攕ay, a score of 42鈥攖his number alone provides no information about why she was scored in this way, how she compares to others, and what she can do to potentially improve her outcome.
To facilitate transparency, the explanation given to an individual should be interpretable, insightful, and actionable.
To facilitate transparency, the explanation given to an individual should be interpretable, insightful, and actionable. As part of the result, data that pertains to other individuals, or a summary of such data, may need to be released鈥攆or example, to explain which other individuals or groups of individuals receive higher scores or more favorable outcomes. This functionality requires data transparency mechanisms discussed in our alternative interpretation above.
Toward Data Transparency by Design
Enacting algorithmic and data transparency challenges the state of the art in data science research and practice, and will require significant technological effort on the part of agencies. It will require careful planning, financial resources, and time.
As an illustration of two recent public actions of a similar nature: the came into effect in October 2016, following a yearlong process [13], while the EU (GDPR) was adopted in April 2016 and became enforceable in May 2018, more than two years later [14].
How can we enable data transparency in complex data-driven administrative processes? The research community is actively working on methods for enabling fairness, accountability, and transparency (FAT) of specific algorithms and their outputs [15-21]. While important, these approaches focus solely on the final step of the data science lifecycle (called 鈥渁nalysis and validation鈥 in Figure 1), and are limited by the assumption that input data sets are clean and reliable.
The data usage lifecycle
In challenging this assumption, we observe that additional information and intervention methods are available if we the input data [22]. Appropriately annotating data sets when they are shared, and maintaining information about how data sets are acquired and manipulated, allows us to provide data transparency: to explain statistical properties of the data sets, uncover any sources of bias, and make statements about data quality and fitness for use. Put another way: if we have no information about how a data set was generated and acquired, we cannot convincingly argue that it is appropriate for use by an automated decision system.
To achieve algorithmic transparency, there is a need to develop generalizable data transparency methodologies for all stages of the data lifecycle [23], and to build tools that implement these methodologies [24-25]. Such tools should be placed in the hands of data practitioners in the public sector. Importantly, the requirement of data transparency cannot be handled as an afterthought, but must be provisioned for at design time.
To make this discussion concrete, let鈥檚 consider an example. The growing homelessness crisis is a deeply complex challenge to urban communities. A variety of services is available to homeless citizens, including emergency shelter, temporary rehousing, and permanent supportive housing. The goal is to enable an individual to transition into stable housing after an episode of homelessness.
Social service agencies are beginning to collect, share, and analyze data in an effort to provide better targeted interventions. Broadly speaking, these agencies perform two categories of data analysis. The first category is personalized prediction and recommendation of services. For example, previously incarcerated citizens may benefit more from supportive housing, while families with a history of substance abuse may be directed to harm-reduction programs. Data can also be used to predict frequent service users, recommend treatment for sufferers of substance abuse and of other mental-health issues, and provide protection for victims of domestic violence. The second category is measurement and evaluation of the effectiveness of specific interventions, and of the overall system of homeless assistance. Both kinds of analysis are done using complex data-driven models, and rest on the availability, interoperability, and statistical validity of data collected from numerous local communities.
Communities use (HMIS) to collect data [26]. Data sets produced by an HMIS are typically 鈥渨eakly structured鈥濃攔ectangular, with rows and columns, but otherwise with no guaranteed properties. For example, it is often the case that columns contain data of mixed types, that missing values are abundant, and that column names are not meaningful. HMIS data is anonymized and then shared, and it must be post-processed in various ways to make it appropriate for analysis.
An analyst鈥檚 set of candidate weakly structured data sets is formed from a number of sources: open data portals, queries against other agencies鈥 APIs, and locally derived data sets. In this context, relevant data sets are identified, repaired, restructured, and aligned鈥攕o-called 鈥渄ata wrangling.鈥 A crucial dimension of data acquisition and curation that is often overlooked, and for which hardly any technical support exists in current systems, concerns statistical properties of the data. For example, removing records with missing values or joining between two data sets may introduce bias. This bias should be tracked and carried with the data set to inform downstream analysis. In some cases, relevant properties can be computed directly (e.g., geographic coverage). In other cases, the data set must be explicitly annotated (e.g., missing records due to system outage or to rules that prevent disclosure).
A data set derived in this manner may be further filtered, scored, and ranked to prioritize analysis. Filtering and ranking operations may introduce further bias, and must be tracked to explain properties of the data set they produce. For example, it may be required that the filtered data set contain homelessness data for Queens, Brooklyn, and Manhattan, and that it have representation of age and gender categories that agrees with a given population model (e.g., with what is expected based on the census). Further, if data is returned in sorted order, it must be guaranteed that no single ethnic group dominates the top ranks of the list. Restating the filtering and ranking tasks to capture the data analyst鈥檚 intent, while at the same time ensuring that several possibly competing objectives hold over the result, is difficult and requires support from the system to be done effectively. The result of this stage is a data set that is used as input for the data analysis stage.
During data analysis, a predictive model is learned based on the data, or an available model is invoked to make predictions. Data analysis is often coupled with validation, where confidences or error rates are produced alongside predictions. Based on, data analytics can be instrumented to quantify accuracy, confidence, and even fairness at the level of sub-population [27-29]. Feedback from the data analytics stage, such as a high error rate on a specific sub-population, may be used to state additional objectives, and to iteratively refine the process upstream.
Takeaways
Given the close link that exists between algorithms and the data on which they are trained, data transparency is the next frontier. That does not, as we have noted, mean releasing raw data sets鈥攚hich are often unnecessary, and usually insufficient, to quantify fitness for use in the context of a particular automated decision system or task.
Enacting algorithmic and data transparency will require a significant shift in culture on the part of relevant agencies.
Enacting algorithmic and data transparency will require a significant shift in culture on the part of relevant agencies. It will require careful planning, financial resources, and time. Equally as important, algorithmic and data transparency will require a paradigm shift in the way we think about data-driven algorithmic decision making in the public sector.
First, we must accept that the objectives of 鈥渁ccuracy鈥 and 鈥渦tility鈥 cannot be the primary goal. They must be balanced with equitable treatment of members of historically disadvantaged groups, and with accountability and transparency to individuals who are being affected by algorithmic decisions and the general public at large. Second, we must recognize that automated decision systems cannot be 鈥減atched鈥 to become transparent and accountable. Rather, we must provision for transparency and accountability at design time, which clearly impacts how we build and procure software systems for agency use. Perhaps less obviously, provisioning for data transparency impacts how municipalities structure their Open Data efforts. It is no longer sufficient to publish a data set on a city鈥檚 open data portal. Rather, the public must be informed about the data set鈥檚 composition and about the methodology used to produce it, as well as its fitness for any particular use. It cannot be reiterated enough that data transparency is a property of not only the data itself, but of how it is deployed in any particular context.
References
- Stephen Goldsmith and Susan Crawford, The Responsive City: Engaging Communities through Data-Smart Governance (San Francisco: John Wiley & Sons, 2014).
- Marcus R Wigan and Roger Clarke, 鈥淏ig Data鈥檚 Big Unintended Consequences,鈥 Computer 46, no. 6 (2013):46鈥53,.
- United Nations, 鈥淭he World鈥檚 Cities in 2016,鈥 2016, http://www.un.org/en/development/desa/population/publications/pdf/urbanization/the_worlds_cities_in_2016_data_booklet.pdf.
- Stefan Baack, 鈥淒atafication and Empowerment: How the Open Data Movement Rearticulates Notions of Democracy, Participation, and Journalism,鈥 Big Data & Society 2, no. 2 (2015).
- Annalisa Cocchia, 鈥淪mart and Digital City: A Systematic Literature Review,鈥 in Smart City, eds. Renata Paola Dameri and Camille Rosenthal-Sabroux (New York: Springer International Publishing, 2014), 13鈥43.
- Ibrahim Abaker Targio Hashem et al., 鈥淭he Role of Big Data in Smart City,鈥 International Journal of Information Management 36, no. 5 (2016):748鈥758.
- MetroLab Network, 鈥淔irst, Do No Harm: Ethical Guidelines for Applying Predictive Tools within Human Services,鈥 2017, https://metrolabnetwork.org/data-science-and-human-services-lab/.
- Robert Brauneis and Ellen P. Goodman, 鈥淎lgorithmic Transparency for the Smart City,鈥 Yale Journal of Law & Technology 20, no. 103 (2018), http://dx.doi.org/10.2139/ssrn.3012499.
- Julia Angwin et al., 鈥淢achine Bias: Risk Assessments in Criminal Sentencing,鈥 ProPublica, May 23, 2016, https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
- The New York City Council, 鈥淚nt. No. 1696-A: A Local Law in Relation to Automated Decision Systems Used by Agencies,鈥 2017, https://legistar.council.nyc.gov/LegislationDetail.aspx?ID=3137815&GUID=437A6A6D-62E1-47E2-9C42-461253F9C6D0.
- Cathy O鈥橬eil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (New York: Crown Publishing Group, 2016).
- Partners Ending Homelessness, 鈥淰ulnerability Index鈥擲ervice Prioritization Decision Assistance Tool (VI-SPDAT),鈥 http://pehgc.org/.
- La R茅publique Num茅rique, 鈥淭he Digital Republic Bill鈥擮verview,鈥 https://www.republique-numerique.fr/pages/in-english.
- The European Union, 鈥淩egulation (EU) 2016/679: General Data Protection Regulation (GDPR),鈥 https://gdpr-info.eu/.
- Cynthia Dwork et al., 鈥淔airness through Awareness,鈥 Innovations in Theoretical Computer Science, Cambridge, Massachusetts, January 8鈥10, 2012.
- Michael Feldman et al., 鈥淐ertifying and Removing Disparate Impact,鈥 in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, August 10鈥13, 2015.
- Sara Hajian and Josep Domingo-Ferrer, 鈥淎 Methodology for Direct and Indirect Discrimination Prevention in Data Mining,鈥 IEEE Transactions on Knowledge and Data Engineering 25, no. 7 (2013):1445鈥1459, https://ieeexplore.ieee.org/document/6175897?reload=true.
- Faisal Kamiran, Indre Zliobaite, and Toon Calders, 鈥淨uantifying Explainable Discrimination and Removing Illegal Discrimination in Automated Decision Making,鈥 Knowledge and Information Systems 35, no. 3 (2013):613鈥644.
- Andrea Romei and Salvatore Ruggieri, 鈥淎 Multidisciplinary Survey on Discrimination Analysis,鈥 The Knowledge Engineering Review 29, no. 5 (2014):582鈥638, https://doi.org/10.1017/S0269888913000039.
20. Richard S. Zemel et al., 鈥淟earning Fair Representations,鈥 in Proceedings of the 30th International Conference on Machine Learning, Atlanta, Georgia, 2013, 325鈥333, http://proceedings.mlr.press/v28/zemel13.pdf.
21. Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan, 鈥淚nherent Trade-Offs in the Fair Determination of Risk Scores,鈥 8th Innovations in Theoretical Computer Science Conference, January 9鈥11, 2017, Berkeley, California, https://doi.org/10.4230/LIPIcs.ITCS.2017.43.
22. Keith Kirkpatrick, 鈥淚t鈥檚 Not the Algorithm, It鈥檚 the Data,鈥 Communications of the ACM 60, no. 2 (2017): 21鈥23, https://cacm.acm.org/magazines/2017/2/212422-its-not-the-algorithm-its-the-data/abstract.
23. Julia Stoyanovich et al., 鈥淔ides: Towards a Platform for Responsible Data Science,鈥 in Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, Illinois, June 27鈥29, 2017, 26:1鈥26:6.
24. Haoyue Ping, Julia Stoyanovich, and Bill Howe, 鈥淒atasynthesizer: Privacy-Preserving Synthetic Datasets,鈥 in Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, Illinois, June 27鈥29, 2017, 42:1鈥42:5.
25. Ke Yang et al., 鈥淎 Nutritional Label for Rankings,鈥 in ACM SIGMOD 2018.
26. HUD Exchange, 鈥淗MIS Data and Technical Standards,鈥 https://www.hudexchange.info/programs/hmis/hmis-data-and-technical-standards/.
27. Florian Tram猫r et al., 鈥淔airTest: Discovering Unwarranted Associations in Data-Driven Applications,鈥 2015, available at https://arxiv.org/abs/1510.02377.
28. Sainyam Galhotra, Yuriy Brun, and Alexandra Meliou, 鈥淔airness Testing: Testing Software for Discrimination,鈥 in Proceedings of 2017 11th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Engineering, Paderborn, Germany, September 4鈥8, 2017, https://doi.org/10.1145/3106237.3106277.
29. Anupam Datta, Shayak Sen, and Yair Zick, 鈥淎lgorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems,鈥 2016 IEEE Symposium on Security and Privacy, May 22鈥26, 2016, San Jose, California, available at https://www.andrew.cmu.edu/user/danupam/datta-sen-zick-oakland16.pdf.