Wherever new technology is introduced, ethics and legislation will trail behind the applications. The field of data science cannot be called new anymore from a technical point of view, but it has not yet reached maturity in terms of ethics and legislation. As a result, the field is especially prone to make harmful ethical missteps.
How do we prevent these missteps right now, while we wait for — or even better: work on — ethical and legislative maturity?
I propose that the solution lies in taking responsibility as a data scientist yourself. I will give you a brief introduction on data ethics and legislation, before I reach this conclusion. Also, I will share a best-practice from my own team, which gives concrete actions to make your team ethics-ready.
“But data and models are neutral in itself, why worry about good and bad?”
If 2012 denoted the kickoff of the golden age of data science applications — through the crowning of data science as the ‘Sexiest job of 21st century’, 2018 might be the age of data ethics. It is the year where the whole world started forming an opinion on how data may and may not be used.
The Cambridge Analytica goal of influencing politics clearly fell in the ‘may not’ camp.
This scandal opened up major discussion about the ethics of data use. Multiple articles have since then discussed situations where the bad of algorithms outweighed the good. The many examples include image recognition AI erroneously denoting humans as gorillas, the chatbot Tay which became too offensive for Twitter within 24 hours and male-preferring HR algorithms (which raises the question: is data science the sexiest, or the most sexist job of the 21st century?).
Clearly, data applications have left neutral ground.
In addition to — or maybe caused by — attention from the public, large (governmental) organisations such as Google, the EU and the UN now also see the importance of data ethics. Many ‘guidelines of data/AI/ML’ have been published, which can provide ethical guidance when working with data and analytics.
It is not necessary to enter the time-consuming endeavour of reading every single one of these. A meta study on 39 different authors of guidelines shows a strong overlap in the following topics:
3) Safety and security
4) Transparency and explainability
5) Fairness and non-discrimination
This is a good list of topics to start thinking and reading about. I highly encourage you to deeper investigate these yourselves, as this article will not explain these topics as deeply as their importance deserves.
Legal governance, are we there yet?
The discussion on the ethics of data is an important step in the journey towards appropriate data regulation. Ideally, laws are based on shared values, which can be found by thinking and talking about data ethics. To write legislation without prior philosophical contemplation would be like blindly pressing some numbers at a vending machine, and hoping your favourite snack comes out.
Some first pieces of legislation aimed at the ethics of data are already in place. Think of the GDPR, which regulates data privacy in the EU. Even though this regulation is not (yet) fully capable of strictly governing privacy, it does propel privacy — and data ethics as a whole — to the center of the debate. It is not the endpoint, but an important step in the right direction.
At this moment, we find ourselves in an in-between situation in the embedding of modern data technology in society:
- Technically, we are capable of many potentially worthwhile applications.
- Ethically, we are reaching the point we can mostly agree what is and what is not acceptable.
- However, legally, we are not in a place where we can suitably ensure that the harmful applications of data are prevented: most data-ethical scandals are solved in the public domain, and not yet in the legal domain.
Responsibility currently (mostly) rests on the shoulders of Data Scientists
So, the field of data cannot be ethically governed (yet) through legislation. I think that the most promising alternative is self-regulation by those with the most expertise in the field: data science teams themselves.
You might argue that self-regulation brings up the problem of partiality, I do however propose it as an in-between solution for the in-between situation we find ourselves in. As soon as legislation on data use is more mature, less–but never zero–self-regulation is necessary.
Another struggle is that many data scientists find themselves in a split between acting ethically and creating the most accurate model. By taking ethical responsibility, data scientists also receive the responsibility to resolve this tension.
I am persuadable with the argument that the unethical alternative might be more expensive in terms of money (e.g. GDPR fines) or damage to company image. Your employer or client may be harder to convince. “How to persuade your stakeholders to use data ethically” sounds like a good topic for a future article.
My proposal has an important consequence for data science teams: next to technical skills, they would also need knowledge on data ethics. This knowledge cannot be assumed to be present automatically, as software firm Anaconda found that just 18% of data science students say they received education on data ethics in their studies.
Moreover, just a single person with ethical knowledge wouldn’t be enough, every practitioner of data science must have basic skill in identifying potential ethical threats of their work. Otherwise the risk for ethical accidents remains substantial. But how to reach overall ethical knowhow in your team?
Two concrete actions towards ethical knowledge
Within my own team, we take a two-step approach:
1)group-wide discussion on what each finds ethically important when dealing with data and algorithms
2)construct a group-wide accepted ethical doctrine based on this discussion
In the first step we educate the group on the current status in data ethics in both academia and business. This includes discussing problems of data ethics in the news, explaining the most prevalent ethical frameworks, and conversation about how ethical problems may arise in daily work. This should enable each individual member to form an opinion on data ethics.
The team-wide ethical data guidelines constructed in the second step should give our data scientists a strong grounding in identifying potential threats. The guidelines shouldn’t be constructed top down; the individual input that comes out of the group-wide discussions forms a much better basis. This way, general guidelines that represent every data scientist can be constructed.
The doctrine will not succeed if constructed as a detailed step-by-step list. Instead, it should serve as a general guideline that helps to identify which individual cases should be further discussed.
Precisely that should be a task of the data scientist: ensure that potentially unethical data usage will not go unnoticed. Unethical usage not only by data scientists, but by all colleagues who may use data in their work. This way, awareness for data ethics is raised, which enables companies to responsibly leverage the power of data.
In short: start talking about data ethics
We are technically capable of life-changing data applications, however a safety net in the form of legislation is not yet in place. Data scientists walk a tightrope over a deep valley of harmful application, where overall knowledge of ethics acts as the pole that helps them balance. By initiating the proper discussion, your data science team has the tools to prevent expensive ethical missteps.
As I argue in the article, discussion on data ethics propels the field towards maturity, such that we can arrive at a “rigorous and complex ethical examination” of data science. So, engage in discussion: be critical about this content, form an opinion, talk about it, and change your opinion often as you encounter novel information. This not only makes you a better data scientist; it makes the whole field better.