I sat down with Eugenio Zuccarelli, Innovation Data Scientist at CVS Health and member of Data Science for Fortune 100 as well as Forbes 30 Under 30, to discuss Artificial Intelligence (AI) and the wider data landscape today. In this discussion we cover what data scientists get wrong about business, the ongoing data talent shortage, data problems within healthcare as well as some of the developments across AI that are piquing Zuccarelli’s interest such as synthetic data and explainable AI.
Elliot Leavy: I thought it'd be great to start with your ideas about performance metrics and machine learning and why in the past you think that many data scientists have been going about it the wrong way.
Eugenio Zuccarelli: So this might be a slightly controversial but my philosophy is that a lot of people, particularly data scientists, focus too much on the machine learning metrics such as accuracy, ROC curves, and so on.
But in an industry setting the most important metrics are the ones that really matters for the business.In this sense, usually the most important metrics are the operational impacts that your model is doing for both the business and wider society.
Instead, too much time is spent on maximizing something metrics at the expense of the operational impacts. Of course performance metrics actually important, but sometimes what is even more important is the positive impact your work can have on the business or its customers.
Elliot Leavy: How has the obsession with metrics become part of the data culture?
Eugenio Zuccarelli: It’s likely because data scientist by training are not always domain knowledge experts. So it's very easy for a data scientist to get stuck in their own way of seeing the world. In this sense, most data scientis will view success as the highest accuracy possible but often they lose the bigger picture that, in most of the cases, we use data science in order to achieve something else: be that business objective or higher purpose. It is almost an art form to be able to forget the scientific components and keep a the Northstar more business-focused and remind oneself of the purpose behind the science in the first place.
Elliot Leavy: But is it was how do you bridge this disconnect between data science, mindsets and business then?
Eugenio Zuccarelli: One way is to look at how we train data scientists, tackling the problem right at the source and assess how business-savvy university courses or online courses are — reinforcing the idea that while data science metrics are extremely important, that data scientists serve another purpose to stakeholder or product team, reminding them that there is an overarching research goal.
This is why I have some reservations about online courses, because although they are undoubtedly a great toolhey often miss the depth of a design problem. It's easy to get stuck simply copy and pasting code from online resources but then not understanding the ultimate why behind it all. The result is that there is this disconnect between the product teams and the scientists and so there needs to be a bit more interest company-side in improving the the communication between the stakeholders and those making the algorithms, this will undoubtedly help facilitate better understanding priorities.
Elliot Leavy: In our recent talent report, it seemed that data talent is in short supply and that businesses aren’t expecting the situation to improve any time soon. What are your thoughts on this, what is the outlook?
Eugenio Zuccarelli: Right now a lot of universities are expanding a lot of the data science programs. A programme at MIT for example only had sixty students in 2020, now it has over one hundred in order to match this lack of talent. I also think that enrollment in universities is increasing but there will always be, to some extent, a shortage in your future especially because there are so many different specific domains. If you think about it, specialities ranging from things such as natural language processing to image recognition exist, so it will take quite some time to get a lot of people deeply skilled in these niches.
Elliot Leavy: In terms of those niches, what is exciting you at the moment?
Eugenio Zuccarelli: What is really interesting me is the rapid advancement of AI-generated imagery through tools such as DALLE-2. These developments are creating new questions and new answers that we maybe have never thought before, and it really has developed at a pace that no-one really expected. What really fascinates me the most is how we're unlocking a lot of different areas and topics, almost disrupting industries from one day to another without even realizing it. I was actually just playing around with it yesterday, and realised that these tools can create images that they themselves can then be trained on.
Elliot Leavy: Yeah, it's fascinating that training models on images which they themselves created, is that does that sort of broach into synthetic data? Could you explain synthetic data to me and what is happening in that space?
Eugenio Zuccarelli: Essentially it is when machines generate datasets, creating a type of catalytic process when they are then trained on these datasets themselves and we then use this synthetically generated data to get the algorithms to complete specific tasks. While this fascinates me, it is also such a powerful tool that we have to make sure that it is properly regulated because it would be easy for someone to just use these tools without fully understanding the implications of that. So if the model has bias itself, simply because it has been taking it on the data that's been trained on, and when we generate synthetic images of synthetic data, then we might be propagating this bias or even exacerbating it.
Elliot Leavy: Well with the incoming fragmentation of regulation across the world, AI auditing is becoming more pertinent.
Eugenio Zuccarelli: Yes and I would say it's extremely needed. It’s one of those new areas that is developing that has realised that we desperately need to stress test these models. Currently there are a lot of models out there already in production being used to make decisions affecting people’s lives more than they should, often having detrimental outcomes.
Elliot Leavy: Interpretability is the phrase correct?
Eugenio Zuccarelli: Yes, and it’s one of the most important aspects of what is happening in AI right now because while of the data scientists right now are fascinated by you know the complexity of things such as DALL-E, we often sacrifice interpretability on the alter of complexity. With interpretability we can gain a lot more maybe business related outcomes and, as we mentioned already, sometimes data scientists are a bit stuck in their own worldview and lose meaning in the process. This of courses isn’t a criticism, data science it’s a fascinating topic but sometimes understanding that complexity for complexities sake does not fully benefit society or businesses. Once you realize that, you're able to unlock a lot of potential in terms of actionable insights.
Elliot Leavy: To return to synthetic data. Is there any industry that you would specifically expect it to be more useful than others? Why?
Eugenio Zuccarelli: Essentially it comes down to any industry where there is a lack of data. Healthcare for example benefits twofold from such data, as it can help with rare diseases but also has the added benefit of solving the issue of privacy as the data is not actually anyone’s and so cannot be abused in any way. The only problem occurs when they lack interpretability, and run the risk of propagating biases inherited from the datasets they were created from in the first place.
Elliot Leavy: Is healthcare the place to watch for the best innovations going on in AI today?
Eugenio Zuccarelli: Yeah, it's so it's fascinating because if you think about healthcare industry is a huge, vertical, it's massive, it affects each and every one of us as we all have access to hospitals, doctors, and so on.
But if you think about it, the basics of machine learning and AI is data, and the main problem facing the healthcare sector in terms of innovation is that the masses of data within the industry is stuck on hospital databases, smart devices, genetic testing services, and even handwritten notes. So we still have to fix this data foundation issue. When we do, we're going to have access to a sea of data and a wave of innovations will follow.
Elliot Leavy: So the day when an AI can read those unintelligible notes from the doctors is the one to look out for?
Eugenio Zuccarelli: Exactly.