All data science projects are composed of the algorithm (model) and the data. While data scientists are natively focused on math and models, experience is showing them that for a data science project to succeed in real life, much more than a deep understanding of math is needed.
Gilad Barkan and Noa Stanger will take us through the journey their team went on in transforming from working with a model-centric to a data-centric approach.
In this episode of the Wix Engineering Podcast: how a data science team upgraded their AI by refocusing, changing their priorities, and creating a new workflow focusing on data.
You can also listen to this episode on Apple podcast, Spotify, Google or on Wix Engineering site. And you can also read the full episode here:
Hi and welcome to the Wix Engineering podcast, I’m Ran Levi.
There are few things in life more challenging, more technical than artificial intelligence. Building software is hard enough, but at least computer programs follow simple rules when you tell them to. Artificial intelligence requires not just good coding, but an effective use of massive datasets.
Because AI is so tough, it tends to attract only super intelligent people--math gurus, PhDs and such. But even in a field like this, being super smart isn’t enough.
Gilad: Hi. I’m Gilad Barkan. I’m team lead in the Data Science Group at Wix.
Gilad’s been working in the data science group at Wix for a few years now.
Gilad: At the beginning of our group, we were basically – developed quite basic models around analytics of Wix’s business performance and then we started to go for more complex problems to make an impact within the Wix product.
As the data science team took on more complex and important tasks, they needed more people. But not just the PhDs.
Gilad: The experience told us that in order for a data science project to succeed in real life, you will need to have much more than math skills and much more than a deep understanding of the math is needed. So we understood that there should be additional skills in our group.
Other kinds of skills. Different ways of approaching the models, dealing with the data, looking at things in general. That’s how they found Noa.
Noa: So hi. My name is Noa Stanger.
This podcast is going to follow Gilad and his PhDs, teaming up with Noa and a band of other multi-talented people, to create a different kind of data science operation. One that combines left brain and right brain--different kinds of thinking, and seeing. Along the way they’ll discover a different, better way to do data science--one that heavily invests not only in the state-of-the-art models, but also on deep and defined data processes, which are the key to the success of their projects.
…
EARLY TROUBLES
A lot of you listening to this are programmers, so you’ll be familiar with the concept of DevOps. Nowadays, in the field of data science specifically, there’s something called “DataOps.” It’s the same, basic idea as devops: about improving quality and efficiency in the process of doing data science. The concept has been around since 2014, but only in the past few years or so has it gained traction.
In the early days, Gilad and his colleagues didn’t necessarily have that perfectly laid out framework for how to improve their data science operations. So they were doing fine, but struggling to get the results they strived for. For example, one of their big projects at the time was to design an algorithm that could remove backgrounds from photographs. Seems simple, sure, but after all kinds of iterating and reiterating, the models just weren’t reliable enough.
So…
NOA JOINS THE TEAM
Noa: I was brought to the team.
Noa. She wasn’t hired to be a data scientist, though. Her job would be a bit different.
Noa: Like take care of the logistics and see what you can make out of it.
I lead the entire operation around everything, which is not purely data science, which means the data, the infrastructure and the product.
And then once I started to go into the data and go into the results and see it, we understood that there is something and it was a joint understanding of me and the data scientists and my managers, that we have something really, really unique.
Something really unique. Like a secret superpower they just weren’t using yet. Why hadn’t they seen it before?
NOA’S NEW PERSPECTIVE
To see what Noa saw in Gilad’s team, it helps to know a bit about her.
When she joined the Wix data science team, as we mentioned, they were looking for some out of the box thinking.
Noa: There was this understanding that we need something a bit different. The team at the time was mostly data scientists and engineers.
I was the first non-data scientist and non-engineer to join the team.
I studied architecture.
I came from a whole different domain that, coming to think about it, does actually connect this creative thinking and practical things.
So I think it did help me look at these problems in a bit of a different way and try to think a bit out of the box how to solve it and not accept, OK, these are the KPIs, these are the numbers. But ask myself how can I look at these numbers and how can I look at this data and improve it.
For example, I can give this example of the cut-out project.
IMAGE CUT-OUT
The cut-out project. We mentioned it briefly earlier.
Noa: Image segmentation which takes the image and removes the background. It can be very relevant for ecommerce stores for example. If I have a product and I want it to be on a white coherent background.
It seems so simple but, from a machine learning point of view, it wasn’t.
When Noa first joined on, Gilad’s team was working with this image cut-out problem. They were approaching it like any data science team would: by building and reworking their algorithm.
Noa: What we did is the entire process in data science is very iterative. You train the model. You see the results. And then you try and understand where the error is and you fix it in the next iteration. The next iteration that we did was fixing the model, it was going back to the algorithm and going back to the code and trying to fix it. And this is what, at the initial point, we understood was a bit incorrect.
But when I looked at the data visually, I could see it.
Gilad and the data scientists were looking at metrics, at KPIs. And Noa, the architect, looked at the photos themselves. Maybe there was something to be gleaned from them. She was more used to dealing with images, anyway.
Noa: We did it on portrait images. When I went and looked at the data and reviewed like hundreds of images, I saw that all the hats, all the people that have hats in the images, their hats are always cut off.
It turned out that the program could distinguish people in the foreground, but it would read their hats as part of the background. Such a simple thing. So obvious that the engineers hadn’t even thought of it.
Noa: So these are the sort of things that a data scientist wouldn’t see when one just looks at the pure metrics.
The program wasn’t all fixed just because of some fixed hats, of course. But it was evidence that having people around who see things in unique ways--who approach problems differently--can help even with such highly technical work.
PROS OF A NEW PERSPECTIVE
Interviewer: Is there anything about an architecture background in particular that you think helps?
Noa: Oh, wow. That’s an interesting question.
Machine learning, AI. It’s very big buzzwords that people think are very intense and intimidating, when in the end it comes down to breaking it into the different parts and I think this data process is a very significant part that I was able to break it into and really helped us improve.
To take something that is very technical and hard and many things that need to work together like a building and try to simplify it and explain it and put it into ground, into something that can be very explainable.
We should pause for a moment, because there’s a risk in oversimplifying the matter.
It isn’t that Gilad and his brainiac coders were too blinded by their fancy numbers, and Noa magnificently solved all their problems in one stroke of pure genius. It’s much more practical than that.
By adding someone new and different to the team--with her own skill set, and a different way of looking at things--they all, together, were able to operate more efficiently. They were able to solve problems from more than one angle, cover up each other’s weaknesses and, leverage each other’s strengths.
IMPORTANCE OF THE DATA
And it helped them discover something about data science in general.
Noa: This is the beginning of this paradigm shift of “wait, maybe it’s not only the model. Maybe it’s also the data”.
The data. The images. The hats.
Gilad: So Google more than 10 years ago coined a phrase that “more data beats better algorithms” or “trumps better algorithms” and since in the industry, (not Google) having big label data is a very high barrier for most of the non-Google-like companies.
“More data beats better algorithms.” Still, most companies aren’t Google. So for most companies--even Wix, that has plenty of data--feeding the hungry machine learning algorithms with big, labeled data is a too high barrier.
Gilad: So we realized that if you have small data even, not big data but small data, but it’s small quality data, you can get along very, very, very far.
Small data. A bit counterintuitive, a bit of a rebellion against the trend towards more and more.
Gilad: So that’s why we invest a lot in having very good quality data. Machine learning is like a darts game, you know. You are throwing darts into a target, into the board. So we are investing a lot in pointing out where the correct target is. So that when we are throwing the darts, which are our models, it will be much easier for them and we can use the best of the state-of-the-art algorithms that are out there. If you feed them good data, they will be on target.
Gilad and his team used to focus around 80 percent of their time on the models an 20 percent on the data. They decided they needed to flip that ratio. 20 percent models, 80 percent data.
Noa: Our jump from having a good model to like an excellent model was really done only through the data. Like our technical attempts could go to some extent and then the major improvement into perfecting it was actually through the data.
ADDING LABELING, CURATING
The new data-first approach began a sea change. A change not just of priorities, but also of who was going to be part of their team.
Gilad: It was not like a spark. It was an evolution of – a process of our groups. But slowly we extended our group not only having more data scientists but having more data people and we started creating a group two teams within our group. One team is about the labeling, data labeling.
Labeling data. Nothing sexy about it. But you know what? If you take your data seriously, it suddenly starts to seem like an absolutely vital part of any project.
Gilad: For example, a good example can be - we have a couple of projects which deal with – to understand what is beauty, because our website, we want our users’ websites to be stunning and we try to help them, their images, the layout of the website.
We want to help them be more stunning. So we need to understand what is beauty. And beauty is a very, very subjective task, you may imagine. So this subjectiveness, how do you give a label for this – such kind of subjective task as beauty?
So we had been through a long research and we’ve done very cool projects and very cool developments around understanding beauty and giving a label for such subjective tasks.
So this label is the key of the whole process and we realized that the label is hugely underestimated in the industry.
In addition to labeling, they added one more team.
Gilad: The other team is about – we call it data curation, exactly from the – from curating arts, images.
Managing the data that would be used, ensuring its quality, making sure it all works together towards the correct business goals.
Gilad: Which is a lot about quality, QA, and the processes of the data.
Noa: In the past, data scientists used to be responsible for the entire thing and all of a sudden we distribute it into quite a few people along the process.
GROWING, SUCCEEDING
Gilad: Now the team is – I think consists of more than 50 percent of the whole group, these two teams that their daily job is only about the data.
Noa: We did understand that we have something really unique here, that is working and the team grew and grew because as we started going into more complex projects, we understood that having a data curator assigned to each project is critical for our success and then it also opened a lot of opportunities to do things we didn’t do before, types of analysis, types of ways to look at the data.
They began as a data science team. Now they were a data science operation.
Noa: Today in most of our projects, we even develop our in-house tools for labeling the data and for managing the data, which is huge. We have like a whole – an engineering and the internal product team that create internal products for us in order to actually facilitate these data processes.
So we started formalizing how we do that. How do we outsource labeling jobs which is also something major? Because if I need tens of thousands of examples, it’s going to be hard for me to create all of it sometimes in-house in specific projects.
So for all of these, we started forming the methodologies and we started actually training the team to do it. But it always – we always need to keep on reinventing ourselves in every project that we start which is unknown and trying to invent and do something that no one has done before.
So we need to define the KPIs, define how we collect the data, how we measure ourselves, how we go through the results. So it’s an ongoing process that we are in all the time.
With the new focus on data first, and all the new people bringing their unique skill sets to the table, Gilad and Noa are doing a lot more than just cutting out backgrounds from images.
Gilad: So if we started in computer vision maybe, now the data science at Wix is impacting more and more areas. Wix product is very diverse, very complex and we are impacting in a lot of other places with very – it’s from chat bot support, forecasting. We are doing a lot in a lot of areas and I think these best practices and infrastructure of data and engineering make difference and we are gaining from it now.
…
CONCLUSION: UNICORNS
When Wix’s data science team was struggling with cutting out backgrounds from images, it was partly because they were so focused on their deep, complex mathematical models. But good, modern AI requires more than just the really tough math.
Interviewer: What are the skills that we should be seeing from newer, modern data scientists?
Gilad: So, before data science came into the game, which is I think like six or eight years, there was machine learning and machine learning was about the confluence of research and engineering, programming and the math.
Basically it was about a very engineering world, different systems, stuff like this. Now when data science came into the game, because AI came into all business, all around, so now the data science is – the machine learning added the business skill to it.
So I think data scientists, so-called data scientists, but data scientists are very different from company to company. Each company defines data scientists as doing different things. But I think in general, the skill set of the data scientist now is – we call it a unicorn because it should be both the programmer and the researcher and the businessmen.
To do data science the right way--to build AI that’s really good, in an efficient manner--you need people with different kinds of skills. Left brain and right brain. Different perspectives from which to solve problems, different eyes to spot different hidden solutions in the data.
In short, you need Gilad and Noa.
That’s it for this episode, thanks for listening. For a full list of our previous episodes - visit wix.engineering/podacst. The Wix Engineering Podcast is produced by PI Media - written by Nate Nelson, produced by Yotam Halachmi and narrated and edited by me, Ran Levi. Special thanks to Moard Stern from Wix. See you again next episode, bye bye.
For more engineering updates and insights:
Join our Telegram channel
Visit us on GitHub
Subscribe to our YouTube channel