In this episode, I'm speaking with Ran Romano from Qwak.ai. Ran built the ML platform at Wix, and we discuss the various data roles, when organizations should focus on ML infrastructure, solving the hard problems of features stores, and one approach to building an end-to-end ML platform.
Listen to the Audio
Read the transcription
Dean: Hi everyone, welcome to the MLOps podcast. I'm Dean, your host. Today I have Ran Romano with me. Ran is VP Engineering at an exciting MLOps startup called Qwak AI. Before that, he led the data and machine learning engineering team at Wix, where he built a really impressive end-to-end machine, learning platform. He also has a master's degree in computer science.
Let's start by giving us a bit of background on how you got into the world of machine learning and machine learning engineering specifically.
Ran: My background is in data engineering, which was my profession at Wix, this is what I started with, and in my first degree and I started getting more and more into machine learning. Took some courses, saw Andrew Ng's course at Stanford, and I really liked it but didn't really want to be a data scientist. I really like both worlds. World of engineering and that of data science. So when I came into Wix and I had the option to do things in between, to be the engineering font end of the data science group, that felt like a really good match.
Dean: Interesting. Where do you see the border? Because there's obviously a lot of talk about titles today. Like, you have the data scientist, you have data engineers, now you have machine learning engineer... Where do you see the line between these titles, and where did you put yourself when you had the choice?
Ran: That's a really good question. When we started at Wix with the entire machine learning platform, machine learning infrastructure, basically, there was no title called machine learning engineering. What I called myself or the team that I built was actually machine learning infrastructure. Same for data infrastructure, actually, the title was software in first software engineer/data infrastructure, software engineers/ML infrastructure, because this is what we did. We really build the models themselves. We didn't really build the systems around the models, we built an ML platform. We build a system, we built a product that is designed for data scientists to bring the models to production better. So, in my eyes, I call that machine learning infrastructure. These days, you have the border between data scientists and machine learning engineers, right? People that only build models and people that need to put them into production. So I don't really see it that way.
Dean: Do you think that there's a clear trend there? In the future, will most data scientists become machine learning Engineers as well, or do you think that it's going to be more and more separated?
Ran: I think the trend right now is that it's becoming more and more separated. We are seeing actually more and more companies hiring machine learning engineers, and with the trends today, that's a good question because it might be the case that let's say, software engineers and DevOps, that software engineers are the ones that are bringing software into production. And it's not like the handing of the code to DevOps engineers. DevOps engineers are working on the infrastructure. Maybe it will go in that direction. This is what I personally believe in, but again, not based on some actual data. There will be people that will build more of the infrastructure with better tooling and the data scientists that will be responsible for also deploying the models to production. That separation between data scientists that only need to build models and machining engineers that only need to put those models in production. I don't know, mixed feelings.
Dean: I understand. I think it's a good distinction because a lot of times, it feels like the discussion becomes shallower, saying either data scientists should do everything and be superhuman, or they only need to do research and everyone else around them will take care of the dirty work, and it actually makes sense. DevOps don't do the hard work of bringing software into production, they build the infrastructure that lets the software developers do it. So possibly, it's not a contradiction. Data scientists will look more like what we imagine ML engineers are doing today, but also you'll have some MLOps or ML infrastructure building the tooling for them. That makes sense.
Ran: In Wix, people building automation around that, because as being a machine-learning engineer at needs to take code that he didn't write, models that he didn't write, and put them into production, something a bit tedious. I think the thread will be automation and NL infrastructure.
Dean: I think it's also a good mental model for people that are listening and are trying to imagine what data or ML organization they need to build within their company. You need to have people that are making the process itself smoother, but you can't shift all of the work of doing the process to those people. If you're building your data teams, maybe the takeaway here is: you want to hire not just either data scientists or ML engineers, but you want someone who's actually in charge of building proper infrastructure and fine-tuning and optimizing those processes. Is there a specific part at this entire process of taking machine learning models into production that you're especially excited about or that you were especially excited about and led you to your current position?
Ran: I think two parts I'm really fond of – one is how do we model a CICD system for ML? This is what I really started at Wix. We had a really big problem with this handoff. We had a group of 30 data scientists and they need to bring more and more models into production. And we only had a small group of machine learning engineers. They were software engineers, but today, they are called machine learning engineers. So that handoff procedure, being a data scientist that was dependent on the software engineering part, was a real issue for us. So we wanted to automate this process. It's simple, just like we do for software engineering. So the question is, how do we model the same structure, the same methodologies, but including ML special parts, you know, integrating with data sources and everything else. So how do we model this process? The CICD process to make them as independent as possible? To make the data scientist as independent as possible, that they will actually hit the deploy button, and they will put a live endpoint into production. This is something that they really worked hard on for the first year or something like that while building the ML platform, the entire CICD. It wasn't just the tools, it was also the mindset. Because these people were not software engineers, they were more statistics background. Physics... and now you come to them, listen, everything needs to be version, we need to understand what's in production, so it's a shift. No more Jupyter Notebooks, or no more Jupyter Notebooks in production...
Dean: Half of the audience now has fire in their eyes. I tend to agree and I think this is a recurring theme in this podcast. It's meaningful. Nowadays, there's definitely a clear shift. When we started DAGsHub two years ago, Jupyter Notebooks in production was a real contentious topic, and now I think it's pretty clear that most organizations are understanding that it's a great tool for some things, but bringing models into production is not one of those things.
Ran: I actually tried building the CI system based on the paper mill. There's a famous series of blogs that Netflix thought about productionizing Jupyter notebooks. I really liked the record system, really tried to do that, but it didn't work out very well. It wasn't really something that held in production scale.
Dean: Was there a specific reason? If I'm now a company that is trying to bring notebooks into production, what should I notice that will tell me that this is a bad idea?
Ran: I think paper mill, as a tool, was great. But I think the viewer wasn't that ready, and it felt like the entire process was really immature. And this is what I remember, trying it out.
I don't remember the specifics, but it was something like a project that I personally believed in, like productionizing notebooks, this is how we're going to model the CI system. Every model build, every model CI procedure will end in a notebook with nice visualization, but it wasn't really valuable like the flexibility of a Jupyter notebook when integrating with other production systems or something that really really bugged us.
Dean: Interesting. You mentioned CICD, and this is actually something that comes up to my mind because there was a discussion about this a week ago. Some people are saying that CICD is not necessarily an appropriate mental model for machine learning and I feel like you are more qualified than many people to talk about that. So what was the conclusion? How did you manage to build a CICD system which is focused on machine learning?
Ran: CICD is a very broad term. What I imagined as important for CI is model versioning, data versioning. This is what we really wanted to build. We wanted to get to a place where we can reproduce every model that is now running in production. So we versioned the models, we version the data pipelines. This is kind of the most important thing in the CI that we thought of as the CI procedure for machine learning. Build for ML, I imagined it as something that runs tests on your model, test that the data scientist provided, that trains your model, and this is not mandatory or not standard to train the model as part of the CI process. This is something that I personally believe in, but again, I understand the cons, serializing the model and pushing it to some repository. What's wrong with this procedure for ML? I didn't see any trouble doing that. You have one command, build command, that takes your model from Git, take your model from your local directory, builds it, trains it, versions it, does all the serialization, and pushes the docker, something that is deployable. This is a full-blown CI system for ML.
Dean: I am biased, but I agree with everything that you said, especially the focus on coming to the CI mental model from the point that models and experiments should be reproducible and that they could be automatically built, and then that they could be automatically pushed to production. Most of the opposition to CICD for ML is just a terminology issue, Definitely, there are going to be differences from software, but there are differences from software in every other... software is also a broad term. There are a lot of different types of development and you have different types of CI. So, it makes sense that you'll automate at some point as well.
Ran: I think that the important part is automation, Specifically, we wanted also reproducibility, but for all the models that we have, we also had automatic training and deployment pipelines, because we didn't really want to build and deploy those models manually every time for every model that we built. For a lot of the models that we built, we also had Airflow DAGs that build and automatically deploys the models on a weekly, daily, scheduling policy. You can't really do that without some notion of a CICD system. I'm not sticking with the exact DevOps like CICD, but CICD as a concept. Something I still strongly believe in. I saw the benefits.
Dean: There are a few questions, but when you joined Wix, did you join specifically to build out this team, or did you join as an individual contributor, and then the team was formed?
Ran: No, I actually joined as an individual contributor to the data engineering group. As part of the data engineering group, my boss at the time and I saw that we're really having trouble scaling the ML, the data science team lacked the skillset to build these things themselves, and then we started the project so it was much more of an individual initiative. It was more this than someone buying into the idea "let's build an ML platform because data scientists need to scale their models". It usually doesn't go like this.
Dean: Did you need to do work to convince people that it makes sense? If someone is now a data engineer in a company, obviously there's a scale issue in the sense of – if you're a really small start-up, and they're two data people, you probably don't need an MLOps team at that point, but maybe you need to start learning the MLOps basics. But if someone listening is a data engineer, in a larger organization, what should they notice so that they need to start raising the flag, maybe what we need to solve these issues is an ML infrastructure team, and how do you recommend they convince the bosses?
Ran: From my experience, one of the first troubles that we saw was scaling. It was really, you know taking the first model or the second model to production took a few months, but also taking the fifth, the sixth, and the seventh model to production took the same amount of time. The problems remain the same. This is where we started seeing a problem, that we need the same amount of time for these repetitive tasks. This was one thing that we noticed, the clear thing that we noticed. Another was more related to the data pipelines issue, more of the feature store. We really had trouble deploying models online. Batch (predictions in training) was a lighter issue, but deploying models online, using them as REST endpoints, for example, modeling the same data pipelines that are used for training that we now needed to model them in production was a really big task, that took a lot of the engineering effort. It really coupled the engineering data science team together because every change in the data pipeline, feature, and feature equals a lot of pipelines in that case. So, every change in a feature that the scientists wanted to do required an integration effort from the data engineer, machine learning engineer, doesn't really matter. This will be something that really held us back because there was a certain amount of features that we could use that the scientists couldn't use 100 features in his model because no one could have expected it for him.
Automation of these automatic extractions of data pipelines, the training serving skew, how do I use the same data? How do I get it in production? This is why we build a feature store, which is a really big part of the MLOps platform. Kind of the hardest part, I would say.
Dean: To share a personal story: the first time I saw you, there was a Meetup at Wix in the Tel Aviv harbor. There was, I think, another talk from data bricks before or after you were speaking. But I remember listening to you explain the architecture of the platform that you built, and this was the early days of DAGsHub. It was me and my co-founder Guy, we went to a bunch of Meetups to mingle with the community in Israel, and I remember finishing the talk, we went up to ask you a few questions about what you built but afterward, we were speaking, like man, they built so many things, it sounds super impressive. I wonder how it actually feels to work with that in reality. So first I'll ask what part of the platform that you built are you most proud of?
Ran: I think the two parts, the CICD pipeline, at the end of the time, we really nailed it, and I think the evidence was the number of models that we had. We started with five models, six, seven models in production at Wix, and after a year or something like that, we had something around 300. So because it was that easy, and I remember asking, where were all those models? What did you do? Because it was really easy to have another model and another model and another model, to just another API, just another endpoint. I was really proud of that. and the features store. I think it was, engineering-wise, it was the hardest task, the hardest piece to design correctly, and we really had almost zero references for feature store. We had bits of Articles here and there, but it was really hard to build and to get right.
Specifically, this training serving skew, generating the automatic data pipelines between training and serving. This was 50% of what we did at the ML platform.
Dean: The takeaway from that is one metric, if you're the ML infrastructure team is: you want to show growth in the number of models and production. If you can show that then you proved your worth.
Metrics are always problematic because you don't know if people don't start optimizing for the wrong thing once you set the metric, I think it's the Graham principle (Edit note: It's actually Goodhart's law). Guy is better at this than I am... but in a sense, if there's a real need within the organization to productionize models and then you made it easier, then you would expect growth in the number of models in production. So that makes sense. Maybe the other buzzword that has been said here a bunch of times is the feature store. So I know that what you did is really impressive but obviously today in the market, this word is thrown around a lot. So first, can you define what a feature story is in your opinion? And then maybe we'll dive into it a bit more.
Ran: Yeah, it's a tricky metric because measuring infrastructure is always a bit tricky. Measuring infrastructure that is used for anyone else. Because what exactly do you measure? Do you measure the number of models? This is something that I really thought about. How do I measure the success of my team? How do you measure the success of an ML infrastructure team? The amount of models? The number of predictions? These numbers are not actually up to me, in a way. But I think the amount of models and the simplicity, we actually defined it, at the end of the day, it's time. We wanted to have a data scientist build, deploy, and have eyes on the model in something like an hour. Instead of a few months that we saw right now, we wanted to have that up until an hour, but it was really hard to measure it directly, but I think at the end of the day, it was a successful project.
I can easily define what I wanted the feature store to be when we first started building that. First of all, we wanted to have a single curated and some sort of discoverable source of Truth for features. We imagined it as a data catalog for features because the problem we set out initially to solve was a problem of data set reproducibility. After all, we had a problem, one of the models that we constantly worked on, is that in every iteration of the model, now and then, new data scientists came and started their working with this very important model for Wix. He started from scratch. He went through all lines of SQL that generate features and in some cases, the tables that he read from were there, in some cases they changed completely, in other cases it didn't understand how to reproduce the datasets. So, the feature engineering part of the feature extraction path for training was something that we wanted to solve. So this is what we started with, a data catalog for features. This is how you now model to feature. This is how the feature is interacting with our data, like, really a structure for a given feature. This is the first thing that we wanted to solve and this made sharing features between projects much easier because one data scientist defined a feature for his model, let's say, in a Wix example, how many times did the user click 'publish' on his site in the last X days, 7 days, something like that. So this feature was reused in another model, let's say the churn model and the premium model. Wix is a freemium-based company. So we have churn and we have premium subscribers and a lot of the models that are used for this model or the churn model are also good for the premium model, for example. So this was the first thing we set out to solve, a single discoverable source of truth, and the second was this training serving skew. How do you make these features that the data scientist trained his model on, how do you make them accessible in production? This is what I wanted to build the feature store to solve.
Dean: I tend to agree with that terminology. I feel like, usually, when you now look at the solutions for feature stories that you can buy, they do a better job at solving the first part, and then most of them don't really live up to the second standard, which I think is objectively harder to do... so first, it's impressive that you succeeded in doing that. I guess everyone usually refers to the Uber platform, right? Like Michelangelo, that's the reference that people usually use, but that's obviously not open source, at least not as far as I know. And so, one question that I have about this is, what part of the success of the second feature, the training serving skew is solved by workflow? You need to convince the data scientists to work in a certain way, to get these results, and how did you manage that within the organization? I don't know many data scientists WIX has, but I'm guessing it's more than a few.
Ran: More than a few. So, how do I convince the data scientists to use the features store? This is what you're asking.
Dean: Or maybe how do you make it so easy that it's not a hard choice for them to do?
Ran: I think it was mostly the second part that was attractive for data scientists because when a data scientist wanted to deploy his model into production, the feature store was actually really integrated into the ML platform. So when a data scientist built his feature and then declared it in the schema of his model, then the platform would auto-extract for him, and then he was decoupled from the engineer. So he didn't really have to talk with anyone to deploy his model, he didn't have to talk with anyone to get the features for him. So this is really valuable. This was one of the main reasons they started using that because the organizational perspective was, listen, we're not going to extract any more features for you. You have a clear way of how to do that. You want to deploy models into production, you want to use hundreds of features in your model? This is the way, the feature store is going to solve that case. Otherwise, you're going to start to get stuck in integration, you're going to have to ask people for favors that extract the features for you, and the features store really solved that. So this was a sweet spot. Do you want more features in your model, you want to iterate faster, you want to be more independent? This is the way to do that.
Dean: And your requirement from them was just to fulfill a contract? I.e. Build out the features and interface?
Ran: Build out the features using the SDK that I gave them. Now, it wasn't really easy, not everything could have been modeled using the DSL. So, a lot of the pain was, okay, I need this specific set of features, or I need these aggregations. So a lot of what we do is, how do we support these feature families. How do we support feature families that are based on the site content? The content of the Wix site itself. How do we base features on data that we saved in Snowflake? And we always wanted to offer an infrastructure for that. We didn't really want to pull this specific feature or that specific feature. As an infrastructure team, this is what I consider myself, or considered my team infrastructure team. We wanted to solve the general problem. How do I model features from Snowflake? How do I model features from that source, from that source, or another source?
Dean: So once the initial platform was built, most of the work was building connectors for more data types and things like that?
Ran: Building connectors from all data types. A lot of time we spent on performance, on the performance of the feature store. I'm not sure how technical you want me to get, but the way that we built the feature store was that features are not actually materialized. This is something I see in features stores today. We had 5,000 features at Wix, but there weren't 5,000 ETLs computing these features all the time. The feature was only a concept, or only metadata once it was created. We only manifested it and we only made it materialized once a data scientist requested to generate a training data set. So basically, a feature was just, in the offline world (in the training world), a Spark SQL query. So a feature translated into spark SQL that we invoked on the data lake once the data scientist requested to generate a data set, but again, because features were not precomputed, the trade-off was time. If nothing is precomputed, if everything you need would be computed on the fly, on Wix's data lake scale, this was a big issue. Every training data set, we invoked a really big spark cluster and a lot of what we did was performance optimization, what does the sizes of the Spark cluster? What is the compute power needed to generate the training sets in enviable, meaningful times? That sort of thing.
Dean: But doesn't that also risk losing reproducibility? You need to ensure that the underlying data isn't changing, right? So did you work on that?
Ran: Underlying data doesn't change. This is a guarantee of how the data lake at Wix is modelled. The data lake at Wix is very organized whether you know, specific catalog of how data is inserted, no one is changing that. It's append-only and immutable. The ML platform was built on top of the data platform. We had a lot of the data infrastructure, the data tools that we use at our disposal and we really built the ML platform on top of the data platform which was really mature at that point.
Dean: So these are all good tips if someone is thinking about building their own equivalent solution. You need to solve for data and then you can sell for machine learning. You built this very sophisticated system and you gave us a few examples of the metrics that proved that it was useful to the organization. Are there some criteria, for someone who is listening, that they shouldn't build such a thing? Specifically, the feature story seems very complex, so when do you think it's not necessary or when is it too early to build out this thing?
Ran: Data is definitely first. I think it's too early to try to build yourself an ML platform. If you only have one, two, or three models, or these models of something that you need to really tweak and build the entire business on. It was a generic platform and it has drawbacks, building such a generic platform. We had to work hard to add connectors, to add the frameworks that we support, to add all of these things. So having a generic platform for all the models in the organization has drawbacks, of course. If you need very few models of something, very, very specific, you want to build an entire recommendation system from scratch and you need it with very specific requirements, then I'm not sure building an MLOps platform and then writing your very complex ML system on top of that, is a good idea.
Dean: Fair enough. For me personally, for us at DAGsHub as well, open source is a very close topic to heart, and obviously, in machine learning, everyone is using one of the TensorFlow, Pytorch, or SKlearn, which are all open-source packages, but even when you go into the realm of ML tools, and ML platforms, you have open source solutions, so I'm curious to hear, within the platform that you built, what parts did you decide not to build yourself? Did you use open-source solutions to fill in the patches? And how did you choose the tools that you used?
Ran: The two main tools that we used, and again, it's a bit of a tricky question because that was 3.5 years ago, something around that. So the ML tooling community or the ML tools that are available today weren't really available back in the day. I think I was the first user, at least in Israel, the first user of MLflow. This is the early days of, I wanted to build the platform, then I came to London, I went to the Databricks convention and I saw MLflow and said, this is how I'm going to solve the serialization issue. This is going to solve the experiment management. This is how I'm going to build a CI system on top of it. This was a naive idea at the time, but it worked. It required a few iterations, but eventually, it worked, so we used MLflow because it was really... again, I think I was the first one to use MLflow in production, I would dare say that, and it was really immature. We had to customize the hell out of the MLflow, basically. So maybe using that project for what we saw as an MLOps platform for Wix... in retrospect, it was a leap of faith and eventually paid off, I think, but again, the road wasn't easy. So we used MLflow experiment tracking, we used it to build a CI system on top, and we used SageMaker as a hosting mechanism. This was two of the many tools that we used and Spark. Spark is the offline feature store, or the offline feature store engine based on our data lake, on Parquet files on S3, and for the online features store, it was Redis. So it was a lot of open-source.
Continuing that question, why I left to build Qwak: stitching these open sources together, building an MLOps platform on top of these and other tools, was a really big task, and it required resources, it required a full team of six, seven, eight people, depending on which period of time. Building a team of eight people to build an MLOps platform for your organization is something that... Wix started having these problems early, so we had to build it. But I don't think that should be the way to go today, to build your own MLOps platform from scratch... that's a big thing to do if that's not your main business.
Dean: That's a fair statement. There's a lot of knowledge to inject into that patching together because things need to not just work on the same Kubernetes cluster, or whatever you're running it on, it actually needs to have a logical interface, so people can transition between the stages of the process logically. But I guess this is a good segue, so what is so maybe you can share a bit of what you're working on at the moment?
Ran: This is the main point, actually. I'm the VP of engineering and one of the cofounders of Qwak. What we're building is basically an end-to-end MLOps platform similar to what I had to build at Wix, to what our CTO, Yuval, that saw it from SageMaker side, or Alon, our CEO, that was the VP data of Payoneer, all of us had to kind of reinvent the wheel in different places and I think it's going to be the case anymore. I think this time, building an MLOps platform for your organization is something that we want to help companies do.
Dean: When you say end-to-end, where do you draw the lines?
Ran: When I say end-to-end, our areas of focus right now are the CICD part, how do you build more models faster and deploy them into production? That's the mission statement, as I see it, for the CICD. And the feature store, which I think is really an integral part of what an MLOps platform should include because it is the hardest part of building an MLOps platform, especially in online models, in real-time models, and I think the trend is going there, building more and more real-time models, so I think a feature store, specifically, training serving skew, is something that an MLOps platform must-have. We have two notions in Qwak, we have 2 First Class citizens. One is a feature and the second is a model. It's a many-to-many, the idea is that you can see which features is the model using and the same for a feature. This feature is being used in which models. Having a feature store alone doesn't allow you this level of discoverability or this level of integration. In Wix, when a data scientist needed to build a new model, the first thing he did, and I know that from experience, was going through the features store and seeing what other models or a model that he thinks is similar, what features are they using? And then he built the first version of his mother. Oh, so this sounds like an interesting feature because this and that model are using that. So, we had a very rich ecosystem of features, you just had to pick and choose, which features are you going to use in this model, generate the training set and see how it goes?
Dean: I'm guessing what you're talking about is very oriented towards tabular data, right? Because then people are looking at the model and choosing the feature calculations that they need and then it's being pulled into the model. So this is something that we're thinking a lot about at DAGsHub as well, but more in the realm of unstructured data, and how do you do that?
Ran: Actually, a feature store for unstructured data is also an interesting concept or embedding feature store.
Dean: It's definitely earlier in the lifecycle of unstructured feature stores, but it's something that we're thinking about. What does it mean to do all of this work where I have this model, I want to be inspired by it, to understand what the smart people that came before me did to create it, and then, hopefully, improve it, because everyone has something to contribute and maybe you're looking at the problem from a different perspective, but you have to have that context. So I definitely understand what you're talking about. One trade-off that we've seen come up a bunch of times, and I'd like to hear your perspective on is: End-to-end versus integrated. You said that your choice was to go after an end-to-end solution. How do you view this trade-off? When is this solution better? When is the other option the right one?
Ran: It's a question of best-of-suite, vs. best-in-class. When you're really trying to build a name and platform, you're starting with your ML journey and you need to build, now, an ML platform, you need to build some parts of a feature store, you need to build some part of a deployment mechanism, some part of the workflow orchestration, trying to get all of these components together and glue them. It's something that can be very hard to do. I know that because this is what I had to do. So you need a team to build that, you need personnel, you need people to glue that thing together into a platform that makes sense for the organization.
Dean: Thank you very much for your time. It was a pleasure, having you here, and hopefully, we can talk again in the future.
Ran: Yeah, of course, thank you very much.