MLOps for NLP Systems with Charlene Chambliss

Dean Pleban
32 min read
4 years ago

Co-Founder & CEO of DAGsHub. Building the home for data science collaboration. Interested in machine learning, physics and philosophy. Join https://DAGsHub.com | DagsHub Co-Founder & CEO

Table of Contents

Share This Article

In this episode, I'm speaking with Charlene Chambliss, Software Engineer at Aquarium. Charlene has vast experience getting NLP models to production. We dive into the intricacies of these models and how they differ from other ML subfields, the challenges in productionizing them, and how to get excited about data quality issues.

Listen to the Audio

Read the transcription

Dean: Hi everyone, and welcome to the ML Podcast. My name is Dean, your host, and today I have with me Charlene Chambliss. Charlene holds a Master's in statistics from CSU and a Bachelor's in psychology from Stanford. She was a data analyst at CrunchBase, a data scientist at Curology , and a Machine learning fellow at Sharpest Minds. She was then a senior software engineer at Primer AI, where she took part in building Primer Automate – an AutoML platform for NLP tasks. She is now an incoming senior software engineer at Aquarium Learning, where she's building other amazing things. Charlene, you are an expert in the area of getting NLP systems into production. I'm curious if you were always especially drawn to NLP as a field, or did you happen to find yourself working in it?

Charlene: Well, I started out in data science more generally. I got interested because I saw all of the cool ways that it was being used in, socially impactful ways. So different applications of ML, like remote sensing for deforestation and other interesting things like that, civic data analysis to reduce traffic fatalities, all that kind of cool stuff. And so I was like, how do I get deeper into ML after I had my data science experience at Curology. And then I was thinking about the different subtitles and I realized that I have an affinity for language. I've always really liked analyzing the nuances of language. I was a big reader as a kid. So it seemed like a natural fit. Also because I saw NLP as a little bit more welcoming for people from slightly more unconventional backgrounds, like myself. Because I majored in psychology and not computer science or electrical engineering or something, which tends to be more common in general ML or the other subfields. And so I noticed that people in NLP, a lot of them, had pretty diverse backgrounds, but had still made really interesting contributions.

Dean: Do you think that's a people thing, or that the specifics of NLP make it more amenable to the varied background? The most complex models that we've built so far have been NLP models. You couldn't say that NLP is a less complex area. And so you might say, maybe it's less complex, so people come into it. What are your thoughts? Why is it more diverse compared to other areas of machine learning?

Charlene: I think it's a combination of what you said. One thing is the people, the kinds of people who tend to be interested in doing things with language come from all sorts of different backgrounds. They might come from a literature background or linguistics background or a language major background or something like that, and then they moved into NLP from that academic approach.

Dean: I've met a few people that have studied linguistics, and I don't know whyת this might be just the random group of people that I've met there, but I feel like people that study linguistics are some of the most excited, enthusiastic people that I've met. So maybe that means that it's a draw. I also think that language has something about it which makes it unique. It's sort of what makes us unique. So maybe that's part of the draw. It's a more elementary part of being human. So it makes sense that AI sort of has stronger foundations there, but also that it's a draw for a lot of people because it's very human to think about language.

Charlene: One more point is some of the details of the ML. For one thing, in computer vision models, you have to understand a lot about convolutions and Fourier transforms and stuff that you would only have been exposed to if you had a physics background. That can be a little bit intimidating. If it's basically mainly matrix multiplication that you have to think about, then that simplifies things a little bit. Not to mention with NLP, we have HuggingFace, which has given us this beautiful, very high level, abstract interface that you can use in order to build these models. I think there is still a little bit more PyTorch Fiddling and stuff that you have to do for computer vision.

Dean: Because this is a personal thing, but Fourier transforms get a bad rep. I guess they are objectively more complicated than other things that happen in the math world that are related to meaningful, applicative things in the real world. But I do want to recommend the YouTube channel Three Blue One Brown, they have a great video on Fourier transforms. I did study physics, and I remember the second year, going into the first class, and I didn't study it in my first year, as opposed to many of the people that I studied with. And it was a culture shock. But in the end, it's fun to understand it, and there are really good videos today. But we're here to talk about ML and MLOps. Obviously, getting models to production has its challenges no matter what type of model you work on. But one of the things that surprised me when I was speaking to you the first time was that there are a lot of unique things about getting machine learning models into production, when we're talking about NLP systems. This is not necessarily related just to the model, but also to working with the data. What do you see as the main differences and constraints between working with NLP and any other ML sub fields?

Charlene: Other ML subfields have had a lot of success solving problems with smaller to moderate sized models that are somewhat more easily deployable. Whereas, all of the best models for NLP nowadays, like Roberta, GBT Three and all of its descendants, basically any large transformer variants, they're massive. They can be a gigabyte of space in your memory, on your virtual machine. That can be pretty tough to work with, operationally. It requires very powerful GPUs in order to run inference in a reasonable amount of time. The compute costs of running these kinds of models can be very high and really eat into your margins if you're a business trying to actually make use of these things, and it gets even worse if you're the type of business that, your product is only valuable if every customer gets to have their own custom model. And then you have to actually manage the cost of deploying all those models for inference. So that can be rough. But people right now are looking into different ways to shrink models such as Pruning, Distillation, Quantization, running examples through only some layers. One thing we've used is something called Inference Triage at Primer, where you use a smaller model to gatekeep your larger model. So it handles the easy examples, and then if something is a more difficult example, it passes it on to the big model. That can save you a lot of time and cost. The second thing that makes MLOps challenging for NLP is: it's really hard to do data augmentation in a way that's actually, semantically valid and automated. Because language is discrete and contextual. So if you change a given word in a sentence to try and make a second example, that's slightly different. "I love this person" vs. "I adore this person". Those two things are synonyms, but the connotation of each word is different enough that they're not quite the same. Whereas with something like computer vision, if you cut part of the image off or something like that, it's still more or less the same image, just with that part missing. Another thing is that a lot of tasks are actually really hard to automate or evaluate without having humans in the loop. Those are also the tasks that are the most useful applications of NLP, like translations and summarization, particularly anything that is taking a sequence and going to another generated sequence. There's no such thing as an automated fact-checking algorithm. So if you have a summarization model and it's outputting completely made-up facts that were not in the original article, you have to have a human read that and notice that, because if you're using Rouge score or one of the other traditional metrics, it's just going to measure token overlap. It has no sense of whether it's actually true and useful to a human.

Dean: I have a few follow-up questions. I'm in Israel, we speak Hebrew. I was reading a Facebook discussion today, where someone was trying to train a large language model for Hebrew. Apparently, there are not that many languages in the world that have large language models actually trained for them from scratch. So I'm not talking about finetuning and stuff like that. So he was asking for advice from people that have had experience training large ML models. Basically, the overwhelming advice was, everything you know from smaller models doesn't apply to larger models. It's a different class of problems. And some of those things are the things that you just described now. The first thing that you said that was interesting is the model triage. The idea is very intuitive. I'm very curious about how the sort of application actually looks. I'm guessing that there are no frameworks for that. Is this something that you had to custom tailor for your use case? Does that work? The smaller model predicts and also provides a confidence interval, and then depending on that, it either goes to the larger model or not. Does that make sense?

Charlene: At Primer, any kind of custom pipeline, stuff like that that we would have to build, they have their own ML platform to handle that, and so we compose it out of pieces. So we have the Deep Learning Library, and the Deep Learning Library has the individual models, and on top of that, you have a pipeline orchestrator that's running the documents through the different models.

Dean: When I first heard about the challenges in the augmentation for NLP from you, the thing that made sense to me is computer vision. Most of the applications that people know are handling natural images. If you have a tree with a bird on it and then you remove the bird, the tree still makes sense. Anyone who speaks more than one language and has another language, which is native, might look at some sentence that someone else wrote and it just doesn't make sense. But you don't know to pinpoint why it doesn't make sense, like something grammatical or whatever it is, just doesn't click. And you know to point that out, but you don't know exactly to explain why. It's much more intuitive to people in the context of images or videos than it is in the context of language. Since we can't define the problem properly, it's hard to pass that onto a computer. Deep Learning is trying to take over for that issue by letting the computer understand for itself. But when you need to define the data, that task is still on you to a certain extent. When you do work on data augmentations in NLP, is there a way or some recommendations you have for being able to actually understand what's going on? Make sense of whether augmentation makes sense, especially if it's unfeasible to just look at everything.

Dean: First, I want to mention that there are some frameworks nowadays that are making data augmentation for NLP a little bit easier. They'll have a bunch of different rules that you can use and customize to generate new examples from a given example. You still have to look through those manually to make sure that they're not too weird or crazy or different, but it helps with the process a lot. If you don't want to go that route, what you can do is expand the space around your current examples and get more documents that look somewhat like the documents that your model found challenging, for example. You can project your documents into the embedding space generated by that particular document vector. And then you can do a nearest neighbor search for the documents around it, and then you can sample those. This, of course, requires having access to a very large unlabeled data set that happens to be the same distribution as whatever documents your model is not doing well on. Oftentimes people will just use the output from the hidden layers of the model, for example. There are also all the standard embedding models now that are accessible, like Word2Vec and things like that. Data somehow always becomes the issue. I know that data labeling is part of the platform that you're working on. Can you share what role that has within the broader workflow and how important that is?

Charlene: There's been kind a paradigm shift in NLP over the last few years where initially, in order to make NLP models work, you were training an RNN or a CNN, and you needed tens of thousands of examples of the specific tasks that you were doing. Nowadays, we have these giant transformer models, and they've been pretrained on all of this internet text. And so they already have enough context to roughly understand the rules of language and how things in a sentence tend to relate to each other etc. What you can do nowadays is you only need a few thousand examples just to get to 80 to 85 F1 score. You can get a pretty good model with relatively few examples. But the thing that's different is that instead of creating a bunch of potentially noisy, unsupervised data, or semisupervised data, like you could when you had 100,000 examples, now you need to be a lot more careful about which examples you give the model because it's so smart that it can be swayed very easily by wrong data. Extra attention needs to be paid to the data labeling process. And there needs to be a lot more QA, there needs to be much stronger task definitions. Everyone needs to be exactly on the same page about how to handle certain edge cases. That becomes very important in new NLP land.

Dean: In the same discussion that I'm describing about training large language models, the first thing that was mentioned about the data is that counterintuitively, you want to go quality over quantity because models can learn so many bad things the larger they get, that you have to make sure that you're throwing out the bad examples, otherwise you can spend a lot of money and time on training a big model and then having it learn a lot of noise, which is useless. This is an "all roads lead to Rome" sort of thing, where no matter which avenue of machine learning you go down, you get to the point where data quality matters. The first time you and I spoke, I was really excited about the fact that you were really excited about data quality. Why are you excited about it and how do you get more people excited about it?

Charlene: Caring about data quality is actually really empowering because personally, in my ML career, what I've noticed is that getting more data and getting better data for your model has much higher impact on model performance than any amount of hyperparameter tuning or architecture tweaking or any sort of more pure ML staying-in-the-code sort of fanciness. I would rather have a thousand more labeled examples than a comprehensive hyperparameter search any day of the week. The 1000 examples could give me another like ten F1 points, whereas hyper parameters are going to give me maybe two or three F1 points max. That's exciting not just for your potential impact as an ML engineer, but it's also exciting from the business perspective because you don't need to have an army of research data scientists and ML engineers just to get to a pretty good model. In NLP, all you need is the rigor and patience to create a well defined task and to reinforce and adjust those definitions carefully as the new edge cases emerge. The NLP tooling is just so good nowadays that NLP models are within reach for anyone with good critical thinking skills.

Dean: That's a strong snippet from this episode. If your dream is to do hyperparameter tuning and playing around with architecture, there are still cases where that makes sense. But in the majority of cases, you need to start with working on data.

Charlene: If you do want to do that, go work for Google or Facebook or OpenAI, but most places are not going to need that for the kind of tasks that they're tackling.

Dean: In most companies, you're trying to build some product, and machine learning should help that product be built, but it is not an end in and of itself. To get to the end of having a good product that is useful to people, usually you need good data. With your experience, what are your tips and best practices on how to do data labeling in NLP and just in general, better?

Charlene: I spent a long time early on at Primer managing the data labeling process for the applied research team. There's a specific a step by step process that we've seen work really well. First, the engineer building the model should label 50-100 examples by themselves, by hand, because that'll give you a firsthand look at what's actually in the data, what types of distinctions you'll need to make between different types of examples. It will give you a sense of the different classes that exist, for example, if you have a classification task, whether or not you've forgotten any classes in your initial conceptualization of the task and need to add them in later. Then you go in and you write down some solid definitions of those classes and concepts that you're trying to get the model to identify. You should include instructions on how to handle specific edge cases. Ideally, you want to have types of edge cases, as opposed to very specific edge cases, like always tag University of Arizona as the location or something like that. Instead, you should say "tag universities as locations", because that's a more useful general rule. Then, as you have better instructions, you can iteratively get larger batches of documents labeled by external partners or by other folks within the company who are willing to help with labeling, like 100 or 200 at a time. This gives you the opportunity to spot and correct any misconceptions that laborers have about the task or that you have about the task and to add any necessary clarifications to the instructions that have become apparent once you've been exposed to more of the data distribution. Iteratively, as the labeling team understands the task better, you can actually increase the batch size so you don't have to be doing small batches of 100 or so. When the labelers actually really get the task, you can give them a thousand and then they'll come back just as high quality as before.

Dean: That's so practical. It just makes sense. The next logical step is once you do that, you want to start automating stuff, right? How do you automate the process in general? And then when does that make sense? And when does it not make sense? Who shouldn't be automating the labeling process?

Charlene: You want to tread carefully with automation in the data labeling and curation process. Things that can be automated, sampling new documents to be labeled based on some prespecified criteria, similarity between document vectors for documents that your model got wrong previously or that it found challenging in the last training run. You can automate applying pre-labels to your documents. This is a semisupervised way, but basically, you take your existing model and then you apply them to the new documents that you want to get labeled and then you give the labelers the labels for those documents. You don't want to do this too early on though, because it could actually bias the labelers to include your model's mistakes in the data and not correct those because they're inclined to listen to the model. You want to do this when your model is already pretty good, ideally, and you're going that last mile to make it amazing. You can also automate gathering feedback from end-users. This is particularly relevant for people who make consumer-facing products, but it is also relevant for B to B products. Because you can build a feedback mechanism straight into the product and then you can ingest those results to look through later. In terms of what users found incongruous as an output of the model. You should give them the opportunity to tell you that even if you're not going to use it right away. Things that you can't automate would be doing the actual error analysis and recognizing those patterns in the examples that your model got wrong. This is all still entirely manual and one of the things that we want to make much easier at Aquarium. Being able to explore your test data set and figuring out what the commonalities are between these examples that my model got wrong. That's still a very cognitive process that you have to do because there just isn't an automated way of doing that right now. Lastly, getting a sense of whether or not the model is actually good enough to what you need to use it for. Usually, that requires input from some stakeholders. You can't really automate like, oh, my model has 99% accuracy, therefore it's good enough. That's not guaranteed. It depends on the task. It depends on what's in that last 1% of accuracy. If your model has 100% precision of 0% recall, that's not very useful. You definitely need to have human eyes on the model's outputs rather than just relying on the metrics.

Dean: The things that you said we can't automate also really relate to the whole realm of monitoring. Because you need to anchor the results of your model in reality, and that's usually still hard to automate, and you just need a person to look at these things. Part of the idea here is a metric is probably not going to ever be good enough, especially for complex tasks. You're still going to want to do predictions and then qualitatively assess those predictions to a certain extent. And then the second part, which is the decision to deploy a model or to switch the main model with the new model etc., is something people would want to automate, but I haven't heard of anyone successfully doing that. Maybe Google with Ad models and things like that. But it's relatively not pervasive and most companies still do this manually.

Charlene: There are ways of monitoring whether your model has drifted too much from the last version to the new version. You can analyze the distribution of classes of your model's predictions. There are things you can do. People have made automated tests for the easier examples, examples that your model should definitely get right. And then you run the model through those before you deploy it. And sometimes people catch things like, oh, we had a bug in the training code and now the model is acting completely crazy. So you can catch certain pathological examples like that. But sometimes the differences can be more subtle.

Dean: There are a few good packages for these things. I think Great Expectations and, more recently, Deepchecks are doing a great job on this front, and it's very necessary. It's the same old issue from software development where you won't have bugs if you don't write any tests. Unit testing for data and models is a great idea. For the basic mistakes, it's usually almost effortless to do, and it has the potential to save you so much grief. I had the chance to speak with a bunch of people who are working on ways to help you, as a user, understand the examples that your model is having a hard time on, or anticipate examples that your model is going to have a hard time with in production, and it seems like part of what you're doing at Aquarium relates to this. I feel like that's still very early, but I'm excited about that. There is potential to improve that experience, and then that. It Also ties back into labeling, because ideally, you'd want to automate that entire process, find those hard samples, label them or relabel them, and push them back into the training set and improve your model. There's a lot of good work to be done there, and there are a lot of things that will stay manual, and that's okay. You probably still want the human in the loop just to make sure that the model is doing something that is actually useful to humans. I try to avoid the robot takeover stuff, but we'll be in a very bad position if we don't need humans anymore because robots decide what's good for them and we don't matter.

Charlene: I prefer not to think about that, it's pretty far in the future.

Dean: A lot of the things that you said relate to the fact that there's a lot in common between good data labeling or good data treatment and product management. Treating your data as if it was a product that you need to work on. There's the whole MVP stage. You do the things yourself, you do things manually, before you try to set out rules and automate that and then extend it to other people within your team. One important thing in product management, or maybe this is true for entrepreneurship, but you want to shorten the iteration cycle, be able to learn quickly, and then improve the product that you're working on. So how long is an iteration cycle for data labeling? How fast could it work if you have done a good job of automating things?

Charlene: That's a great observation between this iteration cycle and product management. Data should definitely be a firstclass object in terms of your ML project management. In terms of how long an iteration cycle is, hopefully a few days or less, if you have a really tight communication loop with your labeling team. Maybe you have them in a connected Slack channel, so if the labelers have questions while they're labeling, they can just ask you and you can tell them if you're online, of course. But it gets shorter over time. Of course, like you said, in the MVP stage, when you're really trying to hammer out the right definitions of everything and catch the 20% of the edge cases that are going to handle 80% of the examples. Those cycles are a little bit longer, but once your labeling team starts to understand the task really well, they can get the documents back to you quickly. QA can be done quickly.

Dean: Fair enough. I think that's a good rule of thumb or order of magnitude if you're thinking about building out a labeling function within your organization or working with people. This is either a good place to aim for or it will give you a range of what you're looking at when you're iterating on data.

Charlene: I can't speak to the computer vision space where the datasets tend to be larger. But for NLP documents, you can generally get a good improvement with your model with another 300-500 documents.

Dean: I have to have a discussion with someone working on the computer vision space to compare notes, but I think that this is still a good order of magnitude. You also have transfer learning with computer vision, you have potentially shorter iteration cycles as well. But it's interesting. So in the end, one of the challenges here is, most of the NLP tests are not labeling NLP papers. That means that you have stakeholders that are maybe domain experts, but they're not from the area of expertise that you are an expert in. There is this question of how you make this accessible or optimized for shortening the iteration cycle.I'd be curious to hear your thoughts about that as well.

Charlene: I've worked with a lot of outsourced teams, a lot of labeling teams, and I've personally labeled a lot of data in various labeling tools. When you're designing this process of labeling, you should help your user do the correct thing. For example, if you have your own in-house labeling platform or you're trying to select a labeling tool, look for something like snapping to the span, if your task is to highlight spans instead of allowing white space on the sides or anything like that. You can do things like highlighting possible errors, like if some punctuation has been included in the span, you can call that out visually. Keyboard shortcuts and navigation make things so much faster. You can also use information about the task to prevent doing the wrong thing. For example, for relationship extraction, you have two entities that can be related to each other, and often those entities will be specific types of entities. So let's say, organizations collaborating with other organizations on a joint venture. As you're designing highlighting for that. If someone has an organization selected and the relationship they're going to make is the collaborated with organization, you can make the other non-organization entities less prominent and make the organization entities more visible and then it's easier to quickly scan the page and find the one that you want to connect to. You can also completely prevent making relationships between the wrong entity types, because that's just something that you have to go back and fix in your data later. So you might as well just prevent it at the labeling stage. Lastly, you should try to minimize kind of the number and complexity of actions that you need to label. A big pet peeve of mine is actions that require click and drag, because some folks working on annotation teams are using a touchpad to label. They're working on a laptop, they're not working at a desktop computer, and this is a lot less accessible for them. Not to mention it just takes longer. If you can do "click here and then click there" for the destination, that can be a lot better than clicking and dragging.

Dean: Design plays such an important role in everything that we do. Whether it's software design or interface design and all those things make a huge difference in how things are done and how we can make them better for the people that are doing the work, but also for the results. Usually this translates into better labels. There are a lot of great tips here if we have listeners from any of the labeling tools, there are a lot of great ideas to implement here. One of the main topics that we like to discuss in this podcast is deploying machine learning and machine learning models to production. We discussed a bunch of more complex setups for deployment. How do you deploy complex and NLP pipelines, especially when you have a combination of models and data processing steps into production?

Charlene: Usually, models are deployed in a containerized fashion. You'll usually have one model or model type per container. But sometimes, you'll need to chain two models together. For example, for relationship extraction. First, you have to actually extract the named entities from the document using a named entity recognition model. And then you would also want to extract relationships between those entities, usually with a relationship classification model that takes entity pairs and tells you whether or not they're related in that particular way. For these kinds of model stacks or cascades or pipelines or whatever you like to call them, it can be ideal to deploy all the models in one container if they'll actually fit. Because if you can fit all the models in GPU with enough memory space left to perform inference reasonably quickly, that allows you to minimize latency. And this is particularly important if you have something that's a real time application, as opposed to a longrunning processing pipeline. It also allows you to create a more consistent interface between how you're sending and receiving data to the model servers because you don't need to be sending some servers structured inference data from a different model. You can just send a document and get the results back. So that can make it a lot easier to interact with those deployed models for the services that do interact with them.

Dean: If you're deploying these systems into production, the interfaces that you just described, do you enforce them culturally, or is that something that you build into the system somehow to make it so that you have these standardized interfaces and things like that? Also, what happens if you can't fit multiple models in one container? Do you just give up on the standardized interface, or is there any other creative solution?

Charlene: For your last question, I'm not sure if there are any creative ways around that. You'll just end up having to do the non-ideal thing, and then maybe from the outside, make it look as if you're sending in the document and getting the results back. In terms of how you can enforce that across the organization, I mean, it probably depends on how big the organization is. If you're a relatively small company with only a few hundred people and you have a single platform team managing all the ML deployments, they can usually standardize that interface, and everyone just has to use that. But once you have many different teams proliferating their own ML solutions, that can be a little bit more difficult to standardize.

Dean: I've also had the chance to speak with larger organizations, and they also try to standardize it culturally. When I say culturally, I mean someone defines an interface or a set of interfaces, and your model or transformation has to comply with one of those interfaces. Otherwise it won't get deployed for a bunch of different reasons. One of those reasons could be that it's harder to work with, but a more reasonable explanation is that you want to optimize performance, and if you have random interfaces, you might not be able to do that. The simplest example for this is the whole Pandas <> Spark compatible APIs etc. Usually when you get to deploying at large scale ,you have to trade off certain actions or functions in these frameworks because they're not efficient and there's no good way to make them efficient. And so the limitation is, you can't use this set of functions if you want to be deployed. Otherwise, latency will be too high and we can't afford that. Just to offer one idea on how you can still standardize, even if you do need to change models, one way is, you can have one input interface which receives that document and one output interface which outputs whatever it is you want to output. From those gateways, you have a standardized model interface, and that gets some standardized input, get out some standardized output, and then you can move that along between the different models. Usually, the scientists or whoever it is that's deploying their specific model within this larger system, don't need to worry about the first input and the last output. They just need to worry about having a compliant interface for the prediction part. It's not always possible because you can say, what if the dimensions of my tensor are different per model and things like that, but you can relatively easily generalize that part because it's just different dimensions for the same type, as opposed to, if the user uploads a document and obviously you're not working with raw documents in your models, and you need to have some solution for getting that into the language of the model. There's a lot of interesting things that are maybe too specific to get into here, but you covered a lot of them, and that's really awesome. I want to end with a few higher level questions that I like to ask all my guests. The first is, what are you personally excited about? What are the strongest, most exciting trends in ML and MLOps?

Charlene: There are so many great things going on in ML right now. One trend that I'm seeing, of course, as someone who has worked on one of these, is low code and no code tools. It's becoming easier and easier as a business user to have access to these very powerful models and model deployments. And you don't have to have two years of experience in statistics and writing Python in order to build and deploy a model that could be useful for your team. I'm also noticing that tooling for technical users is becoming higher level. Something like Aquarium is a great example of this. It's an interface that you can actually use to explore your data and to easily create issues around certain patterns that you're seeing. Whereas, previously, when I was trying to do error analysis on my models, I was manually printing the documents and then looking for what kind of mistake the model made and then just scrolling through them in my Jupyter notebook, which obviously requires me to switch between writing code and then thinking. And then there's a lot of cognitive overhead with that, and it makes things slower. Having these higher-level interfaces is going to help really streamline this iteration step for technical users.The ML tooling space is also growing generally, I'm noticing a huge proliferation of ML tools, and many of these pain points that I described, for example, the manual-ness of dataset exploration, data augmentation, error analysis. These things will eventually be a thing of the past in terms of how tedious they are and how much manual work they require. There are also more platforms that are trying to abstract away some of the DevOps parts of MLOps, like model deployments. Verda is a good example of this, by making deploying, managing, and monitoring your models significantly easier. The data infrastructure space is consolidating, that's another cool trend that I'm seeing, and this has the benefit of allowing ML engineers to spend a lot less time thinking about the data engineering and the pipelining aspects of getting data into and out of their models, because the data storage and processing providers are starting to also provide adjacent integrated services to make their products stickier. Snowflake's acquisition of Streamlit is a good example of this.

Dean: I find your last point very interesting, it was a bit counterintuitive for me in the beginning, but it makes sense in hindsight. Up until relatively recently, it felt like people are not really sure what the proper components of workflow and machine learning are. We're still not there, I still feel like there's work to be done on making the ML realm less fuzzy, but I think we're heading inthat direction and the main advantage for users, in that case, is if you know the boundaries of a certain task, you can do it much better. The tools are going to be better, as you say. And that's why everyone's going to benefit. You're going to have less cognitive overload in deciding what tools you want because it would be clearer what value they're going to give you as part of your workflow. They're also going to be better at what they do because they won't have to say that they do everything in order for them for you to perceive them as valuable. If you are one of those people that feel like everything is confusing and you can't make sense of anything going on in the field, the next few years are going to change that and things are going to be clearer and more well defined. So hold on, the good part is coming. This is something that comes up with every person I speak to in the field, which is, as you say, so much is happening. How do you keep up to date with things? What channels do you follow? What things do you do on a regular basis to stay up to date?

Charlene: I'm notorious for being an information junkie among friends and coworkers, so hopefully, I could be helpful with this. Particularly for NLP, some of the sources that I follow are NLP Highlights. That's a really great podcast. Researchers will often go on there to discuss kind of their interesting new findings, they usually discuss it in more accessible terms than the actual paper will. Especially if you plan on reading the paper later, having that initial higher-level introduction to it can be really helpful. It's been on hiatus for the last few months, but the archive is well worth visiting because there are some really good ones on there. Sebastian Ruder has an amazing blog and newsletter. He's so good, it's an absolute treasure trove. He doesn't update as frequently now, but the updates are still very juicy when they do come out and they make up for the time that you waited. Software Engineering Daily is also really good. Great podcast for keeping tabs on what's going on in software generally, keeping track of trends in containerization and DevOps, in addition to ML. They specifically have an ML channel that you can subscribe to if you don't care that much about the other stuff. There's also TWIML AI, another podcast that I've been fortunate to guest on, which is awesome and covers tons of different areas of ML, not just NLP. Talk Python To Me is great for general Python stuff, and they often will have ML-specific guests or data science-specific guests. I also have a couple of articles that I can send along for you to put in the show notes. Taming the Tail by A16Z is a really great read on the challenges of running an ML-first business, particularly how it compares to running a more standard SaaS business, and how you can think more clearly about what you can expect for your margins and where you can expect most of your costs to be and stuff like that. Every year I also check out the State of AI report, which I thought about particularly when you were talking about, if you're someone who's looking on and seeing this massive proliferation of things. There's an infamous image from those reports where it will just be a big square and it'll have multiple different squares of different companies in different areas of ML. There are more and more companies every year, but it's a great way to help make sense of what the different areas are right now. What changes are we observing? And that's part of how I can notice some of these higher-level trends that are going on, just hearing about what's going on with other companies, because otherwise I'm pretty zoomed in on what I'm working on and what's directly adjacent to what I'm doing. I would also suggest following any researchers that you like on Twitter, any research groups, stuff like that so that you can hear about it when a splashy new paper comes out or something.

Dean: Sebastian Ruder is awesome. I really love his newsletter. Every time I see another Lumascape, it is exciting and bothers me at the same time. There are so many, it's just ridiculous. It depends on which list you want to believe, but it's somewhere between 300-5,000 tools. I understand the frustration people have when they see those. You gave a bunch of recommendations that are machine learning related, but I'll ask if you have any other recommendations that are maybe not machine learning related or not related to your job at all, whether it's Netflix show or whatever you want to recommend.

Charlene: I do have a general mindset recommendation that I can give, which is if you're listening to this and you're new to the space, whether that's like ML or NLP or anything else, if this seems overwhelming, like we just mentioned, you would be amazed by how much you can just learn by osmosis, by listening to things like this. I actually attribute a lot of my own success in transitioning into data science from a less technical background to the fact that I consumed tons of data science and NLP related content. And eventually, you learn the vocab, you learn what the main problems in the space are. You learn how people address certain issues. You learn about which tools and frameworks are actually the ones that people use instead of just the ones that exist out there in theory and that you might have to learn someday, and a lot of other practical details that will help you in your day to day journey just by listening to practitioners talk about it. You can also do informational interviews, and things like that. I did also read textbooks and everything, but to get a picture of what things actually look like on the day-to-day, the podcasts, the conference talks, and other kinds of conversations are super helpful, because without these, you can really get the impression that you need to know every possible thing before you can be even remotely competent in this field. But that's not the case at all. There are so many of us who are experts in one particular area, but if you asked us to do something in some other area, we'd be like, give me a week to go learn it. As long as you're scrappy and you keep a beginner's mind, you can learn whatever you need to learn to make really cool stuff.

Dean:

That's an awesome point. The skill, if you will, that you have to possess is just wanting to learn. There are so many things you can do if you just want to learn. Part of that is being genuinely excited about a field. It's harder if you're forcing yourself to do it because you heard that software developers make a lot of money if they work at Google and that you're going to have a harder time. It's not impossible, some people are talented enough, but I think if you're actually enthusiastic, then becoming an information junkie, as you called it earlier, makes a lot of sense. I'll add one tip in that vein, which I think is true regardless of what things you want to do, it doesn't have to be machine learning development or anything like that. Now with COVID, we've had a bunch of time where all the Meetups and events were virtual. I'm still advising everyone to keep safe and not get COVID, it's still not fun, but if you have an opportunity to go to a Meetup that has in-person meeting... the talks are important, but if you speak to other people that are either in your position or a few years ahead of you and you can learn from their experience and build those relationships, that is super important, especially if you don't come with the background that is, as you say,computer science or technical or something like that. Building relationships is super important no matter what you do, and the ability to find people that have gone a few miles in your future shoes and then learning from their experiences is really invaluable. If you can't go in person, go to the virtual Meetup, sometimes they do have mingling in the beginning, even though on Zoom it's not as fun and there's no pizza unless you order it in advance. But you should totally do that. If you go to the Meetup, don't only eat, also speak to people. That's also important. This is speaking from experience.

Charlene: I do have one more tip that I can give around. It's meta advice around learning generally. Learning how to learn is actually a skill and one that you can improve at very effectively. There are certain techniques that you can use like space repetition, forcing yourself to recall answers instead of just reading your notes and saying oh yeah, I remember that. Two people who I can recommend who are very good at teaching this material are Scott Young, he's the blogger king of learning how to learn. He did a four year MIT CS degree in a very short amount of time, from home. Cal Newport is also really good to listen to in this area. Ultra Learning by Scott Young is a really good book that kind of encapsulates most of his recommendations. Michael Nielsen also has a really good blog on how to use space repetition using Anki, which is a flashcard app that's highly customizable to various use cases. They will help you learn a lot of things in a very short amount of time instead of banging your head against the same material, trying to study it less effectively.

Dean: So this has been awesome. I really had a lot of fun so thank you, Charlene, for taking part.

Charlene: Thanks for inviting me.

Dean: And thank you to all the listeners or viewers. It was a pleasure having you here as well and I'll see you in the next episode.