Urszula Czerwińska is an NLP engineer with a Ph.D. in Systems Biology. She has a varied background, in academic research, data science consulting, working in the industry and the public sector. The following is a transcript of the podcast about her experinece in the field. You can listen here or watch on YouTube.
Tell me a little bit about your background as a DS-ML? How did you become a Data Scientist?
I studied biology and mathematics for my undergraduate, then interdisciplinary studies applied to Life Sciences and I finished with a Ph.D. in Systems Biology. It may sound exotic but it was mostly maths, statistics and CS applied to life science or healthcare problems such as genomics. I learned all the basic data science toolset intending to help to cure diseases or save animal species. Then I realized a research career doesn’t suit me and I decided to focus on CS and data, the part of the job I like the most. I participated in some DS boot camps to fill the gaps between my academic experience and the market needs and I signed my first contract before getting the Ph.D. diploma. I worked for a startup optimizing budgets for big companies. It was not a fit. Then I worked for a Data Consulting company where I could get a taste of corporate culture and versatile Data Science projects. Finally, I chose my discipline of interest and I landed my current position.
What kind of DS/ML projects do you specialize in (NLP, vision, deep learning, etc)? Can you give examples of some of the projects you’ve worked on and their business outcomes?
In research I worked with multidimensional problems, trying to deconvolute signals, sort of seq2seq tasks. Then, through my consulting experience, I could face a plethora of topics, such as DS project management, APIs, Interpretability, and finally NLP. I decided to specialize in NLP, with a focus on deep learning, as it was giving me the most satisfaction. The possibility to evaluate the outcome of the algorithm humanly helps me with my work, so far I have worked only with French and English languages so I could read prediction results and see if my work gives what I expected, in addition to the formal evaluation. This is something, which is much harder to capture when working with numbers, I guess, it requires deep domain knowledge to see if the numbers make sense.
To give you an example of projects I worked on. For instance, my company was asked to construct a guide on the interpretability of ML models by a client from the banking sector. I worked on demos and the code snippets that would allow all the data scientists from the group to use it for their projects and most importantly, understand what happens behind the scenes of the interpretability libraries. The materials I collected for this project were used to write a medium post that got quite popular with over 13k views.
The project that made me fall in love with NLP was for a pharmaceutical company building a tool that would aid their text analysts to be more efficient. I built a classification algorithm that would automatically select patents concerning a disease of interest. Another request was to identify the chemical molecules that were targeted by the patented drug. For this second task, I used the NER algorithm from the spacy library. I wrote a small blog post in french that got nearly 3K views on how we constructed the pipeline.
Tell me more about your current role? What kind of problems does your ds/ml team solve in the organization?
Currently, I am working in a public institution. My main task is quite similar to chemical molecule name detection. The documents I work with are quite long and I need to detect specific information from the text. I am using different tools because the documents are written in French.
What are some of the current DS projects you are working on?
A: The main ML project is the one just mentioned. I also work on sequential paragraph classification. It is still working progress but the subject is quite interesting as no trivial solution exists so far.
Can you share your thoughts about getting machine learning from research to production? What is the ideal pipeline and stack for you? If you had to choose one – what tool can’t you live without?
It is a huge subject. There is a huge gap between research or even POCs and the ML models in production. I am not a specialist, but I saw in many organizations that great models were stuck at the POC stage because of many reasons. One of them was the lack of communication between the team working on the model and the one working on our production environment. Another one was that the models were answering a specific need for a group of users that were not in line with all the users of the tool in production. Then, all the difficulties concerning the heterogeneity of infrastructures and processes, the impact on other parts of the system.
At the moment I create production-ready code that is served through API to other parts of the system. It is possible because all the solutions I use are open source and quite independent from the operating system.
As a data scientist, I am avoiding complex topics of integration, leaving this to team members that are more qualified than me (devs). I am making sure my code is well tested, it produces logs that can be used when needed, that the API behavior is well described.
An ideal pipeline would allow running experiments, testing, and deploying models from one place, keeping track of actions, and making rollback easy if needed. I know that some cloud platforms propose a sort of continuous integration for ML models but I have never tested them in practice. I guess it is possible to build great pipelines from scratch, it is much more difficult to integrate into the old system.
The tool I couldn’t work without is a decent code editor with a debugger and Git.
What process do the teams you’ve worked in used for collaboration?
It all depends on the project and organization. Often the tools are imposed because they are formal communication channels of the company, they are not always the best suited for Data Science teams.
I think the most important collaboration tool remains git interface such as GitHub as it allows to raise issues, write milestones, and of course share the code. It is usually completed with emails that help to communicate with the less techy part of the team.
Talk to me about the team structures you’ve worked in - were there data engineers, analysts, domain experts? What's the division of work? What is working and what could be improved?
I often worked on projects that were quite challenging technically but not deployed on a huge scale. Therefore I usually worked with other data scientists, designers, developers, and domain experts. I never worked directly with data engineers because either I used databases that were already well established or I was using public data that could be cleaned and organized by myself.
I think the communication in teams can always be improved. Depending on the size of the team agile methodology or other strategies can be adapted. The documentation and testing are really important and often done without a predefined frame. I often worked with companies that did not have much clue how to manage data science projects. It is quite a bit of work to design and implement those processes. I would love to work one day in a tech company that elaborated great processes for machine learning engineering.
Name some of the most interesting repositories and tools you have used recently?
Recently I have focused a lot on NLP, I was discovering AllenNLP. It is a framework that facilitates building NLP pipelines and helps to make them more plug and play friendly. It does a part of tedious work that you would need to do in pytorch by yourself in an efficient way.
How do you keep up to date?
There was an amazing NLP newsletter from DAIR-AI but it stopped in August 2020. I usually follow medium posts, LinkedIn, or Twitter. However, the best source of solid information is Papers With Code and the actual code released with the scientific articles. I also follow the GitHub repositories of HuggingFace, Flair, AllenNLP, Stanford NLP, and more to see any new models or major changes.
What are the biggest challenges for you in DS/ML?
As data science is a relatively new field raising a lot of enthusiasm but also a bit chaotic by construction, one of the biggest challenges, in my opinion, is the organization of ML projects. Dealing with POCs and demos on a few thousands of files do not match real-time processing of teras of data. Testing the 10 most popular algorithms or approaches at the time being is not the same as keeping up with the state of the art over the years. More and more systems are based on ML nowadays therefore there is more responsibility on those systems.
I think the key is to organize and structure ML projects to make them readable to anyone without deep knowledge of the context. Another challenge is to use ML or DL when it is really necessary and not because it is boosting the CEO’s pride.
Therefore, tools such as DVC and the interfaces such as DAGsHub are a great help to structure ML projects and make them reproducible.
What things have you learned recently?
Recently, I worked on the sequential classification of paragraphs. I found a publication with code in the Allen NLP framework that I worked on to adapt it to my needs. The next step is to play a bit further with the algorithms and try some new ideas. It is interesting because it makes me revise all my knowledge about NN and pytorch, as well as some state-of-the-art solutions in NLP.
What's your favorite and least favorite thing about being a data scientist?
I like how this job is versatile and how I have the flexibility to bring my creativity to the work I do. I also love the fact there is a huge community and lots of open source work. Whenever I am stuck I can open an issue on GitHub or write a question on Stackoverflow and most probably I will get help in a few days or even hours. I also like when my work has a real impact on the users, how the tools I create change the way people work.
The least favorite part of it is that Data Scientists are seen by some companies as a mascot or trophy. They are over competent for the tasks they are assigned and there is a lot of misunderstanding around their responsibilities.
What’s a good paper you have read recently?
Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching by Najork et al.
Working with long documents it has always been frustrating that transformer algorithms such as BERT accept only 512 sequences length. In this paper authors …
Do you have any tips to improve:
- Model performance:
The best way to increase model performance is to assure the quality of the data, especially mislabelled training data is a lot of pain. I often fixed lots of problems in my models by going back to the data, identifying problems, and focusing on fixing them. One of the advantages I had was working with text data in English and French, which I’m familiar with, but many people work with data they can’t directly read – this raises another point of having access to someone who understands the data and can tell you if you have problems with it.
- Versioning & Experimentation:
Using productivity tools such as Git, DVC and related interfaces such as DAGsHub is a game-changer. It brings a new quality to the experimentation and keeps track of the progress. You put some effort into making things organized in the beginning but then you have a reproducible pipeline, which you’ll use over and over again, as well as automatic experiment tracking. It brings real visibility and order.
- Bringing models to production:
Communication and collaboration with all the relevant teams, and studying the impact your application has on other systems within your product. There are a lot of cascading effects, and it’s important to discover these things as you create your projects.
Any other recommendations you have for the audience?
The important point is that data science and machine learning is a community, and it’s important not only to take but also to give back. If you’re doing something cool, share it, write a medium post, release your code. It might be really useful for someone who is learning or is stuck, and you might not even realize it. Someone might end up seeing your code or writing and it will be a great help. Be curious, explore, and give back to the community!