Back to blog home

πŸ”΄ Building, Deploying and Monitoring Large Language Models with Jinen Setpal

Co-Founder & CEO of DAGsHub. Building the home for data science collaboration. Interested in machine learning, physics and philosophy. Join https://DAGsHub.com | DagsHub Co-Founder & CEO

Take control of your multimodal data

Curate and annotate datasets, track experiments, and manage models on a single platform.

Get started
Table of Contents
    Share This Article

    In this live episode, I'm speaking with Jinen Setpal, ML Engineer at DagsHub about actually building, deploying, and monitoring large language model applications. We discuss DPT, a chatbot project that is live in production on the DagsHub Discord server and helps answer support questions and the process and challenges involved in building it. We dive into evaluation methods, ways to reduce hallucinations and much more. We also answer the audience's great questions.

    Watch the Video

    Listen to the Audio

    Highlights & Summary

    In this episode of the MLOps podcast, the host, Dean, speaks with Jinen Setpal, a machine learning engineer at DagsHub. They discuss the applications of large language models (LLMs) and the challenges associated with working with LLMs, such as hallucinations. They also delve into the development stack and tools for building LLM applications, as well as monitoring LLMs in production. This blog post will provide a summary of the key points discussed in the podcast episode.

    Introduction to LLMs

    Jinen explains that LLMs have been around for quite some time, but the introduction of ChatGPT by OpenAI marked a significant moment when LLMs became more than just a theoretical concept. The utility and intelligence of LLMs became apparent, and their potential for real-world applications became evident.

    DPT - DagsHub's Documentation Chatbot

    Jinen discusses DPT, which stands for DagsHub's Documentation Chatbot. DPT leverages Chat GPT 3.5 Turbo to provide responses to user queries based on semantic search of DagsHub's documentation. It uses prompt engineering and domain adaptation techniques to generate accurate and helpful responses.

    Evaluating LLMs

    There are a lot of challenges in evaluating LLMs and existing metrics each have their limitations. They discussed the trade-off between using automated metrics versus human evaluators, with the latter being more reliable but also more expensive. They also touched upon the issue of biases in evaluation and the need for careful annotation and quality control.

    We highlighted the importance of interpretability in LLM research, as it can help identify biases and provide insights into model behavior. They suggested that future metrics for LLMs should be based on intrinsic interpretability, which would allow for unbiased estimates of model performance. They also discussed the potential for privacy and security concerns in LLMs, noting that privacy-preserving techniques are still a work in progress and that fine-tuning and prompt engineering are currently the most common approaches to combat LLM limitations.

    We discuss the current challenges in evaluating LLMs and emphasized the need for ongoing research and development to improve metrics and address privacy and security concerns.

    Challenges of Hallucinations

    Hallucinations occur when LLMs generate responses that are confident but inaccurate. Jinen discusses the misaligned incentives of LLMs, where the models prioritize generating responses that sound plausible, rather than focusing solely on accuracy. Domain adaptation and prompt engineering are two approaches to mitigate hallucinations, but they are still an open problem.

    Privacy and Security

    Privacy and security aspects of LLMs are concerns that need to be taken into account. OpenAI and other providers have measures in place to protect user data, but self-hosting models allows for greater control over privacy and security. However, LLMs require a trade-off between performance, scalability, and security.

    Monitoring LLMs in Production

    Monitoring LLMs in production involves infrastructure management, scalability, and performance monitoring. Tools like Terraform, AWS Auto Scaling Groups, and ECS services can help streamline the monitoring process. However, evaluating and monitoring model performance is more complex and often requires manual intervention and human evaluation.

    Conclusion

    Large language models like Chat GPT have transformed the field of natural language processing and have vast potential for various applications. However, challenges such as hallucinations and privacy/security concerns remain open problems. Monitoring LLMs in production requires a combination of infrastructure management, scalability, and manual evaluation to ensure accurate and reliable results. As LLMs continue to evolve, advancements in interpretability and privacy-preserving techniques will shape their future use and impact.