🧠 From Prototype to Production: 6 Essential Fixes for Your LLMService Class 🚀
Mai Chi Bao

Mai Chi Bao @mrzaizai2k

About: A Senior Machine Learning Engineer with expertise in AI, ML, and DL!

Location:
Ho Chi Minh, Vietnam
Joined:
Aug 25, 2024

🧠 From Prototype to Production: 6 Essential Fixes for Your LLMService Class 🚀

Publish Date: Jun 20
6 3

"Your LLM code works... until it doesn’t — especially on someone else’s machine."
That was me last month, confidently shipping a prototype only to watch it crumble in different environments. No GPU? Boom. Slight change in model prompt? Silent failure.

I realized I wasn’t writing production-ready code. I was building a proof of concept held together with hopes and hot glue.

This post is a deep dive into how I took a basic LLMService class and leveled it up by identifying six critical (but often overlooked) issues. These are fundamental improvements that every LLM project should include — whether you're building a chatbot, an API, or just experimenting.


📚 Table of Contents


🧪 Original Code

Here’s the starting point — a working LLMService class for running local generation with Meta’s Llama-2 7B model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class LLMService:
    def __init__(self):
        # Load model and tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
        self.model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
        self.model.to("cuda")

    def generate_response(self, user_input):
        # Format the prompt for a chat model
        prompt = f"User: {user_input}\nAssistant:"

        # Tokenize the input
        input_ids = self.tokenizer.encode(prompt, return_tensors="pt").to("cuda")

        # Generate output
        with torch.no_grad():
            output_ids = self.model.generate(
                input_ids,
                max_length=2048,
                temperature=0.7,
                do_sample=True,
                pad_token_id=self.tokenizer.eos_token_id
            )

        # Decode output
        output = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
        answer = output.split("Assistant:")[1].strip()

        # Return everything after "Assistant:"
        return answer

    def batch_generate(self, user_inputs):
        responses = []
        for user_input in user_inputs:
            responses.append(self.generate_response(user_input))
        return responses

# Example usage
if __name__ == "__main__":
    service = LLMService()

    # Process a single query
    response = service.generate_response("What is machine learning?")
    print(response)

    # Process multiple queries
    responses = service.batch_generate([
        "What is deep learning?",
        "Explain natural language processing.",
        "How do transformers work?"
    ])

    for resp in responses:
        print(resp)
        print("-" * 50)
Enter fullscreen mode Exit fullscreen mode

🚨 This worked… until it didn’t:

  • ❌ Crashed on CPU-only systems
  • ❌ Hard to reuse
  • ❌ Silent failures when input changed

So I did a full code review and made six basic improvements that instantly made the service more reliable and flexible.


🧭 Why These Fixes Matter

Production-grade software isn’t just about output — it’s about how well it handles failure, adapts to change, and communicates clearly.

These improvements don’t require deep ML knowledge. But they unlock stability, hardware compatibility, and user trust — everything that brittle prototypes lack.


🔧 Basic Improvements for Stability and Flexibility

🖥️ 1. No GPU Availability Check

🔍 Problem

"It works on my machine."
That's what I said — right before a teammate tried it on their MacBook and it exploded with a CUDA error. The code blindly assumed everyone had a powerful GPU. Spoiler: they don’t.

self.model.to("cuda")  # 💥 Instant crash on CPU/M1 systems
Enter fullscreen mode Exit fullscreen mode

✅ Fix
Detect the available device instead of assuming:

def _get_device(self, device: Optional[str] = None) -> str:
    if device: return device
    if torch.cuda.is_available(): return "cuda"
    elif getattr(torch, "has_mps", False) and torch.backends.mps.is_available(): return "mps"
    return "cpu"
Enter fullscreen mode Exit fullscreen mode

❌ 2. Missing Error Handling for Model Loading

🔍 Problem

One day, Hugging Face went down for maintenance. My app did too.
There was no error handling when downloading the model or tokenizer — so if anything failed, the whole service collapsed without explanation.

self.model = AutoModelForCausalLM.from_pretrained(...)  # ❌ No fallback, no logs
Enter fullscreen mode Exit fullscreen mode

✅ Fix
Gracefully catch and log issues so you’re not debugging blind:

try:
    self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    self.model = AutoModelForCausalLM.from_pretrained(model_name)
except Exception as e:
    logger.error(f"Model loading failed: {str(e)}")
    raise
Enter fullscreen mode Exit fullscreen mode

🧱 3. Hardcoded Prompt Formatting

🔍 Problem

I swapped the model. Suddenly, the outputs were gibberish.
Turns out, each model expects its own prompt style. But I’d hardcoded a single one — breaking everything as soon as I changed models.

prompt = f"User: {user_input}\nAssistant:"  # 🧃Works only for one model flavor
Enter fullscreen mode Exit fullscreen mode

✅ Fix
Use a method that adapts prompt formatting per model:

def format_prompt(self, user_input: str) -> str:
    return f"User: {user_input}\nAssistant:"
Enter fullscreen mode Exit fullscreen mode

🎛️ 4. Fixed Generation Parameters

🔍 Problem

I wanted it to be more creative… but the outputs never changed.
I kept adjusting the temperature but nothing happened — because the code didn’t let me! All generation settings were hardwired in.

temperature = 0.7  # Locked in 🔒
Enter fullscreen mode Exit fullscreen mode

✅ Fix
Expose generation settings as parameters:

def generate_response(self, user_input: str, max_length=2048, temperature=0.7):
    ...
    output_ids = self.model.generate(input_ids, max_length=max_length, temperature=temperature)
Enter fullscreen mode Exit fullscreen mode

🛡️ 5. No Input Validation

🔍 Problem

One API test sent an empty string. The model returned... silence.
The function just trusted that input would always be clean. But it wasn’t. And that led to weird results, or worse — crashes.

response = generate_response("")  # 😶 awkward
Enter fullscreen mode Exit fullscreen mode

✅ Fix
Check input before processing:

if not user_input or not isinstance(user_input, str):
    return "Please provide a valid text input."
Enter fullscreen mode Exit fullscreen mode

🔢 6. Hardcoded Values

🔍 Problem

I wanted to try a smaller model — but the class refused to budge.
The model name, device, config… all hardcoded. Great for demos. Terrible for flexibility.

AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")  # Locked in
Enter fullscreen mode Exit fullscreen mode

✅ Fix
Make everything configurable via __init__:

def __init__(self, model_name="meta-llama/Llama-2-7b-chat-hf", device=None):
    self.model_name = model_name
    self.device = self._get_device(device)
Enter fullscreen mode Exit fullscreen mode

✅ Conclusion: First Fixes First

Each of these six fixes might seem small in isolation — but together, they elevate your LLMService from a fragile prototype to a flexible, production-ready tool. This is your foundation — stable, adaptable, and ready to scale.

Whether you're deploying a chatbot, building an AI assistant, or just trying to avoid those “why is this breaking now?” moments — these are the must-have first steps.


🚀 Coming up next: In Part 2, we’ll dive deeper with advanced upgrades like batch optimization, smarter response parsing, and model quantization to make your service faster and more efficient.


📢 If this breakdown was helpful,
👍 Like it, 💬 drop a comment, and 🔁 share it with your fellow devs.
👉 Follow me for more deep dives into LLM development, debugging tips, and clean code practices — part two is just around the corner.

Comments 3 total

  • Mai Chi Bao
    Mai Chi BaoJun 13, 2025

    I ran into the "cuda not available" crash last week — wish I had read this earlier! 😅 Great job covering both the what and the why.

  • Dotallio
    DotallioJun 21, 2025

    I’ve run into half of these pitfalls myself, especially with hardcoded configs tripping me up later. Did you try loading model config from environment variables too, or is class init all you need?

    • Mai Chi Bao
      Mai Chi BaoJun 24, 2025

      Absolutely — DO NOT HARDCODE anything, especially values that can change between environments.

      Here’s how I usually structure it:

      .env files are for secrets or sensitive values: API keys, encoded credentials, passwords, and environment-specific variables like PORT, DEBUG, etc. These are meant to differ between local/dev and production, so they’re best kept outside of your codebase.

      Config files (YAML, JSON, or even Python/TS config classes) are for general model/config parameters: things like temperature, padding_mode, batch_size, or paths to resources. These aren’t secret, but they might still need tweaking without changing the code.

      Class init args are more like safe defaults — they help signal how to use a class and ensure it’s usable out-of-the-box. But I rarely rely on them for anything dynamic. In collaborative projects, you want those parameters clearly separated and adjustable.

      So yes, I do load model configs from .env when they’re secret or env-specific, and from config files for the rest. That way I can change behavior without editing the code, and it’s also clearer for teammates or future-you.

Add comment