"Your LLM code works... until it doesn’t — especially on someone else’s machine."
That was me last month, confidently shipping a prototype only to watch it crumble in different environments. No GPU? Boom. Slight change in model prompt? Silent failure.
I realized I wasn’t writing production-ready code. I was building a proof of concept held together with hopes and hot glue.
This post is a deep dive into how I took a basic LLMService class and leveled it up by identifying six critical (but often overlooked) issues. These are fundamental improvements that every LLM project should include — whether you're building a chatbot, an API, or just experimenting.
📚 Table of Contents
🧪 Original Code
Here’s the starting point — a working LLMService class for running local generation with Meta’s Llama-2 7B model:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
class LLMService:
def __init__(self):
# Load model and tokenizer
self.tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
self.model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
self.model.to("cuda")
def generate_response(self, user_input):
# Format the prompt for a chat model
prompt = f"User: {user_input}\nAssistant:"
# Tokenize the input
input_ids = self.tokenizer.encode(prompt, return_tensors="pt").to("cuda")
# Generate output
with torch.no_grad():
output_ids = self.model.generate(
input_ids,
max_length=2048,
temperature=0.7,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id
)
# Decode output
output = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
answer = output.split("Assistant:")[1].strip()
# Return everything after "Assistant:"
return answer
def batch_generate(self, user_inputs):
responses = []
for user_input in user_inputs:
responses.append(self.generate_response(user_input))
return responses
# Example usage
if __name__ == "__main__":
service = LLMService()
# Process a single query
response = service.generate_response("What is machine learning?")
print(response)
# Process multiple queries
responses = service.batch_generate([
"What is deep learning?",
"Explain natural language processing.",
"How do transformers work?"
])
for resp in responses:
print(resp)
print("-" * 50)
🚨 This worked… until it didn’t:
- ❌ Crashed on CPU-only systems
- ❌ Hard to reuse
- ❌ Silent failures when input changed
So I did a full code review and made six basic improvements that instantly made the service more reliable and flexible.
🧭 Why These Fixes Matter
Production-grade software isn’t just about output — it’s about how well it handles failure, adapts to change, and communicates clearly.
These improvements don’t require deep ML knowledge. But they unlock stability, hardware compatibility, and user trust — everything that brittle prototypes lack.
🔧 Basic Improvements for Stability and Flexibility
🖥️ 1. No GPU Availability Check
🔍 Problem
"It works on my machine."
That's what I said — right before a teammate tried it on their MacBook and it exploded with a CUDA error. The code blindly assumed everyone had a powerful GPU. Spoiler: they don’t.
self.model.to("cuda") # 💥 Instant crash on CPU/M1 systems
✅ Fix
Detect the available device instead of assuming:
def _get_device(self, device: Optional[str] = None) -> str:
if device: return device
if torch.cuda.is_available(): return "cuda"
elif getattr(torch, "has_mps", False) and torch.backends.mps.is_available(): return "mps"
return "cpu"
❌ 2. Missing Error Handling for Model Loading
🔍 Problem
One day, Hugging Face went down for maintenance. My app did too.
There was no error handling when downloading the model or tokenizer — so if anything failed, the whole service collapsed without explanation.
self.model = AutoModelForCausalLM.from_pretrained(...) # ❌ No fallback, no logs
✅ Fix
Gracefully catch and log issues so you’re not debugging blind:
try:
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name)
except Exception as e:
logger.error(f"Model loading failed: {str(e)}")
raise
🧱 3. Hardcoded Prompt Formatting
🔍 Problem
I swapped the model. Suddenly, the outputs were gibberish.
Turns out, each model expects its own prompt style. But I’d hardcoded a single one — breaking everything as soon as I changed models.
prompt = f"User: {user_input}\nAssistant:" # 🧃Works only for one model flavor
✅ Fix
Use a method that adapts prompt formatting per model:
def format_prompt(self, user_input: str) -> str:
return f"User: {user_input}\nAssistant:"
🎛️ 4. Fixed Generation Parameters
🔍 Problem
I wanted it to be more creative… but the outputs never changed.
I kept adjusting the temperature but nothing happened — because the code didn’t let me! All generation settings were hardwired in.
temperature = 0.7 # Locked in 🔒
✅ Fix
Expose generation settings as parameters:
def generate_response(self, user_input: str, max_length=2048, temperature=0.7):
...
output_ids = self.model.generate(input_ids, max_length=max_length, temperature=temperature)
🛡️ 5. No Input Validation
🔍 Problem
One API test sent an empty string. The model returned... silence.
The function just trusted that input would always be clean. But it wasn’t. And that led to weird results, or worse — crashes.
response = generate_response("") # 😶 awkward
✅ Fix
Check input before processing:
if not user_input or not isinstance(user_input, str):
return "Please provide a valid text input."
🔢 6. Hardcoded Values
🔍 Problem
I wanted to try a smaller model — but the class refused to budge.
The model name, device, config… all hardcoded. Great for demos. Terrible for flexibility.
AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf") # Locked in
✅ Fix
Make everything configurable via __init__
:
def __init__(self, model_name="meta-llama/Llama-2-7b-chat-hf", device=None):
self.model_name = model_name
self.device = self._get_device(device)
✅ Conclusion: First Fixes First
Each of these six fixes might seem small in isolation — but together, they elevate your LLMService from a fragile prototype to a flexible, production-ready tool. This is your foundation — stable, adaptable, and ready to scale.
Whether you're deploying a chatbot, building an AI assistant, or just trying to avoid those “why is this breaking now?” moments — these are the must-have first steps.
🚀 Coming up next: In Part 2, we’ll dive deeper with advanced upgrades like batch optimization, smarter response parsing, and model quantization to make your service faster and more efficient.
📢 If this breakdown was helpful,
👍 Like it, 💬 drop a comment, and 🔁 share it with your fellow devs.
👉 Follow me for more deep dives into LLM development, debugging tips, and clean code practices — part two is just around the corner.
I ran into the "cuda not available" crash last week — wish I had read this earlier! 😅 Great job covering both the what and the why.