If you thought fixing the basics was enough — think again.
In the previous post From Prototype to Production: 6 Essential Fixes for Your LLMService Class, we cleaned up the code to be more stable and flexible. But what if you're dealing with real users, high concurrency, or limited hardware?
This post dives into advanced-level optimizations that address performance, memory usage, and production reliability. Whether you're scaling up or shipping to prod — these are the next steps to make your LLMService
truly battle-ready.
📦 Table of Contents
- 📄 Original Code
- ⚡ 1. Inefficient Batch Processing
- 🧹 2. Lack of Resource Management
- 🧩 3. Fragile Response Parsing (Redux)
- 🪶 4. Missing Tokenizer Padding Configuration
- 📉 5. No Model Quantization
- 🧠 Final Thoughts
📄 Original Code
The original LLMService
class provides a foundation for LLM interactions but requires enhancements for production use. Below is the original implementation:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
class LLMService:
def __init__(self):
# Load model and tokenizer
self.tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
self.model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
self.model.to("cuda")
def generate_response(self, user_input):
# Format the prompt for a chat model
prompt = f"User: {user_input}\nAssistant:"
# Tokenize the input
input_ids = self.tokenizer.encode(prompt, return_tensors="pt").to("cuda")
# Generate output
with torch.no_grad():
output_ids = self.model.generate(
input_ids,
max_length=2048,
temperature=0.7,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id
)
# Decode output
output = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
answer = output.split("Assistant:")[1].strip()
# Return everything after "Assistant:"
return answer
def batch_generate(self, user_inputs):
responses = []
for user_input in user_inputs:
responses.append(self.generate_response(user_input))
return responses
# Example usage
if __name__ == "__main__":
service = LLMService()
# Process a single query
response = service.generate_response("What is machine learning?")
print(response)
# Process multiple queries
responses = service.batch_generate([
"What is deep learning?",
"Explain natural language processing.",
"How do transformers work?"
])
for resp in responses:
print(resp)
print("-" * 50)
⚡ 1. Inefficient Batch Processing
🔍 The Problem
I launched a test with 100 prompts. It choked after 3.
The originalbatch_generate
simply looped through inputs one by one. Zero batching. Zero performance gain.
def batch_generate(self, user_inputs):
responses = []
for user_input in user_inputs:
responses.append(self.generate_response(user_input))
return responses
💡 The Fix
Let's use a single forward pass with proper tensor batching — it's faster, cleaner, and takes full advantage of GPU acceleration.
def batch_generate(self, user_inputs: List[str], ...):
prompts = [self.format_prompt(u) for u in user_inputs]
batch_inputs = self.tokenizer(prompts, return_tensors="pt", padding=True)
batch_inputs = {k: v.to(self.device) for k, v in batch_inputs.items()}
with torch.no_grad():
output_ids = self.model.generate(**batch_inputs, ...)
outputs = self.tokenizer.batch_decode(output_ids, skip_special_tokens=True)
return [self.extract_response(o, p) for o, p in zip(outputs, prompts)]
🧠 Pro Tip: Batch size has a huge impact on latency. Experiment, benchmark, and tune!
🧹 2. Lack of Resource Management
🔍 The Problem
I walked away from my app. Came back an hour later… GPU at 100%.
Turns out, PyTorch doesn’t clean up after itself. Without manual cleanup, memory leaks are inevitable.
💡 The Fix
We added both an explicit cleanup method and a destructor to keep GPU usage lean.
def clear_cuda_cache(self):
if self.device == "cuda":
torch.cuda.empty_cache()
gc.collect()
logger.info("CUDA cache cleared")
def __del__(self):
try:
self.clear_cuda_cache()
except:
pass
🧪 Even in short scripts, always clean up your CUDA mess. Your future self will thank you.
🧩 3. Fragile Response Parsing (Redux)
🔍 The Problem
Assistant: What do you mean by "Assistant"?
Yep. If your model repeats the word "Assistant:" inside its response — your parsing logic breaks.
output = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
answer = output.split("Assistant:")[1].strip() # Not safe!
💡 The Fix
We made parsing bulletproof — handling multiple cases gracefully.
def extract_response(self, full_output: str, prompt: str) -> str:
if prompt in full_output:
return full_output[len(prompt):].strip()
elif "Assistant:" in full_output:
return full_output.split("Assistant:", 1)[-1].strip()
return full_output
🧠 Rule of thumb: never assume your model follows your format perfectly. Prepare for chaos.
🪶 4. Missing Tokenizer Padding Configuration
🔍 The Problem
My batch inference worked… until it didn’t.
If your tokenizer doesn't have apad_token
, it’ll throw a cryptic error the first time you try batch padding.
💡 The Fix
We set a fallback to use the eos_token
if needed.
if self.tokenizer.pad_token is None:
logger.info("Setting pad_token to eos_token")
self.tokenizer.pad_token = self.tokenizer.eos_token
🔧 This tiny fix can prevent hours of silent failure. Add it and forget the headaches.
📉 5. No Model Quantization
🔍 The Problem
My GPU was crying.
The default model loaded in full precision — heavy on memory and unnecessary unless you're training.
# No quantization logic at all
💡 The Fix
Load models in 8-bit mode with load_in_8bit
— cutting memory usage dramatically.
quantization_config = {}
if self.load_in_8bit and self.device != "cpu":
quantization_config = {"load_in_8bit": True}
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map=self.device if self.device != "cpu" else None,
torch_dtype=torch.float16 if self.device != "cpu" else torch.float32,
use_cache=use_cache,
**quantization_config
)
🧠 This one change let me run LLaMA-2 on a 12GB card. Highly recommended.
🧠 Final Thoughts
Even well-written code can fail at scale. These advanced improvements aren't just optimizations — they’re requirements if you're planning to:
- Host LLM services in production
- Run on limited resources
- Scale to multiple users
- Avoid mysterious crashes
💬 Let me know — which of these fixes saved your day?
🔁 If you found this helpful, like, share, and follow for more behind-the-scenes breakdowns like this.
Great breakdown. Love how you balance explanation and real code fixes.