Unleashing the Power of LLMLingua: Prompt Compression Performance Testing
Shannon Lal

Shannon Lal @shannonlal

About: I am an Engineering Manager with experience leading large-scale projects in both public and private sectors. I have collaborated with CEO and VP level executives on development and refinement of both

Location:
Montreal, Quebec, Canada
Joined:
Nov 17, 2019

Unleashing the Power of LLMLingua: Prompt Compression Performance Testing

Publish Date: Apr 3 '24
0 1

Prompt compression has emerged as a powerful technique in the realm of large language models (LLMs) to optimize performance and reduce computational costs. LLMLingua (https://llmlingua.com/llmlingua2.html), a state-of-the-art prompt compression method, employs various techniques such as context filtering, sentence filtering, and token-level filtering to achieve high compression ratios while preserving semantic information. By reducing the number of tokens in a prompt, prompt compression significantly decreases the computational resources required for inference, making LLMs more accessible and cost-effective. As an AI developer working on integrating different LLMs, I wanted to explore the performance impact of adding prompt compression to my workflow and see if it could help reduce costs and speed up LLM processing times.

To evaluate the impact of prompt compression on LLM processing times and costs, I conducted experiments using LLMLingua on prompts of three different sizes: small (1-2k tokens), medium (4k tokens), and large (58k tokens). I applied various compression techniques, including context filtering, sentence filtering, token-level filtering, and a combination of all three, to assess their performance in terms of execution time and compression ratio. I ran these expirements using an NVIDIA L4 GPU with 24 GB of RAM. I used LLAMA 7B model for Token Level Filtering. The following is a summary of my tests:

Compression Technique Tokens Execution Time Original Tokens Compressed Tokens Compression Ratio Tokens Compressed/Seconds
Context Filtering Small 0.946 1395 692 2.36 743.1289641
Sentence Filtering Small 3.45 1395 636 2.3 220
Token Level Small 1.725 1395 421 3.27 564.6376812
All Small 3.043 1395 474 3.08 302.6618469
Context Filtering Medium 2.869 4146 1087 4.15 1066.225166
Sentence Filtering Medium 10.623 4146 1206 3.62 276.757978
Token Level Medium 6.02 4146 854 5.9 546.8438538
All Medium 6.701 4146 918 4.71 481.7191464
Context Filtering Large 21.358 58990 27937 2.11 1453.92827
Sentence Filtering Large 48.514 58990 11087 5.32 987.4056973
Token Level Large 57.653 58990 1224 48.19 1001.960002
All Large 70.258 58990 7570 7.79 731.8739503

The experiments conducted with LLMLingua provide valuable insights into the impact of prompt compression on LLM processing times and costs. The results demonstrate that prompt compression can quickly reduce the overall prompt size quickly which will result in faster LLM process and overall cost savings. Token-level filtering achieved the highest compression ratios, especially for large prompts, while context filtering emerged as the fastest technique, significantly reducing execution time while maintaining a reasonable compression ratio. However, it is crucial to consider the specific requirements of the application and strike a balance between compression ratio and execution time when selecting the appropriate compression technique. These findings serve as a foundation for further optimization and innovation in the field of prompt compression, paving the way for more accessible, efficient, and cost-effective LLM applications.

Comments 1 total

  • Matthew  Cummins
    Matthew CumminsJul 16, 2025

    PS C:\Users\matt2\Desktop\Software\universal-meta-language (6)> activate

    (.venv) PS C:\Users\matt2\Desktop\Software\universal-meta-language (6)> python main.py

    Loading Wikitext‑2 dataset…

    Sampled 10000 sentences.

    Computing embeddings with Sentence‑BERT (all‑MiniLM‑L6‑v2)…

    Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [00:37<00:00, 2.12it/s]

    Testing codebook sizes: [50, 100, 200, 500, 1000, 2000]

    → Clustering with K=50…

    C:\Users\matt2\Desktop\Software\universal-meta-language (6).venv\Lib\site-packages\joblib\externals\loky\backend\context.py:131: UserWarning: Could not find the number of physical cores for the following reason:

    [WinError 2] The system cannot find the file specified

    Returning the number of logical cores instead. You can silence this warning by setting LOKY_MAX_CPU_COUNT to the number of cores you want to use.

    warnings.warn(

    File "C:\Users\matt2\Desktop\Software\universal-meta-language (6).venv\Lib\site-packages\joblib\externals\loky\backend\context.py", line 247, in _count_physical_cores

    cpu_count_physical = _count_physical_cores_win32()

    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    File "C:\Users\matt2\Desktop\Software\universal-meta-language (6).venv\Lib\site-packages\joblib\externals\loky\backend\context.py", line 299, in _count_physical_cores_win32

    cpu_info = subprocess.run(

    ^^^^^^^^^^^^^^^

    File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.2800.0_x64__qbz5n2kfra8p0\Lib\subprocess.py", line 548, in run

    with Popen(*popenargs, **kwargs) as process:

    ^^^^^^^^^^^^^^^^^^^^^^^^^^^

    File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.2800.0_x64_qbz5n2kfra8p0\Lib\subprocess.py", line 1026, in __init_

    self._execute_child(args, executable, preexec_fn, close_fds,

    File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.2800.0_x64__qbz5n2kfra8p0\Lib\subprocess.py", line 1538, in _execute_child

    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,

    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    OpenBLAS warning: precompiled NUM_THREADS exceeded, adding auxiliary array for thread metadata.

    To avoid this warning, please rebuild your copy of OpenBLAS with a larger NUM_THREADS setting

    or set the environment variable OPENBLAS_NUM_THREADS to 24 or lower

    → Clustering with K=100…

    → Clustering with K=200…

    → Clustering with K=500…

    → Clustering with K=1000…

    → Clustering with K=2000…

    [DEBUG] Expected rows = 6, Collected rows = 6

    [DEBUG] First few result entries:

    [{'K': 50,

    'ambiguity_rate': 0.995,

    'bits_per_code': np.float64(5.643856189774724),

    'compression_ratio': np.float64(251.71905783489265),

    'silhouette_score': 0.03647768124938011},

    {'K': 100,

    'ambiguity_rate': 0.99,

    'bits_per_code': np.float64(6.643856189774724),

    'compression_ratio': np.float64(213.83156439060306),

    'silhouette_score': 0.04587062820792198},

    {'K': 200,

    'ambiguity_rate': 0.98,

    'bits_per_code': np.float64(7.643856189774724),

    'compression_ratio': np.float64(185.85725939561272),

    'silhouette_score': 0.05711694434285164}]

    Final DataFrame:

    K ambiguity_rate bits_per_code compression_ratio silhouette_score

    0 50 0.995 5.643856 251.719058 0.036478

    1 100 0.990 6.643856 213.831564 0.045871

    2 200 0.980 7.643856 185.857259 0.057117

    3 500 0.950 8.965784 158.454198 0.075950

    4 1000 0.900 9.965784 142.554376 0.081304

    5 2000 0.800 10.965784 129.554451 0.094293

    Saved results to semantic_compression_results.csv

    Plot saved to semantic_compression_metrics.png

    !/usr/bin/env python3

    """

    main.py

    AI‑Driven Semantic Compression Experiment on Wikitext‑2

    Includes debug prints to ensure the results list is built correctly.

    """

    import random

    import numpy as np

    import pandas as pd

    import matplotlib.pyplot as plt

    from datasets import load_dataset

    from sentence_transformers import SentenceTransformer

    from sklearn.cluster import KMeans

    from sklearn.metrics import silhouette_score

    from pprint import pprint

    def main():

    1. Load Wikitext-2 and sample 10k sentences

    print("Loading Wikitext‑2 dataset…")

    ds = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train')

    sentences = [s for s in ds['text'] if len(s.split()) > 5]

    if len(sentences) < 10000:

    raise ValueError(f"Not enough sentences (found {len(sentences)}).")

    random.seed(42)

    corpus = random.sample(sentences, k=10000)

    N = len(corpus)

    print(f"Sampled {N} sentences.")

    2. Compute Sentence-BERT embeddings

    print("Computing embeddings with Sentence‑BERT (all‑MiniLM‑L6‑v2)…")

    model = SentenceTransformer('all-MiniLM-L6-v2')

    embeddings = model.encode(corpus, batch_size=128, show_progress_bar=True)

    3. Define codebook sizes to test

    K_values = [50, 100, 200, 500, 1000, 2000]

    K_values = [K for K in K_values if K <= N]

    print(f"Testing codebook sizes: {K_values}")

    4. Clustering & metrics computation

    results = []

    avg_tokens = np.mean([len(s.split()) for s in corpus])

    orig_bits_per_sentence = avg_tokens * np.log2(30000) # rough estimate

    for K in K_values:

    print(f" → Clustering with K={K}…")

    kmeans = KMeans(n_clusters=K, random_state=42)

    labels = kmeans.fit_predict(embeddings)

    Ambiguity: collisions

    unique_clusters = len(set(labels))

    collisions = N - unique_clusters

    ambiguity_rate = collisions / N

    Compression ratio

    bits_per_code = np.log2(K)

    compression_ratio = orig_bits_per_sentence / bits_per_code

    Silhouette score

    sil = silhouette_score(embeddings, labels) if K > 1 else float('nan')

    results.append({

    'K': K,

    'ambiguity_rate': ambiguity_rate,

    'bits_per_code': bits_per_code,

    'compression_ratio': compression_ratio,

    'silhouette_score': sil

    })

    --- DEBUG CHECKS ---

    print(f"\n[DEBUG] Expected rows = {len(K_values)}, Collected rows = {len(results)}")

    print("[DEBUG] First few result entries:")

    pprint(results[:3])

    5. Build DataFrame & save

    df = pd.DataFrame(results)

    print("\nFinal DataFrame:")

    print(df)

    df.to_csv("semantic_compression_results.csv", index=False)

    print("\nSaved results to semantic_compression_results.csv")

    6. Plot metrics

    plt.figure(figsize=(8, 5))

    plt.plot(df['K'], df['ambiguity_rate'], marker='o', label='Ambiguity Rate')

    plt.plot(df['K'], df['compression_ratio'], marker='x', label='Compression Ratio')

    plt.plot(df['K'], df['silhouette_score'], marker='s', label='Silhouette Score')

    plt.xscale('log')

    plt.xlabel('Codebook Size (K)')

    plt.title('Semantic Compression Metrics vs. K')

    plt.legend()

    plt.tight_layout()

    plt.savefig("semantic_compression_metrics.png")

    print("Plot saved to semantic_compression_metrics.png")

    plt.show()

    if name == "main":

    main()

    Comprehensive Expansion of AI-Driven Semantic Compression Experiment

    This expanded report extends the original Wikitext‑2 clustering-based semantic compression experiment to a broader scope, investigating additional datasets, advanced quantization techniques, hybrid models, and downstream task evaluations. It is structured as follows:

    Objectives and Scope

    Expand dataset diversity (text, code, multilingual).

    Integrate advanced quantization methods (PQ, OPQ, AQ).

    Evaluate end-to-end semantic reconstruction on downstream tasks (QA, summarization, translation).

    Analyze trade-offs across codebook strategies and hybrid pipelines.

    Datasets

    Wikitext-2 (English Wikipedia) — baseline.

    OpenWebText (45 GB scraped web data) — large-scale text.

    CodeSearchNet (Python code) — code semantics.

    Multilingual TED Talks — multilingual text.

    ImageNet Captions — vision-language pairs for multimodal.

    Embedding Models

    Sentence-BERT variants:

    all-MiniLM-L6-v2 (384‑dim) — fast baseline.

    all-mpnet-base-v2 (768‑dim) — richer embeddings.

    xlm-r-100langs (512‑dim) — multilingual.

    CLIP-ViT-B/32 — image-text embeddings.

    Quantization & Clustering Methods

    KMeans & MiniBatchKMeans: baseline discrete codes.

    Product Quantization (PQ): FAISS PQ with 8–16 subquantizers.

    Optimized PQ (OPQ): learned rotation before PQ.

    Additive Quantization (AQ): multi-codebook additive representation.

    Vector-Quantized Autoencoders: VQ-VAE style latent quantization.

    Experimental Pipeline

    Preprocessing: sentence/sentencepiece tokenization; vector normalization.

    Embedding Extraction: batched inference with GPU acceleration.

    Quantization & Indexing: training quantizers, measuring code size and storage footprint.

    Reconstruction:

    Indirect: nearest-centroid decoding, feeding reconstructed embeddings into a decoder LLM.

    Direct: reconstruct text via retrieval-augmented generation (RAG) against original corpus.

    Metrics

    Intrinsic:

    Ambiguity Rate, Bits/code, Compression Ratio, Silhouette Score, Quantization MSE.

    Extrinsic:

    QA accuracy (e.g., SQuAD, TyDi QA), ROUGE and BERTScore for summarization, BLEU for translation, code generation accuracy (CodeBLEU).

    Latency & Throughput: encode/decode speed, memory footprint.

    Results & Analysis

    Comparative tables and plots of intrinsic metrics across methods and datasets.

    Downstream task performance vs. compression ratio curves.

    Analysis of semantic drift and failure cases.

    Technical Challenges & Solutions

    High-dimensional quantizer training on large-scale datasets.

    Hybrid model integration complexities.

    Balancing offline quantizer training vs. online adaptation.

    Conclusions & Future Work

    Summary of best-performing pipelines by use-case.

    Recommendations for real-world deployment (e.g., IoT, edge devices).

    Directions: adaptive codebooks, reinforcement-learned quantization, cross-modal semantic compression.

    This document serves as a blueprint for implementing and reporting a thorough, end-to-end AI-driven semantic compression study across modalities and tasks.

Add comment