Building an LLM Benchmarking Setup — Unified Evaluation of Cloud and Local Models, and the Security Pitfalls Along the Way

Created: | Updated:

Intro.

I wanted to compare cloud LLMs (Claude, OpenAI, Gemini, etc.) and local LLMs (mid-sized models via Ollama) on the same playing field. I also wanted to measure CPU, GPU, and memory utilization for the local models. This post documents what I learned along the way.

The short version: don't build from scratch — a thin wrapper around existing OSS is the optimal solution. But during the process, I ran into LiteLLM's security incidents and was reminded of the importance of auditing dependencies.

Target audience: developers with a basic understanding of LLMs and evaluation frameworks.

Full source code: github.com/devdama/llm-benchmark

Choosing a Benchmark Framework

The State of Existing OSS

Accuracy evaluation tools are mature, leaving little room to build from scratch.

Tool Strengths
lm-evaluation-harness (EleutherAI) The de facto standard. Dozens of tasks including ARC, HellaSwag, MMLU. Powers the Hugging Face Open LLM Leaderboard
HELM (Stanford CRFM) 42 scenarios × 59 metrics. Evaluates accuracy + bias + toxicity + efficiency holistically
bigcode-evaluation-harness Coding-focused. HumanEval, MBPP, MultiPL-E, BigCodeBench
EvalPlus HumanEval+ / MBPP+ (80× more test cases)
OpenCompass Strong on Chinese-origin models (Qwen, DeepSeek)

Local LLM resource measurement has its own tools like ollama-benchmark and llm-perf-leaderboard, but it's standard practice to treat resource measurement as a separate layer from accuracy evaluation.

The State of Coding Evaluation (2026)

HumanEval is saturated (frontier models hit 93%+), and modern coding evaluation has moved on:

For evaluating mid-sized local LLMs (8B–30B), a realistic combination is HumanEval+ + LiveCodeBench + BigCodeBench. SWE-Bench tends to produce very low scores for mid-sized models and takes a long time to run, so treat it as optional.

Architecture Design

I went with an "existing OSS + ~200 lines of orchestration" approach.

┌─────────────────────────────────────────────────────┐
│  Custom orchestration layer (Python)                 │
│  - Centralized model configuration                   │
│  - Resource monitor start/stop                       │
│  - Result aggregation to CSV                         │
└─────────────────────────────────────────────────────┘
            ↓                ↓                ↓
   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
   │ EvalPlus /   │  │ pynvml +     │  │ LiteLLM      │
   │ bigcode-eval │  │ psutil       │  │ (unified API)│
   └──────────────┘  └──────────────┘  └──────────────┘
            ↓
   Claude / OpenAI / Gemini / Ollama / vLLM

Build vs. Reuse Decisions

Component Choice Reason
Benchmark datasets Reuse Custom datasets become incomparable with published results
Unified model API Reuse (LiteLLM) Reinventing per-provider abstraction is wasteful
Resource monitoring Build Use-case-specific customization pays off. ~50 lines
Result aggregation/visualization Build Pandas + matplotlib is enough

LiteLLM — The Key to Unifying Cloud and Local

litellm exposes 100+ providers through a single, OpenAI-format API. You switch providers just by changing the model name.

import litellm

# Claude
response = litellm.completion(
    model="anthropic/claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Hello"}]
)

# Ollama (local)
response = litellm.completion(
    model="ollama/qwen2.5-coder:7b",
    api_base="http://localhost:11434",
    messages=[{"role": "user", "content": "Hello"}]
)

# Same access pattern in both cases
print(response.choices[0].message.content)

Response Structure

litellm.completion() returns a ModelResponse object, structurally identical to the OpenAI SDK's response:

ModelResponse(
    choices=[
        Choices(
            finish_reason="stop",       # "stop" | "length" | "tool_calls" | "content_filter"
            index=0,
            message=Message(
                role="assistant",
                content="Hello! ...",   # Optional[str] - can be None
                tool_calls=None,
            ),
        ),
    ],
    usage=Usage(
        prompt_tokens=12,
        completion_tokens=24,
        total_tokens=36,
    ),
    _hidden_params={
        "response_cost": 0.000234,   # LiteLLM-specific: estimated USD cost
        "custom_llm_provider": "anthropic",
    },
)

Both attribute access (response.choices[0].message.content) and dict access are supported. In streaming mode, message becomes delta and each chunk carries partial content.

Key Parameters

To ensure fair comparison across models, these parameters must be identical across all models being compared:

Parameter Role Benchmark recommendation
temperature Output randomness 0.0 (pass@1) / 0.6 (pass@k)
max_tokens Maximum output length Code: 2048 / Reasoning models: 8192+
seed Random seed Fix if possible (OpenAI/vLLM support; Anthropic doesn't)
top_p Token candidate filtering 1.0 (don't combine with temperature)
stop Stop sequences Task-dependent

Important: reasoning models (Claude Extended Thinking, OpenAI o1/o3, DeepSeek-R1) consume completion_tokens for the reasoning trace itself. max_tokens=1024 may not even leave room for the answer. Use 8192 or more.

If you see many finish_reason == "length" results, that's a sign your max_tokens is too small.

Resource Monitoring Implementation

Wrap pynvml + psutil in a contextmanager and you can instrument inference code with minimal changes.

from contextlib import contextmanager
import threading, time, psutil, pynvml

pynvml.nvmlInit()
GPU_COUNT = pynvml.nvmlDeviceGetCount()

@contextmanager
def monitor_resources(sample_interval_ms=200, enabled=True):
    if not enabled:
        yield lambda: None
        return
    samples = []
    stop = threading.Event()

    def loop():
        psutil.cpu_percent(interval=None)  # discard the initial 0
        while not stop.is_set():
            sample = {
                "cpu": psutil.cpu_percent(interval=None),
                "ram_gb": psutil.virtual_memory().used / 1024**3,
                "gpu": [],
            }
            for i in range(GPU_COUNT):
                h = pynvml.nvmlDeviceGetHandleByIndex(i)
                util = pynvml.nvmlDeviceGetUtilizationRates(h)
                mem = pynvml.nvmlDeviceGetMemoryInfo(h)
                power = pynvml.nvmlDeviceGetPowerUsage(h) / 1000.0
                sample["gpu"].append({
                    "util": util.gpu,
                    "mem_gb": mem.used / 1024**3,
                    "power_w": power,
                })
            samples.append(sample)
            stop.wait(sample_interval_ms / 1000.0)

    t = threading.Thread(target=loop, daemon=True)
    t.start()
    try:
        yield lambda: samples
    finally:
        stop.set()
        t.join(timeout=2.0)

# Usage
with monitor_resources(enabled=is_local_model) as get_samples:
    response = client.complete(prompt)
stats = get_samples()  # retrieve the sample list

Measurement Tips

The Trap I Fell Into — LiteLLM Security Incidents

While developing, I ran a security check on the dependencies and found problems.

Past Incidents

LiteLLM experienced multiple serious security issues in 2026:

March 24, 2026: Supply Chain Attack
April 2026: Multiple CVEs

Running the Audit

pip install pip-audit
pip-audit

Actual output I got:

Found 2 known vulnerabilities in 2 packages
Name          Version ID             Fix Versions
------------- ------- -------------- ------------
litellm       1.83.7  CVE-2026-40217 1.83.10
python-dotenv 1.0.1   CVE-2026-28684 1.2.2

Version 1.83.7 was safe at the time I installed it (last week), but a new CVE (published April 10) was disclosed afterward. This is exactly why pip-audit exists — it caught what version pinning alone couldn't.

Impact Assessment

CVE Real-world impact in my library use case
CVE-2026-40217 (litellm) Low (the vulnerable endpoint only exists when running Proxy mode)
CVE-2026-28684 (python-dotenv) Near zero (I don't call set_key/unset_key + requires local attacker)

That said, upgrade anyway. Future Proxy mode usage, transitive dependency confusion, and CI environment abuse are all realistic risks.

Remediation

# Upgrade
pip install --upgrade "litellm>=1.83.10" "python-dotenv>=1.2.2"

# Verify
pip show litellm | grep Version
pip-audit
# → "No known vulnerabilities found" = success

Best Practices for Secure Operations

Lessons from this incident:

Version Pinning and Hash Verification

# ✗ Dangerous: could pull a compromised version
pip install litellm

# ✓ Recommended: explicit minimum version
pip install "litellm>=1.83.10"

# ✓ Even safer: exact pin + hash verification
pip install litellm==1.83.10 --require-hashes

Regular Auditing

# Monthly routine
pip-audit                            # detect known CVEs
pip list --outdated | grep litellm   # check for new versions

# For CI/CD integration
pip-audit --strict --requirement requirements.txt
# ↑ non-zero exit code on vulnerability → CI fails

API Key Management

Isolated Execution Environments

# venv (minimum)
python3 -m venv .venv && source .venv/bin/activate

# Docker (recommended)
docker run --rm -it \
  -v $(pwd):/work -w /work \
  -e ANTHROPIC_API_KEY \
  python:3.12-slim \
  bash -c "pip install -r requirements.txt && python run_benchmark.py"

Especially important for benchmarks: you're executing model-generated code. subprocess isolation alone is insufficient; running inside a Docker container is the responsible choice.

Summary

Technical Takeaways

  1. The best approach for LLM benchmarking is "existing OSS + thin custom layer". Building everything from scratch costs months of effort and produces results incomparable to published benchmarks
  2. LiteLLM is invaluable for unifying cloud and local. A single completion() call works across providers
  3. HumanEval alone is insufficient for coding evaluation. Use LiveCodeBench for contamination resistance, BigCodeBench for practicality, EvalPlus for rigor
  4. Fair comparison requires fixing temperature / max_tokens / seed
  5. Resource monitoring is clean as a contextmanager. pynvml for GPU, psutil for CPU/RAM, sample at 100–500ms

Security Takeaways

  1. LLM-adjacent libraries are attractive attack targets (LiteLLM, with 95M monthly downloads, was hit multiple times in 2026)
  2. Make pip-audit a habit — it catches new CVEs discovered after you install
  3. Pin versions + audit regularly + isolate execution is the basic recipe
  4. Always assess real impact. A CVE may not affect your specific use case, but upgrade anyway as a rule

Next Steps

To evolve this benchmarking foundation:

Benchmarks become meaningful only when you "run them correctly", "rigorously match comparison conditions", and "ensure execution environment security". Boring as it sounds, neglecting any of these undermines the credibility of the measurements themselves.