Building an LLM Benchmarking Setup — Unified Evaluation of Cloud and Local Models, and the Security Pitfalls Along the Way

Created: | Updated:

Intro.

I wanted to compare cloud LLMs (Claude, OpenAI, Gemini, etc.) and local LLMs (mid-sized models via Ollama) on the same playing field. I also wanted to measure CPU, GPU, and memory utilization for the local models. This post documents what I learned along the way.

The short version: don't build from scratch — a thin wrapper around existing OSS is the optimal solution. But during the process, I ran into LiteLLM's security incidents and was reminded of the importance of auditing dependencies.

Target audience: developers with a basic understanding of LLMs and evaluation frameworks.

Full source code: github.com/devdama/llm-benchmark

Choosing a Benchmark Framework

The State of Existing OSS

Accuracy evaluation tools are mature, leaving little room to build from scratch.

Tool	Strengths
lm-evaluation-harness (EleutherAI)	The de facto standard. Dozens of tasks including ARC, HellaSwag, MMLU. Powers the Hugging Face Open LLM Leaderboard
HELM (Stanford CRFM)	42 scenarios × 59 metrics. Evaluates accuracy + bias + toxicity + efficiency holistically
bigcode-evaluation-harness	Coding-focused. HumanEval, MBPP, MultiPL-E, BigCodeBench
EvalPlus	HumanEval+ / MBPP+ (80× more test cases)
OpenCompass	Strong on Chinese-origin models (Qwen, DeepSeek)

Local LLM resource measurement has its own tools like ollama-benchmark and llm-perf-leaderboard, but it's standard practice to treat resource measurement as a separate layer from accuracy evaluation.

The State of Coding Evaluation (2026)

HumanEval is saturated (frontier models hit 93%+), and modern coding evaluation has moved on:

LiveCodeBench: Continuously adds new problems to avoid training data contamination
SWE-Bench Verified: Tasks the model with solving real GitHub issues
BigCodeBench: Includes practical library calls

For evaluating mid-sized local LLMs (8B–30B), a realistic combination is HumanEval+ + LiveCodeBench + BigCodeBench. SWE-Bench tends to produce very low scores for mid-sized models and takes a long time to run, so treat it as optional.

Architecture Design

I went with an "existing OSS + ~200 lines of orchestration" approach.

┌─────────────────────────────────────────────────────┐
│  Custom orchestration layer (Python)                 │
│  - Centralized model configuration                   │
│  - Resource monitor start/stop                       │
│  - Result aggregation to CSV                         │
└─────────────────────────────────────────────────────┘
            ↓                ↓                ↓
   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
   │ EvalPlus /   │  │ pynvml +     │  │ LiteLLM      │
   │ bigcode-eval │  │ psutil       │  │ (unified API)│
   └──────────────┘  └──────────────┘  └──────────────┘
            ↓
   Claude / OpenAI / Gemini / Ollama / vLLM

Build vs. Reuse Decisions

Component	Choice	Reason
Benchmark datasets	Reuse	Custom datasets become incomparable with published results
Unified model API	Reuse (LiteLLM)	Reinventing per-provider abstraction is wasteful
Resource monitoring	Build	Use-case-specific customization pays off. ~50 lines
Result aggregation/visualization	Build	Pandas + matplotlib is enough

LiteLLM — The Key to Unifying Cloud and Local

litellm exposes 100+ providers through a single, OpenAI-format API. You switch providers just by changing the model name.

import litellm

# Claude
response = litellm.completion(
    model="anthropic/claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Hello"}]
)

# Ollama (local)
response = litellm.completion(
    model="ollama/qwen2.5-coder:7b",
    api_base="http://localhost:11434",
    messages=[{"role": "user", "content": "Hello"}]
)

# Same access pattern in both cases
print(response.choices[0].message.content)

Response Structure

litellm.completion() returns a ModelResponse object, structurally identical to the OpenAI SDK's response:

ModelResponse(
    choices=[
        Choices(
            finish_reason="stop",       # "stop" | "length" | "tool_calls" | "content_filter"
            index=0,
            message=Message(
                role="assistant",
                content="Hello! ...",   # Optional[str] - can be None
                tool_calls=None,
            ),
        ),
    ],
    usage=Usage(
        prompt_tokens=12,
        completion_tokens=24,
        total_tokens=36,
    ),
    _hidden_params={
        "response_cost": 0.000234,   # LiteLLM-specific: estimated USD cost
        "custom_llm_provider": "anthropic",
    },
)

Both attribute access (response.choices[0].message.content) and dict access are supported. In streaming mode, message becomes delta and each chunk carries partial content.

Key Parameters

To ensure fair comparison across models, these parameters must be identical across all models being compared:

Parameter	Role	Benchmark recommendation
`temperature`	Output randomness	0.0 (pass@1) / 0.6 (pass@k)
`max_tokens`	Maximum output length	Code: 2048 / Reasoning models: 8192+
`seed`	Random seed	Fix if possible (OpenAI/vLLM support; Anthropic doesn't)
`top_p`	Token candidate filtering	1.0 (don't combine with temperature)
`stop`	Stop sequences	Task-dependent

Important: reasoning models (Claude Extended Thinking, OpenAI o1/o3, DeepSeek-R1) consume completion_tokens for the reasoning trace itself. max_tokens=1024 may not even leave room for the answer. Use 8192 or more.

If you see many finish_reason == "length" results, that's a sign your max_tokens is too small.

Resource Monitoring Implementation

Wrap pynvml + psutil in a contextmanager and you can instrument inference code with minimal changes.

from contextlib import contextmanager
import threading, time, psutil, pynvml

pynvml.nvmlInit()
GPU_COUNT = pynvml.nvmlDeviceGetCount()

@contextmanager
def monitor_resources(sample_interval_ms=200, enabled=True):
    if not enabled:
        yield lambda: None
        return
    samples = []
    stop = threading.Event()

    def loop():
        psutil.cpu_percent(interval=None)  # discard the initial 0
        while not stop.is_set():
            sample = {
                "cpu": psutil.cpu_percent(interval=None),
                "ram_gb": psutil.virtual_memory().used / 1024**3,
                "gpu": [],
            }
            for i in range(GPU_COUNT):
                h = pynvml.nvmlDeviceGetHandleByIndex(i)
                util = pynvml.nvmlDeviceGetUtilizationRates(h)
                mem = pynvml.nvmlDeviceGetMemoryInfo(h)
                power = pynvml.nvmlDeviceGetPowerUsage(h) / 1000.0
                sample["gpu"].append({
                    "util": util.gpu,
                    "mem_gb": mem.used / 1024**3,
                    "power_w": power,
                })
            samples.append(sample)
            stop.wait(sample_interval_ms / 1000.0)

    t = threading.Thread(target=loop, daemon=True)
    t.start()
    try:
        yield lambda: samples
    finally:
        stop.set()
        t.join(timeout=2.0)

# Usage
with monitor_resources(enabled=is_local_model) as get_samples:
    response = client.complete(prompt)
stats = get_samples()  # retrieve the sample list

Measurement Tips

Sampling interval: 1 second misses short bursts. 100–500ms is better
Apple Silicon: pynvml doesn't work. You need to call powermetrics separately
MoE models: routing randomness adds variance. Multiple trials are needed for stable estimates

The Trap I Fell Into — LiteLLM Security Incidents

While developing, I ran a security check on the dependencies and found problems.

Past Incidents

LiteLLM experienced multiple serious security issues in 2026:

March 24, 2026: Supply Chain Attack

Affected versions: 1.82.7, 1.82.8
Window of exposure: ~40 minutes starting 10:39 UTC
Attack details: TeamPCP compromised the Trivy CI/CD pipeline. Three-stage payload that exfiltrated SSH keys, cloud credentials, and .env files (credential harvesting → Kubernetes lateral movement → systemd persistence backdoor)
Blast radius: LiteLLM has 95M+ monthly downloads

April 2026: Multiple CVEs

CVE-2026-42208 (CVSS 9.3): Pre-auth SQLi during proxy API key verification → fixed in 1.83.7
CVE-2026-35030 (Critical): JWT auth bypass (only when enable_jwt_auth is on) → fixed in 1.83.0
CVE-2026-40217 (CVSS 8.8): Sandbox escape RCE via /guardrails/test_custom_code → fixed in 1.83.10

Running the Audit

pip install pip-audit
pip-audit

Actual output I got:

Found 2 known vulnerabilities in 2 packages
Name          Version ID             Fix Versions
------------- ------- -------------- ------------
litellm       1.83.7  CVE-2026-40217 1.83.10
python-dotenv 1.0.1   CVE-2026-28684 1.2.2

Version 1.83.7 was safe at the time I installed it (last week), but a new CVE (published April 10) was disclosed afterward. This is exactly why pip-audit exists — it caught what version pinning alone couldn't.

Impact Assessment

CVE	Real-world impact in my library use case
CVE-2026-40217 (litellm)	Low (the vulnerable endpoint only exists when running Proxy mode)
CVE-2026-28684 (python-dotenv)	Near zero (I don't call `set_key`/`unset_key` + requires local attacker)

That said, upgrade anyway. Future Proxy mode usage, transitive dependency confusion, and CI environment abuse are all realistic risks.

Remediation

# Upgrade
pip install --upgrade "litellm>=1.83.10" "python-dotenv>=1.2.2"

# Verify
pip show litellm | grep Version
pip-audit
# → "No known vulnerabilities found" = success

Best Practices for Secure Operations

Lessons from this incident:

Version Pinning and Hash Verification

# ✗ Dangerous: could pull a compromised version
pip install litellm

# ✓ Recommended: explicit minimum version
pip install "litellm>=1.83.10"

# ✓ Even safer: exact pin + hash verification
pip install litellm==1.83.10 --require-hashes

Regular Auditing

# Monthly routine
pip-audit                            # detect known CVEs
pip list --outdated | grep litellm   # check for new versions

# For CI/CD integration
pip-audit --strict --requirement requirements.txt
# ↑ non-zero exit code on vulnerability → CI fails

API Key Management

Don't keep .env files in the same directory as your repo
Don't write keys directly into global env (~/.bashrc)

Inject only at runtime:

ANTHROPIC_API_KEY=$(pass anthropic/api-key) python run_benchmark.py

Set spending limits on API keys (configurable in the provider's console)

Isolated Execution Environments

# venv (minimum)
python3 -m venv .venv && source .venv/bin/activate

# Docker (recommended)
docker run --rm -it \
  -v $(pwd):/work -w /work \
  -e ANTHROPIC_API_KEY \
  python:3.12-slim \
  bash -c "pip install -r requirements.txt && python run_benchmark.py"

Especially important for benchmarks: you're executing model-generated code. subprocess isolation alone is insufficient; running inside a Docker container is the responsible choice.

Summary

Technical Takeaways

The best approach for LLM benchmarking is "existing OSS + thin custom layer". Building everything from scratch costs months of effort and produces results incomparable to published benchmarks
LiteLLM is invaluable for unifying cloud and local. A single completion() call works across providers
HumanEval alone is insufficient for coding evaluation. Use LiveCodeBench for contamination resistance, BigCodeBench for practicality, EvalPlus for rigor
Fair comparison requires fixing temperature / max_tokens / seed
Resource monitoring is clean as a contextmanager. pynvml for GPU, psutil for CPU/RAM, sample at 100–500ms

Security Takeaways

LLM-adjacent libraries are attractive attack targets (LiteLLM, with 95M monthly downloads, was hit multiple times in 2026)
Make pip-audit a habit — it catches new CVEs discovered after you install
Pin versions + audit regularly + isolate execution is the basic recipe
Always assess real impact. A CVE may not affect your specific use case, but upgrade anyway as a rule

Next Steps

To evolve this benchmarking foundation:

Replace the homemade tasks with the evalplus package (80× more test cases)
Add the latest LiveCodeBench cutoff problems (contamination resistance)
Switch to vLLM for higher throughput (roughly 6× Ollama)
Add a cost column using response._hidden_params["response_cost"]
Integrate pip-audit --strict into CI

Benchmarks become meaningful only when you "run them correctly", "rigorously match comparison conditions", and "ensure execution environment security". Boring as it sounds, neglecting any of these undermines the credibility of the measurements themselves.