I wanted to compare cloud LLMs (Claude, OpenAI, Gemini, etc.) and local LLMs (mid-sized models via Ollama) on the same playing field. I also wanted to measure CPU, GPU, and memory utilization for the local models. This post documents what I learned along the way.
The short version: don't build from scratch — a thin wrapper around existing OSS is the optimal solution. But during the process, I ran into LiteLLM's security incidents and was reminded of the importance of auditing dependencies.
Target audience: developers with a basic understanding of LLMs and evaluation frameworks.
Full source code: github.com/devdama/llm-benchmark
Accuracy evaluation tools are mature, leaving little room to build from scratch.
| Tool | Strengths |
|---|---|
| lm-evaluation-harness (EleutherAI) | The de facto standard. Dozens of tasks including ARC, HellaSwag, MMLU. Powers the Hugging Face Open LLM Leaderboard |
| HELM (Stanford CRFM) | 42 scenarios × 59 metrics. Evaluates accuracy + bias + toxicity + efficiency holistically |
| bigcode-evaluation-harness | Coding-focused. HumanEval, MBPP, MultiPL-E, BigCodeBench |
| EvalPlus | HumanEval+ / MBPP+ (80× more test cases) |
| OpenCompass | Strong on Chinese-origin models (Qwen, DeepSeek) |
Local LLM resource measurement has its own tools like ollama-benchmark and llm-perf-leaderboard, but it's standard practice to treat resource measurement as a separate layer from accuracy evaluation.
HumanEval is saturated (frontier models hit 93%+), and modern coding evaluation has moved on:
For evaluating mid-sized local LLMs (8B–30B), a realistic combination is HumanEval+ + LiveCodeBench + BigCodeBench. SWE-Bench tends to produce very low scores for mid-sized models and takes a long time to run, so treat it as optional.
I went with an "existing OSS + ~200 lines of orchestration" approach.
┌─────────────────────────────────────────────────────┐
│ Custom orchestration layer (Python) │
│ - Centralized model configuration │
│ - Resource monitor start/stop │
│ - Result aggregation to CSV │
└─────────────────────────────────────────────────────┘
↓ ↓ ↓
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ EvalPlus / │ │ pynvml + │ │ LiteLLM │
│ bigcode-eval │ │ psutil │ │ (unified API)│
└──────────────┘ └──────────────┘ └──────────────┘
↓
Claude / OpenAI / Gemini / Ollama / vLLM
| Component | Choice | Reason |
|---|---|---|
| Benchmark datasets | Reuse | Custom datasets become incomparable with published results |
| Unified model API | Reuse (LiteLLM) | Reinventing per-provider abstraction is wasteful |
| Resource monitoring | Build | Use-case-specific customization pays off. ~50 lines |
| Result aggregation/visualization | Build | Pandas + matplotlib is enough |
litellm exposes 100+ providers through a single, OpenAI-format API. You switch providers just by changing the model name.
import litellm
# Claude
response = litellm.completion(
model="anthropic/claude-sonnet-4-6",
messages=[{"role": "user", "content": "Hello"}]
)
# Ollama (local)
response = litellm.completion(
model="ollama/qwen2.5-coder:7b",
api_base="http://localhost:11434",
messages=[{"role": "user", "content": "Hello"}]
)
# Same access pattern in both cases
print(response.choices[0].message.content)
litellm.completion() returns a ModelResponse object, structurally identical to the OpenAI SDK's response:
ModelResponse(
choices=[
Choices(
finish_reason="stop", # "stop" | "length" | "tool_calls" | "content_filter"
index=0,
message=Message(
role="assistant",
content="Hello! ...", # Optional[str] - can be None
tool_calls=None,
),
),
],
usage=Usage(
prompt_tokens=12,
completion_tokens=24,
total_tokens=36,
),
_hidden_params={
"response_cost": 0.000234, # LiteLLM-specific: estimated USD cost
"custom_llm_provider": "anthropic",
},
)
Both attribute access (response.choices[0].message.content) and dict access are supported. In streaming mode, message becomes delta and each chunk carries partial content.
To ensure fair comparison across models, these parameters must be identical across all models being compared:
| Parameter | Role | Benchmark recommendation |
|---|---|---|
temperature |
Output randomness | 0.0 (pass@1) / 0.6 (pass@k) |
max_tokens |
Maximum output length | Code: 2048 / Reasoning models: 8192+ |
seed |
Random seed | Fix if possible (OpenAI/vLLM support; Anthropic doesn't) |
top_p |
Token candidate filtering | 1.0 (don't combine with temperature) |
stop |
Stop sequences | Task-dependent |
Important: reasoning models (Claude Extended Thinking, OpenAI o1/o3, DeepSeek-R1) consume completion_tokens for the reasoning trace itself. max_tokens=1024 may not even leave room for the answer. Use 8192 or more.
If you see many finish_reason == "length" results, that's a sign your max_tokens is too small.
Wrap pynvml + psutil in a contextmanager and you can instrument inference code with minimal changes.
from contextlib import contextmanager
import threading, time, psutil, pynvml
pynvml.nvmlInit()
GPU_COUNT = pynvml.nvmlDeviceGetCount()
@contextmanager
def monitor_resources(sample_interval_ms=200, enabled=True):
if not enabled:
yield lambda: None
return
samples = []
stop = threading.Event()
def loop():
psutil.cpu_percent(interval=None) # discard the initial 0
while not stop.is_set():
sample = {
"cpu": psutil.cpu_percent(interval=None),
"ram_gb": psutil.virtual_memory().used / 1024**3,
"gpu": [],
}
for i in range(GPU_COUNT):
h = pynvml.nvmlDeviceGetHandleByIndex(i)
util = pynvml.nvmlDeviceGetUtilizationRates(h)
mem = pynvml.nvmlDeviceGetMemoryInfo(h)
power = pynvml.nvmlDeviceGetPowerUsage(h) / 1000.0
sample["gpu"].append({
"util": util.gpu,
"mem_gb": mem.used / 1024**3,
"power_w": power,
})
samples.append(sample)
stop.wait(sample_interval_ms / 1000.0)
t = threading.Thread(target=loop, daemon=True)
t.start()
try:
yield lambda: samples
finally:
stop.set()
t.join(timeout=2.0)
# Usage
with monitor_resources(enabled=is_local_model) as get_samples:
response = client.complete(prompt)
stats = get_samples() # retrieve the sample list
pynvml doesn't work. You need to call powermetrics separatelyWhile developing, I ran a security check on the dependencies and found problems.
LiteLLM experienced multiple serious security issues in 2026:
.env files (credential harvesting → Kubernetes lateral movement → systemd persistence backdoor)enable_jwt_auth is on) → fixed in 1.83.0/guardrails/test_custom_code → fixed in 1.83.10pip install pip-audit
pip-audit
Actual output I got:
Found 2 known vulnerabilities in 2 packages
Name Version ID Fix Versions
------------- ------- -------------- ------------
litellm 1.83.7 CVE-2026-40217 1.83.10
python-dotenv 1.0.1 CVE-2026-28684 1.2.2
Version 1.83.7 was safe at the time I installed it (last week), but a new CVE (published April 10) was disclosed afterward. This is exactly why pip-audit exists — it caught what version pinning alone couldn't.
| CVE | Real-world impact in my library use case |
|---|---|
| CVE-2026-40217 (litellm) | Low (the vulnerable endpoint only exists when running Proxy mode) |
| CVE-2026-28684 (python-dotenv) | Near zero (I don't call set_key/unset_key + requires local attacker) |
That said, upgrade anyway. Future Proxy mode usage, transitive dependency confusion, and CI environment abuse are all realistic risks.
# Upgrade
pip install --upgrade "litellm>=1.83.10" "python-dotenv>=1.2.2"
# Verify
pip show litellm | grep Version
pip-audit
# → "No known vulnerabilities found" = success
Lessons from this incident:
# ✗ Dangerous: could pull a compromised version
pip install litellm
# ✓ Recommended: explicit minimum version
pip install "litellm>=1.83.10"
# ✓ Even safer: exact pin + hash verification
pip install litellm==1.83.10 --require-hashes
# Monthly routine
pip-audit # detect known CVEs
pip list --outdated | grep litellm # check for new versions
# For CI/CD integration
pip-audit --strict --requirement requirements.txt
# ↑ non-zero exit code on vulnerability → CI fails
.env files in the same directory as your repo~/.bashrc)ANTHROPIC_API_KEY=$(pass anthropic/api-key) python run_benchmark.py
# venv (minimum)
python3 -m venv .venv && source .venv/bin/activate
# Docker (recommended)
docker run --rm -it \
-v $(pwd):/work -w /work \
-e ANTHROPIC_API_KEY \
python:3.12-slim \
bash -c "pip install -r requirements.txt && python run_benchmark.py"
Especially important for benchmarks: you're executing model-generated code. subprocess isolation alone is insufficient; running inside a Docker container is the responsible choice.
completion() call works across providerspynvml for GPU, psutil for CPU/RAM, sample at 100–500mspip-audit a habit — it catches new CVEs discovered after you installTo evolve this benchmarking foundation:
evalplus package (80× more test cases)response._hidden_params["response_cost"]pip-audit --strict into CIBenchmarks become meaningful only when you "run them correctly", "rigorously match comparison conditions", and "ensure execution environment security". Boring as it sounds, neglecting any of these undermines the credibility of the measurements themselves.