What throughput should I use for my hardware?

Throughput depends on the model size and quantization level. On a Mac Studio M4 Max (128 GB), expect around 45 tokens per second for a 70B Q4_K_M model and around 120 tokens per second for a 13B model via Ollama. An RTX 4090 with 24 GB VRAM handles 70B Q4 at around 60 tokens per second. Run ollama with verbose output to see your actual throughput, or check SiliconBench for verified Apple Silicon benchmarks.

Is the break-even calculation accurate?

It is an economic break-even estimate: the month when cumulative API spend exceeds cumulative hardware and electricity spend. It does not account for time value of money, hardware resale value, model quality differences, or maintenance overhead, so use it as a first-order estimate rather than a full business case.

What if I already own the hardware?

Set hardware cost to 0. In that case the only ongoing cost is electricity, and local inference will usually win immediately. Existing hardware you already own is often the strongest case for running models locally.

Does this include API rate limits or reliability considerations?

No. The calculator models direct monetary cost only. Cloud APIs have rate limits, downtime risk, and latency tradeoffs. Local inference avoids external rate limits but requires you to manage hardware. Those factors matter, but they are outside this calculator's scope.

What about models that do not fit in local VRAM?

If a model does not fit in available unified memory or VRAM, inference will be slow or impossible. A 70B model in Q4 quantization needs roughly 40 GB of RAM. Verify that your hardware can hold the model before relying on the calculator's throughput assumptions.

What is the best hardware for local inference in 2026?

Apple Silicon systems such as the Mac Studio M4 Max or M4 Ultra offer strong performance per watt and can run large models through unified memory. For NVIDIA, the RTX 4090 with 24 GB VRAM is a strong single-card option for models that fit. Multi-GPU setups can run larger models, but introduce more cost and complexity.

Local LLM vs Cloud API Cost Calculator

Enter your hardware specs, electricity rate, and token usage to find out whether local inference is cheaper than cloud APIs — and how many months until you break even.

Hardware

Hardware preset — sets defaults below

Hardware cost ($)

Amortize over (months)

Power draw (watts)

Throughput (tok/s)

Electricity rate (¢ per kWh)

¢/kWh

Usage

Daily token volume (tokens/day)

Input : Output ratio (% output tokens)

% out

⚠ At this volume, your hardware needs to run more than 24 hours/day. Lower the daily token volume or pick faster hardware.

Cloud Comparison

API model

⏳ Calculating…

Local $/1M tokens

—

amortized incl. hardware

Cloud API $/1M tokens

—

blended input+output

Monthly savings

—

local vs API at this volume

Break-even

—

months until local wins

Cumulative spend over 60 months

Cloud API

Local

Monthly breakdown

Monthly token volume	—
Hardware amortization/mo	—
Compute hours/day	—
Electricity cost/mo	—
Total local cost/mo	—
Cloud API cost/mo	—

How the calculation works

InferCost models the total cost of ownership for local LLM inference and compares it to cloud API pricing at your actual token volume.

Local costs have two components. The first is hardware amortization: the purchase price divided evenly over the number of months you plan to use the hardware. A $2,799 Mac Studio M4 Max amortized over 36 months adds $77.75/month to your cost basis. The second is electricity: your hardware only draws power while it is actively generating tokens, so the calculator derives daily compute hours from your throughput (tokens/second) and daily token volume, then multiplies by wattage and your electricity rate.

Cloud API costs are calculated from the published input and output token prices for the selected model, blended by your input:output ratio. At 50/50, Claude 3.7 Sonnet ($3/$15 per 1M) costs $9/1M blended. If you mostly generate (more output), your blended rate is higher.

Break-even is the month at which your cumulative cloud API spend exceeds your cumulative local spend (hardware + electricity). Before that month, cloud is cheaper in total. After it, local has paid for itself. The chart shows both lines so you can see how the curves diverge over time.

Note that the calculator does not account for opportunity cost of capital, model quality differences between local and cloud models, maintenance time, or the fact that hardware can be resold. It also assumes 100% utilization of local hardware — if you already own the hardware for other purposes, the amortization cost is effectively zero.

API pricing reference

Model	Provider	Input / 1M	Output / 1M	Tier
GPT-4o	OpenAI	$2.50	$10.00	mid
GPT-4o mini	OpenAI	$0.15	$0.60	cheap
o3	OpenAI	$10.00	$40.00	expensive
o4-mini	OpenAI	$1.10	$4.40	mid
Claude 3.7 Sonnet	Anthropic	$3.00	$15.00	mid
Claude 3.5 Haiku	Anthropic	$0.80	$4.00	cheap
Claude Opus 4	Anthropic	$15.00	$75.00	expensive
Gemini 2.5 Pro	Google	$1.25	$10.00	mid
Gemini 2.0 Flash	Google	$0.10	$0.40	cheap
Gemini 2.0 Flash-Lite	Google	$0.075	$0.30	cheap
Llama 3.3 70B	Together AI	$0.90	$0.90	cheap
DeepSeek V3	Together AI	$0.30	$0.30	cheap
Qwen 2.5 72B	Together AI	$0.40	$0.40	cheap

Prices are approximate as of early 2026. Cloud providers change pricing regularly — verify before making hardware purchasing decisions.

Frequently asked questions

What throughput should I use for my hardware?: Throughput depends on the model size and quantization level. On a Mac Studio M4 Max (128 GB), expect ~45 tok/s for a 70B Q4_K_M model and ~120 tok/s for a 13B model via Ollama. RTX 4090 with 24 GB VRAM handles 70B Q4 at ~60 tok/s. Run ollama run [model] --verbose to see your actual tok/s, or check SiliconBench for verified benchmarks on Apple Silicon hardware.
Is the break-even calculation accurate?: It is an economic break-even — the month when cumulative API spend exceeds cumulative hardware+electricity spend. It does not account for time value of money, hardware resale value, quality differences between local and cloud models, or maintenance overhead. Treat it as a first-order estimate, not a business case.
What if I already own the hardware?: Set hardware cost to $0. The only ongoing cost is electricity, and local inference will almost always win immediately. This is the strongest case for local inference — existing hardware that would otherwise sit idle.
Does this include API rate limits or reliability considerations?: No. The calculator only models direct monetary cost. Cloud APIs have rate limits, potential downtime, and latency that may affect your application. Local inference has no rate limits but requires hardware management. These factors are real but not quantifiable without specific usage patterns.
What about models that do not fit in local VRAM?: If your chosen model does not fit in unified memory or VRAM, inference will be slow or impossible. A 70B model in Q4 quantization requires ~40 GB of RAM. Check that your hardware has enough memory before assuming the throughput figures apply. The Mac Mini M4 with 16 GB is suitable for 7B models, not 70B.
What is the best hardware for local inference in 2026?: Apple Silicon (Mac Studio M4 Max / M4 Ultra) offers the best performance-per-watt for consumer hardware and supports large models via unified memory. The Mac Studio M4 Max at 128 GB handles Llama 3.3 70B comfortably. For NVIDIA, the RTX 4090 with 24 GB VRAM is the fastest single-card option for models that fit. For multi-GPU, two RTX 4090s or a 3090 Ti pair can run 70B models split across cards.