How the calculation works
InferCost models the total cost of ownership for local LLM inference and compares it to cloud API pricing at your actual token volume.
Local costs have two components. The first is hardware amortization: the purchase price divided evenly over the number of months you plan to use the hardware. A $2,799 Mac Studio M4 Max amortized over 36 months adds $77.75/month to your cost basis. The second is electricity: your hardware only draws power while it is actively generating tokens, so the calculator derives daily compute hours from your throughput (tokens/second) and daily token volume, then multiplies by wattage and your electricity rate.
Cloud API costs are calculated from the published input and output token prices for the selected model, blended by your input:output ratio. At 50/50, Claude 3.7 Sonnet ($3/$15 per 1M) costs $9/1M blended. If you mostly generate (more output), your blended rate is higher.
Break-even is the month at which your cumulative cloud API spend exceeds your cumulative local spend (hardware + electricity). Before that month, cloud is cheaper in total. After it, local has paid for itself. The chart shows both lines so you can see how the curves diverge over time.
Note that the calculator does not account for opportunity cost of capital, model quality differences between local and cloud models, maintenance time, or the fact that hardware can be resold. It also assumes 100% utilization of local hardware — if you already own the hardware for other purposes, the amortization cost is effectively zero.
API pricing reference
| Model | Provider | Input / 1M | Output / 1M | Tier |
|---|---|---|---|---|
| GPT-4o | OpenAI | $2.50 | $10.00 | mid |
| GPT-4o mini | OpenAI | $0.15 | $0.60 | cheap |
| o3 | OpenAI | $10.00 | $40.00 | expensive |
| o4-mini | OpenAI | $1.10 | $4.40 | mid |
| Claude 3.7 Sonnet | Anthropic | $3.00 | $15.00 | mid |
| Claude 3.5 Haiku | Anthropic | $0.80 | $4.00 | cheap |
| Claude Opus 4 | Anthropic | $15.00 | $75.00 | expensive |
| Gemini 2.5 Pro | $1.25 | $10.00 | mid | |
| Gemini 2.0 Flash | $0.10 | $0.40 | cheap | |
| Gemini 2.0 Flash-Lite | $0.075 | $0.30 | cheap | |
| Llama 3.3 70B | Together AI | $0.90 | $0.90 | cheap |
| DeepSeek V3 | Together AI | $0.30 | $0.30 | cheap |
| Qwen 2.5 72B | Together AI | $0.40 | $0.40 | cheap |
Prices are approximate as of early 2026. Cloud providers change pricing regularly — verify before making hardware purchasing decisions.
Frequently asked questions
- What throughput should I use for my hardware?
- Throughput depends on the model size and quantization level. On a Mac Studio M4 Max (128 GB), expect ~45 tok/s for a 70B Q4_K_M model and ~120 tok/s for a 13B model via Ollama. RTX 4090 with 24 GB VRAM handles 70B Q4 at ~60 tok/s. Run
ollama run [model] --verboseto see your actual tok/s, or check SiliconBench for verified benchmarks on Apple Silicon hardware. - Is the break-even calculation accurate?
- It is an economic break-even — the month when cumulative API spend exceeds cumulative hardware+electricity spend. It does not account for time value of money, hardware resale value, quality differences between local and cloud models, or maintenance overhead. Treat it as a first-order estimate, not a business case.
- What if I already own the hardware?
- Set hardware cost to $0. The only ongoing cost is electricity, and local inference will almost always win immediately. This is the strongest case for local inference — existing hardware that would otherwise sit idle.
- Does this include API rate limits or reliability considerations?
- No. The calculator only models direct monetary cost. Cloud APIs have rate limits, potential downtime, and latency that may affect your application. Local inference has no rate limits but requires hardware management. These factors are real but not quantifiable without specific usage patterns.
- What about models that do not fit in local VRAM?
- If your chosen model does not fit in unified memory or VRAM, inference will be slow or impossible. A 70B model in Q4 quantization requires ~40 GB of RAM. Check that your hardware has enough memory before assuming the throughput figures apply. The Mac Mini M4 with 16 GB is suitable for 7B models, not 70B.
- What is the best hardware for local inference in 2026?
- Apple Silicon (Mac Studio M4 Max / M4 Ultra) offers the best performance-per-watt for consumer hardware and supports large models via unified memory. The Mac Studio M4 Max at 128 GB handles Llama 3.3 70B comfortably. For NVIDIA, the RTX 4090 with 24 GB VRAM is the fastest single-card option for models that fit. For multi-GPU, two RTX 4090s or a 3090 Ti pair can run 70B models split across cards.