5d ago

How to calculate cost-per-tokens output of local model compared to enterprise model API access

Recently I've been experimenting with Claude and feeling the burn on the premium API usage. I wanted to know how much cheaper my local llm was in terms of cost-per-token output.

Claude Sonnet is a good reference with 15$ per 1 million tokens out, so I wanted to know comparatively how many tokens 15$ worth electricity powering my rig would generate.

(These calculations are just simple raw token generation by the way, in real world theres cost in initial hardware, ongoing maintenance as parts fail, and human time to setup thats much harder to factor into the equation)

So how does one even calculate such a thing? Well, you need to know

how many watts your inference rig consumes at load
how many tokens on average it can generate per second while inferencing (with context relatively filled up, we want conservative estimates)
cost of electric you pay on the utility bill in kilowatts-per-hour

Once you have those constants you can extrapolate how many kilowatt-hours worth of runtime 15$ in electric buys then figure out the total amount of tokens you would expect to generate over that time given the TPS.

The numbers shown in the screenshot are for a fully loaded into vram model on the ol' 1070ti 8gb. But even with partially offloaded numbers for 22-32b models at 1-3tps its still a better deal overall.

I plan to offer the calculator as a tool on my site and release it under a permissive license like gpl if anyone is interested.

14 comments

Not to be that guy (he says as he becomes that guy) but the GPL is not a permissive license, BSD and MIT are. Tho imo GPL is the better and probably best license.
Also what models and use cases did you run it for? And what was your context window?
- Thanks for being that guy, good to know. Those specific numbers shown were just done tonight with DeepHermes 8b q6km (finetuned from llama 3.1 8b) with max context at 8192, in the past before I reinstalled I managed to squeeze ~10k context with the 8b by booting without a desktop enviroment. I happen to know that DeepHermes 22b iq3 (finetuned from mistral small) runs at like 3 tps partially offloaded with 4-5k context.
  Deephermes 8b is the fast and efficient general model I use for general conversation, basic web search, RAG, data table formatting/basic markdown generation, simple computations with deepseek r1 distill reasoning CoT turned on.
  Deephermes 22b is the local powerhouse model I use for more complex task requiring either more domain knowledge or reasoning ability. For example to help break down legacy code and boilerplate simple functions for game creation.
  I have vision model + TTS pipeline for OCR scanning and narration using qwen 2.5vl 7b + outetts+wavtokenizer which I was considering trying to calculate though I need to add up both the llm tps and the audio TTS tps.
  I plan to load up a stable diffusion model and see how image generation compares but the calculations will probably be slightly different.
  I hear theres one or two local models floating around that work with roo-cline for the advanced tool usage, if I can find a local model in the 14b range that works with roo even if just for basic stuff it will be incredible.
  Hope that helps inform you sorry if I missed something.
  
  You're good. I'm trying to get larger context windows on my models so trying to figure that out and balance token throughput. I do appreciate your insights into the different use cases.
  Have you tried larger 70b models? Or compared against larger MoE models?
I do all my local LLM-ing on an M1 Max macbook pro with a power draw of around 40-60 Watts (which for my use cases is probably about 10 minutes a day in total). I definitely believe we can be more efficient running these models at home.
- I wish I’d sprung for the max when I bought my M1 Pro, but I am glad I splurged on memory. Really aside from LLM workloads this thing is still excellent.
  Agree we can be doing a lot more, the recent generation of local models are fantastic.
  Gemma 3n and Phi 4 (non reasoning) are my local workhorses lately.
Neat, would like to toss the numbers of my 3090 and 3080 in there.
- I would recommend you get a cheap wattage meter that plugs inbetween wall outlet and PSU powering your cards for 10-15$ (the 30$ name brand kill-a-watts are overpriced and unneeded IMO). You can try to get rough approximations doing some math with your cards listed TPD specs added together but that doesn't account for motherboard, cpu, ram, drives, so on all and the real change between idle and load. With a meter you can just kind of watch the total power draw with all that stuff factored in, take note of increase and max out as your rig inferences a bit. Have the comfort of being reasonably confident in the actual numbers. Then you can plug the values in a calculation

14 comments