Skip Navigation

InitialsDiceBearhttps://github.com/dicebear/dicebearhttps://creativecommons.org/publicdomain/zero/1.0/„Initials” (https://github.com/dicebear/dicebear) by „DiceBear”, licensed under „CC0 1.0” (https://creativecommons.org/publicdomain/zero/1.0/)TH
Posts
2
Comments
5
Joined
2 wk. ago

  • Thanks, While I still would like to know thr peformance scaling of a cheap cluster this does awnser the question, pay way more for high end cards like the H200 for greater efficiency, or pay less and have to deal with these issues.

    • I know the more bandwidth the better, but i wonder how does it scale. I can only test my own setup which is less then optimal for this purpose with pcie 4.0 x16 and no p2p, but it goes as follows: a single 4090 gets 40.9 t/s while 2 get 58.5 t/s using tensor parrelism tested on Qwen/Qwen3-8B-FP8 with vLLM. I am really curious how this scales over more then 2 pcie 5.0 cards with p2p, which all cards here listed except the 5090 support.
    • The theory goes that yes while the H200 has a very impressive bandwith of 4.89 TB/s, but for the same price you can get 37 TB/s spread across 58 RX 9070s, but if this actually works in practice i don't know.
    • I don't need to build a datacenter, i'm fine with building a rack myself in my garage. And i don't think that requires higher volumes than just purchasing at different retailers
    • I intend to run at fp8 so i wanted to show that instead of fp16 but its surprisingly difficult to find the numbers for that, only the H200 datasheet, cleary displays FP8 Tensor Core, the RTX pro 6000 datasheet keeps it vague with only mentioning AI TOPS, which they define as Effective FP4 TOPS with sparsity, and they didn't even bother writing a datasheet for he 5090 only saying 3352 AI TOPS, which i suppose is fp4 then. the AMD datasheets only list fp16 and int8 matrix, whether int8 matrix is equal to fp8 i don't know. So FP16 was the common denominator for all the cards i could find without comparing apples with oranges.
  • Well a scam for selfhosters, for datacenters it's different ofcourse.

    Im looking to upgrade to my first dedicated built server coming from only SBCs so I'm not sure how much of a concern heat will be, but space and power shouldn't be an issue. (Within reason ofcourse)

  • Yeah i should have specified for at home when saying its a scam, i honestly doubt the companies that are buying thousands of B200s for datacenters are even looking at their pricetags lmao.

    Anyway the end goal is to run something like Qwen3-235B at fp8, with some very rough napkin math 300GB vram with the cheapest option the 9060XT comes down at €7126 with 18 cards, which is very affordable. But ofcourse that this is theoretically possible does not mean it will actually work in practice, which is what im curious about.

    The inference engine im using vLLM supports ROCm so CUDA should not be strictly required.

  • LocalLLaMA @sh.itjust.works

    Very large amounts of gaming gpus vs AI gpus

    Selfhosted @lemmy.world

    Very large amounts of gaming gpus vs AI gpus