I have tried them, and to be honest I was not surprised. The hosted service was better at longer code snippets and in particular, I found that it was consistently better at producing valid chain of thought reasoning chains (I've found that a lot of simpler models, including the distills, tend to produce shallow reasoning chains, even when they get the answer to a question right).
I'm aware of how these models work; I work in this field and have been developing a benchmark for reasoning capabilities in LLMs. The distills are certainly still technically impressive and it's nice that they exist, but the gap between them and the hosted version is unfortunately nontrivial.
It might be trivial to a tech-savvy audience, but considering how popular ChatGPT itself is and considering DeepSeek's ranking on the Play and iOS App Stores, I'd honestly guess most people are using DeepSeek's servers. Plus, you'd be surprised how many people naturally trust the service more after hearing that the company open sourced the models. Accordingly I don't think it's unreasonable for Proton to focus on the service rather than the local models here.
I'd also note that people who want the highest quality responses aren't using a local model, as anything you can run locally is a distilled version that is significantly smaller (at a small, but non-trivial overalll performance cost).
TBF you almost certainly can't run R1 itself. The model is way too big and compute intensive for a typical system. You can only run the distilled versions which are definitely a bit worse in performance.
Lots of people (if not most people) are using the service hosted by Deepseek themselves, as evidenced by the ranking of Deepseek on both the iOS app store and the Google Play store.
Part of this was an optimization that was necessary due to their resource restrictions. Chinese firms can only purchase H800 GPUs instead of H200 or H100. These have much slower inter-GPU communication (less than half the bandwidth!) as a result of export bans by the US government, so this optimization was done to try and alleviate some of that bottleneck. It's unclear to me if this type of optimization would make as big of a difference for a lab using H100s/H200s; my guess is that it probably matters less.
I think the thing that Jensen is getting at is that CUDA is merely a set of APIs. Other hardware manufacturers can re-implement the CUDA APIs if they really wanted to (especially since AFAIK, Google v Oracle ruled that APIs cannot be copyrighted). In fact, AMD's HIP implements many of the same APIs as CUDA, and they ship a tool (HIPIFY) to convert code written for CUDA for HIP instead.
Of course, this does not guarantee that code originally written for CUDA is going to perform well on other accelerators, since it likely was implemented with NVIDIA's compute model in mind.
What I'm curious to see is how well these types of modifications scale with compute. DeepSeek is restricted to H800s instead of H100s or H200. These are gimped cards to get around export controls, and accordingly they have lower memory bandwidth (~2 vs ~3 TB/s) and most notably, much slower GPU to GPU communication (something like 400 GB/s vs 900 GB/s). The specific reason they used PTX in this application was to help alleviate some of the bottlenecks due to the limited inter-GPU bandwidth, so I wonder if that would still improve performance on H100 and H200 GPUs where bandwidth is much higher.
IIRC Zluda does support compiling PTX. My understanding is that this is part of why Intel and AMD eventually didn't want to support it - it's not a great idea to tie yourself to someone else's architecture you have no control or license to.
OTOH, CUDA itself is just a set of APIs and their implementations on NVIDIA GPUs. Other companies can re-implement them. AMD has already done this with HIP.
My stance on Proton is my stance on GrapheneOS: just because the creator is bad doesn’t mean the software is bad. As long as the software is better compared to the alternatives then I seen no reason to stop using it.
I think the major difference is that for a software package or operating system like GrapheneOS, theoretically people can audit the code and verify that it is secure (of course in practice this is not something that 99% of people will ever do). So to some extent, you technically don't have to put a ton of trust into the GrapheneOS devs, especially with features like reproducible builds allowing you to verify that the software you're running is the same software as the repository.
For something like Proton where you're using a service someone else is running, you sort of have to trust the provider by default. You can't guarantee that they're not leaking information about you, since there's no way for you to tell what their servers are doing with your data. Accordingly, to some extent, if you don't trust the team behind the service, it isn't unreasonable to start doubting the service.
Huh. Everything I'm reading seems to imply it's more like a DSP ASIC than an FPGA (even down to the fact that it's a VLIW processor) but maybe that's wrong.
I'm curious what kind of work you do that's led you to this conclusion about FPGAs. I'm guessing you specifically use FPGAs for this task in your work? I'd love to hear about what kinds of ops you specifically find speedups in. I can imagine many exist, as otherwise there wouldn't be a need for features like tensor cores and transformer acceleration on the latest NVIDIA GPUs (since obviously these features must exploit some inefficiency in GPGPU architectures, up to limits in memory bandwidth of course), but also I wonder how much benefit you can get since in practice a lot of features end up limited by memory bandwidth, and unless you have a gigantic FPGA I imagine this is going to be an issue there as well.
I haven't seriously touched FPGAs in a while, but I work in ML research (namely CV) and I don't know anyone on the research side bothering with FPGAs. Even dedicated accelerators are still mostly niche products because in practice, the software suite needed to run them takes a lot more time to configure. For us on the academic side, you're usually looking at experiments that take a day or a few to run at most. If you're now spending an extra day or two writing RTL instead of just slapping together a few lines of python that implicitly calls CUDA kernels, you're not really benefiting from the potential speedup of FPGAs. On the other hand, I know accelerators are handy for production environments (and in general they're more popular for inference than training).
I suspect it's much easier to find someone who can write quality CUDA or PTX than someone who can write quality RTL, especially with CS being much more popular than ECE nowadays. At a minimum, the whole FPGA skillset seems much less common among my peers. Maybe it'll be more crucial in the future (which will definitely be interesting!) but it's not something I've seen yet.
Good point! For my use case (on a different brand, Sony) I'm fine with the lowered resolution since I just use it for video conferencing, in which case the raw resolution is limited anyway. But for users who need higher resolution, using an HDMI capture card might be better for a one time fee rather than a subscription.
Is mechanical shutter necessary for max bit depth on your camera? It isn't on mine (Sony), but bit depth reduces to 12 bit if you max out the framerate. You might still be able to get full 14 bit RAWs if you drop the framerate.
You can do this on Linux using gphoto2, ffmpeg, and v4l2loopback. You probably won't get full resolution but the quality will still be good enough for video conferencing. See here for a guide.
Not that unusual IMO, lots of people start their PhD directly after completing their Bachelor's. If they weren't born born in the first half of the year then they'll have completed their BS by 21 and start the PhD either at 21 or 22.
But at least regulators can force NVIDIA to open their CUDA library and at least have some translation layers like ZLUDA.
I don't believe there's anything stopping AMD from re-implementing the CUDA APIs; In fact, I'm pretty sure this is exactly what HIP is for, even though it's not 100% automatic. AMD probably can't link against the CUDA libraries like cuDNN and cuBLAS, but I don't know that it would be useful to do that anyway since I'm fairly certain those libraries have GPU-specific optimizations. AMD makes their own replacements for them anyway.
IMO, the biggest annoyance with ROCm is that the consumer GPU support is very poor. On CUDA you can use any reasonably modern NVIDIA GPU and it will "just work." This means if you're a student, you have a reasonable chance of experimenting with compute libraries or even GPU programming if you have an NVIDIA card, but less so if you have an AMD card.
I work in CV and I have to agree that AMD is kind of OK-ish at best there. The core DL libraries like torch will play nice with ROCm, but you don't have to look far to find third party libraries explicitly designed around CUDA or NVIDIA hardware in general. Some examples are the super popular OpenMMLab/mmcv framework, tiny-cuda-nn and nerfstudio for NeRFs, and Gaussian splatting. You could probably get these to work on ROCm with HIP but it's a lot more of a hassle than configuring them on CUDA.
I've tried Overture, Creality, and Inland (all black though, not transparent) and Overture printed the best for me (at least for functional parts where I cared about print quality and tolerances). Inland's PETG+ and High Speed PETG was even better though.
Not quite the same thing but modern high end cameras use CF-Express (as in compact flash). They communicate over PCIe using the same protocol as NVMe drives but have fewer lanes and usually are smaller. The tricky part is with their small size you don't have as much room to cram as many flash chips onto a card compared to a 2280 NVMe.
I don't know the architecture of AI accelerator in Ryzen processors but I do know a fair amount of image deblurring and denoising tools run on the neural engine on Apple Silicon. The neural engine is good enough for a lot of tasks, provided that your model only uses relatively simple operators and doesn't need full precision.
I have tried them, and to be honest I was not surprised. The hosted service was better at longer code snippets and in particular, I found that it was consistently better at producing valid chain of thought reasoning chains (I've found that a lot of simpler models, including the distills, tend to produce shallow reasoning chains, even when they get the answer to a question right).
I'm aware of how these models work; I work in this field and have been developing a benchmark for reasoning capabilities in LLMs. The distills are certainly still technically impressive and it's nice that they exist, but the gap between them and the hosted version is unfortunately nontrivial.