Skip Navigation

Posts
4
Comments
288
Joined
2 yr. ago

  • It's not that clear cut a problem. There seems to be two elements; the kernel driver had a memory safety bug; and a definitions file was deployed incorrectly, triggering the bug. The kernel driver definitely deserves a lot of scrutiny and static analysis should have told them this bug existed. The live updates are a bit different since this is a real-time response system. If malware starts actively exploiting a software vulnerability, they can't wait for distribution maintainers to package their mitigation - they have to be deployed ASAP. They certainly should roll-out definitions progressively and monitor for anything anomalous but it has to be quick or the malware could beat them to it.

    This is more a code safety issue than CI/CD strategy. The bug was in the driver all along, but it had never been triggered before so it passed the tests and got rolled out to everyone. Critical code like this ought to be written in memory safe languages like Rust.

  • The Deep

    Jump
  • I'd unsubscribe from !linux@lemmy.ml for a start.

    I'm pretty sure this update didn't get pushed to linux endpoints, but sure, linux machines running the CrowdStrike driver are probably vulnerable to panicking on malformed config files. There are a lot of weirdos claiming this is a uniquely Windows issue.

  • IFERROR(;0)

    Maybe they should use a more appropriate development tool for their critical security platform than Excel.

  • This error isn't intentionally crashing because of a security risk, though that could happen. It's a null pointer exception, so there are no static or runtime checks that could have prevented or handled this more gracefully. This was presumably a bug in the driver for a long time, then a faulty config file came and triggered the crashes. Better static analysis and testing of the kernel driver is one aspect, how these live config updates are deployed and monitored is another.

  • You can still catch the error at runtime and do something appropriate. That might be to say this update might have been tampered with and refuse to boot, but more likely it'd be to just send an error report back to the developers that an unexpected condition is being hit and just continuing without loading that one faulty definition file.

  • A page fault can be what triggers a catch, but you can't unwind what a loaded module (the Crowdstrike driver) did before it crashed. It could have messed with Windows kernel internals and left them in a state that is not safe to continue. Rather than potentially damage the system, Windows stops with a BSOD. The only solution would be to not allow code to be loaded into the kernel at all, but that would make hardware drivers basically impossible.

  • The driver is in kernel mode. If it crashes, the kernel has no idea if any internal structures have been left in an inconsistent state. If it doesn't halt then it has the potential to cause all sorts of damage.

  • I don't think the kernel could continue like that. The driver runs in kernel mode and took a null pointer exception. The kernel can't know how badly it's been screwed by that, the only feasible option is to BSOD.

    The driver itself is where the error handling should take place. First off it ought to have static checks to prove it can't have trivial memory errors like this. Secondly, if a configuration file fails to load, it should make a determination about whether it's safe to continue or halt the system to prevent a potential exploit. You know, instead of shitting its pants and letting Windows handle it.

  • This doesn't really answer my question but Crowdstrike do explain a bit here: https://www.crowdstrike.com/blog/technical-details-on-todays-outage/

    These channel files are configuration for the driver and are pushed several times a day. It seems the driver can take a page fault if certain conditions are met. A mistake in a config file triggered this condition and put a lot of machines into a BSOD bootloop.

    I think it makes sense that this was a preexisting bug in the driver which was triggered by an erroneous config. What I still don't know is if these channel updates have a staged deployment (presumably driver updates do), and what fraction of machines that got the bad update actually had a BSOD.

    Anyway, they should rewrite it in Rust.

  • Does anyone know how these Cloudstrike updates are actually deployed? Presumably the software has its own update mechanism to react to emergent threats without waiting for patch tuesday. Can users control the update policy for these 'channel files' themselves?

  • The switches do suck but they can usually be revived with contact cleaner. If you open the mouse you can spray around the switch plunger or better yet, pop off the top half of the switch case and spray the contact directly. That completely cleared up the double click on my G402 and even revived an old MX510 that was missing clicks.

  • So build concentrated solar power and store the heat for after the sun sets. Bonus - thermal power plant turbines give inertia to the grid, which photo-voltaics don't.

  • Because then it would be 'a;imodo not qazimodo.

  • I don't think that was a malfunction...
    That was 'working as intended'.

  • wouldn’t changing it just end up performative

    Exactly. Sidereal time does get rid of time zones and leap years, but it's still referenced to a single physical object and relies on a arbitrary choice of start point. So it doesn't create some perfect cosmic time standard.

    The international date line doesn't help since that's just 180° offset from Greenwich itself.

    The point of standards is that they can be followed by everyone. The AD/BC epoch is fine. The Greenwich meridian is fine. UTC is fine. Changing them would cause so much disruption that it cannot be worth it.

    Daylight savings can go die in a ditch though.