Any ideas about what I can do about some crashing related to encoding video?
Long story short, within the past week I have started having instability in my system. Initially, it started as an issue when I was ripping a movie, but eventually I noticed it was happening even when just trying to re-encode a video into a more-easily-edited codec. Running ffmpeg to encode a video causes the issue to reproduce as long as the encode takes a sufficient amount of time (I’d say at least ten minutes of video encoding; shorter videos seem to work without error, longer ones always fail).
More details:
The only symptom of issue apparent to a user is a system hang; keyboard would stop responding to even toggle keys, audio would loop, display would either freeze or go to black. Looking at journals show a kernel null pointer dereference as the last event, sometimes even with a full-on stack trace, occasionally with less information even.
I may have also reproduced this once by playing Starfield. The Starfield hang didn’t repro the null pointer part, which is why I’m not sure if it is related; I’m including it for posterity. The behavior outside of the journal log was identical though.
Since I don’t exactly encode video on a regular basis, I’m not sure of exactly what may have changed to start the problem - my theory is some type of driver or kernel update, but it could very well be a hardware fault that just started. The last time I did any video rendering (successfully) was around a month ago.
Potentially-unrelated information:
While I’m running the amdgpu drivers now, I didn’t always; I used to run DaVinci Resolve Studio, which meant amdgpu-pro drivers for GPU acceleration. I wouldn’t be too surprised if there is some remnant of that left behind that I missed, but I don’t see anything.
I do have an active DP 2.0 to HDMI 2.1 adapter, which causes all sorts of weird things to happen via graphical display output to that device (AMD and HDMI 2.1 is a match made in the underworld, but it mostly-works for me), but this can happen even when the TV is unplugged.
I have not noticed any overheating going on. My fan curves tend toward letting things get a little warm before kicking up, but this case has a lot of room and airflow.
No other crashes with either other games or LLMs running locally. To be fair, I mostly play indie games that wouldn’t tax my GPU particularly hard and my LLMs are just basic code assistance models that aren’t doing much either.
Thanks for reading this far and I hope you all can help me!
Ah okay, no clue why, but there is a mismatch in comparison to the current microcode revision but it doesn’t seem to be related as I’ve got a different CPU model.
Not too surprising, I tend toward the “only update the BIOS when necessary” approach to keeping that up to date. I probably last updated it when I put in the 5900X (it was originally a 3900X that had a hardware fault).
Hrm. I mean, I can monitor my memory usage and trigger it again, but I would be a little surprised if I ran out of RAM at 64 GB of RAM.
The ways I can easily reproduce the issue is either re-encoding a half hour long recording I made on my phone (1080p/60, I think the source codec is x264) or ripping a 4k BR that I own (so… 4k/24 at whatever codec 4k blu-rays use). I’m pretty sure ffmpeg is in use either way.
I haven’t checked the health of my RAM recently (the last time I did would have been around nine months ago), but I have checked my SSDs more recently - no issues there, and I can encode video to any of my four SSDs and still reproduce the issue. I’ll do some memory testing overnight to check that out.
Thank you both for help with troubleshooting, by the way!
Monitored RAM and triggered the issue again - I was using 13ish GB of RAM when it crashed (out of my 64 GB of RAM) and didn’t touch swap at all. I’m not sure why that one crash was swapping - maybe something started looping or something else running had a really large memory leak and was unrelated?
Anyway, I’ll memtest overnight to see if anything crops up.
I’ve checked the last journal, but unfortunately there were not much details in it. I guess journal entries were not synced to disk when the error happened. You can try to change sync interval to every 5 seconds, instead of default 5 minutes, to see if we are going to have more info.
Edit /etc/systemd/journald.conf, uncomment SyncIntervalSec line, set 5s, instead of 5m and reboot.
Some other things to consider: did you overclock your CPU? Have you tried using the LTS kernel?
You are saying that you can reliably reproduce the issue, so with little luck we should be able to track it down.
I mean, it takes around 20-30 minutes to reproduce, but yes - I have a 100% repro rate and relative stability for just about everything else. That’s why I’m hoping I can be helpful / can get help. I’ve changed my journald settings, we’ll see if my next crash tells us more.
No CPU overclocking outside of default turboing behavior. RAM is technically overclocked, but I’m just using the XMP/DOCP profile. Even then, I was actually running on stock DDR4 speeds when I updated by BIOS since I forgot to re-enable XMP, so that means I’ve accidentally tested that as well (and was still able to repro).
I’m not using the LTS kernel, but I suppose I could give it a shot to see if I can still reproduce. It’ll at least tell me if this is a kernel-level change. I’ll test it out tomorrow since I don’t want to sit around for the 20-30 minutes while it encodes this late at night.
I can’t even boot with the LTS kernel, just with the regular and zen kernels. I’m guessing I must have something installed that has a dependency on the newer kernel version? I’m not all that experienced in changing versions (and EndeavourOS is my first rolling distro), so I’m not entirely sure.