Kernel null pointer dereference when encoding video?

Hardware info: https://0x0.st/85LL.txt
Journal from last boot (which ended in a crash but not stack trace): https://0x0.st/85L9.txt
Journal from five boots ago (which had stack traces - same glitch, but the logs are much larger): https://0x0.st/85LG.txt

Any ideas about what I can do about some crashing related to encoding video?

Long story short, within the past week I have started having instability in my system. Initially, it started as an issue when I was ripping a movie, but eventually I noticed it was happening even when just trying to re-encode a video into a more-easily-edited codec. Running ffmpeg to encode a video causes the issue to reproduce as long as the encode takes a sufficient amount of time (I’d say at least ten minutes of video encoding; shorter videos seem to work without error, longer ones always fail).

More details:

The only symptom of issue apparent to a user is a system hang; keyboard would stop responding to even toggle keys, audio would loop, display would either freeze or go to black. Looking at journals show a kernel null pointer dereference as the last event, sometimes even with a full-on stack trace, occasionally with less information even.

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page

I may have also reproduced this once by playing Starfield. The Starfield hang didn’t repro the null pointer part, which is why I’m not sure if it is related; I’m including it for posterity. The behavior outside of the journal log was identical though.

Since I don’t exactly encode video on a regular basis, I’m not sure of exactly what may have changed to start the problem - my theory is some type of driver or kernel update, but it could very well be a hardware fault that just started. The last time I did any video rendering (successfully) was around a month ago.

Potentially-unrelated information:

  • While I’m running the amdgpu drivers now, I didn’t always; I used to run DaVinci Resolve Studio, which meant amdgpu-pro drivers for GPU acceleration. I wouldn’t be too surprised if there is some remnant of that left behind that I missed, but I don’t see anything.
  • I do have an active DP 2.0 to HDMI 2.1 adapter, which causes all sorts of weird things to happen via graphical display output to that device (AMD and HDMI 2.1 is a match made in the underworld, but it mostly-works for me), but this can happen even when the TV is unplugged.
  • I have not noticed any overheating going on. My fan curves tend toward letting things get a little warm before kicking up, but this case has a lot of room and airflow.
  • No other crashes with either other games or LLMs running locally. To be fair, I mostly play indie games that wouldn’t tax my GPU particularly hard and my LLMs are just basic code assistance models that aren’t doing much either.

Thanks for reading this far and I hope you all can help me! :slight_smile:

What are the current outputs of:

journalctl -k --grep='microcode:'
and
yay -Qs amd-ucode ?

Microcode:

Jul 26 16:59:07 Naisa-Linux kernel: microcode: Current revision: 0x0a201211
Jul 26 16:59:07 Naisa-Linux kernel: microcode: Updated early from: 0x0a201205

amd-ucode:

local/amd-ucode 20250708-1
    Microcode update image for AMD CPUs

Ah okay, no clue why, but there is a mismatch in comparison to the current microcode revision but it doesn’t seem to be related as I’ve got a different CPU model.

Anyway:

Machine:
  Type: Desktop Mobo: ASRock model: X570 Taichi serial: <superuser required>
    UEFI: American Megatrends v: P4.80 date: 02/16/2022

You’ve got the latest microcode. But the BIOS you’re using seems to be quite old.

Not too surprising, I tend toward the “only update the BIOS when necessary” approach to keeping that up to date. I probably last updated it when I put in the 5900X (it was originally a 3900X that had a hardware fault).

Think it might help to flash the BIOS?

There is at least a non-beta BIOS revision 5.60 available for your Motherboard.

All right, flashed BIOS, can still reproduce the problem.

EDIT: Here’s the most recent log: https://0x0.st/85IX.txt

Interestingly enough, the crash seems to be slightly different now.

Jul 27 13:42:34 Naisa-Linux kernel: #PF: error_code(0x0011) - permissions violation
Jul 27 13:42:34 Naisa-Linux kernel: #PF: supervisor instruction fetch in kernel mode
Jul 27 13:42:34 Naisa-Linux kernel: BUG: unable to handle page fault for address: ffffffffc060d5a0
Jul 27 13:42:34 Naisa-Linux kernel: kernel tried to execute NX-protected page - exploit attempt? (uid: 0)

Hi,

Based on the journal from five boots ago, the culprit seems to be that you are swapping hard when the kernel RIPs

Jul 21 22:22:52 Naisa-Linux kernel: Oops: Oops: 0002 [#1] SMP NOPTI
Jul 21 22:22:52 Naisa-Linux kernel: CPU: 1 UID: 0 PID: 191 Comm: kswapd0 Tainted: G
...
Jul 21 22:22:52 Naisa-Linux kernel:  ? dm_create+0x2f0/0x5a0 [dm_mod e4bb4525dc837c921740f257aa76c3a30c4bfe68]
Jul 21 22:22:52 Naisa-Linux kernel:  ? kswapd+0x1d9/0x370

kswapd kicks in and tries to “rebalance” pages/page faults to disk

So, it looks like you are running out of memory and fast, and the rip happens when hard swapping happens.

What kind of videos are you encoding? Encoding can be very, very memory intensive.

Also, don’t forget to check the health of your disk and RAM.

Hrm. I mean, I can monitor my memory usage and trigger it again, but I would be a little surprised if I ran out of RAM at 64 GB of RAM.

The ways I can easily reproduce the issue is either re-encoding a half hour long recording I made on my phone (1080p/60, I think the source codec is x264) or ripping a 4k BR that I own (so… 4k/24 at whatever codec 4k blu-rays use). I’m pretty sure ffmpeg is in use either way.

I haven’t checked the health of my RAM recently (the last time I did would have been around nine months ago), but I have checked my SSDs more recently - no issues there, and I can encode video to any of my four SSDs and still reproduce the issue. I’ll do some memory testing overnight to check that out.

Thank you both for help with troubleshooting, by the way!

Heh, you’d be surprised. I have handbrake right now eating about 40GBs or RAM :smiley:
System would not be swapping if it does not “feel” memory pressure.

But this one is very interesting

Jul 27 13:42:34 Naisa-Linux kernel: #PF: error_code(0x0011) - permissions violation
Jul 27 13:42:34 Naisa-Linux kernel: #PF: supervisor instruction fetch in kernel mode
Jul 27 13:42:34 Naisa-Linux kernel: BUG: unable to handle page fault for address: ffffffffc060d5a0
Jul 27 13:42:34 Naisa-Linux kernel: kernel tried to execute NX-protected page - exploit attempt? (uid: 0)

Can you provide the full journal?

https://0x0.st/85IX.txt
That should be the most recent one.

Monitored RAM and triggered the issue again - I was using 13ish GB of RAM when it crashed (out of my 64 GB of RAM) and didn’t touch swap at all. I’m not sure why that one crash was swapping - maybe something started looping or something else running had a really large memory leak and was unrelated?

Anyway, I’ll memtest overnight to see if anything crops up.

Memtest came back clean overnight.

I’ve checked the last journal, but unfortunately there were not much details in it. I guess journal entries were not synced to disk when the error happened. You can try to change sync interval to every 5 seconds, instead of default 5 minutes, to see if we are going to have more info.

Edit /etc/systemd/journald.conf, uncomment SyncIntervalSec line, set 5s, instead of 5m and reboot.

Some other things to consider: did you overclock your CPU? Have you tried using the LTS kernel?

You are saying that you can reliably reproduce the issue, so with little luck we should be able to track it down.

I mean, it takes around 20-30 minutes to reproduce, but yes - I have a 100% repro rate and relative stability for just about everything else. That’s why I’m hoping I can be helpful / can get help. :slight_smile: I’ve changed my journald settings, we’ll see if my next crash tells us more.

No CPU overclocking outside of default turboing behavior. RAM is technically overclocked, but I’m just using the XMP/DOCP profile. Even then, I was actually running on stock DDR4 speeds when I updated by BIOS since I forgot to re-enable XMP, so that means I’ve accidentally tested that as well (and was still able to repro).

I’m not using the LTS kernel, but I suppose I could give it a shot to see if I can still reproduce. It’ll at least tell me if this is a kernel-level change. I’ll test it out tomorrow since I don’t want to sit around for the 20-30 minutes while it encodes this late at night.

I can’t even boot with the LTS kernel, just with the regular and zen kernels. I’m guessing I must have something installed that has a dependency on the newer kernel version? I’m not all that experienced in changing versions (and EndeavourOS is my first rolling distro), so I’m not entirely sure.

Annnd it is working again without issue.

I’m guessing some kernel update, perhaps? Mostly mentioning in case someone else has a similar problem and searches.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.