Desktop hanging and OOM Killer invoked despite little active RAM use

Hi! I’m afraid I’ve hit the limits of my googling abilities. On a fresh session, my system seems to be fine, but after it’s been up for a day or so Gnome starts to hang intermittently until eventually it just gives up and crashes to back to the login screen.

Logs shows the out of memory killer getting invoked during many of the hangs and lots right before a crash. System monitors show memory use being very high, but it seems to be mostly cache. An example output of cat /proc/meminfo:

MemTotal:       16048316 kB
MemFree:          251668 kB
MemAvailable:     517940 kB
Buffers:           15180 kB
Cached:         10206088 kB
SwapCached:            0 kB
Active:          3105224 kB
Inactive:        2092108 kB
Active(anon):    2974588 kB
Inactive(anon):  1994528 kB
Active(file):     130636 kB
Inactive(file):    97580 kB
Unevictable:     9739436 kB
Mlocked:             288 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:              2812 kB
Writeback:             0 kB
AnonPages:       4711480 kB
Mapped:           266304 kB
Shmem:           9993056 kB
KReclaimable:     380400 kB
Slab:             557468 kB
SReclaimable:     380400 kB
SUnreclaim:       177068 kB
KernelStack:       21472 kB
PageTables:        63660 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     8024156 kB
Committed_AS:   22766032 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       67620 kB
VmallocChunk:          0 kB
Percpu:             6304 kB
HardwareCorrupted:     0 kB
AnonHugePages:   1339392 kB
ShmemHugePages:  7927808 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
Unaccepted:            0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      411540 kB
DirectMap2M:    13934592 kB
DirectMap1G:     3145728 kB

This was taken shortly before a crash of my desktop session, when I had 40ish browser tabs open (the majority of them unloaded) and lutris and discord left open sitting in a different workspace, as I recall.

Any advice on where to start troubleshooting this? I’ve found lots of advice reassuring me that large memory usage for cache is fine and good, because it’ll get reclaimed without delay if an active application needs it, but the issue is that that doesn’t… seem to be happening? Or something else is going on. I don’t suppose I just need to allocate swap?

If I have done my conversions right, this says you have 16 gigs of ram and about 10 gigs of that is being cached. However, for some reason you aren’t showing any swap use. Do you even have any swap space?

Anyways the important bit of info here is the memory in the unevictable column. Redhat has a good breakdown on their site about the meminfo file: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/s2-proc-meminfo.

That column says that approximately 10 gigs of your ram is being used by actual programs. So about 10 gigs that can’t be released plus 10 gigs being cached. Something funky went down after that. I’m no expert at this. I’m just researching as I go and explaining what I find.

2 Likes

However, for some reason you aren’t showing any swap use. Do you even have any swap space?

I don’t! I saw conflicting opinions on whether I needed it, and I have a slightly cramped SSD, so I haven’t set up any yet.

Anyways the important bit of info here is the memory in the unevictable column. Redhat has a good breakdown on their site about the meminfo file:

That column says that approximately 10 gigs of your ram is being used by actual programs. So about 10 gigs that can’t be released plus 10 gigs being cached. Something funky went down after that.

That is a good point! I didn’t see how large that value was. However I also see that Mlocked is only 288 kB, which according to the red hat page should mean it’s not userspace programs that are locking it down? Active(file) and Inactive(file) also seem to be small.

I also notice, however, that Shmem is very large-- larger than Unevictable and nearly the size of the whole cache. I’m not sure what that indicates-- an issue with my integrated intel graphics?

Some further searching with these new keywords turns up this: https://gitlab.freedesktop.org/drm/intel/-/issues/2924

haven’t read that whole discussion carefully yet, but they suggest a ‘GEM objects leak’, which, I don’t know what that means. but:

[root@midnight 1]# cat i915_gem_objects
1686 shrinkable [0 free] objects, 7722487808 bytes
system: total:0x00000003d382f000 bytes
stolen-system: total:0x0000000004000000 bytes

7722487808 bytes is about the size of my current Unevictable: 7602144 kB. So it… seems like intel graphics might just have serious memory leak issues in linux? I am running a laptop (Samsung NP930QDB) docked to an external (1440p) display…

This would be in line with similar cases I have seen regarding this issue, but I’ll try to dig some more and see what else I can cone up with.

2 Likes

When you get a chance, can you see what oom_killer is killing off first?

Edit: Upon further research I have found that some programs could report as cache and yet still be in the unevictable column due to not actually being droppable. This apparently applies to a lot of the Anon columns.

I found an old shell trick for flushing the cache on my system.
I was using head -c 50G /dev/zero | tail to load about 50 gigs into ram. I have a 64 gig system.
If you could do this but change the size to like 10G to 12G instead. Maybe you can see if that cache is truly unevictable by forcing a situation where the cache would normally be cleared.

Edit: this only something you can try when the system has been running for some time and you actually have something in your cache. Also, I do think having swap is important. In almost every thread that I found. Issues like this always seem to be reported by people with no swap. I’m no expert on the subject, so I don’t know why it is needed when you have sufficient ram. However, maybe it prevents some kind of problem.

When you get a chance, can you see what oom_killer is killing off first?

looking back through it seems like steamwebhelper is the most common thing to get killed off first, but it’s not universal (sometimes I don’t leave it open). krita is also a very common target if I’ve left it open without working in it. brave doesn’t seem to usually get outright killed until memory has gotten very tight but it does become unresponsive if the RAM is very full. It doesn’t seem to touch firefox almost ever but I imagine that’s because i’m usually using it.

example of OOM messages from the other day:

Oct 02 11:51:42 midnight kernel: Out of memory: Killed process 591923 (steamwebhelper) total-vm:5005596kB, anon-rss:377320kB, file-rss:140kB, shmem-rss:32524kB, UID:1000 pgtables:3204kB oom_score_adj:300
Oct 02 13:26:58 midnight kernel: Out of memory: Killed process 595314 (brave) total-vm:1193279704kB, anon-rss:183172kB, file-rss:256kB, shmem-rss:1184kB, UID:1000 pgtables:1476kB oom_score_adj:200
Oct 02 14:29:36 midnight kernel: Out of memory: Killed process 589436 (gnome-software) total-vm:1056080kB, anon-rss:92924kB, file-rss:588kB, shmem-rss:0kB, UID:1000 pgtables:540kB oom_score_adj:200
Oct 02 14:30:04 midnight kernel: Out of memory: Killed process 589262 (pipewire-pulse) total-vm:213832kB, anon-rss:57260kB, file-rss:596kB, shmem-rss:548kB, UID:1000 pgtables:236kB oom_score_adj:200
Oct 02 14:30:17 midnight kernel: Out of memory: Killed process 589753 (OpenTabletDrive) total-vm:4364192kB, anon-rss:41072kB, file-rss:112kB, shmem-rss:0kB, UID:1000 pgtables:488kB oom_score_adj:200
Oct 02 14:30:36 midnight kernel: Out of memory: Killed process 589260 (pipewire) total-vm:311760kB, anon-rss:31672kB, file-rss:780kB, shmem-rss:312kB, UID:1000 pgtables:208kB oom_score_adj:200
Oct 02 14:30:37 midnight kernel: Out of memory: Killed process 589400 (firewall-applet) total-vm:673184kB, anon-rss:30304kB, file-rss:364kB, shmem-rss:0kB, UID:1000 pgtables:348kB oom_score_adj:200
Oct 02 14:30:58 midnight kernel: Out of memory: Killed process 615877 (python3) total-vm:493468kB, anon-rss:27580kB, file-rss:584kB, shmem-rss:0kB, UID:1000 pgtables:264kB oom_score_adj:200
Oct 02 14:30:58 midnight kernel: Out of memory: Killed process 589708 (gsd-xsettings) total-vm:1143456kB, anon-rss:17140kB, file-rss:352kB, shmem-rss:0kB, UID:1000 pgtables:464kB oom_score_adj:200
Oct 02 14:30:58 midnight kernel: Out of memory: Killed process 589451 (xpad) total-vm:743840kB, anon-rss:11100kB, file-rss:272kB, shmem-rss:0kB, UID:1000 pgtables:236kB oom_score_adj:200

you can see it kind of flails around frantically killing random things during the hang before a crash, though that doesn’t happen every time, sometimes it just kills one thing before a crash. I was wondering if it’s gotten worse in the last week or so because i’ve been playing more games, but looking back through journalctl at the periods with no OOM killer messages they don’t seem to correlate with that very well.

i’ll try manually filling the memory, but given that the system just crashes if I try to launch a heavy application when the cache is very large, i’m not really expecting much. I can try to add a swap file (and probably force my windows partition smaller to make room), it seems like people in the gitlab thread saw some improvement from that (though the problem persists, just more slowly). i’m just worried that it’s going to fill up my memory anyway and i’ll just wind up wearing down my SSD. we’ll see.

It’s not actually random though. This segment here at the end of every line of your oom messages is a score given to each process that takes into consideration the memory used by the process and its child processes in order to determine which ones to kill. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/8/html/optimizing_rhel_8_for_real_time_for_low_latency_operation/assembly_managing-out-of-memory-states_optimizing-rhel8-for-real-time-for-low-latency-operation

Fair enough, I suppose.

Running head -c 10G /dev/zero | tail it just returns Killed and this appears in logs:

Oct 04 23:08:36 midnight kernel: Out of memory: Killed process 1133008 (tail) total-vm:929744kB, anon-rss:924032kB, file-rss:512kB, shmem-rss:0kB, UID:1000 pgtables:1856kB oom_score_adj:200

same if I try 5G or 1G though it will accept 500M. with current MemAvailable sitting at 1003532 kB

I’ll try making a swapfile tomorrow, maybe.

Alright, but which OOM killer are you using?

I… dunno? I haven’t modified it? Whatever Endeavour ships with? I figured it was just, like, part of the kernel.

EnOS doesn’t ship with any OOM managing software.

There is systemd-oomd.service which is part of the systemd. That should be disabled by default on a vanilla Arch install. Not sure if EnOS enables it by default.

Personally I would experiment to run the system without oomd enabled and running to see how it fares.

I would also experiment with it enabled or disabled combined with:

zram with or without a swap space (swap file or partition) or
zswap with a swap space.

If you are up for testing, you could have a terminal open with the following command running to monitor your memory (and swap) usage:

watch -n 2 free -m

https://wiki.archlinux.org/title/Improving_performance#RAM,_swap_and_OOM_handling

There is a kernel level oom killer, but some people use the systemD one, but Pebcak had confirmed that that one is off by default.

Almost every recent article I read about swap says to enable it because it does more than just use the disk when memory is insufficient.

It can be just a very small swap, it doesn’t really matter as long as it’s active.

1 Like

added a 1G swap file this morning. also ran systemctl disable --now systemd-oomd but things still seem to be getting killed?

Oct 05 16:32:37 midnight kernel: Out of memory: Killed process 1169128 (xdg-dbus-proxy) total-vm:1412640kB, anon-rss:1109320kB, file-rss:1648kB, shmem-rss:0kB, UID:1000 pgtables:2260kB oom_score_adj:200
Oct 05 18:47:12 midnight kernel: Out of memory: Killed process 1165347 (flatpak-session) total-vm:1118212kB, anon-rss:872836kB, file-rss:512kB, shmem-rss:0kB, UID:1000 pgtables:1808kB oom_score_adj:200
Oct 05 18:57:15 midnight kernel: Out of memory: Killed process 1165740 (brave) total-vm:1193271172kB, anon-rss:218456kB, file-rss:384kB, shmem-rss:816kB, UID:1000 pgtables:1540kB oom_score_adj:200
Oct 05 19:02:15 midnight kernel: Out of memory: Killed process 1886395 (Isolated Web Co) total-vm:3712996kB, anon-rss:665896kB, file-rss:316kB, shmem-rss:6364kB, UID:1000 pgtables:4596kB oom_score_adj:167

swap use was very small until suddenly around 15:10 it seemed to have filled very quickly since i’d last checked. shared memory use grew slowly over time as normal.

Unevictable memory seems to be staying relatively small now: Unevictable: 1519632 kB at time of writing (actually as i’ve been poking around while writing this post it’s shrunk down further) and checking i915_gem_objects that’s sitting at 1.1 gigs.

The system doesn’t seem very interested in evicting it, though. If I try to, for example, launch steam, the system lags and then the steamwebhelper gets sniped by OOM killer.

Current /proc/meminfo:

Every 20.0s: cat /proc/meminfo                                 midnight: Thu Oct  5 19:36:43 2023

MemTotal:       16048316 kB
MemFree:          935600 kB
MemAvailable:    1397832 kB
Buffers:            4504 kB
Cached:         10378448 kB
SwapCached:         7528 kB
Active:          7075432 kB
Inactive:        6666372 kB
Active(anon):    6952672 kB
Inactive(anon):  6608468 kB
Active(file):     122760 kB
Inactive(file):    57904 kB
Unevictable:      216000 kB
Mlocked:             184 kB
SwapTotal:       1048572 kB
SwapFree:              8 kB
Zswap:            110136 kB
Zswapped:         991228 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:       3566936 kB
Mapped:           199404 kB
Shmem:          10202364 kB
KReclaimable:     623912 kB
Slab:             815792 kB
SReclaimable:     623912 kB
SUnreclaim:       191880 kB
KernelStack:       18352 kB
PageTables:        57080 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     9072728 kB
Committed_AS:   21021944 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       71364 kB
VmallocChunk:          0 kB
Percpu:             6336 kB
HardwareCorrupted:     0 kB
AnonHugePages:    935936 kB
ShmemHugePages:  3317760 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
Unaccepted:            0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      411540 kB
DirectMap2M:    13934592 kB
DirectMap1G:     3145728 kB

All this points towards a memory leak. One of your installed programs is probably going berserk and gobbling up your memory due to a programming error. The htop command should help you identify the culprit if you sort by memory (RES column).

I agree with nate on the memory leak. I’m particularly interested in the 6 gigs allocated in the Active(anon) column as well as the additional 6 gigs allocated in the Inactive(anon) column. From my reading Anon memory or Anonymous memory is usually dynamic memory allocated by a program initialized to zero. This memory does not immediately get associated with a process until the process actually starts using that memory.

This poses a problem because we don’t know which process is allocating all of this memory, but if we could find it and kill it. Then that memory should be immediately returned to the system.

I found this page explaining why there is a difference between Active(anon) + Inactive(anon) and AnonPages: https://unix.stackexchange.com/questions/525092/activeanon-inactiveanon-is-more-than-anonpages

I saw another thread that pointed to a bunch of huge log files being stored in ram, but I can’t seem to find that one again.

All in all, something is leaking memory.

I did in fact think to check if there was any process obviously using way too much memory; there doesn’t seem to be. top lines of htop right now:

    PID USER       PRI  NI  VIRT   RES▽  SHR S  CPU% MEM%   TIME+  Command
 123028 jade        20   0 5743M  767M 80276 S   0.0  4.9  0:16.32 /usr/bin/krita
 776156 jade        20   0 3442M  686M 67944 S   0.7  4.4  3:36.95 /usr/lib/firefox/firefox -contentproc -childID 56 -isForBrowser -prefsLen 32831 -prefMapSize 234728 -jsInitLen 234236 -parentBuildID 20230928170531 -greomni /usr/lib/firefox/omni.ja -appo
 705836 jade        20   0 3301M  583M 68156 S   0.0  3.7  1:07.33 /usr/lib/firefox/firefox -contentproc -childID 44 -isForBrowser -prefsLen 32831 -prefMapSize 234728 -jsInitLen 234236 -parentBuildID 20230928170531 -greomni /usr/lib/firefox/omni.ja -appo
   3429 jade        20   0 3622M  542M 85992 S  30.8  3.5  4:08.77 /usr/lib/firefox/firefox -contentproc -childID 7 -isForBrowser -prefsLen 32594 -prefMapSize 234728 -jsInitLen 234236 -parentBuildID 20230928170531 -greomni /usr/lib/firefox/omni.ja -appom
   3008 jade        20   0 12.4G  484M  130M S   3.9  3.1 31:48.45 /usr/lib/firefox/firefox
 705977 jade        20   0 3322M  445M 62416 S   0.7  2.8  0:41.45 /usr/lib/firefox/firefox -contentproc -childID 45 -isForBrowser -prefsLen 32831 -prefMapSize 234728 -jsInitLen 234236 -parentBuildID 20230928170531 -greomni /usr/lib/firefox/omni.ja -appo
   3168 jade        20   0 3192M  384M 56452 S   2.0  2.5 19:55.93 /usr/lib/firefox/firefox -contentproc -childID 2 -isForBrowser -prefsLen 27916 -prefMapSize 234728 -jsInitLen 234236 -parentBuildID 20230928170531 -greomni /usr/lib/firefox/omni.ja -appom
   3172 jade        20   0 2791M  341M 55288 S   0.0  2.2  3:01.89 /usr/lib/firefox/firefox -contentproc -childID 3 -isForBrowser -prefsLen 27916 -prefMapSize 234728 -jsInitLen 234236 -parentBuildID 20230928170531 -greomni /usr/lib/firefox/omni.ja -appom
   1910 jade        20   0 11.0G  340M 47092 S   5.3  2.2 12:12.67 /usr/bin/gnome-shell
 759850 jade        20   0 3000M  305M 65052 S   0.0  2.0  0:43.61 /usr/lib/firefox/firefox -contentproc -childID 54 -isForBrowser -prefsLen 32831 -prefMapSize 234728 -jsInitLen 234236 -parentBuildID 20230928170531 -greomni /usr/lib/firefox/omni.ja -appo
  60886 jade        20   0 2716M  288M 67624 S   0.0  1.8  0:28.79 /usr/lib/firefox/firefox -contentproc -childID 24 -isForBrowser -prefsLen 32831 -prefMapSize 234728 -jsInitLen 234236 -parentBuildID 20230928170531 -greomni /usr/lib/firefox/omni.ja -appo
 737080 jade        20   0 3055M  254M 50684 S   0.0  1.6  0:52.33 /usr/lib/firefox/firefox -contentproc -childID 50 -isForBrowser -prefsLen 32831 -prefMapSize 234728 -jsInitLen 234236 -parentBuildID 20230928170531 -greomni /usr/lib/firefox/omni.ja -appo
 680389 jade        20   0 2615M  197M 72788 S   0.0  1.3  0:20.38 /usr/lib/firefox/firefox -contentproc -childID 43 -isForBrowser -prefsLen 32831 -prefMapSize 234728 -jsInitLen 234236 -parentBuildID 20230928170531 -greomni /usr/lib/firefox/omni.ja -appo
 706611 jade        20   0 2568M  186M 69264 S   0.0  1.2  0:35.87 /usr/lib/firefox/firefox -contentproc -childID 46 -isForBrowser -prefsLen 32831 -prefMapSize 234728 -jsInitLen 234236 -parentBuildID 20230928170531 -greomni /usr/lib/firefox/omni.ja -appo
 997257 jade        20   0 2031M  156M 16136 S   3.9  1.0  0:38.32 /usr/bin/kgx --gapplication-service
 997067 jade        20   0 2559M  155M 51196 S   0.0  1.0  0:10.15 /usr/lib/firefox/firefox -contentproc -childID 101 -isForBrowser -prefsLen 32897 -prefMapSize 234728 -jsInitLen 234236 -parentBuildID 20230928170531 -greomni /usr/lib/firefox/omni.ja -app

Not sure how I’d go about investigating an anonymous memory leak, but I’ll try to look into it later

The problem though is that it if the memory is anonymous then it hasn’t been associated with a process yet, so you wouldn’t be able to find it. That was what I was trying to get across with my last message.