Nfs /home client hopeless since linux kernel 6.2

richard · March 6, 2023, 2:37pm

Been running nfs /home on our Arch based clients since years, and more recently having upgraded the HW to Asus PN51-E1s, we moved to EndeavourOS… But since a week it is impossible to run mainstream, luckily linux-lts is unaffected.

Currently running KDE/plasma

Operating System: EndeavourOS 
KDE Plasma Version: 5.27.2
KDE Frameworks Version: 5.103.0
Qt Version: 5.15.8
Kernel Version: 6.1.15-1-lts or 6.2.2-arch1-1 (64-bit)
Graphics Platform: X11
Processors: 12 × AMD Ryzen 5 5500U with Radeon Graphics
Memory: 30.7 Gio of RAM
Graphics Processor: AMD Radeon Graphics
Manufacturer: ASUSTeK COMPUTER INC.
Product Name: MINIPC PN51-E1
System Version: 0505

Resulting mount (from /etc/fstab) is thus
server:/home/richard on /home/richard type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=14,retrans=2,sec=sys,clientaddr=192.168.0.41,local_lock=none,addr=192.168.0.1)

/etc/fstab uses https://wiki.archlinux.org/title/NFS#Mount_using_/etc/fstab_with_systemd
plus in /etc/nfsmount.conf

[ MountPoint "/export/home" ]
background=True

The server is pure arch x86_64 running 6.1.15-1-lts on a supermicro H8SGL-F

What are the symptoms?

Extremely long login times, on LTS it’s nearly interactive (<5seconds) vs a minute or two
using dolphon accessing /home or other nfs shares is excruciating, mostly on first access (also long moments as opposed to nearly interactive) and file accesses are clearly longer, though maybe not as exaggerated as folder openings.

Booting back and forth between LTS and mainstream reproduces the problem.

It seems like perhaps a serious caching problem, or worse.

Anybody else have a similar configuration with like symptoms?

Root · March 6, 2023, 3:27pm

If it has appeared after long time no issues then I think: Is network time skewed?

NFS is highly sensitive the time being the same for server and client.

mbod · March 6, 2023, 3:40pm

Is that really needed?

I am using nfs mount on a debian client to an EnOS server and my clients fstab look like this.

server://mnt/nextcloudpi/data /mnt/nextcloudpi-data nfs defaults,soft,nfsvers=4,async 0 0

Resulting in this mount:

server:/mnt/nextcloudpi/data /mnt/nextcloudpi-data nfs4 rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.16.82.133,local_lock=none,addr=172.16.82.1 0 0

richard · March 6, 2023, 3:56pm

I serve the clients ntp from the server via dhcp. On the client:

$ timedatectl
               Local time: Mon 2023-03-06 16:54:31 CET
           Universal time: Mon 2023-03-06 15:54:31 UTC
                 RTC time: Mon 2023-03-06 15:54:31
                Time zone: Europe/Paris (CET, +0100)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

richard · March 6, 2023, 3:58pm

hmm, I use v4 defaults… my fstab looks something like this:

server:/home   /home  nfs4  _netdev,noauto,x-systemd.automount,x-systemd.mount-timeout=10,timeo=14 0 0

richard · March 6, 2023, 4:01pm

BTW, I’m way to scared to run ‘async’ in production!

Root · March 6, 2023, 4:01pm

I am confused. To which LTS are you referring?

6.1 is LTS - I assume the next reference is the client - also 6.1? or perhaps 5.15?

Is the increased login time is related to the automatic roll from 5.15 to 6.1?

richard · March 6, 2023, 4:04pm

As indicated above, the client is running either 6.1.15-1-lts or 6.2.2-arch1-1
where 6.1.15-1-lts is LTS (naturally)… mainstream is already 6.2.2.
The problem manifests itself when running mainstream, and is alleviated running LTS.

BTW Not sure what you’re asking ‘related to automatic roll…’?

Root · March 6, 2023, 4:05pm

Perhaps a regression with mainline - it wouldn’t be the first time

I refer to the linux-lts which creates an automated roll from one LTS to the next.

Most recently 5.15.x → 6.1.x

mbod · March 6, 2023, 5:41pm

My client and server are on the same machine. The client is a VM. I have to use nfs for the client/server connection because the vmgfx has such a lousy performance.

And async mode is way faster than sync mode. Write speed in async mode is about 10 times faster with my setup (15 MB/s vs. 200 MB/s write speed with fio)

richard · March 13, 2023, 8:31am

Not having too much time to spend on this as running lts doesn’t exhibit the problem, this morning I believe I can determine that it’s possible that QT is at fault.
That is, with plasma there is dolphin, by default, which has the problem. So I installed pcmanfm and pcmanfm-qt to see about alternatives because in a terminal screen access on NFS seems just fine.

pcmanfm on both lts and mainstream seems just fine
pcmanfm-qt is similar to dolphin, fine on lts but dog slow on mainstream.

Anybody else can verify that? or knows of any way to tune? perhaps it needs a larger read_ahead_kb or something.

[added] and maybe it’s me, but it seems that in lts, dolphin loads fast, then asynchronously adds the size of folders later, a bit at a time… in mainstream it looks perhaps synchronous, evidently taking a real long time to interrogate the nfs server for all that… possible difference in the semantics of the underlying calls?

richard · March 19, 2023, 8:03am

Bug filed https://bugs.kde.org/show_bug.cgi?id=467561