Docker not switching context into the container after restoring snapshot, leading to root access to the host system

When I run a container, it seems to have no effect. I still have access to my shell history, I can still run any program, etc. When I run an Ubuntu container, and then run neofetch, I get the output from EndeavourOS. I already tried sudo pacman -Rns docker; sudo pacman -S docker, but the same thing happens again. I also tried booting a different kernel.

$ sudo docker run -it ubuntu
[root@0649c24117eb /]# neofetch
                     ./o.                  root@0649c24117eb
                   ./sssso-                -----------------
                 `:osssssss+-              OS: EndeavourOS Linux x86_64
               `:+sssssssssso/.            Host: MS-7C91 2.0
             `-/ossssssssssssso/.          Kernel: 6.4.3-273-tkg-pds-llvm
           `-/+sssssssssssssssso+:`        Uptime: 2 hours, 41 mins
         `-:/+sssssssssssssssssso+/.       Packages: 2514 (pacman), 46 (nix-user), 46 (nix-default),
       `.://osssssssssssssssssssso++-      Shell: bash 5.1.16
      .://+ssssssssssssssssssssssso++:     Resolution: 1920x1080
    .:///ossssssssssssssssssssssssso++:    CPU: AMD Ryzen 5 5600X (12) @ 3.700GHz
  `:////ssssssssssssssssssssssssssso+++.   GPU: AMD ATI Radeon RX 5600 OEM/5600 XT / 5700/5700 XT
`-////+ssssssssssssssssssssssssssso++++-   Memory: 25011MiB / 32018MiB
 `..-+oosssssssssssssssssssssssso+++++/`
   ./++++++++++++++++++++++++++++++/:.
  `:::::::::::::::::::::::::------``


[root@0649c24117eb /]#

Journalctl at the time when I start it:

Jul 17 00:39:57 user-pc kernel: audit: type=1110 audit(1689547197.830:963): pid=27976 uid=1000 auid=1000 ses=4 subj=unconfined msg='op=PAM:setcred grantors=pam_faillock,pam_permit,pam_env,pam_faillock acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/4 res=success'
Jul 17 00:39:57 user-pc kernel: audit: audit_lost=2281 audit_rate_limit=0 audit_backlog_limit=64
Jul 17 00:39:57 user-pc kernel: audit: kauditd hold queue overflow
Jul 17 00:39:57 user-pc sudo[27976]: pam_unix(sudo:session): session opened for user root(uid=0) by user(uid=1000)
Jul 17 00:39:57 user-pc kernel: audit: type=1105 audit(1689547197.830:964): pid=27976 uid=1000 auid=1000 ses=4 subj=unconfined msg='op=PAM:session_open grantors=pam_systemd_home,pam_limits,pam_unix,pam_permit acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/4 res=success'
Jul 17 00:39:57 user-pc kernel: audit: audit_lost=2282 audit_rate_limit=0 audit_backlog_limit=64
Jul 17 00:39:57 user-pc kernel: audit: kauditd hold queue overflow
Jul 17 00:39:58 user-pc NetworkManager[936]: <info>  [1689547198.9512] manager: (veth18649e3): new Veth device (/org/freedesktop/NetworkManager/Devices/36)
Jul 17 00:39:58 user-pc kernel: docker0: port 1(vethb03dbd5) entered blocking state
Jul 17 00:39:58 user-pc kernel: docker0: port 1(vethb03dbd5) entered disabled state
Jul 17 00:39:58 user-pc kernel: vethb03dbd5: entered allmulticast mode
Jul 17 00:39:58 user-pc kernel: vethb03dbd5: entered promiscuous mode
Jul 17 00:39:58 user-pc kernel: audit: type=1700 audit(1689547198.950:965): dev=vethb03dbd5 prom=256 old_prom=0 auid=4294967295 uid=0 gid=0 ses=4294967295
Jul 17 00:39:58 user-pc kernel: docker0: port 1(vethb03dbd5) entered blocking state
Jul 17 00:39:58 user-pc kernel: docker0: port 1(vethb03dbd5) entered forwarding state
Jul 17 00:39:58 user-pc kernel: docker0: port 1(vethb03dbd5) entered disabled state
Jul 17 00:39:58 user-pc NetworkManager[936]: <info>  [1689547198.9518] manager: (vethb03dbd5): new Veth device (/org/freedesktop/NetworkManager/Devices/37)
Jul 17 00:39:59 user-pc containerd[967]: time="2023-07-17T00:39:59.011481118+02:00" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
Jul 17 00:39:59 user-pc containerd[967]: time="2023-07-17T00:39:59.011514888+02:00" level=info msg="loading plugin \"io.containerd.ttrpc.v1.pause\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Jul 17 00:39:59 user-pc containerd[967]: time="2023-07-17T00:39:59.011524428+02:00" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1
Jul 17 00:39:59 user-pc containerd[967]: time="2023-07-17T00:39:59.011531128+02:00" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Jul 17 00:39:59 user-pc systemd[1]: Started libcontainer container <some-ID>.
Jul 17 00:39:59 user-pc kernel: eth0: renamed from veth18649e3
Jul 17 00:39:59 user-pc kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethb03dbd5: link becomes ready
Jul 17 00:39:59 user-pc kernel: docker0: port 1(vethb03dbd5) entered blocking state
Jul 17 00:39:59 user-pc kernel: docker0: port 1(vethb03dbd5) entered forwarding state
Jul 17 00:39:59 user-pc NetworkManager[936]: <info>  [1689547199.0942] device (vethb03dbd5): carrier: link connected
Jul 17 00:39:59 user-pc NetworkManager[936]: <info>  [1689547199.0945] device (docker0): carrier: link connected
Jul 17 00:40:00 user-pc avahi-daemon[928]: Joining mDNS multicast group on interface vethb03dbd5.IPv6 with address <some-IP>.
Jul 17 00:40:00 user-pc avahi-daemon[928]: New relevant interface vethb03dbd5.IPv6 for mDNS.
Jul 17 00:40:00 user-pc avahi-daemon[928]: Registering new address record for <some-IP> on vethb03dbd5.*.

It changes the shell to bash, it changes the hostname, and it changes the user to root. I even have access to the home directory of the root user Edit: on the host system, but I don’t have access to the home directory of any other user, including my own home directory.

I’m very sure that this is not at all what is supposed to happen, and something is seriously fucked up.

It used to work normally a couple weeks ago.

This could be related to a btrfs restore operation that I did a couple weeks ago (I used this tutorial: https://wiki.archlinux.de/title/Arch_auf_BtrFS).

I also deleted containers before, using docker system prune and a couple other commands that I don’t remember. I did that, because I ran out of space on my root partition while doing an update. This lead to the system failing to boot, which was the reason for the btrfs restore operation.

What is also weird, is that I can’t restart an old container, and it fails with a permission error:

sudo docker container restart peaceful_fermi
Error response from daemon: Cannot restart container peaceful_fermi: mkdir /var/lib/docker/btrfs/subvolumes/<some_ID>/var: operation not permitted

This is very weird, because the permissions of the parent folder seem to be fine:

ls -lisah
insgesamt 0
       2 0 drwxr-xr-x 1 root root   0 17. Jul 00:49 ./
12364829 0 drwx--x--- 1 root root 24K 17. Jul 00:39 ../

The following command was some minutes ago (before successfully running a docker container), now the command doesn’t even complete, and it just hangs. It also hangs after doing a reboot:

$ sudo du -sh /var/lib/docker/
73M /var/lib/docker/

Other things that I noticed, and that might or might not be related:

  • My system crashed twice during the last week
  • Yesterday I had the problem that I couldn’t put my system into sleep mode.

Is there any way to recover from this? I’m using the system for three years now, and I really want to save it. Reinstalling would only be the last resort for me.

I still have a couple btrfs snapshots, but at the current state, I’m afraid to touch anything without consulting someone with more knowledge.

Sounds like a perfectly normal container run to me. Everything is isolated from the docker host. If you want to share things you have to configure that. It seems like you need to do some background reading and familiarize yourself with what a container is and what you should expect, and how you can expose host resources to the container. Read the docker documentation or there are a wide range of web pages and videos.

I think you don’t understand what I’m saying. My problem is, that the container is not containerized, and not isolated from the host. It basically is not a container. It has (almost) full root access to my host system. The context is not switched into the container. The root user in the container has access to my bash history, it has access to all my installed programs, and it can view files that are saved in the home directory of the root user on the host system. At the same time, it has no access to the root fs that should be in the container. For example, I don’t have access to apt in an ubuntu container. Instead, I still have access to pacman:

$ sudo docker run -it ubuntu
[root@ecbc8aec5ec7 /]# apt
bash: apt: command not found
[root@ecbc8aec5ec7 /]# pacman
error: no operation specified (use -h for help)
[root@ecbc8aec5ec7 /]#

OK, my bad.

I take it you have already used docker inspect to look for oddities?

I take it this happens with any Ubuntu container you run?
Have you tried with other containers like alpine or debian?
Tried removing the image and repulling?

Sounds like the namespace isolation isn’t being engaged, which you know, but why?

Does systemctl logs for docker or containers show anything?
Kernel logs?
What about /etc/docker/daemon.json? How does that compare to the running dockerd arguments?

I have had problems in the past trying to restore zfs snapshots of the docker volumes, and ended removing all the volumes etc and rebuilding from scratch by destroying the /var/lib/docker. Since I didn’t have any required data in docker volumes and had scripting to rebuild it was relatively painless.

Maybe you triggered this with your btrfs restore if varlibdocker was within the context of the restore.

1 Like

No problem. I guess this is a very weird issue. Took me a couple minutes, too. I’m also not sure if this is a configuration problem, or maybe even a very serious bug in docker. Edit: Or btrfs.

Sounds like the namespace isolation isn’t being engaged, which you know, but why?

That’s exactly what I’m thinking.

I just tried alpine, debian and ubuntu:focal, and they seem to be fine.

Maybe it really is just existing containers. It’s happening with ubuntu (I think that should be latest), and an ubuntu based rocm container.

I have the feeling that removing all images, and downloading them again, would fix the issue. The question is, WTF is happening, and can this happen again? I, therefore, don’t want to remove them to try it, at least for now.

I couldn’t find anything weird in systemctl and journalctl. I didn’t check Kernel logs yet.

I don’t have a /etc/docker/daemon.json.

Maybe you triggered this with your btrfs restore if varlibdocker was within the context of the restore.

That could very well be. It fits from the time where the issue occured, and it sounds like something that changes enough for something like this to happen.

I take it you have already used docker inspect to look for oddities?

No, that’s the first time I’ve heard of this command. What would you suggest to do with it?

1 Like

If it is what I had with my zfs snapshot restore then I suggest wipe varlibdocker and rebuild from scratch there are lots of config in there and you never know what is affected. I had a network that I could not find but could see and could not use and could not remove. I can only surmise it or it’s hash changed in some way or corruption.

Docker container inspect allows you to look at the config json, it’s usually big, you might find something. For example volumes, networks etc, as pulled from command line or compose.

I don’t have any ideas really what to look for, just anomalies.

I guess, in that case I should move this topic to the bug report area?

Maybe, I’m still learning my way around here.

Pretty indistinct as a bug. Btrfs? Kernel? Docker? Containerd?

I don’t know that docker would make any data consistency guarantees wrt snapshots.

I don’t know much about btrfs either, as I use zfs almost exclusively, and it is much more mature than btrfs, and I expected my restore would work as it should be no different to a crash, further data effectively was not written to the area. But it didn’t work. I tried things for a few hours, searched, and nuclear was the only timely option.

Even if docker doesn’t specifically say that it works with snapshots, I would think that it would be the default assumption that it does, if not stated otherwise? Suddenly having root access from within a container can’t possibly be expected behavior, right? That’s not even data consistency, that’s a security issue.

If a similar problem occurs in btrfs and zfs, I think we can rule out the file system?

2 instance is hardly a statistical population but still …

I guess there were buffered data that wasn’t written in a safe and consistent manner. I don’t know what docker uses for config storage and if it relevant, or if is something in the volume config itself.

Interesting if anyone else has encountered.

Certainly I found very little when I searched for network problem. Since I could not use or remove network I didn’t have much choice but destroy.

You seem to be able to address object so hopefully you can remove it. The question is what else was effected? How can you be sure?

I found this explanation of how docker and btrfs interact. But I don’t really see how that would result in the problems that I’m experiencing.

I have the following subvolumes:

$ sudo btrfs subvolume list /
ID 256 gen 901393 top level 5 path timeshift-btrfs/snapshots/2022-02-05_09-57-38/@
ID 257 gen 866081 top level 5 path @root
ID 258 gen 866087 top level 256 path timeshift-btrfs/snapshots/2022-02-05_09-57-38/@/srv
ID 259 gen 866087 top level 256 path timeshift-btrfs/snapshots/2022-02-05_09-57-38/@/var/cache
ID 260 gen 866087 top level 256 path timeshift-btrfs/snapshots/2022-02-05_09-57-38/@/var/log
ID 261 gen 866087 top level 256 path timeshift-btrfs/snapshots/2022-02-05_09-57-38/@/var/spool
ID 262 gen 866087 top level 256 path timeshift-btrfs/snapshots/2022-02-05_09-57-38/@/var/tmp
ID 263 gen 901975 top level 5 path @username
ID 1125 gen 901393 top level 5 path timeshift-btrfs/snapshots/2022-02-05_09-57-17/@
ID 1421 gen 866081 top level 5 path snapshots/@_21.03.2022
ID 1529 gen 866081 top level 5 path snapshots/@root_05.06.22
ID 1530 gen 866082 top level 5 path snapshots/@username_05.06.22
ID 3298 gen 901393 top level 5 path timeshift-btrfs/snapshots/2023-06-21_23-00-02/@
ID 3300 gen 901975 top level 5 path @
ID 3301 gen 866081 top level 3300 path srv
ID 3302 gen 901642 top level 3300 path var/cache
ID 3303 gen 901968 top level 3300 path var/log
ID 3304 gen 901755 top level 3300 path var/spool
ID 3305 gen 901645 top level 3300 path var/tmp
ID 3317 gen 901393 top level 5 path timeshift-btrfs/snapshots/2023-07-12_23-00-01/@
ID 3318 gen 901522 top level 3300 path var/lib/docker/btrfs/subvolumes/<ID1>-init
ID 3319 gen 901527 top level 3300 path var/lib/docker/btrfs/subvolumes/<ID1>
ID 3320 gen 901526 top level 3300 path var/lib/docker/btrfs/subvolumes/<ID2>-init
ID 3321 gen 901528 top level 3300 path var/lib/docker/btrfs/subvolumes/<ID2>
$ cat /etc/fstab
UUID=<UUID> /boot/efi vfat umask=0077 0 2
UUID=<UUID> / btrfs rw,ssd,discard,noatime,subvol=@,subvolid=3300,compress=zstd 0 1
UUID=<UUID> / btrfs rw,ssd,discard,noatime,subvol=@root,subvolid=257,compress=zstd 0 1
UUID=<UUID> /home/username btrfs rw,ssd,discard,noatime,subvol=@username,subvolid=263,compress=zstd 0 2
UUID=<UUID> /root/arch_pool btrfs noauto,rw,ssd,discard,noatime,compress=zstd 0 2

Interesting explanation, not I don’t see how it relates to your failure mode.
You volume structure doesn’t look like it is using btrfs subvols for docker as described in reddit, to my btrfs inexperienced eye?

Explicit docker volume that you can independently roll back and forward does seem to be a good idea, and that is what I am already using with zfs, but mounted on /local/docker, and that is what I use /etc/docker/daemon.json for, to tell docker where the volumes are stored. But it didn’t help me. I rebooted and network object didn’t work, and I couldn’t delete it. Stopping docker and rolling back my docker configuration and volumes to a snapshot didn’t work either, as I expected it to, which was frustrating, maybe I just didn’t go back far enough.

Neither of our issues sounds like it matches the reddit description. Your problem was / was not chrooted in docker container. The way that usually works is docker has an image accessible as /a/b/c/d and it does a chroot(/a/b/c/d) and now / for the container is /a/b/c/d for the system. That didn’t happen for you, now if somehow /a/b/c/d was somehow mounted on top of / rather than a docker volume, then it would kind of make sense. I could easily enough do a mount --bind to accomplish that. How docker volume setup could fail that way, much more murky. Could be something to do with an inconsistency with the rollback and the remaining docker volumes, but I don’t know enough to even guess what is happened here.

Look at the last lines:

ID 3318 gen 901522 top level 3300 path var/lib/docker/btrfs/subvolumes/<ID1>-init
ID 3319 gen 901527 top level 3300 path var/lib/docker/btrfs/subvolumes/<ID1>
ID 3320 gen 901526 top level 3300 path var/lib/docker/btrfs/subvolumes/<ID2>-init
ID 3321 gen 901528 top level 3300 path var/lib/docker/btrfs/subvolumes/<ID2>

I think what might be happening, is, that docker chroots into the wrong subvolume (into my root subvolume). On the other hand, why do I still have access to my bash history?

Ah … masked by my narrow screen width and probably just not looking closely enough … but still don’t see how rolling back or not rolling back your docker/btrfs volumes would have that effect. I guess you could test the edge cases manually, making some dummy volumes and paths and see if you get something weird after a timeshift back?

I didn’t use timeshift to do the restoring, I did the following:

sudo btrfs subvolume snapshot timeshift-btrfs/snapshots/2023-06-21_23-00-02/@ @
cd @
sudo rmdir srv var/cache var/log var/spool var/tmp

sudo btrfs subvol create srv
sudo btrfs subvol create var/cache
sudo btrfs subvol create var/log
sudo btrfs subvol create var/spool
sudo btrfs subvol create var/tmp

After that, I manually edited fstab, and regenerated grub.

I’ll see whether I have the time to experiment more with this, but currently I don’t have that much time. Maybe someone with more time could try to reproduce this on another system?