Can linux even cause these problems?

hesxenon · July 28, 2020, 11:22am

So this topic has a bit of a story already attached to it. I’ve bought a new laptop recently and apart from having trouble initially getting the rtx2060 to work while using the internal display it’s been good.

But a few weeks ago it started crashing and refused to post until it crashed in the BIOS with no kind of bootable storage installed. Ok, obviously that’s a manufacturer problem, sent it in, they replaced the motherboard, got it back, installed my SSD again and just kept using it, no problem there.

Now it just crashed again, no tty switching or anything possible. So I reset it, booted again, no POST. Reset and boot again, POST and grub, initial log shows a kernel panic (before reaching DM or any graphical target). Reset and boot again, POST, grub, no kernel panic but screen stays black and suddenly a bunch of green lines came up covering the whole screen. Again, no tty switching or anything. Reset and boot again, now it seems to be stable.

So my question would be, can linux even cause these problems? Like, what would have to happen so that linux can prevent a computer from posting? Or what could cause this complete lock up without any logs? Could it maybe be overheating? Can I look for anything in the logs?

Should I just try to send it in again?

Kresimir · July 28, 2020, 11:24am

I’d send it in again, if it’s under warranty. It seems like a hardware issue.

Try booting a live image from http://www.memtest.org/ and run the tests.

JR29 · July 28, 2020, 11:50am

what is the make and model of the laptop? sometimes some PC’s have factory problems, you might even have problems with windows.

hesxenon · July 28, 2020, 12:13pm

it’s an XMG Neo 15, i7-9750H + 32GB RAM + RTX 2060. Just crashed again after a few hours. I let it cool for a few minutes and it started up without problems. I think I remember something in the kernel panic logs about mentions of “mktemp” and “can’t…temp zone”.

ricklinux · July 28, 2020, 12:20pm

Sounds like hardware issues to me. Did they replace it with a brand new motherboard or did they used one that was repaired? Is it overheating?

keybreak · July 28, 2020, 12:21pm

That must be some hardware issue like overheat / throttling / undervoltage of PSU,.

Try to just completely change it or moneyback, since you still have warranty, and surf internet to see if there’s high rate of faults for this particular model, sometimes it’s just poorly made cooling / bad parts.

I’ve seen that on some MSI laptops with 1070, for example.

Also in theory it can be SSD death (because when it happens whole system goes complete random), what SSD do you have? Maybe you should try to boot with something else, if you have spare parts…Just to be sure

keybreak · July 28, 2020, 12:24pm

Oh, and just to answer your question - yes, you can completely kill Linux by some stupid or accidental actions, so it won’t even go to TTY. For example if you change permissions on /bin or some other system folders or remove them with sudo…

But that’s definitely not your case here, surely you’d notice consequences of such actions immediately

P.S. Sorry to hear that you have such problems with new hardware…It’s always very frustrating

hesxenon · July 28, 2020, 12:51pm

Thx all, I think they replaced it with a new motherboard altogether.

I already did a memtest from a live-usb which reported no errors.

I don’t think it’s overheating, but then again I think I saw something about a tempzone being unreadable, maybe the GPU does without me noticing? CPU is at 50ºC so should be ok, no?

My SSD is a Toshiba/OCZ RD400 (I think…) it’s a few years old so yeah, SSD death is a possibility by now I’d say. But could that lead to the system not posting at all? I mean, when I took out the SSD it still posted and reported “no boot medium found”.

I’m writing another ticket now, honestly it’s a great package with huge issues. Think if they’d revise it it would be a killer deal, but not like that.

Stephane · July 28, 2020, 12:57pm

recheck all options in Bios , SSD should on AHCI
have you installed lm_sensors ?
have you cheched in log any error on boot ?

keybreak · July 28, 2020, 12:59pm

50ºC for CPU is great, if readings are correct of course (which i assume they’re not, coz something is very off here).

Although with MAX-Q design cooling CPU is connected to GPU with same heatpipe, so in theory if one of those parts would go very hot other one would too…

I’d recommend if you want something reliable for future go for Samsung EVO / pro, they’re little more pricey but damn reliable.

Yes, once SSD is started to die usually it’s whole system goes completely nuts, i’ve seen that couple of times. But that is relatively fast process, so when it’s dead you just can’t boot with connected SSD usually…

Btw, to exclude SSD completely just remove it and boot from live usb (see if it’ll post at least), then stress test it with something heavy to see if it will crash from live USB as well.

hesxenon · July 28, 2020, 1:07pm

It doesn’t seem to be stress related, last time I tested gaming from the live disk with the SSD removed and it didn’t crash but the temps were ~85ºC, then it crashed while I was in the BIOS menu…

SSD is on AHCI

lm_sensors is installed, anything I should do with it specifically?

which errors should I check for? There are a few lines in red and every now and then I see something with “error” in its message, but those seem to be present on every boot

I’d recommend if you want something reliable for future go for Samsung EVO / pro, they’re little more pricey but damn reliable.

yeah was looking into the 970 Evo but let’s see what IBMs flash programme produces. Don’t really like samsung as a company but that’s another topic.

keybreak · July 28, 2020, 1:17pm

Can you repeat same procedure just on battery?
I wonder if that may be some low-power state which can’t be satisfied due to under / over voltage by PSU…

I’d say never trust corporation - trust product at hand!

Any, just post here in code tags results of those commands:

sudo dmesg | grep -iE "error|failed"
journalctl -p3 -xb
systemctl --failed

Maybe something useful will come up

hesxenon · July 28, 2020, 1:36pm

# sudo dmesg | grep -iE "error|failed"
[    0.281255] ACPI BIOS Error (bug): AE_AML_BUFFER_LIMIT, Field [SAC1] at bit offset/length 64/64 exceeds size of target Buffer (64 bits) (20200326/dsopcode-198)
[    0.341674] ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
[    0.341676] ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
[    0.352226] ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
[    0.352228] ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
[    0.355288] ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
[    0.355290] ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
[    0.357423] ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
[    0.357425] ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
[    0.359349] ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
[    0.359351] ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
[    0.361211] ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
[    0.361212] ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
[    0.363649] ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
[    0.363651] ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
[    0.365641] ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
[    0.365643] ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
[    0.368544] ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
[    0.368545] ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
[    0.370560] ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
[    0.370561] ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
[    0.738705] pcieport 0000:00:1b.0: DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 4, DL_ActiveErr+
[    0.739248] pcieport 0000:00:1d.0: DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 4, DL_ActiveErr+
[    0.739706] pcieport 0000:00:1d.5: DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 4, DL_ActiveErr+
[    0.765145] RAS: Correctable Errors collector initialized.
[    3.515814] r8168: module verification failed: signature and/or required key missing - tainting kernel
[    3.776024] iwlwifi 0000:00:14.3: Direct firmware load for iwl-debug-yoyo.bin failed with error -2
[    4.149637] thermal thermal_zone3: failed to read out thermal zone (-61)
[    5.218482] uvcvideo: Failed to initialize entity for entity 6
[    5.218483] uvcvideo: Failed to register entities (-22).
[    5.421227] nvidia-gpu 0000:01:00.3: i2c timeout error e0000000
[    5.421229] ucsi_ccg 0-0008: i2c_transfer failed -110
[    5.421231] ucsi_ccg 0-0008: ucsi_ccg_init failed - -110
[    5.421235] ucsi_ccg: probe of 0-0008 failed with error -110
[    7.746623] audit: type=1103 audit(1595942003.485:39): pid=961 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:setcred grantors=? acct="lightdm" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
[  107.689796] i915 0000:00:02.0: [drm] *ERROR* CPU pipe A FIFO underrun

# journalctl -p3 -xb
-- Logs begin at Mon 2020-04-20 12:09:33 CEST, end at Tue 2020-07-28 15:33:37 CEST. --
Jul 28 15:13:18 firstecho kernel: ACPI BIOS Error (bug): AE_AML_BUFFER_LIMIT, Field [SAC1] at bit offset/length 64/64 exceeds size of target Buffer (64 bits) (20200326/dsopcode-198)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Opcode is not deferred [SAC1] (BufferField) (20200326/nsinit-411)
Jul 28 15:13:18 firstecho kernel: ACPI Error: AE_TYPE, Could not execute arguments for [SAC1] (BufferField) (20200326/nsinit-419)
Jul 28 15:13:18 firstecho kernel: usbhid 1-6:1.0: couldn't find an input interrupt endpoint
Jul 28 15:13:21 firstecho kernel: nvidia-gpu 0000:01:00.3: i2c timeout error e0000000
Jul 28 15:13:21 firstecho kernel: ucsi_ccg 0-0008: i2c_transfer failed -110
Jul 28 15:13:21 firstecho kernel: ucsi_ccg 0-0008: ucsi_ccg_init failed - -110
Jul 28 15:13:21 firstecho libvirtd[881]: direct firewall backend requested, but /usr/bin/ebtables is not available: No such file or directory
Jul 28 15:13:21 firstecho libvirtd[881]: internal error: Failed to initialize a valid firewall backend
Jul 28 15:13:21 firstecho libvirtd[881]: Cannot find 'dmidecode' in path: No such file or directory
Jul 28 15:13:21 firstecho libvirtd[881]: Cannot find 'dmidecode' in path: No such file or directory
Jul 28 15:13:31 firstecho lightdm[1005]: gkr-pam: unable to locate daemon control file
Jul 28 15:15:03 firstecho kernel: i915 0000:00:02.0: [drm] *ERROR* CPU pipe A FIFO underrun

# systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
0 loaded units listed.

As for the battery test, I’ll do that when I get home in a few hours, thanks for the suggestion

keybreak · July 28, 2020, 1:52pm

Looks like all those kernel errors / nvidia / pcieport ones might be something, but i have no idea…Maybe @Stephane and others will!

Jul 28 15:15:03 firstecho kernel: i915 0000:00:02.0: [drm] *ERROR* CPU pipe A FIFO underrun

This looks like some Intel GPU stuff, try change to UXA as adviced here https://bbs.archlinux.org/viewtopic.php?pid=1533863#p1533863

On ACPI errors, read this:

Doubt it’s reason of such crashes, and this is matter of experimentation for your machine on what parameters are appropriate, but worth to know…

ricklinux · July 28, 2020, 4:09pm

I think I would be leaning towards the drive being the issue. I personally would try a new drive and do a full new install or try first a new install. I don’t think the mktemp is related to temperature but rather file system related.

Looks like a lot of acpi errors. Sometimes that’s just bad implementation of the Bios or UEFI but may not be an issue rather just information. Using a kernel parameter may rid you if some of those messages.

hesxenon · July 29, 2020, 2:48pm

So I just bought a 970 EVO 1TB. Starting it from the live usb with only the 970 installed resulted in a crash within minutes.

However, after inspecting the BIOS a little more carefully, I found that the CPU was undervolted from the factory settings (-50mv). I changed this to 0 and was able to create a new live usb with the latest iso, install it and am have pretty much completed my setup process (dotfiles, plugins etc). So for the last 3.5hrs it is running stable.

Let’s see how this goes, will install the old ssd soon to get some files not included in the last backup

keybreak · July 29, 2020, 3:10pm

Yes, that very well may be the problem!
I’d advice to check idle for some time and then stress it with some heavy test for 2-10 hours (on night) to see if there are 0 errors (otherwise try to tweak voltage again).

Hopefully you’ll pass all that and be good to go

Great! I’m sure you’ll have a lot of fun with it for years to come, once all that crash problems sorted

ricklinux · July 29, 2020, 4:38pm

@hesxenon
How did those settings get changed?

keybreak · July 29, 2020, 4:48pm

I’d not be surprised if some “wise” technicians from support returned it that way, just in case so CPU/GPU won’t burn or something…

They must know more about that model!

What’s more surprising that this bios is so advanced that it has voltage controls on laptop, which is really rare without modding

hesxenon · August 1, 2020, 3:19pm

I can’t even… just been at a friends place, continuous freezes and the like, boot loop, no post. Got home, thought “before I open this thing again to take out the ssd, let’s try one more time”. No hiccups so far.

Am going to run stress and furmark for a few hours over this night, but please can any of you think of any way that linux (mind the new SSD) would cause a system to stop POSTing?

This is important to me because it seems to be very hard to determine whether something like this is the problem of the manufacturer or my own problem. I don’t want them to be able to say “well, you installed linux, that was your choice, now get lost”. I mean, apart from Linux damaging the hardware or corrupting the bios (possible?) I can’t think of anything that would make linux the culprit here. Mind you, I’m only talkieng about the “no post” issue.