Desktop freezing randomly multiple times a day ..... How to know if its the CPU or Motherboard

My desktop is freezing randomly multiple times a day. When this happens nothing works. So I enabled the sysrq. When I press REISUB is see

Not all CPUs entered broadcast exception handler

The next time this happens I will try to take a photo quickly.

I don’t have a lot of money right now.
Before I spend cash I want to be 100% sure. Is my CPU dying or is my motherboard faulty ?

CPU’s rarely go bad. I would use memtest and run a demanding memory test for a good length of time or until you get errors. See if you have a bad memory stick first. If you get any errors then the trick is to figure out which stick is bad. Test each stick adding one at a time. Make sure they are installed and seated properly. Once you figure out which one is giving the errors replace it or remove it. You’ll have less memory but at least it should work. Did you check logs? If the memory is not bad is the cpu over heating? Then you need to cheek the power supply also. Narrow it down to the motherboard one thing at a time.

2 Likes

Let’s not forget that SSD dies unexpectedly and hardcore, especially if it’s cheap.

So check:

  1. SSD with SMART

  2. RAM with memetest

  3. In case there’s nothing obvious - try to exclude whatever you can by disconnecting anything except essential components MB + PSU + CPU + 1 stick of RAM + GPU (in case you have no internal one) + boot from live usb.

I opened my desktop removed the ram stick (I have only 1 ram stick). I cleaned the ram connector. Then placed it back. Then when I booted there was no beep, no display. Only the power indicator was glowing and the CPU and PSU fan was running. Then I removed the CPU heat sink & CPU fan cleaned the cpu socket and re installed the CPU.

After I reinstalled the CPU my desktop booted

Can I check SMART using CLI ? Never done this before. Can you please tell me the command ?
I will search now about how to run memetest.

Love this.

3 Likes

Is this a repeat of the same problem you were having before:

3 Likes

No. The there is no considerable HDD activity this time

sudo smartctl -a /dev/sdX

(c) https://wiki.archlinux.org/title/S.M.A.R.T.

But it’s the same system that was having problems before, where you were also asked to check the health of the disk but didn’t reply with the information?

2 Likes

Yes

You shouldn’t need to remove the CPU but no harm doing that. Removing the heating sink and fan and cleaning it to reapply new thermal compound is what you should be doing if removing it in order to make sure it’s not overheating. The ram has to be inserted properly until both clips snap into place on there own. That’s how you know you have inserted it and seated properly. You don’t use the clips on the end of the ports to seat it. You press the ram stick into the slot firmly until they snap in. A lot of people do this wrong and the ram isn’t seated and then it doesn’t boot or they get memory errors.

Edit: Also as @keybreak mentioned check your drives also.

@ricklinux @keybreak

$ sudo smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.17.1-arch1-1] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 750 EVO 120GB
Serial Number:    S33MNB0H582786V
LU WWN Device Id: 5 002538 d40eadaf3
Firmware Version: MAT01B6Q
User Capacity:    120,034,123,776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Apr  1 17:51:59 2022 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x53) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  64) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       14161
 12 Power_Cycle_Count       0x0032   094   094   000    Old_age   Always       -       5230
177 Wear_Leveling_Count     0x0013   076   076   000    Pre-fail  Always       -       119
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   099   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   063   043   000    Old_age   Always       -       37
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   099   099   000    Old_age   Always       -       13
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       260
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       18629203034

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Aborted by host               40%      7906         -
# 2  Extended offline    Aborted by host               90%      5986         -
# 3  Short offline       Completed without error       00%      5042         -
# 4  Short offline       Completed without error       00%      5041         -
# 5  Short offline       Completed without error       00%      5033         -
# 6  Short offline       Completed without error       00%      5032         -
# 7  Short offline       Completed without error       00%      5026         -
# 8  Short offline       Completed without error       00%      5025         -
# 9  Short offline       Completed without error       00%      5024         -
#10  Short offline       Completed without error       00%      5018         -
#11  Short offline       Completed without error       00%      5018         -
#12  Short offline       Completed without error       00%      5017         -
#13  Short offline       Completed without error       00%      5013         -
#14  Short offline       Completed without error       00%      5011         -
#15  Short offline       Completed without error       00%      5010         -
#16  Short offline       Completed without error       00%      5009         -
#17  Short offline       Completed without error       00%      5008         -
#18  Extended offline    Aborted by host               30%      5007         -
#19  Short offline       Completed without error       00%      5002         -
#20  Short offline       Completed without error       00%      5000         -
#21  Short offline       Completed without error       00%      4999         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I will run memtest as soon as I reboot. I my SSD okay ?

Looks good, except maybe that

@jonathon what would you say, should he change sata cable as reason of failure, or it could be effect of those random freezes?
I’d probably say it looks more like effect, than cause.

1 Like

The memtest is run in Live Mode right ? How should I attach the result here ? By taking a photo of the screen ?

You shouldn’t take anything, if there will be errors - then it’s RAM problem most likely, or motherboard settings if for example you have manually overclocked or over / under volted RAM too much.

Be right back after the test.

Could be either, but also the raw value doesn’t necessarily mean anything. CRC errors should appear in the journal so if it’s a systematic issue it will show up there.

1 Like

It’s been a long time since your last successful smart test. 14161 - 5042 = 9119

9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       14161
...
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Aborted by host               40%      7906         -
# 2  Extended offline    Aborted by host               90%      5986         -
# 3  Short offline       Completed without error       00%      5042         -

I would do it just to be sure.

sudo smartctl -t short /dev/sda
(OR sudo smartctl -t long  /dev/sda)

When test is over you can read result with:
sudo smartctl -a /dev/sda

@ricklinux @keybreak

memtest result

IMG_20220402_075220

1 Like