crash Centos

This is simple guide on how to debug crash Centos server using kernel crash dump. The program we use is Kdump. Kdump is the feature of the Linux kernel that creates crash dump during server crash. When a kernel crash or kernel panic, the program will dump everything into a vmcore file for us to troubleshoot the cause of the crash. By analyzing the crash we could determine what is the root cause of the crash or we could send the crash dump to someone if we not familiar on how to read the dump file. The kernel crash dump can be accessed directtly via /proc/vmcore during the kernel crash or can also automatically saved to locally accessible file system, to a raw deviceĀ  or to a remote system accessible over network.
The required program for debugging kernel crash is kexec-tools, crash and kernel-debuginfo. Proceed with installing the program using command;

yum install kexec-tools
yum install crash
yum install kernel-debuginfo

Then edit the grub file to reserve memory for crash dump. Just add ‘crashkernel=128M’ the update grub using command ‘grub2-mkconfig -o /boot/grub2/grub.cfg’.
After that we need to enable Kdump service and run at the boot time.

systemctl start kdump.service
systemctl enable kdump.service

Reboot the server.
Now every time the server crash, we can troubleshoot that server.
Example server that have crash issue. We can know by typing ‘last’ command in the console.

root     pts/0        192.168.0.55     Tue Jul 31 11:35 - 11:35  (00:00)
root     pts/0        192.168.0.55     Mon Jul 30 22:48 - 10:55  (12:07)
root     pts/1        192.168.0.55      Mon Jul 30 15:46   still logged in
root     pts/0        192.168.0.55     Mon Jul 30 15:44 - 22:05  (06:21)
reboot   system boot  3.10.0-714.10.2. Mon Jul 30 15:42 - 13:04  (21:21)
root     pts/1        192.168.0.55      Mon Jul 30 14:06 - crash  (01:36)
root     pts/0        192.168.0.55     Mon Jul 30 14:03 - crash  (01:39)
reboot   system boot  3.10.0-714.10.2. Mon Jul 30 13:59 - 13:04  (23:04)
root     pts/0        192.168.0.55      Mon Jul 30 10:41 - crash  (03:17)
root     pts/0        192.168.0.55     Mon Jul 30 06:30 - 10:14  (03:43)
reboot   system boot  3.10.0-714.10.2. Mon Jul 30 06:28 - 13:04 (1+06:35)
root     pts/2        192.168.0.55   Mon Jul 30 06:09 - crash  (00:19)
root     pts/1        192.168.0.55     Mon Jul 30 05:21 - crash  (01:06)
root     pts/0        192.168.0.55     Mon Jul 30 05:21 - crash  (01:07)
reboot   system boot  3.10.0-714.10.2. Mon Jul 30 04:58 - 13:04 (1+08:05)
root     pts/0        192.168.0.55     Mon Jul 30 02:17 - crash  (02:40)
reboot   system boot  3.10.0-714.10.2. Mon Jul 30 02:16 - 13:04 (1+10:47)

For sure this server need to be check. Check where is the location of the crash dump file by editing /etc/kdump.conf. Normally the location is ‘/var/crash’.
Go that directory.

[root@pro3 crash]# pwd
/var/crash
[root@pro3 crash]# ls -lath
total 12K
drwxr-xr-x. 25 root root 4.0K Jul 30 15:43 ..
drwxr-xr-x   2 root root 4.0K Jul 28 00:33 127.0.0.1-2018-07-28-00:33:49
drwxr-xr-x.  3 root root 4.0K Jul 28 00:33 .
[root@pro3 crash]# cd 127.0.0.1-2018-07-28-00\:33\:49/
[root@pro3 127.0.0.1-2018-07-28-00:33:49]#
[root@pro3 127.0.0.1-2018-07-28-00:33:49]# ls
vmcore  vmcore-dmesg.txt
[root@pro3 127.0.0.1-2018-07-28-00:33:49]# crash vmcore /usr/lib/debug/lib/modules/`uname -r`/vmlinux

crash 7.2.0-6.el7.cloudlinux
Copyright (C) 2002-2017 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.

GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

KERNEL: /usr/lib/debug/lib/modules/3.10.0-714.10.2.lve1.5.17.1.el7.x86_64/vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 12
DATE: Sat Jul 28 00:20:48 2018
UPTIME: 04:24:55
LOAD AVERAGE: 0.75, 0.53, 0.57
TASKS: 488
NODENAME: pro3.internet.com
RELEASE: 3.10.0-714.10.2.lve1.5.17.1.el7.x86_64
VERSION: #1 SMP Tue May 22 10:39:25 EDT 2018
MACHINE: x86_64 (1995 Mhz)
MEMORY: 31.9 GB
PANIC: "Kernel panic - not syncing: Fatal machine check on current CPU"
PID: 0
COMMAND: "swapper/6"
TASK: ffff880174bcef90 (1 of 12) [THREAD_INFO: ffff880174bec000]
CPU: 6
STATE: TASK_RUNNING (PANIC)

crash>  files
PID: 0      TASK: ffff880174bcef90  CPU: 6   COMMAND: "swapper/6"
ROOT: /    CWD: /
No open files

crash> sys
KERNEL: /usr/lib/debug/lib/modules/3.10.0-714.10.2.lve1.5.17.1.el7.x86_64/vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 12
DATE: Sat Jul 28 00:20:48 2018
UPTIME: 04:24:55
LOAD AVERAGE: 0.75, 0.53, 0.57
TASKS: 488
NODENAME: pro3.internet.com
RELEASE: 3.10.0-714.10.2.lve1.5.17.1.el7.x86_64
VERSION: #1 SMP Tue May 22 10:39:25 EDT 2018
MACHINE: x86_64 (1995 Mhz)
MEMORY: 31.9 GB
PANIC: "Kernel panic - not syncing: Fatal machine check on current CPU"
crash>
crash> bt
PID: 0      TASK: ffff880174bcef90  CPU: 6   COMMAND: "swapper/6"
#0 [ffff88085f38bc78] machine_kexec at ffffffff8105b6bb
#1 [ffff88085f38bcd8] __crash_kexec at ffffffff81117272
#2 [ffff88085f38bda8] panic at ffffffff81696c18
#3 [ffff88085f38be28] mce_panic at ffffffff810432ea
#4 [ffff88085f38be68] do_machine_check at ffffffff81044a9a
#5 [ffff88085f38bf50] machine_check at ffffffff816a594f
[exception RIP: intel_idle+246]
RIP: ffffffff813a1946  RSP: ffff880174befe10  RFLAGS: 00010046
RAX: 0000000000000030  RBX: 0000000000000010  RCX: 0000000000000001
RDX: 0000000000000000  RSI: ffff880174beffd8  RDI: 0000000000000006
RBP: ffff880174befe40   R8: 00000000000003b3   R9: 0000000000000018
R10: 00000000000003d6  R11: 0000000000000000  R12: ffff880174beffd8
R13: 0000000000000005  R14: 0000000000000030  R15: ffffffff81aa2710
ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
---  ---
#6 [ffff880174befe10] intel_idle at ffffffff813a1946
#7 [ffff880174befe48] cpuidle_enter_state at ffffffff81529500
#8 [ffff880174befe80] cpuidle_idle_call at ffffffff81529659
#9 [ffff880174befec0] arch_cpu_idle at ffffffff8103658e
#10 [ffff880174befed0] cpu_startup_entry at ffffffff810f0845
#11 [ffff880174beff28] start_secondary at ffffffff81050aea
[15894.694014] mce: [Hardware Error]: CPU 6: Machine Check Exception: 4 Bank 3: fe00000000800400
[15894.689003]  [] bit_clear+0xdd/0x120
[15894.689003] mce: [Hardware Error]: TSC 1dfa85008bc6 MISC 3ffff
[15894.694014]
[15894.689003]  [] fbcon_clear+0x1b1/0x1f0
[15894.694014] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1532708447 SOCKET 0 APIC 1 microcode 713
[15894.689003]  [] fbcon_scroll+0x21d/0xd10
[15894.694014] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[15894.689003]  [] scrup+0x16c/0x180
[15894.694014] mce: [Hardware Error]: Some CPUs didn't answer in synchronization
[15894.694014] mce: [Hardware Error]: Machine check: Processor context corrupt
[15894.689003]  [] lf+0xa0/0xb0
[15894.694014] Kernel panic - not syncing: Fatal machine check on current CPU

 

In this case seem the problem is a CPU. We replace the CPU and the problem is gone.

Leave a Reply

Your email address will not be published. Required fields are marked *