This is simple guide on how to debug crash Centos server using kernel crash dump. The program we use is Kdump. Kdump is the feature of the Linux kernel that creates crash dump during server crash. When a kernel crash or kernel panic, the program will dump everything into a vmcore file for us to troubleshoot the cause of the crash. By analyzing the crash we could determine what is the root cause of the crash or we could send the crash dump to someone if we not familiar on how to read the dump file. The kernel crash dump can be accessed directtly via /proc/vmcore during the kernel crash or can also automatically saved to locally accessible file system, to a raw device or to a remote system accessible over network.
The required program for debugging kernel crash is kexec-tools, crash and kernel-debuginfo. Proceed with installing the program using command;
yum install kexec-tools yum install crash yum install kernel-debuginfo
Then edit the grub file to reserve memory for crash dump. Just add ‘crashkernel=128M’ the update grub using command ‘grub2-mkconfig -o /boot/grub2/grub.cfg’.
After that we need to enable Kdump service and run at the boot time.
systemctl start kdump.service systemctl enable kdump.service
Reboot the server.
Now every time the server crash, we can troubleshoot that server.
Example server that have crash issue. We can know by typing ‘last’ command in the console.
root pts/0 192.168.0.55 Tue Jul 31 11:35 - 11:35 (00:00) root pts/0 192.168.0.55 Mon Jul 30 22:48 - 10:55 (12:07) root pts/1 192.168.0.55 Mon Jul 30 15:46 still logged in root pts/0 192.168.0.55 Mon Jul 30 15:44 - 22:05 (06:21) reboot system boot 3.10.0-714.10.2. Mon Jul 30 15:42 - 13:04 (21:21) root pts/1 192.168.0.55 Mon Jul 30 14:06 - crash (01:36) root pts/0 192.168.0.55 Mon Jul 30 14:03 - crash (01:39) reboot system boot 3.10.0-714.10.2. Mon Jul 30 13:59 - 13:04 (23:04) root pts/0 192.168.0.55 Mon Jul 30 10:41 - crash (03:17) root pts/0 192.168.0.55 Mon Jul 30 06:30 - 10:14 (03:43) reboot system boot 3.10.0-714.10.2. Mon Jul 30 06:28 - 13:04 (1+06:35) root pts/2 192.168.0.55 Mon Jul 30 06:09 - crash (00:19) root pts/1 192.168.0.55 Mon Jul 30 05:21 - crash (01:06) root pts/0 192.168.0.55 Mon Jul 30 05:21 - crash (01:07) reboot system boot 3.10.0-714.10.2. Mon Jul 30 04:58 - 13:04 (1+08:05) root pts/0 192.168.0.55 Mon Jul 30 02:17 - crash (02:40) reboot system boot 3.10.0-714.10.2. Mon Jul 30 02:16 - 13:04 (1+10:47)
For sure this server need to be check. Check where is the location of the crash dump file by editing /etc/kdump.conf. Normally the location is ‘/var/crash’.
Go that directory.
[root@pro3 crash]# pwd /var/crash [root@pro3 crash]# ls -lath total 12K drwxr-xr-x. 25 root root 4.0K Jul 30 15:43 .. drwxr-xr-x 2 root root 4.0K Jul 28 00:33 127.0.0.1-2018-07-28-00:33:49 drwxr-xr-x. 3 root root 4.0K Jul 28 00:33 . [root@pro3 crash]# cd 127.0.0.1-2018-07-28-00\:33\:49/ [root@pro3 127.0.0.1-2018-07-28-00:33:49]# [root@pro3 127.0.0.1-2018-07-28-00:33:49]# ls vmcore vmcore-dmesg.txt
[root@pro3 127.0.0.1-2018-07-28-00:33:49]# crash vmcore /usr/lib/debug/lib/modules/`uname -r`/vmlinux crash 7.2.0-6.el7.cloudlinux Copyright (C) 2002-2017 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... KERNEL: /usr/lib/debug/lib/modules/3.10.0-714.10.2.lve1.5.17.1.el7.x86_64/vmlinux DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 12 DATE: Sat Jul 28 00:20:48 2018 UPTIME: 04:24:55 LOAD AVERAGE: 0.75, 0.53, 0.57 TASKS: 488 NODENAME: pro3.internet.com RELEASE: 3.10.0-714.10.2.lve1.5.17.1.el7.x86_64 VERSION: #1 SMP Tue May 22 10:39:25 EDT 2018 MACHINE: x86_64 (1995 Mhz) MEMORY: 31.9 GB PANIC: "Kernel panic - not syncing: Fatal machine check on current CPU" PID: 0 COMMAND: "swapper/6" TASK: ffff880174bcef90 (1 of 12) [THREAD_INFO: ffff880174bec000] CPU: 6 STATE: TASK_RUNNING (PANIC)
crash> files PID: 0 TASK: ffff880174bcef90 CPU: 6 COMMAND: "swapper/6" ROOT: / CWD: / No open files
crash> sys KERNEL: /usr/lib/debug/lib/modules/3.10.0-714.10.2.lve1.5.17.1.el7.x86_64/vmlinux DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 12 DATE: Sat Jul 28 00:20:48 2018 UPTIME: 04:24:55 LOAD AVERAGE: 0.75, 0.53, 0.57 TASKS: 488 NODENAME: pro3.internet.com RELEASE: 3.10.0-714.10.2.lve1.5.17.1.el7.x86_64 VERSION: #1 SMP Tue May 22 10:39:25 EDT 2018 MACHINE: x86_64 (1995 Mhz) MEMORY: 31.9 GB PANIC: "Kernel panic - not syncing: Fatal machine check on current CPU" crash>
crash> bt PID: 0 TASK: ffff880174bcef90 CPU: 6 COMMAND: "swapper/6" #0 [ffff88085f38bc78] machine_kexec at ffffffff8105b6bb #1 [ffff88085f38bcd8] __crash_kexec at ffffffff81117272 #2 [ffff88085f38bda8] panic at ffffffff81696c18 #3 [ffff88085f38be28] mce_panic at ffffffff810432ea #4 [ffff88085f38be68] do_machine_check at ffffffff81044a9a #5 [ffff88085f38bf50] machine_check at ffffffff816a594f [exception RIP: intel_idle+246] RIP: ffffffff813a1946 RSP: ffff880174befe10 RFLAGS: 00010046 RAX: 0000000000000030 RBX: 0000000000000010 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffff880174beffd8 RDI: 0000000000000006 RBP: ffff880174befe40 R8: 00000000000003b3 R9: 0000000000000018 R10: 00000000000003d6 R11: 0000000000000000 R12: ffff880174beffd8 R13: 0000000000000005 R14: 0000000000000030 R15: ffffffff81aa2710 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- --- #6 [ffff880174befe10] intel_idle at ffffffff813a1946 #7 [ffff880174befe48] cpuidle_enter_state at ffffffff81529500 #8 [ffff880174befe80] cpuidle_idle_call at ffffffff81529659 #9 [ffff880174befec0] arch_cpu_idle at ffffffff8103658e #10 [ffff880174befed0] cpu_startup_entry at ffffffff810f0845 #11 [ffff880174beff28] start_secondary at ffffffff81050aea
[15894.694014] mce: [Hardware Error]: CPU 6: Machine Check Exception: 4 Bank 3: fe00000000800400 [15894.689003] [] bit_clear+0xdd/0x120 [15894.689003] mce: [Hardware Error]: TSC 1dfa85008bc6 MISC 3ffff [15894.694014] [15894.689003] [] fbcon_clear+0x1b1/0x1f0 [15894.694014] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1532708447 SOCKET 0 APIC 1 microcode 713 [15894.689003] [] fbcon_scroll+0x21d/0xd10 [15894.694014] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [15894.689003] [] scrup+0x16c/0x180 [15894.694014] mce: [Hardware Error]: Some CPUs didn't answer in synchronization [15894.694014] mce: [Hardware Error]: Machine check: Processor context corrupt [15894.689003] [] lf+0xa0/0xb0 [15894.694014] Kernel panic - not syncing: Fatal machine check on current CPU
In this case seem the problem is a CPU. We replace the CPU and the problem is gone.