正常关机流程
1. 先把业务停掉,然后保障没有io下发后,umount所有节点的文件系统;
2. mmshutdown -a关闭所有集群节点,mmgetstate -a 观察所有节点状态是否为down;
3. mmgetstate -a确认所有节点服务down了之后就可以shutdown了 。
提示“The /lib/modules/5.15.0-86-generic/extra/mmfslinux.ko kernel extension does not exist.”可能原因分为
(1)若非正常关机流程,而是机房意外掉电后重启,体现为全部私有客户端或部分客户端无法访问存储;
(2)部分客户端升级了kernel小版本,导致部分客户端无法自动重建GPL module
1、进入无法访问的私有客户端,root权限用户,重建GPL module,重启GPFS服务
(base) sdses@client1:~$ su root
密码:
root@client1:/home/sdses# /usr/lpp/mmfs/bin/mmbuildgpl <<重建GPL module
--------------------------------------------------------
mmbuildgpl: Building GPL (5.1.6.1) module begins at 2023年 10月 07日 星期六 15:37:58 CST.
--------------------------------------------------------
Verifying Kernel Header...
kernel version = 51500083 (515000083000000, 5.15.0-83-generic, 5.15.0-83)
module include dir = /lib/modules/5.15.0-83-generic/build/include
module build dir = /lib/modules/5.15.0-83-generic/build
kernel source dir = /usr/src/linux-5.15.0-83-generic/include
Found valid kernel header file under /lib/modules/5.15.0-83-generic/build/include
Getting Kernel Cipher mode...
Will use skcipher routines
Verifying Compiler...
make is present at /bin/make
cpp is present at /bin/cpp
gcc is present at /bin/gcc
g++ is present at /bin/g++
ld is present at /bin/ld
make World ...
make InstallImages ...
--------------------------------------------------------
mmbuildgpl: Building GPL module completed successfully at 2023年 10月 07日 星期六 15:38:16 CST.
--------------------------------------------------------
root@client1:/home/sdses# /usr/lpp/mmfs/bin/mmstartup <<重新启动GPFS服务
2023年 10月 07日 星期六 15:38:32 CST: mmstartup: Starting GPFS ...
root@client1:/home/sdses# /usr/lpp/mmfs/bin/mmgetstate <<获取GPFS服务状态
Node number Node name GPFS state
-------------------------------------
6 client1 active
root@client1:/home/sdses# df -h <<可见挂载的文件夹
文件系统 容量 已用 可用 已用% 挂载点
udev 504G 0 504G 0% /dev
tmpfs 101G 3.6M 101G 1% /run
/dev/sda2 879G 94G 741G 12% /
tmpfs 504G 85M 504G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 504G 0 504G 0% /sys/fs/cgroup
/dev/loop0 128K 128K 0 100% /snap/bare/5
/dev/loop2 41M 41M 0 100% /snap/snapd/19993
/dev/loop1 62M 62M 0 100% /snap/core20/1611
/dev/loop3 64M 64M 0 100% /snap/core20/2015
/dev/loop4 74M 74M 0 100% /snap/core22/864
/dev/loop5 350M 350M 0 100% /snap/gnome-3-38-2004/143
/dev/loop6 347M 347M 0 100% /snap/gnome-3-38-2004/115
/dev/loop7 13M 13M 0 100% /snap/snap-store/959
/dev/loop11 92M 92M 0 100% /snap/gtk-common-themes/1535
/dev/loop12 41M 41M 0 100% /snap/snapd/20092
/dev/loop10 55M 55M 0 100% /snap/snap-store/558
/dev/loop9 486M 486M 0 100% /snap/gnome-42-2204/126
/dev/loop8 74M 74M 0 100% /snap/core22/858
/dev/sda1 511M 6.1M 505M 2% /boot/efi
tmpfs 101G 20K 101G 1% /run/user/125
/dev/loop13 497M 497M 0 100% /snap/gnome-42-2204/141
tmpfs 101G 36K 101G 1% /run/user/1000
synthesis01 128T 308G 128T 1% /sdses
tmpfs 101G 0 101G 0% /run/user/0
root@client1:/home/sdses# /usr/lpp/mmfs/bin/mmces service list
Enabled services: NFS
NFS is running
按照同样步骤,对其他无法访问的私有客户端执行/usr/lpp/mmfs/bin/mmbuildgpl 、/usr/lpp/mmfs/bin/mmstartup操作,确保GPFS状态为Active,存储挂载可见。
2、进入存储控制节点,重启GPFS服务,重建GPL module
Last login: Mon Sep 18 18:24:22 2023
[root@ece1 ~]# mmhealth cluster show node <<获取集群节点状态
Component Node Status Reasons
------------------------------------------------------------------------------------------
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY ces_ips_all_unassigned
NODE ***.*** HEALTHY -
NODE ***.*** FAILED gpfs_down,quorum_down,gui_pmsensors_connection_failed
NODE ***.*** FAILED nfsd_down,gpfs_down,local_exported_fs_unavail
NODE ***.*** FAILED gpfs_down,quorum_down,unmounted_fs_check
NODE ***.*** FAILED nfsd_down,gpfs_down,local_exported_fs_unavail
NODE ***.*** HEALTHY -
[root@ece1 ~]# mmhealth cluster show <<获取集群状态
Component Total Failed Degraded Healthy Other
-----------------------------------------------------------------------------------------------------------------
NODE 9 4 0 5 0
GPFS 9 4 0 5 0
NETWORK 9 0 0 9 0
FILESYSTEM 1 0 1 0 0
DISK 16 0 0 16 0
CES 2 2 0 0 0
CESIP 1 1 0 0 0
FILESYSMGR 1 0 0 1 0
GUI 1 0 1 0 0
NATIVE_RAID 4 0 0 4 0
PERFMON 5 0 0 5 0
THRESHOLD 5 0 0 5 0
[root@ece1 ~]# mmstartup -a <<重新启动所有节点GPFS服务
Sun Oct 8 15:08:30 CST 2023: mmstartup: Starting GPFS ...
***.***: The GPFS subsystem is already active.
***.***: The GPFS subsystem is already active.
***.***: The GPFS subsystem is already active.
***.***: The GPFS subsystem is already active.
***.***: The GPFS subsystem is already active.
***.***: mmremote: startSubsys: The /lib/modules/5.15.0-86-generic/extra/mmfslinux.ko kernel extension does not exist. Use mmbuildgpl command to create the needed kernel extension for your kernel or copy the binaries from another node with the identical environment. <<提示执行mmbuildgpl为内核创建所需的内核扩展,或从具有相同环境的其他节点复制二进制文件。
***.***: mmremote: startSubsys: Unable to verify kernel/module configuration.
***.***: mmremote: startSubsys: The /lib/modules/5.15.0-83-generic/extra/mmfslinux.ko kernel extension does not exist. Use mmbuildgpl command to create the needed kernel extension for your kernel or copy the binaries from another node with the identical environment.
mmdsh: ***.*** remote shell process had return code 1.
***.***: mmremote: startSubsys: Unable to verify kernel/module configuration.
***.***: mmremote: startSubsys: The /lib/modules/5.15.0-83-generic/extra/mmfslinux.ko kernel extension does not exist. Use mmbuildgpl command to create the needed kernel extension for your kernel or copy the binaries from another node with the identical environment.
***.***: mmremote: startSubsys: Unable to verify kernel/module configuration.
mmdsh: ***.*** remote shell process had return code 1.
mmdsh: ***.*** remote shell process had return code 1.
mmstartup: Command failed. Examine previous error messages to determine cause.
[root@ece1 ~]# mmhealth cluster show
Component Total Failed Degraded Healthy Other
-----------------------------------------------------------------------------------------------------------------
NODE 9 4 0 5 0
GPFS 9 4 0 5 0
NETWORK 9 0 0 9 0
FILESYSTEM 1 0 1 0 0
DISK 16 0 0 16 0
CES 2 2 0 0 0
CESIP 1 1 0 0 0
FILESYSMGR 1 0 0 1 0
GUI 1 0 1 0 0
NATIVE_RAID 4 0 0 4 0
PERFMON 5 0 0 5 0
THRESHOLD 5 0 0 5 0
[root@ece1 ~]# mmhealth cluster show
Component Total Failed Degraded Healthy Other
-----------------------------------------------------------------------------------------------------------------
NODE 9 4 0 5 0
GPFS 9 4 0 5 0
NETWORK 9 0 0 9 0
FILESYSTEM 1 0 1 0 0
DISK 16 0 0 16 0
CES 2 2 0 0 0
CESIP 1 1 0 0 0
FILESYSMGR 1 0 0 1 0
GUI 1 0 1 0 0
NATIVE_RAID 4 0 0 4 0
PERFMON 5 0 0 5 0
THRESHOLD 5 0 0 5 0
[root@ece1 ~]#
[root@ece1 ~]#
[root@ece1 ~]#
[root@ece1 ~]#
[root@ece1 ~]#
[root@ece1 ~]#
[root@ece1 ~]# mmbuildgpl <<执行重建动作
--------------------------------------------------------
mmbuildgpl: Building GPL (5.1.6.1) module begins at Sun Oct 8 15:17:09 CST 2023.
--------------------------------------------------------
Verifying Kernel Header...
kernel version = 41800305 (418000305003001, 4.18.0-305.3.1.el8.x86_64, 4.18.0-305.3.1)
module include dir = /lib/modules/4.18.0-305.3.1.el8.x86_64/build/include
module build dir = /lib/modules/4.18.0-305.3.1.el8.x86_64/build
kernel source dir = /usr/src/linux-4.18.0-305.3.1.el8.x86_64/include
Found valid kernel header file under /usr/src/kernels/4.18.0-305.3.1.el8.x86_64/include
Getting Kernel Cipher mode...
Will use skcipher routines
Verifying Compiler...
make is present at /bin/make
cpp is present at /bin/cpp
gcc is present at /bin/gcc
g++ is present at /bin/g++
ld is present at /bin/ld
Verifying libelf devel package...
Verifying elfutils-libelf-devel is installed ...
Command: /bin/rpm -q elfutils-libelf-devel
The required package elfutils-libelf-devel is installed
Verifying Additional System Headers...
Verifying kernel-headers is installed ...
Command: /bin/rpm -q kernel-headers
The required package kernel-headers is installed
make World ...
make InstallImages ...
--------------------------------------------------------
mmbuildgpl: Building GPL module completed successfully at Sun Oct 8 15:17:30 CST 2023.
--------------------------------------------------------
继续确认节点及集群状态,启动过程稍慢,需耐心等待
[root@ece1 ~]# mmhealth cluster show
Component Total Failed Degraded Healthy Other
-----------------------------------------------------------------------------------------------------------------
NODE 9 3 0 6 0
GPFS 9 3 0 6 0
NETWORK 9 0 0 9 0
FILESYSTEM 1 0 1 0 0
DISK 16 0 0 16 0
CES 2 2 0 0 0
CESIP 1 1 0 0 0
FILESYSMGR 1 0 0 1 0
GUI 1 0 1 0 0
NATIVE_RAID 4 0 0 4 0
PERFMON 5 0 0 5 0
THRESHOLD 5 0 0 5 0
[root@ece1 ~]# mmhealth cluster show
Component Total Failed Degraded Healthy Other
-----------------------------------------------------------------------------------------------------------------
NODE 9 3 0 6 0
GPFS 9 3 0 6 0
NETWORK 9 0 0 9 0
FILESYSTEM 1 0 1 0 0
DISK 16 0 0 16 0
CES 2 2 0 0 0
CESIP 1 1 0 0 0
FILESYSMGR 1 0 0 1 0
GUI 1 0 1 0 0
NATIVE_RAID 4 0 0 4 0
PERFMON 5 0 0 5 0
THRESHOLD 5 0 0 5 0
[root@ece1 ~]# mmhealth cluster show node
Component Node Status Reasons
------------------------------------------------------------------------------------------
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY ces_ips_all_unassigned
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY gui_pmsensors_connection_failed,time_not_in_sync,gui_refresh_task_failed
NODE ***.*** FAILED nfsd_down,gpfs_down,local_exported_fs_unavail
NODE ***.*** FAILED gpfs_down,quorum_down,unmounted_fs_check
NODE ***.*** FAILED nfsd_down,gpfs_down,local_exported_fs_unavail
NODE ***.*** HEALTHY -
[root@ece1 ~]# mmgetstate -a
Node number Node name GPFS state
-------------------------------------
1 ece1 active
2 ece2 active
3 ece3 active
4 ece4 active
5 gui active
6 client1 active
7 client2 down
8 client3 down
9 client4 active
[root@ece1 ~]# mmhealth cluster show node
Component Node Status Reasons
------------------------------------------------------------------------------------------
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY gui_pmsensors_connection_failed,time_not_in_sync,gui_refresh_task_failed
NODE ***.*** TIPS nfs_in_grace,numactl_not_installed
NODE ***.*** FAILED gpfs_down,quorum_down,unmounted_fs_check
NODE ***.*** FAILED nfsd_down,gpfs_down,local_exported_fs_unavail
NODE ***.*** HEALTHY -
[root@ece1 ~]# mmgetstate -a
Node number Node name GPFS state
-------------------------------------
1 ece1 active
2 ece2 active
3 ece3 active
4 ece4 active
5 gui active
6 client1 active
7 client2 active
8 client3 arbitrating
9 client4 active
[root@ece1 ~]# mmgetstate -a
Node number Node name GPFS state
-------------------------------------
1 ece1 active
2 ece2 active
3 ece3 active
4 ece4 active
5 gui active
6 client1 active
7 client2 active
8 client3 active
9 client4 active
3、确认节点状态,按照TIPS提示进行既定动作
[root@ece1 ~]# mmhealth cluster show node
Component Node Status Reasons
------------------------------------------------------------------------------------------
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY gui_pmsensors_connection_failed,time_not_in_sync,gui_refresh_task_failed
NODE ***.*** TIPS numactl_not_installed
NODE ***.*** HEALTHY -
NODE ***.*** TIPS ces_network_ips_down,numactl_not_installed
NODE ***.*** HEALTHY -
-
[root@ece1 ~]# mmhealth cluster show
Component Total Failed Degraded Healthy Other
-----------------------------------------------------------------------------------------------------------------
NODE 9 0 0 7 2
GPFS 9 0 0 7 2
NETWORK 9 0 0 9 0
FILESYSTEM 1 0 0 1 0
DISK 16 0 0 16 0
CES 2 0 2 0 0
CESIP 1 0 0 1 0
FILESYSMGR 1 0 0 1 0
GUI 1 0 1 0 0
NATIVE_RAID 4 0 0 4 0
PERFMON 5 0 0 5 0
THRESHOLD 5 0 0 5 0
[root@ece1 ~]# mmhealth cluster show node
Component Node Status Reasons
------------------------------------------------------------------------------------------
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY gui_pmsensors_connection_failed,time_not_in_sync,gui_refresh_task_failed
NODE ***.*** TIPS numactl_not_installed
NODE ***.*** HEALTHY -
NODE ***.*** TIPS numactl_not_installed
NODE ***.*** HEALTHY -
root@client1:/home/sdses# apt-get install numactl
正在读取软件包列表... 完成
正在分析软件包的依赖关系树
正在读取状态信息... 完成
下列软件包是自动安装的并且现在不需要了:
gir1.2-goa-1.0 nvidia-firmware-535-535.86.05
使用'apt autoremove'来卸载它(它们)。
下列【新】软件包将被安装:
numactl
升级了 0 个软件包,新安装了 1 个软件包,要卸载 0 个软件包,有 14 个软件包未被升级。
需要下载 38.5 kB 的归档。
解压缩后会消耗 150 kB 的额外空间。
获取:1 ***.***/ubuntu focal/main amd64 numactl amd64 2.0.12-1 [38.5 kB]
已下载 38.5 kB,耗时 0秒 (211 kB/s)
正在选中未选择的软件包 numactl。
(正在读取数据库 ... 系统当前共安装有 217894 个文件和目录。)
准备解压 .../numactl_2.0.12-1_amd64.deb ...
正在解压 numactl (2.0.12-1) ...
正在设置 numactl (2.0.12-1) ...
正在处理用于 man-db (2.9.1-1) 的触发器 ...
root@client1:/home/sdses#
[root@ece1 ~]# mmhealth cluster show node
Component Node Status Reasons
------------------------------------------------------------------------------------------
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY gui_pmsensors_connection_failed,time_not_in_sync,gui_refresh_task_failed
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY -
NODE ***.*** HEALTHY -
见配置步骤
该案例暂时没有网友评论
✖
案例意见反馈
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作