Zabbix Server Cluster部署最佳实践
架构设计

使用软件
- REDHAT 8.4
- Mysql 8.0
- Zabbix 5.4
IP规划
vips for cluster
| 1 | 192.168.2.28 zabbix-ha-db | 
db nodes
| 1 | 192.168.2.24 zabbix-db1 | 
web nodes
| 1 | 192.168.2.26 zabbix-server1 | 
服务器通用配置
- 时间同步
| 1 | 0 0 * * * /usr/sbin/ntpdate ntpserver >> /root/ntpdate.log 2>&1 ; /sbin/hwclock -w | 
- 关闭防火墙
| 1 | systemctl stop firewalld | 
- 关闭SElinux
| 1 | sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config | 
- 配置hosts文件
| 1 | # vips for cluster | 
- 配置yum源
| 1 | [base] | 
数据库HA集群
集群安装
所有节点上执行:
- 安装HA组件
| 1 | yum install pcs pacemaker fence-agents-all | 
- 给hacluster用户设置密码(最好相同)
| 1 | echo hacluster | passwd --stdin hacluster | 
任意节点上执行:
- 用相同的密码验证所有节点
| 1 | pcs host auth zabbix-db1 zabbix-db2 -u hacluster -p hacluster | 
- 创建database cluster和增加资源
| 1 | pcs cluster setup zabbix_db_cluster zabbix-db1 zabbix-db2 | 
- 启用集群
| 1 | pcs cluster start --all | 
- 检查集群状态
| 1 | [root@zabbix-db1 pcsd]# pcs status | 
集群参数配置
- 禁用fencing
| 1 | pcs property set stonith-enabled=false | 
- 忽略quorum状态
| 1 | pcs property set no-quorum-policy=ignore | 
- 配置转移策略
| 1 | pcs resource defaults migration-threshold=1 | 
创建Service和测试故障转移
- 创建VIP服务
| 1 | pcs resource create VirtualIP ocf:heartbeat:IPaddr2 ip=192.168.2.28 op monitor interval=5s --group zabbix_db_cluster | 
| 1 | [root@zabbix-db1 pcsd]# pcs status | 
vip可以ping通
| 1 | [root@zabbix-db1 pcsd]# ping 192.168.2.28 | 
IP信息里也能看到
| 1 | [root@zabbix-db1 pcsd]# ip addr | 
- 通过crm_resource强制终止服务
| 1 | crm_resource --resource VirtualIP --force-stop | 
通过crm_mon来监控资源的运行情况
| 1 | #crm_mon | 
这里可以看到VirtualIP服务自动转移到了zabbix-db2节点上,由此验证了资源的自动故障转移。
| 1 | [root@zabbix-db2 pcsd]# ip addr | 
- 避免频繁故障转移,意思就是当转移到节点二后可以一直在节点2运行,直到节点2出现异常
| 1 | pcs resource defaults resource−stickiness=100 | 
Mysql安装
单节点安装
- 选择使用新版的mysql8.0
| 1 | yum install mysql-community-server | 
这里由于redhat版本太新(8.4),导致只能手动本地安装
| 1 | [root@zabbix-db1 ~]# ll *.rpm | 
- 启动mysql
| 1 | systemctl start mysqld | 
- 修改初始密码
| 1 | sudo grep 'temporary password' /var/log/mysqld.log | 
- 修改my.cnf
| 1 | [client] | 
- 修改二号节点配置
| 1 | server_id = 2 ## Last number of IP | 
Mysql同步配置
登录zabbix-db1
| 1 | #mysql −uroot −p | 
登录zabbix-db2
配置db2作为db1的从库
| 1 | mysql −uroot −p<MYSQL_ROOT_PASSWORD> | 
配置db2的同步账号
| 1 | mysql> create user 'rep'@'192.168.2.24' identified by 'MyNewPass4!'; | 
重置db2的master,启动slave
| 1 | RESET MASTER; | 
查看db2的master信息
| 1 | mysql> show master status\G; | 
登录zabbix-db1
配置db1作为db2的从库
| 1 | CHANGE MASTER TO MASTER_HOST = '192.168.2.25', MASTER_USER = 'rep', MASTER_PASSWORD='MyNewPass4!', MASTER_LOG_FILE = 'mysql-bin.000001', MASTER_LOG_POS = 156; | 
Zabbix Server Mysql性能优化
zabbix一般来说最大的瓶颈都在于数据库层面,大量数据的读写导致压力很大,所以可以历史数据表进行分区处理。
由于即使关闭了前端的housekeeping,zabbix server依旧会写相关信息到housekeeper表中,所以将其关闭
| 1 | ALTER TABLE housekeeper ENGINE = BLACKHOLE; | 
首先需要对7张表做一个分区的初始化操作。
| 1 | ALTER TABLE `history` PARTITION BY RANGE ( clock) | 
开启event
| 1 | mysql> show variables like '%event_scheduler%'; | 
通过自带的存储过程来实现定时自动增删分区
| 1 | USE `zabbix`; | 
Zabbix Proxy Mysql性能优化
| 1 | # 重建proxy_history表 | 
Zabbix数据库准备
创建zabbix数据库
| 1 | # mysql -uroot -p | 
导入zabbix数据
| 1 | ## create.sql.gz从zabbix-server服务器上复制过来 | 
Zabbix Server HA集群
集群安装
所有节点上执行:
- 
安装HA组件 1 
 2
 3yum install pcs pacemaker fence-agents-all 
 systemctl start pcsd.service
 systemctl enable pcsd.service
- 
给hacluster用户设置密码(最好相同) 1 echo hacluster | passwd --stdin hacluster 
任意节点上执行:
- 
用相同的密码验证所有节点 1 pcs host auth zabbix-server1 zabbix-server2 -u hacluster -p hacluster 
- 
创建database cluster和增加资源 1 pcs cluster setup zabbix_server_cluster zabbix-server1 zabbix-server2 
- 
启用集群 1 
 2
 3
 4
 5
 6
 7pcs cluster start --all 
 #启动pacemaker服务
 systemctl start pacemaker.service
 pcs cluster enable --all
 systemctl enable pacemaker.service
- 
检查集群状态 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24[root@zabbix-server1 ~]# pcs status 
 Cluster name: zabbix_server_cluster
 WARNINGS:
 No stonith devices and stonith-enabled is not false
 Cluster Summary:
 * Stack: corosync
 * Current DC: zabbix-server2 (version 2.0.5-9.el8-ba59be7122) - partition with quorum
 * Last updated: Thu Jun 24 16:16:52 2021
 * Last change: Thu Jun 24 16:16:52 2021 by hacluster via crmd on zabbix-server2
 * 2 nodes configured
 * 0 resource instances configured
 Node List:
 * Online: [ zabbix-server1 zabbix-server2 ]
 Full List of Resources:
 * No resources
 Daemon Status:
 corosync: active/enabled
 pacemaker: active/enabled
 pcsd: active/enabled
集群参数配置
- 
禁用fencing 1 pcs property set stonith-enabled=false 
- 
忽略quorum状态 1 pcs property set no-quorum-policy=ignore 
- 
配置转移策略 1 pcs resource defaults migration-threshold=1 
创建Service和测试故障转移
- 
创建VIP服务 1 pcs resource create VirtualIP ocf:heartbeat:IPaddr2 ip=192.168.2.29 op monitor interval=5s --group zabbix_server_cluster 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21[root@zabbix-server1 ~]# pcs status 
 Cluster name: zabbix_server_cluster
 Cluster Summary:
 * Stack: corosync
 * Current DC: zabbix-server2 (version 2.0.5-9.el8-ba59be7122) - partition with quorum
 * Last updated: Thu Jun 24 16:19:07 2021
 * Last change: Thu Jun 24 16:19:00 2021 by root via cibadmin on zabbix-server1
 * 2 nodes configured
 * 1 resource instance configured
 Node List:
 * Online: [ zabbix-server1 zabbix-server2 ]
 Full List of Resources:
 * Resource Group: zabbix_server_cluster:
 * VirtualIP (ocf::heartbeat:IPaddr2): Started zabbix-server1
 Daemon Status:
 corosync: active/enabled
 pacemaker: active/enabled
 pcsd: active/enabledvip可以ping通 1 
 2
 3
 4
 5[root@zabbix-server1 ~]# ping 192.168.2.29 
 PING 192.168.2.29 (192.168.2.29) 56(84) bytes of data.
 64 bytes from 192.168.2.29: icmp_seq=1 ttl=64 time=0.023 ms
 64 bytes from 192.168.2.29: icmp_seq=2 ttl=64 time=0.037 ms
 64 bytes from 192.168.2.29: icmp_seq=3 ttl=64 time=0.035 msIP信息里也能看到 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15[root@zabbix-server1 ~]# ip addr 
 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 inet 127.0.0.1/8 scope host lo
 valid_lft forever preferred_lft forever
 inet6 ::1/128 scope host
 valid_lft forever preferred_lft forever
 2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
 link/ether 00:50:56:94:1c:50 brd ff:ff:ff:ff:ff:ff
 inet 192.168.2.26/24 brd 192.168.2.255 scope global noprefixroute ens192
 valid_lft forever preferred_lft forever
 inet 192.168.2.29/24 brd 192.168.2.255 scope global secondary ens192
 valid_lft forever preferred_lft forever
 inet6 fe80::250:56ff:fe94:1c50/64 scope link noprefixroute
 valid_lft forever preferred_lft forever
- 
通过crm_resource强制终止服务 1 crm_resource --resource VirtualIP --force-stop 通过crm_mon来监控资源的运行情况 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19#crm_mon 
 Cluster Summary:
 * Stack: corosync
 * Current DC: zabbix-server2 (version 2.0.5-9.el8-ba59be7122) - partition with quorum
 * Last updated: Thu Jun 24 16:20:09 2021
 * Last change: Thu Jun 24 16:19:00 2021 by root via cibadmin on zabbix-server1
 * 2 nodes configured
 * 1 resource instance configured
 Node List:
 * Online: [ zabbix-server1 zabbix-server2 ]
 Active Resources:
 * Resource Group: zabbix_server_cluster:
 * VirtualIP (ocf::heartbeat:IPaddr2): Started zabbix-server2
 Failed Resource Actions:
 * VirtualIP_monitor_5000 on zabbix-server1 'not running' (7): call=7, status='complete', exitreason='', last-rc-change='2021-06-24 16:20:06 +08:00', queued=0ms, exec=0ms这里可以看到VirtualIP服务自动转移到了zabbix-server2节点上,由此验证了资源的自动故障转移。 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15[root@zabbix-server2 ~]# ip addr 
 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 inet 127.0.0.1/8 scope host lo
 valid_lft forever preferred_lft forever
 inet6 ::1/128 scope host
 valid_lft forever preferred_lft forever
 2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
 link/ether 00:50:56:94:27:39 brd ff:ff:ff:ff:ff:ff
 inet 192.168.2.27/24 brd 192.168.2.255 scope global noprefixroute ens192
 valid_lft forever preferred_lft forever
 inet 192.168.2.29/24 brd 192.168.2.255 scope global secondary ens192
 valid_lft forever preferred_lft forever
 inet6 fe80::250:56ff:fe94:2739/64 scope link noprefixroute
 valid_lft forever preferred_lft forever
- 
避免频繁故障转移,意思就是当转移到节点二后可以一直在节点2运行,直到节点2出现异常 1 pcs resource defaults resource−stickiness=100 
Zabbix安装
- 
配置zabbix yum源 1 
 2
 3
 4
 5[zabbix] 
 name=zabbix
 baseurl=http://yumserver/zabbix/zabbix/5.4/rhel/8/x86_64/
 enable=1
 gpgcheck=0
- 
安装zabbix server、前端和agent 1 dnf install zabbix-server-mysql zabbix-web-mysql zabbix-nginx-conf zabbix-sql-scripts zabbix-agent 
- 
配置zabbix_server.conf 1 
 2
 3
 4
 5
 6
 7
 8# 修改sourceip为VIP 
 SourceIP=192.168.2.29
 # 修改dbhost为db的vip
 DBHost=192.168.2.28
 DBName=zabbix
 DBUser=zabbix
 DBPassword=<DB_ZABBIX_PASS>
- 
对于zabbix_server节点创建ZabbixServer资源 1 pcs resource create zabbixserver systemd:zabbix-server op monitor interval=10s --group zabbix_server_cluster 
- 
两个zabbix server不能同时运行,所以要确保zabbix server只在其中一个节点在线 1 pcs constraint colocation add VirtualIP with zabbixserver INFINITY 
- 
确保VirtualIP在zabbixserver之前开始运行 1 pcs constraint order VirtualIP then zabbixserver 
- 
配置资源的超时时间 1 
 2pcs resource op add zabbixserver start interval=0s timeout=60s 
 pcs resource op add zabbixserver stop interval=0s timeout=120s
- 
检查资源状态 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22[root@zabbix-server1 zabbix]# pcs status 
 Cluster name: zabbix_server_cluster
 Cluster Summary:
 * Stack: corosync
 * Current DC: zabbix-server1 (version 2.0.5-9.el8-ba59be7122) - partition with quorum
 * Last updated: Thu Jun 24 17:58:18 2021
 * Last change: Thu Jun 24 17:56:40 2021 by root via crm_resource on zabbix-server1
 * 2 nodes configured
 * 2 resource instances configured
 Node List:
 * Online: [ zabbix-server1 zabbix-server2 ]
 Full List of Resources:
 * Resource Group: zabbix_server_cluster:
 * VirtualIP (ocf::heartbeat:IPaddr2): Started zabbix-server1
 * zabbixserver (systemd:zabbix-server): Started zabbix-server1
 Daemon Status:
 corosync: active/enabled
 pacemaker: active/enabled
 pcsd: active/enabled
- 
修改/etc/nginx/conf.d/zabbix.conf文件 1 
 2listen 80; 
 server_name 192.168.2.29;
- 
启动zabbix server和agent 1 
 2systemctl restart zabbix-server zabbix-agent nginx php-fpm 
 systemctl enable zabbix-server zabbix-agent nginx php-fpm
- 
关闭IPV6 1 
 2
 3
 4
 5# 临时关闭 
 # sysctl -w net.ipv6.conf.all.disable_ipv6=1
 # Disabling IPv6 in NetworkManager
 nmcli connection modify ens192 ipv6.method "disabled"
- 
修改默认时区 1 
 2
 3
 4
 5
 6[root@zabbix-server1 ~]# vim /etc/php.ini 
 date.timezone = Asia/Shanghai
 #重启服务
 systemctl restart php-fpm
- 
zabbix server参数配置(参考) 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34[root@zabbix-server1 include]# egrep -v '^#|^$' /etc/zabbix/zabbix_server.conf 
 SourceIP=192.168.2.29
 LogFile=/var/log/zabbix/zabbix_server.log
 LogFileSize=0
 PidFile=/var/run/zabbix/zabbix_server.pid
 SocketDir=/var/run/zabbix
 DBHost=192.168.2.28
 DBName=zabbix
 DBUser=zabbix
 DBPassword=zabbix
 StartPollers=200
 StartPreprocessors=20
 StartPollersUnreachable=5
 StartTrappers=20
 StartPingers=5
 StartDiscoverers=5
 StartHTTPPollers=5
 StartTimers=5
 StartEscalators=5
 StartAlerters=5
 SNMPTrapperFile=/var/log/snmptrap/snmptrap.log
 StartSNMPTrapper=1
 CacheSize=2G
 StartDBSyncers=20
 HistoryCacheSize=1G
 HistoryIndexCacheSize=512M
 TrendCacheSize=512M
 TrendFunctionCacheSize=128M
 ValueCacheSize=128M
 Timeout=30
 LogSlowQueries=3000
 StartLLDProcessors=20
 AllowRoot=1
 StatsAllowedIP=127.0.0.1
故障处理
验证节点失败Unable to communicate
| 1 | rm -rf /var/lib/pcsd/ | 
创建cluster报错Unable to read the known-hosts file: No such file or directory: ‘/var/lib/pcsd/known-hosts’
| 1 | pcs cluster destroy | 
Authentication plugin ‘caching_sha2_password’ reported error: Authentication requires secure connection
使用复制用户请求服务器公钥:
 mysql -u rep -p -h 192.168.2.24 -P3306 --get-server-public-key
在这种情况下,服务器将RSA公钥发送给客户端,后者使用它来加密密码并将结果返回给服务器。插件使用服务器端的RSA私钥解密密码,并根据密码是否正确来接受或拒绝连接。
重新在从库配置change masrer to并且start slave,复制可以正常启动:
| 1 | #停止主从复制 | 
缺少字体
查看系统带的字体
| 1 | locale -a | 
安装缺少的字体
| 1 | yum install langpacks-en.noarch |