GaussDB集群问题总纲

1          原理类

1.1         通信原理 => https://bbs.huaweicloud.com/blogs/239971

1.2         通信视图 => https://bbs.huaweicloud.com/blogs/247543

1.3         资源负载管理 => https://bbs.huaweicloud.com/blogs/239960

1.4         集群管理CM => https://bbs.huaweicloud.com/blogs/244355

1.5         集群管理CM => https://bbs.huaweicloud.com/blogs/224005

1.6         CMS通信机制 => https://bbs.huaweicloud.com/blogs/241853

1.7         LVS基本原理 => https://bbs.huaweicloud.com/blogs/238621

1.8         CPU资源管理 => https://bbs.huaweicloud.com/blogs/237550

2          连接类

2.1         JDBC连接报错 => https://bbs.huaweicloud.com/blogs/244348

Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections

Invalid username/password, login denied

No suitable driver found for XXXX

No pg_hba.conf entry for host

conflict

Terminating connection due to administrator command, Session unused timeout

SSL error: Connection reset

Connection refused: connect

Connections could not be acquired from the underlying database

2.2         LVS异常 => https://bbs.huaweicloud.com/blogs/247267 || https://bbs.huaweicloud.com/blogs/244340

安装报错

安装报写入文件权限不足

ipvsadn –Ln显示没有CN信息

客户端连接gsql报错

客户端连接LVS不轮询

客户端通过虚拟IP不能连接CN

卸载LVS导致机器重启

检测是否virtual_router_id冲突问题

机器重启导致浮动IP丢失CN启动异常

2.3         连接断开 => https://bbs.huaweicloud.com/blogs/239471 || http://3ms.huawei.com/km/blogs/details/8697907 || https://bbs.huaweicloud.com/blogs/205970

Too many clients already, active/non_active: xxxx/xxxx

An I/O error occurred while sending to the backend

客户端连接CN耗时长

Kerberos认证失败

集群内部连接报错

3          网络类

3.1         重传or丢包 => https://bbs.huaweicloud.com/blogs/235237

3.2         通信异常 => http://3ms.huawei.com/km/blogs/details/2431967?l=zh-cn

集群异常 –> 环境异常 –> 环境问题:防火墙/MTU/网卡加固等

集群异常 –> 环境正常 –> 配置问题:监听端口/bind地址/权限等

集群正常 –> 偶发故障 –> Core/OS/内存不足/网卡故障/LVS等

集群正常 –> 持续故障 –> 死锁/节点异常/连接数满等

3.3         通信性能 => https://bbs.huaweicloud.com/blogs/248843

网卡多队列

网络流量

通信库内存

系统调用

4          资源类

4.1         资源管理配置 => https://bbs.huaweicloud.com/blogs/244671

无效的服务名/内部未知异常/CPU配额不足等

oms到主CMS节点的问题

创建租户失败,后台日志报错权限不足

创建租户失败,日志报错修改资源池失败

4.2         内存异常 => https://bbs.huaweicloud.com/forum/thread-110215-1-1.html || https://bbs.huaweicloud.com/forum/thread-82838-1-1.html || https://bbs.huaweicloud.com/forum/thread-85225-1-1.html || https://bbs.huaweicloud.com/forum/thread-94896-1-1.html

memory temporarily unavailable

4.3         CPU异常 => https://bbs.huaweicloud.com/forum/thread-76364-1-1.html || https://bbs.huaweicloud.com/forum/thread-79937-1-1.html || https://bbs.huaweicloud.com/forum/thread-70297-1-1.html || https://bbs.huaweicloud.com/forum/thread-73291-1-1.html

CPU使用率超过阈值

多租户CPU资源管理

5          参数类

5.1         通信参数 => https://bbs.huaweicloud.com/blogs/239863

tcp_keepalives_idle、tcp_keepalives_interval、tcp_keepalives_count

comm_max_datanode、comm_max_stream、comm_max_receiver

enable_stateless_pooler_reuse、comm_cn_dn_logic_conn

comm_quota_size、comm_usable_memory

net.ipv4.tcp_tw_reuse、net.ipv4.tcp_tw_recycle、net.ipv4.tcp_max_tw_buckets

net.ipv4.tcp_syn_retries、net.ipv4.tcp_synack_retries

net.ipv4.tcp_retries、net.ipv4.tcp_retries2

6         工具类

6.1         网络流量/重传/丢包 => gsar.sh

6.2         客户端连接状况监控 => clients.py

6.3         网络打流 => speed_test_x86.sh/speed_test_arm

6.4         网络多队列查询/设置 => get_irq_affinity.sh/set_irq_affinity.sh

6.5         网络监控 => network_monitor.py

6.6         通用语句监控 => general.sh

7         总结

未完待续 => 欢迎补充

(完)