分布式服务平台构建问题汇总(3)

yum install python3-devel -y && pip install krb5 cython six ecdsa pytest-runner

0x00 缺少wget命令

原本的操作系统，是可以安装wget等命令的，但是当我们执行bash repair initAll之后，原来的仓库放在一个备份目录下，如果此时由于repo的一些异常导致安装失败，我们需要wget下载一些东西的时候，也会出现问题，所以，我们需要把当前的usdp.repo仓库备份，把/etc/yum.repos.d/backup/下面的**.repo**文件移到/etc/yum.repos.d/目录。然后重建仓库。

1 2	$ yum clean all $ yum makecache

0x01 udp-base仓库缺少repodata数据

安装过程中，缺少epel包下面的repodata/repomd.xml，我们在/var/www/html/epel/7/x86_64/目录下生成对应的repomd.xml，此处需要安装一个命令：createrepo。

1	$ yum install createrepo -y

然后，进入到/var/www/html/epel/7/x86_64/这个目录，执行：

$ createrepo /var/www/html/epel/7/x86_64
Spawning worker 0 with 1680 pkgs
Spawning worker 1 with 1680 pkgs
Spawning worker 2 with 1679 pkgs
Spawning worker 3 with 1679 pkgs
Workers Finished
Saving Primary metadata
Saving file lists metadata
Saving other metadata
Generating sqlite DBs
Sqlite DBs complete

为什么是/var/www/html/epel/7/x86_64这个目录？因为在/etc/yum.repos.d/usdp.repo 中有如下内容：

[usdp-base]
name=usdp-base
baseurl=http://mirrors.ucloud.cn:8000/centos/7/os/x86_64/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7

[usdp-updates]
name=usdp-updates
baseurl=http://mirrors.ucloud.cn:8000/centos/7/updates/x86_64/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7

[usdp-extras]
name=usdp-extras
baseurl=http://mirrors.ucloud.cn:8000/centos/7/extras/x86_64/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7

[usdp-epel]
name=usdp-epel
baseurl=http://mirrors.ucloud.cn:8000/epel/7/x86_64/
failovermethod=priority
enabled=1
gpgcheck=0

由此我们也可以看出，usdp是启动了一个httpd服务，作为仓库地址的，这个在/etc/hosts中有绑定127.0.0.1 mirrors.ucloud.cn。

0x02 安装批量传输工具pssh

$ wget https://files.pythonhosted.org/packages/60/9a/8035af3a7d3d1617ae2c7c174efa4f154e5bf9c24b36b623413b38be8e4a/pssh-2.3.1.tar.gz
$ mkdir -p /usr/local/pssh
$ tar xf pssh-2.3.1.tar.gz -C /usr/local/pssh
$ cd pssh-2.3.1/
$ python setup.py install

如果在Python3的环境中，出现ModuleNotFoundError: No module named 'version'这个错误，请参考我的另一篇文章：《Python3环境下PSSH报错缺少version模块》

0x03 安装过程必须要root密码

配置初期，服务器之间设置了免密登录，但是安装依然失败，在usdp社区看了别人的问答，官方答复目前仅仅支持密码安装，不能通过免密安装😖。 usdp使用ssh密钥部署失败

重新配置repair.properties和repair-host-info.properties文件，将root用户的密码配置进去。

0x04 ERROR: ‘10.20.210.49’ set libxslt devel failed

手动执行安装命令：yum install libxslt-devel，出现下面错误。

Error:  Multilib version problems found. This often means that the root
       cause is something else and multilib version checking is just
       pointing out that there is a problem. Eg.:
       
         1. You have an upgrade for libxml2 which is missing some
            dependency that another package requires. Yum is trying to
            solve this by installing an older version of libxml2 of the
            different architecture. If you exclude the bad architecture
            yum will tell you what the root cause is (which package
            requires what). You can try redoing the upgrade with
            --exclude libxml2.otherarch ... this should give you an error
            message showing the root cause of the problem.
       
         2. You have multiple architectures of libxml2 installed, but
            yum can only see an upgrade for one of those architectures.
            If you don't want/need both architectures anymore then you
            can remove the one with the missing update and everything
            will work.
       
         3. You have duplicate versions of libxml2 installed already.
            You can use "yum check" to get yum show these errors.
       
       ...you can also use --setopt=protected_multilib=false to remove
       this checking, however this is almost never the correct thing to
       do as something else is very likely to go wrong (often causing
       much more problems).
       
       Protected multilib versions: libxml2-2.9.1-6.el7.5.i686 != libxml2-2.9.1-6.el7_9.6.x86_64

在上面的命令后面加上--setopt=protected_multilib=false参数继续安装libxslt-devel，则出现如下错误：

1 2	Transaction check error: package libxml2-2.9.1-6.el7_9.6.x86_64 (which is newer than libxml2-2.9.1-6.el7.5.i686) is already installed

这样就只能卸载libxml2，再次安装libxslt-devel，然后由于依赖问题，卸载不了。

1	$ yum remove libxml2

那就通过rpm删除软件包吧， rpm -e --nodeps libxml2，这样就把软件包也删除了，然鹅：

$ yum install libxslt-devel
There was a problem importing one of the Python modules
required to run yum. The error leading to this problem was:

   libxml2.so.2: cannot open shared object file: No such file or directory

Please install a package which provides this module, or
verify that the module is installed correctly.

It's possible that the above module doesn't match the
current version of Python, which is:
2.7.5 (default, Oct 14 2020, 14:45:30) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]

If you cannot solve this problem yourself, please go to 
the yum faq at:
  http://yum.baseurl.org/wiki/Faq

GG了。

从其他服务复制了一份 libxml2.so文件放在10.20.210.49机器的/usr/lib64/。

#在10.20.210.41的机器上执行
$ find / -name libxml2.so
$ cd /usr/lib64/
$ ll | grep libxml2
$ scp libxml2.so.2.9.1 root@10.20.210.49:/usr/lib64/

#在10.20.210.49的机器上执行
$ ln -s libxml2.so.2.9.1 libxml2.so
$ ln -s libxml2.so.2.9.1 libxml2.so.2

重新初始化，解决此问题。

0x05 usdp-server和mysql在不同节点部署

usdp-server和mysql分别在不同的服务器节点中部署，在repair阶段，脚本在拷贝数据的数据，并没有把/opt/usdp-srv/usdp/sql中的数据同步到安装mysql的节点，这样导致usdp-server无法启动。

repair结束后，从usdp-server节点scp一份数据到mysql节点，然后登录mysql，在mysql控制台执行，use db_udp，然后source init_db_udp.sql初始化数据库。

0x06 注释/etc/hosts中的部分信息

注意，安装前先将所有服务节点中/etc/hosts中的这两行注释掉，否则会引起NameNode通信问题：

1 2	#127.0.0.1 pdh01 localhost.localdomain localhost4 localhost4.localdomain4 #::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

具体体现为为 pdh01 HDFS 添加 /tmp,/user 目录的时候卡住了，排查过程一波三张。

hdfs添加目录界面

我们找到对应的脚本，发现执行的类似这种命令：

1
2
3

path1=/user
version=2.0.0.0
su -s /bin/bash hadoop -c "/srv/udp/$version/hdfs/bin/hdfs dfs -mkdir -p $path1

我们在正在执行命令的节点上，将用户切换到hadoop，执行hdfs dfs -ls /时会发现namenode的8020端口不通，错误信息如下。

Last login: Thu Jul 28 16:19:10 CST 2022
-bash-4.2$ hdfs dfs -ls /
2022-07-28 16:21:11 INFO org.apache.hadoop.io.retry.RetryInvocationHandler: java.net.ConnectException: Call From pdh01/10.20.210.41 to ds04:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over ds04/10.20.210.48:8020 after 1 failover attempts. Trying to failover after sleeping for 774ms.
2022-07-28 16:21:12 INFO org.apache.hadoop.io.retry.RetryInvocationHandler: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:88)
        at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1952)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1423)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3085)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1154)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:966)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over ds03/10.20.210.47:8020 after 2 failover attempts. Trying to failover after sleeping for 2547ms.

但具体在执行界面返回的错误是这样的，需要点击到**[执行][HDFS] 为 pdh01 HDFS 添加 /tmp,/user 目录**里面去看：

2022-07-27 08:09:58 [AsyncTask] Task Started: [HDFS] 为 pdh01 HDFS 添加 /tmp,/user 目录
TaskInfo:
[
    hostname:   pdh01,
    ipv4:       10.20.210.41,
    ipv6:       null,
    name:       ChmodHDFS777Task,
    desc:       [HDFS] 为 pdh01 HDFS 添加 /tmp,/user 目录,
    exec:       chmod-hdfs.sh,
    timeout:        null,
    args:       [/user, /tmp, /proya-base/, /proya-base/tmp, /proya-base/user, 2.0.0.0],
    interactArgs:       null,
    state:      RUNNING,
    skippable:  false
] 
2022-07-27 08:09:58 [AsyncTask] AgentUrl: http://pdh01:8001/api/v1/udp/agent/exec?

t=059d12bacdd017f2def577eaf51f7550&r=e9a0e54c7c3be8820eddd6d10f3d92f9&s=f34874a662ff4d4f038621498f0cd33f0923dd5e535a63f7b3d967de99f07395 
2022-07-27 08:29:58 [AsyncTask] Task Failed: [HDFS] 为 pdh01 HDFS 添加 /tmp,/user 目录
TaskInfo:
[
    hostname:   pdh01,
    ipv4:       10.20.210.41,
    ipv6:       null,
    name:       ChmodHDFS777Task,
    desc:       [HDFS] 为 pdh01 HDFS 添加 /tmp,/user 目录,
    exec:       chmod-hdfs.sh,
    timeout:        null,
    args:       [/user, /tmp, /proya-base/, /proya-base/tmp, /proya-base/user, 2.0.0.0],
    interactArgs:       null,
    state:      FAILED,
    skippable:  false
]
null 
2022-07-27 08:29:58 [AsyncTask] org.springframework.web.client.ResourceAccessException: I/O error on POST request for "http://pdh01:8001/api/v1/udp/agent/exec": Read timed out; nested exception is java.net.SocketTimeoutException: Read timed out  at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:751)  at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:677)  at org.springframework.web.client.RestTemplate.postForObject(RestTemplate.java:421)  at cn.ucloud.udp.async.task.impl.service.hdfs.ChmodHDFS777Task.execute(ChmodHDFS777Task.java:42)  at cn.ucloud.udp.async.task.AbstractTask.run(AbstractTask.java:206)  at cn.ucloud.udp.async.task.AbstractTask.call(AbstractTask.java:192)  at cn.ucloud.udp.async.task.AbstractTask.call(AbstractTask.java:68)  at java.util.concurrent.FutureTask.run(FutureTask.java:266)  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)  at java.lang.Thread.run(Thread.java:748) Caused by: java.net.SocketTimeoutException: Read timed out  at java.net.SocketInputStream.socketRead0(Native Method)  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)  at java.net.SocketInputStream.read(SocketInputStream.java:171)  at java.net.SocketInputStream.read(SocketInputStream.java:141)  at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)  at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)  at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280)  at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)  at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)  at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)  at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)  at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157)  at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)  at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)  at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)  at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)  at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)  at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)  at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)  at org.springframework.http.client.HttpComponentsStreamingClientHttpRequest.executeInternal(HttpCompone

我们将用户切换到hadoop：su - hadoop，然后分别在两台nameNode节点中执行：hdfs haadmin -getServiceState nn1和hdfs haadmin -getServiceState nn2。

-bash-4.2$ hdfs haadmin -getServiceState nn1
standby
-bash-4.2$ hdfs haadmin -getServiceState nn2
2022-07-28 16:30:45 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ds04/10.20.210.48:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
Operation failed: Call From ds03/10.20.210.47 to ds04:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
-bash-4.2$

发现ds04，也就是namenode2的8020端口不通。这个问题在apache hadoop的网站有描述（不过我忘记保存网站原文链接了）。

现在我们注释掉/etc/hosts里面的东西，如开头所说。然后在ds04服务器上重启namenode:

1
2
3

$ cd /opt/usdp-srv/srv/udp/2.0.0.0/hdfs/sbin/
$ bash hadoop-daemon.sh stop namenode
$ bash hadoop-daemon.sh start namenode

等待namenode重新启动，我们在执行hdfs haadmin -getServiceState命令查看两个nodename的情况：

-bash-4.2$ hdfs haadmin -getServiceState nn2
standby
-bash-4.2$ hdfs haadmin -getServiceState nn1
active

发现没问题，通过hdfs dfs -ls /可以验证。

hdfs dfs ls /

0x07 故障自动切换ZKFC配置问题

此问题依然由注释/etc/hosts中的部分信息引起，但呈现的方式不一样。

如果namenode重启后依然8020端口不通，我们通过hdfs haadmin -getServiceState nn1和hdfs haadmin -getServiceState nn1命令发现，我们的NameNode1和NameNode2都是standby。我们用下面命令尝试修复。

1	-bash-4.2$ hdfs zkfc -formatZK # 注意是hadoop用户

结果报错了，错误日志当时没保存，不过网上有一段几乎一样的，如下(也是必然的，如果不是必然就不会出现两个standby了)：

16/06/27 13:41:11 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=node1:2181,node2:2181,node3:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@125a6d70
16/06/27 13:41:11 INFO zookeeper.ClientCnxn: Opening socket connection to server node1/192.168.245.11:2181. Will not attempt to authenticate using SASL (unknown error)
16/06/27 13:41:11 WARN zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
    at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
16/06/27 13:41:11 INFO zookeeper.ClientCnxn: Opening socket connection to server node2/192.168.245.12:2181. Will not attempt to authenticate using SASL (unknown error)
16/06/27 13:41:11 WARN zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
    at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
16/06/27 13:41:11 INFO zookeeper.ClientCnxn: Opening socket connection to server node3/192.168.245.13:2181. Will not attempt to authenticate using SASL (unknown error)
16/06/27 13:41:11 WARN zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
    at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)

看日志，应该是zk连接拒绝了。我们进入到zk的安装目录，查看zk的状态(排除防火墙的问题)：


[root@ds01 ~]# systemctl status firewalld 
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:firewalld(1)
     
# ds02节点
[root@ds02 ~]# 
[root@ds02 ~]# cd /opt/usdp-srv/srv/udp/2.0.0.0/zookeeper
[root@ds02 zookeeper]# zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /srv/udp/2.0.0.0/zookeeper/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.

# ds03节点
[root@ds03 bin]# zkServer.sh restart
ZooKeeper JMX enabled by default
Using config: /srv/udp/2.0.0.0/zookeeper/bin/../conf/zoo.cfg
ZooKeeper JMX enabled by default
Using config: /srv/udp/2.0.0.0/zookeeper/bin/../conf/zoo.cfg
Stopping zookeeper ... STOPPED
ZooKeeper JMX enabled by default
Using config: /srv/udp/2.0.0.0/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
[root@ds03 bin]# zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /srv/udp/2.0.0.0/zookeeper/bin/../conf/zoo.cfg
Mode: leader

发现zk确实挂了，问题出现在/etc/hosts中，如前文所说，注释掉后，重启zk各个节点，便解决了。

root@ds02 zookeeper]# zkServer.sh restart
ZooKeeper JMX enabled by default
Using config: /srv/udp/2.0.0.0/zookeeper/bin/../conf/zoo.cfg
ZooKeeper JMX enabled by default
Using config: /srv/udp/2.0.0.0/zookeeper/bin/../conf/zoo.cfg
Stopping zookeeper ... STOPPED
ZooKeeper JMX enabled by default
Using config: /srv/udp/2.0.0.0/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
[root@ds02 zookeeper]# 
[root@ds02 zookeeper]# 
[root@ds02 zookeeper]# zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /srv/udp/2.0.0.0/zookeeper/bin/../conf/zoo.cfg
Mode: follower
[root@ds02 zookeeper]#

我们再次执行故障自动化切换命令。

[root@ds02 zookeeper]# su - hadoop
Last login: Thu Jul 28 14:28:34 CST 2022
-bash-4.2$ hdfs zkfc -formatZK 
2022-07-28 14:29:08 INFO org.apache.hadoop.hdfs.tools.DFSZKFailoverController: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting DFSZKFailoverController
STARTUP_MSG:   host = ds02/10.20.210.46
STARTUP_MSG:   args = [-formatZK]
STARTUP_MSG:   version = 3.1.1
STARTUP_MSG:   classpath = /opt/usdp-srv/srv/udp/2.0.0.0/hdfs/etc/hadoop:/opt/usdp-srv/srv/udp/2.0.0.0/hdfs/share/hadoop/common/lib/accessors-smart-1.2.jar:/opt/usdp-srv/srv/udp/2.0.0.0/hdfs/share/ha 。。。。
。。。。
STARTUP_MSG:   build = Unknown -r Unknown; compiled by 'hadoop' on 2020-11-15T04:36Z
STARTUP_MSG:   java = 1.8.0_202
************************************************************/
2022-07-28 14:29:09 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
===============================================
The configured parent znode /hadoop-ha/proya-base already exists.
Are you sure you want to clear all failover information from
ZooKeeper?
WARNING: Before proceeding, ensure that all HDFS services and
failover controllers are stopped!
===============================================
Proceed formatting /hadoop-ha/proya-base? (Y or N) y
2022-07-28 14:29:12 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down for session: 0x30004a09f880001
2022-07-28 14:29:12 INFO org.apache.hadoop.hdfs.tools.DFSZKFailoverController: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down DFSZKFailoverController at ds02/10.20.210.46
************************************************************/

最后验证：

-bash-4.2$ hdfs haadmin -getServiceState nn2
standby
-bash-4.2$ hdfs haadmin -getServiceState nn1
active
-bash-4.2$ 
-bash-4.2$ 
-bash-4.2$ jps
96594 HttpFSServerWebServer
419408 DFSZKFailoverController
354653 NameNode
421882 Jps
354648 JournalNode
354635 DataNode

0x08 datanode的节点安装失败

DataNode的节点安装失败界面

查看错误信息发现，8001服务超时了。8001我们已经知道是agent，那我们在服务节点上看看agent是什么情况。排查发现agent没有启动，但按道理是agent服务起来的。

手动重启，报错Cannot allocate memory：

Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000ce000000, 838860800, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 838860800 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /opt/usdp-srv/usdp/bin/hs_err_pid459521.log

好吧，没有内存：

[root@ds02 zookeeper]# free -m
              total        used        free      shared  buff/cache   available
Mem:           7821        5710        1872           8         238        1872
Swap:             0           0           0
[root@ds02 zookeeper]#

卧槽，咋办，只能重新来一遍，重新规划资源！

0x09 usdp确认集群节点时无法获取配置

usdp确认集群节点无法获取配置

如上图所示，表现在确认集群节点时无法获取配置信息，错误展示为“无法登录”。经过排查确认，这个问题应该是pssh安装存在问题。存在什么问题呢，这需要我们分析一下。

如果服务器python版本是python2，需要安装pssh2.3.1，安装的命令为python setup.py install，注意安装前，需要把已有存在的库删掉，删除命令如下：

[root@cyhl pssh-2.3.1]# rm -rf /usr/local/bin/pssh
[root@cyhl pssh-2.3.1]# rm -rf /usr/local/bin/pssh-askpass 
[root@cyhl pssh-2.3.1]# rm -rf /usr/local/bin/pscp 
[root@cyhl pssh-2.3.1]# rm -rf /usr/local/bin/pslurp
[root@cyhl pssh-2.3.1]# rm -rf /usr/local/bin/pnuke
[root@cyhl pssh-2.3.1]# rm -rf /usr/local/bin/prsync

此外，确保系统中只有python2，把python3全部干掉：

1 2	rpm -qa\|grep python3\|xargs rpm -ev --allmatches --nodeps whereis python3 \|xargs rm -frv

如果服务器python版本是python3，请确保python3的版本不高于3.2.1。

Traceback (most recent call last):
  File "/usr/local/bin/pssh", line 118, in <module>
    do_pssh(hosts, cmdline, opts)
  File "/usr/local/bin/pssh", line 71, in do_pssh
    manager = Manager(opts)
  File "/usr/local/lib/python3.6/site-packages/psshlib/manager.py", line 42, in __init__
    self.iomap = IOMap()
  File "/usr/local/lib/python3.6/site-packages/psshlib/manager.py", line 215, in __init__
    signal.set_wakeup_fd(wakeup_writefd)
ValueError: the fd 4 must be in non-blocking mode
[WARN] Could not import version package in /usr/local/lib/python3.6/site-packages/psshlib/cli.py