Flink On Yarn HA模式搭建问题

1.问题描述

Hadoop版本2.6.5 Flink版本1.11.6

前置，在搭建Standalone集群时没有任何问题启动正常

在搭建Flink On Yarn时,也没有任何问题

搭建Flink On Yarn却一直执行失败

2.Flink On Yarn HA配置

1. 给yarn-site.xml：文件添加如下配置

<property>
<name>yarn.resourcemanager.am.max-attempts</name>
<value>10</value>
</property>

这个表示的是向resourcemanager提交任务,Yarn给我们维护的失败后自动重试次数

2. 修改FLINK_HOME/conf下的flink-conf.yaml文件：

一定要看到最后，这么配有问题！！！！！

   high-availability: zookeeper
   high-availability.storageDir: hdfs://node02:9000/flink/ha/ 这里我使用的是node02
   high-availability.zookeeper.quorum: node02:2181,node03:2181,node04:2181

3. 下载支持Hadoop插件并且拷贝到各个节点的FLINK_HOME/lib目录下,因为在Flink这个版本中不再默认携带和Hadoop交互的jar文件,需要自己导入

下载地址:

https://repo.maven.apache.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.6.5-10.0/flink-shaded-hadoop-2-uber-2.6.5-10.0.jarz

3.解决过程复现

先查看命令行

2022-07-05 20:43:01,918 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli                [] - Error while running the Flink session.
org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster
	at org.apache.flink.yarn.YarnClusterDescriptor.deploySessionCluster(YarnClusterDescriptor.java:392) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:636) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$4(FlinkYarnSessionCli.java:895) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_152]
	at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_152]
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:895) [flink-dist_2.11-1.11.6.jar:1.11.6]
Caused by: org.apache.flink.yarn.YarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment. 
Diagnostics from YARN: Application application_1657020068308_0004 failed 2 times due to AM Container for appattempt_1657020068308_0004_000004 exited with  exitCode: 1
For more detailed output, check application tracking page:http://node05:8088/proxy/application_1657020068308_0004/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1657020068308_0004_04_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
	at org.apache.hadoop.util.Shell.run(Shell.java:455)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

这是命令行的报错,大概意思就是容器container启动错误,为什么错没有描述清楚,详细的信息他说可以通过下面地址进行查看http://node05:8088/proxy/application_1657020068308_0004/Then

查看容器日志

发现并没有什么鸟用信息，和命令行一样的

但是可以看到这个容器启动时尝试了多次，可以通过每次尝试启动的日志查看为什么每次都失败

查看重试日志

在这里有一个error，我本能的去查看这个日志

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1657020068308_0004/filecache/15/log4j-slf4j-impl-2.16.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/bigdata/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

这里描述的是一个SLF4j的包版本冲突日志，那么就是找到两个包的具体位置，替换改名替换等一系列的操作，然后发现这个error确实没了，但是集群还起不来！！！！！！

搞了好久才意识到漏掉一个日志没看，大意了，本能反应很有可能该我们带入误区

查看jobmanager.log全量日志

看不全，要点这里，表面看都是没啥问题的

org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint YarnSessionClusterEntrypoint.
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:200) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:577) [flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.yarn.entrypoint.YarnSessionClusterEntrypoint.main(YarnSessionClusterEntrypoint.java:82) [flink-dist_2.11-1.11.6.jar:1.11.6]
Caused by: java.net.ConnectException: Call From node04/192.168.76.96 to node03:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_152]
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:1.8.0_152]
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:1.8.0_152]
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[?:1.8.0_152]
	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.ipc.Client.call(Client.java:1474) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.ipc.Client.call(Client.java:1401) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at com.sun.proxy.$Proxy26.mkdirs(Unknown Source) ~[?:?]
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:539) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_152]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_152]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_152]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_152]
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at com.sun.proxy.$Proxy27.mkdirs(Unknown Source) ~[?:?]
	at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2742) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2713) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:870) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:866) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:866) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:859) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1819) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.mkdirs(HadoopFileSystem.java:172) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.blob.FileSystemBlobStore.<init>(FileSystemBlobStore.java:64) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.blob.BlobUtils.createFileSystemBlobStore(BlobUtils.java:98) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.blob.BlobUtils.createBlobStoreFromConfig(BlobUtils.java:76) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.highavailability.HighAvailabilityServicesUtils.createHighAvailabilityServices(HighAvailabilityServicesUtils.java:115) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.createHaServices(ClusterEntrypoint.java:335) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:293) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:223) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:177) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_152]
	at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_152]
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:174) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	... 2 more
Caused by: java.net.ConnectException: Connection refused

找到问题了！！是无法连接HDFS导致的，因为我们在HA中配置了 HDFS地址，ZK地址等

3.解决

   high-availability: zookeeper
   high-availability.storageDir: hdfs://mycluster/flink/ha/

这里这这里，通过IP+端口是无法访问HDFS的
   high-availability.zookeeper.quorum: node02:2181,node03:2181,node04:2181

因为我们在 hdfs-site.xml中做过如下配置，所以需要地址写成mycluster，由集群帮我们自动找到HDFS中NameNode的地址

		<!--找寻NN的配置文件,k-v映射-->
		<property>
		  <name>dfs.nameservices</name>
		  <value>mycluster</value>
		</property>
		<property>
		  <name>dfs.ha.namenodes.mycluster</name>
		  <value>nn1,nn2</value>
		</property>
		<property>
          <!--NN的具体地址-->
		  <name>dfs.namenode.rpc-address.mycluster.nn1</name>
		  <value>node01:8020</value>
		</property>
		<property>
		  <name>dfs.namenode.rpc-address.mycluster.nn2</name>
		  <value>node02:8020</value>
		</property>

4.总结

集群配置多，日志也多，排错比较复杂，不要放弃慢慢找总能找到，尤其是自己学习搭建，网络上很多配置方式是不对的，也可能只和你之前搭建的HDFS，Yarn集群配置是冲突的，整体所有组件排错一遍也是很大的进步！！！

原文链接：https://blog.csdn.net/weixin_56892092/article/details/125628234

Flink On Yarn HA模式搭建问题

1.问题描述

2.Flink On Yarn HA配置

3.解决过程复现

先查看命令行

查看容器日志

查看重试日志

查看jobmanager.log全量日志

3.解决

4.总结

标签云

近期文章

分类

Flink On Yarn HA模式搭建问题

1.问题描述

2.Flink On Yarn HA配置

3.解决过程复现

先查看命令行

查看容器日志

查看重试日志

查看jobmanager.log全量日志

3.解决

4.总结

相关文章