1.问题描述
Hadoop版本2.6.5 Flink版本1.11.6
前置,在搭建Standalone集群时没有任何问题启动正常
在搭建Flink On Yarn时,也没有任何问题
搭建Flink On Yarn却一直执行失败
2.Flink On Yarn HA配置
1. 给yarn-site.xml:文件添加如下配置
<property>
<name>yarn.resourcemanager.am.max-attempts</name>
<value>10</value>
</property>这个表示的是向resourcemanager提交任务,Yarn给我们维护的失败后自动重试次数
2. 修改FLINK_HOME/conf下的flink-conf.yaml文件:
一定要看到最后,这么配有问题!!!!!
high-availability: zookeeper
high-availability.storageDir: hdfs://node02:9000/flink/ha/ 这里我使用的是node02
high-availability.zookeeper.quorum: node02:2181,node03:2181,node04:2181
3. 下载支持Hadoop插件并且拷贝到各个节点的FLINK_HOME/lib目录下,因为在Flink这个版本中不再默认携带和Hadoop交互的jar文件,需要自己导入
下载地址:
3.解决过程复现
先查看命令行
2022-07-05 20:43:01,918 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli [] - Error while running the Flink session.
org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster
at org.apache.flink.yarn.YarnClusterDescriptor.deploySessionCluster(YarnClusterDescriptor.java:392) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:636) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$4(FlinkYarnSessionCli.java:895) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_152]
at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_152]
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:895) [flink-dist_2.11-1.11.6.jar:1.11.6]
Caused by: org.apache.flink.yarn.YarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment.
Diagnostics from YARN: Application application_1657020068308_0004 failed 2 times due to AM Container for appattempt_1657020068308_0004_000004 exited with exitCode: 1
For more detailed output, check application tracking page:http://node05:8088/proxy/application_1657020068308_0004/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1657020068308_0004_04_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
这是命令行的报错,大概意思就是容器container启动错误,为什么错没有描述清楚,详细的信息他说可以通过下面地址进行查看http://node05:8088/proxy/application_1657020068308_0004/Then
查看容器日志
发现并没有什么鸟用信息,和命令行一样的
但是可以看到这个容器启动时尝试了多次,可以通过每次尝试启动的日志查看为什么每次都失败
查看重试日志
在这里有一个error,我本能的去查看这个日志
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1657020068308_0004/filecache/15/log4j-slf4j-impl-2.16.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/bigdata/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
这里描述的是一个SLF4j的包版本冲突日志,那么就是找到两个包的具体位置,替换改名替换等一系列的操作,然后发现这个error确实没了,但是 集群还起不来!!!!!!
搞了好久才意识到漏掉一个日志没看,大意了,本能反应很有可能该我们带入误区
查看jobmanager.log全量日志
看不全,要点这里,表面看都是没啥问题的
org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint YarnSessionClusterEntrypoint.
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:200) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:577) [flink-dist_2.11-1.11.6.jar:1.11.6]
at org.apache.flink.yarn.entrypoint.YarnSessionClusterEntrypoint.main(YarnSessionClusterEntrypoint.java:82) [flink-dist_2.11-1.11.6.jar:1.11.6]
Caused by: java.net.ConnectException: Call From node04/192.168.76.96 to node03:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_152]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:1.8.0_152]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:1.8.0_152]
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[?:1.8.0_152]
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
at org.apache.hadoop.ipc.Client.call(Client.java:1474) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
at org.apache.hadoop.ipc.Client.call(Client.java:1401) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
at com.sun.proxy.$Proxy26.mkdirs(Unknown Source) ~[?:?]
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:539) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_152]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_152]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_152]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_152]
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
at com.sun.proxy.$Proxy27.mkdirs(Unknown Source) ~[?:?]
at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2742) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2713) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:870) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:866) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:866) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:859) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1819) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.mkdirs(HadoopFileSystem.java:172) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
at org.apache.flink.runtime.blob.FileSystemBlobStore.<init>(FileSystemBlobStore.java:64) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
at org.apache.flink.runtime.blob.BlobUtils.createFileSystemBlobStore(BlobUtils.java:98) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
at org.apache.flink.runtime.blob.BlobUtils.createBlobStoreFromConfig(BlobUtils.java:76) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
at org.apache.flink.runtime.highavailability.HighAvailabilityServicesUtils.createHighAvailabilityServices(HighAvailabilityServicesUtils.java:115) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.createHaServices(ClusterEntrypoint.java:335) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:293) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:223) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:177) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_152]
at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_152]
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:174) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
... 2 more
Caused by: java.net.ConnectException: Connection refused
找到问题了!! 是无法连接HDFS导致的,因为我们在HA中配置了 HDFS地址,ZK地址等
3.解决
high-availability: zookeeper
high-availability.storageDir: hdfs://mycluster/flink/ha/这里这这里,通过IP+端口是无法访问HDFS的
high-availability.zookeeper.quorum: node02:2181,node03:2181,node04:2181
因为我们在 hdfs-site.xml中做过如下配置,所以需要地址写成mycluster,由集群帮我们自动找到HDFS中NameNode的地址
<!--找寻NN的配置文件,k-v映射-->
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
<property>
<!--NN的具体地址-->
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>node01:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>node02:8020</value>
</property>
4.总结
集群配置多,日志也多,排错比较复杂,不要放弃慢慢找总能找到,尤其是自己学习搭建,网络上很多配置方式是不对的,也可能只和你之前搭建的HDFS,Yarn集群配置是冲突的,整体所有组件排错一遍也是很大的进步!!!