Click here to Skip to main content
15,892,005 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have a cluster made by two slaves and one master and set up and I submit a jar (scala) to the spark master (192.168.1.64):

Java
spark-submit --master spark://spark-master:7077 --class tests.elements target/scala-2.10/zzz-project_2.10-1.0.jar


After quite sometime running just fine it stops abruptly with the last lines on the terminal being

Java
...
  15/08/19 17:45:24 INFO scheduler.TaskSchedulerImpl: Adding task set 411292.0 with 6 tasks
  15/08/19 17:45:24 WARN scheduler.TaskSetManager: Stage 411292 contains a task of very large size (2762 KB). The maximum recommended task size is 100 KB.
  15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 411292.0 (TID 1832, 192.168.1.64, PROCESS_LOCAL, 2828792 bytes)
  15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 411292.0 (TID 1833, 192.168.1.62, PROCESS_LOCAL, 2310009 bytes)
  15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 411292.0 (TID 1834, 192.168.1.64, PROCESS_LOCAL, 2669188 bytes)
  15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 411292.0 (TID 1835, 192.168.1.62, PROCESS_LOCAL, 2295676 bytes)
  15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 411292.0 (TID 1836, 192.168.1.64, PROCESS_LOCAL, 2847786 bytes)
  15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 411292.0 (TID 1837, 192.168.1.64, PROCESS_LOCAL, 2913528 bytes)


and the error occurring at the master log is the following:

Java
...
    15/08/19 16:09:49 INFO master.Master: Launching executor app-20150819160949-0001/0 on worker worker-20150819160925-192.168.1.64-51640
    15/08/19 16:09:49 INFO master.Master: Launching executor app-20150819160949-0001/1 on worker worker-20150819160938-192.168.1.62-38007
    15/08/19 16:15:44 INFO master.Master: akka.tcp://sparkDriver@192.168.1.64:46823 got disassociated, removing it.
    15/08/19 16:15:44 INFO master.Master: Removing app app-20150819160949-0001
    15/08/19 16:15:44 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@192.168.1.64:46823] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
    15/08/19 16:15:44 WARN master.Master: Application testPageRank is still in progress, it may be terminated abnormally.
    ...


Both workers have in their logs something like this

Java
...
   15/08/19 16:15:49 INFO worker.Worker: Executor app-20150819160949-0001/0 finished with state EXITED message Command exited with code 1 exitStatus 1
   15/08/19 16:15:50 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@192.168.1.64:54799] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

and

Java
   ...
15/08/19 16:15:43 INFO worker.Worker: Executor app-20150819160949-0001/1 finished with state EXITED message Command exited with code 1 exitStatus 1
    15/08/19 16:15:43 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@192.168.1.62:53325] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].


respectively. The work/app files contain something like this

Java
...
  15/08/19 16:15:41 INFO executor.Executor: Finished task 1.0 in stage 387758.0 (TID 1803). 1911 bytes result sent to driver
  15/08/19 16:15:41 INFO executor.Executor: Finished task 4.0 in stage 387758.0 (TID 1806). 1911 bytes result sent to driver
  15/08/19 16:15:41 INFO storage.BlockManager: Found block rdd_1206_5 locally
  15/08/19 16:15:41 INFO executor.Executor: Finished task 5.0 in stage 387758.0 (TID 1807). 1911 bytes result sent to driver
  15/08/19 16:15:41 INFO storage.BlockManager: Found block rdd_1206_3 locally
  15/08/19 16:15:41 INFO executor.Executor: Finished task 3.0 in stage 387758.0 (TID 1805). 1911 bytes result sent to driver
  15/08/19 16:15:44 ERROR executor.CoarseGrainedExecutorBackend: Driver 192.168.1.64:46823 disassociated! Shutting down.
  15/08/19 16:15:44 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@192.168.1.64:46823] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
  15/08/19 16:15:45 INFO storage.DiskBlockManager: Shutdown hook called
  15/08/19 16:15:46 INFO util.Utils: Shutdown hook called



and

Java
...
  15/08/19 16:15:41 INFO storage.BlockManager: Found block rdd_1206_0 locally
  15/08/19 16:15:41 INFO executor.Executor: Finished task 2.0 in stage 387758.0 (TID 1804). 1911 bytes result sent to driver
  15/08/19 16:15:41 INFO executor.Executor: Finished task 0.0 in stage 387758.0 (TID 1802). 1911 bytes result sent to driver
  15/08/19 16:15:42 ERROR executor.CoarseGrainedExecutorBackend: Driver 192.168.1.64:46823 disassociated! Shutting down.
  15/08/19 16:15:42 INFO storage.DiskBlockManager: Shutdown hook called
  15/08/19 16:15:42 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@192.168.1.64:46823] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
  15/08/19 16:15:42 INFO util.Utils: Shutdown hook called


respectively. There seem to be no other error in hdfs or spark.

I am suspecting that the error lies in the master log, the third line (`15/08/19 16:15:44 INFO master.Master: akka.tcp://sparkDriver@192.168.1.64:46823 got disassociated, removing it.`) but I can't figure out why. I tried changing the `spark.akka.heartbeat.interval` to 100 as suggested in some posts but no luck. Anyone would know why it happens and how to solve this? Thanks so much.
Posted
Updated 20-Aug-15 5:18am
v2

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900