Click here to Skip to main content
15,867,833 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
I am new in spark .I want to use function "take(3)" for getting(displaying) 3 row of my csv file It is collecting the record correctly

raw_data =sc.textFile("day.csv")
raw_data.collect()


Quote:
['Name,Abbreviation,Numeric,Numeric-2', 'Monday,Mon.,1,01', 'Tuesday,Tue.,2,02', >'Wednesday,Wed.,3,03', 'Thursday,Thu.,4,04', 'Friday,Fri.,5,05', 'Saturday,Sat.,6,06', >'Sunday,Sun.,7,07']


What I have tried:

I am reading the csv file like this

raw_data =sc.textFile("day.csv")
raw_data.take(3) -->error


The error statement is

Quote:
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases. 23/05/27 22:29:16 ERROR Executor: Exception in task 0.0 in stage 11.0 (TID 17)1] org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:192) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:166) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.base/java.lang.Thread.run(Thread.java:1623) Caused by: java.net.SocketTimeoutException: Accept timed out at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:694) at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:738) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:690) at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:655) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:631) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:588) at java.base/java.net.ServerSocket.accept(ServerSocket.java:546) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:179) 15 more 23/05/27 22:29:16 WARN TaskSetManager: Lost task 0.0 in stage 11.0 (TID 17) (DESKTOP-SHEBUHK executor driver): org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:192) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:166) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.base/java.lang.Thread.run(Thread.java:1623) Caused by: java.net.SocketTimeoutException: Accept timed out at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:694) at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:738) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:690) at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:655) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:631) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:588) at java.base/java.net.ServerSocket.accept(ServerSocket.java:546) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:179) 15 more

23/05/27 22:29:16 ERROR TaskSetManager: Task 0 in stage 11.0 failed 1 times; aborting job Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Spark\spark-3.4.0-bin-hadoop3\python\pyspark\rdd.py", line 2836, in take res = self.context.runJob(self, takeUpToNumLeft, p) File "C:\Spark\spark-3.4.0-bin-hadoop3\python\pyspark\context.py", line 2319, in runJob sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) File "C:\Spark\spark-3.4.0-bin-hadoop3\python\lib\py4j-0.10.9.7-src.zip\py4j\java_gateway.py", line 1322, in call File "C:\Spark\spark-3.4.0-bin-hadoop3\python\pyspark\errors\exceptions\captured.py", line 169, in deco return f(*a, **kw) File "C:\Spark\spark-3.4.0-bin-hadoop3\python\lib\py4j-0.10.9.7-src.zip\py4j\protocol.py", line 326, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0 (TID 17) (DESKTOP-SHEBUHK executor driver): org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:192) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:166) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.base/java.lang.Thread.run(Thread.java:1623) Caused by: java.net.SocketTimeoutException: Accept timed out at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:694) at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:738) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:690) at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:655) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:631) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:588) at java.base/java.net.ServerSocket.accept(ServerSocket.java:546) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:179) 15 more

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1206) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1206) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1206) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2984) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:971) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2263) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2284) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303) at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:179) at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:75) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:52) at java.base/java.lang.reflect.Method.invoke(Method.java:578) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:1623) Caused by: org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:192) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:166) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) 1 more Caused by: java.net.SocketTimeoutException: Accept timed out at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:694) at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:738) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:690) at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:655) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:631) at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:588) at java.base/java.net.ServerSocket.accept(ServerSocket.java:546) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:179) 15 more


How I can resolve this error and get 3 rows displayed of csv file
Posted
Updated 27-May-23 20:56pm
v3

1 solution

Look at the error message:
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases. 23/05/27 22:29:16 ERROR Executor: Exception in task 0.0 in stage 11.0 (TID 17)1] org.apache.spark.SparkException: Python worker failed to connect back. at 


Either you do not have Python installed on your system, or it is in a location that Spark cannot find.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900