I want save data in libsvm format by python. So I choose to use pyspark to finish this task. But the data I saved was not in the libsvm format. Here is my code.
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint
d = c.map(lambda line: LabeledPoint(line[0],[line[1:]]))
MLUtils.saveAsLibSVMFile(d, "D://spark-warehouse/part1")
When I run the code print (d.take(3)), it showsas follows which is the format of LabeledPoint
[LabeledPoint(-0.05643994211287995, [0.0142684401451,-0.0072049689441,-0.929159510172,-0.893124442121,-0.996100725507]), LabeledPoint(-0.02315484804630974, [0.0408706166868,-0.00372670807453,-0.891585462256,-0.839681870708,-0.96168588986]), LabeledPoint(0.03039073806078152, [0.0577992744861,-0.00621118012422,-0.898020043313,-0.847917899172,-0.968368717236])]
However, when I tested my saved data, it did not in the libsvm format .It shows the wrong label which only has label 1.
''.join(sorted(input(glob("D://spark-warehouse/part2" + "/part-0000*")))).
-0.05643994211287995 1:[ 0.01426844 -0.00720497 -0.92915951 -0.89312444 -0.99610073]\n-0.02315484804630974 1:[ 0.04087062 -0.00372671 -0.89158546 -0.83968187 -0.96168589]\n0.03039073806078152 1:[ 0.05779927 -0.00621118 -0.89802004 -0.8479179 -0.96836872]\n
Which should be in the right format like follows.
-0.05643994211287995 1:0.01426844 2:-0.00720497 3:-0.92915951 4:-0.89312444 5:-0.99610073]\n ...
And my python version is 3.5.2 and pyspark version is 2.0.1. I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.
Note: I want to do SVR, so my indexes are not in the type of int.
What I have tried:
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.