Grouping values that are sequence in two columns pyspark

Question

0.00/5 (No votes)

See more:

I have the follow df:

index      initial_range	  final_range
1            1000000	          5999999
2            6000000	          6299999
3            6300000	          6399999
4            6400000	          6499999
5            6600000	          6699999
6            6700000	          6749999
7            6750000	          6799999
8            7000000	          7399999
9            7600000	          7699999
10           7700000	          7749999
11           7750000	          7799999
12           6500000              6549999

See that the 'initial_range' field and 'final_range' field are intervals of abrangency.
When we compare row index 1 and index 2, we observe that the end of the value of the 'final_range' field starts in the next one as sequence+1 in the 'initial_range' index 2. So, in the example ended in 5999999 and started in 6000000 in index 2. I need grouping this cases and return the follow df:

index      initial_range	  final_range       grouping
1            1000000	          5999999    1000000-6549999
2            6000000	          6299999    1000000-6549999
3            6300000	          6399999    1000000-6549999
4            6400000	          6499999    1000000-6549999
5            6600000	          6699999    6600000-6799999
6            6700000	          6749999    6600000-6799999
7            6750000	          6799999    6600000-6799999
8            7000000	          7399999    7000000-7399999
9            7600000	          7699999    7600000-7799999
10           7700000	          7749999    7600000-7799999
11           7750000	          7799999    7600000-7799999
12           6500000              6549999    1000000-6549999

See, that the grouping field there are a news abrangencies, that are the values min(initial) and max(final), until the sequence is broken.

Some details:

- The index 4 for 5 the sequence+1 is broken, so the new 'grouping' change. In other words, every time the sequence is broken a new sequence needs to be written.
- In index 12 the grouping 1000000-6549999 appear again, because the 6500000 is the next number of 6499999 in index 4.

I tried this code:

```
comparison = df == df.shift()+1
df['grouping'] = comparison['initial_range'] & comparison['final_range']
```

But, the logic sequence, don't worked.

Can anyone help me?

What I have tried:

comparison = df == df.shift()+1
df['grouping'] = comparison['initial_range'] & comparison['final_range']

But, the logic sequence, don't worked.

Can anyone help me?

Posted 9-Feb-23 9:57am

Vivian Scarpioni

Updated 28-Feb-23 6:27am

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Maciej Los · Answer 1 · 2023-02-28T06:27:00

Well, an algoritm is not so complicated...

Steps to do:
1) You need to sort data by initial_range.
2) Then you need to add 2 extra fields, where ID1 starts from 0 (zero) and ID2 starts from 1 (one). This part of job can be done by using sequence number. See: PySpark: Dataframe Sequence Number[^]

index	initial_range	final_range	ID1	ID2
1		1000000			5999999		0	1
2		6000000			6299999		1	2
3		6300000			6399999		2	3
4		6400000			6499999		3	4
12		6500000			6549999		4	5
5		6600000			6699999		5	6
6		6700000			6749999		6	7
7		6750000			6799999		7	8
8		7000000			7399999		8	9
9		7600000			7699999		9	10
10		7700000			7749999		10	11
11		7750000			7799999		11	12

3) Now, you have to make two left-self-joins based on ID1 and ID2:
a) to the next value (where ID2 = ID1) to identify end of group (EOG)
b) to the prev value (where ID1 = ID2) to identify beginning of group (BOG).
To define EOG and BOG - use sequence number :)
See: left-self-joins.png (43.4 KB)
4) Finally, you have to join a) and b) based on the EOG and BOG.
Result:

index	initial_range	final_range	BOG	EOG
1		1000000			6549999		1	1
5		6600000			6799999		2	2
8		7000000			7399999		3	3
9		7600000			7799999		4	4