Click here to Skip to main content
15,886,074 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have the follow df:

index      initial_range	  final_range
1            1000000	          5999999
2            6000000	          6299999
3            6300000	          6399999
4            6400000	          6499999
5            6600000	          6699999
6            6700000	          6749999
7            6750000	          6799999
8            7000000	          7399999
9            7600000	          7699999
10           7700000	          7749999
11           7750000	          7799999
12           6500000              6549999


See that the 'initial_range' field and 'final_range' field are intervals of abrangency.
When we compare row index 1 and index 2, we observe that the end of the value of the 'final_range' field starts in the next one as sequence+1 in the 'initial_range' index 2. So, in the example ended in 5999999 and started in 6000000 in index 2. I need grouping this cases and return the follow df:

index      initial_range	  final_range       grouping
1            1000000	          5999999    1000000-6549999
2            6000000	          6299999    1000000-6549999
3            6300000	          6399999    1000000-6549999
4            6400000	          6499999    1000000-6549999
5            6600000	          6699999    6600000-6799999
6            6700000	          6749999    6600000-6799999
7            6750000	          6799999    6600000-6799999
8            7000000	          7399999    7000000-7399999
9            7600000	          7699999    7600000-7799999
10           7700000	          7749999    7600000-7799999
11           7750000	          7799999    7600000-7799999
12           6500000              6549999    1000000-6549999

See, that the grouping field there are a news abrangencies, that are the values min(initial) and max(final), until the sequence is broken.

Some details:

- The index 4 for 5 the sequence+1 is broken, so the new 'grouping' change. In other words, every time the sequence is broken a new sequence needs to be written.
- In index 12 the grouping 1000000-6549999 appear again, because the 6500000 is the next number of 6499999 in index 4.

I tried this code:

```
comparison = df == df.shift()+1
df['grouping'] = comparison['initial_range'] & comparison['final_range']
```

But, the logic sequence, don't worked.

Can anyone help me?

What I have tried:

comparison = df == df.shift()+1
df['grouping'] = comparison['initial_range'] & comparison['final_range']


But, the logic sequence, don't worked.

Can anyone help me?
Posted
Updated 28-Feb-23 6:27am

1 solution

Well, an algoritm is not so complicated...

Steps to do:
1) You need to sort data by initial_range.
2) Then you need to add 2 extra fields, where ID1 starts from 0 (zero) and ID2 starts from 1 (one). This part of job can be done by using sequence number. See: PySpark: Dataframe Sequence Number[^]
index	initial_range	final_range	ID1	ID2
1		1000000			5999999		0	1
2		6000000			6299999		1	2
3		6300000			6399999		2	3
4		6400000			6499999		3	4
12		6500000			6549999		4	5
5		6600000			6699999		5	6
6		6700000			6749999		6	7
7		6750000			6799999		7	8
8		7000000			7399999		8	9
9		7600000			7699999		9	10
10		7700000			7749999		10	11
11		7750000			7799999		11	12

3) Now, you have to make two left-self-joins based on ID1 and ID2:
a) to the next value (where ID2 = ID1) to identify end of group (EOG)
b) to the prev value (where ID1 = ID2) to identify beginning of group (BOG).
To define EOG and BOG - use sequence number :)
See: left-self-joins.png (43.4 KB)
4) Finally, you have to join a) and b) based on the EOG and BOG.
Result:
index	initial_range	final_range	BOG	EOG
1		1000000			6549999		1	1
5		6600000			6799999		2	2
8		7000000			7399999		3	3
9		7600000			7799999		4	4
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900