Say I have a list of lists where each nested list has two values: a start date of a range and an end date of a range. So something like this:
ranges_list = [
['2020-03-12', '2020-06-12'],
['2020-03-13', '2020-06-13'],
['2020-03-14', '2020-06-14']
]
This represents 3 ranges:
- March 12th 2020 – June 12th 2020
- March 13th 2020 – June 13th 2020
- March 14th 2020 – June 14th 2020
Say I also have some dataframe d
which has multiple columns, one of which is a column called 'occurence_date'
that contains datetimes in it.
Say the dataframe d
looks like:
ID LinkID PC occurence_date
10R46 R*1005 8017 2020-03-12
10R46 R*10335 5019 2020-04-15
100R91 R*1005 8017 2020-04-15
10R91 R*243 8870 2020-06-14
I want to group dataframe d
using the occurence_date
column on the ranges specified in ranges_list
So something like:
grouped = d.groupby('occurence_date', ranges=ranges_list)
Obviously, this groupby code is incorrect but helps serve what i want to do.
At the end of it all, the grouped object should have 3 distinct groups that look like:
group: ('2020-03-12', '2020-06-12')
ID LinkID PC occurence_date
10R46 R*1005 8017 2020-03-12
10R46 R*10335 5019 2020-04-15
100R91 R*1005 8017 2020-04-15
group: ('2020-03-13', '2020-06-13')
ID LinkID PC occurence_date
10R46 R*10335 5019 2020-04-15
100R91 R*1005 8017 2020-04-15
group: ('2020-03-14', '2020-06-14')
ID LinkID PC occurence_date
10R46 R*10335 5019 2020-04-15
100R91 R*1005 8017 2020-04-15
10R91 R*243 8870 2020-06-14
How can I go about accomplishing this?
The following interactive session shows how to get to a point where you have the necessary data to group records together as you want it. There may be a more efficient way as this will iterate over len(d) * len(dranges)
but if you do not have a lot of data, this is a straightforward solution.
>>> d
ID LinkID PC occurence_date
0 10R46 R*1005 8017 2020-03-12
1 10R46 R*10335 5019 2020-04-15
2 100R91 R*1005 8017 2020-04-15
3 10R91 R*243 8870 2020-06-14
>>> dranges
0 1
0 2020-03-12 2020-06-12
1 2020-03-13 2020-06-13
2 2020-03-14 2020-06-14
>>> d['overlaps'] = d.apply(lambda row: [f'{dr[0]} to {dr[1]}'
for _, dr in dranges.iterrows()
if row['occurence_date'] >= dr[0]
and row['occurence_date'] <= dr[1]]
, axis=1)
>>> d.explode('overlaps').sort_values('overlaps')
ID LinkID PC occurence_date overlaps
0 10R46 R*1005 8017 2020-03-12 2020-03-12 to 2020-06-12
1 10R46 R*10335 5019 2020-04-15 2020-03-12 to 2020-06-12
2 100R91 R*1005 8017 2020-04-15 2020-03-12 to 2020-06-12
1 10R46 R*10335 5019 2020-04-15 2020-03-13 to 2020-06-13
2 100R91 R*1005 8017 2020-04-15 2020-03-13 to 2020-06-13
1 10R46 R*10335 5019 2020-04-15 2020-03-14 to 2020-06-14
2 100R91 R*1005 8017 2020-04-15 2020-03-14 to 2020-06-14
3 10R91 R*243 8870 2020-06-14 2020-03-14 to 2020-06-14
You can group by pd.IntervalIndex
:
ranges_list = [
(pd.Timestamp('2020-03-12'), pd.Timestamp('2020-06-12')),
(pd.Timestamp('2020-03-13'), pd.Timestamp('2020-06-13')),
(pd.Timestamp('2020-03-14'), pd.Timestamp('2020-06-14'))
]
idx = pd.IntervalIndex.from_tuples(ranges_list, closed='both')
def in_ranges(x, bins):
rv = []
for b in bins:
if x in b:
rv.append(b)
return rv
df['groups'] = df['occurence_date'].apply(lambda x: in_ranges(x, idx))
for g in df.explode('groups').groupby('groups'):
print(g[0])
print('-' * 80)
print(g[1][['ID', 'LinkID', 'PC', 'occurence_date']])
print()
Prints:
[2020-03-12, 2020-06-12]
--------------------------------------------------------------------------------
ID LinkID PC occurence_date
0 10R46 R*1005 8017 2020-03-12
1 10R46 R*10335 5019 2020-04-15
2 100R91 R*1005 8017 2020-04-15
[2020-03-13, 2020-06-13]
--------------------------------------------------------------------------------
ID LinkID PC occurence_date
1 10R46 R*10335 5019 2020-04-15
2 100R91 R*1005 8017 2020-04-15
[2020-03-14, 2020-06-14]
--------------------------------------------------------------------------------
ID LinkID PC occurence_date
1 10R46 R*10335 5019 2020-04-15
2 100R91 R*1005 8017 2020-04-15
3 10R91 R*243 8870 2020-06-14