Pandas: Filtering data based on information from multiple dataframes -


i analyzing clinical data , trying filter out information in 1 dataframe based on information dataframe.

one of dataframes lists dates patients come in treatment

dftreatments = pd.dataframe({'patientid': [4,4,4,9,9,9,11,11,11], 'treatmentdate': ['2016-01-01', '2016-01-15', '2016-03-25','2016-01-01','2016-01-15','2016-01-29','2016-01-01','2016-03-15','2016-03-25']}) dftreatments['treatmentdate'] = pd.to_datetime(dftreatments['treatmentdate'])     patientid treatmentdate 0          4    2016-01-01 1          4    2016-01-15 2          4    2016-03-25 3          9    2016-01-01 4          9    2016-01-15 5          9    2016-01-29 6         11    2016-01-01 7         11    2016-03-15 8         11    2016-03-25 

and other dataframe lists dates patients visit hospital complication.

dfhospitalvisits = pd.dataframe({'patientid': [4,4,9,11], 'hospitalvisitdate': ['2016-01-14','2016-03-10','2016-01-28','2016-01-03']}) dfhospitalvisits['hospitalvisitdate'] = pd.to_datetime(dfhospitalvisits['hospitalvisitdate'])    hospitalvisitdate  patientid 0        2016-01-14          4 1        2016-03-10          4 2        2016-01-28          9 3        2016-01-03         11 

in our study, want exclude hospital visits our analysis if patient does not receive treatment 20 days. start excluding them @ last treatment before 20 day gap. e.g.: exclude hospital visits patient 4 after 2016-01-15.

in example, patient 4's 2nd hospital visit , patient 11's hospital visit removed dfhospitalvisits.

edit: @merlin, far have used dftreatments.groupby('patientid')['treatmentdate'].diff() me gaps in treatment dates grouped patient. part stuck don't know how use difference in treatment dates >20 filter out values in dfhospitalvisits .

i suggest following:

# make sorted dataframe calculate time gap before next treatment  dftreatments_sorted = dftreatments.sort_values(['patientid','treatmentdate'], ascending=false)  # calculate time gap before next treatment df_diff = dftreatments_sorted.groupby('patientid').treatmentdate.diff(periods=1).rename('gap_before_next_treatment')  # add time gaps new column existing dftreatments dataframe dftreatments = pd.concat([dftreatments, -df_diff], axis=1, join='inner').sort_index()  # join dftreatments , dfhospitalvisits new dataframe (df) df = dfhospitalvisits.set_index('patientid').join(dftreatments.set_index('patientid'))  # select combination treatmentdate before corresponding hospitalvisitdate df = df[(df.hospitalvisitdate>df.treatmentdate)]  # treatmentdate important latest before hospitalvisitdate df = df.reset_index().groupby(['patientid','hospitalvisitdate']).max()  # can filter hospital visits given calculated time gap df = df[df.gap_before_next_treatment<'20 days'].reset_index()[['patientid','hospitalvisitdate']] 

Comments