Say I have a dataframe df
with P columns where there can be missing data at different rows, e.g. first row data available for column 1 but for for 2, and potentially vice-versa for other rows. I want to select, for each column separately, the column data between first and last valid index (which, again, can differ across columns), and check if there are NaNs left. Then, I would like to exclude those columns that meet such condition.
My code works this way:
for i in df.columns:
df_i = df[i]
trimemd_i = df_i.loc[df_i.first_valid_index():df_i.last_valid_index()]
if np.any(trimmed_i.isnull()):
continue
else:
good_data.append(i)
df = df.loc[:, good_data]
The problem is that the loop is slow if I have many columns. Is there a more efficient way to do it, maybe avoiding loops?