EN VI

slicing pandas columns individually between first and last valid index?

How to slicing pandas columns individually between first and last valid index

Say I have a dataframe df with P columns where there can be missing data at different rows, e.g. first row data available for column 1 but for for 2, and potentially vice-versa for other rows. I want to select, for each column separately, the column data between first and last valid index (which, again, can differ across columns), and check if there are NaNs left. Then, I would like to exclude those columns that meet such condition.

My code works this way:

for i in df.columns:
    df_i = df[i]
    trimemd_i = df_i.loc[df_i.first_valid_index():df_i.last_valid_index()]
    if np.any(trimmed_i.isnull()):
        continue
    else:
        good_data.append(i)
df = df.loc[:, good_data]

The problem is that the loop is slow if I have many columns. Is there a more efficient way to do it, maybe avoiding loops?

Solution:

Assuming such an input, in which you want to remove "C":

     A    B    C
0  NaN  NaN  NaN
1  1.0  1.0  1.0
2  2.0  2.0  NaN
3  3.0  3.0  3.0
4  NaN  4.0  NaN

You can remove the columns with inner NaNs using isna a double cummax to identify the inner NaNs, then boolean indexing with all to select the columns that have no inner NaNs:

tmp = df.notna()

out = df.loc[:, tmp.eq(tmp.cummax() & tmp[::-1].cummax()).all()]

Output:

     A    B
0  NaN  NaN
1  1.0  1.0
2  2.0  2.0
3  3.0  3.0
4  NaN  4.0

Intermediates:

# getting NaNs
# tmp
       A      B      C
0  False  False  False
1   True   True   True
2   True   True  False
3   True   True   True
4  False   True  False

# removing inner NaNs
# tmp.cummax() & tmp[::-1].cummax()
       A      B      C
0  False  False  False
1   True   True   True
2   True   True   True
3   True   True   True
4  False   True  False

# identifying inner NaNs (as False)
# tmp.eq(tmp.cummax() & tmp[::-1].cummax())
      A     B      C
0  True  True   True
1  True  True   True
2  True  True  False
3  True  True   True
4  True  True   True

# columns to keep (no inner NaNs)
# tmp.eq(tmp.cummax() & tmp[::-1].cummax()).all()
A     True
B     True
C    False
dtype: bool