EN VI

Python - How to select rows based on a specified set of values in one column which share a value in another column with pandas?

2024-03-12 21:00:10
Python - How to select rows based on a specified set of values in one column which share a value in another column with pandas?

In reality I have a dataset with 5281 rows and ~ 40 columns. From this I need to select a certain set of values which are duplicates in another row.

To simplify I try to break it down to a df with 2 columns, A and B.

d = {'A': [2, 1, 2, 2, 1, 1, 3, 1], 'B':['a', 'a', 'b', 'b', 'c', 'c', 'd', 'd']}
df = pd.DataFrame(d)

In the image you see the df, and I marked what I want: I want a set of A = (1, 2) which shares the value in B.

enter image description here

A little bit of context: I need to drop rows which have duplicates in one column (here as in col B) but only if the duplicates have a certain set of values in another row (here it is the set 1, 2 of A). And all this I would like to apply directly on the df.

Solution:

Try this:

In this solution, groupby() will return a series, with the values in A as sets, with the values of B as the index. We then check to see if each set is equal to the target set. Lastly we map the boolean series to the original number of rows so we can select just the rows we need.

s = {1,2}
df.loc[df['B'].map(df.groupby('B')['A'].agg(set).eq(s))]

Output:

   A  B
0  2  a
1  1  a
Answer

Login


Forgot Your Password?

Create Account


Lost your password? Please enter your email address. You will receive a link to create a new password.

Reset Password

Back to login