EN VI

Python - Group dataframe and sample n rows with equal probability between groups?

2024-03-13 16:00:05
How to Python - Group dataframe and sample n rows with equal probability between groups

I have a pandas dataframe like this:

     ID  Value
0     a     2
1     a     4
2     b     6
3     c     8
4     c    10
5     c    12

I would like to sample equally from the ID groups. I know I can group the data frame by ID and then specify the number of rows I want to sample from each group like this: df.groupby("ID").sample(n=2, replace = True) However, I just want the probability of sampling from a group to be the same, not necessarily the exact same number of rows.

Thanks in advance.

Solution:

If you want to sample N rows with about the same probability to sample each group, you could oversample per group then sample again:

import math

N = 4

out = (df.groupby('ID').sample(n=math.ceil(N/df['ID'].nunique()), replace=True)
         .sample(N)
      )

Example output:

  ID  Value
2  b      6
2  b      6
4  c     10
1  a      4

With N = 10:

  ID  Value
0  a      2
2  b      6
5  c     12
3  c      8
1  a      4
5  c     12
2  b      6
1  a      4
1  a      4
2  b      6

Proportion with N = 100:

ID
b    0.34
a    0.33
c    0.33
Name: proportion, dtype: float64
Answer

Login


Forgot Your Password?

Create Account


Lost your password? Please enter your email address. You will receive a link to create a new password.

Reset Password

Back to login