EN VI

Python - How to Improve Efficiency in Random Column Selection and Assignment in Pandas DataFrame?

I'm working on a project where I need to create a new DataFrame based on an existing one, with certain columns randomly selected and assigned in each row with probability proportional to the number in that column.

However, my current implementation seems to be inefficient, especially when dealing with large datasets. I'm seeking advice on how to optimize this process for better performance.

Here's a simplified version of what I'm currently doing:

import pandas as pd
import numpy as np

# Sample DataFrame
data = {
    'dog': [1, 2, 3, 4],
    'cat': [5, 6, 7, 8],
    'parrot': [9, 10, 11, 12],
    'owner': ['fred', 'bob', 'jim', 'jannet']
}
df = pd.DataFrame(data)

# List of relevant columns
relevant_col_list = ['dog', 'cat', 'parrot']

# New DataFrame with the same number of rows
new_df = df.copy()

# Create 'iteration_1' column in new_df
new_df['iteration_1'] = ""

# Iterate over rows
for index, row in new_df.iterrows():
    # Copy columns not in relevant_col_list 
    for column in new_df.columns:
        if column not in relevant_col_list and column != 'iteration_1':
            new_df.at[index, column] = row[column]
    
    # Randomly select a column from relevant_col_list with probability proportional to the number in the column
    probabilities = df[relevant_col_list ].iloc[index] / df[relevant_col_list ].iloc[index].sum()
    chosen_column = np.random.choice(relevant_col_list , p=probabilities)
    
    # Write the name of the chosen column in the 'iteration_1' column
    new_df.at[index, 'iteration_1'] = chosen_column

print(new_df)

how can i speed this up?

Solution:

You could first rework your DataFrame to select the columns of interest, normalize the weights, then create a cumsum.

tmp = (df[relevant_col_list]
       .pipe(lambda x: x.div(x.sum(axis=1), axis=0))
       .cumsum(axis=1).to_numpy()
      )

# cumulated probabilities
array([[0.06666667, 0.4       , 1.        ],
       [0.11111111, 0.44444444, 1.        ],
       [0.14285714, 0.47619048, 1.        ],
       [0.16666667, 0.5       , 1.        ]])

After that generate n random numbers between 0-1 and identify the first column greater than the random number with argmax:

r = np.random.random(len(df))

df['iteration_1'] = np.array(relevant_col_list)[(tmp > r[:,None]).argmax(axis=1)]

Output:

   col1  col2  col3  col4 iteration_1
0     1     5     9    13        col3
1     2     6    10    14        col3
2     3     7    11    15        col1
3     4     8    12    16        col2

Intermediates:

# for reproducibility
np.random.seed(1)

tmp
# array([[0.06666667, 0.4       , 1.        ],
#        [0.11111111, 0.44444444, 1.        ],
#        [0.14285714, 0.47619048, 1.        ],
#        [0.16666667, 0.5       , 1.        ]])

r = np.random.random(len(df))
# array([0.417022005, 0.720324493, 0.114374817, 0.302332573])

(tmp > r[:,None])
# array([[False, False,  True],
#        [False, False,  True],
#        [ True,  True,  True],
#        [False,  True,  True]])

(tmp > r[:,None]).argmax(axis=1)
# array([2, 2, 0, 1])

np.array(relevant_col_list)[(tmp > r[:,None]).argmax(axis=1)]
# array(['col3', 'col3', 'col1', 'col2'], dtype='<U4')