I'm working on a project where I need to create a new DataFrame based on an existing one, with certain columns randomly selected and assigned in each row with probability proportional to the number in that column.
However, my current implementation seems to be inefficient, especially when dealing with large datasets. I'm seeking advice on how to optimize this process for better performance.
Here's a simplified version of what I'm currently doing:
import pandas as pd
import numpy as np
# Sample DataFrame
data = {
'dog': [1, 2, 3, 4],
'cat': [5, 6, 7, 8],
'parrot': [9, 10, 11, 12],
'owner': ['fred', 'bob', 'jim', 'jannet']
}
df = pd.DataFrame(data)
# List of relevant columns
relevant_col_list = ['dog', 'cat', 'parrot']
# New DataFrame with the same number of rows
new_df = df.copy()
# Create 'iteration_1' column in new_df
new_df['iteration_1'] = ""
# Iterate over rows
for index, row in new_df.iterrows():
# Copy columns not in relevant_col_list
for column in new_df.columns:
if column not in relevant_col_list and column != 'iteration_1':
new_df.at[index, column] = row[column]
# Randomly select a column from relevant_col_list with probability proportional to the number in the column
probabilities = df[relevant_col_list ].iloc[index] / df[relevant_col_list ].iloc[index].sum()
chosen_column = np.random.choice(relevant_col_list , p=probabilities)
# Write the name of the chosen column in the 'iteration_1' column
new_df.at[index, 'iteration_1'] = chosen_column
print(new_df)
how can i speed this up?