EN VI

Python - Take the last occurrence of a string value based on the latest date?

How to Python - Take the last occurrence of a string value based on the latest date

There are 3 columns, say: Item, Item type and Sales. Now item type keeps changing frequently. Sometimes an item (say, Apple) might have item type as one thing on a particular date ("Healthy" on Jan 1) but might have something else on another date ("Vegan" on Jan 5). There are more than 500 such items.

Eg dataframe:

Item	Item type	Date	Sales
Apple	Healthy	Jan 1	10
Apple	Vegan	Jan 5	5

The output needs to be:

Item	Item type	Sales
Apple	Vegan	15

(assuming Vegan is the last entry in the data) and the cumulative total sales is 15.

Now if on Jan 7 another entry is made (Apple, Fruit, 2) for Item, item type, sales, the output changes to: Eg dataframe:

Item	Item type	Date	Sales
Apple	Healthy	Jan 1	10
Apple	Vegan	Jan 5	5
Apple	Fruit	Jan 7	2

The required output:

Item	Item type	Sales
Apple	Fruit	17

I tried using the df.groupby .last function but it's only applicable on numerical data.

Solution:

The groupby last function should work in this use case, as long as you can confirm that your data is being entered chronologically you can just do

import pandas

df = pandas.DataFrame(columns=['item', 'type', 'date', 'sales'])
df.loc[0] = ["Apple", "unripe", "Jan 1", 10]
df.loc[1] = ["Apple", "ripe", "Jan 10", 5]
df.loc[2] = ["Orange", "ripe", "Jan 12", 6]
df.groupby(["item"]).agg({'type': 'last', 'date': 'last', 'sales': sum})

That groupby ensures that the latest entries of date and type are used while sales is the sum of all the grouped entries.

The output looks like

        type    date    sales
item            
Apple   ripe    Jan 10  15
Orange  ripe    Jan 12  6