EN VI

Pandas - Dynamic Query for Dataframe with multiple conditions and types?

2024-03-13 00:00:07
How to Pandas - Dynamic Query for Dataframe with multiple conditions and types

CONTEXT

I have a DataFrame I'm trying to filter using multiple different conditions / "rulesets". There are 3 different rules with a set of conditions as follows:

Rule_DF    Variable    Operator    Value
Rule1      Gender      ==          M
Rule1      DOB         >           01/01/2000
Rule1      Height      >           72    <-- "inches"
Rule2      Gender      ==          F
Rule2      DOB         >=          06/10/1998
Rule2      Height      <           70
etc...

EXAMPLE DATA

Name    Gender    DOB          Height(in)
Bob     M         01/01/2001   74
Jane    F         01/01/1999   64
etc...

EXPECTED RESULT

rule1_df should contain a list of all the names of people that fit rule 1's conditions. rule2_df should contain a list of all the names of people that fit rule 2's conditions.

CURRENT CODE

I am at the point where the rules are all grouped by the rule name and the query is written as such:

for item in grouped_rule_df:
    query = ' & '.join(f'{i} {j} {repr(k)}' for i, j, k in zip(column, equal, condition))
    query = "Gender == 'M' & Height > 72 & DOB > '01/01/2000'"
    test = df.query(query)

PROBLEM

The query still returns a list of all the heights. If I run the query with just the height condition, it returns the correct list of people. Can you not combine data types in pandas .query? Or is PandaSQL/another package a better solution here?

Solution:

Assuming rule_df the rules DataFrame and df the data one:

# first ensure the correct column names are used
# and that dates are in a YYYY-MM-DD format
tmp = (df.rename(columns={'Height(in)': 'Height'})
         .assign(DOB=lambda d: pd.to_datetime(d['DOB']).dt.strftime('%Y-%m-%d'))
      )

for rule, g in rule_df.groupby('Rule_DF'):
    # convert dates to YYYY-MM-DD
    # keep numbers as numeric, quote strings
    date = (pd.to_datetime(g['Value'], errors='coerce')
              .dt.strftime('%Y-%m-%d')
            )
    value = (pd.to_numeric(g['Value'], errors='coerce')
               .fillna('"'+date.fillna(g['Value'])+'"')
               .astype(str)
            )

    # craft query string and select
    query = ' & '.join('('+g['Variable']+g['Operator']+value+')')

    out[rule] = tmp.query(query)

Output:

{'Rule1':    Name Gender         DOB  Height
          0   Bob      M  2001-01-01      74,
 'Rule2':    Name Gender         DOB  Height
          1  Jane      F  1999-01-01      64,
}

Queries:

# Rule1
(Gender=="M") & (DOB>"2000-01-01") & (Height>72.0)

# Rule 2
(Gender=="F") & (DOB>="1998-06-10") & (Height<70.0)
Answer

Login


Forgot Your Password?

Create Account


Lost your password? Please enter your email address. You will receive a link to create a new password.

Reset Password

Back to login