We have to compute percentiles for 100 columns in an data frame. In the example below, the column names that need percentiles are pctile_columns
. The criteria for receiving percentiles is (1) the column is not NA
, and (2) the min_pg
column is >= 12
. We are struggling to obtain the correct set of percentiles:
Data + Attempt
temp_df = structure(list(group_var = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
min_pg = c(11, 15, 19, 7, 5, 34, 32, 27, 24, 18, 13, 10),
stat1 = c(0.35, 0.32, 0.27, NA, NA, 0.42, 0.45, 0.47, 0.33, NA, 0.24, 0.39)),
row.names = c(NA, -12L), class = "data.frame")
library(dplyr)
pctile_columns <- c('stat1')
temp_output <- temp_df %>%
group_by(group_var) %>%
mutate(across(.cols = all_of(pctile_columns),
.fns = ~ if_else(is.na(.) | min_pg < 12, as.numeric(NA),
rank(., ties.method = "max")),
.names = "{.col}__rank")) %>%
mutate(across(.cols = all_of(pctile_columns),
.fns = ~ if_else(is.na(.) | min_pg < 12, as.numeric(NA),
round((rank(., ties.method = "max") - 1) / (n() - 1) * 100, 0)),
.names = "{.col}__pctile"))
Output
# Groups: group_var [1]
group_var min_pg stat1 stat1__rank stat1__pctile
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 11 0.35 NA NA
2 1 15 0.32 3 18
3 1 19 0.27 2 9
4 1 7 NA NA NA
5 1 5 NA NA NA
6 1 34 0.42 7 55
7 1 32 0.45 8 64
8 1 27 0.47 9 73
9 1 24 0.33 4 27
10 1 18 NA NA NA
11 1 13 0.24 1 0
12 1 10 0.39 NA NA
The problem with this output is that the ranks go from 1-9, whereas they should go from 1-7. Even though the stat1
values with min_pg < 12
are correctly being assigned an NA
value, these stat1
values are still being factored into the rank
equation when computing the ranks for all of the other rows. The correct set of ranks should be 1-7 in this instance, as there are 7 metrics that meet the criteria for stat1
to receive a rank/percentile.
How can we revise our code to compute ranks/percentiles properly per our criteria?