Piping in Pandas: Group By and Mutate

I am a big fan of the tidyverse in R but most of the time, I actually use Python. If the rest of your team uses Python, your production code is in Python, it simply doesn’t make much sense to use R. Anyway, I started to like working with pandas much better once I figured out how to pipe with pandas and how to “translate” from tidyverse to pandas. Then this code in R

df %>%
  filter(year >= 2009  & species %in% c("Adelie", "Gentoo")) %>%
  mutate(body_mass_kg = body_mass_g / 1000) %>%
  group_by(island, sex) %>%
  summarise(avg_weight = mean(body_mass_kg), .groups='drop')

## # A tibble: 6 × 3
##   island    sex    avg_weight
##   <fct>     <fct>       <dbl>
## 1 Biscoe    female       4.40
## 2 Biscoe    male         5.17
## 3 Dream     female       3.36
## 4 Dream     male         3.94
## 5 Torgersen female       3.19
## 6 Torgersen male         3.78

becomes this code in Python:

(df
  .query('year >= 2009 & species.isin(["Adelie", "Gentoo"])')
  .assign(body_mass_kg = lambda x: x.body_mass_g / 1000 )
  .groupby(['island', 'sex'])
  .agg(avg_weight = ('body_mass_kg', 'mean'))
  .reset_index()
)

##       island     sex  avg_weight
## 0     Biscoe  female    4.403571
## 1     Biscoe    male    5.168103
## 2      Dream  female    3.357500
## 3      Dream    male    3.945000
## 4  Torgersen  female    3.193750
## 5  Torgersen    male    3.784375

There is however one type of transformation that whenever it comes up, I have to google it, read through the same question on StackOverflow again, and am, again, not very satisfied with the answer. That transformation is the group_by and mutate combo. In R, it works so seamlessly, one either uses a mutate or summarise after a group_by statement and everything is nice and neat. In Pandas, there is a confusing range of functions to be used after groupby and somehow they all work a bit different. So in this post, I want to go through the different ways how one can group_by and mutate in pandas.

One example where this task frequently came up for me is when normalizing count data. In the penguin data I’m using here, this could be for example the proportions of male/female penguins over the different islands for the three different species. In R, we would proceed as follows to compute this:

df %>%
  group_by(species, island, sex) %>%
  count() %>%
  group_by(species) %>%
  mutate(prop = n/ sum(n))

## # A tibble: 10 × 5
## # Groups:   species [3]
##    species   island    sex        n  prop
##    <fct>     <fct>     <fct>  <int> <dbl>
##  1 Adelie    Biscoe    female    22 0.151
##  2 Adelie    Biscoe    male      22 0.151
##  3 Adelie    Dream     female    27 0.185
##  4 Adelie    Dream     male      28 0.192
##  5 Adelie    Torgersen female    24 0.164
##  6 Adelie    Torgersen male      23 0.158
##  7 Chinstrap Dream     female    34 0.5  
##  8 Chinstrap Dream     male      34 0.5  
##  9 Gentoo    Biscoe    female    58 0.487
## 10 Gentoo    Biscoe    male      61 0.513

Et voila, we got counts and proportions for each species for the different islands and sexes, summing up to 1 per species.

In pandas, we got a few options to achieve the same result:

Using transform

One common suggestion for this problem is to use transform():

(df
  .groupby(['species', 'island', 'sex'], observed=True)
  .size().rename('n')
  .reset_index()
  .groupby('species')
  .n
  .transform(lambda x: x / x.sum() )
)

## 0    0.157534
## 1    0.164384
## 2    0.150685
## 3    0.150685
## 4    0.191781
## 5    0.184932
## 6    0.512605
## 7    0.487395
## 8    0.500000
## 9    0.500000
## Name: n, dtype: float64

Can you guess why I don’t like this approach? It returns a Series and not a DataFrame, meaning I have to assign it to its original data frame to see the species and island information. Buuuut, what original data frame? I cannot assign this Series to df because df actually has a different index. So I would have to save a copy of just the counts data and then assign this to the counts data frame and then I can keep working with that. Such a flow interruption! Surely, there must be better ways!

Combining transform with assign

One option is to put the whole groupby() and transform() action inside an assign() statement:

(df
  .groupby(['species', 'island', 'sex'], observed=True)
  .size().rename('n')
  .reset_index()
  .assign(prop = lambda x: (x
                            .groupby('species')
                            .n.transform(lambda x: x / x.sum() )
                            ) 
          )
)

##      species     island     sex   n      prop
## 0     Adelie  Torgersen    male  23  0.157534
## 1     Adelie  Torgersen  female  24  0.164384
## 2     Adelie     Biscoe    male  22  0.150685
## 3     Adelie     Biscoe  female  22  0.150685
## 4     Adelie      Dream    male  28  0.191781
## 5     Adelie      Dream  female  27  0.184932
## 6     Gentoo     Biscoe    male  61  0.512605
## 7     Gentoo     Biscoe  female  58  0.487395
## 8  Chinstrap      Dream    male  34  0.500000
## 9  Chinstrap      Dream  female  34  0.500000

Nice! No need to make some copies in between and if we wanted to, we could keep on chaining more functions after this. Downside: the groupby statement is quite nested and if we’d want to do multiple transformations for this grouping, we could also end up with quite a lot of repeated code.

Using apply and assign

A slight variant on the example above: this time we move the assign() inside the nested statement:

(df
  .groupby(['species', 'island', 'sex'], observed=True)
  .size().rename('n')
  .reset_index()
  .groupby('species')
  .apply(lambda grp: grp.assign(prop = lambda x: x.n / x.n.sum() ) )
)

##      species     island     sex   n      prop
## 0     Adelie  Torgersen    male  23  0.157534
## 1     Adelie  Torgersen  female  24  0.164384
## 2     Adelie     Biscoe    male  22  0.150685
## 3     Adelie     Biscoe  female  22  0.150685
## 4     Adelie      Dream    male  28  0.191781
## 5     Adelie      Dream  female  27  0.184932
## 6     Gentoo     Biscoe    male  61  0.512605
## 7     Gentoo     Biscoe  female  58  0.487395
## 8  Chinstrap      Dream    male  34  0.500000
## 9  Chinstrap      Dream  female  34  0.500000

A bit nicer, isn’t it? It is still a rather nested statement (compared with the beauty of the short three lines with dplyr) but it has a few advantages over the previous options: it easily allows multiple aggregate statements inside the assign and also, in my eyes the biggest advantage, it also allows to combine multiple columns.

If, for example, we have another columns with some kind of weights, then we could use them as follows:

(df
  .groupby(['species', 'island', 'sex'], observed=True)
  .size().rename('n')
  .reset_index()
  .assign(weight = range(10) )
  .groupby('species')
  .apply(lambda grp: grp.assign(prop = lambda x: x.weight * x.n / x.n.sum() ) )
)

##      species     island     sex   n  weight      prop
## 0     Adelie  Torgersen    male  23       0  0.000000
## 1     Adelie  Torgersen  female  24       1  0.164384
## 2     Adelie     Biscoe    male  22       2  0.301370
## 3     Adelie     Biscoe  female  22       3  0.452055
## 4     Adelie      Dream    male  28       4  0.767123
## 5     Adelie      Dream  female  27       5  0.924658
## 6     Gentoo     Biscoe    male  61       6  3.075630
## 7     Gentoo     Biscoe  female  58       7  3.411765
## 8  Chinstrap      Dream    male  34       8  4.000000
## 9  Chinstrap      Dream  female  34       9  4.500000

Honorable mentions: siuba and datar

The tidyverse has become so successful and popular that by now there is a whole range of ports from its packages to Python. For dplyr, the two biggest one are siuba and datar. They both work very similar, so I will only briefly show siuba. With these two packages, the task above becomes a breeze:

from siuba import group_by, filter, count, mutate, _
(df >> 
  group_by(_.species, _.island, _.sex) >>
  count() >>
  filter(_.n > 0 ) >>
  group_by(_.species) >>
  mutate(prop = _.n/ _.n.sum() )
)

## (grouped data frame)
##       species     island     sex   n      prop
## 0      Adelie     Biscoe  female  22  0.150685
## 1      Adelie     Biscoe    male  22  0.150685
## 2      Adelie      Dream  female  27  0.184932
## 3      Adelie      Dream    male  28  0.191781
## 4      Adelie  Torgersen  female  24  0.164384
## 5      Adelie  Torgersen    male  23  0.157534
## 8   Chinstrap      Dream  female  34  0.500000
## 9   Chinstrap      Dream    male  34  0.500000
## 12     Gentoo     Biscoe  female  58  0.487395
## 13     Gentoo     Biscoe    male  61  0.512605

The code is very similar to the R code and the package even provides a pipe operator as well!

Now, you might ask, why is this only a honorable mention? One problem is, that not everything has been ported to this package. Any of the pivot_* functions have not been ported yet though one would still need to use the pivot functionality from pandas. However, I could not find a way to make pandas functions work in the same chain as siuba functions (same with datar). Also, you might noticed that I added a line that filters out any rows with a count of 0. That is, by default, the pandas groupby uses observed=False and lists grouping combinations even the one that don’t exist in the data. siuba still uses the pandas implementation of groupby underneath but doesn’t actually forward any arguments to it.

So in the end, I still prefer working with pandas directly instead of using one of the ports. And while, in my eyes, it doesn’t quite reach the beauty of the tidyverse and dplyr, it gets you pretty close.

Menu

Samples of Thoughts

about data, statistics and everything in between