I am a big fan of the tidyverse
in R but most of the time, I actually use Python. If the rest of your team uses Python, your production code is in Python, it simply doesn’t make much sense to use R.
Anyway, I started to like working with pandas
much better once I figured out how to pipe with pandas and how to “translate” from tidyverse to pandas. Then this code in R
df %>%
filter(year >= 2009 & species %in% c("Adelie", "Gentoo")) %>%
mutate(body_mass_kg = body_mass_g / 1000) %>%
group_by(island, sex) %>%
summarise(avg_weight = mean(body_mass_kg), .groups='drop')
## # A tibble: 6 × 3
## island sex avg_weight
## <fct> <fct> <dbl>
## 1 Biscoe female 4.40
## 2 Biscoe male 5.17
## 3 Dream female 3.36
## 4 Dream male 3.94
## 5 Torgersen female 3.19
## 6 Torgersen male 3.78
becomes this code in Python:
(df
.query('year >= 2009 & species.isin(["Adelie", "Gentoo"])')
.assign(body_mass_kg = lambda x: x.body_mass_g / 1000 )
.groupby(['island', 'sex'])
.agg(avg_weight = ('body_mass_kg', 'mean'))
.reset_index()
)
## island sex avg_weight
## 0 Biscoe female 4.403571
## 1 Biscoe male 5.168103
## 2 Dream female 3.357500
## 3 Dream male 3.945000
## 4 Torgersen female 3.193750
## 5 Torgersen male 3.784375
There is however one type of transformation that whenever it comes up, I have to google it, read through the same question on StackOverflow again, and am, again, not very satisfied with the answer.
That transformation is the group_by
and mutate
combo. In R, it works so seamlessly, one either uses a mutate
or summarise
after a group_by
statement and everything is nice and neat. In Pandas, there is a confusing range of functions to be used after groupby
and somehow they all work a bit different. So in this post, I want to go through the different ways how one can group_by
and mutate
in pandas.
One example where this task frequently came up for me is when normalizing count data. In the penguin data I’m using here, this could be for example the proportions of male/female penguins over the different islands for the three different species. In R, we would proceed as follows to compute this:
df %>%
group_by(species, island, sex) %>%
count() %>%
group_by(species) %>%
mutate(prop = n/ sum(n))
## # A tibble: 10 × 5
## # Groups: species [3]
## species island sex n prop
## <fct> <fct> <fct> <int> <dbl>
## 1 Adelie Biscoe female 22 0.151
## 2 Adelie Biscoe male 22 0.151
## 3 Adelie Dream female 27 0.185
## 4 Adelie Dream male 28 0.192
## 5 Adelie Torgersen female 24 0.164
## 6 Adelie Torgersen male 23 0.158
## 7 Chinstrap Dream female 34 0.5
## 8 Chinstrap Dream male 34 0.5
## 9 Gentoo Biscoe female 58 0.487
## 10 Gentoo Biscoe male 61 0.513
Et voila, we got counts and proportions for each species for the different islands and sexes, summing up to 1 per species.
In pandas, we got a few options to achieve the same result:
Using transform
One common suggestion for this problem is to use transform()
:
(df
.groupby(['species', 'island', 'sex'], observed=True)
.size().rename('n')
.reset_index()
.groupby('species')
.n
.transform(lambda x: x / x.sum() )
)
## 0 0.157534
## 1 0.164384
## 2 0.150685
## 3 0.150685
## 4 0.191781
## 5 0.184932
## 6 0.512605
## 7 0.487395
## 8 0.500000
## 9 0.500000
## Name: n, dtype: float64
Can you guess why I don’t like this approach?
It returns a Series
and not a DataFrame
, meaning I have to assign it to its original data frame to see the species and island information. Buuuut, what original data frame? I cannot assign this Series
to df
because df
actually has a different index. So I would have to save a copy of just the counts data and then assign this to the counts data frame and then I can keep working with that. Such a flow interruption! Surely, there must be better ways!
Combining transform with assign
One option is to put the whole groupby()
and transform()
action inside an assign()
statement:
(df
.groupby(['species', 'island', 'sex'], observed=True)
.size().rename('n')
.reset_index()
.assign(prop = lambda x: (x
.groupby('species')
.n.transform(lambda x: x / x.sum() )
)
)
)
## species island sex n prop
## 0 Adelie Torgersen male 23 0.157534
## 1 Adelie Torgersen female 24 0.164384
## 2 Adelie Biscoe male 22 0.150685
## 3 Adelie Biscoe female 22 0.150685
## 4 Adelie Dream male 28 0.191781
## 5 Adelie Dream female 27 0.184932
## 6 Gentoo Biscoe male 61 0.512605
## 7 Gentoo Biscoe female 58 0.487395
## 8 Chinstrap Dream male 34 0.500000
## 9 Chinstrap Dream female 34 0.500000
Nice! No need to make some copies in between and if we wanted to, we could keep on chaining more functions after this. Downside: the groupby statement is quite nested and if we’d want to do multiple transformations for this grouping, we could also end up with quite a lot of repeated code.
Using apply and assign
A slight variant on the example above: this time we move the assign()
inside the nested statement:
(df
.groupby(['species', 'island', 'sex'], observed=True)
.size().rename('n')
.reset_index()
.groupby('species')
.apply(lambda grp: grp.assign(prop = lambda x: x.n / x.n.sum() ) )
)
## species island sex n prop
## 0 Adelie Torgersen male 23 0.157534
## 1 Adelie Torgersen female 24 0.164384
## 2 Adelie Biscoe male 22 0.150685
## 3 Adelie Biscoe female 22 0.150685
## 4 Adelie Dream male 28 0.191781
## 5 Adelie Dream female 27 0.184932
## 6 Gentoo Biscoe male 61 0.512605
## 7 Gentoo Biscoe female 58 0.487395
## 8 Chinstrap Dream male 34 0.500000
## 9 Chinstrap Dream female 34 0.500000
A bit nicer, isn’t it? It is still a rather nested statement (compared with the beauty of the short three lines with dplyr
) but it has a few advantages over the previous options: it easily allows multiple aggregate statements inside the assign
and also, in my eyes the biggest advantage, it also allows to combine multiple columns.
If, for example, we have another columns with some kind of weights, then we could use them as follows:
(df
.groupby(['species', 'island', 'sex'], observed=True)
.size().rename('n')
.reset_index()
.assign(weight = range(10) )
.groupby('species')
.apply(lambda grp: grp.assign(prop = lambda x: x.weight * x.n / x.n.sum() ) )
)
## species island sex n weight prop
## 0 Adelie Torgersen male 23 0 0.000000
## 1 Adelie Torgersen female 24 1 0.164384
## 2 Adelie Biscoe male 22 2 0.301370
## 3 Adelie Biscoe female 22 3 0.452055
## 4 Adelie Dream male 28 4 0.767123
## 5 Adelie Dream female 27 5 0.924658
## 6 Gentoo Biscoe male 61 6 3.075630
## 7 Gentoo Biscoe female 58 7 3.411765
## 8 Chinstrap Dream male 34 8 4.000000
## 9 Chinstrap Dream female 34 9 4.500000
Honorable mentions: siuba and datar
The tidyverse has become so successful and popular that by now there is a whole range of ports from its packages to Python. For dplyr
, the two biggest one are siuba
and datar
. They both work very similar, so I will only briefly show siuba
. With these two packages, the task above becomes a breeze:
from siuba import group_by, filter, count, mutate, _
(df >>
group_by(_.species, _.island, _.sex) >>
count() >>
filter(_.n > 0 ) >>
group_by(_.species) >>
mutate(prop = _.n/ _.n.sum() )
)
## (grouped data frame)
## species island sex n prop
## 0 Adelie Biscoe female 22 0.150685
## 1 Adelie Biscoe male 22 0.150685
## 2 Adelie Dream female 27 0.184932
## 3 Adelie Dream male 28 0.191781
## 4 Adelie Torgersen female 24 0.164384
## 5 Adelie Torgersen male 23 0.157534
## 8 Chinstrap Dream female 34 0.500000
## 9 Chinstrap Dream male 34 0.500000
## 12 Gentoo Biscoe female 58 0.487395
## 13 Gentoo Biscoe male 61 0.512605
The code is very similar to the R code and the package even provides a pipe operator as well!
Now, you might ask, why is this only a honorable mention? One problem is, that not everything has been ported to this package. Any of the pivot_*
functions have not been ported yet though one would still need to use the pivot functionality from pandas. However, I could not find a way to make pandas functions work in the same chain as siuba
functions (same with datar
). Also, you might noticed that I added a line that filters out any rows with a count of 0. That is, by default, the pandas groupby
uses observed=False
and lists grouping combinations even the one that don’t exist in the data. siuba
still uses the pandas implementation of groupby
underneath but doesn’t actually forward any arguments to it.
So in the end, I still prefer working with pandas directly instead of using one of the ports. And while, in my eyes, it doesn’t quite reach the beauty of the tidyverse and dplyr, it gets you pretty close.