Samples of Thoughts

about data, statistics and everything in between

Memory Efficiency in Pandas

If you’re working with big data in pandas you can run into memory problems very quickly. When working locally, your machine might slow down or you even get this lovely message that asks you to please kill some applications. If working in the cloud, one can of course always ramp up memory but trust me, having to restart a couple of thousand killed jobs because of Out-of-Memory errors is not fun and also pricey!

In this post, I will talk about how to measure size of a data frame, then show on an example how to use categories and chunking to optimize memory usage, as well as give some tips on how to find the cause of an Out-of-Memory error at the end.

Some Basics first: What’s your Type?

How much memory space a data frame needs, is mostly determined by its content and in particular the type of the content.

There are four main data types that can be stored in a pandas data frame: integer, bool, float and object. Integers are nice since they only need a couple of bits to represent any whole number. A boolean is even simpler than an integer and only needs a single bit. Floats are internally represented as two integers (exponent and the significand) so they need a similar amount of memory as integers. object is usually used for string types and a bit more tricky.

Let’s construct a small test data frame with the three different types:

import pandas as pd
data = pd.DataFrame({'integer':[1], 'float': [2.0], 'bool': [True],
                     'object': ['some text']} )
data.dtypes
## integer      int64
## float      float64
## bool          bool
## object      object
## dtype: object

The 64 means the type uses 64 bits.

To check how much memory is used per column, we use memory_usage():

data.memory_usage(deep=True)
## Index      128
## integer      8
## float        8
## bool         1
## object      66
## dtype: int64

Note that by default, pandas gives the space in bytes (1 byte = 8 bits, so 8 bytes = 64 bits). Also, if we don’t use deep=True, Python won’t give the true memory space allocated to objects.

Oh So Big

So object type uses much more memory space than integer or float. Strings can also be arbitrarily long which also means arbitrarily large in terms of memory space:

larger = pd.DataFrame({'integer':[1], 'float': [2.0], 
                        'object': ['some text, but really long '*100 ]} )
larger.memory_usage(deep=True)
## Index       128
## integer       8
## float         8
## object     2757
## dtype: int64

In most cases, memory issues are due to string columns.

Let’s look at more realistic example data. I simulated survey data with 50,000 respondents from different European countries with 12 different attributes for the respondents:

##               gender  age  ...      home_type  \
## 0             Female   38  ...  I own a house   
## 1  Don't want to say   30  ...   I own a flat   
## 
##                                     has_garden  
## 0  There is a large garden (bigger than 20sqm)  
## 1                           There is no garden  
## 
## [2 rows x 12 columns]

I then simulate some questions about political leaning and vote intentions for different candidates and parties. You can find the code to simulate the data here.

##               gender  age  ...  \
## 0             Female   38  ...   
## 1  Don't want to say   30  ...   
## 
##                        q_voted_party_Right_SE_before  \
## 0  I have voted party Right (SE) once or twice be...   
## 1  I have voted party Right (SE) once or twice be...   
## 
##              q_agrees_with_party_Right_SE  
## 0  I strongly agree with party Right (SE)  
## 1  I strongly agree with party Right (SE)  
## 
## [2 rows x 372 columns]

Since I simulated three questions per candidate, 10 candidates per country with 10 countries, similar for the parties, we end up with quite a few questions columns, 360 in total plus the 12 attribute columns.

Even though the data is not very big in terms of rows (50,000 observations, does that even count as Big Data?), the data frame is already big enough to slow down most laptops.

To get the memory usage in a meaningful unit, we divide by 1e6 or 1e9 to get MB or GB respectively:

df.memory_usage(deep=True) / 1e6
## Index                            0.000128
## gender                           3.249436
## age                              0.400000
## country                          3.440040
## city_rural                       3.199799
##                                    ...   
## q_agrees_with_party_Right_PT     3.976533
## q_voted_party_Right_PO_before    4.170503
## q_agrees_with_party_Right_PO     3.972357
## q_voted_party_Right_SE_before    4.189754
## q_agrees_with_party_Right_SE     3.968368
## Length: 373, dtype: float64

Most columns are 3 to 5MB big. To get the total memory usage of the whole data frame, we use the info() function:

df.info(memory_usage='deep')
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 50000 entries, 0 to 49999
## Columns: 372 entries, gender to q_agrees_with_party_Right_SE
## dtypes: int64(1), object(371)
## memory usage: 1.3 GB

Not only does this size significantly slow down your local machine but worse, many operations and transformations will need additional memory space. This makes working with this data difficult if not impossible.

Categorize It

One simple trick solves this problem almost instantly. Since all of the survey questions only have a limited number of answer options we can map each answer option to an integer value. As we’ve seen before, integers can be stored much more memory efficiently.

The pandas type pd.Categorical does exactly that. It transforms strings into discrete categories and maps each of the categories to an integer internally and thus storing strings in a very efficient way:

str_cols = df.select_dtypes('object').columns # select all string columns
df[str_cols] = df[str_cols].apply(pd.Categorical)
df.info(memory_usage='deep')
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 50000 entries, 0 to 49999
## Columns: 372 entries, gender to q_agrees_with_party_Right_SE
## dtypes: category(371), int64(1)
## memory usage: 18.2 MB

Some little magic! With just two lines, we went from more than 1GB to less than 20MB! These two lines can easily save you a few hundred dollars in cloud computing bills.

Obviously, it’s always better to start out with a memory efficient type instead of transforming into one later. In this example, it probably makes sense to either store the answers in a different format to begin with (e.g. integer coded) or transform them into categories when loading the data.

Chunk Chunk Chunk

Sometimes, the data frame itself is already memory optimized but certain data transformations still lead to Out-of-Memory errors.

Imagine we want to unpivot all question columns so we can group by attributes and count how many people answered the different questions. This could be useful for example when we want to group multiple questions together that are of the same format.

question_cols = [col for col in df.columns if col.startswith("q_")]
attribute_cols = [col for col in df.columns if not col.startswith("q_")]
melted = df.melt(id_vars=attribute_cols,
                    value_vars=question_cols)

But alas, it looks like things are slowing down again. Let’s check the memory again:

melted.info(memory_usage='deep')
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 18000000 entries, 0 to 17999999
## Data columns (total 14 columns):
##  #   Column             Dtype   
## ---  ------             -----   
##  0   gender             category
##  1   age                int64   
##  2   country            category
##  3   city_rural         category
##  4   edu                category
##  5   employment_status  category
##  6   owns_cat           category
##  7   likes_dogs         category
##  8   household_type     category
##  9   has_kids           category
##  10  home_type          category
##  11  has_garden         category
##  12  variable           object  
##  13  value              object  
## dtypes: category(11), int64(1), object(2)
## memory usage: 3.0 GB

What before was sweet 18MB now suddenly exploded to 3GB! Crap!

While all the attribute columns are still categories and thus small in terms of memory, the three new columns are all object again. Since melting the data frame also transformed it into long format,the frame has 18 million rows with three columns of type object. Big enough to kill some machines.

The problem is that pd.melt() doesn’t preserve the category type of the question columns. Since each question column has different categories, pandas just merges them into the more general object again.

The solution: We write a light wrapper around pd.melt() that works on a smaller chunk of the data. After melting each chunk, we transform the object columns into categories and then concatenate all melted chunks together:

from pandas.api.types import union_categoricals

def chunked_melt(df, id_vars, value_vars, chunk_size=1000):
  dfs = []
  value_cat = union_categoricals([df[col] for col in value_vars]).categories
  for chunk in np.array_split(df, len(df) // chunk_size):
    d = pd.melt(chunk, id_vars=id_vars, value_vars=value_vars)
    d['variable'] = pd.Categorical(d['variable'], categories=value_vars)
    d['value'] = pd.Categorical(d['value'], categories=value_cat)
    dfs.append(d)
  return pd.concat(dfs, axis=0, ignore_index=True)

Note that we now need to explicitly set the categories. The single chunks might not contain all categories from the final data frame so we explicitly set the union of all categories for each chunk. If any of the chunks would have different categories than the other chunks, then pd.concat() would transform the column to type again.

melted = chunked_melt(df, id_vars=attribute_cols, value_vars=question_cols)
melted.info(memory_usage='deep')
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 18000000 entries, 0 to 17999999
## Data columns (total 14 columns):
##  #   Column             Dtype   
## ---  ------             -----   
##  0   gender             category
##  1   age                int64   
##  2   country            category
##  3   city_rural         category
##  4   edu                category
##  5   employment_status  category
##  6   owns_cat           category
##  7   likes_dogs         category
##  8   household_type     category
##  9   has_kids           category
##  10  home_type          category
##  11  has_garden         category
##  12  variable           category
##  13  value              category
## dtypes: category(13), int64(1)
## memory usage: 395.0 MB

The data frame is still relatively big (compared with the 20MB before) but much easier to handle.

Some Final Tips On Hunting Down Out-Of-Memory Errors

Once you have an Out-Of-Memory (OoM) error, it can be tricky to find the cause of it. When trying to identify a memory issue in a data science context, we’re interested in the peak memory of the whole program. That is, the moment where the most memory is needed. However, this moment might be different for small data than for bigger data.

So what can we do?

  • Generously log memory size of the main data frames.
  • Log peak memory of the program with the inbuilt package resource:
import resource
# .. all your code
...
# end of script
peak_memory = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# gives memory in kb for linux, bytes for OSX

Obviously, you can only log the final memory usage for runs that don’t get killed early.

  • Step through your code with an example data frame reasonably big to be able to see memory problems but small enough to still comfortably work with.
  • Do some back-of-the-envelope calculations. Sometimes one can already find the problem by computing the expected memory of the final data frame based on type and size.
  • In Jupyter, there are some cell magic commands to profile memory: %memit for a single statement and %mprun for a full script. For more info on how to use it, see e.g. here. In my experience however, they aren’t always accurate and might miss memory usage from non-python processes, e.g. numpy mostly uses C internally. It can give some good leads though.
  • Memory profilers like Fil promise to also track memory allocation from C and focuses on peak memory, making it well-suited for OoM errors in data science. Personally, I often had the problem that my smaller sample data I used for profiling didn’t show the peaks that caused the error but bigger data was difficult to profile.

It still remains a bit of a detective hunt to find the cause of an OoM error but I hope my tips can help a bit on where and how to look.

Corrie

Read more posts by this author.