As we all know, pandas is a fantastic data science tool. It provides us with the dataframe structure we need, powerful computational abilities, and does so in a very user-friendly format. It even has quality documentation and a large support network, making it easy to learn. It’s excellent.
But it’s not always particularly fast.
This can become a problem when many, many computations are involved. If the processing method is slow, it takes longer to run the program. And that gets very bothersome if millions of computations are required and total computation time stretches on and on.
This is a common problem in the work that I do. A big focus of my work is developing simulation models representing common equipment in buildings. This means I create functions emulating the heat transfer processes and control logic decisions in a piece of equipment, then pass data describing the building conditions and occupant behavior choices into those models. The models then predict what the equipment will do, how well it will satisfy the occupants needs, and how much energy it will consume.
In order to do that the models need to be time-based. It needs to be able to calculate what happens at one point in the simulation, and only then move on to the next set of calculations. That’s because the outputs at one time are the inputs at the next time. For example, imagine predicting the temperature in your oven at any point in time. Is it currently heating? How much has the temperature increased since the last time in question? What was the temperature at that time?
This dependence on the last time leads to a problem. We can’t use vector calculations. We must use the dreaded for loops. For loops are slow.
One solution, whether vectorizing calculations is possible or not, is to convert your calculations to numpy. According to Sofia Heisler at Upside Engineering Blog, numpy performs a lot of background information using precompiled C code. This precompiled C code makes it significantly faster than pandas by skipping the compiling step, and by including pre-programmed speed optimizations. Additionally, numpy drops a lot of the information in pandas. Pandas keeps track of data types, indexes, and performs error checking. All of which are very useful, but are not necessary at this time and slow down the calculations. Numpy doesn’t do that, and can perform the same calculations significantly faster.
There are multiple ways to convert panads data to numpy.
A series can be converted using the .values method. This creates the same series in numpy. For a simple example, see the following code:
import pandas as pd Series_Pandas = pd.Series(data=[1, 2, 3, 4, 5, 6]) Series_Numpy = Series_Pandas.values
A dataframe can be converted using the .to_numpy() function. This creates an int64 objectwith the same values in numpy. Note that this does not keep the column names, and you need to create a dictionary converting your pandas column names to numpy column numbers. This can be accomplished with the following code:
import pandas as pd import numpy as np Dataframe_Pandas = pd.DataFrame(data=[[0,1], [2,3], [4,5]], columns = ['First Column', 'Second Column']) Dataframe_Numpy = Dataframe_Pandas.to_numpy() Column_Index_Dictionary = dict(zip(Dataframe_Pandas.columns, list(range(0,len(Dataframe_Pandas.columns)))))
That code converts the dataframe to a numpy int64 object and provides all of the tools needed to iterate through each line, editing values in specific columns, in a user-friendly manner. Each cell can be called in a manner similar to using the pandas .loc function with numpy indexing, following the structure int64object[row, Dictionary[‘Pandas Column Name’]]. For instance, if you want to set the value in the first row of ‘Second Column’ to ‘9’ you can use the following code:
Dataframe_Numpy[0, Column_Index_Dictionary['Second Column']] = 9
Of course this is going to vary from one situation to the next. Some scripts will see more improvement by switching to numpy than others. It depends on the types of calculations used in your script and the percentage of all calculations that are converted to numpy. But the results can be drastic.
For an example, I recently used this to convert one of my simulation models from a pandas base to a numpy base. The original, pandas based model required 362 seconds to perform an annual simulation. That’s a bit over 6 minutes. This isn’t terrible if you’re running one simulation with the model, but what if you’re running a thousand? After converting the core of the model to numpy, the same annual simulation required 32 seconds to calculate.
That’s 9% as much time to do the same thing. More than a 10x speedup from user pre-existing functions to convert my code from pandas to numpy.
Numpy has all of the computation capabilities of pandas, but performs them without carrying as much overhead information and uses pre-compiled, optimized methods. As a result, it can be significantly faster than pandas.
Converting a dataframe from pandas to numpy is relatively straightforward. You can use the dataframes .to_numpy() function to automatically convert it, then create a dictionary of the column names to enable accessing each cell similarly to the pandas .loc function.
This simple change can yield significant results. Script using numpy can execute in approximately 10% of the time that would be required if using pandas.