Data handling with Pandas

Pandas is a library optimized for handling one or two dimensional data sources [1]. One dimensional data is stored in a Series object, and two dimensional data is stored in a DataFrame object.

Loading the library

It is customary to give the library a short handle ‘pd’ at import time:

import pandas as pd
pd.options.display.max_rows = 10 #this line aids in displaying the data concisely

Loading data from CSV files

Pandas gives us a comprehensive set of tools for loading data from a variety of sources, including CSV, Excel, SQL, JSON, and Stata, amongst others. In this demonstration, we’ll read a comma separated value file of global emissions data from the year 1751 until 2011.

The .read_csv method gives us options for how we want to format the data as we read it in. In reading in our data file, we want to skip the second row (indexed as 1!) and use the column Time as the index of our resulting DataFrame.

emissions = pd.read_csv('../../data/Climate/global_emissions.csv',
                        skiprows=[1], index_col='Year')
emissions  # Display the resulting DataFrame in the notebook
Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C) Carbon emissions from gas fuel consumption Carbon emissions from liquid fuel consumption Carbon emissions from solid fuel consumption Carbon emissions from cement production Carbon emissions from gas flaring Per capita carbon emissions (metric tons of carbon; after 1949 only)
Year
1751 3 0 0 3 0 0 NaN
1752 3 0 0 3 0 0 NaN
1753 3 0 0 3 0 0 NaN
1754 3 0 0 3 0 0 NaN
1755 3 0 0 3 0 0 NaN
... ... ... ... ... ... ... ...
2007 8532 1563 3080 3442 382 65 1.28
2008 8740 1625 3107 3552 387 68 1.29
2009 8700 1582 3039 3604 412 63 1.27
2010 9140 1698 3100 3832 445 65 1.32
2011 9449 1760 3137 3997 491 63 1.35

261 rows × 7 columns

Selecting rows of data by name

Both DataFrame and Series objects have an index attribute which is used to identify their rows. We can access rows of data according to this index, using the .loc[...] syntax.

Between the brackets, we can select individual rows:

emissions.loc[1875]

or ranges of dates:

emissions.loc[1908:1920]

or ranges beginning or ending at a specific point:

emissions.loc[1967:]
emissions.loc[:1805]

Give these a try and become comfortable selecting index ranges.

emissions.loc[1985:1987]
Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C) Carbon emissions from gas fuel consumption Carbon emissions from liquid fuel consumption Carbon emissions from solid fuel consumption Carbon emissions from cement production Carbon emissions from gas flaring Per capita carbon emissions (metric tons of carbon; after 1949 only)
Year
1985 5438 835 2186 2237 131 49 1.12
1986 5606 830 2293 2300 137 46 1.13
1987 5750 892 2306 2364 143 44 1.14

Selecting rows of data by position

In addition to selecting by row names, we can select by the row position using the .iloc syntax.

This syntax lets us select the first n rows: >emissions.iloc[:5]

or, if we wish, the last n, using a minus sign to indicate counting from the end of the DataFrame:

``` emissions.iloc[-5:]

or rows in the middle:
>```
emissions.iloc[10:20]
emissions.iloc[1:30]
Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C) Carbon emissions from gas fuel consumption Carbon emissions from liquid fuel consumption Carbon emissions from solid fuel consumption Carbon emissions from cement production Carbon emissions from gas flaring Per capita carbon emissions (metric tons of carbon; after 1949 only)
Year
1752 3 0 0 3 0 0 NaN
1753 3 0 0 3 0 0 NaN
1754 3 0 0 3 0 0 NaN
1755 3 0 0 3 0 0 NaN
1756 3 0 0 3 0 0 NaN
... ... ... ... ... ... ... ...
1776 4 0 0 4 0 0 NaN
1777 4 0 0 4 0 0 NaN
1778 4 0 0 4 0 0 NaN
1779 4 0 0 4 0 0 NaN
1780 4 0 0 4 0 0 NaN

29 rows × 7 columns

Renaming columns

The column names given in the CSV file are too long to use conveniently in dealing with data. We can assign new column names from a list of strings, that will be applied in order as the new column names:

emissions.columns = ['Total Emissions', 'Gas Emissions', 'Liquid Emissions',
                     'Solid Emissions', 'Cement Emissions', 'Flare Emissions',
                     'Per Capita Emissions']
emissions.iloc[-3:]
Total Emissions Gas Emissions Liquid Emissions Solid Emissions Cement Emissions Flare Emissions Per Capita Emissions
Year
2009 8700 1582 3039 3604 412 63 1.27
2010 9140 1698 3100 3832 445 65 1.32
2011 9449 1760 3137 3997 491 63 1.35

Accessing specific columns

Each of the columns in the DataFrame can be accessed as its own Series object, using the same syntax we would use to access members of a python dictionary:

emissions[['Total Emissions']]
Total Emissions
Year
1751 3
1752 3
1753 3
1754 3
1755 3
... ...
2007 8532
2008 8740
2009 8700
2010 9140
2011 9449

261 rows × 1 columns

Passing a list of column names into this syntax returns a subset of the dataframe:

emissions[['Gas Emissions', 'Liquid Emissions']]
Gas Emissions Liquid Emissions
Year
1751 0 0
1752 0 0
1753 0 0
1754 0 0
1755 0 0
... ... ...
2007 1563 3080
2008 1625 3107
2009 1582 3039
2010 1698 3100
2011 1760 3137

261 rows × 2 columns

Element-wise Arithmetic

We can perform element-wise arithmetic on DataFrame columns using natural syntax.

emissions['Gas Emissions'] + emissions['Liquid Emissions']
Year
1751       0
1752       0
1753       0
1754       0
1755       0
        ...
2007    4643
2008    4732
2009    4621
2010    4798
2011    4897
Length: 261, dtype: int64

Array Operations

A number of simple operations are built into Pandas to facilitate working with the data. For example, we can show descriptive statistics such as the maximum value of each column:

print(emissions.idxmax(), emissions.max())
Total Emissions         2011
Gas Emissions           2011
Liquid Emissions        2011
Solid Emissions         2011
Cement Emissions        2011
Flare Emissions         1973
Per Capita Emissions    2011
dtype: int64 Total Emissions         9449.00
Gas Emissions           1760.00
Liquid Emissions        3137.00
Solid Emissions         3997.00
Cement Emissions         491.00
Flare Emissions          110.00
Per Capita Emissions       1.35
dtype: float64

The year in which this maximum value occurred:

emissions.idxmax()
Total Emissions         2011
Gas Emissions           2011
Liquid Emissions        2011
Solid Emissions         2011
Cement Emissions        2011
Flare Emissions         1973
Per Capita Emissions    2011
dtype: int64

Or the sum of each column:

emissions.sum()
Total Emissions         373729.0
Gas Emissions            49774.0
Liquid Emissions        131976.0
Solid Emissions         179160.0
Cement Emissions          9366.0
Flare Emissions           3456.0
Per Capita Emissions        65.5
dtype: float64
emissions['Per Capita Emissions'].loc[1930:]
Year
1930     NaN
1931     NaN
1932     NaN
1933     NaN
1934     NaN
        ...
2007    1.28
2008    1.29
2009    1.27
2010    1.32
2011    1.35
Name: Per Capita Emissions, Length: 82, dtype: float64

Merging Datasets

The dataset we have currently is missing data for per capita consumption before 1950. We have another dataset which gives us estimates of the world population which we can use to try and fill in some missing data. It too, however, has some missing values: before 1900, the data comes at 50 year intervals.

population = pd.read_csv('../../data/Climate/world_population.csv', index_col='Year')

What we need to do is first merge the two datasets together. Pandas gives us a merge function which allows us to align the datasets on their index values.

merged = pd.merge(emissions, population, how='outer', left_index=True, right_index=True)
merged.loc[1750:2011]
Total Emissions Gas Emissions Liquid Emissions Solid Emissions Cement Emissions Flare Emissions Per Capita Emissions World Population
Year
1750 NaN NaN NaN NaN NaN NaN NaN 8.115621e+08
1751 3.0 0.0 0.0 3.0 0.0 0.0 NaN NaN
1752 3.0 0.0 0.0 3.0 0.0 0.0 NaN NaN
1753 3.0 0.0 0.0 3.0 0.0 0.0 NaN NaN
1754 3.0 0.0 0.0 3.0 0.0 0.0 NaN NaN
... ... ... ... ... ... ... ... ...
2007 8532.0 1563.0 3080.0 3442.0 382.0 65.0 1.28 6.681607e+09
2008 8740.0 1625.0 3107.0 3552.0 387.0 68.0 1.29 6.763733e+09
2009 8700.0 1582.0 3039.0 3604.0 412.0 63.0 1.27 6.846480e+09
2010 9140.0 1698.0 3100.0 3832.0 445.0 65.0 1.32 6.929725e+09
2011 9449.0 1760.0 3137.0 3997.0 491.0 63.0 1.35 7.013427e+09

262 rows × 8 columns

Interpolating missing values

The merge operation creates NaN values in the rows where data is missing from the world population column. We can fill these using a cubic spline interpolation from the surrounding points:

interpolated = merged.interpolate(method='cubic')
interpolated.loc[1750:2011]
Total Emissions Gas Emissions Liquid Emissions Solid Emissions Cement Emissions Flare Emissions Per Capita Emissions World Population
Year
1750 NaN NaN NaN NaN NaN NaN NaN 8.115621e+08
1751 3.0 0.0 0.0 3.0 0.0 0.0 NaN 8.155185e+08
1752 3.0 0.0 0.0 3.0 0.0 0.0 NaN 8.194193e+08
1753 3.0 0.0 0.0 3.0 0.0 0.0 NaN 8.232672e+08
1754 3.0 0.0 0.0 3.0 0.0 0.0 NaN 8.270645e+08
... ... ... ... ... ... ... ... ...
2007 8532.0 1563.0 3080.0 3442.0 382.0 65.0 1.28 6.681607e+09
2008 8740.0 1625.0 3107.0 3552.0 387.0 68.0 1.29 6.763733e+09
2009 8700.0 1582.0 3039.0 3604.0 412.0 63.0 1.27 6.846480e+09
2010 9140.0 1698.0 3100.0 3832.0 445.0 65.0 1.32 6.929725e+09
2011 9449.0 1760.0 3137.0 3997.0 491.0 63.0 1.35 7.013427e+09

262 rows × 8 columns

Calculating per capita emissions

Now we can calculate a new value for per capita emissions. We multiply by 1,000,000 to convert from units of ‘Million Metric Tons’ as the Total Emissions are expressed, to merely ‘Metric Tons’, as the existing, incomplete estimate of per capita emissions is expressed.

interpolated['Per Capita Emissions 2'] = interpolated['Total Emissions'] / interpolated['World Population'] * 1000000
interpolated.loc[1751:2011]
Total Emissions Gas Emissions Liquid Emissions Solid Emissions Cement Emissions Flare Emissions Per Capita Emissions World Population Per Capita Emissions 2
Year
1751 3.0 0.0 0.0 3.0 0.0 0.0 NaN 8.155185e+08 0.003679
1752 3.0 0.0 0.0 3.0 0.0 0.0 NaN 8.194193e+08 0.003661
1753 3.0 0.0 0.0 3.0 0.0 0.0 NaN 8.232672e+08 0.003644
1754 3.0 0.0 0.0 3.0 0.0 0.0 NaN 8.270645e+08 0.003627
1755 3.0 0.0 0.0 3.0 0.0 0.0 NaN 8.308138e+08 0.003611
... ... ... ... ... ... ... ... ... ...
2007 8532.0 1563.0 3080.0 3442.0 382.0 65.0 1.28 6.681607e+09 1.276938
2008 8740.0 1625.0 3107.0 3552.0 387.0 68.0 1.29 6.763733e+09 1.292186
2009 8700.0 1582.0 3039.0 3604.0 412.0 63.0 1.27 6.846480e+09 1.270726
2010 9140.0 1698.0 3100.0 3832.0 445.0 65.0 1.32 6.929725e+09 1.318956
2011 9449.0 1760.0 3137.0 3997.0 491.0 63.0 1.35 7.013427e+09 1.347273

261 rows × 9 columns

Pandas and PySD

By default, PySD will return the results of model simulation as a Pandas DataFrame, with the column names representing elements of the model, and the index (row names) as timestamps in the model.

import pysd
model = pysd.read_vensim('../../models/Predator_Prey/Predator_Prey.mdl')
sim_result_df = model.run()
sim_result_df
FINAL TIME INITIAL TIME SAVEPER TIME STEP Prey Births Predation Rate Prey Deaths Prey Fertility Prey Population Predator Births Predator Deaths Predator Food Driven Fertility Predator Mortality Predator Population
0.000000 50 0 0.015625 0.015625 5.000000e+02 0.0001 2.500000e+00 2 2.500000e+02 2.500000e+01 1.000000 0.001 0.01 100.000000
0.015625 50 0 0.015625 0.015625 5.155469e+02 0.0001 2.587401e+00 2 2.577734e+02 2.587401e+01 1.003750 0.001 0.01 100.375000
0.031250 50 0 0.015625 0.015625 5.315769e+02 0.0001 2.678180e+00 2 2.657884e+02 2.678180e+01 1.007636 0.001 0.01 100.763598
0.046875 50 0 0.015625 0.015625 5.481049e+02 0.0001 2.772488e+00 2 2.740525e+02 2.772488e+01 1.011663 0.001 0.01 101.166319
0.062500 50 0 0.015625 0.015625 5.651466e+02 0.0001 2.870484e+00 2 2.825733e+02 2.870484e+01 1.015837 0.001 0.01 101.583713
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49.937500 50 0 0.015625 0.015625 8.048343e-234 0.0001 3.909033e-233 2 4.024172e-234 3.909033e-232 971.388185 0.001 0.01 97138.818477
49.953125 50 0 0.015625 0.015625 7.078281e-234 0.0001 3.437342e-233 2 3.539141e-234 3.437342e-232 971.236405 0.001 0.01 97123.640536
49.968750 50 0 0.015625 0.015625 6.225308e-234 0.0001 3.022651e-233 2 3.112654e-234 3.022651e-232 971.084650 0.001 0.01 97108.464968
49.984375 50 0 0.015625 0.015625 5.475271e-234 0.0001 2.658060e-233 2 2.737635e-234 2.658060e-232 970.932918 0.001 0.01 97093.291770
50.000000 50 0 0.015625 0.015625 4.815729e-234 0.0001 2.337510e-233 2 2.407864e-234 2.337510e-232 970.781209 0.001 0.01 97078.120943

3201 rows × 14 columns

In this case, may want to downsample the returned data to make it more manageable:

sim_result_df.loc[range(50)]
FINAL TIME INITIAL TIME SAVEPER TIME STEP Prey Births Predation Rate Prey Deaths Prey Fertility Prey Population Predator Births Predator Deaths Predator Food Driven Fertility Predator Mortality Predator Population
0 50 0 0.015625 0.015625 5.000000e+02 0.0001 2.500000e+00 2 2.500000e+02 2.500000e+01 1.000000 0.001 0.01 100.000000
1 50 0 0.015625 0.015625 3.537670e+03 0.0001 3.742022e+01 2 1.768835e+03 3.742022e+02 2.115529 0.001 0.01 211.552891
2 50 0 0.015625 0.015625 1.857105e+04 0.0001 2.417686e+04 2 9.285526e+03 2.417686e+05 260.371488 0.001 0.01 26037.148846
3 50 0 0.015625 0.015625 6.957765e-02 0.0001 5.403739e-01 2 3.478882e-02 5.403739e+00 1553.297342 0.001 0.01 155329.734184
4 50 0 0.015625 0.015625 1.911405e-08 0.0001 1.469722e-07 2 9.557026e-09 1.469722e-06 1537.844530 0.001 0.01 153784.453003
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
45 50 0 0.015625 0.015625 1.348667e-215 0.0001 6.881968e-215 2 6.743337e-216 6.881968e-214 1020.558213 0.001 0.01 102055.821334
46 50 0 0.015625 0.015625 2.193421e-219 0.0001 1.108119e-218 2 1.096711e-219 1.108119e-217 1010.402700 0.001 0.01 101040.269999
47 50 0 0.015625 0.015625 4.004902e-223 0.0001 2.003148e-222 2 2.002451e-223 2.003148e-221 1000.348244 0.001 0.01 100034.824354
48 50 0 0.015625 0.015625 8.198319e-227 0.0001 4.059782e-226 2 4.099159e-227 4.059782e-225 990.393838 0.001 0.01 99039.383839
49 50 0 0.015625 0.015625 1.879055e-230 0.0001 9.212427e-230 2 9.395273e-231 9.212427e-229 980.538489 0.001 0.01 98053.848891

50 rows × 14 columns

Notes

[1]: While pandas can handle dimensions larger than two, it is clunky. Xarray is a package for handling multidimensional data that interfaces well with Pandas.

Resources