Data handling with Pandas
Pandas is a library optimized for handling one or two dimensional data
sources [1]. One dimensional data is stored in a Series
object, and
two dimensional data is stored in a DataFrame
object.
Loading the library
It is customary to give the library a short handle ‘pd
’ at import
time:
import pandas as pd
pd.options.display.max_rows = 10 #this line aids in displaying the data concisely
Loading data from CSV files
Pandas gives us a comprehensive set of tools for loading data from a variety of sources, including CSV, Excel, SQL, JSON, and Stata, amongst others. In this demonstration, we’ll read a comma separated value file of global emissions data from the year 1751 until 2011.
The .read_csv
method
gives us options for how we want to format the data as we read it in. In
reading in our data file, we want to skip the second row (indexed as
1
!) and use the column Time
as the index of our resulting
DataFrame
.
emissions = pd.read_csv('../../data/Climate/global_emissions.csv',
skiprows=[1], index_col='Year')
emissions # Display the resulting DataFrame in the notebook
Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C) | Carbon emissions from gas fuel consumption | Carbon emissions from liquid fuel consumption | Carbon emissions from solid fuel consumption | Carbon emissions from cement production | Carbon emissions from gas flaring | Per capita carbon emissions (metric tons of carbon; after 1949 only) | |
---|---|---|---|---|---|---|---|
Year | |||||||
1751 | 3 | 0 | 0 | 3 | 0 | 0 | NaN |
1752 | 3 | 0 | 0 | 3 | 0 | 0 | NaN |
1753 | 3 | 0 | 0 | 3 | 0 | 0 | NaN |
1754 | 3 | 0 | 0 | 3 | 0 | 0 | NaN |
1755 | 3 | 0 | 0 | 3 | 0 | 0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
2007 | 8532 | 1563 | 3080 | 3442 | 382 | 65 | 1.28 |
2008 | 8740 | 1625 | 3107 | 3552 | 387 | 68 | 1.29 |
2009 | 8700 | 1582 | 3039 | 3604 | 412 | 63 | 1.27 |
2010 | 9140 | 1698 | 3100 | 3832 | 445 | 65 | 1.32 |
2011 | 9449 | 1760 | 3137 | 3997 | 491 | 63 | 1.35 |
261 rows × 7 columns
Selecting rows of data by name
Both DataFrame
and Series
objects have an index
attribute
which is used to identify their rows. We can access rows of data
according to this index, using the .loc[...]
syntax.
Between the brackets, we can select individual rows:
emissions.loc[1875]
or ranges of dates:
emissions.loc[1908:1920]
or ranges beginning or ending at a specific point:
emissions.loc[1967:]
emissions.loc[:1805]
Give these a try and become comfortable selecting index ranges.
emissions.loc[1985:1987]
Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C) | Carbon emissions from gas fuel consumption | Carbon emissions from liquid fuel consumption | Carbon emissions from solid fuel consumption | Carbon emissions from cement production | Carbon emissions from gas flaring | Per capita carbon emissions (metric tons of carbon; after 1949 only) | |
---|---|---|---|---|---|---|---|
Year | |||||||
1985 | 5438 | 835 | 2186 | 2237 | 131 | 49 | 1.12 |
1986 | 5606 | 830 | 2293 | 2300 | 137 | 46 | 1.13 |
1987 | 5750 | 892 | 2306 | 2364 | 143 | 44 | 1.14 |
Selecting rows of data by position
In addition to selecting by row names, we can select by the row position
using the .iloc
syntax.
This syntax lets us select the first n rows: >emissions.iloc[:5]
or, if we wish, the last n, using a minus sign to indicate counting from
the end of the DataFrame
:
``` emissions.iloc[-5:]
or rows in the middle:
>```
emissions.iloc[10:20]
emissions.iloc[1:30]
Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C) | Carbon emissions from gas fuel consumption | Carbon emissions from liquid fuel consumption | Carbon emissions from solid fuel consumption | Carbon emissions from cement production | Carbon emissions from gas flaring | Per capita carbon emissions (metric tons of carbon; after 1949 only) | |
---|---|---|---|---|---|---|---|
Year | |||||||
1752 | 3 | 0 | 0 | 3 | 0 | 0 | NaN |
1753 | 3 | 0 | 0 | 3 | 0 | 0 | NaN |
1754 | 3 | 0 | 0 | 3 | 0 | 0 | NaN |
1755 | 3 | 0 | 0 | 3 | 0 | 0 | NaN |
1756 | 3 | 0 | 0 | 3 | 0 | 0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
1776 | 4 | 0 | 0 | 4 | 0 | 0 | NaN |
1777 | 4 | 0 | 0 | 4 | 0 | 0 | NaN |
1778 | 4 | 0 | 0 | 4 | 0 | 0 | NaN |
1779 | 4 | 0 | 0 | 4 | 0 | 0 | NaN |
1780 | 4 | 0 | 0 | 4 | 0 | 0 | NaN |
29 rows × 7 columns
Renaming columns
The column names given in the CSV file are too long to use conveniently in dealing with data. We can assign new column names from a list of strings, that will be applied in order as the new column names:
emissions.columns = ['Total Emissions', 'Gas Emissions', 'Liquid Emissions',
'Solid Emissions', 'Cement Emissions', 'Flare Emissions',
'Per Capita Emissions']
emissions.iloc[-3:]
Total Emissions | Gas Emissions | Liquid Emissions | Solid Emissions | Cement Emissions | Flare Emissions | Per Capita Emissions | |
---|---|---|---|---|---|---|---|
Year | |||||||
2009 | 8700 | 1582 | 3039 | 3604 | 412 | 63 | 1.27 |
2010 | 9140 | 1698 | 3100 | 3832 | 445 | 65 | 1.32 |
2011 | 9449 | 1760 | 3137 | 3997 | 491 | 63 | 1.35 |
Accessing specific columns
Each of the columns in the DataFrame
can be accessed as its own
Series
object, using the same syntax we would use to access members
of a python dictionary:
emissions[['Total Emissions']]
Total Emissions | |
---|---|
Year | |
1751 | 3 |
1752 | 3 |
1753 | 3 |
1754 | 3 |
1755 | 3 |
... | ... |
2007 | 8532 |
2008 | 8740 |
2009 | 8700 |
2010 | 9140 |
2011 | 9449 |
261 rows × 1 columns
Passing a list of column names into this syntax returns a subset of the dataframe:
emissions[['Gas Emissions', 'Liquid Emissions']]
Gas Emissions | Liquid Emissions | |
---|---|---|
Year | ||
1751 | 0 | 0 |
1752 | 0 | 0 |
1753 | 0 | 0 |
1754 | 0 | 0 |
1755 | 0 | 0 |
... | ... | ... |
2007 | 1563 | 3080 |
2008 | 1625 | 3107 |
2009 | 1582 | 3039 |
2010 | 1698 | 3100 |
2011 | 1760 | 3137 |
261 rows × 2 columns
Element-wise Arithmetic
We can perform element-wise
arithmetic
on DataFrame
columns using natural syntax.
emissions['Gas Emissions'] + emissions['Liquid Emissions']
Year
1751 0
1752 0
1753 0
1754 0
1755 0
...
2007 4643
2008 4732
2009 4621
2010 4798
2011 4897
Length: 261, dtype: int64
Array Operations
A number of simple operations are built into Pandas to facilitate working with the data. For example, we can show descriptive statistics such as the maximum value of each column:
print(emissions.idxmax(), emissions.max())
Total Emissions 2011
Gas Emissions 2011
Liquid Emissions 2011
Solid Emissions 2011
Cement Emissions 2011
Flare Emissions 1973
Per Capita Emissions 2011
dtype: int64 Total Emissions 9449.00
Gas Emissions 1760.00
Liquid Emissions 3137.00
Solid Emissions 3997.00
Cement Emissions 491.00
Flare Emissions 110.00
Per Capita Emissions 1.35
dtype: float64
The year in which this maximum value occurred:
emissions.idxmax()
Total Emissions 2011
Gas Emissions 2011
Liquid Emissions 2011
Solid Emissions 2011
Cement Emissions 2011
Flare Emissions 1973
Per Capita Emissions 2011
dtype: int64
Or the sum of each column:
emissions.sum()
Total Emissions 373729.0
Gas Emissions 49774.0
Liquid Emissions 131976.0
Solid Emissions 179160.0
Cement Emissions 9366.0
Flare Emissions 3456.0
Per Capita Emissions 65.5
dtype: float64
emissions['Per Capita Emissions'].loc[1930:]
Year
1930 NaN
1931 NaN
1932 NaN
1933 NaN
1934 NaN
...
2007 1.28
2008 1.29
2009 1.27
2010 1.32
2011 1.35
Name: Per Capita Emissions, Length: 82, dtype: float64
Merging Datasets
The dataset we have currently is missing data for per capita consumption before 1950. We have another dataset which gives us estimates of the world population which we can use to try and fill in some missing data. It too, however, has some missing values: before 1900, the data comes at 50 year intervals.
population = pd.read_csv('../../data/Climate/world_population.csv', index_col='Year')
What we need to do is first merge the two datasets together. Pandas gives us a merge function which allows us to align the datasets on their index values.
merged = pd.merge(emissions, population, how='outer', left_index=True, right_index=True)
merged.loc[1750:2011]
Total Emissions | Gas Emissions | Liquid Emissions | Solid Emissions | Cement Emissions | Flare Emissions | Per Capita Emissions | World Population | |
---|---|---|---|---|---|---|---|---|
Year | ||||||||
1750 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.115621e+08 |
1751 | 3.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | NaN | NaN |
1752 | 3.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | NaN | NaN |
1753 | 3.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | NaN | NaN |
1754 | 3.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2007 | 8532.0 | 1563.0 | 3080.0 | 3442.0 | 382.0 | 65.0 | 1.28 | 6.681607e+09 |
2008 | 8740.0 | 1625.0 | 3107.0 | 3552.0 | 387.0 | 68.0 | 1.29 | 6.763733e+09 |
2009 | 8700.0 | 1582.0 | 3039.0 | 3604.0 | 412.0 | 63.0 | 1.27 | 6.846480e+09 |
2010 | 9140.0 | 1698.0 | 3100.0 | 3832.0 | 445.0 | 65.0 | 1.32 | 6.929725e+09 |
2011 | 9449.0 | 1760.0 | 3137.0 | 3997.0 | 491.0 | 63.0 | 1.35 | 7.013427e+09 |
262 rows × 8 columns
Interpolating missing values
The merge operation creates NaN
values in the rows where data is
missing from the world population column. We can fill these using a
cubic spline interpolation from the surrounding points:
interpolated = merged.interpolate(method='cubic')
interpolated.loc[1750:2011]
Total Emissions | Gas Emissions | Liquid Emissions | Solid Emissions | Cement Emissions | Flare Emissions | Per Capita Emissions | World Population | |
---|---|---|---|---|---|---|---|---|
Year | ||||||||
1750 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.115621e+08 |
1751 | 3.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | NaN | 8.155185e+08 |
1752 | 3.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | NaN | 8.194193e+08 |
1753 | 3.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | NaN | 8.232672e+08 |
1754 | 3.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | NaN | 8.270645e+08 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2007 | 8532.0 | 1563.0 | 3080.0 | 3442.0 | 382.0 | 65.0 | 1.28 | 6.681607e+09 |
2008 | 8740.0 | 1625.0 | 3107.0 | 3552.0 | 387.0 | 68.0 | 1.29 | 6.763733e+09 |
2009 | 8700.0 | 1582.0 | 3039.0 | 3604.0 | 412.0 | 63.0 | 1.27 | 6.846480e+09 |
2010 | 9140.0 | 1698.0 | 3100.0 | 3832.0 | 445.0 | 65.0 | 1.32 | 6.929725e+09 |
2011 | 9449.0 | 1760.0 | 3137.0 | 3997.0 | 491.0 | 63.0 | 1.35 | 7.013427e+09 |
262 rows × 8 columns
Calculating per capita emissions
Now we can calculate a new value for per capita emissions. We multiply
by 1,000,000
to convert from units of ‘Million Metric Tons’ as the
Total Emissions are expressed, to merely ‘Metric Tons’, as the existing,
incomplete estimate of per capita emissions is expressed.
interpolated['Per Capita Emissions 2'] = interpolated['Total Emissions'] / interpolated['World Population'] * 1000000
interpolated.loc[1751:2011]
Total Emissions | Gas Emissions | Liquid Emissions | Solid Emissions | Cement Emissions | Flare Emissions | Per Capita Emissions | World Population | Per Capita Emissions 2 | |
---|---|---|---|---|---|---|---|---|---|
Year | |||||||||
1751 | 3.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | NaN | 8.155185e+08 | 0.003679 |
1752 | 3.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | NaN | 8.194193e+08 | 0.003661 |
1753 | 3.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | NaN | 8.232672e+08 | 0.003644 |
1754 | 3.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | NaN | 8.270645e+08 | 0.003627 |
1755 | 3.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | NaN | 8.308138e+08 | 0.003611 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2007 | 8532.0 | 1563.0 | 3080.0 | 3442.0 | 382.0 | 65.0 | 1.28 | 6.681607e+09 | 1.276938 |
2008 | 8740.0 | 1625.0 | 3107.0 | 3552.0 | 387.0 | 68.0 | 1.29 | 6.763733e+09 | 1.292186 |
2009 | 8700.0 | 1582.0 | 3039.0 | 3604.0 | 412.0 | 63.0 | 1.27 | 6.846480e+09 | 1.270726 |
2010 | 9140.0 | 1698.0 | 3100.0 | 3832.0 | 445.0 | 65.0 | 1.32 | 6.929725e+09 | 1.318956 |
2011 | 9449.0 | 1760.0 | 3137.0 | 3997.0 | 491.0 | 63.0 | 1.35 | 7.013427e+09 | 1.347273 |
261 rows × 9 columns
Pandas and PySD
By default, PySD will return the results of model simulation as a Pandas
DataFrame
, with the column names representing elements of the model,
and the index (row names) as timestamps in the model.
import pysd
model = pysd.read_vensim('../../models/Predator_Prey/Predator_Prey.mdl')
sim_result_df = model.run()
sim_result_df
FINAL TIME | INITIAL TIME | SAVEPER | TIME STEP | Prey Births | Predation Rate | Prey Deaths | Prey Fertility | Prey Population | Predator Births | Predator Deaths | Predator Food Driven Fertility | Predator Mortality | Predator Population | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.000000 | 50 | 0 | 0.015625 | 0.015625 | 5.000000e+02 | 0.0001 | 2.500000e+00 | 2 | 2.500000e+02 | 2.500000e+01 | 1.000000 | 0.001 | 0.01 | 100.000000 |
0.015625 | 50 | 0 | 0.015625 | 0.015625 | 5.155469e+02 | 0.0001 | 2.587401e+00 | 2 | 2.577734e+02 | 2.587401e+01 | 1.003750 | 0.001 | 0.01 | 100.375000 |
0.031250 | 50 | 0 | 0.015625 | 0.015625 | 5.315769e+02 | 0.0001 | 2.678180e+00 | 2 | 2.657884e+02 | 2.678180e+01 | 1.007636 | 0.001 | 0.01 | 100.763598 |
0.046875 | 50 | 0 | 0.015625 | 0.015625 | 5.481049e+02 | 0.0001 | 2.772488e+00 | 2 | 2.740525e+02 | 2.772488e+01 | 1.011663 | 0.001 | 0.01 | 101.166319 |
0.062500 | 50 | 0 | 0.015625 | 0.015625 | 5.651466e+02 | 0.0001 | 2.870484e+00 | 2 | 2.825733e+02 | 2.870484e+01 | 1.015837 | 0.001 | 0.01 | 101.583713 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49.937500 | 50 | 0 | 0.015625 | 0.015625 | 8.048343e-234 | 0.0001 | 3.909033e-233 | 2 | 4.024172e-234 | 3.909033e-232 | 971.388185 | 0.001 | 0.01 | 97138.818477 |
49.953125 | 50 | 0 | 0.015625 | 0.015625 | 7.078281e-234 | 0.0001 | 3.437342e-233 | 2 | 3.539141e-234 | 3.437342e-232 | 971.236405 | 0.001 | 0.01 | 97123.640536 |
49.968750 | 50 | 0 | 0.015625 | 0.015625 | 6.225308e-234 | 0.0001 | 3.022651e-233 | 2 | 3.112654e-234 | 3.022651e-232 | 971.084650 | 0.001 | 0.01 | 97108.464968 |
49.984375 | 50 | 0 | 0.015625 | 0.015625 | 5.475271e-234 | 0.0001 | 2.658060e-233 | 2 | 2.737635e-234 | 2.658060e-232 | 970.932918 | 0.001 | 0.01 | 97093.291770 |
50.000000 | 50 | 0 | 0.015625 | 0.015625 | 4.815729e-234 | 0.0001 | 2.337510e-233 | 2 | 2.407864e-234 | 2.337510e-232 | 970.781209 | 0.001 | 0.01 | 97078.120943 |
3201 rows × 14 columns
In this case, may want to downsample the returned data to make it more manageable:
sim_result_df.loc[range(50)]
FINAL TIME | INITIAL TIME | SAVEPER | TIME STEP | Prey Births | Predation Rate | Prey Deaths | Prey Fertility | Prey Population | Predator Births | Predator Deaths | Predator Food Driven Fertility | Predator Mortality | Predator Population | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 50 | 0 | 0.015625 | 0.015625 | 5.000000e+02 | 0.0001 | 2.500000e+00 | 2 | 2.500000e+02 | 2.500000e+01 | 1.000000 | 0.001 | 0.01 | 100.000000 |
1 | 50 | 0 | 0.015625 | 0.015625 | 3.537670e+03 | 0.0001 | 3.742022e+01 | 2 | 1.768835e+03 | 3.742022e+02 | 2.115529 | 0.001 | 0.01 | 211.552891 |
2 | 50 | 0 | 0.015625 | 0.015625 | 1.857105e+04 | 0.0001 | 2.417686e+04 | 2 | 9.285526e+03 | 2.417686e+05 | 260.371488 | 0.001 | 0.01 | 26037.148846 |
3 | 50 | 0 | 0.015625 | 0.015625 | 6.957765e-02 | 0.0001 | 5.403739e-01 | 2 | 3.478882e-02 | 5.403739e+00 | 1553.297342 | 0.001 | 0.01 | 155329.734184 |
4 | 50 | 0 | 0.015625 | 0.015625 | 1.911405e-08 | 0.0001 | 1.469722e-07 | 2 | 9.557026e-09 | 1.469722e-06 | 1537.844530 | 0.001 | 0.01 | 153784.453003 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
45 | 50 | 0 | 0.015625 | 0.015625 | 1.348667e-215 | 0.0001 | 6.881968e-215 | 2 | 6.743337e-216 | 6.881968e-214 | 1020.558213 | 0.001 | 0.01 | 102055.821334 |
46 | 50 | 0 | 0.015625 | 0.015625 | 2.193421e-219 | 0.0001 | 1.108119e-218 | 2 | 1.096711e-219 | 1.108119e-217 | 1010.402700 | 0.001 | 0.01 | 101040.269999 |
47 | 50 | 0 | 0.015625 | 0.015625 | 4.004902e-223 | 0.0001 | 2.003148e-222 | 2 | 2.002451e-223 | 2.003148e-221 | 1000.348244 | 0.001 | 0.01 | 100034.824354 |
48 | 50 | 0 | 0.015625 | 0.015625 | 8.198319e-227 | 0.0001 | 4.059782e-226 | 2 | 4.099159e-227 | 4.059782e-225 | 990.393838 | 0.001 | 0.01 | 99039.383839 |
49 | 50 | 0 | 0.015625 | 0.015625 | 1.879055e-230 | 0.0001 | 9.212427e-230 | 2 | 9.395273e-231 | 9.212427e-229 | 980.538489 | 0.001 | 0.01 | 98053.848891 |
50 rows × 14 columns
Notes
[1]: While pandas can handle dimensions larger than two, it is clunky. Xarray is a package for handling multidimensional data that interfaces well with Pandas.
Resources
Basic introduction to Pandas constructs
More advanced usage of Pandas syntax