Data handling with Pandas¶

Pandas is a library optimized for handling one or two dimensional data sources [1]. One dimensional data is stored in a Series object, and two dimensional data is stored in a DataFrame object.

Loading the library¶

It is customary to give the library a short handle ‘pd’ at import time:

import pandas as pd
pd.options.display.max_rows = 10 #this line aids in displaying the data concisely

Loading data from CSV files¶

Pandas gives us a comprehensive set of tools for loading data from a variety of sources, including CSV, Excel, SQL, JSON, and Stata, amongst others. In this demonstration, we’ll read a comma separated value file of global emissions data from the year 1751 until 2011.

The .read_csv method gives us options for how we want to format the data as we read it in. In reading in our data file, we want to skip the second row (indexed as 1!) and use the column Time as the index of our resulting DataFrame.

emissions = pd.read_csv('../../data/Climate/global_emissions.csv',
                        skiprows=[1], index_col='Year')
emissions  # Display the resulting DataFrame in the notebook

	Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C)	Carbon emissions from gas fuel consumption	Carbon emissions from liquid fuel consumption	Carbon emissions from solid fuel consumption	Carbon emissions from cement production	Carbon emissions from gas flaring	Per capita carbon emissions (metric tons of carbon; after 1949 only)
Year
1751	3	0	0	3	0	0	NaN
1752	3	0	0	3	0	0	NaN
1753	3	0	0	3	0	0	NaN
1754	3	0	0	3	0	0	NaN
1755	3	0	0	3	0	0	NaN
...	...	...	...	...	...	...	...
2007	8532	1563	3080	3442	382	65	1.28
2008	8740	1625	3107	3552	387	68	1.29
2009	8700	1582	3039	3604	412	63	1.27
2010	9140	1698	3100	3832	445	65	1.32
2011	9449	1760	3137	3997	491	63	1.35

261 rows × 7 columns

Selecting rows of data by name¶

Both DataFrame and Series objects have an index attribute which is used to identify their rows. We can access rows of data according to this index, using the .loc[...] syntax.

Between the brackets, we can select individual rows:

emissions.loc[1875]

or ranges of dates:

emissions.loc[1908:1920]

or ranges beginning or ending at a specific point:

emissions.loc[1967:]
emissions.loc[:1805]

Give these a try and become comfortable selecting index ranges.

emissions.loc[1985:1987]

	Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C)	Carbon emissions from gas fuel consumption	Carbon emissions from liquid fuel consumption	Carbon emissions from solid fuel consumption	Carbon emissions from cement production	Carbon emissions from gas flaring	Per capita carbon emissions (metric tons of carbon; after 1949 only)
Year
1985	5438	835	2186	2237	131	49	1.12
1986	5606	830	2293	2300	137	46	1.13
1987	5750	892	2306	2364	143	44	1.14

Selecting rows of data by position¶

In addition to selecting by row names, we can select by the row position using the .iloc syntax.

This syntax lets us select the first n rows: >emissions.iloc[:5]

or, if we wish, the last n, using a minus sign to indicate counting from the end of the DataFrame:

``` emissions.iloc[-5:]

or rows in the middle:
>```
emissions.iloc[10:20]

emissions.iloc[1:30]

	Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C)	Carbon emissions from gas fuel consumption	Carbon emissions from liquid fuel consumption	Carbon emissions from solid fuel consumption	Carbon emissions from cement production	Carbon emissions from gas flaring	Per capita carbon emissions (metric tons of carbon; after 1949 only)
Year
1752	3	0	0	3	0	0	NaN
1753	3	0	0	3	0	0	NaN
1754	3	0	0	3	0	0	NaN
1755	3	0	0	3	0	0	NaN
1756	3	0	0	3	0	0	NaN
...	...	...	...	...	...	...	...
1776	4	0	0	4	0	0	NaN
1777	4	0	0	4	0	0	NaN
1778	4	0	0	4	0	0	NaN
1779	4	0	0	4	0	0	NaN
1780	4	0	0	4	0	0	NaN

29 rows × 7 columns

Renaming columns¶

The column names given in the CSV file are too long to use conveniently in dealing with data. We can assign new column names from a list of strings, that will be applied in order as the new column names:

emissions.columns = ['Total Emissions', 'Gas Emissions', 'Liquid Emissions',
                     'Solid Emissions', 'Cement Emissions', 'Flare Emissions',
                     'Per Capita Emissions']
emissions.iloc[-3:]

	Total Emissions	Gas Emissions	Liquid Emissions	Solid Emissions	Cement Emissions	Flare Emissions	Per Capita Emissions
Year
2009	8700	1582	3039	3604	412	63	1.27
2010	9140	1698	3100	3832	445	65	1.32
2011	9449	1760	3137	3997	491	63	1.35

Accessing specific columns¶

Each of the columns in the DataFrame can be accessed as its own Series object, using the same syntax we would use to access members of a python dictionary:

emissions[['Total Emissions']]

	Total Emissions
Year
1751	3
1752	3
1753	3
1754	3
1755	3
...	...
2007	8532
2008	8740
2009	8700
2010	9140
2011	9449

261 rows × 1 columns

Passing a list of column names into this syntax returns a subset of the dataframe:

emissions[['Gas Emissions', 'Liquid Emissions']]

	Gas Emissions	Liquid Emissions
Year
1751	0	0
1752	0	0
1753	0	0
1754	0	0
1755	0	0
...	...	...
2007	1563	3080
2008	1625	3107
2009	1582	3039
2010	1698	3100
2011	1760	3137

261 rows × 2 columns

Element-wise Arithmetic¶

We can perform element-wise arithmetic on DataFrame columns using natural syntax.

emissions['Gas Emissions'] + emissions['Liquid Emissions']

Year
     0
     0
     0
     0
     0
        ...
  4643
  4732
  4621
  4798
  4897
Length: 261, dtype: int64

Array Operations¶

A number of simple operations are built into Pandas to facilitate working with the data. For example, we can show descriptive statistics such as the maximum value of each column:

print(emissions.idxmax(), emissions.max())

Total Emissions         2011
Gas Emissions           2011
Liquid Emissions        2011
Solid Emissions         2011
Cement Emissions        2011
Flare Emissions         1973
Per Capita Emissions    2011
dtype: int64 Total Emissions         9449.00
Gas Emissions           1760.00
Liquid Emissions        3137.00
Solid Emissions         3997.00
Cement Emissions         491.00
Flare Emissions          110.00
Per Capita Emissions       1.35
dtype: float64

The year in which this maximum value occurred:

emissions.idxmax()

Total Emissions         2011
Gas Emissions           2011
Liquid Emissions        2011
Solid Emissions         2011
Cement Emissions        2011
Flare Emissions         1973
Per Capita Emissions    2011
dtype: int64

Or the sum of each column:

emissions.sum()

Total Emissions         373729.0
Gas Emissions            49774.0
Liquid Emissions        131976.0
Solid Emissions         179160.0
Cement Emissions          9366.0
Flare Emissions           3456.0
Per Capita Emissions        65.5
dtype: float64

emissions['Per Capita Emissions'].loc[1930:]

Year
   NaN
   NaN
   NaN
   NaN
   NaN
        ...
  1.28
  1.29
  1.27
  1.32
  1.35
Name: Per Capita Emissions, Length: 82, dtype: float64

Merging Datasets¶

The dataset we have currently is missing data for per capita consumption before 1950. We have another dataset which gives us estimates of the world population which we can use to try and fill in some missing data. It too, however, has some missing values: before 1900, the data comes at 50 year intervals.

population = pd.read_csv('../../data/Climate/world_population.csv', index_col='Year')

What we need to do is first merge the two datasets together. Pandas gives us a merge function which allows us to align the datasets on their index values.

merged = pd.merge(emissions, population, how='outer', left_index=True, right_index=True)
merged.loc[1750:2011]

	Total Emissions	Gas Emissions	Liquid Emissions	Solid Emissions	Cement Emissions	Flare Emissions	Per Capita Emissions	World Population
Year
1750	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8.115621e+08
1751	3.0	0.0	0.0	3.0	0.0	0.0	NaN	NaN
1752	3.0	0.0	0.0	3.0	0.0	0.0	NaN	NaN
1753	3.0	0.0	0.0	3.0	0.0	0.0	NaN	NaN
1754	3.0	0.0	0.0	3.0	0.0	0.0	NaN	NaN
...	...	...	...	...	...	...	...	...
2007	8532.0	1563.0	3080.0	3442.0	382.0	65.0	1.28	6.681607e+09
2008	8740.0	1625.0	3107.0	3552.0	387.0	68.0	1.29	6.763733e+09
2009	8700.0	1582.0	3039.0	3604.0	412.0	63.0	1.27	6.846480e+09
2010	9140.0	1698.0	3100.0	3832.0	445.0	65.0	1.32	6.929725e+09
2011	9449.0	1760.0	3137.0	3997.0	491.0	63.0	1.35	7.013427e+09

262 rows × 8 columns

Interpolating missing values¶

The merge operation creates NaN values in the rows where data is missing from the world population column. We can fill these using a cubic spline interpolation from the surrounding points:

interpolated = merged.interpolate(method='cubic')
interpolated.loc[1750:2011]

	Total Emissions	Gas Emissions	Liquid Emissions	Solid Emissions	Cement Emissions	Flare Emissions	Per Capita Emissions	World Population
Year
1750	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8.115621e+08
1751	3.0	0.0	0.0	3.0	0.0	0.0	NaN	8.155185e+08
1752	3.0	0.0	0.0	3.0	0.0	0.0	NaN	8.194193e+08
1753	3.0	0.0	0.0	3.0	0.0	0.0	NaN	8.232672e+08
1754	3.0	0.0	0.0	3.0	0.0	0.0	NaN	8.270645e+08
...	...	...	...	...	...	...	...	...
2007	8532.0	1563.0	3080.0	3442.0	382.0	65.0	1.28	6.681607e+09
2008	8740.0	1625.0	3107.0	3552.0	387.0	68.0	1.29	6.763733e+09
2009	8700.0	1582.0	3039.0	3604.0	412.0	63.0	1.27	6.846480e+09
2010	9140.0	1698.0	3100.0	3832.0	445.0	65.0	1.32	6.929725e+09
2011	9449.0	1760.0	3137.0	3997.0	491.0	63.0	1.35	7.013427e+09

262 rows × 8 columns

Calculating per capita emissions¶

Now we can calculate a new value for per capita emissions. We multiply by 1,000,000 to convert from units of ‘Million Metric Tons’ as the Total Emissions are expressed, to merely ‘Metric Tons’, as the existing, incomplete estimate of per capita emissions is expressed.

interpolated['Per Capita Emissions 2'] = interpolated['Total Emissions'] / interpolated['World Population'] * 1000000
interpolated.loc[1751:2011]

	Total Emissions	Gas Emissions	Liquid Emissions	Solid Emissions	Cement Emissions	Flare Emissions	Per Capita Emissions	World Population	Per Capita Emissions 2
Year
1751	3.0	0.0	0.0	3.0	0.0	0.0	NaN	8.155185e+08	0.003679
1752	3.0	0.0	0.0	3.0	0.0	0.0	NaN	8.194193e+08	0.003661
1753	3.0	0.0	0.0	3.0	0.0	0.0	NaN	8.232672e+08	0.003644
1754	3.0	0.0	0.0	3.0	0.0	0.0	NaN	8.270645e+08	0.003627
1755	3.0	0.0	0.0	3.0	0.0	0.0	NaN	8.308138e+08	0.003611
...	...	...	...	...	...	...	...	...	...
2007	8532.0	1563.0	3080.0	3442.0	382.0	65.0	1.28	6.681607e+09	1.276938
2008	8740.0	1625.0	3107.0	3552.0	387.0	68.0	1.29	6.763733e+09	1.292186
2009	8700.0	1582.0	3039.0	3604.0	412.0	63.0	1.27	6.846480e+09	1.270726
2010	9140.0	1698.0	3100.0	3832.0	445.0	65.0	1.32	6.929725e+09	1.318956
2011	9449.0	1760.0	3137.0	3997.0	491.0	63.0	1.35	7.013427e+09	1.347273

261 rows × 9 columns

Pandas and PySD¶

By default, PySD will return the results of model simulation as a Pandas DataFrame, with the column names representing elements of the model, and the index (row names) as timestamps in the model.

import pysd
model = pysd.read_vensim('../../models/Predator_Prey/Predator_Prey.mdl')
sim_result_df = model.run()
sim_result_df

	FINAL TIME	INITIAL TIME	SAVEPER	TIME STEP	Prey Births	Predation Rate	Prey Deaths	Prey Fertility	Prey Population	Predator Births	Predator Deaths	Predator Food Driven Fertility	Predator Mortality	Predator Population
0.000000	50	0	0.015625	0.015625	5.000000e+02	0.0001	2.500000e+00	2	2.500000e+02	2.500000e+01	1.000000	0.001	0.01	100.000000
0.015625	50	0	0.015625	0.015625	5.155469e+02	0.0001	2.587401e+00	2	2.577734e+02	2.587401e+01	1.003750	0.001	0.01	100.375000
0.031250	50	0	0.015625	0.015625	5.315769e+02	0.0001	2.678180e+00	2	2.657884e+02	2.678180e+01	1.007636	0.001	0.01	100.763598
0.046875	50	0	0.015625	0.015625	5.481049e+02	0.0001	2.772488e+00	2	2.740525e+02	2.772488e+01	1.011663	0.001	0.01	101.166319
0.062500	50	0	0.015625	0.015625	5.651466e+02	0.0001	2.870484e+00	2	2.825733e+02	2.870484e+01	1.015837	0.001	0.01	101.583713
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49.937500	50	0	0.015625	0.015625	8.048343e-234	0.0001	3.909033e-233	2	4.024172e-234	3.909033e-232	971.388185	0.001	0.01	97138.818477
49.953125	50	0	0.015625	0.015625	7.078281e-234	0.0001	3.437342e-233	2	3.539141e-234	3.437342e-232	971.236405	0.001	0.01	97123.640536
49.968750	50	0	0.015625	0.015625	6.225308e-234	0.0001	3.022651e-233	2	3.112654e-234	3.022651e-232	971.084650	0.001	0.01	97108.464968
49.984375	50	0	0.015625	0.015625	5.475271e-234	0.0001	2.658060e-233	2	2.737635e-234	2.658060e-232	970.932918	0.001	0.01	97093.291770
50.000000	50	0	0.015625	0.015625	4.815729e-234	0.0001	2.337510e-233	2	2.407864e-234	2.337510e-232	970.781209	0.001	0.01	97078.120943

3201 rows × 14 columns

In this case, may want to downsample the returned data to make it more manageable:

sim_result_df.loc[range(50)]

	FINAL TIME	INITIAL TIME	SAVEPER	TIME STEP	Prey Births	Predation Rate	Prey Deaths	Prey Fertility	Prey Population	Predator Births	Predator Deaths	Predator Food Driven Fertility	Predator Mortality	Predator Population
0	50	0	0.015625	0.015625	5.000000e+02	0.0001	2.500000e+00	2	2.500000e+02	2.500000e+01	1.000000	0.001	0.01	100.000000
1	50	0	0.015625	0.015625	3.537670e+03	0.0001	3.742022e+01	2	1.768835e+03	3.742022e+02	2.115529	0.001	0.01	211.552891
2	50	0	0.015625	0.015625	1.857105e+04	0.0001	2.417686e+04	2	9.285526e+03	2.417686e+05	260.371488	0.001	0.01	26037.148846
3	50	0	0.015625	0.015625	6.957765e-02	0.0001	5.403739e-01	2	3.478882e-02	5.403739e+00	1553.297342	0.001	0.01	155329.734184
4	50	0	0.015625	0.015625	1.911405e-08	0.0001	1.469722e-07	2	9.557026e-09	1.469722e-06	1537.844530	0.001	0.01	153784.453003
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
45	50	0	0.015625	0.015625	1.348667e-215	0.0001	6.881968e-215	2	6.743337e-216	6.881968e-214	1020.558213	0.001	0.01	102055.821334
46	50	0	0.015625	0.015625	2.193421e-219	0.0001	1.108119e-218	2	1.096711e-219	1.108119e-217	1010.402700	0.001	0.01	101040.269999
47	50	0	0.015625	0.015625	4.004902e-223	0.0001	2.003148e-222	2	2.002451e-223	2.003148e-221	1000.348244	0.001	0.01	100034.824354
48	50	0	0.015625	0.015625	8.198319e-227	0.0001	4.059782e-226	2	4.099159e-227	4.059782e-225	990.393838	0.001	0.01	99039.383839
49	50	0	0.015625	0.015625	1.879055e-230	0.0001	9.212427e-230	2	9.395273e-231	9.212427e-229	980.538489	0.001	0.01	98053.848891

50 rows × 14 columns

Notes¶

[1]: While pandas can handle dimensions larger than two, it is clunky. Xarray is a package for handling multidimensional data that interfaces well with Pandas.

Resources¶

Basic introduction to Pandas constructs
More advanced usage of Pandas syntax
Cookbook of Pandas Applications