Parallel Model Fitting

In this notebook, we’ll fit a simple ageing model to all of the counties in the United States. As before, we’ll use scipy.optimize to perform the fitting, but we’ll use python’s multiprocessing library to perform these optimizations in parallel.

When to use this technique

This technique is appropriate when we are modeling a large number of entirely independent but structurally identical systems. In this example, we’re conceptualizing the population of counties to be influenced by aging and exogenous migration patterns. If we were to attempt to link the models together, for instance by specifying that the outmigration from one county needed to be accounted for as the inmigration to another county, we would need to perform a single large-scale optimization, or some form of hybrid.

%pylab inline
import pandas as pd
import pysd
import scipy.optimize
import multiprocessing
import numpy as np
import seaborn
Populating the interactive namespace from numpy and matplotlib

Ingredients

Data

The first ingredient theat we’ll use is census data from the 2000 and 2010 census:

data = pd.read_csv('../../data/Census/Males by decade and county.csv', header=[0,1], skiprows=[2])
data.head()
Unnamed: 0_level_0 Unnamed: 1_level_0 2000 2010
Unnamed: 0_level_1 Unnamed: 1_level_1 dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9 dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9
0 1 1 3375 3630 2461 3407 3283 2319 1637 825 284 3867 4384 3082 3598 4148 3390 2293 1353 454
1 1 3 9323 10094 7600 9725 10379 8519 6675 4711 1822 11446 12006 9976 11042 12517 12368 10623 6307 2911
2 1 5 2002 2198 2412 2465 2178 1699 1026 689 301 1673 1739 2260 2208 2233 1910 1490 739 324
3 1 7 1546 1460 1680 1762 1624 1237 774 475 187 1471 1577 1798 2016 1928 1581 1140 579 211
4 1 9 3741 3615 3393 3901 3773 3007 2227 1269 550 3741 4252 3312 3719 4129 3782 3052 1723 652

Model

The model will be a simple ageing chain that groups individuals into 10 year cohorts.

model = pysd.read_vensim('../../models/Aging_Chain/Aging_Chain.mdl')

The Recipe

As in our other optimization problems, we’ll construct an error function that calculates the sum of squared errors between our model prediction and the measured data. We also construct a helper function called fit which basically makes the call to the optimizer and formats the result in something that we can aggregate into a Pandas DataFrame.

param_names = ['dec_%i_loss_rate'%i for i in range(1,10)]

def error(param_vals, measurements):
    predictions = model.run(params=dict(zip(param_names, param_vals)),
                            initial_condition=(2000,measurements['2000']),
                            return_timestamps=2010,
                            rtol=1).loc[2010]

    errors = predictions - measurements['2010']
    return sum(errors.values[1:]**2) #ignore first decade: no birth info

def fit(row):
    res = scipy.optimize.minimize(error, args=row,
                                  x0=[.05]*9,
                                  method='L-BFGS-B');
    return pd.Series(index=['dec_%i_loss_rate'%i for i in range(1,10)], data=res['x'])

At this point, fitting the model is a simple matter of applying the fit function to the data:

%%capture
county_params = data.apply(fit, axis=1)

On my 2014 era machine, this optimization takes about half an hour.

We can plot the distributions of the fit parameters for each of the counties in a histogram, to get a sense of the result. (Here we’re ignoring the first decade, which will not have reasonable parameters, as we have no information about births to the system.)

df2 = county_params.drop('dec_1_loss_rate',1)
df2.plot(kind='hist', bins=np.arange(-.15,.4,.01), alpha=.4, histtype='stepfilled')
plt.xlim(-.15,.4)
plt.title('Fit yearly loss rates from each US county\n by age bracket from 2000 to 2010', fontsize=16)
plt.ylabel('Number of Counties', fontsize=16)
plt.xlabel('Yearly Loss Rate in 1% Brackets', fontsize=16)
plt.legend(frameon=False, fontsize=14)
plt.savefig('Loss_Histogram.svg')
../../_images/Massively_Parallel_Fitting_11_0.png

Executing the optimization in parallel

We can take advantage of the multicore nature of most modern machines by using python’s multiprocessing module to distribute the various counties between each of the cores we have available for the calculation. The basic structure for this piece of code comes from this gist. We are essentially creating a helper function that will apply the fit function to a subset of the census DataFrame, and calling this function once on each of our worker nodes.

%%capture

def _apply_df(args):
    df, func, kwargs = args
    return df.apply(func, **kwargs)

def apply_by_multiprocessing(df, func, workers, **kwargs):
    pool = multiprocessing.Pool(processes=workers)
    result = pool.map(_apply_df, [(d, func, kwargs) for d in np.array_split(df, workers)])
    pool.close()
    return pd.concat(list(result))

county_params = apply_by_multiprocessing(data[:10], fit, axis=1, workers=4)