Using SD models to understand the differences between population measures at varying levels of geographic disagregation

In this recipe, we will use data at a national level to infer parameters for a population aging model. We’ll then try two different ways of using this trained model to understand variation between the behavior of each of the states.

About this technique

Firstly, we’ll use the parameters fit at the national level to predict census data at the disaggregated level, and compare these predicted state-level outputs with the measured values. This gives us a sense for how different the populations of the states are from what we should expect given our understanding of the national picture.

Secondly, we’ll fit parameters to a model at each of the state levels actual measured census data, and then compare the differences in fit parameters to each other and to the national expectation. This is a helpful analysis if the parameter itself (and its inter-state variance) is what we find interesting.

%pylab inline
import pandas as pd
import pysd
import scipy.optimize
import geopandas as gpd
Populating the interactive namespace from numpy and matplotlib

##Ingredients

Population data by age cohort

We start with data from the decennial census years 2000 and 2010, for the male population by state and county. We have aggregated the data into ten-year age cohorts (with the last cohort representing everyone over 80 years old). The data collection task is described here. In this analysis we will only use data at the state and national levels.

data = pd.read_csv('../../data/Census/Males by decade and county.csv', header=[0,1], index_col=[0,1])
data.head()
2000 2010
dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9 dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9
state county
1.0 1.0 3375.0 3630.0 2461.0 3407.0 3283.0 2319.0 1637.0 825.0 284.0 3867.0 4384.0 3082.0 3598.0 4148.0 3390.0 2293.0 1353.0 454.0
3.0 9323.0 10094.0 7600.0 9725.0 10379.0 8519.0 6675.0 4711.0 1822.0 11446.0 12006.0 9976.0 11042.0 12517.0 12368.0 10623.0 6307.0 2911.0
5.0 2002.0 2198.0 2412.0 2465.0 2178.0 1699.0 1026.0 689.0 301.0 1673.0 1739.0 2260.0 2208.0 2233.0 1910.0 1490.0 739.0 324.0
7.0 1546.0 1460.0 1680.0 1762.0 1624.0 1237.0 774.0 475.0 187.0 1471.0 1577.0 1798.0 2016.0 1928.0 1581.0 1140.0 579.0 211.0
9.0 3741.0 3615.0 3393.0 3901.0 3773.0 3007.0 2227.0 1269.0 550.0 3741.0 4252.0 3312.0 3719.0 4129.0 3782.0 3052.0 1723.0 652.0

###A model of an aging population

The model we’ll use to represent the population is a simple aging chain, with individuals aggregated into stocks by decade, to match the agregation we used for the above data. Each cohort is promoted with a timescale of 10 years, and there is some net inmigration, outmigration, and death subsumed into the loss flow associated with each cohort. This loss is controled by some yearly fraction that it will be our task to understand.

../../_images/Aging_Chain.png
model = pysd.read_vensim('../../models/Aging_Chain/Aging_Chain.mdl')

Our model is initialy parameterized with 10 individuals in each stock, no births, and a uniform loss rate of 5%. We’ll use data to set the initial conditions, and infer the loss rates. Estimating births is difficult, and so for this analysis, we’ll pay attention only to individuals who have been born before the year 2000.

model.run().plot();
../../_images/Exploring_models_across_geographic_scales_8_0.png

Geography Information

This information comes to us as a shape file .shp with its associated .dbf and .shx conspirator files. Lets check the plotting functionality:

state_geo = gpd.read_file('../../data/Census/US_State.shp')
state_geo.set_index('StateFIPSN', inplace=True)
state_geo.plot();
state_geo.head(2)
OBJECTID FIPS FIPSNum StateFIPS StateName CensusReg CensusDiv XCentroid YCentroid Notes geometry
StateFIPSN
2 1 02000 2000 02 Alaska West Pacific -1.882092e+06 2.310348e+06 None MULTIPOLYGON (((-2247528.774 2237995.012, -224...
53 9 53000 53000 53 Washington West Pacific -1.837353e+06 1.340481e+06 None MULTIPOLYGON (((-2124362.243 1480441.851, -212...
../../_images/Exploring_models_across_geographic_scales_10_1.png

Recipe Part A: Predict state-level values from national model fit

Step 1: Initialize the model using census data

We can aggregate the county level data to the national scale by summing across all geographies. This is relatively straightforward.

country = data.sum()
country
2000  dec_1    20332536.0
      dec_2    20909490.0
      dec_3    19485544.0
      dec_4    21638975.0
      dec_5    21016627.0
      dec_6    15115009.0
      dec_7     9536197.0
      dec_8     6946906.0
      dec_9     3060483.0
2010  dec_1    20703935.0
      dec_2    21878666.0
      dec_3    21645336.0
      dec_4    20033352.0
      dec_5    21597437.0
      dec_6    20451686.0
      dec_7    13926846.0
      dec_8     7424945.0
      dec_9     4083435.0
dtype: float64

If we run the model using national data from the year 2000 as starting conditions, we can see how the cohorts develop, given our arbitrary loss rate values:

model.run(return_timestamps=range(2000,2011),
          initial_condition=(2000, country['2000'])).plot();
../../_images/Exploring_models_across_geographic_scales_14_0.png

Step 2: Fit the national level model to the remaining data

We’ve used half of our data (from the year 2000 census) to initialize our model. Now we’ll use an optimization routine to choose the loss rate parameters that best predict the census 2010 data. We’ll use the same basic operations described in the previous recipe: Fitting with Optimization.

To make this simple, we’ll write a function that takes a list of potential model parameters, and returns the model’s prediction in the year 2010

def exec_model(paramlist):
    params = dict(zip(['dec_%i_loss_rate'%i for i in range(1,10)], paramlist))
    output = model.run(initial_condition=(2000,country['2000']),
                       params=params, return_timestamps=2010)
    return output

Now we’ll define an error function that calls this executor and calculates a sum of squared errors value. Remember that because we don’t have birth information, we’ll only calculate error based upon age cohorts 2 through 9.

def error(paramlist):
    output = exec_model(paramlist)
    errors = output - country['2010']
    #don't tally errors in the first cohort, as we don't have info about births
    return sum(errors.values[0,1:]**2)

Now we can use the minimize function from scipy to find a vector of parameters which brings the 2010 predictions into alignment with the data.

res = scipy.optimize.minimize(error, x0=[.05]*9,
                              method='L-BFGS-B')
country_level_fit_params = dict(zip(['dec_%i_loss_rate'%i for i in range(1,10)], res['x']))
country_level_fit_params
{'dec_1_loss_rate': 0.05,
 'dec_2_loss_rate': 0.05,
 'dec_3_loss_rate': 0.05,
 'dec_4_loss_rate': 0.05,
 'dec_5_loss_rate': 0.05,
 'dec_6_loss_rate': 0.05,
 'dec_7_loss_rate': 0.05,
 'dec_8_loss_rate': 0.05,
 'dec_9_loss_rate': 0.05}

If we run the national model forwards with these parameters, we see generally good behavior, except for the 0-9yr demographic bracket, from whom we expect less self-control. (And because we don’t have births data.)

model.run(params=country_level_fit_params,
          return_timestamps=range(2000,2011),
          initial_condition=(2000, country['2000'])).plot();
../../_images/Exploring_models_across_geographic_scales_22_0.png

###Step 3: Make state-level predictions

If we want to look at the variances between the states and the national level, we can try making state-level predictions using state-specific initial conditions, but parameters fit at the national level.

states = data.groupby(level=0).sum()
states.head()
2000 2010
dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9 dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9
state
1.0 312841.0 329043.0 301076.0 315262.0 321447.0 246427.0 165327.0 109918.0 45131.0 312605.0 338568.0 321236.0 297502.0 321810.0 318358.0 229496.0 124070.0 56543.0
2.0 50687.0 53992.0 42537.0 51442.0 56047.0 35804.0 14974.0 7628.0 2325.0 53034.0 52278.0 58166.0 47753.0 51856.0 54170.0 29869.0 10392.0 4151.0
4.0 395110.0 384672.0 386486.0 391330.0 352471.0 257798.0 187193.0 144837.0 61148.0 463808.0 466275.0 455170.0 422447.0 418398.0 381076.0 300553.0 178849.0 89247.0
5.0 188589.0 201405.0 180801.0 187516.0 186931.0 149142.0 104621.0 72629.0 33050.0 201821.0 205074.0 196956.0 183761.0 192596.0 188081.0 143285.0 81138.0 38925.0
6.0 2669364.0 2588761.0 2556975.0 2812648.0 2495158.0 1692007.0 1002881.0 725610.0 331342.0 2573619.0 2780997.0 2849483.0 2595717.0 2655307.0 2336519.0 1489395.0 780576.0 456217.0

We can now generate a prediction by setting the model’s intitial conditions with state level data, and parameters fit in the national case. I’ve created a model_runner helper function to make the code easier to read, but this could be conducted in a single line if we so chose.

def model_runner(row):
    result = model.run(params=country_level_fit_params,
                       initial_condition=(2000, row['2000']),
                       return_timestamps=2010,
                       return_columns=[
                           "dec_1", "dec_2", "dec_3",
                           "dec_4", "dec_5", "dec_6",
                           "dec_7", "dec_8", "dec_9"
                       ])
    return result.loc[2010]

state_predictions = states.apply(model_runner, axis=1)
state_predictions.head()
dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9
state
1.0 68817.239206 1.425135e+05 1.752830e+05 1.856499e+05 1.905074e+05 1.766502e+05 1.430809e+05 104576.864395 112780.892489
2.0 11149.879343 2.323983e+04 2.717858e+04 2.883606e+04 3.114106e+04 2.841718e+04 2.006445e+04 11731.576605 9263.036022
4.0 86914.373061 1.731934e+05 2.158222e+05 2.308789e+05 2.268514e+05 1.985338e+05 1.576861e+05 120469.144700 140567.692416
5.0 41484.889527 8.658169e+04 1.061958e+05 1.115457e+05 1.128048e+05 1.048007e+05 8.660780e+04 65340.896065 74727.981971
6.0 587193.689179 1.167877e+06 1.443928e+06 1.583683e+06 1.589019e+06 1.372822e+06 1.015240e+06 699665.480214 759574.642778

Step 4: Compare model predictions with measured data

Comparing the state level predictions with the actual data, we can see where our model most under or overpredicts population for each region/cohort combination.

diff = state_predictions-states['2010']
diff.head()
dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9
state
1.0 -2.437878e+05 -1.960545e+05 -1.459530e+05 -1.118521e+05 -1.313026e+05 -141707.812835 -86415.080653 -19493.135605 56237.892489
2.0 -4.188412e+04 -2.903817e+04 -3.098742e+04 -1.891694e+04 -2.071494e+04 -25752.819389 -9804.550877 1339.576605 5112.036022
4.0 -3.768936e+05 -2.930816e+05 -2.393478e+05 -1.915681e+05 -1.915466e+05 -182542.183715 -142866.899500 -58379.855300 51320.692416
5.0 -1.603361e+05 -1.184923e+05 -9.076017e+04 -7.221535e+04 -7.979121e+04 -83280.336688 -56677.198680 -15797.103935 35802.981971
6.0 -1.986425e+06 -1.613120e+06 -1.405555e+06 -1.012034e+06 -1.066288e+06 -963697.077944 -474154.512342 -80910.519786 303357.642778

This is a little easier to understand if we weight it by the actual measured population:

diff_percent = (state_predictions-states['2010'])/states['2010']
diff_percent.head()
dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9
state
1.0 -0.779859 -0.579070 -0.454348 -0.375971 -0.408013 -0.445121 -0.376543 -0.157114 0.994604
2.0 -0.789760 -0.555457 -0.532741 -0.396141 -0.399470 -0.475407 -0.328252 0.128905 1.231519
4.0 -0.812607 -0.628559 -0.525843 -0.453473 -0.457810 -0.479018 -0.475347 -0.326420 0.575041
5.0 -0.794447 -0.577803 -0.460814 -0.392985 -0.414293 -0.442790 -0.395556 -0.194694 0.919794
6.0 -0.771841 -0.580051 -0.493267 -0.389886 -0.401568 -0.412450 -0.318354 -0.103655 0.664942

###Step 5: Merge with geo data to plot

I’m using geopandas to manage the shapefiles, and it has its own plotting functionality. Unfortunately, it is not a particularly well defined functionality.

geo_diff = state_geo.join(diff_percent)
geo_diff.plot(column='dec_4')
geo_diff.head()
OBJECTID FIPS FIPSNum StateFIPS StateName CensusReg CensusDiv XCentroid YCentroid Notes geometry dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9
StateFIPSN
2 1 02000 2000 02 Alaska West Pacific -1.882092e+06 2.310348e+06 None MULTIPOLYGON (((-2247528.774 2237995.012, -224... -0.789760 -0.555457 -0.532741 -0.396141 -0.399470 -0.475407 -0.328252 0.128905 1.231519
53 9 53000 53000 53 Washington West Pacific -1.837353e+06 1.340481e+06 None MULTIPOLYGON (((-2124362.243 1480441.851, -212... -0.792005 -0.585096 -0.510514 -0.430876 -0.426070 -0.459076 -0.393843 -0.135940 0.674966
23 10 23000 23000 23 Maine Northeast New England 2.068850e+06 1.172787e+06 None MULTIPOLYGON (((1951177.135 1127914.539, 19511... -0.765509 -0.556475 -0.409031 -0.338041 -0.430710 -0.473652 -0.409398 -0.164981 0.726818
27 11 27000 27000 27 Minnesota Midwest West North Central 1.310476e+05 9.821300e+05 None POLYGON ((-91052.168 1282100.079, -30035.391 1... -0.786822 -0.557537 -0.463915 -0.369641 -0.403840 -0.452106 -0.335580 -0.123440 0.645598
26 18 26000 26000 26 Michigan Midwest East North Central 8.425679e+05 8.094373e+05 None MULTIPOLYGON (((764918.731 786184.882, 772386.... -0.746814 -0.547370 -0.378482 -0.282202 -0.364176 -0.423316 -0.329958 -0.074694 0.747247
../../_images/Exploring_models_across_geographic_scales_33_1.png

Recipe Part B: fit state-by-state models

Now lets try optimizing the model’s parameters specifically to each state, and comparing with the national picture.

Step 1: Write the optimization functions to account for the state

We’ll start as before with functions that run the model and compute the error (this time with a parameter for the information about the state in question) and add a function to optimize and return the best fit parameters for each state.

def exec_model(paramlist, state):
    params = dict(zip(['dec_%i_loss_rate'%i for i in range(1,10)], paramlist))
    output = model.run(initial_condition=(2000,state['2000']),
                       params=params, return_timestamps=2010).loc[2010]
    return output

def error(paramlist, state):
    output = exec_model(paramlist, state)
    errors = output - state['2010']
    #don't tally errors in the first cohort, as we don't have info about births
    sse = sum(errors.values[1:]**2)
    return sse

###Step 2: Apply optimization to each state We can wrap the optimizer in a function that takes census information about each state and returns an optimized set of parameters for that state. If we apply it to the states dataframe, we can get out a similar dataframe that includes optimized parameters.

%%capture
def optimize_params(row):
    res = scipy.optimize.minimize(lambda x: error(x, row),
                                  x0=[.05]*9,
                                  method='L-BFGS-B');
    return pd.Series(index=['dec_%i_loss_rate'%i for i in range(1,10)], data=res['x'])

state_fit_params = states.apply(optimize_params, axis=1)
state_fit_params.head()

Step 3: Merge with geographic data

As we’re looking at model parameters which themselves are multiplied by populations to generate actual flows of people, we can look at the difference between parameters directly without needing to normalize.

geo_diff = state_geo.join(state_fit_params)
geo_diff.plot(column='dec_4_loss_rate')
geo_diff.head(3)
OBJECTID FIPS FIPSNum StateFIPS StateName CensusReg CensusDiv XCentroid YCentroid Notes geometry dec_1_loss_rate dec_2_loss_rate dec_3_loss_rate dec_4_loss_rate dec_5_loss_rate dec_6_loss_rate dec_7_loss_rate dec_8_loss_rate dec_9_loss_rate
StateFIPSN
2 1 02000 2000 02 Alaska West Pacific -1.882092e+06 2.310348e+06 None MULTIPOLYGON (((-2247528.774 2237995.012, -224... 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
53 9 53000 53000 53 Washington West Pacific -1.837353e+06 1.340481e+06 None MULTIPOLYGON (((-2124362.243 1480441.851, -212... 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
23 10 23000 23000 23 Maine Northeast New England 2.068850e+06 1.172787e+06 None MULTIPOLYGON (((1951177.135 1127914.539, 19511... 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
../../_images/Exploring_models_across_geographic_scales_39_1.png