Using SD models to understand the differences between population measures at varying levels of geographic disagregation

In this recipe, we will use data at a national level to infer parameters for a population aging model. We’ll then try two different ways of using this trained model to understand variation between the behavior of each of the states.

About this technique

Firstly, we’ll use the parameters fit at the national level to predict census data at the disaggregated level, and compare these predicted state-level outputs with the measured values. This gives us a sense for how different the populations of the states are from what we should expect given our understanding of the national picture.

Secondly, we’ll fit parameters to a model at each of the state levels actual measured census data, and then compare the differences in fit parameters to each other and to the national expectation. This is a helpful analysis if the parameter itself (and its inter-state variance) is what we find interesting.

%pylab inline
import pandas as pd
import pysd
import scipy.optimize
import geopandas as gpd
Populating the interactive namespace from numpy and matplotlib

Ingredients

Population data by age cohort

We start with data from the decennial census years 2000 and 2010, for the male population by state and county. We have aggregated the data into ten-year age cohorts (with the last cohort representing everyone over 80 years old). The data collection task is described here. In this analysis we will only use data at the state and national levels.

data = pd.read_csv('../../data/Census/Males by decade and county.csv', header=[0,1], index_col=[0,1])
data.head()
2000 2010
dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9 dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9
state county
1 1 3375 3630 2461 3407 3283 2319 1637 825 284 3867 4384 3082 3598 4148 3390 2293 1353 454
3 9323 10094 7600 9725 10379 8519 6675 4711 1822 11446 12006 9976 11042 12517 12368 10623 6307 2911
5 2002 2198 2412 2465 2178 1699 1026 689 301 1673 1739 2260 2208 2233 1910 1490 739 324
7 1546 1460 1680 1762 1624 1237 774 475 187 1471 1577 1798 2016 1928 1581 1140 579 211
9 3741 3615 3393 3901 3773 3007 2227 1269 550 3741 4252 3312 3719 4129 3782 3052 1723 652

A model of an aging population

The model we’ll use to represent the population is a simple aging chain, with individuals aggregated into stocks by decade, to match the agregation we used for the above data. Each cohort is promoted with a timescale of 10 years, and there is some net inmigration, outmigration, and death subsumed into the loss flow associated with each cohort. This loss is controled by some yearly fraction that it will be our task to understand.

../../_images/Aging_Chain.png
model = pysd.read_vensim('../../models/Aging_Chain/Aging_Chain.mdl')

Our model is initialy parameterized with 10 individuals in each stock, no births, and a uniform loss rate of 5%. We’ll use data to set the initial conditions, and infer the loss rates. Estimating births is difficult, and so for this analysis, we’ll pay attention only to individuals who have been born before the year 2000.

model.run().plot();
../../_images/Exploring_models_across_geographic_scales_8_0.png

Geography Information

This information comes to us as a shape file .shp with its associated .dbf and .shx conspirator files. Lets check the plotting functionality:

state_geo = gpd.read_file('../../data/Census/US_State.shp')
state_geo.set_index('StateFIPSN', inplace=True)
state_geo.plot();
state_geo.head(2)
CensusDiv CensusReg FIPS FIPSNum Notes OBJECTID StateFIPS StateName XCentroid YCentroid geometry
StateFIPSN
2 Pacific West 02000 2000 None 1 02 Alaska -1882092.15195 2310348.392810 (POLYGON ((-2247528.774479387 2237995.01157197...
53 Pacific West 53000 53000 None 9 53 Washington -1837353.15317 1340481.223852 (POLYGON ((-2124362.24278068 1480441.850674443...
../../_images/Exploring_models_across_geographic_scales_10_1.png

Recipe Part A: Predict state-level values from national model fit

Step 1: Initialize the model using census data

We can aggregate the county level data to the national scale by summing across all geographies. This is relatively straightforward.

country = data.sum()
country
2000  dec_1    20332536
      dec_2    20909490
      dec_3    19485544
      dec_4    21638975
      dec_5    21016627
      dec_6    15115009
      dec_7     9536197
      dec_8     6946906
      dec_9     3060483
2010  dec_1    20703935
      dec_2    21878666
      dec_3    21645336
      dec_4    20033352
      dec_5    21597437
      dec_6    20451686
      dec_7    13926846
      dec_8     7424945
      dec_9     4083435
dtype: float64

If we run the model using national data from the year 2000 as starting conditions, we can see how the cohorts develop, given our arbitrary loss rate values:

model.run(return_timestamps=range(2000,2011),
          initial_condition=(2000, country['2000'])).plot();
../../_images/Exploring_models_across_geographic_scales_14_0.png

Step 2: Fit the national level model to the remaining data

We’ve used half of our data (from the year 2000 census) to initialize our model. Now we’ll use an optimization routine to choose the loss rate parameters that best predict the census 2010 data. We’ll use the same basic operations described in the previous recipe: Fitting with Optimization.

To make this simple, we’ll write a function that takes a list of potential model parameters, and returns the model’s prediction in the year 2010

def exec_model(paramlist):
    params = dict(zip(['dec_%i_loss_rate'%i for i in range(1,10)], paramlist))
    output = model.run(initial_condition=(2000,country['2000']),
                       params=params, return_timestamps=2010)
    return output

Now we’ll define an error function that calls this executor and calculates a sum of squared errors value. Remember that because we don’t have birth information, we’ll only calculate error based upon age cohorts 2 through 9.

def error(paramlist):
    output = exec_model(paramlist)
    errors = output - country['2010']
    #don't tally errors in the first cohort, as we don't have info about births
    return sum(errors.values[0,1:]**2)

Now we can use the minimize function from scipy to find a vector of parameters which brings the 2010 predictions into alignment with the data.

res = scipy.optimize.minimize(error, x0=[.05]*9,
                              method='L-BFGS-B')
country_level_fit_params = dict(zip(['dec_%i_loss_rate'%i for i in range(1,10)], res['x']))
country_level_fit_params
{'dec_1_loss_rate': 0.021183432598200467,
 'dec_2_loss_rate': -0.052101419562612286,
 'dec_3_loss_rate': -0.0014091019293939956,
 'dec_4_loss_rate': 0.0088436979759478375,
 'dec_5_loss_rate': -0.0072046351581388701,
 'dec_6_loss_rate': -0.011046250905098235,
 'dec_7_loss_rate': 0.017228650364514753,
 'dec_8_loss_rate': 0.063195268137886118,
 'dec_9_loss_rate': 0.16077452197707129}

If we run the national model forwards with these parameters, we see generally good behavior, except for the 0-9yr demographic bracket, from whom we expect less self-control. (And because we don’t have births data.)

model.run(params=country_level_fit_params,
          return_timestamps=range(2000,2011),
          initial_condition=(2000, country['2000'])).plot();
../../_images/Exploring_models_across_geographic_scales_22_0.png

Step 3: Make state-level predictions

If we want to look at the variances between the states and the national level, we can try making state-level predictions using state-specific initial conditions, but parameters fit at the national level.

states = data.sum(level=0)
states.head()
2000 2010
dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9 dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9
state
1 312841 329043 301076 315262 321447 246427 165327 109918 45131 312605 338568 321236 297502 321810 318358 229496 124070 56543
2 50687 53992 42537 51442 56047 35804 14974 7628 2325 53034 52278 58166 47753 51856 54170 29869 10392 4151
4 395110 384672 386486 391330 352471 257798 187193 144837 61148 463808 466275 455170 422447 418398 381076 300553 178849 89247
5 188589 201405 180801 187516 186931 149142 104621 72629 33050 201821 205074 196956 183761 192596 188081 143285 81138 38925
6 2669364 2588761 2556975 2812648 2495158 1692007 1002881 725610 331342 2573619 2780997 2849483 2595717 2655307 2336519 1489395 780576 456217

We can now generate a prediction by setting the model’s intitial conditions with state level data, and parameters fit in the national case. I’ve created a model_runner helper function to make the code easier to read, but this could be conducted in a single line if we so chose.

def model_runner(row):
    result = model.run(params=country_level_fit_params,
                       initial_condition=(2000,row['2000']),
                       return_timestamps=2010)
    return result.loc[2010]

state_predictions = states.apply(model_runner, axis=1)
state_predictions.head()
dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9
state
1 93117.371881 341167.213342 337071.875334 304311.201330 325833.277920 316445.464900 223627.149408 121168.476612 65739.525962
2 15087.025659 55697.612356 52613.730625 47359.482364 53391.883377 50862.893499 31762.129879 14297.875847 6133.949764
4 117604.805785 411745.021973 413024.709292 377589.072369 386843.833461 354697.832262 246306.408711 138031.543558 79499.152922
5 56133.665931 207553.388634 204417.938099 183004.303895 192875.524445 187772.888790 135135.556656 75209.777699 42570.643627
6 794538.317812 2775504.193632 2765302.000982 2586476.416275 2709538.756509 2450307.128284 1595568.711162 818892.701581 439955.979351

Step 4: Compare model predictions with measured data

Comparing the state level predictions with the actual data, we can see where our model most under or overpredicts population for each region/cohort combination.

diff = state_predictions-states['2010']
diff.head()
dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9
state
1 -219487.628119 2599.213342 15835.875334 6809.201330 4023.277920 -1912.535100 -5868.850592 -2901.523388 9196.525962
2 -37946.974341 3419.612356 -5552.269375 -393.517636 1535.883377 -3307.106501 1893.129879 3905.875847 1982.949764
4 -346203.194215 -54529.978027 -42145.290708 -44857.927631 -31554.166539 -26378.167738 -54246.591289 -40817.456442 -9747.847078
5 -145687.334069 2479.388634 7461.938099 -756.696105 279.524445 -308.111210 -8149.443344 -5928.222301 3645.643627
6 -1779080.682188 -5492.806368 -84180.999018 -9240.583725 54231.756509 113788.128284 106173.711162 38316.701581 -16261.020649

This is a little easier to understand if we weight it by the actual measured population:

diff_percent = (state_predictions-states['2010'])/states['2010']
diff_percent.head()
dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9
state
1 -0.702124 0.007677 0.049297 0.022888 0.012502 -0.006007 -0.025573 -0.023386 0.162647
2 -0.715522 0.065412 -0.095456 -0.008241 0.029618 -0.061051 0.063381 0.375854 0.477704
4 -0.746436 -0.116948 -0.092592 -0.106186 -0.075417 -0.069220 -0.180489 -0.228223 -0.109223
5 -0.721864 0.012090 0.037886 -0.004118 0.001451 -0.001638 -0.056876 -0.073063 0.093658
6 -0.691276 -0.001975 -0.029543 -0.003560 0.020424 0.048700 0.071286 0.049088 -0.035643

Step 5: Merge with geo data to plot

I’m using geopandas to manage the shapefiles, and it has its own plotting functionality. Unfortunately, it is not a particularly well defined functionality.

geo_diff = state_geo.join(diff_percent)
geo_diff.plot(column='dec_4')
geo_diff.head()
CensusDiv CensusReg FIPS FIPSNum Notes OBJECTID StateFIPS StateName XCentroid YCentroid geometry dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9
StateFIPSN
2 Pacific West 02000 2000 None 1 02 Alaska -1882092.151950 2310348.392810 (POLYGON ((-2247528.774479387 2237995.01157197... -0.715522 0.065412 -0.095456 -0.008241 0.029618 -0.061051 0.063381 0.375854 0.477704
53 Pacific West 53000 53000 None 9 53 Washington -1837353.153170 1340481.223852 (POLYGON ((-2124362.24278068 1480441.850674443... -0.718560 -0.006548 -0.058893 -0.069401 -0.018494 -0.031807 -0.047478 0.013500 -0.030573
23 New England Northeast 23000 23000 None 10 23 Maine 2068849.532637 1172786.748295 (POLYGON ((1951177.135094963 1127914.539498126... -0.682708 0.073037 0.147467 0.085983 -0.023217 -0.055249 -0.076098 -0.034226 -0.009394
27 West North Central Midwest 27000 27000 None 11 27 Minnesota 131047.575089 982130.006959 POLYGON ((-91052.16805501282 1282100.079225723... -0.711546 0.062669 0.034611 0.032717 0.020127 -0.019902 0.043175 0.022524 -0.060437
26 East North Central Midwest 26000 26000 None 18 26 Michigan 842567.889298 809437.260865 (POLYGON ((764918.7306621727 786184.882347865,... -0.657411 0.082088 0.197840 0.176409 0.088111 0.032552 0.050903 0.075986 0.011505
../../_images/Exploring_models_across_geographic_scales_33_1.png

Recipe Part B: fit state-by-state models

Now lets try optimizing the model’s parameters specifically to each state, and comparing with the national picture.

Step 1: Write the optimization functions to account for the state

We’ll start as before with functions that run the model and compute the error (this time with a parameter for the information about the state in question) and add a function to optimize and return the best fit parameters for each state.

def exec_model(paramlist, state):
    params = dict(zip(['dec_%i_loss_rate'%i for i in range(1,10)], paramlist))
    output = model.run(initial_condition=(2000,state['2000']),
                       params=params, return_timestamps=2010).loc[2010]
    return output

def error(paramlist, state):
    output = exec_model(paramlist, state)
    errors = output - state['2010']
    #don't tally errors in the first cohort, as we don't have info about births
    sse = sum(errors.values[1:]**2)
    return sse

Step 2: Apply optimization to each state

We can wrap the optimizer in a function that takes census information about each state and returns an optimized set of parameters for that state. If we apply it to the states dataframe, we can get out a similar dataframe that includes optimized parameters.

%%capture
def optimize_params(row):
    res = scipy.optimize.minimize(lambda x: error(x, row),
                                  x0=[.05]*9,
                                  method='L-BFGS-B');
    return pd.Series(index=['dec_%i_loss_rate'%i for i in range(1,10)], data=res['x'])

state_fit_params = states.apply(optimize_params, axis=1)
state_fit_params.head()
dec_1_loss_rate dec_2_loss_rate dec_3_loss_rate dec_4_loss_rate dec_5_loss_rate dec_6_loss_rate dec_7_loss_rate dec_8_loss_rate dec_9_loss_rate
state
1 0.021448 -0.051229 0.005996 0.009159 -0.006314 -0.012721 0.013077 0.061467 0.199157
2 0.024790 -0.045179 -0.021897 0.015292 -0.003120 -0.023723 0.039461 0.137920 0.200484
4 0.015650 -0.066072 -0.009319 -0.003544 -0.012930 -0.018060 -0.013545 0.033047 0.168508
5 0.022108 -0.050841 0.003923 0.005605 -0.006411 -0.011430 0.006571 0.054650 0.191023
6 0.020025 -0.052080 -0.006135 0.010332 -0.004042 -0.004572 0.025891 0.065019 0.148008

Step 3: Merge with geographic data

As we’re looking at model parameters which themselves are multiplied by populations to generate actual flows of people, we can look at the difference between parameters directly without needing to normalize.

geo_diff = state_geo.join(state_fit_params)
geo_diff.plot(column='dec_4_loss_rate')
geo_diff.head(3)
CensusDiv CensusReg FIPS FIPSNum Notes OBJECTID StateFIPS StateName XCentroid YCentroid geometry dec_1_loss_rate dec_2_loss_rate dec_3_loss_rate dec_4_loss_rate dec_5_loss_rate dec_6_loss_rate dec_7_loss_rate dec_8_loss_rate dec_9_loss_rate
StateFIPSN
2 Pacific West 02000 2000 None 1 02 Alaska -1882092.151950 2310348.392810 (POLYGON ((-2247528.774479387 2237995.01157197... 0.024790 -0.045179 -0.021897 0.015292 -0.003120 -0.023723 0.039461 0.137920 0.200484
53 Pacific West 53000 53000 None 9 53 Washington -1837353.153170 1340481.223852 (POLYGON ((-2124362.24278068 1480441.850674443... 0.021844 -0.053040 -0.010785 0.001392 -0.006107 -0.015620 0.010874 0.072777 0.150871
23 New England Northeast 23000 23000 None 10 23 Maine 2068849.532637 1172786.748295 (POLYGON ((1951177.135094963 1127914.539498126... 0.025517 -0.044513 0.016994 0.013372 -0.014052 -0.017986 0.007721 0.065287 0.161693
../../_images/Exploring_models_across_geographic_scales_39_1.png