Collecting US decennial census data

In this notebook, we’ll collect demographic data from the US decennial census, by county.

The census website has an API, which is good, because everything else about the census website is close to unusable. The api is described here: http://www.census.gov/data/developers/data-sets/decennial-census-data.html

As a quick demonstration, we can use the API to get population data for every county in the US:

import pandas as pd
df = pd.read_json('http://api.census.gov/data/2010/sf1?get=P0120001&for=county:*')
df.columns = df.iloc[0]
df.drop(df.index[0], inplace=True)
df.head()
P0120001 state county
1 54571 01 001
2 182265 01 003
3 27457 01 005
4 22915 01 007
5 57322 01 009

The census code descriptions can also be accessed via the API. A listing of the field names is available here: http://api.census.gov/data/2010/sf1/variables.html

pd.read_json('http://api.census.gov/data/2010/sf1/variables/P0120001.json', typ='ser')
concept    P12. Sex By Age [49]
label          Total population
name                   P0120001
dtype: object

Collect data on male population by age, county

For now I’m only going to look at males. This is probably a bad idea in general.

Start with the 2010 census

The male population is broken down into some somewhat arbitrary cohorts, each with its own name. We want all of the fields between P0120003 and P0120025.

We’ll do some data munging to get it in numeric format, and to take care of the labels and indicies.

fields = ['P01200%02i'%i for i in range(3,26)]
url = 'http://api.census.gov/data/2010/sf1?get=%s&for=county:*'%','.join(fields)
print(url)
pops2010 = pd.read_json(url)
pops2010.columns = pops2010.iloc[0]
pops2010.drop(pops2010.index[0], inplace=True)
pops2010 = pops2010.applymap(float)
pops2010.set_index(['state', 'county'], inplace=True)
pops2010.head()
http://api.census.gov/data/2010/sf1?get=P0120003,P0120004,P0120005,P0120006,P0120007,P0120008,P0120009,P0120010,P0120011,P0120012,P0120013,P0120014,P0120015,P0120016,P0120017,P0120018,P0120019,P0120020,P0120021,P0120022,P0120023,P0120024,P0120025&for=county:*
P0120003 P0120004 P0120005 P0120006 P0120007 P0120008 P0120009 P0120010 P0120011 P0120012 ... P0120016 P0120017 P0120018 P0120019 P0120020 P0120021 P0120022 P0120023 P0120024 P0120025
state county
1 1 1866 2001 2171 1417 796 350 279 910 1543 1594 ... 1866 1524 494 785 418 596 807 546 295 159
3 5614 5832 6076 3704 2226 1013 862 2918 5183 5317 ... 6425 5943 2301 3427 2054 2841 3663 2644 1735 1176
5 847 826 820 559 360 190 192 666 1212 1162 ... 1000 910 358 501 280 351 436 303 195 129
7 712 759 771 513 293 122 167 522 987 1013 ... 847 734 294 390 188 268 347 232 138 73
9 1805 1936 2113 1340 799 340 294 943 1735 1730 ... 1972 1810 716 984 546 806 1039 684 418 234

5 rows × 23 columns

Get data from 2000

The 2000 census (logically) has different codes for its data, and (even more logically) breaks the cohorts down differently. In this case, we can get data for each age year with codes PCT012003 through PCT012104. The api limits us to only 50 columns at a time, so we’ll do it in chunks and stitch them together.

fields = ['PCT012%03i'%i for i in range(3,105)]

dflist = []
chunkSize = 40
for i in range(0, len(fields), chunkSize):
    chunk = fields[i:i+chunkSize]
    url = 'http://api.census.gov/data/2000/sf1?get=%s&for=county:*'%','.join(chunk)
    print(url)
    df_chunk = pd.read_json(url)
    df_chunk.columns = df_chunk.iloc[0]
    df_chunk.drop(df_chunk.index[0], inplace=True)
    df_chunk = df_chunk.applymap(float)
    df_chunk.set_index(['state', 'county'], inplace=True)
    dflist.append(df_chunk)

pops2000 = pd.concat(dflist,axis=1)
pops2000 = pops2000.applymap(float)
pops2000.head()
http://api.census.gov/data/2000/sf1?get=PCT012003,PCT012004,PCT012005,PCT012006,PCT012007,PCT012008,PCT012009,PCT012010,PCT012011,PCT012012,PCT012013,PCT012014,PCT012015,PCT012016,PCT012017,PCT012018,PCT012019,PCT012020,PCT012021,PCT012022,PCT012023,PCT012024,PCT012025,PCT012026,PCT012027,PCT012028,PCT012029,PCT012030,PCT012031,PCT012032,PCT012033,PCT012034,PCT012035,PCT012036,PCT012037,PCT012038,PCT012039,PCT012040,PCT012041,PCT012042&for=county:*
http://api.census.gov/data/2000/sf1?get=PCT012043,PCT012044,PCT012045,PCT012046,PCT012047,PCT012048,PCT012049,PCT012050,PCT012051,PCT012052,PCT012053,PCT012054,PCT012055,PCT012056,PCT012057,PCT012058,PCT012059,PCT012060,PCT012061,PCT012062,PCT012063,PCT012064,PCT012065,PCT012066,PCT012067,PCT012068,PCT012069,PCT012070,PCT012071,PCT012072,PCT012073,PCT012074,PCT012075,PCT012076,PCT012077,PCT012078,PCT012079,PCT012080,PCT012081,PCT012082&for=county:*
http://api.census.gov/data/2000/sf1?get=PCT012083,PCT012084,PCT012085,PCT012086,PCT012087,PCT012088,PCT012089,PCT012090,PCT012091,PCT012092,PCT012093,PCT012094,PCT012095,PCT012096,PCT012097,PCT012098,PCT012099,PCT012100,PCT012101,PCT012102,PCT012103,PCT012104&for=county:*
PCT012003 PCT012004 PCT012005 PCT012006 PCT012007 PCT012008 PCT012009 PCT012010 PCT012011 PCT012012 ... PCT012095 PCT012096 PCT012097 PCT012098 PCT012099 PCT012100 PCT012101 PCT012102 PCT012103 PCT012104
state county
1 1 264 305 293 331 309 364 342 374 382 411 ... 6 2 0 7 2 0 0 3 1 0
3 877 865 845 873 926 856 951 981 1031 1118 ... 31 28 14 13 2 5 6 4 5 0
5 185 184 196 173 191 223 187 236 193 234 ... 3 3 5 3 2 2 1 0 0 0
7 179 146 150 145 157 148 183 140 147 151 ... 6 2 2 2 3 2 0 0 0 0
9 344 347 374 394 376 377 400 361 402 366 ... 6 9 1 6 0 3 0 3 5 0

5 rows × 102 columns

Align the datasets

As they have different cohorts, we need to do some summation before we can merge the two census years into a single table. I’ll break the data down into 10-year cohorts by selecting columns to stitch together. We’ll set breakpoints by the last few digits of the field name, and label our new cohorts according to which decade of your life they are. We’re using 1-based indexing here for the cohort names.

pops2010d = pd.DataFrame(index=pops2010.index)

decades = ['dec_%i'%i for i in range(1,10)]
breakpoints_2010 = [3, 5, 8, 12, 14, 16, 18, 22, 24, 26]
for dec, s, f in zip(decades, breakpoints_2010[:-1], breakpoints_2010[1:]):
    pops2010d[dec] = pops2010[['P0120%03i'%i for i in range(s,f)]].sum(axis=1)

pops2010d.head()
dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9
state county
1 1 3867 4384 3082 3598 4148 3390 2293 1353 454
3 11446 12006 9976 11042 12517 12368 10623 6307 2911
5 1673 1739 2260 2208 2233 1910 1490 739 324
7 1471 1577 1798 2016 1928 1581 1140 579 211
9 3741 4252 3312 3719 4129 3782 3052 1723 652
pops2000d = pd.DataFrame(index=pops2000.index)

decades = ['dec_%i'%i for i in range(1,10)]
breakpoints_2000 = [3, 13, 23, 33, 43, 53, 63, 73, 83, 104]
for dec, s, f in zip(decades, breakpoints_2000[:-1], breakpoints_2000[1:]):
    pops2000d[dec] = pops2000[['PCT012%03i'%i for i in range(s,f)]].sum(axis=1)

pops2000d.head()
dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9
state county
1 1 3375 3630 2461 3407 3283 2319 1637 825 284
3 9323 10094 7600 9725 10379 8519 6675 4711 1822
5 2002 2198 2412 2465 2178 1699 1026 689 301
7 1546 1460 1680 1762 1624 1237 774 475 187
9 3741 3615 3393 3901 3773 3007 2227 1269 550

Now that the data have been formatted in the same way, we’ll concatenate them. We also drop any rows that don’t show up in both datasets.

frame = pd.concat([pops2000d, pops2010d], keys=[2000, 2010], axis=1)
frame.dropna(inplace=True)
frame.head()
2000 2010
dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9 dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9
state county
1 1 3375 3630 2461 3407 3283 2319 1637 825 284 3867 4384 3082 3598 4148 3390 2293 1353 454
3 9323 10094 7600 9725 10379 8519 6675 4711 1822 11446 12006 9976 11042 12517 12368 10623 6307 2911
5 2002 2198 2412 2465 2178 1699 1026 689 301 1673 1739 2260 2208 2233 1910 1490 739 324
7 1546 1460 1680 1762 1624 1237 774 475 187 1471 1577 1798 2016 1928 1581 1140 579 211
9 3741 3615 3393 3901 3773 3007 2227 1269 550 3741 4252 3312 3719 4129 3782 3052 1723 652

I’m happy with this format, so we’ll save it to csv:

frame.to_csv('Males by decade and county.csv')

As our dataframe has a MultiIndex we have to take care when re-importing from the csv to get the index and header columns correct.

pd.read_csv('Males by decade and county.csv', header=[0,1], index_col=[0,1])
2000 2010
dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9 dec_1 dec_2 dec_3 dec_4 dec_5 dec_6 dec_7 dec_8 dec_9
state county
1 1 3375 3630 2461 3407 3283 2319 1637 825 284 3867 4384 3082 3598 4148 3390 2293 1353 454
3 9323 10094 7600 9725 10379 8519 6675 4711 1822 11446 12006 9976 11042 12517 12368 10623 6307 2911
5 2002 2198 2412 2465 2178 1699 1026 689 301 1673 1739 2260 2208 2233 1910 1490 739 324
7 1546 1460 1680 1762 1624 1237 774 475 187 1471 1577 1798 2016 1928 1581 1140 579 211
9 3741 3615 3393 3901 3773 3007 2227 1269 550 3741 4252 3312 3719 4129 3782 3052 1723 652
11 840 885 1059 952 968 628 390 238 179 697 702 892 884 876 905 553 287 116
13 1497 1766 1076 1219 1496 1199 832 599 334 1440 1494 1112 1115 1188 1477 1120 600 292
15 7212 8032 7544 7613 8237 6544 4535 2846 1137 7586 8279 8185 7041 7723 8030 5842 3185 1305
17 2543 2540 2185 2437 2504 2088 1474 992 522 2062 2302 1905 1954 2325 2427 1907 1045 436
19 1552 1506 1424 1664 1726 1598 1322 717 285 1453 1729 1224 1504 1893 1999 1773 1002 311
21 2957 2904 2656 2872 2882 2317 1612 1002 379 3024 3137 2671 3015 2988 2970 2163 1181 459
23 1097 1164 839 956 1080 1010 710 430 203 875 931 673 755 914 1002 837 479 222
25 2130 2247 1496 1785 1904 1521 1120 654 323 1590 1975 1279 1371 1728 1751 1348 830 342
27 939 1025 854 992 983 844 664 432 217 850 982 716 815 988 972 787 493 230
29 965 1031 879 1061 1048 906 643 349 155 991 1062 827 893 1084 1043 916 466 171
31 2903 3246 2986 3006 3067 2628 1826 1189 452 3403 3448 3581 3279 3415 3065 2448 1430 615
33 3606 3773 3156 3740 3939 3310 2458 1632 697 3211 3654 3004 3076 3738 3837 2976 1778 885
35 1008 1069 762 838 914 865 579 458 174 801 965 668 634 833 994 855 423 239
37 813 872 814 1005 966 712 529 387 134 614 798 534 660 831 987 752 361 190
39 2403 2608 2068 2458 2643 2211 1809 1237 555 2431 2527 1984 2097 2452 2731 2119 1328 619
41 921 1026 766 807 944 826 549 431 193 909 1007 734 793 942 973 780 403 173
43 5230 5602 5008 5695 5596 4595 3351 2230 907 5177 5563 4825 5104 5649 5456 4369 2526 1050
45 3860 3656 3712 3650 3446 2659 1806 1162 405 3624 3417 3868 3224 3190 3259 2395 1308 540
47 3477 3901 2486 2587 3044 2317 1658 1121 501 3195 3298 2413 2124 2584 2902 2127 1129 472
49 4615 4491 4493 4697 4571 3768 2540 1663 670 5191 5134 4267 4732 4876 4574 3618 1925 796
51 4684 4939 4945 5530 5172 3825 2361 1336 548 5110 5556 5093 5410 5785 5262 3858 1890 725
53 2559 2757 2949 2955 3078 2224 1585 927 441 2337 2501 2582 3078 2895 2818 1928 1193 434
55 6842 7225 6406 6824 7328 6187 4167 3261 1266 6551 7323 6029 6317 7055 7222 5647 3039 1443
57 1199 1363 1071 1213 1310 1162 847 536 234 1057 1279 902 953 1187 1232 1017 615 265
59 2108 2303 2255 2221 2091 1758 1391 871 330 2266 2228 2109 2132 2130 2003 1565 978 417
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
55 129 911 1301 661 1017 1256 1107 891 615 312 879 907 655 791 1121 1332 1214 710 315
131 8537 9141 6305 9977 9934 6906 4025 2589 1194 8759 9196 6665 8111 11050 10028 6446 3288 1850
133 24815 27971 17718 26947 32031 22825 13003 8520 3653 24294 28104 19344 21650 30470 31667 19754 10176 5896
135 3451 4069 2595 3789 4098 2996 2080 1821 1000 3182 3531 2655 2964 3954 4179 2931 1726 1325
137 1373 1784 977 1510 1812 1456 1396 992 369 1250 1524 1282 1473 1977 2142 1633 1100 512
139 10095 11708 12046 12620 12521 8297 5210 3777 1875 10105 11085 13374 10920 12568 11968 7363 4065 2504
141 5065 5916 4018 5515 5946 4268 2874 2215 1213 4555 5065 4112 4095 5486 5714 3886 2366 1498
56 1 1666 2803 4997 1859 2005 1555 848 545 251 1890 2615 6309 2129 1645 1995 1380 603 331
3 860 1032 514 639 817 713 558 403 199 796 867 605 593 710 869 740 457 245
5 2728 3245 2306 2475 3515 1823 748 366 102 3896 3480 4008 3564 3295 3788 1528 497 202
7 985 1206 1027 1221 1523 1143 666 414 191 1150 999 1172 1143 1218 1313 912 434 212
9 900 1043 557 788 1118 772 441 286 102 994 995 796 852 972 1135 742 350 181
11 358 532 246 345 502 420 334 163 79 510 479 352 382 491 596 480 261 97
13 2539 3099 1872 2187 2880 2181 1511 1061 410 3032 2746 2529 2286 2490 3006 2168 1149 624
15 816 1022 674 716 974 779 576 442 235 735 921 858 766 888 1111 836 526 265
17 246 359 187 248 400 338 273 200 97 264 284 235 222 298 404 326 229 115
19 419 540 291 375 566 510 365 292 118 565 497 414 516 515 730 620 318 190
21 5759 6176 6018 6631 6274 4698 2751 1878 801 6631 6015 6869 5759 6306 6475 4473 2216 1131
23 1133 1391 642 873 1249 868 617 408 179 1487 1382 899 1257 1189 1410 984 485 209
25 4578 5435 4221 4335 5649 3824 2262 1950 618 5363 4945 5691 4987 4940 5877 3404 1677 1098
27 141 187 71 145 206 149 140 86 49 127 155 92 114 144 208 152 117 50
29 1508 2107 1287 1526 2154 1723 1130 765 362 1669 1844 1688 1453 1660 2319 1924 976 488
31 529 703 391 491 705 627 455 306 139 428 522 408 412 555 744 680 373 181
33 1549 2061 1402 1482 2249 1879 1116 844 419 1844 1840 1672 1691 1844 2485 1790 884 515
35 374 469 267 412 554 433 294 140 80 707 643 731 845 830 898 555 238 103
37 2770 3508 2326 2523 3628 2314 1096 598 263 3626 3189 3583 3212 2935 3421 1830 730 323
39 982 1112 2006 1731 1763 1222 582 250 85 1216 1018 1979 2079 1576 1623 1097 442 161
41 1701 2131 1121 1294 1842 1068 456 321 120 1821 1724 1288 1408 1300 1682 898 363 178
43 517 781 358 496 661 537 360 287 135 598 610 417 470 527 637 516 297 183
45 359 551 347 425 627 434 301 230 99 450 448 472 481 534 652 398 208 147

3137 rows × 18 columns