Collecting US decennial census data

In this notebook, we’ll collect demographic data from the US decennial census, by county.

The census website has an API, which is good, because everything else about the census website is close to unusable. The api is described here: http://www.census.gov/data/developers/data-sets/decennial-census-data.html

As a quick demonstration, we can use the API to get population data for every county in the US:

import pandas as pd

df = pd.read_json('http://api.census.gov/data/2010/sf1?get=P0120001&for=county:*')
df.columns = df.iloc[0]
df.drop(df.index[0], inplace=True)
df.head()

	P0120001	state	county
1	54571	01	001
2	182265	01	003
3	27457	01	005
4	22915	01	007
5	57322	01	009

The census code descriptions can also be accessed via the API. A listing of the field names is available here: http://api.census.gov/data/2010/sf1/variables.html

pd.read_json('http://api.census.gov/data/2010/sf1/variables/P0120001.json', typ='ser')

concept    P12. Sex By Age [49]
label          Total population
name                   P0120001
dtype: object

Collect data on male population by age, county

For now I’m only going to look at males. This is probably a bad idea in general.

Start with the 2010 census

The male population is broken down into some somewhat arbitrary cohorts, each with its own name. We want all of the fields between P0120003 and P0120025.

We’ll do some data munging to get it in numeric format, and to take care of the labels and indicies.

fields = ['P01200%02i'%i for i in range(3,26)]
url = 'http://api.census.gov/data/2010/sf1?get=%s&for=county:*'%','.join(fields)
print(url)
pops2010 = pd.read_json(url)
pops2010.columns = pops2010.iloc[0]
pops2010.drop(pops2010.index[0], inplace=True)
pops2010 = pops2010.applymap(float)
pops2010.set_index(['state', 'county'], inplace=True)
pops2010.head()

http://api.census.gov/data/2010/sf1?get=P0120003,P0120004,P0120005,P0120006,P0120007,P0120008,P0120009,P0120010,P0120011,P0120012,P0120013,P0120014,P0120015,P0120016,P0120017,P0120018,P0120019,P0120020,P0120021,P0120022,P0120023,P0120024,P0120025&for=county:*

		P0120003	P0120004	P0120005	P0120006	P0120007	P0120008	P0120009	P0120010	P0120011	P0120012	...	P0120016	P0120017	P0120018	P0120019	P0120020	P0120021	P0120022	P0120023	P0120024	P0120025
state	county
1	1	1866	2001	2171	1417	796	350	279	910	1543	1594	...	1866	1524	494	785	418	596	807	546	295	159
	3	5614	5832	6076	3704	2226	1013	862	2918	5183	5317	...	6425	5943	2301	3427	2054	2841	3663	2644	1735	1176
	5	847	826	820	559	360	190	192	666	1212	1162	...	1000	910	358	501	280	351	436	303	195	129
	7	712	759	771	513	293	122	167	522	987	1013	...	847	734	294	390	188	268	347	232	138	73
	9	1805	1936	2113	1340	799	340	294	943	1735	1730	...	1972	1810	716	984	546	806	1039	684	418	234

5 rows × 23 columns

Get data from 2000

The 2000 census (logically) has different codes for its data, and (even more logically) breaks the cohorts down differently. In this case, we can get data for each age year with codes PCT012003 through PCT012104. The api limits us to only 50 columns at a time, so we’ll do it in chunks and stitch them together.

fields = ['PCT012%03i'%i for i in range(3,105)]

dflist = []
chunkSize = 40
for i in range(0, len(fields), chunkSize):
    chunk = fields[i:i+chunkSize]
    url = 'http://api.census.gov/data/2000/sf1?get=%s&for=county:*'%','.join(chunk)
    print(url)
    df_chunk = pd.read_json(url)
    df_chunk.columns = df_chunk.iloc[0]
    df_chunk.drop(df_chunk.index[0], inplace=True)
    df_chunk = df_chunk.applymap(float)
    df_chunk.set_index(['state', 'county'], inplace=True)
    dflist.append(df_chunk)

pops2000 = pd.concat(dflist,axis=1)
pops2000 = pops2000.applymap(float)
pops2000.head()

http://api.census.gov/data/2000/sf1?get=PCT012003,PCT012004,PCT012005,PCT012006,PCT012007,PCT012008,PCT012009,PCT012010,PCT012011,PCT012012,PCT012013,PCT012014,PCT012015,PCT012016,PCT012017,PCT012018,PCT012019,PCT012020,PCT012021,PCT012022,PCT012023,PCT012024,PCT012025,PCT012026,PCT012027,PCT012028,PCT012029,PCT012030,PCT012031,PCT012032,PCT012033,PCT012034,PCT012035,PCT012036,PCT012037,PCT012038,PCT012039,PCT012040,PCT012041,PCT012042&for=county:*
http://api.census.gov/data/2000/sf1?get=PCT012043,PCT012044,PCT012045,PCT012046,PCT012047,PCT012048,PCT012049,PCT012050,PCT012051,PCT012052,PCT012053,PCT012054,PCT012055,PCT012056,PCT012057,PCT012058,PCT012059,PCT012060,PCT012061,PCT012062,PCT012063,PCT012064,PCT012065,PCT012066,PCT012067,PCT012068,PCT012069,PCT012070,PCT012071,PCT012072,PCT012073,PCT012074,PCT012075,PCT012076,PCT012077,PCT012078,PCT012079,PCT012080,PCT012081,PCT012082&for=county:*
http://api.census.gov/data/2000/sf1?get=PCT012083,PCT012084,PCT012085,PCT012086,PCT012087,PCT012088,PCT012089,PCT012090,PCT012091,PCT012092,PCT012093,PCT012094,PCT012095,PCT012096,PCT012097,PCT012098,PCT012099,PCT012100,PCT012101,PCT012102,PCT012103,PCT012104&for=county:*

		PCT012003	PCT012004	PCT012005	PCT012006	PCT012007	PCT012008	PCT012009	PCT012010	PCT012011	PCT012012	...	PCT012095	PCT012096	PCT012097	PCT012098	PCT012099	PCT012100	PCT012101	PCT012102	PCT012103	PCT012104
state	county
1	1	264	305	293	331	309	364	342	374	382	411	...	6	2	0	7	2	0	0	3	1	0
	3	877	865	845	873	926	856	951	981	1031	1118	...	31	28	14	13	2	5	6	4	5	0
	5	185	184	196	173	191	223	187	236	193	234	...	3	3	5	3	2	2	1	0	0	0
	7	179	146	150	145	157	148	183	140	147	151	...	6	2	2	2	3	2	0	0	0	0
	9	344	347	374	394	376	377	400	361	402	366	...	6	9	1	6	0	3	0	3	5	0

5 rows × 102 columns

Align the datasets

As they have different cohorts, we need to do some summation before we can merge the two census years into a single table. I’ll break the data down into 10-year cohorts by selecting columns to stitch together. We’ll set breakpoints by the last few digits of the field name, and label our new cohorts according to which decade of your life they are. We’re using 1-based indexing here for the cohort names.

pops2010d = pd.DataFrame(index=pops2010.index)

decades = ['dec_%i'%i for i in range(1,10)]
breakpoints_2010 = [3, 5, 8, 12, 14, 16, 18, 22, 24, 26]
for dec, s, f in zip(decades, breakpoints_2010[:-1], breakpoints_2010[1:]):
    pops2010d[dec] = pops2010[['P0120%03i'%i for i in range(s,f)]].sum(axis=1)

pops2010d.head()

		dec_1	dec_2	dec_3	dec_4	dec_5	dec_6	dec_7	dec_8	dec_9
state	county
1	1	3867	4384	3082	3598	4148	3390	2293	1353	454
	3	11446	12006	9976	11042	12517	12368	10623	6307	2911
	5	1673	1739	2260	2208	2233	1910	1490	739	324
	7	1471	1577	1798	2016	1928	1581	1140	579	211
	9	3741	4252	3312	3719	4129	3782	3052	1723	652

pops2000d = pd.DataFrame(index=pops2000.index)

decades = ['dec_%i'%i for i in range(1,10)]
breakpoints_2000 = [3, 13, 23, 33, 43, 53, 63, 73, 83, 104]
for dec, s, f in zip(decades, breakpoints_2000[:-1], breakpoints_2000[1:]):
    pops2000d[dec] = pops2000[['PCT012%03i'%i for i in range(s,f)]].sum(axis=1)

pops2000d.head()

		dec_1	dec_2	dec_3	dec_4	dec_5	dec_6	dec_7	dec_8	dec_9
state	county
1	1	3375	3630	2461	3407	3283	2319	1637	825	284
	3	9323	10094	7600	9725	10379	8519	6675	4711	1822
	5	2002	2198	2412	2465	2178	1699	1026	689	301
	7	1546	1460	1680	1762	1624	1237	774	475	187
	9	3741	3615	3393	3901	3773	3007	2227	1269	550

Now that the data have been formatted in the same way, we’ll concatenate them. We also drop any rows that don’t show up in both datasets.

frame = pd.concat([pops2000d, pops2010d], keys=[2000, 2010], axis=1)
frame.dropna(inplace=True)
frame.head()

		2000									2010
		dec_1	dec_2	dec_3	dec_4	dec_5	dec_6	dec_7	dec_8	dec_9	dec_1	dec_2	dec_3	dec_4	dec_5	dec_6	dec_7	dec_8	dec_9
state	county
1	1	3375	3630	2461	3407	3283	2319	1637	825	284	3867	4384	3082	3598	4148	3390	2293	1353	454
	3	9323	10094	7600	9725	10379	8519	6675	4711	1822	11446	12006	9976	11042	12517	12368	10623	6307	2911
	5	2002	2198	2412	2465	2178	1699	1026	689	301	1673	1739	2260	2208	2233	1910	1490	739	324
	7	1546	1460	1680	1762	1624	1237	774	475	187	1471	1577	1798	2016	1928	1581	1140	579	211
	9	3741	3615	3393	3901	3773	3007	2227	1269	550	3741	4252	3312	3719	4129	3782	3052	1723	652

I’m happy with this format, so we’ll save it to csv:

frame.to_csv('Males by decade and county.csv')

As our dataframe has a MultiIndex we have to take care when re-importing from the csv to get the index and header columns correct.

pd.read_csv('Males by decade and county.csv', header=[0,1], index_col=[0,1])

		2000									2010
		dec_1	dec_2	dec_3	dec_4	dec_5	dec_6	dec_7	dec_8	dec_9	dec_1	dec_2	dec_3	dec_4	dec_5	dec_6	dec_7	dec_8	dec_9
state	county
1	1	3375	3630	2461	3407	3283	2319	1637	825	284	3867	4384	3082	3598	4148	3390	2293	1353	454
	3	9323	10094	7600	9725	10379	8519	6675	4711	1822	11446	12006	9976	11042	12517	12368	10623	6307	2911
	5	2002	2198	2412	2465	2178	1699	1026	689	301	1673	1739	2260	2208	2233	1910	1490	739	324
	7	1546	1460	1680	1762	1624	1237	774	475	187	1471	1577	1798	2016	1928	1581	1140	579	211
	9	3741	3615	3393	3901	3773	3007	2227	1269	550	3741	4252	3312	3719	4129	3782	3052	1723	652
	11	840	885	1059	952	968	628	390	238	179	697	702	892	884	876	905	553	287	116
	13	1497	1766	1076	1219	1496	1199	832	599	334	1440	1494	1112	1115	1188	1477	1120	600	292
	15	7212	8032	7544	7613	8237	6544	4535	2846	1137	7586	8279	8185	7041	7723	8030	5842	3185	1305
	17	2543	2540	2185	2437	2504	2088	1474	992	522	2062	2302	1905	1954	2325	2427	1907	1045	436
	19	1552	1506	1424	1664	1726	1598	1322	717	285	1453	1729	1224	1504	1893	1999	1773	1002	311
	21	2957	2904	2656	2872	2882	2317	1612	1002	379	3024	3137	2671	3015	2988	2970	2163	1181	459
	23	1097	1164	839	956	1080	1010	710	430	203	875	931	673	755	914	1002	837	479	222
	25	2130	2247	1496	1785	1904	1521	1120	654	323	1590	1975	1279	1371	1728	1751	1348	830	342
	27	939	1025	854	992	983	844	664	432	217	850	982	716	815	988	972	787	493	230
	29	965	1031	879	1061	1048	906	643	349	155	991	1062	827	893	1084	1043	916	466	171
	31	2903	3246	2986	3006	3067	2628	1826	1189	452	3403	3448	3581	3279	3415	3065	2448	1430	615
	33	3606	3773	3156	3740	3939	3310	2458	1632	697	3211	3654	3004	3076	3738	3837	2976	1778	885
	35	1008	1069	762	838	914	865	579	458	174	801	965	668	634	833	994	855	423	239
	37	813	872	814	1005	966	712	529	387	134	614	798	534	660	831	987	752	361	190
	39	2403	2608	2068	2458	2643	2211	1809	1237	555	2431	2527	1984	2097	2452	2731	2119	1328	619
	41	921	1026	766	807	944	826	549	431	193	909	1007	734	793	942	973	780	403	173
	43	5230	5602	5008	5695	5596	4595	3351	2230	907	5177	5563	4825	5104	5649	5456	4369	2526	1050
	45	3860	3656	3712	3650	3446	2659	1806	1162	405	3624	3417	3868	3224	3190	3259	2395	1308	540
	47	3477	3901	2486	2587	3044	2317	1658	1121	501	3195	3298	2413	2124	2584	2902	2127	1129	472
	49	4615	4491	4493	4697	4571	3768	2540	1663	670	5191	5134	4267	4732	4876	4574	3618	1925	796
	51	4684	4939	4945	5530	5172	3825	2361	1336	548	5110	5556	5093	5410	5785	5262	3858	1890	725
	53	2559	2757	2949	2955	3078	2224	1585	927	441	2337	2501	2582	3078	2895	2818	1928	1193	434
	55	6842	7225	6406	6824	7328	6187	4167	3261	1266	6551	7323	6029	6317	7055	7222	5647	3039	1443
	57	1199	1363	1071	1213	1310	1162	847	536	234	1057	1279	902	953	1187	1232	1017	615	265
	59	2108	2303	2255	2221	2091	1758	1391	871	330	2266	2228	2109	2132	2130	2003	1565	978	417
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
55	129	911	1301	661	1017	1256	1107	891	615	312	879	907	655	791	1121	1332	1214	710	315
	131	8537	9141	6305	9977	9934	6906	4025	2589	1194	8759	9196	6665	8111	11050	10028	6446	3288	1850
	133	24815	27971	17718	26947	32031	22825	13003	8520	3653	24294	28104	19344	21650	30470	31667	19754	10176	5896
	135	3451	4069	2595	3789	4098	2996	2080	1821	1000	3182	3531	2655	2964	3954	4179	2931	1726	1325
	137	1373	1784	977	1510	1812	1456	1396	992	369	1250	1524	1282	1473	1977	2142	1633	1100	512
	139	10095	11708	12046	12620	12521	8297	5210	3777	1875	10105	11085	13374	10920	12568	11968	7363	4065	2504
	141	5065	5916	4018	5515	5946	4268	2874	2215	1213	4555	5065	4112	4095	5486	5714	3886	2366	1498
56	1	1666	2803	4997	1859	2005	1555	848	545	251	1890	2615	6309	2129	1645	1995	1380	603	331
	3	860	1032	514	639	817	713	558	403	199	796	867	605	593	710	869	740	457	245
	5	2728	3245	2306	2475	3515	1823	748	366	102	3896	3480	4008	3564	3295	3788	1528	497	202
	7	985	1206	1027	1221	1523	1143	666	414	191	1150	999	1172	1143	1218	1313	912	434	212
	9	900	1043	557	788	1118	772	441	286	102	994	995	796	852	972	1135	742	350	181
	11	358	532	246	345	502	420	334	163	79	510	479	352	382	491	596	480	261	97
	13	2539	3099	1872	2187	2880	2181	1511	1061	410	3032	2746	2529	2286	2490	3006	2168	1149	624
	15	816	1022	674	716	974	779	576	442	235	735	921	858	766	888	1111	836	526	265
	17	246	359	187	248	400	338	273	200	97	264	284	235	222	298	404	326	229	115
	19	419	540	291	375	566	510	365	292	118	565	497	414	516	515	730	620	318	190
	21	5759	6176	6018	6631	6274	4698	2751	1878	801	6631	6015	6869	5759	6306	6475	4473	2216	1131
	23	1133	1391	642	873	1249	868	617	408	179	1487	1382	899	1257	1189	1410	984	485	209
	25	4578	5435	4221	4335	5649	3824	2262	1950	618	5363	4945	5691	4987	4940	5877	3404	1677	1098
	27	141	187	71	145	206	149	140	86	49	127	155	92	114	144	208	152	117	50
	29	1508	2107	1287	1526	2154	1723	1130	765	362	1669	1844	1688	1453	1660	2319	1924	976	488
	31	529	703	391	491	705	627	455	306	139	428	522	408	412	555	744	680	373	181
	33	1549	2061	1402	1482	2249	1879	1116	844	419	1844	1840	1672	1691	1844	2485	1790	884	515
	35	374	469	267	412	554	433	294	140	80	707	643	731	845	830	898	555	238	103
	37	2770	3508	2326	2523	3628	2314	1096	598	263	3626	3189	3583	3212	2935	3421	1830	730	323
	39	982	1112	2006	1731	1763	1222	582	250	85	1216	1018	1979	2079	1576	1623	1097	442	161
	41	1701	2131	1121	1294	1842	1068	456	321	120	1821	1724	1288	1408	1300	1682	898	363	178
	43	517	781	358	496	661	537	360	287	135	598	610	417	470	527	637	516	297	183
	45	359	551	347	425	627	434	301	230	99	450	448	472	481	534	652	398	208	147

3137 rows × 18 columns