Thursday, May 23, 2013

Census data with Bash

For fun I played with the 2012 US census data on metropolitan areas. I wanted to answer a simple questions, which "big" cities are growing the fastest?

First a note on file format. If you are splitting this line on comma, you want 3 fields:

apple,"banana,split",coconut

But it is easy to get 4 fields instead. My quick fix was to change comma to colon for field separators with this bit of python in a file named "csv-comma-to-colon.py":

import csv
import fileinput

for row in csv.reader(fileinput.input()):
    print(':'.join(row))

Simple, effective. An alternative I did not explore enough was csvprintf.

Once I have colon field separators, the rest writes itself. Seeing the CSV file from the census has non-data header and footer lines, I filter and sort:

sed '/^[^1-9]/d' CBSA-EST2012-05.csv \
    | python ./csv-comma-to-colon.py \
    | sort -t: -k4 -nr \
    | head -50 \
    | sort -t: -k6 -nr \
    | head -10 \
    | cut -d: -f2,4,6

Breaking it down:

  1. Skip non-data lines from the CSV file.
  2. Convert comma field separators to colon.
  3. Reorder the data by city size, descending.
  4. Keep only the top 50 biggest cities.
  5. Reorder the remaining data by growth rate.
  6. Keep only the 10 fastest growing cities.
  7. Display only these columns: city name, current population, growth rate.

There may be better ways, this was enough for some quick fun. I would be remiss not to show the results nicely formatted for the web:

Metropolitan Area Population Growth
Austin-Round Rock, TX 1,834,303 3.0
Raleigh, NC 1,188,564 2.2
Orlando-Kissimmee-Sanford, FL 2,223,674 2.2
Houston-The Woodlands-Sugar Land, TX 6,177,035 2.1
Dallas-Fort Worth-Arlington, TX 6,700,991 2.0
San Antonio-New Braunfels, TX 2,234,003 1.9
Phoenix-Mesa-Scottsdale, AZ 4,329,534 1.8
Denver-Aurora-Lakewood, CO 2,645,209 1.8
Nashville-Davidson--Murfreesboro--Franklin, TN 1,726,693 1.7
Las Vegas-Henderson-Paradise, NV 2,000,759 1.7

No comments: