For fun I played with the 2012 US census data on metropolitan areas. I wanted to answer a simple questions, which "big" cities are growing the fastest?
First a note on file format. If you are splitting this line on comma, you want 3 fields:
apple,"banana,split",coconut
But it is easy to get 4 fields instead. My quick fix was to change comma to colon for field separators with this bit of python in a file named "csv-comma-to-colon.py":
import csv import fileinput for row in csv.reader(fileinput.input()): print(':'.join(row))
Simple, effective. An alternative I did not explore enough was csvprintf
.
Once I have colon field separators, the rest writes itself. Seeing the CSV file from the census has non-data header and footer lines, I filter and sort:
sed '/^[^1-9]/d' CBSA-EST2012-05.csv \ | python ./csv-comma-to-colon.py \ | sort -t: -k4 -nr \ | head -50 \ | sort -t: -k6 -nr \ | head -10 \ | cut -d: -f2,4,6
Breaking it down:
- Skip non-data lines from the CSV file.
- Convert comma field separators to colon.
- Reorder the data by city size, descending.
- Keep only the top 50 biggest cities.
- Reorder the remaining data by growth rate.
- Keep only the 10 fastest growing cities.
- Display only these columns: city name, current population, growth rate.
There may be better ways, this was enough for some quick fun. I would be remiss not to show the results nicely formatted for the web:
Metropolitan Area | Population | Growth |
---|---|---|
Austin-Round Rock, TX | 1,834,303 | 3.0 |
Raleigh, NC | 1,188,564 | 2.2 |
Orlando-Kissimmee-Sanford, FL | 2,223,674 | 2.2 |
Houston-The Woodlands-Sugar Land, TX | 6,177,035 | 2.1 |
Dallas-Fort Worth-Arlington, TX | 6,700,991 | 2.0 |
San Antonio-New Braunfels, TX | 2,234,003 | 1.9 |
Phoenix-Mesa-Scottsdale, AZ | 4,329,534 | 1.8 |
Denver-Aurora-Lakewood, CO | 2,645,209 | 1.8 |
Nashville-Davidson--Murfreesboro--Franklin, TN | 1,726,693 | 1.7 |
Las Vegas-Henderson-Paradise, NV | 2,000,759 | 1.7 |
No comments:
Post a Comment