For fun I played with the 2012 US census data on metropolitan areas. I wanted to answer a simple questions, which "big" cities are growing the fastest?
First a note on file format. If you are splitting this line on comma, you want 3 fields:
But it is easy to get 4 fields instead. My quick fix was to change comma to colon for field separators with this bit of python in a file named "csv-comma-to-colon.py":
import csv import fileinput for row in csv.reader(fileinput.input()): print(':'.join(row))
Simple, effective. An alternative I did not explore enough was
Once I have colon field separators, the rest writes itself. Seeing the CSV file from the census has non-data header and footer lines, I filter and sort:
sed '/^[^1-9]/d' CBSA-EST2012-05.csv \ | python ./csv-comma-to-colon.py \ | sort -t: -k4 -nr \ | head -50 \ | sort -t: -k6 -nr \ | head -10 \ | cut -d: -f2,4,6
Breaking it down:
- Skip non-data lines from the CSV file.
- Convert comma field separators to colon.
- Reorder the data by city size, descending.
- Keep only the top 50 biggest cities.
- Reorder the remaining data by growth rate.
- Keep only the 10 fastest growing cities.
- Display only these columns: city name, current population, growth rate.
There may be better ways, this was enough for some quick fun. I would be remiss not to show the results nicely formatted for the web:
|Austin-Round Rock, TX||1,834,303||3.0|
|Houston-The Woodlands-Sugar Land, TX||6,177,035||2.1|
|Dallas-Fort Worth-Arlington, TX||6,700,991||2.0|
|San Antonio-New Braunfels, TX||2,234,003||1.9|
|Las Vegas-Henderson-Paradise, NV||2,000,759||1.7|