For fun I played with the 2012 US census data on metropolitan areas. I wanted to answer a simple questions, which "big" cities are growing the fastest?
First a note on file format. If you are splitting this line on comma, you want 3 fields:
apple,"banana,split",coconut
But it is easy to get 4 fields instead. My quick fix was to change comma to colon for field separators with this bit of python in a file named "csv-comma-to-colon.py":
import csv
import fileinput
for row in csv.reader(fileinput.input()):
print(':'.join(row)) Simple, effective. An alternative I did not explore enough was csvprintf.
Once I have colon field separators, the rest writes itself. Seeing the CSV file from the census has non-data header and footer lines, I filter and sort:
sed '/^[^1-9]/d' CBSA-EST2012-05.csv \
| python ./csv-comma-to-colon.py \
| sort -t: -k4 -nr \
| head -50 \
| sort -t: -k6 -nr \
| head -10 \
| cut -d: -f2,4,6 Breaking it down:
- Skip non-data lines from the CSV file.
- Convert comma field separators to colon.
- Reorder the data by city size, descending.
- Keep only the top 50 biggest cities.
- Reorder the remaining data by growth rate.
- Keep only the 10 fastest growing cities.
- Display only these columns: city name, current population, growth rate.
There may be better ways, this was enough for some quick fun. I would be remiss not to show the results nicely formatted for the web:
| Metropolitan Area | Population | Growth |
|---|---|---|
| Austin-Round Rock, TX | 1,834,303 | 3.0 |
| Raleigh, NC | 1,188,564 | 2.2 |
| Orlando-Kissimmee-Sanford, FL | 2,223,674 | 2.2 |
| Houston-The Woodlands-Sugar Land, TX | 6,177,035 | 2.1 |
| Dallas-Fort Worth-Arlington, TX | 6,700,991 | 2.0 |
| San Antonio-New Braunfels, TX | 2,234,003 | 1.9 |
| Phoenix-Mesa-Scottsdale, AZ | 4,329,534 | 1.8 |
| Denver-Aurora-Lakewood, CO | 2,645,209 | 1.8 |
| Nashville-Davidson--Murfreesboro--Franklin, TN | 1,726,693 | 1.7 |
| Las Vegas-Henderson-Paradise, NV | 2,000,759 | 1.7 |
No comments:
Post a Comment