Wrangle comma-separated files with intercalated headers and varied column counts—with pandas and HURDAT2
The National Hurricane Center data archive hosts hundreds of years of storm data. In search of a new dataset to explore, I pulled the Atlantic hurricane database (HURDAT2) and for a moment felt hopeless 😬. I wouldn’t call this file format cruel but it is unusual, haha.
Our file is comma-separated but not really. The file alternates between single header rows followed by one or more data rows. As a result of these intercalated header rows (with a different comma count than the data rows), pandas.read_csv
returns an unusable. We’ll fix this with the following approach:
- Ingest the file as a single column of strings
- Separate header rows from data rows
- Reorganize the data as arrays
- Join the header rows into the data rows
concat
everything into a single data frame 😅- Save it as Parquet for speed and compactness
- Collect the results into a Quilt data package so that others can reuse our work and start with a clean dataframe
To follow along, you can clone the parse-hurdat2 repository, which includes all of the code in this article (and more) in a Jupyter notebook.
git clone https://github.com/quiltdata/parse-hurdat2.git
You will need pandas and quilt3:
pip install quilt3[pyarrow] pandas
If you just want to get the cleaned dataset (including data sources):
quilt3 install examples/hurdat2 --registry s3://quilt-example --dest ../data
And if you just want to pull the cleaned dataframe into memory:
import quilt3 as q3hurdat2 = q3.Package.browse("examples/hurdat2", registry="s3://quilt-example")
# boom dataframe
hurdat2['Atlantic-HURDAT2.parquet']()
Ingest and clean HURDAT2
One column of strings
The first trick is to find a character that’s NOT in our file. We use grep:
grep "%" external/hurdat2-1851-2020-052921.txt
Now we do a read_csv with our non-separator:
df = pd.read_csv(
"../data/external/hurdat2-1851-2020-052921.txt",
sep="%",
header=None
)
Now we’ve got a dataframe but it’s a not-so-useful monolith of header rows and data rows.
Check how many row types we have
Are there really just headers and data rows in here? Are they consistently formatted?
widths = set()
for s in df[0]:
widths.add(len(s.split(',')))
Sweet, there are only two types of rows, headers (with 3 commas), and data rows (with 20 commas). We know this because widths
is the set {4, 21}
.
Split dataframe into a dict of arrays
We want to get to a single table with consistent columns, so we’re going to use the header rows as keys and a list of lists as the values:
storms = {}
header = None
for s in df[0]:
tokens = s.split(',')
assert (len(tokens) == 4) or len(tokens) == 21
if len(tokens) == 4:
header = s
assert not (header in storms)
storms[header] = []
else:
assert header
row = [t.strip() for t in tokens]
storms[header].append(row)
Integrate headers as columns in data rows
We read a bit of the data docs to figure out what’s what:
frames = []
for k in storms:
code, name, entries, _ = [t.strip() for t in k.split(',')]
columns = {
'Basin': code[:2],
'ATCF Cyclone Number': code[2:5],
'HYear': code[5:9],
'Name': name,
'Num. best track entries': entries
}
table = pd.DataFrame(storms[k], columns=[
'YYYY-MM-DD',
'TimeUTC',
'Record identifier',
'Status of system',
'Latitude',
'Longitude',
'Max. sustained wind (knots)',
'Min. pressure (millibars)',
'34 kt wind max. NE (nautical miles)',
'34 kt wind max. SE (nautical miles)',
'34 kt wind max. SW (nautical miles)',
'34 kt wind max. NW (nautical miles)',
'50 kt wind max. NE (nautical miles)',
'50 kt wind max. SE (nautical miles)',
'50 kt wind max. SW (nautical miles)',
'50 kt wind max. NW (nautical miles)',
'64 kt wind max. NE (nautical miles)',
'64 kt wind max. SE (nautical miles)',
'64 kt wind max. SW (nautical miles)',
'64 kt wind max. NW (nautical miles)',
'DROPME'
])
# add header columns
for c in columns:
table[c] = columns[c]
table
frames.append(table)
# concat arrays of arrays in ONE shot, making ONE dataframe = muuuch faster
master = pd.concat(frames)
master
Boom, we’ve got HURDAT2 as a proper dataframe with nice, organized columns. Every row has the same structure and we can get to work.
Notes on performance and correctness
Note that, as intermediates, we didn’t use dataframes, but rather used Python lists. Creating and discarding numerous temporary dataframes is approximately 100X slower. (I think because pandas dataframes are immutable, and we end up generating a lot of garbage objects. If you know better, please drop me a note in the comments.)
As for correctness, kudos to the National Hurricane Center for documenting their data. That makes reuse much easier. I must say, nevertheless, that intercalated headers make reuse quite difficult. Don’t release data like this. Prefer a consistent, columnar structure and a high-performance file-format like Parquet, which in our case is more than 10-times smaller than the original.
Sanity check
Let’s do a quick query and see if the results make sense.
hurs2020 = master[
(master['HYear'] == '2020')
& (master['Status of system'] == 'HU')
]
hurs2020['Name'].unique()
We get 14 hurricanes last year:
['HANNA', 'ISAIAS', 'LAURA', 'MARCO', 'NANA', 'PAULETTE', 'SALLY', 'TEDDY', 'GAMMA', 'DELTA', 'EPSILON', 'ZETA', 'ETA', 'IOTA']
That checks out with the NOAA (see Record-breaking Atlantic hurricane season draws to an end) 🎉.
Use Quilt to access a cleaned version of HURDAT2
The HURDAT2 dataset, normalized and wrangled into Parquet is available through Quilt:
quilt3 install examples/hurdat2 --registry s3://quilt-example --dest ../data
Like all Quilt packages, examples/hurdat2 has a landing page, version control, and documentation. Check out HURDAT2 on Quilt here.
Conclusion
You’ve now got a few more methods to clean oddly-formatted datasets. I welcome questions in the comments.