Wrangle comma-separated files with intercalated headers and varied column counts—with pandas and HURDAT2

Aneesh Karve
Quilt
Published in
4 min readJun 30, 2021

--

The National Hurricane Center data archive hosts hundreds of years of storm data. In search of a new dataset to explore, I pulled the Atlantic hurricane database (HURDAT2) and for a moment felt hopeless 😬. I wouldn’t call this file format cruel but it is unusual, haha.

Intercalated header rows and data rows with varying widths are tricky to work with in pandas

Our file is comma-separated but not really. The file alternates between single header rows followed by one or more data rows. As a result of these intercalated header rows (with a different comma count than the data rows), pandas.read_csv returns an unusable. We’ll fix this with the following approach:

  1. Ingest the file as a single column of strings
  2. Separate header rows from data rows
  3. Reorganize the data as arrays
  4. Join the header rows into the data rows
  5. concat everything into a single data frame 😅
  6. Save it as Parquet for speed and compactness
  7. Collect the results into a Quilt data package so that others can reuse our work and start with a clean dataframe

To follow along, you can clone the parse-hurdat2 repository, which includes all of the code in this article (and more) in a Jupyter notebook.

git clone https://github.com/quiltdata/parse-hurdat2.git

You will need pandas and quilt3:

pip install quilt3[pyarrow] pandas

If you just want to get the cleaned dataset (including data sources):

quilt3 install examples/hurdat2 --registry s3://quilt-example --dest ../data

And if you just want to pull the cleaned dataframe into memory:

import quilt3 as q3hurdat2 = q3.Package.browse("examples/hurdat2", registry="s3://quilt-example")
# boom dataframe
hurdat2['Atlantic-HURDAT2.parquet']()

Ingest and clean HURDAT2

One column of strings

The first trick is to find a character that’s NOT in our file. We use grep:

grep "%" external/hurdat2-1851-2020-052921.txt

Now we do a read_csv with our non-separator:

df = pd.read_csv(
"../data/external/hurdat2-1851-2020-052921.txt",
sep="%",
header=None
)

Now we’ve got a dataframe but it’s a not-so-useful monolith of header rows and data rows.

Check how many row types we have

Are there really just headers and data rows in here? Are they consistently formatted?

widths = set()
for s in df[0]:
widths.add(len(s.split(',')))

Sweet, there are only two types of rows, headers (with 3 commas), and data rows (with 20 commas). We know this because widths is the set {4, 21}.

Split dataframe into a dict of arrays

We want to get to a single table with consistent columns, so we’re going to use the header rows as keys and a list of lists as the values:

storms = {}
header = None
for s in df[0]:
tokens = s.split(',')
assert (len(tokens) == 4) or len(tokens) == 21
if len(tokens) == 4:
header = s
assert not (header in storms)
storms[header] = []
else:
assert header
row = [t.strip() for t in tokens]
storms[header].append(row)

Integrate headers as columns in data rows

We read a bit of the data docs to figure out what’s what:

frames = []
for k in storms:
code, name, entries, _ = [t.strip() for t in k.split(',')]
columns = {
'Basin': code[:2],
'ATCF Cyclone Number': code[2:5],
'HYear': code[5:9],
'Name': name,
'Num. best track entries': entries

}
table = pd.DataFrame(storms[k], columns=[
'YYYY-MM-DD',
'TimeUTC',
'Record identifier',
'Status of system',
'Latitude',
'Longitude',
'Max. sustained wind (knots)',
'Min. pressure (millibars)',
'34 kt wind max. NE (nautical miles)',
'34 kt wind max. SE (nautical miles)',
'34 kt wind max. SW (nautical miles)',
'34 kt wind max. NW (nautical miles)',
'50 kt wind max. NE (nautical miles)',
'50 kt wind max. SE (nautical miles)',
'50 kt wind max. SW (nautical miles)',
'50 kt wind max. NW (nautical miles)',
'64 kt wind max. NE (nautical miles)',
'64 kt wind max. SE (nautical miles)',
'64 kt wind max. SW (nautical miles)',
'64 kt wind max. NW (nautical miles)',
'DROPME'
])
# add header columns
for c in columns:
table[c] = columns[c]
table

frames.append(table)
# concat arrays of arrays in ONE shot, making ONE dataframe = muuuch faster
master = pd.concat(frames)
master

Boom, we’ve got HURDAT2 as a proper dataframe with nice, organized columns. Every row has the same structure and we can get to work.

Notes on performance and correctness

Note that, as intermediates, we didn’t use dataframes, but rather used Python lists. Creating and discarding numerous temporary dataframes is approximately 100X slower. (I think because pandas dataframes are immutable, and we end up generating a lot of garbage objects. If you know better, please drop me a note in the comments.)

As for correctness, kudos to the National Hurricane Center for documenting their data. That makes reuse much easier. I must say, nevertheless, that intercalated headers make reuse quite difficult. Don’t release data like this. Prefer a consistent, columnar structure and a high-performance file-format like Parquet, which in our case is more than 10-times smaller than the original.

Sanity check

Let’s do a quick query and see if the results make sense.

hurs2020 = master[
(master['HYear'] == '2020')
& (master['Status of system'] == 'HU')
]
hurs2020['Name'].unique()

We get 14 hurricanes last year:

['HANNA', 'ISAIAS', 'LAURA', 'MARCO', 'NANA', 'PAULETTE', 'SALLY', 'TEDDY', 'GAMMA', 'DELTA', 'EPSILON', 'ZETA', 'ETA', 'IOTA']

That checks out with the NOAA (see Record-breaking Atlantic hurricane season draws to an end) 🎉.

Use Quilt to access a cleaned version of HURDAT2

The HURDAT2 dataset, normalized and wrangled into Parquet is available through Quilt:

quilt3 install examples/hurdat2 --registry s3://quilt-example --dest ../data

Like all Quilt packages, examples/hurdat2 has a landing page, version control, and documentation. Check out HURDAT2 on Quilt here.

Conclusion

You’ve now got a few more methods to clean oddly-formatted datasets. I welcome questions in the comments.

--

--

Data, visualization, machine learning, and abstract algebra. CTO and co-founder @QuiltData. Recent talks https://goo.gl/U9VYr5.