More binary data parsing
Way back when I was trying to read binary data, I came across ‘struct’ in an initial search. I ended up going with ‘bitstring’ instead, because I thought I was going to have to if I wanted to read individual bits of data. But David from USGS told me last week that he uses Struct to read at the byte level, then parses out what he needs to afterwards. He showed me some of his code, and it really looks a lot more efficient than what I was doing with bitstring – something that would have taken me like 20 lines of code only took him 1! :-)
David gave me a sample file, and some tips to help me get started. Here are the highlights:
Michelle, the classes you are looking for are in the recordframe.py
file. The basic pattern is this:class RecordType
def __init__(self, data):
This method reads the data record portion of the data frame
All the binary data is stored in a string of bytes called data. If the
record is pre-defined, I just use the struct package to read the bytes
in all at once. Frequently, though, I had to read a “size” field
before I knew how large the record was that needed to be read, you can
see it gets a little more messy in that case. It also might have been
cleaner to pass in the file handle directly rather than the data bytes
themselves, then I wouldn’t need all the code for indexing into the
data array. This is what Eric’s matlab code does. Might try that way
later and see if it is cleaner.def __str__(self):
This method allows you to print RecordTypes and it prints out whatever
__str__ returns. Just overrides Python’s default printing for an
object. You could just as easily done something like a printString()
method, but this way makes code cleaner later on.
And later….
One thing about struct, there are two ways to use it. If the struct
isn’t going to change you should make it a class and use it like thiss = struct.Struct(“format string”)
for i in 100000000:
fields = s.unpack(data)This pre-compiles the format and all the work is done in efficient C
code. Unfortunately, this won’t work for reson data because the
structs have variable sizes that you need to figure out before you
read the data. In that case you do it like this:for i in 100000000:
fmt = ”
fields = struct.unpack(fmt, data)notice that the struct module now takes the format string along with
the data on each and every iteration through the loop. This is much
slower. But if you need to change the fmt string on each trip through
the loop you don’t have a choice. The re module for regular expression
parsing has the same options: a pre-compiled and fast version and a
flexible slow version.