I am using Python to do some data file processing, converting data from a horrendously verbose, repetitive format to a nice, clean, CSV format. The date and time are in two different fields, and the date is in MM/DD/YYYY format, plus, the MM and DD might be one or two characters. That is, January is 1, not 01.

I am converting the timestamp to ISO format, so I was using time.strptime to extract the date/time and time.strftime to generate to proper ISO formatted date, like so:

return time.strftime("%Y-%m-%d %H:%M:%S", time.strptime(ts, "%m/%d/%Y %H:%M:%S"))

On the smallest of my data files, the processing was taking 13 to 15 seconds. I profiled it, and found that in a 13 second run, strptime was taking 8.755 seconds of that, and it was calling _getlang(), _parse_localename(), and the like very time.

So, thought I, regexes are pretty efficient, I wonder if that would reduce the run time any.

ts_re = re.compile('^(\d{1,2})/(\d{1,2})/(\d{4}) (\d{2}:\d{2}:\d{2})')
m = ts_re.match(ts).groups()
return ("%s-%02d-%02d %s") % (m[2], int(m[0]), int(m[1]), m[3])

(The re.compile() call is at the module level, outside the function, so it is only run once.)

My overall run time dropped to about 5 seconds, a little over 1/3 of the time it took previously. My convert_timestamp() function, which previously had consumed nearly 10 seconds, was only taking about 1.3 seconds now.

Sometimes regexes are the answer.


comments powered by Disqus