Tuesday, 12 February 2013

Batch Processes in Python on HPCs

As part of my PhD I had to analyse millions upon millions of WiFi packets that I had captured. These packets had to be assessed sequentially, so I couldnt really parallelise the system very much. Then to make matters worse I needed to run the same analysis with potentially 40 different values, as I wanted to show the impact of those values on the outcome of an algorithm.

This process was going to take weeks of constant running on my personal machine before I would get even preliminary results, so I utilised by university High Performance Computing Cluster (HPC). This allowed me to run 10-20 of the processes simultaneously with slightly different parameters as well as queuing all subsequent jobs once they were finished. Suffice to say, I would still be doing the data processing now if I had to do it on my personal machine rather than utilising this resource.

How I went about doing it though took a lot of revision and begrudging entire rewrites before I got the batch processing correct. It was easiest just to create a single Python file with arguments which could be appended to the file as text inputs. The differences in these arguments changed the testing parameters for the script. Obviously then the script needed to be able to take these input parameters and pass them into variables properly.

Then, as the number of inputs that needed to be included expanded (dataset location, identifier, timescale of analysis etc) you also just want a typical, standard, default operation that will check it runs. Half the time, you expend effort with something like this on making sure it is running correctly and you have caught every possible error. You definitely dont want to go through all this effort to find one "+" rather than "-" has totally destroyed weeks of analysis and graph generation. So I also included a default running mode just to check that any changes I had made hadn't created errors further down the pipe before I committed all the jobs to the HPC.

For your entertainment, here is the batch handling section of my Python script that managed all of these things.

<><><><><>


# For running on a batch system (HPC)

if allow_batch == True:
if len(sys.argv) != 7:                   # Hard coded the necessary number of arguments to 7 in my case
if len(sys.argv) == 2:          # If I only had 2 arguments then this was a test condition
dir_path = '/home/jonny/Desktop/test/'                   # Dir path would be better called source data path
results_path = '/home/jonny/Desktop/'
source = '(A11-P1-full)'                   # This was ALWAYS included in the output filename
delay_start = 0
delay_end = 0
req_window_length = 10
rsp_window_length = 10
print "NOT ENOUGH TERMS. Do you want to run with defaults:"   # Just double checking
print "\tDir_path = " + dir_path
print "\tSource = " + source
print "\tDelay Start = " + str(delay_start)
print "\tDelay End = " + str(delay_end)
print "\tReq Window Length = " + str(req_window_length)
print "\tRsp Window Length = " + str(rsp_window_length)
yorn = raw_input ('(y/n):\t')
if yorn == 'y':                   # Confirm that you meant to run with only two i.e. in testing
pass
else:
i=0
for arg in sys.argv:                   # Its a good idea to parrot the input, especially in HPC environments
print str(i) + '-' + arg
i+=1
                                # Stating the expected format is a good idea too, as you will probably forget over time
print '<>\nEXITING\n<>\nNot enough terms. Require each: (python:\'Script path\':\'capture folder\':' + \
'\'source name\':\'delay_start\':\'delay_length\':\'Req Window length\':\'Rsp Window length\')'
exit(1)
                                # Didnt need an else catching non "== 2" conditions as any missing variable values caught by script
else:                   # If we have the correct number of inputs parse into necessary variables
dir_path = sys.argv[1]
source = sys.argv[2]

print "Dataset is: " + str(source.split('-')[1][0]) + " and channel is:" + str(source.split('-')[0][1])
if int(sys.argv[3]) == 0 and int(sys.argv[4]) == 0:      # If I meant to run the entire dataset i used 0 start and 0 stop
delay = False
else:
delay = True
delay_start = int(sys.argv[3])
delay_length = int(sys.argv[4])

req_window_length = int(sys.argv[5])
rsp_window_length = int(sys.argv[6])
i=0
for arg in sys.argv:              # Again, even when successful its good to parrot the inputs, especially with large runs
print str(i) + '-' + arg
i+=1
<><><><><>

No comments:

Post a Comment