Tuesday, 12 February 2013

Batch Processes in Python on HPCs

As part of my PhD I had to analyse millions upon millions of WiFi packets that I had captured. These packets had to be assessed sequentially, so I couldnt really parallelise the system very much. Then to make matters worse I needed to run the same analysis with potentially 40 different values, as I wanted to show the impact of those values on the outcome of an algorithm.

This process was going to take weeks of constant running on my personal machine before I would get even preliminary results, so I utilised by university High Performance Computing Cluster (HPC). This allowed me to run 10-20 of the processes simultaneously with slightly different parameters as well as queuing all subsequent jobs once they were finished. Suffice to say, I would still be doing the data processing now if I had to do it on my personal machine rather than utilising this resource.

How I went about doing it though took a lot of revision and begrudging entire rewrites before I got the batch processing correct. It was easiest just to create a single Python file with arguments which could be appended to the file as text inputs. The differences in these arguments changed the testing parameters for the script. Obviously then the script needed to be able to take these input parameters and pass them into variables properly.

Then, as the number of inputs that needed to be included expanded (dataset location, identifier, timescale of analysis etc) you also just want a typical, standard, default operation that will check it runs. Half the time, you expend effort with something like this on making sure it is running correctly and you have caught every possible error. You definitely dont want to go through all this effort to find one "+" rather than "-" has totally destroyed weeks of analysis and graph generation. So I also included a default running mode just to check that any changes I had made hadn't created errors further down the pipe before I committed all the jobs to the HPC.

For your entertainment, here is the batch handling section of my Python script that managed all of these things.

<><><><><>


# For running on a batch system (HPC)

if allow_batch == True:
if len(sys.argv) != 7:                   # Hard coded the necessary number of arguments to 7 in my case
if len(sys.argv) == 2:          # If I only had 2 arguments then this was a test condition
dir_path = '/home/jonny/Desktop/test/'                   # Dir path would be better called source data path
results_path = '/home/jonny/Desktop/'
source = '(A11-P1-full)'                   # This was ALWAYS included in the output filename
delay_start = 0
delay_end = 0
req_window_length = 10
rsp_window_length = 10
print "NOT ENOUGH TERMS. Do you want to run with defaults:"   # Just double checking
print "\tDir_path = " + dir_path
print "\tSource = " + source
print "\tDelay Start = " + str(delay_start)
print "\tDelay End = " + str(delay_end)
print "\tReq Window Length = " + str(req_window_length)
print "\tRsp Window Length = " + str(rsp_window_length)
yorn = raw_input ('(y/n):\t')
if yorn == 'y':                   # Confirm that you meant to run with only two i.e. in testing
pass
else:
i=0
for arg in sys.argv:                   # Its a good idea to parrot the input, especially in HPC environments
print str(i) + '-' + arg
i+=1
                                # Stating the expected format is a good idea too, as you will probably forget over time
print '<>\nEXITING\n<>\nNot enough terms. Require each: (python:\'Script path\':\'capture folder\':' + \
'\'source name\':\'delay_start\':\'delay_length\':\'Req Window length\':\'Rsp Window length\')'
exit(1)
                                # Didnt need an else catching non "== 2" conditions as any missing variable values caught by script
else:                   # If we have the correct number of inputs parse into necessary variables
dir_path = sys.argv[1]
source = sys.argv[2]

print "Dataset is: " + str(source.split('-')[1][0]) + " and channel is:" + str(source.split('-')[0][1])
if int(sys.argv[3]) == 0 and int(sys.argv[4]) == 0:      # If I meant to run the entire dataset i used 0 start and 0 stop
delay = False
else:
delay = True
delay_start = int(sys.argv[3])
delay_length = int(sys.argv[4])

req_window_length = int(sys.argv[5])
rsp_window_length = int(sys.argv[6])
i=0
for arg in sys.argv:              # Again, even when successful its good to parrot the inputs, especially with large runs
print str(i) + '-' + arg
i+=1
<><><><><>

Friday, 8 February 2013

Lets Start Again

Alright,

Wow, two years since the last post. That's a long time! That's what a PhD does to you I suppose. Finally finished and started a post doc, less stress but more work!

I have finally decided that I am going to do this properly now too. One post a week, on a Monday. Every week. Two years of built up resources ought to keep that going for a while.

J

Friday, 3 December 2010

Installing Gnuploy.py module for Python on Windows

I had far more trouble with this than reasonable in the end, the support documentation and guidance is a little misguiding at times. I now have an installation of straight up Gnuploy as well as the python module on my machine...

1) Download Gnuplot.py (zip file) from here: http://sourceforge.net/projects/gnuplot-py/files/

2) Unzip and move the subsequent folder to your working directory. Just drag and drop it. I happen to have this directory added in to the python default search path. While we are here I managed that using:


import sys
sys.path.append('C:\\folder\\folder\\working-directory')


3) Open the command line by typing 'cmd' into Run


4) 'cd' your way into your working directory (using the double \\ convention for folders) and on into the top of the gnuplot.py directory. This is the one with the file "setup.py" in it at the bottom


5) Type: "python setup.py install" in the command line and it all ought to extract and build just fine.


Now you, well I, can open the python interpreter of choice and,


import Gnuplot


now ought to work just fine. Two hours of my life condensed in 5 steps. It always takes longer than you think...

Friday, 26 November 2010

Calculating a time in human readable form (Python)

I was trying to work out the total time of this pcap directory I was working on but didnt want to see a huge number of seconds, so I have written something to change it into a human readable form. It is largely the same as a version I found on the internet but I cant remember where. Apologies.

<><><>

def human_time(secs):
mins, secs = divmod(secs, 60)
hours, mins = divmod(mins, 60)
days, hours = divmod(hours, 24)
weeks, days = divmod(days, 7)
return '%02d weeks %02d days %02d hours %02d mins %02d secs' % (weeks, days,hours, mins, secs)

<><><>

Monday, 22 November 2010

Calculating running Averages in Python

It took a while, but I got it to work. Again there was a nod in the direction of the internet (mostly here), the source of all my knowledge. I'll just throw up the code as it ought to be fairly self explanatory, but then I always say that.

<><><>


def make_av():
def std_dev(value):
from math import sqrt
std_dev.tot += 1 # Calculate necessary values
std_dev.sum += value
std_dev.sq_sum += (value * value)
if std_dev.tot < 2: # Cant have sd with less than 2 packets
return 0.0
# Return a tuple of the running mean followed by the running standard deviation
return (std_dev.sum / std_dev.tot, sqrt(abs((std_dev.tot * std_dev.sq_sum - std_dev.sum**2) / (std_dev.tot *(std_dev.tot-1)))))
std_dev.tot, std_dev.sum, std_dev.sq_sum = [0.0]*3 # Reset variables
return (std_dev)


<><><>


So you create an instance of make_av() and input the value you wish to get the averages of. The function returns a tuple of the mean and standard deviation. Done