Wednesday, 20 October 2010

Reading Pcap's with Python

So time to have a look at some of the other options. There are 3 mainly:

DPKT

http://code.google.com/p/dpkt/

http://jon.oberheide.org/blog/2008/10/15/dpkt-tutorial-2-parsing-a-pcap-file/

There are a few worked examples around the internet on using this but again lacking in a comprehensive documentation. As a result I tried implementing it only to get seg faults (the most annoying of all errors). I attributed this to excessive memory consumption again.

PYPCAP

http://code.google.com/p/pypcap/

http://www.gradstein.info/python/how-to-understand-the-arp-queries-and-replies-fields-with-pypcap/

Pypcap is handily installed through the python-pypcap module on ubuntu but the examples given show how to handle packets as you receive them on the interface. Since I needed to extract information from an already received capture file, this wasn't terribly helpful. Also, since I have a very limited attention span for poorly documented endeavours I kept on looking.

LIBPCAP

Now for most people this would be the most obvious choice and so it should have been. This is essentially just a wrapper for the C library. Or so I am told, I cant even pretend to have tried the C implementation of packet capture reading. I expect I would have a lot less hair by the end of it.

The operations for getting the code working are fairly self explanatory. But then I would say that, having just done it. It certainly wasn't self explanatory at the time. So here goes:
  1. Import scapy.layers.all and pcap. The first is required for the "RadioTap()" function and the second for, primarily, the "pcap" class and all it's derivatives
  2. Using the "pcap.pcapObject()" method defines "p" as a variable which can contain pcap data, because obviously you need somewhere to keep it
  3. As a result "p" has a series of methods that can be called for it. The primary use of which being opening the pcap. So for the loop of total pcaps in the directory we open each pcap using the open_offline() method.
  4. In my case I need to look at the radiotap data as well as all subsequent data contained within each packet so again we need somewhere to keep it. Hence "packet=RadioTap()"
  5. The second method of "p" we need to utilise is "p.next()" which iterates through each of the packets in the file. I write this data into a list variable in "pkt"
  6. The result is that pkt is now a list which contains [0] *Not sure* [1] Undissected packet information [2] RadioTap received timestamp
  7. In order to dissect the information we have to run it through the RadioTap parser, hence "packet.dissect(pkt[1])". 
  8. The result of which is a nice parsed packet, with each of the fields being selected through additional parameters, such as ".subtype" here
  9. Hey presto, we can read packets. However in order to do so we can only move on to the next packet, we dont know how many packets are actually in there, unlike when we loaded the entire file in one go.
  10. We need a method of terminating the loop cleanly though so we use a try-except. When p.next() reaches the end it returns a TypeError. So excepting this condition allows us to finish the capture file cleanly.
<><><>

## new_read_pcap.py


## This script reads the contents of pcap files in a directory and summarises the information contained within


from scapy.layers.all import *
import pcap, os


## To have the user import the directory


dir_path = raw_input ('Give the full path to the directory of the pcap files: ')


list_data = []
list_mgmt = []
list_ctrl = []
list_unkn = []
total_packets = 0


p=pcap.pcapObject()


for pcaps in os.listdir(dir_path):

flag = True
i = 0
p.open_offline(dir_path + pcaps)


while flag is not False:
# packet=RadioTap()   # Was originally here


try:
i+=1
pkt=p.next()
   packet=RadioTap()   # But needs to go here!
packet.dissect(pkt[1])

if packet.type == 2L:
# type = data
list_data.append(i)
elif packet.type == 0L:
# type = management
list_mgmt.append(i)
elif packet.type == 1L:
# type= control
list_ctrl.append(i)
else:
list_unkn.append(i)


except TypeError:


flag = False


total_packets += i


storage = open('/home/jonny/Python_work/workfile', 'w+a')


# Opens the file requested with read and append linked to variable storage


storage.write('Summary of the contents of the folder ' + dir_path + ' by the module read_pcap.py\nBy Jonny Milliken\n')
storage.write('Total number of pcap files found in the folder = ' + str(len(os.listdir(dir_path))) + '\n')
storage.write('<><><><><><><><><><><><><><><><><><><><><><><><><><>\n')
storage.write('The total number of data packets is ' + str(len(list_data)) + ' (' + str(len(list_data) * 100 / total_packets) + '%)\n')
storage.write('The total number of management packets is ' + str(len(list_mgmt)) + ' (' + str(len(list_mgmt) * 100 / total_packets) + '%)\n')
storage.write('The total number of control packets is ' + str(len(list_ctrl)) + ' (' + str(len(list_ctrl) * 100 / total_packets) + '%)\n')
storage.write('The total number of unknown packets is ' + str(len(list_unkn)) + ' (' + str(len(list_unkn) * 100 / total_packets) + '%)\n')
storage.close()

<><><>

But oh ho. The problems didn't stop there, oh no. Segfaults abound. Stupid memory. But what is the problem!? A single packet is loaded into memory, dissected and the timestamp extracted. Then this is all over-written, right? Right!? Nope.

Turns out in the implementation above the operation "r.dissect()" actually APPENDS to the existing "r". So in this case "r" is getting increasingly large and causing the same issues as before. As a result I needed to redefine "r" as an empty "RadtioTap()" each time the try is invoked.

Now you ask why I included it at all, given that I have already solved the relatively simple problem. The answer being that I needed something to show for the day it took to work out...

Friday, 8 October 2010

Calculating on time pass standard deviation in python

After yet another few days working on how to calculate an accurate average of the timestamp's in each packet in a pcap found out that I was using the incorrect timestamp values. Duh! So the packet.timestamp values coming out of the Radiotap information for the beacon's was likely just synchronisation info with little objective basis in real life. However, the timestamp of arrival was simply to be found in the Tcpdunmp header using:

<><><>


import pcap


p=pcap.pcapObject()
p.open_offline("/home/file")
packet=p.next()
timestamp=packet[2]

You can see why the Duh! then.

<><><>

With that out of the way I needed to organise how to start calculating averages (specifically standard deviations and mean) from the data. Remembering of course that I couldnt re-cycle over the data again to calculate it. Well I could, but that would require at least a doubling of time spent in calculation. Admittedly the particular application of this isnt time sensitive but with something like 25 million packets to look through any doubling needs to be avoided!!

With a quick google and more than a nod in this direction I found a C implementation. That's perfect only I'm writing in Python. Not too bad, but it did take me a while to get it converted so I thought I would share it.

<><><>


def std_dev_run(a, len(a)):
if n == 0:
return 0
  sum=0.0
  sq_sum=0.0
  for i in range(n):
   sum += a[i]
   sq_sum += (a[i] * a[i])
  mean = sum / n
  variance = sq_sum / n - mean * mean


  return sqrt(abs(variance))
<><><>

The only appreciative difference being the inclusion of the abs() term. The maths module kept throwing domain errors at me and turns out the variance was negative. And of course negative square roots tend to make computers hate you a little.

Oh yeah, and THEN I realised I didn't actually need a single pass. After all, I dont want to store 25million packets in an array, or the values of them, so a single pass wont help. My intention is to produce single values from each packet and have that update the necessary stats before discarding them. The upshot being what I need is actually an 'online' or 'running' standard deviation calculation. Best get started on that one now...

Adventures in Python

Back for my second post nearly a month later. And to think I was joking about the massive gap in postings. The massive pain in the interim has been trying to find any real documentation or guides on how to read in pcap files into python. Various different attempts went wrong for various reasons and I suppose that seems like a good way to get going again.


<><><>

## read_pcap.py

## This script reads the contents of pcap files in a directory and summarises the information contained within


from scapy.all import *
import os

## To have the user import the directory

dir_path = raw_input ('Give the full path to the directory of the pcap files: ')

list_data = []
list_mgmt = []
list_ctrl = []
list_unkn = []
total_packets = 0


# For each of the pcaps in the directory

for pcaps in os.listdir(dir_path):

      pcktList = rdpcap (dir_path + pcaps)
      total_packets += len(pcktList)


# For each of the packets in the loaded list

      for i in range(len(pcktList)):

            if pcktList[i].type == 2L:
            # type = data
                  list_data.append(i)
            elif pcktList[i].type == 0L:
            # type = management
                  list_mgmt.append(i)
            elif pcktList[i].type == 1L:
            # type= control
                  list_ctrl.append(i)
            else:
                  list_unkn.append(i)

storage = open('/home/jonny/Python_work/workfile', 'w+a')

# Opens the file requested with read and append linked to variable storage

storage.write('Summary of the contents of the folder ' + dir_path + ' by the module read_pcap.py\nBy Jonny Milliken\n')
storage.write('Total number of pcap files found in the folder = ' + str(len(os.listdir(dir_path))) + '\n')
storage.write('<><><><><><><><><><><><><><><><><><><><><><><><><><>\n')
storage.write('The total number of data packets is ' + str(len(list_data)) + ' (' + str(len(list_data) * 100 / total_packets) + '%)\n')
storage.write('The total number of management packets is ' + str(len(list_mgmt)) + ' (' + str(len(list_mgmt) * 100 / total_packets) + '%)\n')
storage.write('The total number of control packets is ' + str(len(list_ctrl)) + ' (' + str(len(list_ctrl) * 100 / total_packets) + '%)\n')

<><><>

This piece is simple enough, loading in the entirety of the pcap into a list with each packet being held in memory and selected by the index value. This is a nearly perfect solution in theory since all packets could be selected at will. But as the packet number increases, tested to a limit of around 200,000, the memory consumption on a 3GB RAM system peaks. Above this packet level the operating system, assumedly, starts trying to page the information into hard drive storage which slows the operation to a crawl. Not a deal breaker if you are working with sufficiently small capture files, but at the moment I am working with a directory of 25GB of pcap files with each containing around 1.2million packets....

Of course it would be that the most logical, documented and convenient method would be exactly the one that won't work in my case. There are alternatives though, after all Wireshark seems to open them just fine so it's clearly not impossible!!