Scripting 2005

Scripting 2005

BiRC / Courses / Scripting / Lecture Notes / Distributed Systems

Distributed Systems

In this lecture we learn how to write networking code and how to distribute programs to run on more than one computer.
We learn how to:
  • Connect processes on different machines through TCP/IP ports
  • Transfer data over ports
  • Build a simple protocol for communication between processes

Motivation

By distributing programs we can execute several tasks in parallel, thus completing jobs much faster than we otherwise could.

Since there are very few free lunches in life, it should not come as a surprise that this benefit does not come for free either, but that running distributed programs adds an extra layer of complexity.

This complexity shows itself in many ways: figuring out how to parallelize algorithms, how to synchronize processes, how to deal with different data representations on different architectures, and many more.

In this lecture we cannot cover but a very small portion of this, so we will concentrate on simple communication between processes, using the TCP/IP protocol — arguably the most widely used networking protocol for the kinds of applications you are likely to write in the future.

Socket Communication

We already know how to run several programs on different architectures — after all, we just start the programs there and let them run — so the only new thing we need is a way for the programs to communicate.

Remember how we used pipes in the Process Management lecture? By having an input and an output file for a process, we were able to chain programs together in pipelines, where the result of running one program could be used as input for the next. We also saw how we could use another process to do part of a computation, by sending data to its input file and reading the result from its output file.

As you can no doubt imagine, we could build arbitrarily complex communication patterns using just file objects like this, as long as the communicating processes agrees on the communicating pattern (see protocols below).

Since communicating through file-like objects like this is both a conceptually very simple idea, and still very powerful, it was generalized in UNIX for inter-process communication — build around so-called sockets — and later extended to inter-host communication in the TCP/IP protocol.

Communication over TCP/IP therefore looks very similar to the pipe communication we have already seen: two processes can communicate by writing to and reading from file like objects. The processes need not run on the same computer, and we are not restricted to a single input and output file per process, but aside from this it is very similar to what we have already seen.

Connecting Sockets and the Client/Server Architecture

Communication through sockets is designed around the so-called Client/Server architecture; two communicating processes are not treated identically, as in a peer-to-peer architecture, but rather one is considered the server and the other the client. The server creates a socket and starts listening on it; the client creates another port, but does not start listening itself but instead connects to the server and in that way initiates the communication.

You can think of this a bit like with telephones: one process is waiting at the telephone (the server) and the other is calling (the client). If both processes were just waiting, or if both processes were calling, no communication would be initiated, but if one waits and the other calls we have a connection. The analogy does not carry over completely, 'cause with telephones you do not actively have to wait for a call, and once the communication starts the "telephone" is not busy but can accept more calls if the server is built to handle that, but for now just go with the analogy.

Before we can call the server, we need it to have a phone number, or an address, in TCP/IP lingo. This address consists of two parts: a host address and a port (think area number and local number in our phone analogy). The host address is the address of the machine the server is running on — an IP number or a host name — and the port is an address on that host that distinguishes this server from other processes that are running on the host.

The server will listen on an address by binding a socket to the address and then start listening:

import socket

HOST = ''     # local host
PORT = 50000
SERVER_ADDRESS = HOST, PORT

# set up server socket
s = socket.socket()
s.bind(SERVER_ADDRESS)
s.listen(1)   # listen on the port, allowing 1 client in queue 

The server will always be on the local host — that is, after all, where it is running — so its address is on a port on this host. The port number can, basically, be chosen arbitrarily, but different protocols will wait on different ports (you need a fixed protocol for the clients to be able to find the server), and some low-number ports are reserved and cannot be used by users. Here we choose, arbitrarily, the port 50000. The address is then the pair of host and port.

A socket, s is created using the socket() function from module socket, here using default parameters that sets up a TCP/IP socket. The socket is bound to the address we created — which means that this address is where it will accept incoming calls from — and is then told to listen, allowing one client to queue at a time.

We have now connected the phone, but are not yet ready to accept calls. To do this, we call the accept() method:

conn, addr = s.accept() 

This call blocks until a client connects to the server, and returns a connection object, conn and the client's address, addr — think of it as caller id on a phone.

The client part of establishing a connection is slightly simpler: The client only needs to create the socket and then call the server. Assuming the server is running on the host "zebra03", this could look like:

import socket

HOST = 'zebra03'
PORT = 50000
SERVER_ADDRESS = HOST, PORT

s = socket.socket()
s.connect(SERVER_ADDRESS)

Once again, it needs the server address to know where to connect to, but aside from that it just creates the socket and connects to the server.

Sending and Receiving Data

Once the connection is established, data can be send and received over it — this is like an input and output file in one. The client is communicating through the socket s it connected to the server, and the server is communicating through the connection socket, conn it got from the accept() call (the other socket, s is still listening on the server address, although the server is no longer accepting calls...if it calls accept() again later it can start communicating with new clients).

Let's say the client wants to send data to the server. Then the client code could look like this:

s.sendall('Hello, world')
s.close() 

while the server code looks like this.

data = []
while 1:
    next = conn.recv(1024)
    if not next: break
    data.append(next)
conn.close() 

The client sends a string using the sendall() method; this method takes care of sending the complete string given as input, splitting the data if not all can be send in a single message. The method send() gives you slightly more information on how much data was transfered if an error occurs, but with that you need to make sure that all data is send yourself, so you will usually stick to sendall().

The server uses the recv() method to read 1024 bytes from the connection at a time, and combine the read bytes after the connection is closed (signaled by next being empty).

Of course, the server could also be the one sending and the client receiving; once the connection is established the client/server architecture need no longer be in effect and the processes can be considered as peers.

In summary, we show the two complete client and server scripts:

import socket

HOST = ''     # local host
PORT = 50000
SERVER_ADDRESS = HOST, PORT

# set up server socket
s = socket.socket()
s.bind(SERVER_ADDRESS)
s.listen(1)

conn, addr = s.accept()
data = []
while 1:
    next = conn.recv(1024)
    if not next: break
    data.append(next)
conn.close()
s.close()

print ''.join(data) 
import socket

HOST = 'zebra03'
PORT = 50000
SERVER_ADDRESS = HOST, PORT

s = socket.socket()
s.connect(SERVER_ADDRESS)

s.sendall('Hello, world')
s.close() 

EXERCISE DIST.1: Change the communication patter so the server is sending to the client instead.

EXERCISE DIST.2: Can you make the client send data to the server first and then the server sending to the client afterward? What happens if both send data and then tries to receive?

File Objects

The send()/sendall() and recv() methods, with their explicit buffering and such, are rather primitive. Sometimes you need such low-level control over the communication, but most often you do not and would prefer to work on sockets just as file objects — and earlier I promised you that socket communication was similar to pipes from earlier, so similar to that it shall be!

We can wrap a socket in a file object that changes the interface to that we know from file objects; this is done simply by calling the makefile() method on the socket.

Using file objects, we can write the server as this:

import socket

HOST = ''     # local host
PORT = 50000
SERVER_ADDRESS = HOST, PORT

# set up server socket
s = socket.socket()
s.bind(SERVER_ADDRESS)
s.listen(1)

conn, addr = s.accept()
f = conn.makefile()
for line in f:
    print line,
f.close()
s.close() 

and the client like this:

import socket

HOST = 'zebra03'
PORT = 50000
SERVER_ADDRESS = HOST, PORT

s = socket.socket()
s.connect(SERVER_ADDRESS)

f = s.makefile()
print >> f, 'Hello, world'
print >> f, 'How are you today?'
f.close()

EXERCISE DIST.3: Write a program that copies a file from one host to another using sockets and file objects.

Data Transfer

Using file objects you can send text between processes, and for most purposes that is all you need. After all, it was all we needed for inter-process communication earlier; we have used simple text formats for data that we could parse as needed, and for most cases that is probably the best solution, but some times we need a bit more...

Since we are now dealing with two Python programs communicating, wouldn't it be nice if we were also able to send more structured data, like lists and dictionaries, for example, over the file objects?

This problem isn't really specific to networking, but just as relevant for persistent storage of data on files. The problem is: we need to serialize data from Python into a sequence of bytes (a string) in such a way that we can re-construct the data again from the bytes.

We could use simple text formats as before, or perhaps serialize data using XML. Both solutions are excellent for interprocess communication (as long as we are not transferring huge amounts of data), especially when communicating between programs written in different languages, but for transferring data between two Python programs there are other, easier, options.

Pickling

With a text format (or XML format) we still need to parse the data, but using the python pickle module, or the faster cPickle module, we can avoid writing our own serializer and parser.

Simply put, the two pickle modules lets you dump objects to file objects and then reconstruct the objects by loading them again. Whether the file object is an actual file, a pipe or a socket is not important.

To send a list from the client to the server, we can do this:

import socket, cPickle

HOST = ''     # local host
PORT = 50000
SERVER_ADDRESS = HOST, PORT

# set up server socket
s = socket.socket()
s.bind(SERVER_ADDRESS)
s.listen(1)

conn, addr = s.accept()
f = conn.makefile()
lst = cPickle.load(f)
f.close()
s.close()

print lst
import socket, cPickle

HOST = 'zebra03'
PORT = 50000
SERVER_ADDRESS = HOST, PORT

s = socket.socket()
s.connect(SERVER_ADDRESS)

f = s.makefile()
lst = ['Hello, world', 'How are you today?']
cPickle.dump(lst,f)
f.close()

The pickling can also handle nested structures such as a list of list, and handles the case when the same object is referenced several places in the structure, as below where we send a list containing two references to the same other list; updating through one reference changes it through both references (of course).

import socket, cPickle

HOST = ''     # local host
PORT = 50000
SERVER_ADDRESS = HOST, PORT

# set up server socket
s = socket.socket()
s.bind(SERVER_ADDRESS)
s.listen(1)

conn, addr = s.accept()
f = conn.makefile()
lst = cPickle.load(f)
f.close()
s.close()

print lst
lst[0][0] = 'hello universe'
print lst
import socket, cPickle

HOST = 'zebra03'
PORT = 50000
SERVER_ADDRESS = HOST, PORT

s = socket.socket()
s.connect(SERVER_ADDRESS)

f = s.makefile()
lst1 = ['Hello, world', 'How are you today?']
lst2 = [lst1, lst1]
cPickle.dump(lst2,f)
f.close()

The data is still copied to the server, though, so the changes are only reflected on the server side, the client's list is not changed.

We can also transfer objects of our own classes using pickle:

import socket, cPickle

HOST = ''     # local host
PORT = 50000
SERVER_ADDRESS = HOST, PORT

# set up server socket
s = socket.socket()
s.bind(SERVER_ADDRESS)
s.listen(1)

class C(object):
    def __init__(self,lst):
        self.lst = lst
    def changeFirst(self):
        self.lst[0] = 'foo!'


conn, addr = s.accept()
f = conn.makefile()
obj = cPickle.load(f)
f.close()
s.close()

print obj.lst
obj.changeFirst()
print obj.lst 
import socket, cPickle

HOST = 'zebra03'
PORT = 50000
SERVER_ADDRESS = HOST, PORT

s = socket.socket()
s.connect(SERVER_ADDRESS)

f = s.makefile()

class C(object):
    def __init__(self,lst):
        self.lst = lst
    def changeFirst(self):
        self.lst[0] = 'foo!'

obj = C(['Hello, world', 'How are you today?'])

cPickle.dump(obj,f)
f.close()

We are only transferring the object, though, not the class, so the class needs to be known on both ends of the connection, either by explicitly placing it here, or more realistically by putting the class in a module present on both hosts.

import socket, cPickle

HOST = ''     # local host
PORT = 50000
SERVER_ADDRESS = HOST, PORT

# set up server socket
s = socket.socket()
s.bind(SERVER_ADDRESS)
s.listen(1)


conn, addr = s.accept()
f = conn.makefile()
obj = cPickle.load(f)
f.close()
s.close()

print type(obj)
print obj.lst
obj.changeFirst()
print obj.lst
import socket, cPickle
from C import C

HOST = 'zebra03'
PORT = 50000
SERVER_ADDRESS = HOST, PORT

s = socket.socket()
s.connect(SERVER_ADDRESS)

f = s.makefile()


obj = C(['Hello, world', 'How are you today?'])

cPickle.dump(obj,f)
f.close()

EXERCISE DIST.4: What happens if the class on the server and client side is not the same?

Not everything can be pickled — see the pickle documentation for limitations — but pickling is very powerful and can be used in most practical cases.

EXERCISE DIST.5: If, in the Clustal W exercise, you wrote classes for sequences and multiple alignments, try now to serialize these and send them over a socket.

Protocols

Before two process can communicate, they need to agree on how to communicate; agree on a common language, so to speak.

This is called the communication protocol between the two processes. The processes need to agree on where the client can find the server, and once it is connected to the server, which kinds of messages are expected from which process, and how the processes should react to different messages.

We already saw the problem that can arise if both processes are either reading or writing at the same time, in exercise DIST.2, but that is just one of the problems; even if the processes agree on when to read and write, respectively, they must also give the same semantics to the messages for the communication to be meaningful.

The protocol between the processes specify this. We have already used simple protocols in the examples above — although we didn't call it by that name, we wrote the client and sever in such a way that they agreed on who should read and who should write, and such that the receiver got its data on the form sent by the sender.

We actually also made heavy use of protocols just for setting up the sockets and sending bytes over the connections — that protocol was TCP/IP (which means the TCP protocol over the IP protocol, so we are really dealing with a whole layer of protocols here: our protocol on top of TCP on top of IP, and IP is not the bottom layer, there are protocols below it as well ... there is not turtles all the way down, though, but we end up with a protocol for how the electricity should flow between the hosts...)

But enough with the fancy talk; the message here is: if you want processes to communicate, you must also specify how they should communicate. A few examples should make it clear.

Echo

A very simple protocol, the echo protocol, could look like this:

The client, after establishing the connection on port 45000, writes a stream of bytes to the server, which then returns that stream of bytes, after which both processes closes the connection.

EXERCISE DIST.6: Is this a reasonable protocol? Try implementing it...

If you did exercise 6, did you notice that we never specified how the server would know when the message from the client was completed? We know that the client is done sending when it closes the connection, but if it does that we cannot send the reply back, and if it doesn't close it we cannot know that it is done, so we end up with the sever waiting indefinitely for more data.

No, this was not a reasonable protocol. You need to be careful with these specifications!

Try this one, then: The client sends one line at a time to the server, which replies with returning the line; when the client is done, it sends an empty line after which both processes closes the connection.

EXERCISE DIST.7: Implement this protocol. Remember to flush the file object after each written line (just as you needed with pipes).

File Copy

Another protocol specification could sound like this: The server listens on port 54000, the client connects and executes one of two services: put or get.

put: is run by sending a single line on the form:

put filename

followed by a stream of bytes, after which the client closes the connection. The server is responsible for saving the stream of bytes in a file called filename.

get: is run by sending a single line on the form:

get filename

after which the server returns a stream of bytes, read from the file filename after which it closes the connection.

The server should return to listening for new connections after executing a command.

Clustal W

For the final example, consider the Clustal W wrapper from the first week.

EXERCISE DIST.8: Design a protocol for executing Clustal W alignments; make sure to specify the format of the input sequences and the output alignments. How does the client signal when it is done with sending sequences? Do you use pickling or do you serialize the sequences and alignment manually?

Summary

We have learned how to connect processes running on different hosts through sockets and how to communicate using communication protocols and how to serialize data using pickling.

We have not quite touched on how this can be used to parallelize tasks, but we will in this weeks exercises.