
Theory of operation

Contents:

    Section   Title

     1        Introduction
     2        Clusters
     3        Fault handling in DIPC
     4        DIPC keys
     5        How dipcd works
     6        The Referee
     7        How data and information are transferred
     8        Creating and accessing IPC structures
     9        The Owner concept
     10       Removing IPC structures
     11       Deleting IPC structures
     12       Local Kernel information
     13       Security
     14       System calls
     15       Shared memories
     16       The proxy
     17       UDP/IP





1.0) Introduction

 Here are some information about DIPC internals. DIPC has two parts: the
main part runs in user space as an ordinary process with root privileges and 
does the decision makings. It is called 'dipcd'. This may bring the 
performance down, but adds flexibility to the system: Changes to dipcd can be
done with no need to alter the kernel. It also simplifies the design and saves
the kernel from becoming (even) more complex. dipcd creates many child
processes to do various things during the time it is active. The other part of
DIPC is inside the kernel, and provides the first part with necessary
functionality and information to do its job. It is not possible to run dipcd
in a kernel with no DIPC support.

 The kernel parts of the DIPC are short-circuited when the dipcd program
is not running. In the absence of dipcd, DIPC calls in a program should 
behave as if they are normal System V IPC calls. dipcd itself uses ordinary 
means to access System V IPC mechanisms. These mechanisms are changed, so 
they treat dipcd differently than other user processes. For example, dipcd 
can gain access to a shared memory segment by a shmget(), even when it has 
been removed (shmctl() with the IPC_RMID command), but not deleted from the 
kernel. All the manipulations by dipcd to IPC structures are done locally, 
without them being visible outside that machine. this is in contrast to normal 
user processes, who's actions on distributed IPC structures may affect other 
computers in the network.

 DIPC is only concerned with the transfer of data in a distributed 
environment. Starting suitable programs in different computers is the business 
of DIPC's user/programmer. This means that the programs to be executed may 
need to be present in the computer on which they are to execute. The programs 
could be placed in different computers once, and used many times after that. 
This implies that there is no overhead for transferring code in the network 
whenever a program is starting. Considering the fact that a program's code 
remains unchanged for relatively long times, while the data that is used by 
it changes frequently (maybe from run to run), the above should be considered 
a plus in most cases. 

 It is important to note that DIPC is a set of mechanisms. There is no talk 
of policy here: How the program is parallelized, where the processes are to be 
run, etc. are dealt with by the user/programmer. There are some other tools 
available for this kind of work, though no satisfactory solution is known to
me. 

 There are two kinds of activity in (D)IPC:
 1.1) Synchronous: here the programmer uses system calls to carry out an 
action. examples are using xxxget() to gain access to an IPC structure, or 
using msgrcv() system call to receive a message. The program makes the call
and waits for its completion before continuing. DIPC will always take some
action here.

 1.2) Asynchronous: Reading and writing a shared memory may cause an
asynchronous action to take place. The programmer can not predict if and when 
such a thing may take place. One example is reading from a shared memory, when 
the relevant pages are swapped out (IPC) or are not in the requesting machine
(DIPC). The needed pages are fetched from wherever they are and the program
resumes execution. These events may or may not occur for each reference to
the shared memory.
  






2.0) Clusters

 The system consists of a number of computers connected over a TCP/IP or
UDP/IP network. Some or all of these machines could be in one 'cluster'. 
There may be more than one cluster in a network, but each machine can belong 
to at most one of them. clusters are logical entities: they can be created 
or removed, or their members changed without the need to change any of the
network's physical properties.

 Computers on the same cluster can use DIPC to transfer data and synchronize 
themselves without interfering with the workings of machines on other
clusters at all, even though they may also be using DIPC and even the same
applications. In other words, as far as DIPC is concerned, computers never
interact in any way with machines outside their own cluster. 

 It is the ability to exchange data between programs running on different 
machines in a cluster that makes DIPC a distributed system, but it is also 
possible to run all the different processes of a DIPC-enabled application in 
a single machine. They should behave like they are using normal System V IPC. 
This is because there is no explicit reference to any particular computer in
DIPC. The same program may use different computers during different invocations 
to complete its job, freeing the program of being dependent on certain 
machines with certain addresses. This also means that programmers can use 
single machines to develop their application, and later run it in a 
multi-computer cluster. In other words, the user could be able to do the
final mapping between the resources needed by a program, and the physical
resources available. See the program in directory examples/pi as an example. 







3.0) Fault handling in DIPC

 Two kinds of error can be encountered by programs using DIPC:

 3.1) The Same errors as those that can also happen in normal IPC. For example,
referring to a non-existent IPC structure, or trying to access a structure
without the proper permissions. Synchronous errors are detected using the 
return value of the system calls. Asynchronous errors are dealt with using
asynchronous mechanisms: signals.

 3.2) Errors that are specific to DIPC. They may be caused by the following
situations:

 3.2.1) A computer becomes unreachable to one or more machines in the 
cluster. This may happen if the network is damaged. But it can also happen 
when that computer is very slow, or the network is congested, in which case 
other computers 'think' that it is unreachable.

 3.2.2) A machine crashes. In this case that machine won't be able to take
 part in DIPC actions, causing some operations to fail.

 3.2.3) The dipcd program in a machine ceases to exist or malfunctions for 
any reason. In this case no DIPC operation can be done in that machine. 

 It is assumed that when a failure arises somewhere, no false message will be
transmitted (fail-stop), so a time out mechanism can be used to detect them,
and this is what DIPC does. There are a number of different time out values 
used by DIPC to do this.

 Trying to recover from an error in the system (such as a network
problem) is difficult and sometimes impossible, and adds to the complexity
of DIPC. So when DIPC detects an error, It does not try to overcome it, but
does exactly like IPC: It tries to inform the application about the error
via a return code, or in case the process has been trying to access a shared 
memory segment, by a signal (SIGSEGV). The rest is up to the application.
You should remember that only the responsible process(es) will be informed
about a failure to do something, not the whole processes of a distributed
program.

 DIPC tries everything at most once. Either it succeeds, or it fails, in 
either case no retries are made (at-most-once semantics). This means that 
there is no possibility of the same request being serviced more than one
time.

 It should be noted that errors may cause inconsistencies to appear in the
system. For example, an IPC structure may be deleted in one computer, but
others may fail to know about it. They think that that structure is still in
existence, which may later cause problems. See the programs in the tools
directory for a partial remedy to this problem.







4.0) DIPC keys

 Processes in a cluster can use the same 'key' to access an IPC structure.
This structure should first be created. This is done by one of the 'xxxget'
system calls (shmget(), semget() or msgget()). After creation and 
initialization, other processes, possibly in other machines, may be able to 
access this structure. So a key here can have the same meaning in all the 
computers of the cluster. Put another way, computers in the same cluster 
have a common DIPC key space. 

 Many legacy software with no accessible source code may use System V IPC
to do their work, and some may use the same keys. As these programs may be
needed on several machines in the cluster, it was important to make them
work with no interferences, under DIPC. So it becomes necessary to allow
two different kinds of IPC keys in a cluster:

 * local keys, which are used only in a single computer, and are usable only 
to processes on that computer.

 * distributed keys, which are used to refer to the same IPC structure in
the whole cluster.

 A distributed key has a unique meaning in all the machines in the cluster, 
and referring to it should denote the same IPC structure. This means that 
when an IPC structure with a distributed key is created, there should be no
other structure (local or distributed) with the same key in any other
computer. This makes it clear that to be able to decide about creating a
new structure, the keys of the existing IPC structures in the cluster should
be known in the whole cluster, even if they are local.

 So as to make older programs able to run, different IPC structures with the
same local key can exist in different computers. And for the same end, local 
keys are the default key type: creating a distributed key requires the 
programmer to explicitly add an 'IPC_DIPC' to other usual flags while
creating or gaining access to an IPC structure with a xxxget() system call.

 If it was not for the above requirement, DIPC would be used totally
transparently. Even now the ONLY thing the programmer has to do is using the
IPC_DIPC flag.   







5.0) How dipcd works

 This section describes how TCP is used. For more information about UDP 
please refer to the UDP/IP section of this file.
 
 In DIPC ordinary user programs interact only with the local kernel of the
computer they run on. Requests of user programs for DIPC actions, whether
synchronous ones like system calls that should be executed remotely, or
asynchronous ones like trying to read from a shared memory who's pages are
not available in the local computer, are routed inside the kernel. All such
requests are put in a single linked list. The process id of the requesting
user task is noted, and is used to find the original request when the results
come back. This way the results can be delivered to the correct user
program.

 The kernel will in turn refer to the user space part of DIPC (dipcd) to 
actually fulfill these requests. This means that the presence of dipcd is 
transparent to user programs, and as far as they are concerned, this the 
kernel that satisfies their requests. 

 A part of dipcd is always waiting inside the kernel to collect these 
requests. Whenever it finds new ones, other parts of dipcd are activated 
to satisfy them. These parts get any necessary data (for example the
parameters for a system call) to do the request from inside the kernel
and return any results back to the kernel, where they will be delivered
to the suitable user process.

 As will be seen, it is the dipcd that actually executes remote functions, 
transfers any data over the network, or decides which computer can read or 
write to a distributed shared memory. It also keeps necessary information 
about the IPC structures in the system and arbitrates between processes on 
different machines wanting to access the same structures at the same time.

 It is apparent that there should be some provisions for dipcd to access
the needed information in the kernel structures. A new system call, 
multiplexed with other System V IPC calls, is added to Linux to do just
that. dipcd and other DIPC tools (like dipcker) use it to transfer data to
and from the kernel. This new system call (used in DIPC programs as dipc()), 
is strictly for use by the dipcd and other related programs. Ordinary user 
programs should not use it.

 It is important to remember that the IPC system calls initiated by dipcd 
are always executed inside the local kernel. This is true even when working 
on distributed IPC structures. This is part of 'special treatment' of dipcd
processes by a DIPC-enabled kernel.


5.1) dipcd processes
 dipcd forks several processes to do its job. You can see that all the 
processes have their own source files. The following is a list of processes
in dipcd. There are some cross references between them, so a 'multi-pass'
reading may be needed to understand it.

 * back_end (in dipcd/back_end.c): 
 The main() of dipcd is here. There is only one back_end in any computer of
the cluster. back_end and front_end (see below) are the only processes
in dipcd that use fork() to create other processes. back_end starts
the system: it registers itself with the kernel as the process responsible
for handling DIPC requests, reads the configuration file, initializes
the variables, and forks the front_end task. If the machine it is running
on is the same as the referee machine, it also forks the referee process. 

 After this, back_end gets in a loop, gathering requests and related data, 
whenever available, from inside the kernel. These requests can (among others) 
be for reading or writing to a distributed shared memory, or they can be 
inquiries about a certain IPC structure. back_end will then fork an employer
process (see below) to handle each of these requests.

 * front_end (in dipcd/front_end.c): 
 front_end has the responsibility of handling the incoming requests and data
over the network for a computer. There is only one front_end for each machine 
in a DIPC cluster. The reason is that in TCP/IP networks, to connect to a 
computer you have to know not only the IP address of that machine, but also 
a TCP port number. dipcd creates an unknown number of processes to do
its works, and many of them need to receive information over the network.
So as to get rid of problems encountered when each process uses its own TCP
port number, it was decided to handle all incoming requests by a single task.
Now each process knows which TCP port number should be used to connect to
another DIPC-enabled computer.

 Any process wanting to interact with another machine can connect directly 
to the fron_end of that computer (This does not include interacting with the
referee. see below). If the incoming connection is for doing work, the 
front_end forks a worker process to handle it. Otherwise it just passes the 
incoming data to its destination process, which may be an employer waiting for 
results, or a shared memory manager (see shm_man).

 * referee (in dipcd/referee.c): 
 referee has the important role of keeping the order in the system. There is
only one referee in the whole cluster, so only one computer has it running.
For DIPC to work, Each computer should have the address of a machine that
runs the referee. Like the front_end, it has its own TCP port number, and
other processes can reach it directly.

 referee keeps all its information about the existing IPC structures in a 
number of linked lists (in DIPC documents referred to as referee tables).
Some are for IPC structures that have been created, and some are for
structures that does not yet exist, but a remote process is trying to create.
This makes referee the name server of DIPC.

 referee provides a 'back door' mechanism (which uses a UNIX domain socket),
through which processes can send it commands and receive some information.
see the documentation for the dipcref program (in tools directory) for more
information.

 See the section 'The Referee' below, for more information.

 * shm_man (in dipcd/shm_man.c): 
 This is created in machines that own a shared memory segment (the machine 
that created the corresponding structure first) as the manager of the shared
memory. There is one shm_man process for each distributed shared memory in 
the system. It is executed by the employer task that handled the successful 
shmget(). shm_man determines who can read or write from/to the shared memory, 
and keeps an account of them. It also manages the transfer of shared memory 
contents to computers that need them. shm_man will attach the shared memory to 
itself. This is to prevent the shared memory from being destroyed if it is 
removed (shmctl() with the IPC_RMID command) and all the processes in the 
creator machine detach it from themselves. It would cause problems if other 
processes in other computers still needed the shared memory.

 There are no processes forked to handle semaphore sets and message queues
in the computer that creates them (no sem_man and msg_man). Only shared 
memories get a dedicated process to manage them. The reasons for this are:

 1) shared memories need information (i.e. list of computers accessing it for
reading or writing) that are used for a long time. Keeping them across
many processes is not easy.    
 
 2) More than one outstanding request for a shared memory read or write may 
be present at the same time. So a central decision making entity is needed.

 3) A shared memory should be attached to the address space of a process for
the system not to delete it. See above.

 * employer (in dipcd/employer.c): 
 It is started to handle the remote execution of a system call (such as 
shmctl()) or  to handle other kinds of request (such as a shared memory 
read/write requests). It connects to the appropriate computer (for example the 
one on which the IPC structure was first created) and after passing the 
necessary data to the responsible process there (for example the front_end), 
waits for a reply. It uses time outs to find about possible network or 
machine problems.


 * worker (in dipcd/worker.c): 
 This is started by the front_end process to execute requested actions. The
requests may come from employers, the referee, or a shm_man, and may include 
executing a remote system call or transferring the contents of a shared memory 
from computer to computer. to complete its work, a worker may connect to the 
original requesting machine and delivers any results. A worker may turn into
a proxy and stay alive after doing its assignment. See the section on proxies 
for more information.


5.2) How dipcd processes are created
 Threads allow multiple tasks to run in parallel and share a common address
space. Unfortunately there are no threads in Linux. This means that if a
program wants to do multiple things at the same time, it has to use the
fork() system call and considering that there is no common address space,
should think of some means of communication between these processes. 

 Another way around this problem is for a process to finish its assignment 
as soon as possible, this way it can do its work one after the other. dipcd 
uses both methods: some processes are forked when they are expected to 
be active for a long time (e.g. referee is forked), and some processes do
their work sequentially, hoping that the requests for service won't
overwhelm them (shm_man works like this).

 The following diagram shows the creation relationships between dipcd 
processes:

 user --1--> back_end --4--> employer --6--> shm_man
              |
              +--2--> front_end --5--> worker
              |
              +--3--> referee
(Diag. 1)

(1) dipcd should be executed by the user (or automatically in a startup
script). The back_end process gains control.

(2) The front_end process is forked by the back_end. 

(3) back_end forks the referee process if this computer should also houses
the referee.

(4) back_end forks an employer when it discovers work for DIPC inside the
kernel.

(5) front_end forks a worker whenever an incoming request should be handled 
by a new process.

(6) The employer executes the shm_man code when it determines that it has
helped successfully create a distributed IPC structure for a shared memory.


5.3) How dipcd components talk to each other
 Different parts of dipcd use a structure called a 'message' (not to be
confused with IPC messages) to pass any information between themselves.
Included in these messages are the sender and the receiver IP address, local 
pid of the sender, the function to be performed, necessary arguments and so 
on. In some cases There may be the need for more data than can be put in 
the message structure, so other information may follow the message. An IPC 
message contents or a msgid_ds structure contents used in msgctl() with the 
IPC_SET command are two examples of data that are sent after a dipcd message.

 Each message has a request field, which determines its destination process.
It may be REQ_DOWORK for workers, REQ_SHMMAN for the shared memory manager 
and REQ_REFEREE for the referee. The reply to each message contains in its 
request field one of RES_DOWORK, RES_SHMMAN or RES_REFEREE, depending on 
who is answering the message.

 TCP/IP or UDP/IP sockets are the main way of transferring information from 
machine to machine. dipcd can be told to use either of these protocols (but
not both at the same time), with a command line option when it is started.

 UNIX domain sockets are used for communication between dipcd tasks that 
always run on a single computer. Using sockets in every case unifies the way 
processes exchange data among themselves. UNIX sockets are created in dipcd's 
home directory. 

 The processes that setup TCP/IP sockets are the referee and the front_end. 
All inter-machine communications should involve one of them. The processes
that setup a UNIX domain socket are the employer (for receiving any results
from the front_end of the local machine), the referee (for the 'back door'
mechanism), the shared memory manager (to receive requests about the
distributed shared memory that it manages and also for the 'back door'
mechanism), and the worker, when it is going to execute a semop() system 
call. In the case of the shm_man and the employer, Any data from other
computers is first sent to the front_end's TCP/IP socket, and it will copy
them to the UNIX domain sockets. The referee's UNIX domain socket is accessed
locally by the dipcref tool. The worker process receives a file descriptor 
corresponding to an Internet socket when it is forked by the front_end, so it 
can directly communicate with remote processes. The proxies receive their
data form the front_end via their UNIX sockets.

 The name of the employer UNIX domain socket is determined like this: First
is the string 'DIPC', then comes the type of IPC structure it serves, 
('SEM' for semaphores, 'MSG' for message queues and 'SHM' for shared
memories) plus the corresponding IPC structure's key, and the process id of
the employer. This makes the socket name unique, so the results can be sent
back to the right employer. for example: DIPC_SEM.45.89 is for an employer 
with process id no. 89, that is handling a semaphore request. The semaphore 
set has the key 45. The shared memory manager's socket name is also determined 
like this, with the exception that it does not include its process-id number, 
because different processes may need to contact it without knowing this 
number. In fact the processes requesting to access the shared memory know 
nothing about the process that is managing the shared memory. See the
section on proxies to see how proxy processes set up UNIX domain sockets.

 Some processes have to use TCP/IP sockets for their information exchange
in most cases (for example between a shm_man and the front_end of the
computer that is a reader or the writer of the shared memory), because they
usually reside on different computers. But it may happen that the two
processes are on the same machine. So as to keep the algorithms general, here
also TCP/IP sockets are used, but this time the addresses are changed to
local loop-back ones.







6.0) The Referee

 It is possible that more than one process in the cluster want to create 
an IPC structure with the same key. These processes may be in the same 
computer or in different machines. So as to prevent unwanted interactions 
due to different processes trying do the same thing, and possibly creating 
an inconsistent state in the whole system, the creation of an IPC structure 
should be done atomically, meaning that while one process is trying to 
create an IPC structure, No other process should try the same with an 
identical key. 

 The kernel parts of DIPC prevent more than one process in the same machine 
to attempt to create an IPC structure at the same time. They are serialized 
inside the kernel. accomplishing the same across the cluster is made 
possible by an especial process responsible for this job: it will play the 
referee among requests from different machines and registers the necessary 
information about all the IPC structures in the system. It also controls the 
attempts to remove, or otherwise manipulate IPC structures. This process is
called the 'referee'.

 There is a single referee in a cluster. All the machines in the same cluster
should know on which computer the referee is currently running, and refer to 
it when needed. In fact, it is having the same referee that places two or more
machines in the same cluster. In other words, a cluster is made of the machines 
that have the same referee. The referee address can be assigned by the system 
administrator, using the dipc.conf configuration file. Changing the address 
of a referee in a computer places that computer in another cluster. A machine 
that is running the referee process can also act as any other machine in the 
cluster.







7.0) How data and information are transferred

 In ordinary IPC, system calls are executed locally. Data provided by
a process as parameters of a system call are copied inside the kernel and 
kept there. Each IPC call should return some result to the caller address 
space. The calling process will not be able to continue until it gets the 
results back. During this time it is waiting inside the kernel. The amount 
of time between making a call and getting an answer can vary greatly for 
different system calls and the IPC structure state. Some calls (e.g. xxxctl() 
with IPC_STAT command) return soon, Others (e.g  a semop() call) can take 
very long or for ever to return. See the following diagram.

 Making the call:    
  process --1--> local kernel
 
 Getting the results: 
  process <--2-- local kernel
(Diag. 2)

 First the parameters are copied to the kernel (1), and then the results
are returned (2).

 So an IPC system call requires two copies between the user and kernel
address spaces. Note that the amount of data (parameters or results) copied 
can be very small (a single integer, containing an error code) or it 
can be quite large (the contents of an IPC message).

 In the following, remember that the dipcd program runs in user address and 
uses ordinary means to communicate with the kernel. We also assume that the
user process under discussion is not running on the owner machine. Here 'data'
means either input parameters, or the results.

 RPC (Remote Procedure Call) is used to execute a system call on a remote
computer. So as to ensure transparency, no process using DIPC should see any 
changes relative to the normal (local) IPC activities. Here too, data are 
copied from a process to kernel's memory. dipcd then brings this data to its 
address space and transfers it over the network to the computer that is
responsible for handling the request. This could be the computer on which an
IPC structure was first created, in which case it is called the owner of that 
structure (see the section on owners below for more information). The remote 
dipcd will copy the newly arrived data to the owner machine's kernel space. 
This is 3 copies and a network access for a process to simulate a system call 
in the destination computer. 

 After this, the system call can be executed by dipcd in the remote kernel
and the results will be sent back much the same way as described before: The
remote kernel will copy the data to user space, so that dipcd can transfer
them over the network. The dipcd at the original process's site receives this 
data and sends it to the local kernel. Now the data are copied from there to 
the original process's address space. See the following diagram.
 
 Making the call:                   Network
           (Requesting Computer)      |     (Responsible Computer)
    process -1-> local kernel         |
                  |                   |
                  +-2-> local dipcd --|-3--> owner dipcd -4-> remote kernel
                                      |          
 Getting the results:                 |
                       local dipcd <--|-6-- owner dipcd <-5- remote kernel
                                |     |
  process <-8- local kernel <-7-+
(Diag. 3)

 The process is suspended until the results come back.

 As can be seen, the user process interacts only with the local kernel,
seeing no changes in calling System V IPC routines. 

 Note: Remember that transferring data over the network also requires copies
to and from the kernel. This is due to the design of networking code inside 
the kernel and thus is inevitable. It will not be considered any further in 
the discussions.

 The following algorithm shows how the decision to execute an operation
remotely is taken.

Start
 1.If dipcd is present and the caller in not a dipcd process then
  1.1 If the operation is supported by DIPC then
   1.1.1 If the operation is on a valid distributed structure and this is
         not the owner computer the
    1.1.1.1 Execute the call remotely.

7.1) Reducing the copies
 Apparently not much can be done about the network access. Some operating
systems use address remaping (using the Memory Management Unit (MMU) to
give access of a part of a process's memory to another process), so the data 
transfers between the kernel and user address spaces would be done very 
cheaply. Unfortunately Linux does not use this method in implementing System
V IPC mechanisms. 

 One possible way to lessen the number of copies between user and kernel spaces 
is to refrain from copying data from a process's address space to the kernel 
memory, when that process is not running in the responsible machine, as it will
again be copied to the address space of dipcd to be sent to over the network. 
Also, it would help not to copy the data arrived over the network into the 
kernel, when we know it will be recopied to a user process's address space.

 To do so we could use user space stubs to replace ordinary System V IPC code 
in kernel. The stub code could examine the data and the destination, and
decide if it should send the data over the network, without bothering with
the local kernel. Another process on the responsible machine should receive
this data, carry out the operation, and send the results back, where they are
delivered to the original stub's process. For this to work, we should provide
stubs to replace ordinary IPC stubs. We could provide object files that had
to be linked to the program, or better than that, we could change the standard 
C libraries.

 Two problems should be addressed here:
 * How to find the waiting process: The stub of the waiting process will 
wait in the user space, and a mechanism for rendezvous should be devised.

 * What if another process has not yet claimed the data: The sending process 
may continue execution as soon as the data is transferred, and as the 
receiving process has not yet asked for the data, they have to be kept 
somewhere.

 Assuming that the above problems were solved, we would be able to save 2
copies between user and kernel memories.


7.2) Why DIPC has not used the above methods

 The first reason is that solving the above mentioned problems in an easy and 
satisfactory manner may not be possible. 

 The second reason is that having to link a special object file to a 
program would be a nuisance and cause errors if forgotten. Also, I did not 
have access to the standard C libraries sources. In any case an upgrade in 
DIPC may have required the recompilation of programs using it. The other draw 
back is that older programs that are not linked with the changed stubs, would 
not notify DIPC of their presence and will not give it the needed information.
DIPC should know about all the IPC keys in the cluster, even if they
represent ordinary IPC structures. Programs not compiled with the new stubs
would have caused the DIPC to malfunction.

 We should also have in mind that people may want to use DIPC with languages 
other than C, and there may be insolvable problems in providing altered 
object files for all these languages. 
 
 Using the kernel environment for information transfer and task sleeping has 
several advantages:
 * There is a known place where data could be kept and processes could be put 
to sleep, so that they could later be found easily.

 * All the programs, written in any language, old and new, will behave the 
same in regard to DIPC. Also, DIPC code can be changed without the need to 
recompile user programs.







8.0) Creating and accessing IPC structures

 The normal behavior for programs willing to use System V IPC mechanisms
for data exchange is like this: A process first creates an IPC structure
by a suitable xxxget() system call using an agreed-upon key. Other processes 
can now use xxxget() to gain access to the same structure created before. In 
normal IPC, the first process causes the appropriate structures to be setup
inside the kernel. Subsequent xxxget() calls merely return a numerical ID 
value, that can be used to refer to that structure. All these processes use 
and manipulate a single structure.
 
 DIPC tries to mimic this situation as much as possible. Processes on
different machines need to use the same key to be able to refer to the same
IPC structure. In DIPC, when a process wants to create or gain access to an 
IPC structure with a certain key, the local kernel is first searched to see 
if that key is already used. This is quite like normal IPC. If the structure
is found, the request is handled locally, with no reference to the referee. 
But if an IPC structure with that key is not found, then the referee should 
be consulted to find out if that key is already used in the cluster. 

 The referee searches its tables for the key, and tells the requester if the 
key was found or not, and if found, was it a distributed key or not. The 
referee can answer immediately if the key is present, or if the key is not 
present and no other machine has queried about it. After sending the 
information to the requesting computer, the referee expects a reply, informing 
it whether that machine locally created an IPC structure with that key or not, 
so that it can update its information, if necessary. But if the key was not
found and the referee had already sent a message telling so to another
machine, all further requests for that same key are not answered, until that
other machine tells the referee if it created an IPC structure or not. When
this information arrives, the referee can proceed to answer other waiting
requests.

 This centralized (serialized) algorithm is simple and easy to implement. A 
distributed algorithm would be very complex and requires many message passings.
Also, Consider that creating an IPC structure consumes a relatively small 
portion of the total time a program runs.

 In the following two tables, The remote-key-type is the key as registered
by the referee, and returned to the requesting computer. requested-key-type is
the type of key as desired by the local requesting computer. In all The cases, 
that key has not been found in the IPC structures of the local kernel.

 The following table shows how a process decides if it can create a structure 
when it wants a previously non-existent IPC structure (xxxget() with an 
IPC_CREAT flag). 

 requested-key-type       remote-key-type             action

       local                non-existent               create
       local                   local                   create
       local                distributed                fail 
    distributed              non-existent              create
    distributed                local                   fail
    distributed              distributed               fail

 The following table shows how a process decides if it can create a structure
when it wants access to a previously created IPC structure (xxxget() with no 
IPC_CREAT flag).
 
 requested-key-type       remote-key-type             action

       local                non-existent               fail
       local                   local                   fail
       local                distributed                fail 
    distributed              non-existent              fail
    distributed                local                   fail
    distributed              distributed               check
 
 Check means to make sure that the specified flags in xxxget() and the
permissions of the distributed IPC structure allows the process to access it.
If so, the 'action' becomes 'create'.  

 Here the rules for checking the flags are nearly the same as in ordinary IPC.
For example, specifying IPC_EXCL and IPC_DIPC in the same system call fails 
if an IPC structure with that same distributed key is present in another 
machine. The method used to check the permissions is this: The effective 
user's login name along with all the parameters of the xxxget() are sent to 
the machine which has the structure with the distributed key. dipcd then 
executes the xxxget() system call in that computer. If it succeeds, The same 
will be tried in the original requesting machine. If any of the above actions 
fail, then the original xxxget() will fail. For more information about the way 
DIPC handles user names, refer to the security section.

 If the 'action' becomes 'create', a local structure will be created even if 
another structure with the same key is present somewhere else in the cluster. 
In short, in any machine in which some processes use a key to communicate,
there exists a structure with that key.

 The following diagram shows how DIPC handle a new IPC creation request:

 Phase one: Search
                                    Network
          (Requesting Computer)       |        (Referee Computer)
                                      |
       |-1->back_end >-2-+            |
       |                 |            |
kernel |<--------6--employer --3------|-> referee 
       |            |                 |        |       
       |            +-5-< front_end <-|------4-+
(Diag. 4)

 It is assumed that an IPC structure with the specified key does not reside
in the requesting machine. In this case the back_end finds the request to 
search for a key (1) and forks an employer to handle it (2). The employer
calls the referee and queries it (3). The referee sends the answer (found or
not...) to the front_end of the requesting computer (4) and from there it is
delivered to the original employer (5). The employer gives the results back
to the kernel (6). The kernel should determine if the IPC structure can
be created or not. Now the referee will start a timer to make sure that this
information will get to it in time. After this, Phase two (Commit) can begin.

 You can refer to the previous diagram for the second phase. In the commit 
phase the referee is informed about the outcome of the xxxget() system call. 
All the actions are the same as in the previous phase. The back_end finds 
out about the outcome of the xxxget() (1) and informs the referee (2 and 3) 
but this time the purpose of actions (4), (5) and (6) is to make sure that 
the original requesting process continues only after the referee information 
are updated.

 Refer to the following algorithm to see that as far as the kernel part of
DIPC is concerned, how a new structure is created or accessed. the * is to
indicate that there may be a network access.

Start
 1.If the specified key is used in a local structure then
  1.1 End algorithm.
 2.If dipcd is present and the caller is not a dipcd process then
  2.1 Search the referee for the key (*)
  2.2 If the key was found then
   2.2.1 Test the compatibility of the requested structure and the remote
         structure. In case both are distributed, do a permission check (*). 
         If every thing was okay, try creating the structure.
  2.3 Inform the referee of the results (commit) (*).
END

 FAULTS: Problems in back_end and employer are promptly conveyed to the kernel. 
Errors in other parts are detected with the employer's or the referee's 
time out.
  
 If it was not possible for the system to get the needed information from
the referee, it will be assumed that, that key is unused, and behaves as if
the referee said so. This means that it will assume it is the 'owner' (see
the next section). Though this causes the process to be able to continue, it 
may cause problems when actually there are IPC structures in other machines 
with that same key. The same kind of problem can happen when the commit does
not get to the referee. In this case a time out will occur and the referee
will assume that the requesting process failed to create the structure, even 
though this may not be true. Now the referee may have wrong information. 




 


9.0) The Owner concept

 This concept has significance only for IPC structures with distributed
keys. When a process first creates an IPC structure with an xxxget() in the
cluster, the computer on which it runs becomes the 'owner' of it. All the 
operations for manipulating this structure are done at the owner machine. 
This effectively means that there is only one active instance of an IPC 
structure. This property greatly affects the simplicity and semantics 
preservation of DIPC.   
 
 In DIPC, if a process is not running in the owner computer, all its IPC
system calls will actually be executed on the owner machine. This is done
using Remote Procedure Calls (RPC). Here all the necessary parameters
and data are sent to the owner machine and another process there executes
the call on behalf of the original one, and returns any results. DIPC system
calls in the owner machine are executed like normal IPC.
 
 In DIPC, a data producer and a data consumer process may be in the following 
situations:

 * Both are on the same machine, which is the owner.
 Here every thing is like normal IPC calls.
 
 * Both are on the same machine, which is not the owner.
 In this situation, the producer will send its data to the owner computer. 
the consumer will then refer to the owner for the data, who will then send 
them to it. The data will make a round trip.
 
 * Each are on a different machine, but one of these machines is the owner.
 One process has to send/receive data to/from the owner, but the other one 
uses normal IPC techniques.

 * Each are on a different machine, neither of them being the owner.
 This is like the second situation, but here the data uses the owner
computer as a transit point to go from the producer computer to the consumer
machine.

 The following reasons were influential in deciding to involve the owner in
all DIPC activities:

 * This centralized approach simplifies the algorithms:
 A process knows where it should send the data it has produced, and where to
find the data it wants, without the need to query all the machines in the 
cluster. The situation would be worse if the requested data have not yet been 
produced, because in this case the requester does not know where to wait for 
the them. A related problem would be how to find and inform all the waiting 
processes in the cluster when some new data is available, and how to decide
which one of them should get the data.

 * This is more like the way ordinary IPC behaves:
 Suppose two processes on two different machines produce some data, in the
absence of synchronized clocks, It would be very hard to talk about which
data was produced first. The above method serializes producing and consuming 
data. This means that relative to the owner machine, there is a definitive 
order between data production and consumption. This is very much like the 
semantics of ordinary IPC mechanisms.

 It is not possible to change the owner of an IPC structure after it is
setup. In other words there is no migration in DIPC.







10.0) Removing IPC structures

 All requests for IPC structure removal (xxxctl() with the IPC_RMID command) 
are sent to the owner machine, where it is executed. If the structure is 
actually removed, the referee is notified, which then informs other machines 
with IPC structures of that same key to also remove them. The removal is done 
with superuser privileges.

The following diagram shows how a key removal is handled:
Phase one: Informing the referee:
                                    Network
               (Owner Computer)       |        (Referee Computer)
                                      |
       |-1->back_end >-2-+            |
       |                 |            |
kernel |            employer --3------|-> referee 
       |                              |
(Diag. 6)

 Here the owner computer has removed the IPC structure. The back_end finds
out about it (1) and forks an employer to handle it (2). The employer
informs the referee about the removal (3).

 The following diagram shows what happens after the referee receives this
information:

                          Network
 (Referee Computer)          |        (Computer With the Key)
                             |
                             |                              |
            referee --4------|-> front_end --5-->worker -6->| kernel
                             |                        |     |
                             |                        +-7--<|
(Diag. 7)
 
 The above is done for all the computers that have created an IPC structure
with that key: The referee send a request of key removal to the front_end
(4), which forks a worker to do a xxxctl() and remove it with root
privileges. The worker gets the results of this, but the referee does not
care (and wait) for it (can this produce problems?).

The referee also checks to see if the IPC structure is that of a shared
memory and if there is only one computer that has the structure. If this is
the case, Then is sends a request to the shared memory manager (shm_man) on
that machine to detach itself from the shared memory. This is because now we
know that no other process in any other computer has the shared memory and
so will not refer to it, so the original reason of the attach (preventing
the shared memory from being destroyed when all the processes on the owner
computer have ended) does not hold any more. The following diagram shows the
actions that follow, In case the above is true.

                          Network
 (Referee Computer)          |        (Owner of the Shared Memory)
                             |
            referee --8------|-> front_end --9-->shm_man
                             |
(Diag. 8)

 The referee informs the shm_man to detach itself (8 and 9). Here also the 
referee does not wait for an acknowledgment that the shm_man has actually 
detached.

 The following diagram shows The rest:
                                     Network
               (Owner Computer)        |        (Referee Computer)
                                       |
       |                               |
kernel |<-------12--employer           |   referee 
       |            |                  |        |       
       |            +-11-< front_end <-|-----10-+
(Diag. 9)

 Now the referee informs the original employer that it can continue (10 and
11). The employer will then inform the kernel that the original user process
can continue (12).

 FAULTS: No special action is taken when the referee can not be notified
about the removal. In this case the owner will remove the structure,
while other machines having an structure with the same key won't know
anything about it.

 Also note that It may be possible that some computers are temporarily
unreachable. or they may not have removed their structure fast enough. As 
mentioned, the referee does not care about these things: No special action
is taken if it can not connect to a computer and tell it to remove. Also, it
does not wait for an acknowledgment before continuing. See the programs in
the tools directory to find a partial remedy for these problems. 







11.0) Deleting IPC structures

 After an structure is removed, it may be deleted (this is always the case 
with message queues and semaphore sets). As an IPC structure is deleted, the 
referee is notified about it, and it also deletes that machine in its tables 
as a holder of that structure.

The following diagram shows what happens:
                                    Network
          (Requesting Computer)       |        (Referee Computer)
                                      |
       |-1->back_end >-2-+            |
       |                 |            |
kernel |            employer --3------|-> referee 
       |
(Diag. 10)

 When a computer deletes an IPC structure, The back_end will be informed (1). 
It forks an employer (2) and it brings the news to the referee (3). The
referee deletes its corresponding entries and checks to see if the structure
is that of a shared memory, and if there is only one remaining computer that 
has it. This should be the owner. In this case the referee does according
to the following diagram.
 
                          Network
 (Referee Computer)          |        (Owner of the Shared Memory)
                             |
            referee --4------|-> front_end --5-->shm_man
                             |
(Diag. 11)
 The referee informs the shm_man to detach (4 and 5). This is like the case
of removing a shared memory IPC structure. 


 Now comes the rest:
                                    Network
               (Owner Computer)       |        (Referee Computer)
                                      |
       |                              |
kernel |<-------8--employer          |   referee 
       |            |                 |        |       
       |            +-7-< front_end <-|------6-+
(Diag. 12)

 The referee informs the employer that all is done (6 and 7). The employer
informs the kernel, so that the original user process that caused the
deletion can continue.

 FAULTS: No special action is taken when the referee can not be notified
about the deletion. Here the referee will think that the computer still has
the structure, while this is not true. This may cause the referee to give
wrong answers about that structure to other machines. See the programs in
the tools directory for a partial remedy to this problem.







12.0) Local Kernel information

 UNIX was not designed as a distributed system. Many of the information you
find in it are produced independently of any other computer and have meanings
only in the machine that produced them. These include the time, user-id, 
group-id and process-id values. You can not refer to a processes by a 
process-id brought from another machine. There may not be any process on the
other computer with that number, or worse, you may find a totally unrelated 
process. 

 Some of these problems can be solved by using alternate means of 
identification (like using a login name instead of a user-id, see the section 
on security), but solving others is more difficult and may not be possible 
without making a lot of modifications to the design and implementation of the 
Linux kernel, which would mean losing the compatibility. So the time and
process-number information in IPC structures, for example those returned by
xxxctl() system calls with IPC_STAT command, have no meaning in other
machines.







13.0) Security

 Unix uses user-id values to differentiate between users. People get
permission to do things, based on what they are (owner, in the same group,
or other) relative to the objects they access. User-id values are assigned
independently in each computer, so two different users on two different
machines may be assigned the same id value. Programs using DIPC need to
execute code on other computers, mostly accessing kernel structures. This
calls for a measure to provide some degree of security in the system.
 
 Considering the uselessness of user-id values, and also to be as similar 
to the general UNIX security mechanisms as possible, while requiring no
changes to the Linux kernel, It was decided that in DIPC, a user on who's 
behalf a remote system call is executed, should be known in that remote 
machine by the same login name. The dipcd program executing the remote call 
assumes the same effective user-id as that of this user name. Obviously, In 
order for this to be useful, or even meaningful, this name should exist and 
denote the same person in both machines.

 One exception to the above rule is the root user, which has extreme powers
in a UNIX computer. Being the root of one machine does not necessary mean
being the root of any other computer. So a root user can not retain its
identity in another machine. For this reason, all root users are transformed
to a user with the name of 'dipcd'. Administrators can control this user by
adjusting its user-id and its group-id relative to other groups. Programmers 
can choose the IPC structure permissions to allow or disallow such accesses.

 Important note: Beware of accidental name clashes in the same cluster, which 
may give the ability to unauthorized users to affect other people's IPC
structures. Administrators should make sure that a unique name in the
cluster is used for one person only.


 The other security measure, suggested by Michael Schmitz, deals with the
problem of intruders to a DIPC cluster. The file '/etc/dipc.allow' lists the 
addresses of computers that can be part of a DIPC cluster, i.e., trusted 
machines. The referee will just discard requests from computers who's address 
can not be found in this file. The '-s' (secure) option causes the back_end to 
also not take any action for requests from untrusted computers. Refer to the 
dipcd.doc file for more information.

 Faking an IP address will not work in DIPC. The reason is that dipcd always
uses a new TCP connection to reply a request, meaning that the reply will be
routed to the real computer, not to the imposter.

 





14.0) System calls

 Any of the following conditions cause a DIPC system call to be executed 
locally, on the machine making the call:
* dipcd process is not running.
* This machine is actually the owner.
* The requested command is not supported by DIPC for remote execution, like
  SHM_UNLOCK in shmctl().
* The process making the call belongs to dipcd.

 All other cases lead to the remote execution of the system call.

 The following diagram shows the way a user DIPC system call (for example
msgsnd()) is handled by RPC.
                                    Network
          (Calling Computer)          |        (Owner Computer)
                                      |
       |-1->back_end >-2-+            |                              |
       |                 |            |                              |
kernel |<--------9--employer --3------|-> front_end --4-->worker -5->| kernel
       |            |                 |                    |   |     |
       |            +-8-< front_end <-|--------------7-----+   +-6--<|
(Diag. 13)

 The back_end finds requests in the kernel (1), the address of the owner 
machine should have been determined by the kernel. back_end then forks an 
employer to handle it (2). The employer connects to the front_end on the 
owner machine (3). The front_end will fork a worker (4), which will 
actually execute the system call (5) and gathers any results (6). The results
are sent back to the front_end of the calling computer (7), which gives 
them back to the original employer (8). the employer then gives the results 
back to the kernel (9), from where the user process receives them (not shown).

 In a network the order of sending two requests may be different from the
order of receiving their replies. So as to ensure that no races can occur, the
operations on an IPC structure are serialized: While a process is waiting for
something in a remote machine, other processes wanting to use the same
structure will be put to sleep.

 FAULTS: Errors detected by the back_end and the employer result in an
appropriate error message being returned to the kernel. Errors in other parts
are ignored and only result in an error message being output by the dipcd
process that encountered it. The employer will then time out and inform the 
kernel about it.

 The following are the supported system calls 

14.1) msgctl(), msgsnd(), msgrcv(), semctl(), semop() and shmctl()
 These are actually executed on the owner machine. This is done using Remote 
Procedure Calls (RPC). These calls are intercepted inside the kernel and all
the required parameters and data are copied to the kernel memory. The rest is
shown in Diag. 13.
 
 As explained before, there may be many copy operations between kernel
and user spaces for one remote system call. This can hurt the performance,
especially when a great amount of data are copied. Fortunately, most of the
System V IPC system calls transfer data only in one way. For example, the 
msgrcv(), only receives data, so there is little to copy from the original 
task to the kernel. Also, many other IPC system calls have a few bytes of 
data as their parameters. One example is xxxctl() with the IPC_RMID flag. The 
expenses of copying the needed data and the results, and even the network 
access are very little.

 The xxxctl() system calls with the IPC_RMID command in the owner machine, 
if successful, will result in communicating with the referee.

14.2) shmget(), semget() and msgget()
 These involve the referee, but not the kernel of the owner machine (unless
the system call is being made in the owner machine!).

14.3) shmat() 
 This is done locally.

14.4) shmdt()
 If causing the removal of a shared memory segment, it will result in 
communication with the referee.







15.0) Shared memories

 DIPC provides strict consistency. It means that a read will return the most
recently written value. This is very familiar to programmers. There can
be multiple readers of a shared memory (page) at a time, but only one
writer at a time. The shared memory manager, running on the owner of the
distributed shared memory, will receive the requests to read or write the
whole segment or its individual pages. It will decide who will get the right 
to do the read or write, and if necessary, provides the requesting machine with 
the relevant shared memory contents.

 In what comes, always remember that processes will only express their need 
to access a shared memory. They don't inform anybody when their access has 
ended. This means that we can not wait for them to finish using the shared 
memory before giving access rights to others.

 MMU tables are changed to make the pages of a shared memory write-protected.
The same is done to read-protected a page (DIPC versions prior to 2.0 would
swap out the page to make it read-protected). Any process trying to read or 
write read-protected pages will encounter a page fault, and would sleep in the 
kernel. Before they can be readable or writable again, the new contents are 
brought over the network and replace the old ones. Now user processes can 
access them. 

 DIPC can consider multiple virtual memory pages as one, managing and 
transfering all of them at the same time. This means that for any integer n, 
n >= 1:  <DIPC's page size> =  n * <virtual memory page size>

 The macro dipc_page_size entry in file /etc/dipc.conf determines the
distributed page size. 

NOTE: All the computers in the same cluster should have the same DIPC page 
size.

 A bigger DIPC page size will mean less overhead in transfers, and also
makes it possible for computers with different native page sizes to be able
to work with each other under DIPC. This value should be set to the maximum
virtual memory page size in all the computers of the cluster.

 Writers have priority over the readers. This could be the other way around,
but the philosophy behind it is that writers provide new information, while
readers just use information. So a writer will get access to a shared memory
even if there are other want-to-be-reader processes.

 Two signals are used to inform processes of when they become readers or 
writers, so they can do any data conversions necessary in a heterogeneous 
environment. The signal currently used for reads is SIGURG, and SIGPWR is 
used for writes. They may change in future, if I find out they are used for 
other purposes. They can be used in programs with the names DIPC_SIG_READER
and DIPC_SIG_WRITER.

 All the processes on the same machine have the same state regarding a
shared memory. All of them can read or write it, or non of them can do any
of the above. When a machine's access type to a shared memory changes, all
the processes on that machine are affected. 

 DIPC could be configured to manage and transfer shared memories a page at a 
time. This is called page transfer mode. It allows different computers in the
cluster to read or write different pages of a shared memory at the same time. 
DIPC could also consider the whole segment as an indivisible unit, in this case 
it is said to operate in segment transfer mode. Different nodes in a DIPC 
cluster, configured to use different transfer modes can work with each other, 
though they may not always get what they have asked for. The following shows
how two computers with different transfer modes manage to work with each
other. 

* The requesting computer and the shared memory manager both use pages: The
requesting computer sends a request for a page, and will receive that page. 

* The requesting computer uses pages, while the shared memory manager uses 
segments: The requesting computer sends a request for one page to the
manager, but the manager will send it the whole segment. Now the requester
can access all the shared memory.

* The requesting computer uses segments, while the shared memory manager uses 
pages: The requesting computer asks for the whole segment, but the manager 
interprets this as a request for the pages within which the access occurred, 
and sends that page only. 

* The requesting computer and the shared memory manager both use segments:
Here the requesting computer asks for the whole segment, and gets it.


 In case DIPC is configured in a segment transfer mode, then any transfer of a 
shared memory contents would involve all the segment. That is in contrast to 
some other DSM systems, which only transfer a page at a time. That is why DIPC 
can be a 'segment based' DSM. This may slow down the programs, especially if 
the shared memory is big and/or the network is slow, and the transfers are 
frequent. This could also increase the performance, namely when the programmer 
places a lot of information in the shared memory and lets others access the 
whole segment. Here the overhead will be less than in a page by page access. 

 The reasons for allowing DIPC to be able to transfer whole segments are:

 * It simplifies the working in a heterogeneous environment with different
   page sizes.

 * In some networks, the transfer time over the network is much less than the 
   transfer setup time, so when the transfer is ready to begin, the amount 
   that is transferred has very little significance.

 * If a process needs all the shared memory, then sending it in one operation 
   reduces unnecessary overheads.

 * This idea is not used much and deserves to be evaluated.

 * A perfect match between the page size and the program's shared memory 
   access pattern is not likely any way (false sharing). The fact that the
   page size in different systems differ makes this point more clear.   


 In any case, the user is able to configure DIPC in a computer to follow the
more traditional paging scheme.

 The owner computer of a shared memory is its first writer, And it is always 
among the readers (if there are any readers). The way the owner computer is
always present among the readers of a shared memory is like this: The owner
always starts as the writer, when a request for read arrives, the owner is
converted to a reader. If a process on another computer wants to write to
the shared memory, then the owner won't have any access rights. As soon as
a request for read arrives, the shared memory manager will place a request
for read on behalf of the owner machine in front of the original read request.
In this way, the owner will become a reader first, getting the shared memory
contents from the current writer (see below). It then provides the original
reader (and possibly other requesters) with the contents.

 The above behaviors simplify the algorithms, as we always know were to 
provide the shared memory contents from, when there are multiple readers 
and a new reader or writer is selected. If we had to use one of other 
readers, we should have provided ways to cope with the possibilities of 
network or that reader's failure.

 In short, in the present version of DIPC there is always one machine
responsible for providing other computers with the shared memory contents: If 
there is a writer, this is the writer machine. If there are one or more 
readers, this is the owner machine. In both cases, if an error occurs, we know 
we can do almost nothing. If a writer could not send the contents, there would 
be no other place with up to date data, And if the owner fails to give the 
requested machine the data, either because of network problems, or the 
requesting machine's failure, then again we can do very little, because using
another reader would probably end up with the same results.

 In both a request to read and to write, the requesting computer (if alive) 
will eventually find out about the problem through a time-out and send
SIGSEGV (segmentation fault) to all processes that have attached the shared 
memory to their address space. As can be seen, there is hardly an easier way 
to cope with failures than the one described!
 
 The following four cases are possible when machines request access to the
shared memory.

15.1) There is a reader, and another machine wants to read: As the owner
is a also reader, it sends the relevant memory contents to the requesting 
machine and allows it to continue.
 The following diagram show what happens:
                                    Network
          (Requesting Computer)       |        (Owner Computer)
                                      |
       |-1->back_end >-2-+            |       <----------5----+      |
       |                 |            |       |               |      |
kernel |            employer --3------|-> front_end --4-->shm_man    | kernel
       |                              |       |                      |
       |                              |       +---6--> worker        |   
(Diag. 14)

 When a process wants to read from a shared memory whose relevant contents are 
not present, an exception occurs and the kernel part of DIPC is notified. A 
request is prepared inside the kernel which will cause the back_end to wake 
up (1), and fork an employer (2). The employer connects to the owner 
computer (3) and delivers the request to read the relevant shared memory 
contents to the shm_man (4). The shm_man sends a message to the front-end 
(using a loop-back Internet socket) (5). The front_end forks a worker (6) to 
do the actual transfer.

 Note: shm_man never transfers data itself, but orders a worker on different 
machines to do so. So as to keep the generality, it does so even on the 
machine it is running.
 
The next diagram tells what happens next:
                                    Network
          (Requesting Computer)       |        (Owner Computer)
                                      |
       |                              |                              |
       |                              |                              |
kernel |       +-8-< front_end <------|----------7------- worker     | kernel
       |       |                      |                   |   |      |
       | <-11- worker <---------------|--10---------------+   +-9--< |
(Diag. 15)

  The worker in the owner machine gets the shared memory (by a shmget()
system call) and connects to the front_end of the requesting machine (7), 
which forks another worker (8). Now the worker on the owner machine reads 
the relevant shared memory contents (9) and sends it directly (with no 
intervention of the front_end) to the newly forked worker (10), which places 
it in the shared memory of the requesting computer (11). By this time The 
requesting machine has at last received the information it wants.

 Here is the rest:
                                    Network
          (Requesting Computer)       |        (Owner Computer)
                                      |
       |       worker ---------12-----|--------------> front_end                       |
       |                              |                        |
kernel |                  front_end <-|--14------ shm_man <-13-+
       |                   |          |                
       |<-16- employer<-15-+
(Diag. 16)

 After the worker has placed the contents in the shared memory, it sends an
acknowledge to the shm_man (12 and 13). The shm_man then send a go-ahead
message to the original employer that sent the request to read (14 and 15). 
The employer informs the kernel about it (16) and the kernel will restart 
the processes wanting to read from the shared memory (not shown). This is the
end.


15.2) There is a reader, and another machine wants to write: The manager sends
a read-protect message to all readers, and if needed (the new writer is not a
previous reader), provides the want-to-be writer with the relevant memory 
contents, then allows it to continue.

 See the following diagram: 
                                    Network
          (Requesting Computer)       |        (Owner Computer)
                                      |
       |-1->back_end >-2-+            |                              |
       |                 |            |                              |
kernel |            employer --3------|-> front_end --4-->shm_man    | kernel
       |                              |                              |   
(Diag. 17)

 The request to write is trapped inside the kernel and the back_end is
activated to handle it (1). This results in the shm_man on the owner machine
to be informed (2, 3, and 4). Note that all the above can happen in the same
computer (if the owner wants to write the shared memory). In which case all
the processes shown above run on the same machine.

Now the rest:
                                    Network
          (Reader Computer)           |        (Owner Computer)
       |                              |
kernel |<-7- worker <-6- front_end <--|------------5----- shm_man
       |                              |  
(Diag. 18)

 The shared memory manager sends a read-protect message to ALL the reader 
computers (which may include the owner machine, in which case a loop back 
address is used) (5). The front_end on each machine forks a worker (6) and 
the request is executed (7). From now on all attempts to read or write the 
relevant contents of the shared memory on any of these computers causes the 
responsible process to stop. The above is repeated for every reader. 

 If the computer requesting to write was among the readers, no contents
will be sent to it. Otherwise the owner is responsible for providing the
new requester with the shared memory contents. This is like Diag. 15.

 Now the shm_man waits for an acknowledgment and when one arrives, sends a
go-ahead to the employer that originally started the whole process. This is
like Diag. 16.
 

15.3) There is a writer, and another one wants to read: If the writer or the
requester are not the owner, the owner will place itself as a requester to
read the shared memory, This is to make sure it is always a reader. after 
all is done, the previous writer will become a reader.

 Diag. 17 shows how the shm_man is informed about the request (but here it
is a read request).

 If the shm_man inserts the owner as a want-to-read machine, the following
will happen:

                                    Network
          (Writer Computer)           |        (Owner Computer)
       |                              |
kernel |     worker <-6- front_end <--|------------5----- shm_man
       |                              |  
(Diag. 19)

 The owner will request the current writer computer to send it the relevant 
contents of the shared memory. This is done by sending a request to the 
front_end of the writer (5), which in turn will result with the forking of a 
worker to do the transfer (6).

 The relevant shared memory contents are sent to the owner computer in a way 
like that shown in Diag. 15 (substitute the owner computer with the writer
computer, and the requesting computer with the owner computer).

 When the owner becomes a reader of the shared memory, it will attend to the
original read request. Now the situation is like that in case 15.1.


15.4) There is a writer, and another one wants to write: shm_man sends
the current writer a read-protect message and instructs it to send the 
relevant shared memory contents to the requesting machine. The requesting 
machine is then the new writer.

 Diag. 17 shows how the shm_man is informed about the request. It will then
send a message to the current writer to write-protect relevant parts of the 
shared memory and send the contents to the new writer. This is like Diag. 15, 
but replace the (owner computer) with (current writer).

 You can see the rest in Diag. 16.


15.5) Shared memory schedulings 

 shm_man places the requests for access in a linked list. The schedulings are 
done at fixed intervals. Whenever a time quantum is up, shm_man sees if there 
are any requests for the shared memory, if yes, it behaves as described above. 
If no, nothing happens. This quantum (shm_hold_time) is to prevent a Ping-Pong 
effect when badly written programs on different machines try to write to a 
shared memory, without using proper synchronizations. In this way more work 
is done for each transfer.

 A new reader for a page or the whole segment will be assigned if:

 * There is a computer wanting to read and
 * There is no computer wanting to write and
 * there is no pending acknowledge from a writer

 A new writer will be assigned if:
 * There is a computer wanting to write and
 * There is no pending acknowledge from a writer and
 * There is no pending acknowledge from a reader







16.0) The proxy.
 
 A proxy is a worker process that has executed a semop() system call with
the SEM_UNDO flag used. When a worker executes such a function, it will not 
terminate its execution, but will note the network address and the process
id of the remote process (the process for which it has executed a system call) 
and will then continue running. Doing otherwise would cause the semop()'s
effects to be undone when the worker quits. This worker is a proxy now, and
it sets up a UNIX domain socket with the name 
PROXY_<remote process IP address>_<remote process id>, using the information 
noted before. One example is PROXY_191.72.2.0_234. From now on it will execute 
all the semop() operations on behalf of the original remote process, whether 
they specify the SEM_UNDO flag or not. 

 The front_end will check the presence of a suitable proxy before forking a
worker process to handle a semop() function. If the proxy is present, it is 
contacted using the UNIX domain socket, and is instructed to do the semop().

 The proxy can execute any other function that a worker is capable of, but
in the current design the front_end process will not inform the proxy of the
need to do any thing other than semop()s.

 When the original remote process stops executing, all its proxies are
informed to quit, "doing" all the necessary "undoings".

 





 17.0) UDP/IP

 DIPC uses TCP/IP by default, but as of version 1.0, it can also work with
the UDP/IP networking protocol. Using UDP results in much higher speed in 
transfering data over the network, mainly because the overheads of a 
connection oriented protocol like TCP are not present. 

 NOTE: All the computers in the same cluster should use the same networking
protocol.

 UDP is not reliable, meaning that due to permanent or temporary errors,
some data may not be delivered to its destination. UDP does not try to
overcome the error. It also does not do any checksumming, so data corruption 
is possible. There are other services available in TCP but missing in UDP.

 Another important point is that UDP does not fragment data packets which
are too large to be transferred to kernel buffers or over the network. The
current implementation of DIPC does not employ flow control or fragmentation
in dipcd code. 

 The above means that 
 * any problem can lead to the failur of the distributed application program. 

 * It is not possible to transfer data that is larger than the limitations
   in the kernel. For example, when using SLIP it is not possible to
   transfer more than about 3200 bytes at a time with UDP. This is because
   of buffer size limitations in the kernel. This might force the programmer
   to refrain from using big shared memory segments or messages. 

 The SLIP case mentioned can be improved by allocating more buffers 
(tx_queue_len in linux/drivers/net/slip.c). This solution is from Alan Cox.

17.1) How addresses are determined

 Beacuse UDP is connectionless, each packet of data transfered should have
its destination address with itself. 

 Also of improtance is the fact that one uses port numbers assigned by the 
system when using a socket. This means that each process should somehow get
the complete address of the other process it wants to communicate to.

 Things are started using the referee or the front_end's well known ports.
Reading to a UDP socket also provides the sender's address, so after the
first contact, the processes exchange a few bytes of data, and thus get 
each other's complete addresses.

