
  Section    Title

      0      Warning
      1      Introduction
      2      Hardware for parallel processing
      3      Software for parallel processing
      4      Differences in programming MCs and MPs
      5      An ideal parallel system
      6      A discussion of the MC architecture
      7      Ways to ease MC programming
      8      Requirements of Programming a distributed system
      9      Heterogeneous Programming in MCs
     10      DIPC



0.0) Warning
 The reader should note that the materials in this file are not definitive, 
and the categorizations are not absolute. There may also be some errors, but 
on the whole I hope the interested reader will find something useful here.


 
1.0) Introduction:
 Programmers are very familiar with uni-processor systems. These systems have
one processor and execute programs one instruction at a time, and do so 
sequentially, meaning that the execution of the next instruction won't begin 
until the previous one has finished (unless the two don't affect each other,
in which case they may overlap in execution). There is also some memory
attached to the processor of such a system, holding both the code to be
executed and the data to be processed (Von Neumann architecture). A value
produced by one part of the program can be used by another part of it. The
only necessary thing to do is to access the same memory location in both
parts, assuming that the location is not changed before the second access.
 
 Parallel processing is in contrast to sequential processing, and is based on 
the idea of using more than one processing element at the same time to achieve 
more processing power.

 There are some reasons for the increasing interest in parallel processing.
Among them are:

1.1) There is an upper limit for the speed and sophistication of electronic 
components, like processors, And we are approaching it fast.

1.2) many of the natural phenomena, which we are interested in analyzing or 
controlling, are intrinsically parallel. 

1.3) No matter how fast uni-processors become, we can still achieve more
speed by using more of these processors at the same time.

 This can be useful only when we have parallelized our application and 
succeeded in keeping the processing elements busy most of the time. 
Parallelizing a program include detecting sections that can be executed at
the same time. These sections may need to read some shared data, and should
do so only after making sure that the data is already produced. producing new
data may also require making sure no one else is accessing the data while it
is being updated. This calls for some synchronization with the processes that
run in parallel.

 Unfortunately some programs can not be parallelized in a satisfactory manner 
(i.e. a great percentage of the program still needs to be run sequentially). 
This may be due to the original algorithms used, or because the solution to 
the problem we are trying to solve is intrinsically sequential. Such programs 
won't benefit much from the presence of a parallel processing system. 
Programmers should always keep this in mind. 

 Distributed processing allows us to use the resources of more than one
computer by a program to reach its goals. Note that Distributed and parallel
processing are not necessary the same thing: Parts of a distributed program
may or may not be executing in parallel. Also, different parts of a parallel
program may or may not run in different computers. In this discussion, it is
assumed that distributed programs can also be parallel programs.

 

2.0) Hardware for parallel processing
 there are three main architectures for parallel processing:

2.1) CPU-internal parallel processing
 CPUs that are described as being Super Scalar, Pipelined, Vector Processing
or a combination of them, can be executing more than one instruction at a 
time (super scalar and pipelined), alternatively they can be applying the 
same instruction to more than one unit of data (vector processor) and so 
effectively doing more than one thing at a time. We won't talk about these 
any further.

2.2) Multi-Processors (MP)
 This architecture includes systems that have more than one processor to
execute the programs. Any of these processors may in turn have internal
parallel processing capabilities.

 These computers usually have a shared memory, that is, memory that can be
accessed by all the processors. There may also be a single clock in the
system. When accessing the shared memory, special hardware does the
arbitration among processors, so that no two processors access the same
location in the shared memory simultaneously. 

 There are some different schemes in connecting the processors to the shared 
memory. A bus creates a common access path for all the processors, but this 
limits the number of active processors on the bus, as there will be
collisions between processors trying to access the shared memory, even when
they are accessing different parts of it. As a remedy, some try to divide the 
shared memory into different modules and use more than one bus, so as to 
increase the chances of simultaneous access to different modules of the 
shared memory. The best results can be obtained by using a Crossbar Network, 
which allows any processor to access any module, only if two processors are 
not accessing the same module. a Crossbar Bar network is much more complex 
and expensive than a single bus. As a compromise between a single bus and a 
Crossbar Network, some MPs use networks in which some access paths between 
processors and the shared memory are shared. In these systems it is possible 
that some processors can not simultaneously access their requested memory 
locations, even though they are in different modules. Examples include the
Butterfly or the shuffle networks.  

 So as to decrease the amount of contention for memory access, MPs usually
use some cache memory for each processor. In this way parts of the program 
that are not needed to be shared (program code and private variables), can 
be accessed freely. If shared data is only being read, then it too can be put 
in the cache and readily accessed. But here is a problem here: if a processor 
wants to write to a shared data that is in other processors' cache, 
inconsistencies can arise. So methods should be improvised to keep shared 
data consistent.

 Briefly, The following seem to hold for MPs:

2.2.1) In this architecture, shared data can be accessed by all processors. 
The main thing that the programmer should keep in mind is synchronizing this 
access.

2.2.2) Because of direct access to the shared memory, processes can exchange 
data at high speeds.

2.2.3) Because of hardware problems, adding large numbers of processors,
specially general purpose processors, as opposed to special purpose ones,
(like single bit processors) may not be easy.

2.2.4) MPs are usually designed with a pre-determined number of processors.
Adding more than that number to a system may be impossible without a complete 
redesign. 


2.3) Multi-Computers (MC)
 These are also called distributed systems. Here the system consists of a 
number of autonomous computers connected by a network.  There is no shared 
memory or clock.  Any exchange of information between these computers 
is done be via the network. Here each computer should have all the data it 
needs for its processings in its local memory. The programmer of such a
system should distribute the data at the proper time among the computers. 
Network interconnections can be of various forms. You could connect each 
computer to all the others, giving each one the ability to exchange data 
with any other machine directly, but that may become very complex and 
expensive. Or you can connect them in a line, where a computer has direct 
connection to only its two neighbors. Yet another way is to use a bus, where
each machine can contact any other computer directly. Simultaneous use of
the bus by more than one pair of computers may be limited. Like MPs, here too
are some alternatives. some of the alternate topologies are Mesh, Torus and
Hypercube.

 Any of the computers in an MC can in itself be an MP. This makes it clear
that the MC architecture is an improvement over the MP, not a rival.

 The following points seem to hold about MCs:

2.3.1) The MC programmer should not only care for synchronization, but
also for the availability of data at the proper machine.

2.3.2) Data transfers are done over a network. As the current network
speeds are less than the speed of computer buses, data exchanges are not
very fast.

2.3.3) Designing and implementing an MC is easy. The designer can use
already-available computers and connect them using network adapters. The
main task here is choosing a suitable connection topology.

2.3.4) It is easy to add or remove computers from an MC. So it has good
scalability. It is also easy to rearrange the connection topology.



3.0) Software for parallel processing
 Parallel execution is possible at various levels. All the levels mentioned
below can be used in both MCs and MPs. But care should be taken that in an
MC the data transfer times do not exceed the execution times. For example it 
may not be a good idea to send a few machine instructions over the network to 
another computer, execute them there and then return the results. The other 
thing to consider is systems with different, incompatible processor types 
(heterogeneous systems). In these systems different processors can not 
directly execute programs written for other types of processor. Also each
processor may have a different way to represent data types. This problem 
exists for both MPs and MCs, but it is more often encountered in MCs, as MP
designers usually avoid it by using the same kind of processor in their
designs. 

3.1) Parallelism at instruction level
 Pipelined processors can overlap the execution of many instructions. Super
Scalar processors actually execute more than one instruction at a time. Both
depend on the condition that the instructions don't have resource conflicts
(the same register, multiplication unit, or the result of a previous
instruction). Discovering the parallelism opportunities may be the
responsibility of the compiler (Intel Pentium), or the processor may take a
more active part (Intel Pentium Pro). As the compiler or chip knows all about
the machine instructions and the processor (needed resources, execution time
and so on), it can try to make the program execute in parallel as much as
possible. Because this is done automatically, either by the compiler or the
processor, it is easy to use and totally transparent.

3.2) Parallelism at loop level
 Loops give the programs good opportunities for parallelization. If it the 
loop iterations are independent of each other, it would be possible to execute
each of them on a different processor. The compiler or the programmer may be
able to detect these opportunities.   

3.3) Parallelism at procedure level
 This kind of parallelizing depends on the algorithms used. Here procedures
in a single program may be executed in parallel. As it may be very difficult
for the compiler to ascertain the procedures' side effects and to guess their
running time, the main burden of detection is on the programmer's shoulders.
More than that, the programmer should also think about synchronizations.

3.4) Parallelism at the program level
 This means executing stand alone programs that run in parallel to solve a
single problem. This is done mainly by the decisions of the programmer
and/or the user.



4.0) Differences in programming MCs and MPs
 In uni-processor systems, data are passed between different parts of the
program in a shared memory, for example in the stack or in global variables.
Because the MPs also use shared memory to pass data, they resemble very much
uni-processor programming model. For the majority of programmers familiar 
with uni-processor architecture, MPs are a natural extension of the
systems they are familiar with. The main consideration for them is the 
synchronization problem. This is why MPs are so popular and even PCs with 
multiprocessing capabilities are readily available.

 In MCs there is the need to transfer the data to the computers that are to
process them, before they are able to proceed. This is very different from 
uni-processors. In most MC systems, the programmer is required to explicitly 
transfer any needed data over the network. The programmer should also consider 
problems like the possibility of a network or remote computer crash. This 
may prevents him/her from concentrating on the problem at hand. The other 
obstacle (and a big one) faced by the programmer is making sure the data in 
different machines are kept consistent.



5.0) An ideal parallel system
 In an ideal parallel system the programmer writes programs quite like
ordinary uni-processor machines. Different parts of this ideal system
(compiler, operating system and so on) should detect all parallelizable 
parts of the program in all levels, and execute them without any programmer 
or user intervention, among possibly heterogeneous processing elements. It 
should also handle all the required synchronizations and data transfers.

 Well, don't hold your breath waiting for such a system. A lot of improvements
in many different branches of computer science are needed before we are able
to build such a system. So the programmer has to consider many things when 
programming a parallel system. And this is much harder in the case of MCs. It 
seems that the system performance and programmer's ease have a reverse 
relation: The more the programmer is involved, the better the performance. 
This seems true in many other, more familiar situations as well. For example, 
a programmer can choose to use overlaying to swap parts of his/her program to 
secondary storage. Here the programmer can swap exactly the parts (s)he know 
are not needed in a long time. A paging system, on the other hand, may swap 
an immediately needed part of a program, resulting in wasteful access to
the slow secondary memory, costing performance. But such a system is much
easier to use. This may have to do with the fact that computers are not 
very intelligent (yet!). 

 So against our ideal parallel system, another one may not provide any
automatic services at all and the programmer is held responsible for
everything: (remote) program execution, data transfers, synchronization, and
error handling. This kind of system needs very little (if at all) changes to
the underlying operating system, so they can be easily ported to different
environments. Although they are very difficult to use, because of the great
needs of more processing powers, they are used widely and can be found
anywhere, from PCs to super computers.



6.0) A discussion of the MC architecture
 Some people use the phrase 'processor pool' when talking about distributed
processing. They believe that ideally, processing power should be used like
electricity. Electrical energy is generated in big power plants, and can be
used easily through a plug. So we should have some place with a lot of
processing resources, and users will 'plug' to this power source and use
these resources.

 I believe this is an unjustified imitation of a very different branch of
engineering. Power engineering equipment are usually very big and
expensive. Computers on the other hand use different technologies and offer 
lots of power, cheaply, and while using very small spaces. A better thing 
would be using the processing powers of stand alone computers scattered all 
over (the world?). Each user can use other peoples' facilities, while letting 
others use his/her. 
   
 Designing an MC requires making some decisions about the level of programmer 
verses system (OS, Compiler, and so on) involvement. Usually the programmer 
can make a better use of the resources, at the expense of the time spent 
during the development. but it seems that the rate of growth of programmers 
familiar with MC programming is increasing much slower than the corresponding 
computing resources. More than that, we expect the performance of computer 
hardware to grow. It seems that preferring the programmer's ease is more 
logical.



7.0) Ways to ease MC programming
 Here are some possible solutions for making MC programming easier. All of
them are based on concealing the presence of multiple computers and a
network from the programmer and providing him or her with virtual direct
access to the memory in other computers. This is called 'Distributed Shared
Memory' or DSM.

7.1) We could use a compiler to handle the work for us. The programmer
declares all his shared data structures. The compiler then knows where these 
are read or written, and can insert code to send or receive information or to 
synchronize access to them. A Run Time System, or RTS, will be necessary. 
The programmer can assume (s)he is working with an MP. Here the programmer
should note and mark all shared data, (s)he is also restricted to using a 
certain compiler (and language). This requires a complex compiler, but
results in good performance.

7.2) The next thing we could do is to use 'Object Oriented' or OO concepts.
here data transfers and synchronizations are hidden inside the objects. The
programmer can use these objects without regard to which machine they are
in. An RTS may need to be present here, but many newer OSs have support for
(heterogeneous) objects inside them. The programmer may be forced to use
a certain language and software construction methodology, but it has rather 
good performance. 

7.3) Another way is using a 'Memory Management Unit' or MMU to implement
DSM. This works very much like virtual memory in ordinary computers.
Multiple computers will have access to a shared region of memory. When a
program needs some data that are not in the computer's local memory, a fault
occurs and program execution is stopped. The contents of the required shared
memory are then transferred over the network by the OS, then the original
program is restarted.

 In some DSM systems, the memories of all the computers are considered as
unified. A pointer in such a memory means the same thing in all the machines. 
This can be of use when different processes can use a common address space, 
like in operating systems that support multi-threading. When each process 
has its own virtual address space (like in traditional UNIX), a pointer 
outside its owner's address space is meaningless to others, so sharing it 
would be of little use.

7.4) The last way mentioned does not impose a particular development
language or methodology on the programmer. It is also possible to implement a
distributed OO system on top of it. Many older MP programs can also be ported
easily.



8.0) Requirements of Programming a distributed system
 Here we discuss a multi computer system with program-level parallelism.

8.1) The ability to execute programs remotely.
 The code to be executed on a computer should be present there before anything 
can happen. Some distributed systems transfer executable code from machine to
machine and then execute it. This is a problem in heterogeneous systems. To
overcome it, the source of programs can be sent to a remote computer and
compiled, and then executed there. Another way is to send an intermediate form
of the program (see the Java language). 

 The easiest way is to place the programs 'by hand' in computers they are
needed. Now one could start them on different machines manually, or (s)he 
could use the UNIX 'rsh' to do the same thing remotely.

8.2) The ability to synchronize with remote processes.
 Different programs should make sure they access shared data at proper times.
This requires knowledge about other programs. Doing this usually needs some 
data exchanges. That is why many people consider synchronization as a special
form of data exchange (see 8.3). The important thing here is not to make the
synchronization an expensive operation. Remember that synchronization is not
our goal, It is an overhead. Distributed programmers should always use
techniques that results in the minimum amount of data transfer. That is why
using methods like test and set is not very attractive. Semaphores are a very
good solution.

8.3) The ability to distribute the data needed by various remote processes and
to gather the processed information.
 This also should be done with the least amount of overhead: exchange only 
the data needed, and only at the right time. The other thing to consider is 
the effort put by programmer to do the exchange. The best thing would be to 
relieve the programmer of knowing about machines and network addresses and use 
logical means to identify computers which need data.



9.0) Heterogeneous Programming in MCs
 Programming a heterogeneous system involves solving two problems

9.1) The code problem
 Processors can not execute programs that are compiled for other incompatible 
ones. So executing a program on an arbitrary processor in the system may not 
be possible. One solution to this is to make sure that programs are always
run on an appropriate processor. Another way would be using software 
emulation.

9.2) The data problem
 Different processors may interpret data in different ways. Programmers can
not just send their data to any machine. One way to overcome this is to
restrict data exchanges between compatible processors. Another is to change
the data somewhere along the way. 
 
 Data conversion normally requires knowledge about data semantics (their
meaning and the way they are used), which are determined by the programmer. 
It would be difficult for the system software to guess these semantics 
without any help from the programmer. A very straight forward solution would 
be making the programmer responsible for any data conversions. A more complex 
solution is involving the compiler and the OS to convert the data using 
programmer-supplied information.


10.0) DIPC
 DIPC is a system to create a multi computer system from Linux computers on a
TCP or UDP network. It eases the data exchanges and the synchronizations, and 
uses the processor's MMU to provide DSM capabilities. DIPC provides parallelism 
in the program level. The programmer or the user is responsible to invoke 
remote programs in computers with appropriate machine architectures. There
are some basic support for heterogeneous environments.

