Rabu, 15 Juni 2011

Parallel Computing; Message Passing Interface

MPI (Message-Passing Interface) is a message-passing library interface speci cation. All parts of this de nition are signi cant. MPI addresses primarily the message-passing parallel programming model, in which data is moved from the address space of one process to that of another process through cooperative operations on each process. (Extensions to the \classical" message-passing model are provided in collective operations, remote-memory access operations, dynamic process creation, and parallel I/O.) MPI is a speci cation, not an implementation; there are multiple implementations of MPI. This speci cation is for a library interface; MPI is not a language, and all MPI operations are expressed as functions, subroutines, or methods, according to the appropriate language bindings, which for C, C++, Fortran-77, and Fortran-95, are part of the MPI standard. The standard has been defined through an open process by a community of parallel computing vendors, computer scientists, and application developers. The next few sections provide an overview of the history of MPI's development.

The goal of the Message-Passing Interface simply stated is to develop a widely used standard for writing message-passing programs. As such the interface should establish a practical, portable, efficient, and exible standard for message passing.
A complete list of goals follows.
  • Design an application programming interface (not necessarily for compilers or a system implementation library).
  • Allow efficient communication: Avoid memory-to-memory copying, allow overlap of computation and communication, and offload to communication co-processor, where available.
  • Allow for implementations that can be used in a heterogeneous environment.
  • Allow convenient C, C++, Fortran-77, and Fortran-95 bindings for the interface.
  • Assume a reliable communication interface: the user need not cope with communication failures. Such failures are dealt with by the underlying communication subsystem.
  • De ne an interface that can be implemented on many vendor's platforms, with no signi cant changes in the underlying communication and system software.
  • Semantics of the interface should be language independent.
  • The interface should be designed to allow for thread safety.

Point-to-Point Communication
Sending and receiving of messages by processes is the basic MPI communication mechanism. The basic point-to-point communication operations are send and receive. Their use is illustrated in the example below.

#include "mpi.h"
int main( int argc, char **argv )
char message[20];
int myrank;
MPI_Status status;
MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &myrank );
if (myrank == 0) /* code for process zero */
strcpy(message,"Hello, there");
MPI_Send(message, strlen(message)+1, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
else if (myrank == 1) /* code for process one */
MPI_Recv(message, 20, MPI_CHAR, 0, 99, MPI_COMM_WORLD, &status);
printf("received :%s:\n", message);

In this example, process zero (myrank = 0) sends a message to process one using the send operation MPI_SEND. The operation speci es a send bu er in the sender memory from which the message data is taken. In the example above, the send bu er consists of the storage containing the variable message in the memory of process zero. The location, size and type of the send bu er are speci ed by the rst three parameters of the send operation. The message sent will contain the 13 characters of this variable. In addition, the send operation associates an envelope with the message. This envelope speci es the message destination and contains distinguishing information that can be used by the receive operation to select a particular message. The last three parameters of the send operation, along with the rank of the sender, specify the envelope for the message sent. Process one(myrank = 1) receives this message with the receive operation MPI_RECV. The message to be received is selected according to the value of its envelope, and the message data is stored into the receive bu er. In the example above, the receive bu er consists of the storage containing the string message in the memory of process one. The rst three parameters of the receive operation specify the location, size and type of the receive bu er. The next three parameters are used for selecting the incoming message. The last parameter is used to return information on the message just received.
The next sections describe the blocking send and receive operations. We discuss send, receive, blocking communication semantics, type matching requirements, type conversion in heterogeneous environments, and more general communication modes. Nonblocking communication is addressed next, followed by channel-like constructs and send-receive operations, Nonblocking communication is addressed next, followed by channel-like constructs and send-receive operations, ending with a description of the \dummy" process, MPI_PROC_NULL.

Up to here, all point to point communication have involved only bu ers containing a sequence of identical basic datatypes. This is too constraining on two accounts. One often wants to pass messages that contain values with di erent datatypes (e.g., an integer count, followed by a sequence of real numbers); and one often wants to send noncontiguous data(e.g., a sub-block of a matrix). One solution is to pack noncontiguous data into a contiguous bu er at the sender site and unpack it at the receiver site. This has the disadvantage of requiring additional memory-to-memory copy operations at both sites, even when the communication subsystem has scatter-gather capabilities. Instead, MPI provides mechanisms to specify more general, mixed, and noncontiguous communication bu ers. It is up to the implementation to decide whether data should be rst packed in a contiguous bu er before being transmitted, or whether it can be collected directly from where it resides.
The general mechanisms provided here allow one to transfer directly, without copying, objects of various shape and size. It is not assumed that the MPI library is cognizant of the objects declared in the host language. Thus, if one wants to transfer a structure, or an array section, it will be necessary to provide in MPI a de nition of a communication bu er that mimics the de nition of the structure or array section in question. These facilities can be used by library designers to de ne communication functions that can transfer objects defi ned in the host language | by decoding their de nitions as available in a symbol table or a dope vector. Such higher-level communication functions are not part of MPI.
More general communication bu ers are speci ed by replacing the basic datatypes that
have been used so far with derived datatypes that are constructed from basic datatypes using the constructors described in this section. These methods of constructing derived datatypes can be applied recursively.
Here is an example in C that passes an array of ints and all the processors want to send their arrays to the root with MPI_Gather:

int array[100];
int root, total_p, *receive_array;

MPI_Comm_size(comm, &total_p);
receive_array=(int *) malloc(total_p*100*sizeof(int));
MPI_Gather(array, 100, MPI_INT, receive_array, 100, MPI_INT, root, comm);

However, you may instead wish to send data as one block as opposed to 100 ints. To do this define a "contiguous block" derived data type.

MPI_Datatype newtype;
MPI_Type_contiguous(100, MPI_INT, &newtype);
MPI_Gather(array, 1, newtype, receive_array, 1, newtype, root, comm);

Passing a class or a data structure cannot use a predefined data type. MPI_Type_create_struct creates an MPI derived data type from MPI_predefined data types, as follows:

int MPI_Type_create_struct(int count, int blocklen[], MPI_Aint disp[], MPI_Datatype type[], MPI_Datatype *newtype)

where count is a number of blocks, also number of entries in blocklen[], disp[], and type[]:
  • blocklen[] — number of elements in each block (array of integer)
  • disp[] — byte displacement of each block (array of integer)
  • type[] — type of elements in each block (array of handles to datatype objects).

The disp[] array is needed because processors require the variables to be aligned a specific way on the memory. For example, Char is one byte and can go anywhere on the memory. Short is 2 bytes, so it goes to even memory addresses. Long is 4 bytes, it goes on locations divisible by 4 and so on. The compiler tries to accommodate this architecture in a class or data structure by padding the variables. The safest way to find the distance between different variables in a data structure is by obtaining their addresses with MPI_Get_address. This function calculates the displacement of all the structure's elements from the beginning of the data structure.
Given the following data structures:

typedef struct{
int f;
short p;
} A;

typedef struct{
A a;
int pp,vp;
} B;

Here's the C code for building an MPI-derived data type:

void define_MPI_datatype(){

int blocklen[6]={1,1,1,1,1,1}; //The first and last elements mark the beg and end of data structure
MPI_Aint disp[6];
MPI_Datatype newtype;
B findsize[2]; //You need an array to establish the upper bound of the data structure
MPI_Aint findsize_addr, a_addr, f_addr, p_addr, pp_addr, vp_addr, UB_addr;
int error;

MPI_Get_address(&findsize[0], &findsize_addr);
MPI_Get_address(&(findsize[0]).a, &a_addr);
MPI_Get_address(&((findsize[0]).a).f, &f_addr);
MPI_Get_address(&((findsize[0]).a).p, &p_addr);
MPI_Get_address(&(findsize[0]).pp, &pp_addr);
MPI_Get_address(&(findsize[0]).vp, &vp_addr);


error=MPI_Type_create_struct(6, blocklen, disp, type, &newtype);

One-Sided Communications
Remote Memory Access (RMA) extends the communication mechanisms of MPI by allowing one process to specify all communication parameters, both for the sending side and for the receiving side. This mode of communication facilitates the coding of some applications with dynamically changing data access patterns where the data distribution is fixed or slowly changing. In such a case, each process can compute what data it needs to access or update at other processes. However, processes may not know which data in their own memory need to be accessed or updated by remote processes, and may not even know the identity of these processes. Thus, the transfer parameters are all available only on one side. Regular send/receive communication requires matching operations by sender and receiver. In order to issue the matching operations, an application needs to distribute the transfer parameters. This may require all processes to participate in a time consuming global computation, or to periodically poll for potential communication requests to receive and act upon. The use of RMA communication mechanisms avoids the need for global computations or explicit polling. A generic example of this nature is the execution of an assignment of the form A = B(map), where map is a permutation vector, and A, B and map are distributed in the same manner.
Message-passing communication achieves two e ects: communication of data from sender to receiver; and synchronization of sender with receiver. The RMA design separates
these two functions. Three communication calls are provided: MPI_PUT (remote write), MPI_GET (remote read) and MPI_ACCUMULATE (remote update). A larger number of synchronization calls are provided that support di erent synchronization styles. The design is similar to that of weakly coherent memory systems: correct ordering of memory accesses has to be imposed by the user, using synchronization calls; the implementation can delay communication operations until the synchronization calls occur, for efficiency. The design of the RMA functions allows implementors to take advantage, in many cases, of fast communication mechanisms provided by various platforms, such as coherent or noncoherent shared memory, DMA engines, hardware-supported put/get operations, communication coprocessors, etc. The most frequently used RMA communication mechanisms can be layered on top of message-passing. However, support for asynchronous communication agents (handlers, threads, etc.) is needed, for certain RMA functions, in a distributed memory environment.
We shall denote by origin the process that performs the call, and by target the process in which the memory is accessed. Thus, in a put operation, source=origin and destination=target; in a get operation, source=target and destination=origin.

Process Creation and Management
MPI is primarily concerned with communication rather than process or resource management. However, it is necessary to address these issues to some degree in order to defi ne a useful framework for communication. This chapter presents a set of MPI interfaces that allow for a variety of approaches to process management while placing minimal restrictions on the execution environment.
The MPI model for process creation allows both the creation of an intial set of processes related by their membership in a common MPI_COMM_WORLD and the creation and management of processes after an MPI application has been started. A major impetus for the later form of process creation comes from the PVM [23] research e ort. This work has provided a wealth of experience with process management and resource control that illustrates their bene ts and potential pitfalls.
The MPI Forum decided not to address resource control because it was not able to design a portable interface that would be appropriate for the broad spectrum of existing and potential resource and process controllers. Resource control can encompass a wide range of abilities, including adding and deleting nodes from a virtual parallel machine, reserving and scheduling resources, managing compute partitions of an MPP, and returning information about available resources. assumes that resource control is provided externally-probably by computer vendors, in the case of tightly coupled systems, or by a third party software package when the environment is a cluster of workstations.
The reasons for including process management in MPI are both technical and practical. Important classes of message-passing applications require process control. These include task farms, serial applications with parallel modules, and problems that require a run-time assessment of the number and type of processes that should be started. On the practical side, users of workstation clusters who are migrating from PVM to MPI may be accustomed to using PVM's capabilities for process and resource management. The lack of these features would be a practical stumbling block to migration.
The following goals are central to the design of MPI process management:
  • The MPI process model must apply to the vast majority of current parallel environments. These include everything from tightly integrated MPPs to heterogeneous networks of workstations.
  • MPI must not take over operating system responsibilities. It should instead provide a clean interface between an application and system software.
  • MPI must guarantee communication determinism in the presense of dynamic processes,
    i.e., dynamic process management must not introduce unavoidable race conditions.
  • MPI must not contain features that compromise performance.

The process management model addresses these issues in two ways. First, MPI remains primarily a communication library. It does not manage the parallel environment in which a parallel program executes, though it provides a minimal interface between an application and external resource and process managers.
Second, MPI maintains a consistent concept of a communicator, regardless of how its members came into existence. A communicator is never changed once created, and it is always created using deterministic collective operations.

POSIX provides a model of a widely portable le system, but the portability and optimization needed for parallel I/O cannot be achieved with the POSIX interface.
The signifi cant optimizations required for efficiency (e.g., grouping , collective
buff ering , and disk-directed I/O ) can only be implemented if the parallel I/O system provides a high-level interface supporting partitioning of le data among processes and a collective interface supporting complete transfers of global data structures between process memories and les. In addition, further e ciencies can be gained via support for asynchronous I/O, strided accesses, and control over physical le layout on storage devices (disks). The I/O environment described in this chapter provides these facilities.
Instead of de fining I/O access modes to express the common patterns for accessing a shared file (broadcast, reduction, scatter, gather), we chose another approach in which data partitioning is expressed using derived datatypes. Compared to a limited set of predefi ned access patterns, this approach has the advantage of added exibility and expressiveness.

Pro ling Interface
To meet the MPI pro ling interface, an implementation of the MPI functions must :
  1. provide a mechanism through which all of the MPI de ned functions except those allowed as macros (See Section 2.6.5). This requires, in C and Fortran, an alternate entry point name, with the pre x PMPI_ for each MPI function. The pro ling interface in C++ is described in Section 16.1.10. For routines implemented as macros, it is still required that the PMPI_ version be supplied and work as expected, but it is not possible to replace at link time the MPI_ version with a user-de ned version.
  2. ensure that those MPI functions that are not replaced may still be linked into an executable image without causing name clashes.
  3. document the implementation of di erent language bindings of the MPI interface if they are layered on top of each other, so that the pro ler developer knows whether she must implement the pro le interface for each binding, or can economise by implementing it only for the lowest level routines.
  4. where the implementation of di erent language bindings is done through a layered approach (e.g. the Fortran binding is a set of \wrapper" functions that call the C implementation), ensure that these wrapper functions are separable from the rest of the library.

This separability is necessary to allow a separate pro ling library to be correctly implemented, since (at least with Unix linker semantics) the pro ling library must contain these wrapper functions if it is to perform as expected. This requirement allows the person who builds the pro ling library to extract these functions from the original MPI library and add them into the pro ling library without bringing along any other unnecessary code.

'Classical' cluster and supercomputer implementations
The MPI implementation language is not constrained to match the language or languages it seeks to support at runtime. Most implementations combine C, C++ and assembly language, and target C, C++, and Fortran programmers. Bindings are available for many other languages, including Perl, Python, Ruby, Java, CL.

The initial implementation of the MPI 1.x standard was MPICH, from Argonne National Laboratory (ANL) and Mississippi State University. IBM also was an early implementor, and most early 90s supercomputer companies either commercialized MPICH, or built their own implementation. LAM/MPI from Ohio Supercomputer Center was another early open implementation. ANL has continued developing MPICH for over a decade, and now offers MPICH 2, implementing the MPI-2.1 standard. LAM/MPI and a number of other MPI efforts recently merged to form Open MPI. Many other efforts are derivatives of MPICH, LAM, and other works, including, but not limited to, commercial implementations from HP, Intel, and Microsoft.

MPI Python implementations include: pyMPI, mpi4py, pypar, MYMPI, and the MPI submodule in ScientificPython. PyMPI is notable because it is a variant python interpreter, while pypar, MYMPI, and ScientificPython's module are import modules. They make it the coder's job to decide where the call to MPI_Init belongs. Recently the well known Boost C++ Libraries acquired Boost:MPI which included the MPI Python Bindings. This is of particular help for mixing C++ and Python.
The OCamlMPI Module implements a large subset of MPI functions and is in active use in scientific computing. An eleven thousand line OCaml program was "MPI-ified" using the module, with an additional 500 lines of code and slight restructuring and ran with excellent results on up to 170 nodes in a supercomputer.
Although Java does not have an official MPI binding, several groups attempt to bridge the two, with different degrees of success and compatibility. One of the first attempts was Bryan Carpenter's mpiJava,[15] essentially a collection of JNI wrappers to a local C MPI library, resulting in a hybrid implementation with limited portability, which also has to be compiled against the specific MPI library being used.

However, this original project also defined the mpiJava API[16] (a de-facto MPI API for Java that closely followed the equivalent C++ bindings) which other subsequent Java MPI projects adopted. An alternative, less–used API is MPJ API,[17] designed to be more object-oriented and closer to Sun Microsystems' coding conventions. Beyond the API, Java MPI libraries can be either dependent on a local MPI library, or implement the message passing functions in Java, while some like P2P-MPI also provide peer-to-peer functionality and allow mixed platform operation.

Some of the most challenging parts of Java/MPI arise from Java characteristics such as the lack of explicit pointers and the linear memory address space for its objects, which make transferring multidimensional arrays and complex objects inefficient. Workarounds usually involve transferring one line at a time and/or performing explicit de-serialization and casting at both sending and receiving ends, simulating C or FORTRAN-like arrays by the use of a one-dimensional array, and pointers to primitive types by the use of single-element arrays, thus resulting in programming styles quite far from Java conventions.

The most usable Java message passing system is MPJ Express[citation needed]. Recent versions can be executed in cluster and multicore configurations. In the cluster configuration, it can execute parallel Java applications on clusters and clouds. Here Java sockets or specialized I/O interconnects like Myrinet can support messaging between MPJ Express processes. In the multicore configuration, a parallel Java application is executed on multicore processors. In this mode, MPJ Express processes are represented by Java threads. New users are encouraged to start in multicore mode before moving to clusters or clouds.
Common Language Infrastructure
The two managed CLI (.NET) implementations are Pure Mpi.NET[18] and MPI.NET,[19] a research effort at Indiana University licensed under a BSD-style license. It is compatible with Mono, and can make full use of underlying low-latency MPI network fabrics.
Hardware implementations
MPI hardware research focuses on implementing MPI directly in hardware, for example via Processor-in-memory, building MPI operations into the microcircuitry of the RAM chips in each node. By implication, this approach is independent of the language, OS or CPU, but cannot be readily updated or removed.

Another approach has been to add hardware acceleration to one or more parts of the operation, including hardware processing of MPI queues and using RDMA to directly transfer data between memory and the network interface without CPU or OS kernel intervention.

Example program
Here is a "Hello World" program in MPI written in C. In this example, we send a "hello" message to each processor, manipulate it trivially, return the results to the main process, and print the messages.

"Hello World" MPI Test Program

#define BUFSIZE 128
#define TAG 0

int main(int argc, char *argv[])
char idstr[32];
char buff[BUFSIZE];
int numprocs;
int myid;
int i;
MPI_Status stat;

MPI_Init(&argc,&argv); /* all MPI programs start with MPI_Init; all 'N' processes exist thereafter */
MPI_Comm_size(MPI_COMM_WORLD,&numprocs); /* find out how big the SPMD world is */
MPI_Comm_rank(MPI_COMM_WORLD,&myid); /* and this processes' rank is */

/* At this point, all programs are running equivalently, the rank distinguishes
the roles of the programs in the SPMD model, with rank 0 often used specially... */
if(myid == 0)
printf("%d: We have %d processors\n", myid, numprocs);
for(i=1;i numprocs;i++)
sprintf(buff, "Hello %d! ", i);
printf("%d: %s\n", myid, buff);
/* receive from rank 0: */
sprintf(idstr, "Processor %d ", myid);
strncat(buff, idstr, BUFSIZE-1);
strncat(buff, "reporting for duty\n", BUFSIZE-1);
/* send to rank 0: */

MPI_Finalize(); /* MPI Programs end with MPI Finalize; this is a weak synchronization point */
return 0;

Note that the runtime environment for the MPI implementation used (often called mpirun or mpiexec) spawns multiple copies of the program, with the total number of copies determining the number of process ranks in MPI_COMM_WORLD, which is an opaque descriptor for communication between the set of processes. A Single-Program-Multiple-Data (SPMD) programming model is thereby facilitated, but not required; many MPI implementations allow multiple, different, executables to be started in the same MPI job. Each process has its own rank, the total number of processes in the world, and the ability to communicate between them either with point-to-point (send/receive) communication, or by collective communication among the group. It is enough for MPI to provide an SPMD-style program with MPI_COMM_WORLD, its own rank, and the size of the world to allow algorithms to decide what to do. In more realistic situations, I/O is more carefully managed than in this example. MPI does not guarantee how POSIX I/O would actually work on a given system, but it commonly does work, at least from rank 0.

MPI uses the notion of process rather than processor. Program copies are mapped to processors by the MPI runtime. In that sense, the parallel machine can map to 1 physical processor, or N where N is the total number of processors available, or something in between. For maximum parallel speedup, more physical processors are used. This example adjusts its behavior to the size of the world N, so it also seeks to scale to the runtime configuration without compilation for each size variation, although runtime decisions might vary depending on that absolute amount of concurrency available.

LAM 7.1.4 Downloads
MPI.NET Source Code (Windows)

0 komentar:

Posting Komentar