ARC
INTRODUCTION TO MPI
GEORGI YANAKIEV (YANAKIEV@ARC.UNM.EDU)
MARK ENLOW (MENLOW@ARC.UNM.EDU)
CHUANYI DING (DING@ARC.UNM.EDU)
JIM WARSA (WARSA@BOLTZMANN.UNM.EDU)
� Copyright 1995, The Maui Project/University of New Mexico
_________________________________________________________________
WHAT IS MPI?
* Message passing is the communication model used on massively
parallel machines with distributed-memory architectures (can be
used on shared memory machines also)
* The goal of the standard is to establish a widely used language-
and platform-independent standard for writing message-passing
programs
* The interface should establish a practical, portable, efficient,
and flexible standard not too different from current practice
MPI STANDARDIZATION EFFORT
* About 60 people from 40 organizations in the United States and
Europe participated
* Influenced by:
+ Venus (IBM)
+ NX/2 (INTEL)
+ Express (Parasoft)
+ Vertex (nCUBE)
+ P4 (ANL)
+ PARMACS (ANL)
* Other Contributions:
+ Zipcode (MSU)
+ Chimp (Edinburgh University)
+ PVM (ORNL, UTK, Emory U.)
+ Chameleon (ANL)
+ PICL (ANL)
MILESTONES OF THE MPI FORUM
* April 1992: Williamsburg, VA, Workshop on Message Passing
Standards
* November 1992: MPI draft proposal (MPI1) from ORNL
* November 1992: Minneapolis working group meeting and e-mail
discussion
* November 1993: Draft MPI standard at Supercomputing '93
* May 5, 1994: Current Version
* Meetings and e-mail discussion groups constitute the MPI Forum
_________________________________________________________________
FURTHER INFORMATION
* "MPI: A Message-Passing Interface Standard" document available at
ARC in /usr/local/ftp/pub/mpi-report.ps (comments@cs.utk.edu)
* Usenet newsgroup comp.parallel.mpi
* WWW Sites
+ UTK/ORNL http://www.netlib.org/mpi/index.html
+ ANL http://www.mcs.anl.gov/mpi
+ MSU http://www.erc.msstate.edu/mpi
+ LAM Project http://www.osc.edu/lam.html
_________________________________________________________________
TARGET IMPLEMENTATION
* MIMD parallel computation models, including SPMD
* Distributed-memory clusters and multi-processors
* Shared-memory platform support
* Support for virtual process topologies
* No dynamic task spawning (fixed number of available processes)
* Initial processor allocation and binding to physical processors
and interprocessor hardware communications are left to vendor
implementations
* Explicit shared-memory operations, I/O functions, and task
management are not specified in the standard
LANGUAGE BINDING
* Fortran 77 and ANSI C
* F90 is assumed to follow the Fortran 77 binding and C++ the ANSI C
binding
* Standards for message passing between languages are not specified
* MPI_ is prefixed to all MPI names (functions, subroutines, and
constants)
_________________________________________________________________
POINT-TO-POINT COMMUNICATION
* Message sending example
MPI_SEND (mesg, len(mesg), MPI_CHAR, 1, 99,
MPI_COMM_WORLD, ierror)
* Message receiving example
MPI_RECV (mesg, 20, MPI_CHAR, 0, 99,
MPI_COMM_WORLD, status, ierror)
* Message envelope: source, destination, tag, communicator
* Blocking and non-blocking modes available
* Standard, buffered, ready, and synchronous modes available
* Data type matching at all three stages of message-passing
communication
* Data type and representation conversion to support message-passing
in a heterogeneous environment
* Communication semantics (order, progress, fairness, and resource
limitations) are specified by the standard
WEB PAGES FOR MPI ROUTINES NON-BLOCKING POINT-TO-POINT COMMUNICATION
* Performance improvement by overlapping communication and
computation
* Communication initiation (the prefix I is for immediate)
MPI_ISEND (buf,count,datatype,dest,tag,comm,request,ierror)
MPI_IRECV (buf,count,datatype,dest,tag,comm,request,ierror)
* Communication completion
MPI_WAIT (request, status, ierror)
MPI_TEST (request, flag, status, ierror)
MPI_REQUEST_FREE (request, ierror)
* Multiple completion wait/test also available
* Incoming messages can be checked without actually receiving them:
flag returns true if a match occurs and status returns the
message envelope
MPI_IPROBE (source, tag, comm, flag, status, ierror)
* Pending communication requests can be cancelled to gracefully free
resources
MPI_CANCEL (request, ierror)
* Multiple probe request modes also available
_________________________________________________________________
COLLECTIVE COMMUNICATION
* The standard specifies collective communication operations over
groups of processes
* Barrier synchronization across all group members
MPI_BARRIER (comm, ierror)
* Broadcast from one member of the group to all other members
MPI_BCAST (buffer, count, datatype, root, comm, ierror)
* Gather data from all group members to one process
MPI_GATHER (sendbuf, sendcount, sendtype, recvbuf,
recvcount, recvtype, root, comm, ierror)
* Scatter data from one group member to all other members
MPI_SCATTER (sendbuf, sendcount, sendtype, recvbuf,
recvcount, recvtype, root, comm, ierror)
* Global reduction operations such as max, min, sum, product, and
min and max operations are also available
_________________________________________________________________
MPI ENVIRONMENTAL AND PROCESS MANAGEMENT
* Startup and shutdown
MPI_INIT (ierror)
MPI_FINALIZE (ierror)
MPI_INITIALIZED (flag, ierror)
MPI_ABORT (comm, errorcode, ierror)
* Process management
MPI_COMM_RANK (comm, rank, ierror)
MPI_COMM_SIZE (comm, size, ierror)
* To use all available processes in a single process group without
further management, use MPI_COMM_WORLD as the communicator
* Process group communicators, virtual process topologies, and
communication context functions are provided to manage
communication "universes" between and among process groups
* Environmental inquiry functions, program execution timers, and
error handling routines are also specified in the standard
_________________________________________________________________
Running MPI Programs at ARC
MPI IMPLEMENTATIONS AT ARC
1. MPICH (MPI/Chameleon) - Argonne National Laboratory and Missippi
State University
2. LAM (Local Area Multicomputer) - Ohio Supercomputing Center
3. CHIMP (Common High-level Interface to Message Passing) - Edinburgh
Parallel Computing Centre
_________________________________________________________________
EXAMPLES
Initial Example
The initial examples use mpich, the Argonne MPI implementation.
* Add your username to your .rhosts file if your login is the same
on all machines
* Create working directory
> mkdir MPI
> cd MPI
* Obtain the example files
> cp /usr/local/mpi/workshop/examples/* .
* Your working directory should now contain the files:
Makefile
test1.F
test2.F
test3.F
* In order to run each program you have to create, in the working
directory, a file called filename.pg (e.g., test1.pg). This
processor group file should contain:
local 0
machineA 1 /u/username/MPI/test1
machineB 1 /u/username/MPI/test1
Similarly, for the files test2.pg and test3.pg.
* The previous processor group file is required in order to start
execution on the local machine and two remote machines. Executing
the program test1.F on the local machine starts 3 processes: the
"master" process on the local machine and one on each of machineA
and machineB.
* A sample processor group file (e.g., test1.pg), for running on
ARC, is the following:
local 0
taos 1 /u/username/MPI/test1
zia 1 /u/username/MPI/test1
* Therefore if you are logged onto acoma, the "master" process will
run on acoma (local machine) and two more processes will run on
taos and zia.
* In order to compile a program (e.g., test1.F), type:
> make test1
* In order to run the above program you type:
> test1
* By modifying the corresponding processor group file you may run
the program on 1, 2, or 3 processors.
* When running the programs, messages on the screen refer to
'Process 0' (or 1, or 2). The rank of each process corresponds to
the order in which they appear in the "process group" file. For
the file test1.pg shown earlier:
acoma is Process 0
taos is Process 1
zia is Process 2
File: test1.F
A simple exchange of messages from one process to
another. One process sends two messages, (2 arrays),
which are received in the reverse order by the other
process.
Files: test2.F and test3.F
Calculation of the definite integral of a function using
the trapezoid rule. The range of integration is divided
into n sub-intervals. Each process calculates the result
of integration over a specific number of sub-intervals.
Each process sends its result to a root process, where
the final result is stored. Program test2.F uses
point-to-point communication mode and program test3.F
uses collective communication mode.
_________________________________________________________________
More Advanced Examples (not currently available, due to hardware problems)
These examples demonstrate the heterogeneous makefile system that
quickly switches between the various MPI implementations and
automatically performs the make for each of the Unix machines
(RS/6000, SGI, HP, and SUN).
Package Name Developer Location of Package
mpich Argonne and MSU /usr/local/mpich
chimp Edinburgh /usr/local/chimp
lam Ohio State University /usr/local/lam52
Need for a Makefile System
Switching between 3 MPI implementations
The ability to switch between three different MPI
implementations allows the developer to check
portability/compatibility, helps with debugging, and allows the
selection of the optimum implementation for an individual
application and environment
Heterogeneous cluster of workstations
A single make invocation compiles code for all Unix
architectures, greatly simplifying the use of a heterogeneous
cluster of workstations.
HETEROGENEOUS MAKEFILE
ALL: aix_build sgi_build **** Architecture List ****
ARCH = rs6000 **** Default Architecture ****
(needed to prevent include error)
# Settings for mpich using p4 -
#MPI = mpich |
#COMM = ch_p4 |
# Settings for chimp | Switches between
#MPI = chimp | MPI Architectures
#COMM = |
# Settings for lam |
MPI = lam |
COMM = |
# End Settings -
WORK_DIR = mpi *** Working Directory ***
*** Set to source code directory ***
RM = rm
**** All architecture and MPI implementation differences ****
**** are handled with the following include ****
include ../$(ARCH).$(MPI).$(COMM).make
programs: $(ARCH)/test1 $(ARCH)/first
**** Recursive make for each architecture ****
aix_build:
rsh acoma "cd $(WORK_DIR); make -e ARCH=rs6000 programs"
sgi_build:
rsh cochiti "cd $(WORK_DIR); make -e ARCH=sgi programs"
*** Instructions to create object and executable ***
$(ARCH)/first: $(ARCH)/first.o Makefile
cd $(ARCH); $(CC) -o first first.o $(LOPT)
$(ARCH)/first.o: first.c Makefile
cd $(ARCH); $(CC) -c $(COPT) ../first.c
$(ARCH)/test1: $(ARCH)/test1.o Makefile
cd $(ARCH) ; $(FC) -o test1 test1.o $(FLOPT)
$(ARCH)/test1.o: test1.F Makefile
cd $(ARCH) ; $(FC) -c $(FCOPT) ../test1.F
*** Housekeeping -- please use ***
clean:
$(RM) -f rs6000/*
$(RM) -f sgi/*
MAKEFILE SYSTEM DEMONSTRATION
1. Create a working directory called mpi under your home directory
2. Copy all files from /usr/local/mpi to this directory
3. Correct .cshrc
a. run /usr/local/bin/reset_environment
OR
b. compare with /usr/local/environment/.cshrc and correct manually
4. Edit Makefile and uncomment the two lines for the desired
implementation
5. Type make to create executables
6. Edit machine configuration files
a. mpich --> *first.pg
b. chimp --> *first.chp
c. lam --> *first.bhost
lamboot.conf
* user dependent paths!
7. Run executables
a. mpich
rs6000/first -p4pg ../first.pg
b. chimp
chimp first.chp
c. lam
lam52_mpi first.bhost
ADDITIONAL EXAMPLES
/usr/local/mpich/examples
/usr/local/chimp/examples
/usr/local/lam52/examples
_________________________________________________________________
�Copyright, The Maui Project/University of New Mexico
Last revised on March 1, 1995
by Georgi Yanakiev, yanakiev@arc.unm.edu