|
Test using local GAT Adaptors
export
LD_LIBRARY_PATH=$GAT_LOCATION/lib:$LD_LIBRARY_PATH
export GAT_ADAPTOR_PATH=$GAT_LOCATION/lib/GAT/adaptor-list To test whether or not checkpointing using the GAT works, create a file gatrc with following content: [resourcebroker_adaptor]
Adaptor=resourcebroker_adaptor CheckPointAfterIterations=90 defining, after how many iterations GAT should checkpoint your simulation. export GAT_CONFIG_FILE=<location of your gatrc>
IOHDF5::checkpoint = "yes"
IO::checkpoint_file = "myFirstGATCheckpoint" CGAT::replica_home_directory = "/home/gatuser" CGAT::announce_checkpointfiles = "yes"
If everything works out well,
following Cactus output is expected (perhaps it looks different in your
case ):
INFO (CGAT): Invoked checkpoint
call-back, checkpoint will be triggered at next iteration
INFO (CGAT): Updating replica at iteration 90. INFO (CGAT): annihilate logical file INFO (CGAT): <-- gsiftp://Ikarus/home/robert/TestSuite/./myFirstGATCheckpoint.it_90.h5 INFO (CGAT): --> /home/gatuser/GAT_JOBID:8e5a25e0-1c56-11d9-b031-000d60371fb6/myFirstGATCheckpoint 90 | 1.552 | 0.03823166 | 0.94869581 | INFO (IOHDF5): --------------------------------------------------------- INFO (IOHDF5): Dumping termination checkpoint at iteration 90 INFO (IOHDF5): --------------------------------------------------------- INFO (CGAT): Shutting down the GAT engine -------------------------------------------------------------------------------- Done. and you should find a checkpoint file
called myFirstGATCheckpoint.it_90.h5
within your working directory.
You should further check, that the local ( logical ) file: /tmp/GAT/gat_logicalfilestore/home/gatuser/GAT_JOBID\:8e5a25e0-1c56-11d9-b031-000d60371fb6/myFirstGATCheckpoint
exists and it's content points to the
correct ( physical ) checkpoint file:
gsiftp://Ikarus/home/robert/TestSuite/./myFirstGATCheckpoint.it_90.h5
The lengthy string (
GAT_JOBID\:8e5a25e0-1c56-11d9-b031-000d60371fb6 ) is called the
GAT_JOBID and uniquely
distinguishes every GAT run. You can however
override the creation of a random string by setting the environment
variable $GAT_JOBID to something more convenient.
Test using remote GAT adaptors
There are currently many remote
adaptors being developed. Check the Gridlab web pages for available
remote
adaptors designed to access Gridlab services.
Currently available on the Gridlab Testbed are following remote adaptors: gridlab_util_gsoap_adaptor
( needed to support adaptors using gsoap )
gridlab_file_adaptor ( remote file operations ) gridlab_logicalfile_adaptor ( access gridlab replica service ) gridlab_advertservice_adaptor ( access gridlab advertise service ) gridlab_monitoring_adaptor ( access gridlab resource monitoring service ) gridlab_resource_adaptor ( access gridlab resource management service ) gridlab_tracing_adaptor ( access gridlab logging service )
If you want to build remote adaptors
on your own, check the Gridlab adaptor
release page for further details.
List all local adaptors you intend to use ( full path and name ) in $GAT_ADAPTOR_PATH List all remote adaptors you intend to use ( full full and name ) in $GAT_ADAPTOR_PATH For our purpose following
GAT_ADAPTOR_LIST would be reasonable:
# absolute path names to local adaptors
/mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libfileops_adaptor.la /mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libfilestream_adaptor.la /mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libadvertservice_adaptor.la /mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libendpoint_adaptor.la /mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libresourcebroker_adaptor.la # absolute path names to remote adaptors /mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libgridlab_util_gsoap_adaptor.la /mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libgridlab_logicalfile_adaptor.la /mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libgridlab_advertservice_adaptor.la /mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libgridlab_monitoring_adaptor.la /mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libgridlab_resource_adaptor.la /mnt/shared/people/glab010/gat_installed/lib/GAT/adaptors/libgridlab_tracing_adaptor.la to checkpoint the Cactus application
using GRMS
( libgridlab_resource_adaptor ), even if the application is
running on internal cluster nodes ( libgridlab_monitoring_adaptor ).
The checkpoint files / output would be registered using the
gridlab replica service ( libgridlab_logicalfile_adaptor ).
If you have access to the Gridlab
Testbed you can take advantage of a pre installed GAT installation
including remote adaptors on most resources. To check, which resources
are properly configured with GAT click here.
source /etc/gridlab.conf to set all necessary environment variables. Integrate CGAT into your cactus application. Start-up your simulation and initiate a checkpoint using the GRMS Command Line Client part of the Gridlab Resource Management System or the Cactus Portal. To checkpoint your cactus simulation using the Cactus Portal, you will need to have a valid portal-account. You can get it here.
Initializing the GAT Engine and
accessing the Gridlab
Replica Service for registering the output directory:
INFO (CGAT): Initializing the GAT
engine
INFO (CGAT): Registering checkpoint capabilities with the GAT INFO (CGAT): Announcing output directory to replica service GSI plugin for gSOAP v2.4: Established security context with: /O=Grid/O=GridLab/CN=litchi.zib.de GSI plugin for gSOAP v2.4: Established security context with: /O=Grid/O=GridLab/CN=litchi.zib.de GSI plugin for gSOAP v2.4: Established security context with: /O=Grid/O=GridLab/CN=litchi.zib.de INFO (CGAT): Announcing output directory ... INFO (CGAT): <-- gsiftp://rage1.man.poznan.pl/mnt/shared/people/glab040/output/ INFO (CGAT): --> /home/glab040/cactus/output Remote execution using GRMS
On server side the GridLab Resource
Management Service (GRMS)
must be installed on a server within your grid environment.
On client side you will need the command line client from the GRMS distribution on any host you intend to use for job submission. You can get both here. Download the source code together with examples and follow instructions. For further reading please have a look at the GRMS User's and Administrator's Guide.
Job submission to the resource
management system is handled using a job description in xml format.
ws_client.sh
submit <jobDescription.xml>
ws_client.sh migrate <jobId> [<jobDescription.xml>] ws_client.sh cancel <jobId> ws_client.sh info <jobId>
This may demonstrate, how to submit a
job to the resource management system without using a queuing system
like PBS on the target host. Simple create a file non-parallel.xml with following content:
<grmsjob
appid = "appid" persistent="true">
<simplejob> <resource> <hostname>skirit.ics.muni.cz</hostname> </resource> <executable type="single" count="1"> <file name="cactus_wavetoy_serial.sh" type="in"> <url>file:////${HOME}/demo/scripts/cactus_wavetoy_serial.sh</url> </file> <stdout> <url>gsiftp://peyote.aei.mpg.de/${HOME}/run.out</url> </stdout> <stderr> <url>gsiftp://peyote.aei.mpg.de/${HOME}/run.err</url> </stderr> </executable> </simplejob> </grmsjob > In this case, we want to execute the
shell script cactus_wavetoy_serial.sh
on a remote host called skirit.ics.muni.cz.
The stdout and stderr is supposed to be copied to peyote.aei.mpg.de after successful
job execution. Submitting the job is done calling the command
line client:
[robert@Ikarus
bin]$ ./ws_client.sh submit non-parallel.xml
- Your DN: /C=US/O=National Center for Supercomputing Applications/CN=Robert Engel - Service URL: httpg://rage1.man.poznan.pl:8543/axis/services/grms - Job submitted successfully, jobId=1097768271008_appid_5556 In this case the GridLab Resource Management Service is running on a server named rage1.man.poznan.pl part of the GridLab Testbed. If the job description specified in non-parallel.xml did not contain errors, the JOBID is being returned. You can query the status of your job, request a migrate or cancel using this JOBID.
Now we want to submit a job to a
queuing system on the target host using the Resource Management System.
This will only require minor changes of the job description xml file:
<grmsjob
appid = "appid" persistent="true">
<simplejob> <resource> <hostname>skirit.ics.muni.cz</hostname> <localrmname>pbs</localrmname> </resource> <executable type="mpi" count="4"> <file name="cactus_physics_parallel.sh" type="in"> <url>file:////${HOME}/demo/scripts/cactus_physics_parallel.sh</url> </file> <stdout> <url>gsiftp://peyote.aei.mpg.de/${HOME}/run.out</url> </stdout> <stderr> <url>gsiftp://peyote.aei.mpg.de/${HOME}/run.err</url> </stderr> </executable> </simplejob> </grmsjob > Note: The only difference is to change
the executable type to "mpi"
and to define the number of nodes to run on "4". Further we specified "pbs" to be used for queuing,
GRMS will not handle environment
settings by default ( .bash_profile, /etc/gridlab.conf, etc. ). If you
need certain environment settings for your executable to run (
LD_LIBRARY_PATH ), you might want to use a script (
cactus_wavetoy_serial.sh ) to set your environment correctly and to
start the executable.
The option persistent="true" will keep all your job-output within a working directory on the target host. By default the working directory used for output is deleted after the job finished ( successful or not ). If you intend to use the GridLab Resource Management System on the GridLab testbed, check here, to check if GRMS is up and running on the target host. Troubleshooting
If you have access to the Gridlab
Testbed, you can take advantage of a system wide Cactus
Installation. If you want to check the current state of the testbed
click here!
source
/etc/gridlab.conf
In $CACTUS_LOCATION (
$CACTUS_DEV_LOCATION ) you will find several executables linked
properly against the system wide GAT installation
cactus_wavetoy_serial ( non mpi
version of a wavetoy demo )
cactus_wavetoy_parallel ( parallel version of a wavetoy demo ) cactus_physics_parallel ( parallel version of black hole head on collision ) Try to execute
$CACTUS_LOCATION/cactus_wavetoy_serial.sh, to check that:
System wide GAT installation in
$GAT_LOCATION is properly working
Cactus with CGAT could be build and linked against the GAT Gridlab resources like like the replica services can be contacted.
Cactus - Steve White, Thomas Radke
Cactus Portal - Michael Russel CGAT - Robert Engel, Thomas Radke GAT - Hartmut Kaiser, Kelly Davis GRMS - Tomasz Piontek, Juliusz Pukacki Gridlab Replica Service - Thorsten Schuett Testbed - testbed mailing list | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||