Skip to content
This repository has been archived by the owner on Jan 30, 2024. It is now read-only.

Releases: PanDAWMS/pilot

68.1

09 Aug 15:06
Compare
Choose a tag to compare

Summary

Nightlies

  • Now supporting nightlies GIT releases (Atlas-N.N.N-GIT)
  • Requested by Tulay Cuhadar Donszelmann

Benchmarks

  • Adding benchmark dictionary to machine section when available
  • Avoid unwanted and duplicated info in benchmark dictionary, adding new info
  • Found new problem with benchmark command which uses python argparse module which is normally only available from python 2.7. Developers have been notified (solution might take time since the primary developer is leaving)

Event Service

  • Fixed problem with special athena jobOptions in release 21.0.15. The problem was in handling the jobOptions python string; jps part was combined with EVNTtoHITS in same block
  • Fixed issue with some athena 20 releases that have _000 at the end of the filename of the produced files

Contributors: W. Guan, P. Nilsson

Version info

General changes:

  • Removed OtherSiteMover from distribution. Not needed.
  • Removed getUtilityCommandOld() (ATLASExperiment)
  • Cleaned up deprecated new site mover function and related code (Mover)

ESS:

  • Lots of internal changes related to ESS (RunJobEvent)

Benchmarks:

  • Removed --freetext option from benchmark command in getBenchmarkCommand() (ATLASSiteInformation)
  • Removed key 'cpuname' from the benchmark dictionary in getBenchmarkDictionary() (JobLog)
  • Sending sitename and computingElement to getBenchmarkDictionary() from postJobTask() (JobLog)
  • Added sitename and queuename arguments to getBenchmarkDictionary() (JobLog)
  • Adding sitename and queuename to benchmark dictionary (JobLog)
  • Added argument section to addToJobReport() (FileHandling)
  • Now adding key+value to section in addToJobReport() (FileHandling)
  • Added sitename argument to getBenchmarkSubprocess() (RunJob)
  • Sending sitename to getBenchmarkSubprocess() (RunJob, RunJobEvent)
  • Avoiding running benchmark suite on ANALY sites, since the user jobs will not produce the jobReport, in getBenchmarkSubprocess() (RunJob)
  • Sending subsection to addToJobReport() (JobLog)
  • Added argument subsection to addToJobReport() (FileHandling)
  • Added possibility to set jobReport subsection in addToJobReport() (FileHandling)

Nightlies:

  • Added -GIT to getCacheInfo() to fix a case where nightlies were not setup correctly. Requested by Johannes Elmsheuser (ATLASExperiment)

Updates from Wen Guan:

#122

In athena 21.0.15, if we combine jps part with EVENtoHITS part in one block, athena will not be able to start. So we need to use a separate block like below to preExec jps part.
--preExec 'from AthenaMP.AthenaMPFlags import jobproperties as jps;jps.AthenaMPFlags.EventRangeChannel="EventService_EventRanges-9419a6ca-4de9-4fcf-9b7d-5e1045e998b2"' 'EVNTtoHITS
:simFlags.SimBarcodeOffset.set_Value_and_Lock(200000)' 'EVNTtoHITS:simFlags.TRTRangeCut=30.0;simFlags.TightMuonStepping=True'

Commit Summary

fix problem when pilot adding es special part to athena joboptions

File Changes

M RunJobEvent.py (6)

#123

Commit Summary

fix objectstore sitemover

File Changes

M S3ObjectstoreSiteMover.py (2)

#124

Commit Summary

fix merge problem
patch to check _000 and auto fix it
fix type error from merge

File Changes

M RunJobEvent.py (2)
M SiteInformation.py (2)
M pUtil.py (13)

68.0

09 Aug 15:06
Compare
Choose a tag to compare

Summary

Memory Monitoring update

  • Now using release 21.0.18 (I/O values are stored in absolute values and I/O rates are reported as integer)
  • Requested by Johannes Elmsheuser

Benchmarking

  • 1/100 jobs run cern-benchmark in offline mode
  • Chained benchmarks: Whetstone and fastBmk
  • Using freetext = “Whetstone+fastBmk” for ES use
  • Executed during stage-in to minimize wasting time
  • Waiting for benchmark to finish before executing payload
  • Output dictionary is added to jobReport.json
  • Note: After the ADC Weekly meeting we added IP and hostname to the output dictionary as well. The hostname is however available elsewhere in the jobReport (machine section). We might want to avoid duplication of info.

Fix for missing traces when using remote i/o

  • Caused data to be deleted even though it had recently been accessed

Support for (mainly) new HC options

  • --useTestASetup
    Sets special env variable (ALRB_asetupVersion) that will activate test version of asetup
  • --useTestXRootD
    Use xrootdsetup-dev.sh instead of xrootdsetup.sh for the LRS
  • Improved handling of HC option --overwriteQueuedata
    Supports new and more powerful ways to overwrite schedconfig parameters, incl. sending dictionaries (--overwriteQueuedata might become --overwriteAGISdata)
  • Requested by Asoka de Silva (JIRA ticket not yet updated)

Event service updates

  • Periodical tarring of event outputs
  • Disabled long monitor sleep on ND because of failed heartbeat
  • Using normal SE for AES: not yet tested, needs AGIS updates (e.g. currently not possible to add normal SE to ddmendpoints)
  • Optimized S3 objectstore site mover to have less HEAD operations
  • Updated Rucio site mover to support OS upload
  • Fixed traces for pre-merged files (file size and event type)
  • Protection against corruption while downloading/updating event ranges

Some DQ2 API cleanup (especially around deprecated code)

  • Old site movers still rely on DQ2/ToA functionality; but can be avoided once all sites have migrated to use the new site movers; all DQ2 ref. will/can be removed at that point

Ddmendpoint fix for storm site mover

Avoiding all process kills on BOINC

No call to dispatcher for ND pilots

Note: Earlier today Johannes Elmsheuser alerted me to a problem with using nightlies releases. The fix was trivial so it made it for the release as well. Previously only VAL releases would have worked (i.e. if release was e.g. '21.0.X-VAL' but now also '21.0.X' should work).

Contributors: A. Anisenkov, W. Guan, M. Lassnig, D. Cameron, D. Drizhuk, P. Nilsson

Version info

General changes:

  • Updated dump to pretty print written json dictionary, in writeJSON() (FileHandling)

Removal of remaining dq2 API usage:

  • Deprecated and removed large parts of getTURLs() related to lcg-getturls, which also called getRSEType() and getRSE(), which in turn used the dq2 API (Mover)
  • Removed isDPMSite(), not used any longer (Mover)
  • Removed getRSEType(), not used any longer (Mover)
  • Removed getRucioPath(), not used any longer (Mover, SiteMover)
  • Removed getRucioFileList(), not used any longer (Mover)
  • Cleaned up getPoolFileCatalog() a bit (Mover)

NOTE: getRSE() need to be rewritten, or logic changed to set the RSE, since it is used by all old site mover

Nightlies:

  • Added .X to getCacheInfo() to fix a case where nightlies were not setup correctly. Requested by Johannes Elmsheuser (ATLASExperiment)

Benchmarking:

  • Only executing benchmark tool once out of a hundred starts in shouldExecuteBenchmark() (ATLASSiteInformation)
  • Created getJobReportFileName(), addToJobReport() (FileHandling)
  • Created getBenchmarkFileName() (SiteInformation, ATLASSiteInformation)
  • Created getBenchmarkDictionary() (JobLog)
  • Created getBenchmarkSubprocess() (RunJob)
  • Added benchmark process, running during stage-in (RunJob, RunJobEvent)
  • Deprecated and removed unnecessary executeBenchmark() (SiteInformationm ATLASSiteInformation, Node)

Memory monitoring:

  • Updated the memory monitor version to 21.0.18 in getUtilityCommand() (ATLASExperiment)

Copytools:

  • Corrected logical bug when updating file state (direct_access -> remote_io) in stagein_real() (mover)
  • Corrected loging bug when checking file states after transfer (direct_access -> remote_io) in get_data_new() (Mover)

Event Service:

  • PREFETCHER IS ENABLED for late releases
  • Added __yamplChannelNamePrefetcher, renamed __yamplChannelName to __yamplChannelNamePayload (RunJobEvent)
  • Setting __yamplChannelNamePrefetcher in init() (RunJobEvent)
  • Added prefetcher field to Job class (Job)
  • Updated schemes for prefetcher and adding turl to fileState file in stagein_real() (mover)
  • Created isPrefetcherReady(), setPrefetcherIsReady() (RunJobEvent)

Updates from Wen Guan:

#112

tar event outputs periodically
2)disable long monitor sleep on ND cloud because of failed 'send' (failed heartbeat)
fix to use different yampl channel name in yoda
payload may includes the same input files for more than one time, fix to stagein it only one time.
5)change updateEventRanges to support new version which supports to tar events periodically
You can view, comment on, or merge this pull request online at:

Commit Summary

fix files starting with zip and duplicated files
disable long monitor sleeping on ND cloud
using different yampl channel name for different athenamp
not show duplicated files when preparing inputs
fix to get correct jobid when naming curl config
fix Yoda to use different yampl channel name
make updateEventRanges to support different versions
RunJobEvent to support periodically tar and upload
fix python problem to pop events from list
to configure time gap between tar/zip functions

File Changes
M EventRanges.py (4)
M HPC/EventServer/EventServerJobManager.py (10)
M Job.py (6)
M Monitor.py (4)
M RunJobEvent.py (315)
M RunJobHpcEvent.py (15)
M RunJobUtilities.py (2)
M pUtil.py (5)

#114

  1. normalized objectstore as a normal rse
  2. optimized s3objectstore sitemover to have less HEAD operations: Dan reported that we had 4~5 times of 'HEAD' operations than 'GET' and 'PUT' operations. It caused load problems
    on objectstore.
  3. updated rucio sitemover to support os upload
  4. set objectstore keypair in environment, rucio site mover will use it.

Commit Summary

fix to remove tar/zip es files in the log
optimize s3objectstore sitemover to have less HEAD operation
update rucio sitemover to support os upload
normalize Objectstore as a normal RSE
set objectstore keypair in environment which rucio mover will use
normalize os as a normal rse

File Changes

M ATLASExperiment.py (4)
M Mover.py (13)
M RunJobEvent.py (64)
M S3ObjectstoreSiteMover.py (82)
M movers/mover.py (87)
M movers/rucio_sitemover.py (14)

#116

fix tracer report:
when es merge job stagein premerge files, the filesize is not filled.
when es merge job stagein premerge files, the eventtype is 'get_sm', it should be 'get_es'

Commit Summary

fix filesize in S3ObectStoreSiteMover
fix trace report

File Changes

M Mover.py (2)
M S3ObjectstoreSiteMover.py (17)
M movers/mover.py (3)
M movers/rucio_sitemover.py (2)

#117

Commit Summary

to protect corruption in downloading/updating eventranges

File Changes

M EventRanges.py (135)

Updates from Daniel Drizhuk:

#113

Introduced the new way of presenting HammerCloud parameter --overwriteQueuedata, added two more parameters: --useTestASetup and --useTestXRootD.

The proposed way for the new --overwriteQueuedata is to use common shell syntax.
In the new syntax --overwriteQueuedata is a multiargument parameter, that receives after it a set of parameters represented in key=value form. The set is ended when next parameter
starts with - or with parameter --, that will be stripped.
The value in the parameter may be either a string or a valid JSON.
The escape sequences are posix shell compatible, so JSON should be probably wrapped into single quotes.
The argument string is parsed by shlex.

Examples of --overwriteQueuedata (assuming TRF is echo, lines do not include TRF):

Stripped end-of-parameters
The line this is --overwriteQueuedata key1 key2=null key3='{"a":1,"b":2}' -- test
will result in queuedata modification key1=True, key2=None, key3={a:1,b:2}
and command echo this is test

End-of-parameters is a dash-prefix of the next parameter
The line this is --overwriteQueuedata key1 key2=null key3='{"a":1,"b":2}' -test
will result in queuedata modification key1=True, key2=None, key3={a:1,b:2}
and command echo this is -test

Second occurrence and EOL as an end-of-parameters
this is --overwriteQueuedata key1 key2=null key3='{"a":1,"b":2}' -test --overwriteQueuedata key4
will result in queuedata modification key1=True, key2=None, key3={a:1,b:2}, key4=True
and command echo this is -test

Commit Summary

Merge pull request #2 from PanDAWMS/main-dev
Merge pull request #3 from PanDAWMS/main-dev
Testing parameters for HammerCloud
Merge remote-tracking branch 'origin/main-dev' into main-dev

File Changes

M ATLASExperiment.py (3)
M ATLASSiteInformation.py (5)
M SiteInformation.py (113)

#119

Commit Summary

Fixed issue with queuedata parameters logging when fixing.

File Changes

M SiteInformation.py (2)

Updates from Mario Lassnig:
#115

Commit Summary

fix ddmendpoint handling for storm sitemover

File Changes

M movers/storm_sitemover.py (17)

Updates from David Cameron:

#118

Commit Summary

do not kill anything on BOINC

File Changes

M processes.py (6)

https://github.com/PanDAWMS/pilot/pul...

Read more

67.6

09 Aug 15:07
Compare
Choose a tag to compare

Summary

Exceptional release due to a bug found (by Johannes Elmsheuser) in the memory monitor. The new version is using release 21.0.17 instead of 21.0.12.

67.5

09 Aug 15:07
Compare
Choose a tag to compare

Summary

Checking job status after job download

  • After job has been downloaded, check with the PanDA server that the job is not already in a running state. This can happen due to a bug on the batch system side on Nordugrid resources
  • Requested by Andrej Filipcic, David Cameron
  • Pre-version running since Christmas on ND resources

Removed hardcoded PanDA URLs

  • Requested by Peter Love

Memory Monitoring updates

  • Now using release 21.0.12 (next pilot version will use version from payload release area)
  • Extracting and reporting new output; totRCHAR, totWCHAR, totRBYTES, totWBYTES, rateRCHAR, rateWCHAR, rateRBYTES, rateWBYTES
  • Requested by Johannes Elmsheuser

Avoiding killing BOINC client process at the end of the pilot

  • Requested by David Cameron

Time-out around pstack command and fix for bad usage of os.killpg() (removed negation of pid)

  • Requested by Rod Walker

Mover updates

  • Now sending file size with trace report
    • Requested by David Cameron
  • Support for direct access in all new movers
  • Upgrade to stage-in workflow
    • Execute resolve input replicas by demand only for required movers (mv, storm and rucio are excluded)

Contributions from M. Lassnig, A. Anisenkov, W. Guan, D. Cameron, P. Nilsson.

Version info

General changes:

  • Added call to pUtil.getJobStatus() in getNewJob(). After job has been downloaded, check with the PanDA server that the job is not already in a running state. This can happen due
    to a bug on the batch system side on Nordugrud resources. Requested by Andrej Filipcic, David Cameron
  • Removed negation of id number sent to os.killpg() in killProcesses(). Requested by Rod Walker (processes)
  • Added time-out to pstack command in dumpStackTrace() (processes)

Benchmarks:

  • Added pdict argument to executeBenchmark() (SiteInformation, ATLASSiteInformation)
  • Added cloud argument to getBenchmarkDictionary(), sent from runMain() (node, pilot)
  • Created executeBenchmarks(), new getBenchmarkDictionary(). Added benchmarks private data member (Node)
  • Created getBenchmarkDictionary(), updated executeBenchmarks(), added self.__benchmarks (SiteInformation, ATLASSiteInformation)

Hardcoded pandaserver urls:

  • Removed hardcoded pandaserver url from various places (note: cannot currently remove hardcoded url for S3 secret key downloads since keys are only known to pandaserver and not a
    ny dev server)
  • Added new argument url to downloadEventRanges(), updateEventRanges() (EventRanges)
  • Setting PanDA server url in RunJob* argument list in getSubprocessArguments() (Experiment)
  • Added -W server url argument in argumentParser() (RunJob)
  • Added __pandaserver variable (RunJob)
  • Using __pandaserver with downloadEventRanges() in executePayload(), main() (RunJob)
  • Using __pandaserver with downloadEventRanges() in main() (RunJobEvent)
  • Using __pandaserver with downloadEventRanges() in getJobEventRanges() (RunJobHpcEvent)
  • Using __pandaserver with updateEventRanges() in stageOutZipFiles_new(), stageOutZipFiles() (RunJobEvent)
  • Using __pandaserver with updateEventRanges() in updateEventRanges(), (RunJobHpcEvent)
  • Removed import of httpConnect in RunJobEvent, RunJobHpcEvent

Memory Monitoring:

  • Now using release 21.0.12 to setup the memory monitoring (ATLASExperiment)
  • Added calculation and handling of variables related to new outputs RCHAR, WCHAR, RBYTES, WBYTES in getMemoryValues() (ATLASExperiment)

Mover fixes:

  • Now sending filesize with trace report, from stageout(), stagein_real() (mover)

Updates from Wen Guan:

#104

  • Removed deprecated modules EventStager.py and MVEventStager.py
  • Updated RunJobHpcEvent.py to remove EventStager usage

Commit Summary

deprecate eventstager
remove eventStager
File Changes

D EventStager.py (559)
D MVEventStager.py (298)
M RunJobHpcEvent.py (95)

#105

Commit Summary

Fixed trace report which overwrote localsite

File Changes

M Mover.py (3)

Updates from Mario Lassnig:

#106

Commit Summary

ruciomover: tiny updates
storm: fix for timestamp corner case
Merge pull request #91 from mlassnig/storm-update
Merge pull request #90 from mlassnig/rucio-mover-updates
Merge commit 'e68a141debeafff49cf89d28047d0faf599e5c87' into main-dev

File Changes

M movers/rucio_sitemover.py (26)

#107

Commit Summary

Explicit import of OS

File Changes

M movers/rucio_sitemover.py (4)

Updates from Alexey Anisenkov:

#108

Commit Summary

movers: update direct access workflow (force to check root protocol in case of PQ supports direct access and jobspec allows it as well)
cosmetic fix
movers: rewrite resolve_replica to iterate over accepted schemes first
movers bugfix: prevent stage-in error (Argument list too long) while printing details about input files(1k+).
Merge branch 'main-dev' of https://github.com/PanDAWMS/pilot into main-dev
Merge branch 'main-dev' of https://github.com/PanDAWMS/pilot into main-dev
sitemovers: stage-in workflow upgrade: resolve input replicas by demand only for required movers (mv, storm, rucio are excluded)
Merge branch 'main-dev' of https://github.com/PanDAWMS/pilot into main-dev
bugfix: ATLASExperiment.getMemoryValues() fix local var declaration ('rchar' referenced before assignment issue)

  • movers bugfix: prevent stage-in error (Argument list too long) while printing details about input files(1k+).
    sitemovers fix
    directaccess fixes
    Merge branch 'main-dev' of https://github.com/PanDAWMS/pilot into main-dev
    ATLASExperiment.py: revert back line endings style to win
    ATLASExperiment.py: fix line endings style
    sitemovers: exclude DISABLED ddms from inputddms
    Merge branch 'main-dev' of https://github.com/PanDAWMS/pilot into main-dev
    Merge branch 'main-dev' of https://github.com/PanDAWMS/pilot into main-dev
    ATLASExperiment.py fix endline style
    Update ATLASExperiment.py

File Changes

M ATLASExperiment.py (7089)
M Job.py (30)
M movers/base.py (63)
M movers/mover.py (128)
M movers/mv_sitemover.py (3)
M movers/rucio_sitemover.py (2)
M movers/storm_sitemover.py (2)

#109

Commit Summary

sitemovers: protect sitemover.resolve_replica()

File Changes

M movers/mover.py (2)

Updates from David Cameron

#110

Commit Summary

do not kill boinc_client process on exit

File Changes

M processes.py (2)

#111

Commit Summary

rucioSiteMover updates: use default resolve_replica() implementation; fix upload cmd to consider --guid value for .root files
File Changes

M movers/rucio_sitemover.py (14)

67.4

09 Aug 15:08
Compare
Choose a tag to compare

Summary

  • Removal of timestamp from filename in storm mover. Fix already announced but code was missed in release due to a github mixup. Requested by Rod Walker et al.
  • Removed duplicated call to event range download function used at NERSC. Requested by Taylor Childers.

Version info

General changes:

  • Removed duplicated call to downloadEventRanges() with numRanges set to 2 (RunJobHpcEvent)
  • Restored correct version of storm mover in pilot tarball