Releases: PanDAWMS/pilot
68.1
Summary
Nightlies
- Now supporting nightlies GIT releases (Atlas-N.N.N-GIT)
- Requested by Tulay Cuhadar Donszelmann
Benchmarks
- Adding benchmark dictionary to machine section when available
- Avoid unwanted and duplicated info in benchmark dictionary, adding new info
- Found new problem with benchmark command which uses python argparse module which is normally only available from python 2.7. Developers have been notified (solution might take time since the primary developer is leaving)
Event Service
- Fixed problem with special athena jobOptions in release 21.0.15. The problem was in handling the jobOptions python string; jps part was combined with EVNTtoHITS in same block
- Fixed issue with some athena 20 releases that have _000 at the end of the filename of the produced files
Contributors: W. Guan, P. Nilsson
Version info
General changes:
- Removed OtherSiteMover from distribution. Not needed.
- Removed getUtilityCommandOld() (ATLASExperiment)
- Cleaned up deprecated new site mover function and related code (Mover)
ESS:
- Lots of internal changes related to ESS (RunJobEvent)
Benchmarks:
- Removed --freetext option from benchmark command in getBenchmarkCommand() (ATLASSiteInformation)
- Removed key 'cpuname' from the benchmark dictionary in getBenchmarkDictionary() (JobLog)
- Sending sitename and computingElement to getBenchmarkDictionary() from postJobTask() (JobLog)
- Added sitename and queuename arguments to getBenchmarkDictionary() (JobLog)
- Adding sitename and queuename to benchmark dictionary (JobLog)
- Added argument section to addToJobReport() (FileHandling)
- Now adding key+value to section in addToJobReport() (FileHandling)
- Added sitename argument to getBenchmarkSubprocess() (RunJob)
- Sending sitename to getBenchmarkSubprocess() (RunJob, RunJobEvent)
- Avoiding running benchmark suite on ANALY sites, since the user jobs will not produce the jobReport, in getBenchmarkSubprocess() (RunJob)
- Sending subsection to addToJobReport() (JobLog)
- Added argument subsection to addToJobReport() (FileHandling)
- Added possibility to set jobReport subsection in addToJobReport() (FileHandling)
Nightlies:
- Added -GIT to getCacheInfo() to fix a case where nightlies were not setup correctly. Requested by Johannes Elmsheuser (ATLASExperiment)
Updates from Wen Guan:
In athena 21.0.15, if we combine jps part with EVENtoHITS part in one block, athena will not be able to start. So we need to use a separate block like below to preExec jps part.
--preExec 'from AthenaMP.AthenaMPFlags import jobproperties as jps;jps.AthenaMPFlags.EventRangeChannel="EventService_EventRanges-9419a6ca-4de9-4fcf-9b7d-5e1045e998b2"' 'EVNTtoHITS
:simFlags.SimBarcodeOffset.set_Value_and_Lock(200000)' 'EVNTtoHITS:simFlags.TRTRangeCut=30.0;simFlags.TightMuonStepping=True'
Commit Summary
fix problem when pilot adding es special part to athena joboptions
File Changes
M RunJobEvent.py (6)
Commit Summary
fix objectstore sitemover
File Changes
M S3ObjectstoreSiteMover.py (2)
Commit Summary
fix merge problem
patch to check _000 and auto fix it
fix type error from merge
File Changes
M RunJobEvent.py (2)
M SiteInformation.py (2)
M pUtil.py (13)
68.0
Summary
Memory Monitoring update
- Now using release 21.0.18 (I/O values are stored in absolute values and I/O rates are reported as integer)
- Requested by Johannes Elmsheuser
Benchmarking
- 1/100 jobs run cern-benchmark in offline mode
- Chained benchmarks: Whetstone and fastBmk
- Using freetext = “Whetstone+fastBmk” for ES use
- Executed during stage-in to minimize wasting time
- Waiting for benchmark to finish before executing payload
- Output dictionary is added to jobReport.json
- Note: After the ADC Weekly meeting we added IP and hostname to the output dictionary as well. The hostname is however available elsewhere in the jobReport (machine section). We might want to avoid duplication of info.
Fix for missing traces when using remote i/o
- Caused data to be deleted even though it had recently been accessed
Support for (mainly) new HC options
- --useTestASetup
Sets special env variable (ALRB_asetupVersion) that will activate test version of asetup - --useTestXRootD
Use xrootdsetup-dev.sh instead of xrootdsetup.sh for the LRS - Improved handling of HC option --overwriteQueuedata
Supports new and more powerful ways to overwrite schedconfig parameters, incl. sending dictionaries (--overwriteQueuedata might become --overwriteAGISdata) - Requested by Asoka de Silva (JIRA ticket not yet updated)
Event service updates
- Periodical tarring of event outputs
- Disabled long monitor sleep on ND because of failed heartbeat
- Using normal SE for AES: not yet tested, needs AGIS updates (e.g. currently not possible to add normal SE to ddmendpoints)
- Optimized S3 objectstore site mover to have less HEAD operations
- Updated Rucio site mover to support OS upload
- Fixed traces for pre-merged files (file size and event type)
- Protection against corruption while downloading/updating event ranges
Some DQ2 API cleanup (especially around deprecated code)
- Old site movers still rely on DQ2/ToA functionality; but can be avoided once all sites have migrated to use the new site movers; all DQ2 ref. will/can be removed at that point
Ddmendpoint fix for storm site mover
Avoiding all process kills on BOINC
No call to dispatcher for ND pilots
Note: Earlier today Johannes Elmsheuser alerted me to a problem with using nightlies releases. The fix was trivial so it made it for the release as well. Previously only VAL releases would have worked (i.e. if release was e.g. '21.0.X-VAL' but now also '21.0.X' should work).
Contributors: A. Anisenkov, W. Guan, M. Lassnig, D. Cameron, D. Drizhuk, P. Nilsson
Version info
General changes:
- Updated dump to pretty print written json dictionary, in writeJSON() (FileHandling)
Removal of remaining dq2 API usage:
- Deprecated and removed large parts of getTURLs() related to lcg-getturls, which also called getRSEType() and getRSE(), which in turn used the dq2 API (Mover)
- Removed isDPMSite(), not used any longer (Mover)
- Removed getRSEType(), not used any longer (Mover)
- Removed getRucioPath(), not used any longer (Mover, SiteMover)
- Removed getRucioFileList(), not used any longer (Mover)
- Cleaned up getPoolFileCatalog() a bit (Mover)
NOTE: getRSE() need to be rewritten, or logic changed to set the RSE, since it is used by all old site mover
Nightlies:
- Added .X to getCacheInfo() to fix a case where nightlies were not setup correctly. Requested by Johannes Elmsheuser (ATLASExperiment)
Benchmarking:
- Only executing benchmark tool once out of a hundred starts in shouldExecuteBenchmark() (ATLASSiteInformation)
- Created getJobReportFileName(), addToJobReport() (FileHandling)
- Created getBenchmarkFileName() (SiteInformation, ATLASSiteInformation)
- Created getBenchmarkDictionary() (JobLog)
- Created getBenchmarkSubprocess() (RunJob)
- Added benchmark process, running during stage-in (RunJob, RunJobEvent)
- Deprecated and removed unnecessary executeBenchmark() (SiteInformationm ATLASSiteInformation, Node)
Memory monitoring:
- Updated the memory monitor version to 21.0.18 in getUtilityCommand() (ATLASExperiment)
Copytools:
- Corrected logical bug when updating file state (direct_access -> remote_io) in stagein_real() (mover)
- Corrected loging bug when checking file states after transfer (direct_access -> remote_io) in get_data_new() (Mover)
Event Service:
- PREFETCHER IS ENABLED for late releases
- Added __yamplChannelNamePrefetcher, renamed __yamplChannelName to __yamplChannelNamePayload (RunJobEvent)
- Setting __yamplChannelNamePrefetcher in init() (RunJobEvent)
- Added prefetcher field to Job class (Job)
- Updated schemes for prefetcher and adding turl to fileState file in stagein_real() (mover)
- Created isPrefetcherReady(), setPrefetcherIsReady() (RunJobEvent)
Updates from Wen Guan:
tar event outputs periodically
2)disable long monitor sleep on ND cloud because of failed 'send' (failed heartbeat)
fix to use different yampl channel name in yoda
payload may includes the same input files for more than one time, fix to stagein it only one time.
5)change updateEventRanges to support new version which supports to tar events periodically
You can view, comment on, or merge this pull request online at:
Commit Summary
fix files starting with zip and duplicated files
disable long monitor sleeping on ND cloud
using different yampl channel name for different athenamp
not show duplicated files when preparing inputs
fix to get correct jobid when naming curl config
fix Yoda to use different yampl channel name
make updateEventRanges to support different versions
RunJobEvent to support periodically tar and upload
fix python problem to pop events from list
to configure time gap between tar/zip functions
File Changes
M EventRanges.py (4)
M HPC/EventServer/EventServerJobManager.py (10)
M Job.py (6)
M Monitor.py (4)
M RunJobEvent.py (315)
M RunJobHpcEvent.py (15)
M RunJobUtilities.py (2)
M pUtil.py (5)
- normalized objectstore as a normal rse
- optimized s3objectstore sitemover to have less HEAD operations: Dan reported that we had 4~5 times of 'HEAD' operations than 'GET' and 'PUT' operations. It caused load problems
on objectstore. - updated rucio sitemover to support os upload
- set objectstore keypair in environment, rucio site mover will use it.
Commit Summary
fix to remove tar/zip es files in the log
optimize s3objectstore sitemover to have less HEAD operation
update rucio sitemover to support os upload
normalize Objectstore as a normal RSE
set objectstore keypair in environment which rucio mover will use
normalize os as a normal rse
File Changes
M ATLASExperiment.py (4)
M Mover.py (13)
M RunJobEvent.py (64)
M S3ObjectstoreSiteMover.py (82)
M movers/mover.py (87)
M movers/rucio_sitemover.py (14)
fix tracer report:
when es merge job stagein premerge files, the filesize is not filled.
when es merge job stagein premerge files, the eventtype is 'get_sm', it should be 'get_es'
Commit Summary
fix filesize in S3ObectStoreSiteMover
fix trace report
File Changes
M Mover.py (2)
M S3ObjectstoreSiteMover.py (17)
M movers/mover.py (3)
M movers/rucio_sitemover.py (2)
Commit Summary
to protect corruption in downloading/updating eventranges
File Changes
M EventRanges.py (135)
Updates from Daniel Drizhuk:
Introduced the new way of presenting HammerCloud parameter --overwriteQueuedata, added two more parameters: --useTestASetup and --useTestXRootD.
The proposed way for the new --overwriteQueuedata is to use common shell syntax.
In the new syntax --overwriteQueuedata is a multiargument parameter, that receives after it a set of parameters represented in key=value form. The set is ended when next parameter
starts with - or with parameter --, that will be stripped.
The value in the parameter may be either a string or a valid JSON.
The escape sequences are posix shell compatible, so JSON should be probably wrapped into single quotes.
The argument string is parsed by shlex.
Examples of --overwriteQueuedata (assuming TRF is echo, lines do not include TRF):
Stripped end-of-parameters
The line this is --overwriteQueuedata key1 key2=null key3='{"a":1,"b":2}' -- test
will result in queuedata modification key1=True, key2=None, key3={a:1,b:2}
and command echo this is test
End-of-parameters is a dash-prefix of the next parameter
The line this is --overwriteQueuedata key1 key2=null key3='{"a":1,"b":2}' -test
will result in queuedata modification key1=True, key2=None, key3={a:1,b:2}
and command echo this is -test
Second occurrence and EOL as an end-of-parameters
this is --overwriteQueuedata key1 key2=null key3='{"a":1,"b":2}' -test --overwriteQueuedata key4
will result in queuedata modification key1=True, key2=None, key3={a:1,b:2}, key4=True
and command echo this is -test
Commit Summary
Merge pull request #2 from PanDAWMS/main-dev
Merge pull request #3 from PanDAWMS/main-dev
Testing parameters for HammerCloud
Merge remote-tracking branch 'origin/main-dev' into main-dev
File Changes
M ATLASExperiment.py (3)
M ATLASSiteInformation.py (5)
M SiteInformation.py (113)
Commit Summary
Fixed issue with queuedata parameters logging when fixing.
File Changes
M SiteInformation.py (2)
Updates from Mario Lassnig:
#115
Commit Summary
fix ddmendpoint handling for storm sitemover
File Changes
M movers/storm_sitemover.py (17)
Updates from David Cameron:
Commit Summary
do not kill anything on BOINC
File Changes
M processes.py (6)
67.6
Summary
Exceptional release due to a bug found (by Johannes Elmsheuser) in the memory monitor. The new version is using release 21.0.17 instead of 21.0.12.
67.5
Summary
Checking job status after job download
- After job has been downloaded, check with the PanDA server that the job is not already in a running state. This can happen due to a bug on the batch system side on Nordugrid resources
- Requested by Andrej Filipcic, David Cameron
- Pre-version running since Christmas on ND resources
Removed hardcoded PanDA URLs
- Requested by Peter Love
Memory Monitoring updates
- Now using release 21.0.12 (next pilot version will use version from payload release area)
- Extracting and reporting new output; totRCHAR, totWCHAR, totRBYTES, totWBYTES, rateRCHAR, rateWCHAR, rateRBYTES, rateWBYTES
- Requested by Johannes Elmsheuser
Avoiding killing BOINC client process at the end of the pilot
- Requested by David Cameron
Time-out around pstack command and fix for bad usage of os.killpg() (removed negation of pid)
- Requested by Rod Walker
Mover updates
- Now sending file size with trace report
- Requested by David Cameron
- Support for direct access in all new movers
- Upgrade to stage-in workflow
- Execute resolve input replicas by demand only for required movers (mv, storm and rucio are excluded)
Contributions from M. Lassnig, A. Anisenkov, W. Guan, D. Cameron, P. Nilsson.
Version info
General changes:
- Added call to pUtil.getJobStatus() in getNewJob(). After job has been downloaded, check with the PanDA server that the job is not already in a running state. This can happen due
to a bug on the batch system side on Nordugrud resources. Requested by Andrej Filipcic, David Cameron - Removed negation of id number sent to os.killpg() in killProcesses(). Requested by Rod Walker (processes)
- Added time-out to pstack command in dumpStackTrace() (processes)
Benchmarks:
- Added pdict argument to executeBenchmark() (SiteInformation, ATLASSiteInformation)
- Added cloud argument to getBenchmarkDictionary(), sent from runMain() (node, pilot)
- Created executeBenchmarks(), new getBenchmarkDictionary(). Added benchmarks private data member (Node)
- Created getBenchmarkDictionary(), updated executeBenchmarks(), added self.__benchmarks (SiteInformation, ATLASSiteInformation)
Hardcoded pandaserver urls:
- Removed hardcoded pandaserver url from various places (note: cannot currently remove hardcoded url for S3 secret key downloads since keys are only known to pandaserver and not a
ny dev server) - Added new argument url to downloadEventRanges(), updateEventRanges() (EventRanges)
- Setting PanDA server url in RunJob* argument list in getSubprocessArguments() (Experiment)
- Added -W server url argument in argumentParser() (RunJob)
- Added __pandaserver variable (RunJob)
- Using __pandaserver with downloadEventRanges() in executePayload(), main() (RunJob)
- Using __pandaserver with downloadEventRanges() in main() (RunJobEvent)
- Using __pandaserver with downloadEventRanges() in getJobEventRanges() (RunJobHpcEvent)
- Using __pandaserver with updateEventRanges() in stageOutZipFiles_new(), stageOutZipFiles() (RunJobEvent)
- Using __pandaserver with updateEventRanges() in updateEventRanges(), (RunJobHpcEvent)
- Removed import of httpConnect in RunJobEvent, RunJobHpcEvent
Memory Monitoring:
- Now using release 21.0.12 to setup the memory monitoring (ATLASExperiment)
- Added calculation and handling of variables related to new outputs RCHAR, WCHAR, RBYTES, WBYTES in getMemoryValues() (ATLASExperiment)
Mover fixes:
- Now sending filesize with trace report, from stageout(), stagein_real() (mover)
Updates from Wen Guan:
- Removed deprecated modules EventStager.py and MVEventStager.py
- Updated RunJobHpcEvent.py to remove EventStager usage
Commit Summary
deprecate eventstager
remove eventStager
File Changes
D EventStager.py (559)
D MVEventStager.py (298)
M RunJobHpcEvent.py (95)
Commit Summary
Fixed trace report which overwrote localsite
File Changes
M Mover.py (3)
Updates from Mario Lassnig:
Commit Summary
ruciomover: tiny updates
storm: fix for timestamp corner case
Merge pull request #91 from mlassnig/storm-update
Merge pull request #90 from mlassnig/rucio-mover-updates
Merge commit 'e68a141debeafff49cf89d28047d0faf599e5c87' into main-dev
File Changes
M movers/rucio_sitemover.py (26)
Commit Summary
Explicit import of OS
File Changes
M movers/rucio_sitemover.py (4)
Updates from Alexey Anisenkov:
Commit Summary
movers: update direct access workflow (force to check root protocol in case of PQ supports direct access and jobspec allows it as well)
cosmetic fix
movers: rewrite resolve_replica to iterate over accepted schemes first
movers bugfix: prevent stage-in error (Argument list too long) while printing details about input files(1k+).
Merge branch 'main-dev' of https://github.com/PanDAWMS/pilot into main-dev
Merge branch 'main-dev' of https://github.com/PanDAWMS/pilot into main-dev
sitemovers: stage-in workflow upgrade: resolve input replicas by demand only for required movers (mv, storm, rucio are excluded)
Merge branch 'main-dev' of https://github.com/PanDAWMS/pilot into main-dev
bugfix: ATLASExperiment.getMemoryValues() fix local var declaration ('rchar' referenced before assignment issue)
- movers bugfix: prevent stage-in error (Argument list too long) while printing details about input files(1k+).
sitemovers fix
directaccess fixes
Merge branch 'main-dev' of https://github.com/PanDAWMS/pilot into main-dev
ATLASExperiment.py: revert back line endings style to win
ATLASExperiment.py: fix line endings style
sitemovers: exclude DISABLED ddms from inputddms
Merge branch 'main-dev' of https://github.com/PanDAWMS/pilot into main-dev
Merge branch 'main-dev' of https://github.com/PanDAWMS/pilot into main-dev
ATLASExperiment.py fix endline style
Update ATLASExperiment.py
File Changes
M ATLASExperiment.py (7089)
M Job.py (30)
M movers/base.py (63)
M movers/mover.py (128)
M movers/mv_sitemover.py (3)
M movers/rucio_sitemover.py (2)
M movers/storm_sitemover.py (2)
Commit Summary
sitemovers: protect sitemover.resolve_replica()
File Changes
M movers/mover.py (2)
Updates from David Cameron
Commit Summary
do not kill boinc_client process on exit
File Changes
M processes.py (2)
Commit Summary
rucioSiteMover updates: use default resolve_replica() implementation; fix upload cmd to consider --guid value for .root files
File Changes
M movers/rucio_sitemover.py (14)
67.4
Summary
- Removal of timestamp from filename in storm mover. Fix already announced but code was missed in release due to a github mixup. Requested by Rod Walker et al.
- Removed duplicated call to event range download function used at NERSC. Requested by Taylor Childers.
Version info
General changes:
- Removed duplicated call to downloadEventRanges() with numRanges set to 2 (RunJobHpcEvent)
- Restored correct version of storm mover in pilot tarball