Releases: PanDAWMS/pilot
73.7
Rucio API
- Rucio API is again used for file transfers
- Now verifying that the replica is present at the destination after upload
- Transfer time-outs are based on file size
Support for file based input file lists in direct access jobs
- I.e. now updating LFNs to full TURLs in input file list stored in a file (@filename in job params)
Added davs to list of allowed schemas for direct access mode
- Previously test jobs running against the Dynafed at CERN attempting to benchmark its performance failed because the pilot did not allow davs for direct acces
Event service updates
- Now using different DDM protocol in case of stage-in retry
- Fixed minor issues caused by failed event range update
Contributions from P. Nilsson, F. Berghaus, W. Guan, T. Javurek, A. Anisenkov.
73.6
New error code
- ERR_BADXML = 1247, "Badly formed XML" - set if the PoolFileCatalog.xml contains illegal characters, as happened in recent user job
- Otherwise leads to obscure failure, although not frequently (~once a year error)
list_replicas() usage
- Now always setting geoip in list_replicas() call as requested by Mario Lassnig/DDM team
- Otherwise remote replicas are not handled correctly (server will always reply the external door)
- Previously only set in direct access mode
- Feature tested at OU and IN2P3, thanks to Horst Severini and Emmanouil Vamvakopoulos
Log file creation updates, requested by R. Walker
- Added HAHM_* to tarball exceptions list to avoid large tarballs
- Added --one-file-system to the tar command in to avoid soft links leading to /cvmfs which also can result in huge tarballs
- Discussed in JIRA tickets https://its.cern.ch/jira/browse/ATLASPANDA-464 and https://its.cern.ch/jira/browse/ATLMCPROD-6175
Scaling correction for max memory usage on UCORE, requested by R. Walker
- Now taking schedconfig.corecount into account (job.corecount/PQ.corecount * PQ.maxrss)
- UCORE queues have maxrss specified for MCORE, but value should be corrected for SCORE jobs
- Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-463
Rucio mover update
- First RC versions used API again, but some left questions open during testing so this change was postponed
- Now sending traces with new rucio options (this change will not be necessary after the API is used again)
Support for new prodSourceLabel, requested by A. de Silva et al - Pilot option -i ALRB will trigger download of job corresponding to prodSourceLabel=rc_alrb
- For testing new ALRB asetup releases
Code updates by P. Nilsson, T. Javurek, P. Svirin.
73.5
- Added safer handling of stdout_tail variable to prevent discovered lost heartbeat as suggested in GGUS ticket https://ggus.eu/?mode=ticket_info&ticket_id=137637
- Requested by Ivan Glushkov
- Stopped setting PQ as default local/remoteSite in Rucio traces (now using default ddmendpoint from the job definition)
- Requested by DDM team
- Bug fix: Added missing string to integer conversion in os.killpg() call for orphaned processes
- Problem introduced in previous pilot version, affected killing orphaned processes in looping jobs
- Requested by Eygene Ryabinkin
73.4
-
Event service
ES merge fix
Objectstore stage-in verification -
Replica priorities update (also done in Pilot 2)
Optimization of priority sorting for list_replica() output -
Lingering orphaned processes (also done in Pilot 2)
Some processes were found hanging even after pilot ended in looping jobs
Process killer algorithm updated (use group kill for orphaned processes)
Requested by P. McGuigan (UTA)
Code contributions from Alexey Anisenkov, Wen Guan, Paul Nilsson.
73.3
- Improvement of the list_replicas() update in version 73.2
Fixed logic to properly consider priority of replica protocols for stage-in. The applied fix can be reverted once the Rucio server-side patch is delivered (which will return an already sorted list to the pilot)
- Minor update to rucio mover stage-in client
Replacing download() function with download_pfns() / download_dids()
Change discussed in Rucio ticket rucio/rucio#1378
- Too large log files
It was reported in JIRA ticket https://its.cern.ch/jira/browse/ADCSUPPORT-5078 that once again log files are too big, this time due to a new sub directory _joproxy15. Pilot now removes it as well
Code contributions from Alexey Anisenkov, Tomas Javurek, Paul Nilsson.
73.2
list_replicas()
- It was realized that both pilot and rucio server called list_replicas() which is unnecessary and has increased the load on the rucio servers due to the ongoing migration to use rucio as sitemover and since rucio will call list_replicas() for each input file download. A quick fix for this is to use the --pfn option with rucio download which will prevent rucio from also calling list_replicas(). It will however bypass useful features including fallbacks so a better solution is being implemented on the rucio side (which will add a locally cached metadata file) which will require another pilot update during the next couple of weeks.
Wrong error message
- It was discovered that the error message "Payload exceeded max allowed memory" was overwritten by the error message for a kill signal which thus ended up on the monitor page for the failed job. This should now be fixed. Reported by R. Walker
73.1
A new pilot version has been released with a minor update:
In debug mode, the pilot now scans for the latest updated payload log file and sends its tail with each heartbeat (every five minutes). Requested by R. Walker.
73.0
Containers
- No formal container development in Pilot 1 - all container testing is done with Pilot 2
- New pilot instruction arriving with job parameters (--containerimage ) removed from job parameters in case it is present (i.e. only acted on in Pilot 2)
Pilot timing
- Added on-the-fly measurement of CPU consumption time
- Pilot now reports this timing in job updates
Tracing
- Removed any present escape characters from stateReason
- Now reporting localSite properly in traces
Google updates
- Added https:// as approved protocol for direct access
- Added escape character for &, needed for turls
LAPP debugging
- Added detailed rucio output to log
Event service
- Now using killpg instead of kill, to include child processes in time-outs
- Now allowing ES merge jobs to select closest inputs
- On-the-fly CPU consumption time also reported for ES jobs
Contributions from W. Guan, M. Lassnig, N. Magini, P. Nilsson.
72.11
Rucio copytool update (from M. Lassnig):
- Removed troublesome API fallback
- Added -v option for more verbose output, requested by Stephane Jezequel
Google testing (from M. Lassnig):
- Added https protocol to schemas used in replica resolution algorithm
Bug fix:
- Changed logger.warning -> pUtil.tolog in detect_client_location(), reported by Javier Sanchez Martinez
72.10
The pilot has been updated for an issue seen (at least) at QMUL with metadata containing garbage data. Requested by R. Walker.