Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweaks for remote components #173

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

Asthelen
Copy link
Collaborator

@Asthelen Asthelen commented Mar 19, 2024

This PR has a few tweaks to remote codes that I found necessary for running a new model:

  • writing an n2 after a server analysis is now an option and off by default, because a large model would crash on this step (presumably due to large computed variable sizes embedded in it)
  • bug fix that involved design variables with "." in the name on server side (which were previously being replaced with var_naming_dot_replacement)
  • RemoteComp's stop_server() command checks if server_manager is None, so that it doesn't crash when being called in parallel (where the server manager may only exist on one rank)
  • get_remote=True was added to several get_val commands on the server side. This was necessary for a model that had inputs/design variables not defined on all ranks (within a parallel group)
  • The additional_remote_constants option was added for an edge case where a distributed input on the server side only existed on the first rank. These constant inputs operate similarly to additional_remote_inputs, but do not use get_remote when getting its value while totals for it are also not computed (i.e., its variable name is left out of compute_totals's wrt).

In addition, a function job_has_expired has been added to check if the job is no longer running before trying to run the remote analysis. This could previously happen if the job happened to expire during down time (i.e., not during a function/gradient call). For example: if two remote components are run in parallel, with one being much slower, the fast one's job could expire while the slow one is being run (or is stuck in queue), since the fast one only predicts how long its analysis might take, not how much down time there will be before the next analysis request.

Also, the placeholder ssh port forwarding process was replaced by a dummy zeromq socket when holding a specific port while in queue. The prior approach worked but caused a lot of system log errors, since it was essentially setting up port forwarding with itself.

@Asthelen Asthelen marked this pull request as draft December 11, 2024 17:29
@Asthelen Asthelen marked this pull request as ready for review December 11, 2024 20:13
…them, remote DVs that don't exist on all ranks (e.g. from parallel group)
@kejacobson kejacobson self-requested a review December 11, 2024 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants