Skip to content

Commit

Permalink
Merge pull request #164 from Asthelen/network-new
Browse files Browse the repository at this point in the history
Client Server OM problem for large
  • Loading branch information
kejacobson authored Nov 27, 2023
2 parents 2a483ef + 294dbe8 commit c547005
Show file tree
Hide file tree
Showing 13 changed files with 1,485 additions and 12 deletions.
96 changes: 96 additions & 0 deletions docs/basics/remote_components.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
.. _remote_components:

*****************
Remote Components
*****************

The purpose of remote components is to provide a means of adding a remote physics analysis to a local OpenMDAO problem.
One situation in which this may be desirable is when the time to carry out a full optimization exceeds an HPC job time limit.
Such a situation, without remote components, may normally require manual restarts of the optimization, and would thus limit one to optimizers with this capability.
Using remote components, one can keep a serial OpenMDAO optimization running continuously on a login node (e.g., using the nohup or screen Linux commands) while the parallel physics analyses are evaluated across several HPC jobs.
Another situation where these components may be advantageous is when the OpenMDAO problem contains components not streamlined for massively parallel environments.

In general, remote components use nested OpenMDAO problems in a server-client arrangement.
The outer, client-side OpenMDAO model serves as the overarching analysis/optimization problem while the inner, server-side model serves as the isolated high-fidelity analysis.
The server inside the HPC job remains open to evaluate function or gradient calls.
Wall times for function and gradient calls are saved, and when the maximum previous time multiplied by a scale factor exceeds the remaining job time, the server will be relaunched.

Three general base classes are used to achieve this.

* :class:`~mphys.network.remote_component.RemoteComp`: Explicit component that wraps communication with server, replicating inputs/outputs to/from server-side group and requesting new a server when estimated analysis time exceeds remaining job time.
* :class:`~mphys.network.server_manager.ServerManager`: Used by ``RemoteComp`` to control and communicate with the server.
* :class:`~mphys.network.server.Server`: Loads the inner OpenMDAO problem and evaluates function or gradient calls as requested by the ``ServerManager``.

Currently, there is one derived class for each, which use pbs4py for HPC job control and ZeroMQ for network communication.

* :class:`~mphys.network.zmq_pbs.RemoteZeroMQComp`: Through the use of ``MPhysZeroMQServerManager``, uses encoded JSON dictionaries to send and receive necessary information to and from the server.
* :class:`~mphys.network.zmq_pbs.MPhysZeroMQServerManager`: Uses ZeroMQ socket and ssh port forwarding from login to compute node to communicate with server, and pbs4py to start, stop, and check status of HPC jobs.
* :class:`~mphys.network.zmq_pbs.MPhysZeroMQServer`: Uses ZeroMQ socket to send and receive encoded JSON dictionaries.

RemoteZeroMQComp Options
========================
.. embed-options::
mphys.network.zmq_pbs
RemoteZeroMQComp
options

Usage
=====
When adding a :code:`RemoteZeroMQComp` component, the two required options are :code:`run_server_filename`, which is the server to be launched on an HPC job, and :code:`pbs`, which is the pbs4py Launcher object.
The server file should accept port number as an argument to facilitate communication with the client.
Within this file, the :code:`MPhysZeroMQServer` class's :code:`get_om_group_function_pointer` option is the pointer to the OpenMDAO Group or Multipoint class to be evaluated.
By default, any design variables, objectives, and constraints defined in the group will be added on the client side.
Any other desired inputs or outputs must be added in the :code:`additional_remote_inputs` or :code:`additional_remote_outputs` options.
On the client side, any "." characters in these input and output names will be replaced by :code:`var_naming_dot_replacement`.

The screen output from a particular remote component's Nth server will be sent to :code:`mphys_<component name>_serverN.out`, where :code:`component name` is the subsystem name of the :code:`RemoteZeroMQComp` instance.
Searching for the keyword "SERVER" will display what the server is currently doing; the keyword "CLIENT" will do the same on the client-side.
The HPC job for the component's server is named :code:`MPhys<port number>`; the pbs4py-generated job submission script is the same followed by ".pbs".
Note that running the remote component in parallel is not supported, and a SystemError will be triggered otherwise.

Example
=======
Two examples are provided for the `supersonic panel aerostructural case <https://github.com/OpenMDAO/mphys/tree/main/examples/aerostructural/supersonic_panel>`_: :code:`as_opt_remote_serial.py` and :code:`as_opt_remote_parallel.py`.
Both run the optimization problem defined in :code:`as_opt_parallel.py`, which contains a :code:`MultipointParallel` class and thus evaluates two aerostructural scenarios in parallel.
The serial remote example runs this group on one server.
The parallel remote example, on the other hand, contains an OpenMDAO parallel group which runs two servers in parallel.
Both examples use the same server file, :code:`mphys_server.py`, but point to either :code:`as_opt_parallel.py` or :code:`run.py` by sending the model's filename through the use of the :code:`RemoteZeroMQComp`'s :code:`additional_server_args` option.
As demonstrated in this server file, additional configuration options may be sent to the server-side OpenMDAO group through the use of a functor (called :code:`GetModel` in this case) in combination with :code:`additional_server_args`.
In this particular case, scenario name(s) are sent as :code:`additional_server_args` from the client side; on the server side, the :code:`GetModel` functor allows the scenario name(s) to be sent as OpenMDAO options to the server-side group.
Using the scenario :code:`run_directory` option, the scenarios can then be evaluated in different directories.
In both examples, the remote component(s) use a :code:`K4` pbs4py Launcher object, which will launch, monitor, and stop jobs using the K4 queue of the NASA K-cluster.

Troubleshooting
===============
The :code:`dump_json` option for :code:`RemoteZeroMQComp` will make the component write input and output JSON files, which contain all data sent to and received from the server.
An exception is the :code:`wall_time` entry (given in seconds) in the output JSON file, which is added on the client-side after the server has completed the design evaluation.
Another entry that is only provided for informational purposes is :code:`design_counter`, which keeps track of how many different designs have been evaluated on the current server.
If :code:`dump_separate_json` is set to True, then separate files will be written for each design evaluation.
On the server side, an n2 file titled :code:`n2_inner_analysis_<component name>.html` will be written after each evaluation.

Current Limitations
===================
* A pbs4py Launcher must be implemented for your HPC environment
* On the client side, :code:`RemoteZeroMQComp.stop_server()` should be added after your analysis/optimization to stop the HPC job and ssh port forwarding, which the server manager starts as a background process.
* If :code:`stop_server` is not called or the server stops unexpectedly, stopping the port forwarding manually is difficult, as it involves finding the ssh process associated with the remote server's port number. This must be done on the same login node that the server was launched from.
* Stopping the HPC job is somewhat easier as the job name will be :code:`MPhys` followed by the port number; however, if runs are launched from multiple login nodes then one may have multiple jobs with the same name.
* Currently, the :code:`of` option (as well as :code:`wrt`) for :code:`check_totals` or :code:`compute_totals` is not used by the remote component; on the server side, :code:`compute_totals` will be evaluated for all responses (objectives, constraints, and :code:`additional_remote_outputs`). Depending on how many :code:`of` responses are desired, this may be more costly than not using remote components.
* The HPC environment must allow ssh port forwarding from the login node to a compute node.

.. autoclass:: mphys.network.remote_component.RemoteComp
:members:

.. autoclass:: mphys.network.server_manager.ServerManager
:members:

.. autoclass:: mphys.network.server.Server
:members:

.. autoclass:: mphys.network.zmq_pbs.RemoteZeroMQComp
:members:

.. autoclass:: mphys.network.zmq_pbs.MPhysZeroMQServerManager
:members:

.. autoclass:: mphys.network.zmq_pbs.MPhysZeroMQServer
:members:
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ These are descriptions of how MPhys works and how it interfaces with solvers and
basics/tagged_promotion.rst
basics/builders.rst
basics/naming_conventions.rst
basics/remote_components.rst

.. _scenario_library:

Expand Down
208 changes: 208 additions & 0 deletions examples/aerostructural/supersonic_panel/as_opt_parallel.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
import numpy as np
import openmdao.api as om
import os

from mphys import Multipoint, MultipointParallel
from mphys.scenario_aerostructural import ScenarioAeroStructural

from structures_mphys import StructBuilder
from aerodynamics_mphys import AeroBuilder
from xfer_mphys import XferBuilder
from geometry_morph import GeometryBuilder

check_totals = False # True=check objective/constraint derivatives, False=run optimization

# panel geometry
panel_chord = 0.3
panel_width = 0.01

# panel discretization
N_el_struct = 20
N_el_aero = 7

# Mphys parallel multipoint scenarios
class AerostructParallel(MultipointParallel):

def initialize(self):
self.options.declare('aero_builder')
self.options.declare('struct_builder')
self.options.declare('xfer_builder')
self.options.declare('geometry_builder')
self.options.declare('scenario_names')

def setup(self):
for i in range(len(self.options['scenario_names'])):

# create the run directory
if self.comm.rank==0:
if not os.path.isdir(self.options['scenario_names'][i]):
os.mkdir(self.options['scenario_names'][i])
self.comm.Barrier()

nonlinear_solver = om.NonlinearBlockGS(maxiter=100, iprint=2, use_aitken=True, aitken_initial_factor=0.5)
linear_solver = om.LinearBlockGS(maxiter=40, iprint=2, use_aitken=True, aitken_initial_factor=0.5)
self.mphys_add_scenario(self.options['scenario_names'][i],
ScenarioAeroStructural(
aero_builder=self.options['aero_builder'],
struct_builder=self.options['struct_builder'],
ldxfer_builder=self.options['xfer_builder'],
geometry_builder=self.options['geometry_builder'],
in_MultipointParallel=True,
run_directory=self.options['scenario_names'][i]),
coupling_nonlinear_solver=nonlinear_solver,
coupling_linear_solver=linear_solver)

# OM group
class Model(om.Group):
def initialize(self):
self.options.declare('scenario_names', default=['aerostructural1','aerostructural2'])

def setup(self):
self.scenario_names = self.options['scenario_names']

# ivc
self.add_subsystem('ivc', om.IndepVarComp(), promotes=['*'])
self.ivc.add_output('modulus', val=70E9)
self.ivc.add_output('yield_stress', val=270E6)
self.ivc.add_output('density', val=2800.)
self.ivc.add_output('mach', val=[5.,3.])
self.ivc.add_output('qdyn', val=[3E4,1E4])
#self.ivc.add_output('aoa', val=[3.,2.], units='deg') # derivatives are wrong when using vector aoa and coloring; see OpenMDAO issue 2919
self.ivc.add_output('aoa1', val=3., units='deg')
self.ivc.add_output('aoa2', val=2., units='deg')
self.ivc.add_output('geometry_morph_param', val=1.)

# create dv_struct, which is the thickness of each structural element
thickness = 0.001*np.ones(N_el_struct)
self.ivc.add_output('dv_struct', thickness)

# structure setup and builder
structure_setup = {'panel_chord' : panel_chord,
'panel_width' : panel_width,
'N_el' : N_el_struct}
struct_builder = StructBuilder(structure_setup)

# aero builder
aero_setup = {'panel_chord' : panel_chord,
'panel_width' : panel_width,
'N_el' : N_el_aero}
aero_builder = AeroBuilder(aero_setup)

# xfer builder
xfer_builder = XferBuilder(
aero_builder=aero_builder,
struct_builder=struct_builder
)

# geometry
builders = {'struct': struct_builder, 'aero': aero_builder}
geometry_builder = GeometryBuilder(builders)

# add parallel multipoint group
self.add_subsystem('multipoint',AerostructParallel(
aero_builder=aero_builder,
struct_builder=struct_builder,
xfer_builder=xfer_builder,
geometry_builder=geometry_builder,
scenario_names=self.scenario_names))

for i in range(len(self.scenario_names)):

# connect scalar inputs to the scenario
for var in ['modulus', 'yield_stress', 'density', 'dv_struct']:
self.connect(var, 'multipoint.'+self.scenario_names[i]+'.'+var)

# connect vector inputs
for var in ['mach', 'qdyn']: #, 'aoa']:
self.connect(var, 'multipoint.'+self.scenario_names[i]+'.'+var, src_indices=[i])

self.connect(f'aoa{i+1}', 'multipoint.'+self.scenario_names[i]+'.aoa')

# connect top-level geom parameter
self.connect('geometry_morph_param', 'multipoint.'+self.scenario_names[i]+'.geometry.geometry_morph_param')

# add design vars
self.add_design_var('geometry_morph_param', lower=0.1, upper=10.0)
self.add_design_var('dv_struct', lower=1.e-4, upper=1.e-2, ref=1.e-3)
#self.add_design_var('aoa', lower=-10., upper=10.)
self.add_design_var('aoa1', lower=-20., upper=20.)
self.add_design_var('aoa2', lower=-20., upper=20.)

# add objective/constraints
self.add_objective(f'multipoint.{self.scenario_names[0]}.mass', ref=0.01)
self.add_constraint(f'multipoint.{self.scenario_names[0]}.func_struct', upper=1.0, parallel_deriv_color='struct_cons') # run func_struct derivatives in parallel
self.add_constraint(f'multipoint.{self.scenario_names[1]}.func_struct', upper=1.0, parallel_deriv_color='struct_cons')
self.add_constraint(f'multipoint.{self.scenario_names[0]}.C_L', lower=0.15, ref=0.1, parallel_deriv_color='lift_cons') # run C_L derivatives in parallel
self.add_constraint(f'multipoint.{self.scenario_names[1]}.C_L', lower=0.45, ref=0.1, parallel_deriv_color='lift_cons')

def get_model(scenario_names):
return Model(scenario_names=scenario_names)

# run model and check derivatives
if __name__ == "__main__":

prob = om.Problem()
prob.model = Model()

if check_totals:
prob.setup(mode='rev')
om.n2(prob, show_browser=False, outfile='n2.html')
prob.run_model()
prob.check_totals(step_calc='rel_avg',
compact_print=True,
directional=False,
show_progress=True)

else:

# setup optimization driver
prob.driver = om.ScipyOptimizeDriver(debug_print=['nl_cons','objs','desvars','totals'])
prob.driver.options['optimizer'] = 'SLSQP'
prob.driver.options['tol'] = 1e-5
prob.driver.options['disp'] = True
prob.driver.options['maxiter'] = 300

# add optimization recorder
prob.driver.recording_options['record_objectives'] = True
prob.driver.recording_options['record_constraints'] = True
prob.driver.recording_options['record_desvars'] = True
prob.driver.recording_options['record_derivatives'] = True

recorder = om.SqliteRecorder("optimization_history.sql")
prob.driver.add_recorder(recorder)

# run the optimization
prob.setup(mode='rev')
prob.run_driver()
prob.cleanup()

# write out data
cr = om.CaseReader("optimization_history.sql")
driver_cases = cr.list_cases('driver')

case = cr.get_case(0)
cons = case.get_constraints()
dvs = case.get_design_vars()
objs = case.get_objectives()

f = open("optimization_history.dat","w+")

for i, k in enumerate(objs.keys()):
f.write('objective: ' + k + '\n')
for j, case_id in enumerate(driver_cases):
f.write(str(j) + ' ' + str(cr.get_case(case_id).get_objectives(scaled=False)[k][0]) + '\n')
f.write(' ' + '\n')

for i, k in enumerate(cons.keys()):
f.write('constraint: ' + k + '\n')
for j, case_id in enumerate(driver_cases):
f.write(str(j) + ' ' + ' '.join(map(str,cr.get_case(case_id).get_constraints(scaled=False)[k])) + '\n')
f.write(' ' + '\n')

for i, k in enumerate(dvs.keys()):
f.write('DV: ' + k + '\n')
for j, case_id in enumerate(driver_cases):
f.write(str(j) + ' ' + ' '.join(map(str,cr.get_case(case_id).get_design_vars(scaled=False)[k])) + '\n')
f.write(' ' + '\n')

f.close()
Loading

0 comments on commit c547005

Please sign in to comment.