Merge pull request #164 from Asthelen/network-new

Client Server OM problem for large
OpenMDAO · Nov 27, 2023 · c547005 · c547005
2 parents 2a483ef + 294dbe8
commit c547005
Show file tree

Hide file tree

Showing 13 changed files with 1,485 additions and 12 deletions.
diff --git a/docs/basics/remote_components.rst b/docs/basics/remote_components.rst
@@ -0,0 +1,96 @@
+.. _remote_components:
+
+*****************
+Remote Components
+*****************
+
+The purpose of remote components is to provide a means of adding a remote physics analysis to a local OpenMDAO problem.
+One situation in which this may be desirable is when the time to carry out a full optimization exceeds an HPC job time limit.
+Such a situation, without remote components, may normally require manual restarts of the optimization, and would thus limit one to optimizers with this capability.
+Using remote components, one can keep a serial OpenMDAO optimization running continuously on a login node (e.g., using the nohup or screen Linux commands) while the parallel physics analyses are evaluated across several HPC jobs.
+Another situation where these components may be advantageous is when the OpenMDAO problem contains components not streamlined for massively parallel environments.
+
+In general, remote components use nested OpenMDAO problems in a server-client arrangement.
+The outer, client-side OpenMDAO model serves as the overarching analysis/optimization problem while the inner, server-side model serves as the isolated high-fidelity analysis.
+The server inside the HPC job remains open to evaluate function or gradient calls.
+Wall times for function and gradient calls are saved, and when the maximum previous time multiplied by a scale factor exceeds the remaining job time, the server will be relaunched.
+
+Three general base classes are used to achieve this.
+
+* :class:`~mphys.network.remote_component.RemoteComp`: Explicit component that wraps communication with server, replicating inputs/outputs to/from server-side group and requesting new a server when estimated analysis time exceeds remaining job time.
+* :class:`~mphys.network.server_manager.ServerManager`: Used by ``RemoteComp`` to control and communicate with the server.
+* :class:`~mphys.network.server.Server`: Loads the inner OpenMDAO problem and evaluates function or gradient calls as requested by the ``ServerManager``.
+
+Currently, there is one derived class for each, which use pbs4py for HPC job control and ZeroMQ for network communication.
+
+* :class:`~mphys.network.zmq_pbs.RemoteZeroMQComp`: Through the use of ``MPhysZeroMQServerManager``, uses encoded JSON dictionaries to send and receive necessary information to and from the server.
+* :class:`~mphys.network.zmq_pbs.MPhysZeroMQServerManager`: Uses ZeroMQ socket and ssh port forwarding from login to compute node to communicate with server, and pbs4py to start, stop, and check status of HPC jobs.
+* :class:`~mphys.network.zmq_pbs.MPhysZeroMQServer`: Uses ZeroMQ socket to send and receive encoded JSON dictionaries.
+
+RemoteZeroMQComp Options
+========================
+.. embed-options::
+    mphys.network.zmq_pbs
+    RemoteZeroMQComp
+    options
+
+Usage
+=====
+When adding a :code:`RemoteZeroMQComp` component, the two required options are :code:`run_server_filename`, which is the server to be launched on an HPC job, and :code:`pbs`, which is the pbs4py Launcher object.
+The server file should accept port number as an argument to facilitate communication with the client.
+Within this file, the :code:`MPhysZeroMQServer` class's :code:`get_om_group_function_pointer` option is the pointer to the OpenMDAO Group or Multipoint class to be evaluated.
+By default, any design variables, objectives, and constraints defined in the group will be added on the client side.
+Any other desired inputs or outputs must be added in the :code:`additional_remote_inputs` or :code:`additional_remote_outputs` options.
+On the client side, any "." characters in these input and output names will be replaced by :code:`var_naming_dot_replacement`.
+
+The screen output from a particular remote component's Nth server will be sent to :code:`mphys_<component name>_serverN.out`, where :code:`component name` is the subsystem name of the :code:`RemoteZeroMQComp` instance.
+Searching for the keyword "SERVER" will display what the server is currently doing; the keyword "CLIENT" will do the same on the client-side.
+The HPC job for the component's server is named :code:`MPhys<port number>`; the pbs4py-generated job submission script is the same followed by ".pbs".
+Note that running the remote component in parallel is not supported, and a SystemError will be triggered otherwise.
+
+Example
+=======
+Two examples are provided for the `supersonic panel aerostructural case <https://github.com/OpenMDAO/mphys/tree/main/examples/aerostructural/supersonic_panel>`_: :code:`as_opt_remote_serial.py` and :code:`as_opt_remote_parallel.py`.
+Both run the optimization problem defined in :code:`as_opt_parallel.py`, which contains a :code:`MultipointParallel` class and thus evaluates two aerostructural scenarios in parallel.
+The serial remote example runs this group on one server.
+The parallel remote example, on the other hand, contains an OpenMDAO parallel group which runs two servers in parallel.
+Both examples use the same server file, :code:`mphys_server.py`, but point to either :code:`as_opt_parallel.py` or :code:`run.py` by sending the model's filename through the use of the :code:`RemoteZeroMQComp`'s :code:`additional_server_args` option.
+As demonstrated in this server file, additional configuration options may be sent to the server-side OpenMDAO group through the use of a functor (called :code:`GetModel` in this case) in combination with :code:`additional_server_args`.
+In this particular case, scenario name(s) are sent as :code:`additional_server_args` from the client side; on the server side, the :code:`GetModel` functor allows the scenario name(s) to be sent as OpenMDAO options to the server-side group.
+Using the scenario :code:`run_directory` option, the scenarios can then be evaluated in different directories.
+In both examples, the remote component(s) use a :code:`K4` pbs4py Launcher object, which will launch, monitor, and stop jobs using the K4 queue of the NASA K-cluster.
+
+Troubleshooting
+===============
+The :code:`dump_json` option for :code:`RemoteZeroMQComp` will make the component write input and output JSON files, which contain all data sent to and received from the server.
+An exception is the :code:`wall_time` entry (given in seconds) in the output JSON file, which is added on the client-side after the server has completed the design evaluation.
+Another entry that is only provided for informational purposes is :code:`design_counter`, which keeps track of how many different designs have been evaluated on the current server.
+If :code:`dump_separate_json` is set to True, then separate files will be written for each design evaluation.
+On the server side, an n2 file titled :code:`n2_inner_analysis_<component name>.html` will be written after each evaluation.
+
+Current Limitations
+===================
+* A pbs4py Launcher must be implemented for your HPC environment
+* On the client side, :code:`RemoteZeroMQComp.stop_server()` should be added after your analysis/optimization to stop the HPC job and ssh port forwarding, which the server manager starts as a background process.
+* If :code:`stop_server` is not called or the server stops unexpectedly, stopping the port forwarding manually is difficult, as it involves finding the ssh process associated with the remote server's port number. This must be done on the same login node that the server was launched from.
+* Stopping the HPC job is somewhat easier as the job name will be :code:`MPhys` followed by the port number; however, if runs are launched from multiple login nodes then one may have multiple jobs with the same name.
+* Currently, the :code:`of` option (as well as :code:`wrt`) for :code:`check_totals` or :code:`compute_totals` is not used by the remote component; on the server side, :code:`compute_totals` will be evaluated for all responses (objectives, constraints, and :code:`additional_remote_outputs`). Depending on how many :code:`of` responses are desired, this may be more costly than not using remote components.
+* The HPC environment must allow ssh port forwarding from the login node to a compute node.
+
+.. autoclass:: mphys.network.remote_component.RemoteComp
+    :members:
+
+.. autoclass:: mphys.network.server_manager.ServerManager
+    :members:
+
+.. autoclass:: mphys.network.server.Server
+    :members:
+
+.. autoclass:: mphys.network.zmq_pbs.RemoteZeroMQComp
+    :members:
+
+.. autoclass:: mphys.network.zmq_pbs.MPhysZeroMQServerManager
+    :members:
+
+.. autoclass:: mphys.network.zmq_pbs.MPhysZeroMQServer
+    :members:
diff --git a/docs/index.rst b/docs/index.rst
@@ -23,6 +23,7 @@ These are descriptions of how MPhys works and how it interfaces with solvers and
   basics/tagged_promotion.rst
   basics/builders.rst
   basics/naming_conventions.rst
+  basics/remote_components.rst
 
 .. _scenario_library:
 

diff --git a/examples/aerostructural/supersonic_panel/as_opt_parallel.py b/examples/aerostructural/supersonic_panel/as_opt_parallel.py
@@ -0,0 +1,208 @@
+import numpy as np
+import openmdao.api as om
+import os
+
+from mphys import Multipoint, MultipointParallel
+from mphys.scenario_aerostructural import ScenarioAeroStructural
+
+from structures_mphys import StructBuilder
+from aerodynamics_mphys import AeroBuilder
+from xfer_mphys import XferBuilder
+from geometry_morph import GeometryBuilder
+
+check_totals = False # True=check objective/constraint derivatives, False=run optimization
+
+# panel geometry
+panel_chord = 0.3
+panel_width = 0.01
+
+# panel discretization
+N_el_struct = 20
+N_el_aero = 7
+
+# Mphys parallel multipoint scenarios
+class AerostructParallel(MultipointParallel):
+
+    def initialize(self):
+        self.options.declare('aero_builder')
+        self.options.declare('struct_builder')
+        self.options.declare('xfer_builder')
+        self.options.declare('geometry_builder')
+        self.options.declare('scenario_names')
+
+    def setup(self):
+        for i in range(len(self.options['scenario_names'])):
+
+            # create the run directory
+            if self.comm.rank==0:
+                if not os.path.isdir(self.options['scenario_names'][i]):
+                    os.mkdir(self.options['scenario_names'][i])
+            self.comm.Barrier()
+
+            nonlinear_solver = om.NonlinearBlockGS(maxiter=100, iprint=2, use_aitken=True, aitken_initial_factor=0.5)
+            linear_solver = om.LinearBlockGS(maxiter=40, iprint=2, use_aitken=True, aitken_initial_factor=0.5)
+            self.mphys_add_scenario(self.options['scenario_names'][i],
+                                    ScenarioAeroStructural(
+                                        aero_builder=self.options['aero_builder'],
+                                        struct_builder=self.options['struct_builder'],
+                                        ldxfer_builder=self.options['xfer_builder'],
+                                        geometry_builder=self.options['geometry_builder'],
+                                        in_MultipointParallel=True,
+                                        run_directory=self.options['scenario_names'][i]),
+                                    coupling_nonlinear_solver=nonlinear_solver,
+                                    coupling_linear_solver=linear_solver)
+
+# OM group
+class Model(om.Group):
+    def initialize(self):
+        self.options.declare('scenario_names', default=['aerostructural1','aerostructural2'])
+
+    def setup(self):
+        self.scenario_names = self.options['scenario_names']
+
+        # ivc
+        self.add_subsystem('ivc', om.IndepVarComp(), promotes=['*'])
+        self.ivc.add_output('modulus', val=70E9)
+        self.ivc.add_output('yield_stress', val=270E6)
+        self.ivc.add_output('density', val=2800.)
+        self.ivc.add_output('mach', val=[5.,3.])
+        self.ivc.add_output('qdyn', val=[3E4,1E4])
+        #self.ivc.add_output('aoa', val=[3.,2.], units='deg') # derivatives are wrong when using vector aoa and coloring; see OpenMDAO issue 2919
+        self.ivc.add_output('aoa1', val=3., units='deg')
+        self.ivc.add_output('aoa2', val=2., units='deg')
+        self.ivc.add_output('geometry_morph_param', val=1.)
+
+        # create dv_struct, which is the thickness of each structural element
+        thickness = 0.001*np.ones(N_el_struct)
+        self.ivc.add_output('dv_struct', thickness)
+
+        # structure setup and builder
+        structure_setup = {'panel_chord'  : panel_chord,
+                           'panel_width'  : panel_width,
+                           'N_el'         : N_el_struct}
+        struct_builder = StructBuilder(structure_setup)
+
+        # aero builder
+        aero_setup = {'panel_chord'  : panel_chord,
+                      'panel_width'  : panel_width,
+                      'N_el'         : N_el_aero}
+        aero_builder = AeroBuilder(aero_setup)
+
+        # xfer builder
+        xfer_builder = XferBuilder(
+            aero_builder=aero_builder,
+            struct_builder=struct_builder
+        )
+
+        # geometry
+        builders = {'struct': struct_builder, 'aero': aero_builder}
+        geometry_builder = GeometryBuilder(builders)
+
+        # add parallel multipoint group
+        self.add_subsystem('multipoint',AerostructParallel(
+                                        aero_builder=aero_builder,
+                                        struct_builder=struct_builder,
+                                        xfer_builder=xfer_builder,
+                                        geometry_builder=geometry_builder,
+                                        scenario_names=self.scenario_names))
+
+        for i in range(len(self.scenario_names)):
+
+            # connect scalar inputs to the scenario
+            for var in ['modulus', 'yield_stress', 'density', 'dv_struct']:
+                self.connect(var, 'multipoint.'+self.scenario_names[i]+'.'+var)
+
+            # connect vector inputs
+            for var in ['mach', 'qdyn']: #, 'aoa']:
+                self.connect(var, 'multipoint.'+self.scenario_names[i]+'.'+var, src_indices=[i])
+
+            self.connect(f'aoa{i+1}', 'multipoint.'+self.scenario_names[i]+'.aoa')
+
+            # connect top-level geom parameter
+            self.connect('geometry_morph_param', 'multipoint.'+self.scenario_names[i]+'.geometry.geometry_morph_param')
+
+        # add design vars
+        self.add_design_var('geometry_morph_param', lower=0.1, upper=10.0)
+        self.add_design_var('dv_struct', lower=1.e-4, upper=1.e-2, ref=1.e-3)
+        #self.add_design_var('aoa', lower=-10., upper=10.)
+        self.add_design_var('aoa1', lower=-20., upper=20.)
+        self.add_design_var('aoa2', lower=-20., upper=20.)
+
+        # add objective/constraints
+        self.add_objective(f'multipoint.{self.scenario_names[0]}.mass', ref=0.01)
+        self.add_constraint(f'multipoint.{self.scenario_names[0]}.func_struct', upper=1.0, parallel_deriv_color='struct_cons') # run func_struct derivatives in parallel
+        self.add_constraint(f'multipoint.{self.scenario_names[1]}.func_struct', upper=1.0, parallel_deriv_color='struct_cons')
+        self.add_constraint(f'multipoint.{self.scenario_names[0]}.C_L', lower=0.15, ref=0.1, parallel_deriv_color='lift_cons') # run C_L derivatives in parallel
+        self.add_constraint(f'multipoint.{self.scenario_names[1]}.C_L', lower=0.45, ref=0.1, parallel_deriv_color='lift_cons')
+
+def get_model(scenario_names):
+    return Model(scenario_names=scenario_names)
+
+# run model and check derivatives
+if __name__ == "__main__":
+
+    prob = om.Problem()
+    prob.model = Model()
+
+    if check_totals:
+        prob.setup(mode='rev')
+        om.n2(prob, show_browser=False, outfile='n2.html')
+        prob.run_model()
+        prob.check_totals(step_calc='rel_avg',
+                          compact_print=True,
+                          directional=False,
+                          show_progress=True)
+
+    else:
+
+        # setup optimization driver
+        prob.driver = om.ScipyOptimizeDriver(debug_print=['nl_cons','objs','desvars','totals'])
+        prob.driver.options['optimizer'] = 'SLSQP'
+        prob.driver.options['tol'] = 1e-5
+        prob.driver.options['disp'] = True
+        prob.driver.options['maxiter'] = 300
+
+        # add optimization recorder
+        prob.driver.recording_options['record_objectives'] = True
+        prob.driver.recording_options['record_constraints'] = True
+        prob.driver.recording_options['record_desvars'] = True
+        prob.driver.recording_options['record_derivatives'] = True
+
+        recorder = om.SqliteRecorder("optimization_history.sql")
+        prob.driver.add_recorder(recorder)
+
+        # run the optimization
+        prob.setup(mode='rev')
+        prob.run_driver()
+        prob.cleanup()
+
+        # write out data
+        cr = om.CaseReader("optimization_history.sql")
+        driver_cases = cr.list_cases('driver')
+
+        case = cr.get_case(0)
+        cons = case.get_constraints()
+        dvs = case.get_design_vars()
+        objs = case.get_objectives()
+
+        f = open("optimization_history.dat","w+")
+
+        for i, k in enumerate(objs.keys()):
+            f.write('objective: ' + k + '\n')
+            for j, case_id in enumerate(driver_cases):
+                f.write(str(j) + ' ' + str(cr.get_case(case_id).get_objectives(scaled=False)[k][0]) + '\n')
+            f.write(' ' + '\n')
+
+        for i, k in enumerate(cons.keys()):
+            f.write('constraint: ' + k + '\n')
+            for j, case_id in enumerate(driver_cases):
+                f.write(str(j) + ' ' + ' '.join(map(str,cr.get_case(case_id).get_constraints(scaled=False)[k])) + '\n')
+            f.write(' ' + '\n')
+
+        for i, k in enumerate(dvs.keys()):
+            f.write('DV: ' + k + '\n')
+            for j, case_id in enumerate(driver_cases):
+                f.write(str(j) + ' ' + ' '.join(map(str,cr.get_case(case_id).get_design_vars(scaled=False)[k])) + '\n')
+            f.write(' ' + '\n')
+
+        f.close()