Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behavior with Max Snapshots Per File with Restarts #6795

Open
bogensch opened this issue Dec 4, 2024 · 9 comments
Open

Inconsistent behavior with Max Snapshots Per File with Restarts #6795

bogensch opened this issue Dec 4, 2024 · 9 comments
Assignees
Labels
bug EAMxx PRs focused on capabilities for EAMxx

Comments

@bogensch
Copy link
Contributor

bogensch commented Dec 4, 2024

In EAMxx the YAML directive “Max Snapshot Per File” seems to be tripping up upon restarts and not behaving as expected.

For example, a DPxx run that is 6 hours in duration I have an output stream with Max Snapshots Per File: 1 with hourly averaged output. I have observed the following:

  • CNTL: If I set the simulation to run through without restarts: it will produce 6 files as expected.
  • EXP01: If I direct the simulation to run with one restart halfway through: it will produce 5 output files. The third file in the series will contain two time slices.
  • EXP02: If I direct the simulation to run with restarts each hour (hence, five restarts): it will produce 1 output file. The output file will contain all 6 time slices.

I have observed similar behavior in multiple DPxx simulations for both pm-cpu and pm-gpu. In a large production run I’m doing I have it doing daily restarts. I have an output stream with hourly output with Max Snapshots Per File: 24. In this case it is putting all my data into ONE file.

I have performed one global ne30 test and noticed similar behavior with restarts and unexpected behavior with Max Snapshots Per File treatment. Thus, this does not appear to be a DPxx specific problem but a general problem.

For a quick reproducer of EXP01, run the DYCOMS-RF01 case:
https://github.com/E3SM-Project/scmlib/blob/master/DPxx_SCREAM_SCRIPTS/run_dpxx_scream_DYCOMSrf01.csh
Set to run for 3 hours (search for stop_n) and set for one restart.

And direct to the following YAML:

%YAML 1.1
---
Averaging Type: Average
Max Snapshots Per File: 1
filename_prefix: ${CASE}.scream.hourly.avg
Fields:
  Physics PG2:
    Field Names:
    - LiqWaterPath
output_control:
  Frequency: 1
  frequency_units: nhours
@bogensch bogensch added bug EAMxx PRs focused on capabilities for EAMxx labels Dec 4, 2024
@mahf708
Copy link
Contributor

mahf708 commented Dec 4, 2024

I don't believe we've seen this odd behavior when we have Max Snapshots Per File equaling a day worth of data (so 8 snaps for 3-hourly stream). I do wonder if this is specific to Max Snapshots Per File: 1 ...

@mahf708
Copy link
Contributor

mahf708 commented Dec 4, 2024

we discussed making this default 👀 👀

Restart:
  force_new_file: true

@bogensch
Copy link
Contributor Author

bogensch commented Dec 4, 2024

@mahf708 I see this when I have Max Snapshots Per File: 24 . You probably don't have this issue because I see your stream sets:

Restart:
  force_new_file: true

which overrides this behavior (consistent with my experience too).

@mahf708
Copy link
Contributor

mahf708 commented Dec 4, 2024

Yep! I was gonna say the same exact thing!

I have an output stream with hourly output with Max Snapshots Per File: 24. In this case it is putting all my data into ONE file.

Also, sorry I just saw this in your op :/

@bartgol
Copy link
Contributor

bartgol commented Dec 4, 2024

Even without forcing a new file, the output infrastructure should see that the last output file is full, so it should NOT try to add a new slice to that one. Definitely a bug.

@mahf708
Copy link
Contributor

mahf708 commented Dec 4, 2024

Even without forcing a new file, the output infrastructure should see that the last output file is full, so it should NOT try to add a new slice to that one. Definitely a bug.

Guess something's gone wrong here?

      // Check if the prev run wrote any output file (it may have not, if the restart was written
      // before the 1st output step). If there is a file, check if there's still room in it.
      const auto& last_output_filename = get_attribute<std::string>(rhist_file,"GLOBAL","last_output_filename");
      m_resume_output_file = last_output_filename!="" and not restart_pl.get("force_new_file",false);
      if (m_resume_output_file) {
        m_output_file_specs.storage.num_snapshots_in_file = scorpio::get_attribute<int>(rhist_file,"GLOBAL","last_output_file_num_snaps");

        if (m_output_file_specs.storage.snapshot_fits(m_output_control.next_write_ts)) {
          // The setup_file call will not register any new variable (the file is in Append mode,
          // so all dims/vars must already be in the file). However, it will register decompositions,
          // since those are a property of the run, not of the file.
          m_output_file_specs.filename = last_output_filename;
          m_output_file_specs.is_open = true;
          setup_file(m_output_file_specs,m_output_control);
        } else {
          m_output_file_specs.close();
        }
      }

@bartgol
Copy link
Contributor

bartgol commented Dec 4, 2024

I guess. But nothing stands out... I would have to reproduce manually, then inundate the src code with print statements and see...

I'll get to it hopefully this week. Unless someone else feels like taking this on

@AaronDonahue
Copy link
Contributor

quick look at Peter B's case shows that in the metadata last_output_file_num_snaps is always 0 for those rhist files. So likely the issue is wherever that metadata is being set.

@AaronDonahue
Copy link
Contributor

https://github.com/E3SM-Project/E3SM/blob/master/components/eamxx/src/share/io/scream_output_manager.cpp#L537

So probably would just need print statements here. I have a bunch of meetings tomorrow, but if I get a good break I will take a look.

AaronDonahue added a commit that referenced this issue Dec 11, 2024
This commit fixes an issue during restarts that occurs with
averaged type output.  The restart history file (rhist) metadata
was incorrectly setup which could lead EAMxx to reopen files that
already had the max number of snaps in them and continue to fill
them at the restart step.

Fixes #6795
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug EAMxx PRs focused on capabilities for EAMxx
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants