Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker: Collect JVM heap dump in case server run into error #2546

Merged
merged 1 commit into from
Jan 1, 2025

Conversation

VietND96
Copy link
Member

@VietND96 VietND96 commented Jan 1, 2025

User description

Thanks for contributing to the Docker-Selenium project!
A PR well described will help maintainers to quickly review and merge it

Before submitting your PR, please check our contributing guidelines, applied for this repository.
Avoid large PRs, help reviewers by making them as simple and short as possible.

Description

Support #2528

  • ENV var SE_JAVA_OPTS_DEFAULT=-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/selenium/logs append to command start the server. Heap dump is created in /opt/selenium/logs directory in case the server run into OOM.
  • If the user wants to collect heap dump on demand, set env var SE_JAVA_HEAP_DUMP=true. Once Pod/container is terminated or restarted, the heap dump will be created in /opt/selenium/logs. In deployment.

Noted: this needs to set volumes/PVC mount /opt/selenium/logs in container to persist the data file.

This pull request introduces several improvements to the Selenium Grid setup, including enhancements to the Dockerfile, new environment variables, and heap dump handling across various components. The changes also include updates to Helm charts for better configuration management.

Dockerfile and Environment Variables:

  • Changed from openjdk-${JRE_VERSION}-jre-headless to openjdk-${JRE_VERSION}-jdk-headless to include JDK instead of JRE (Base/Dockerfile).
  • Added new directories and permissions for Selenium logs and heap dumps (Base/Dockerfile) [1] [2].
  • Introduced new environment variables SE_JAVA_OPTS_DEFAULT and SE_JAVA_HEAP_DUMP to manage Java options and heap dump settings (Base/Dockerfile).

Heap Dump Handling:

  • Added handle_heap_dump.sh script to manage heap dumps (Base/handle_heap_dump.sh).
  • Updated start scripts for various components to handle heap dumps using the new script and environment variables (Distributor/start-selenium-grid-distributor.sh, EventBus/start-selenium-grid-eventbus.sh, Hub/start-selenium-grid-hub.sh, NodeBase/start-selenium-node.sh, NodeDocker/start-selenium-grid-docker.sh, Router/start-selenium-grid-router.sh, SessionQueue/start-selenium-grid-session-queue.sh, Sessions/start-selenium-grid-sessions.sh, Standalone/start-selenium-standalone.sh, StandaloneDocker/start-selenium-grid-docker.sh) [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17].

Helm Charts:

  • Added configuration options for extra volume mounts and volumes in the Helm chart (charts/selenium-grid/CONFIGURATION.md, charts/selenium-grid/values.yaml) [1] [2] [3] [4].
  • Updated test deployment values to include heap dump settings and volume configurations (tests/charts/ci/DeploymentAutoscaling-values.yaml) [1] [2] [3].

Motivation and Context

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • I have read the contributing document.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

PR Type

Enhancement


Description

  • Added JVM heap dump collection functionality

  • Created heap dump handler script

  • Added process management with PID tracking

  • Added volume mounts for logs in Helm charts


Changes walkthrough 📝

Relevant files
Enhancement
11 files
handle_heap_dump.sh
New script to handle JVM heap dump collection                       
+19/-0   
start-selenium-grid-distributor.sh
Add heap dump handling and process management                       
+18/-1   
start-selenium-grid-eventbus.sh
Add heap dump handling and process management                       
+18/-1   
start-selenium-grid-hub.sh
Add heap dump handling and process management                       
+18/-1   
start-selenium-node.sh
Add heap dump handling and process management                       
+21/-1   
start-selenium-grid-docker.sh
Add heap dump handling and process management                       
+18/-1   
start-selenium-grid-router.sh
Add heap dump handling and process management                       
+18/-1   
start-selenium-grid-session-queue.sh
Add heap dump handling and process management                       
+18/-1   
start-selenium-grid-sessions.sh
Add heap dump handling and process management                       
+18/-1   
start-selenium-standalone.sh
Add heap dump handling and process management                       
+18/-1   
start-selenium-grid-docker.sh
Add heap dump handling and process management                       
+18/-1   
Configuration changes
1 files
Dockerfile
Add JDK and heap dump configuration                                           
+6/-3     
Additional files
7 files
CONFIGURATION.md +4/-2     
values.yaml +17/-9   
DeploymentAutoscaling-values.yaml +14/-0   
JobAutoscaling-values.yaml +2/-0     
NoAutoscaling-values.yaml +14/-0   
base-auth-ingress-values.yaml +12/-0   
chart_test.sh +3/-5     

💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

Copy link

qodo-merge-pro bot commented Jan 1, 2025

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Error Handling

The heap dump script does not handle errors that may occur during jmap execution or file operations. Consider adding error handling and logging.

  echo "Server process is still running. Create heap dump by using jmap"
  jmap -dump:live,format=b,file="${filename}" "${SELENIUM_SERVER_PID}"
else
Resource Usage

Installing full JDK instead of JRE may increase container size unnecessarily. Consider using JRE if debugging tools like jmap are available separately.

openjdk-${JRE_VERSION}-jdk-headless \

Copy link

qodo-merge-pro bot commented Jan 1, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Score
Possible issue
Add error handling for critical diagnostic operations to ensure reliability

Add error handling for the jmap command to ensure the heap dump is properly created
and handle potential failures gracefully.

Base/handle_heap_dump.sh [10]

-jmap -dump:live,format=b,file="${filename}" "${SELENIUM_SERVER_PID}"
+if ! jmap -dump:live,format=b,file="${filename}" "${SELENIUM_SERVER_PID}"; then
+  echo "Failed to create heap dump using jmap"
+  exit 1
+fi
  • Apply this suggestion
Suggestion importance[1-10]: 8

Why: Adding error handling for the jmap command is critical as it ensures heap dumps are properly created and failures are detected, which is essential for debugging production issues.

8
Validate directory permissions before performing file operations

Check if the target directory exists and is writable before attempting to create or
move heap dump files.

Base/handle_heap_dump.sh [7]

+if [ ! -d "$LOG_DIR" ] || [ ! -w "$LOG_DIR" ]; then
+  echo "Error: $LOG_DIR does not exist or is not writable"
+  exit 1
+fi
 filename="$LOG_DIR/dump_pid${SELENIUM_SERVER_PID}_${TIMESTAMP}.hprof"
  • Apply this suggestion
Suggestion importance[1-10]: 7

Why: Checking directory permissions before attempting file operations prevents silent failures and provides clear error messages, improving diagnostic capabilities.

7
Enable immediate error detection at script start to prevent silent failures

The set -e command should be moved to the start of the script to ensure all errors
are caught immediately.

NodeBase/start-selenium-node.sh [15-16]

+#!/usr/bin/env bash
+set -e
+
 rm -f /tmp/.X*lock
 
-# set -e: exit asap if a command exits with a non-zero status
-set -e
-
  • Apply this suggestion
Suggestion importance[1-10]: 6

Why: Moving 'set -e' to the script start ensures all errors are caught immediately, preventing potential issues from undetected failures early in the script execution.

6

Copy link

qodo-merge-pro bot commented Jan 1, 2025

CI Failure Feedback 🧐

(Checks updated until commit 54943fb)

Action: Rerun workflow when failure

Failed stage: Authenticate GitHub CLI for PR [❌]

Failure summary:

The action failed because the GitHub authentication token lacks the required 'read:org' permission
scope. The error occurred during the gh auth login command, which requires proper authorization
scopes to interact with GitHub's API.

Relevant error logs:
1:  ##[group]Operating System
2:  Ubuntu
...

28:  SecurityEvents: write
29:  Statuses: write
30:  ##[endgroup]
31:  Secret source: Actions
32:  Prepare workflow directory
33:  Prepare all required actions
34:  Getting action download info
35:  Download action repository 'actions/checkout@main' (SHA:cbb722410c2e876e24abbe8de2cc27693e501dcb)
36:  Complete job name: Rerun workflow when failure
...

48:  show-progress: true
49:  lfs: false
50:  submodules: false
51:  set-safe-directory: true
52:  env:
53:  GH_CLI_TOKEN: ***
54:  GH_CLI_TOKEN_PR: ***
55:  RUN_ID: 12568267750
56:  RERUN_FAILED_ONLY: true
...

119:  ##[group]Run sudo apt update
120:  �[36;1msudo apt update�[0m
121:  �[36;1msudo apt install gh�[0m
122:  shell: /usr/bin/bash -e {0}
123:  env:
124:  GH_CLI_TOKEN: ***
125:  GH_CLI_TOKEN_PR: ***
126:  RUN_ID: 12568267750
127:  RERUN_FAILED_ONLY: true
...

164:  0 upgraded, 0 newly installed, 0 to remove and 48 not upgraded.
165:  ##[group]Run echo "$GH_CLI_TOKEN_PR" | gh auth login --with-token
166:  �[36;1mecho "$GH_CLI_TOKEN_PR" | gh auth login --with-token�[0m
167:  shell: /usr/bin/bash -e {0}
168:  env:
169:  GH_CLI_TOKEN: ***
170:  GH_CLI_TOKEN_PR: ***
171:  RUN_ID: 12568267750
172:  RERUN_FAILED_ONLY: true
173:  RUN_ATTEMPT: 1
174:  ##[endgroup]
175:  error validating token: missing required scope 'read:org'
176:  ##[error]Process completed with exit code 1.

✨ CI feedback usage guide:

The CI feedback tool (/checks) automatically triggers when a PR has a failed check.
The tool analyzes the failed checks and provides several feedbacks:

  • Failed stage
  • Failed test name
  • Failure summary
  • Relevant error logs

In addition to being automatically triggered, the tool can also be invoked manually by commenting on a PR:

/checks "https://github.com/{repo_name}/actions/runs/{run_number}/job/{job_number}"

where {repo_name} is the name of the repository, {run_number} is the run number of the failed check, and {job_number} is the job number of the failed check.

Configuration options

  • enable_auto_checks_feedback - if set to true, the tool will automatically provide feedback when a check is failed. Default is true.
  • excluded_checks_list - a list of checks to exclude from the feedback, for example: ["check1", "check2"]. Default is an empty list.
  • enable_help_text - if set to true, the tool will provide a help message with the feedback. Default is true.
  • persistent_comment - if set to true, the tool will overwrite a previous checks comment with the new feedback. Default is true.
  • final_update_message - if persistent_comment is true and updating a previous checks message, the tool will also create a new message: "Persistent checks updated to latest commit". Default is true.

See more information about the checks tool in the docs.

Signed-off-by: Viet Nguyen Duc <nguyenducviet4496@gmail.com>
@VietND96 VietND96 merged commit 3902799 into trunk Jan 1, 2025
30 of 36 checks passed
@VietND96 VietND96 deleted the jvm-logs branch January 1, 2025 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant