Spike - testing CSV exports in production #3201

jtimpe · 2024-09-25T14:58:12Z

Description:
#3162 addressed memory issues related to the csv export, noted in #3137 and #3138. While testing the solution, we noticed an out-of-memory error is still possible, but substantially more records are able to be reliably exported before the export breaks. The limitations are noted in @ADPennington 's comment on 3162.

Since we are no longer caching the queryset and file I/O operations are done as efficiently as possible using python io and gzip, the thought is that the celery worker is leaking memory. Indeed the following warning message is present everywhere we use DJANGO_DEBUG=True (which is the case for all deployed dev environments)

/celery/fixups/django.py:203: UserWarning: Using settings.DEBUG leads to a memory leak, never use this setting in production environments!

In dev environments, we can turn this setting to False to test this assumption. In production, DJANGO_DEBUG is False already, so we'd like to observe the behavior of the csv export in production with the following questions in mind:

Open Questions:
Please include any questions, possible solutions or decisions that should be explored during work

Are reliable csv exports of 600k+ records possible in production (or with DJANGO_DEBUG=False)? 900k-1m+?
How do production's system resources behave during large csv exports?
Do simultaneous operations (large queries, memory-heavy processes) on the backend create memory pressure for celery tasks?
Open Question 1
Open Question 2

Deliverable(s):
Create a list of recommendations or proofs of concept to be achieved to complete this issue

Turn DJANGO_DEBUG to False in a non-prod environment. Run exports of 600k, 900k, 1m+ rows and observe memory.
Repeat; perform memory-heavy operations while the exports run. Observe.
Observe memory in production environment during exports.

Supporting Documentation:
Please include any relevant log snippets/files/screen shots

In QASP, this was tested by running the following script while a csv export was simultaneously run

while true
do
    echo "Watching memory..."
    cf app tdp-backend-qasp >> memory.txt
    sleep 1
done

Results: memory.txt
- QASP where DJANGO_DEBUG=True, export of ~950k records
- Notice that memory started around 760MB and rose quickly to 1.7GB once the export started. It stayed at 1.7GB until the export completed, then returned to ~800MB.
- This export succeeded, but it doesn't always. See this comment for an example of the error.

The text was updated successfully, but these errors were encountered:

jtimpe · 2024-09-25T14:58:19Z

This research may best be done after the introduction of #3046

jtimpe · 2024-10-08T19:21:22Z

celery flower can also be used to monitor resources in place of PLG (#3046)

jtimpe · 2024-10-29T13:21:56Z

Summary of testing so far

I have been unable to cause an export to fail with the sig 9 described in the linked comment. all exports have been successful
memory is reserved by the celery process and not released when complete, so it certainly makes sense that a large enough export would cause an out of memory export

Test Runs

Test script:

while true
do
    echo "Watching memory..."
    # cf app tdp-backend-raft | grep "#0" >> memory.txt
    cf app tdp-backend-raft | grep -e "type:" -e "#0" >> memory.txt
    sleep 1
done

Results in a 4 line set like this every second

type:           web
#0   running   2024-10-23T16:11:38Z   50.6%   1.1G of 2G   768.7M of 2G   
type:           worker
#0   running   2024-10-23T16:11:24Z   0.8%   45.9M of 128M   836.1M of 2G

the stats listed are (left to right) state, since, cpu, memory, disk
the line after type: web refers to everything we deploy as part of our application manifest including django, celery, and redis. this is the most relevant line, and is what i'm referring to when talking about cpu and memory below
the line after type: worker refers to our PLG processes ingesting logs and system stats
export start is marked when web cpu starts to climb (to about 50% utilization). end is when it returns to baseline (about 6%)
memory is "reset" when the value of DJANGO_DEBUG is changed because the application has to be restaged
DJANGO_DEBUG=Yes can cause a memory leak in celery. this is True is all non-production environments
celery's max-tasks-per-child specifies how many jobs a celery worker can complete before it is replaced with a new process. we set this to 1 to produce a new worker after every process in an attempt to limit the effect of the memory leak.

Run 1: memory-raft-1-debug-on.txt

707782 t3s
DJANG_DEBUG=Yes
celery max-tasks-per-child=1
memory starts out at baseline, increases as the export task gets underway, then decreases back to baseline once the worker is recycled. basically "no memory leak" (because of max-tasks-per-child=1)

Run 2: memory-raft-2-debug-off.txt

707782 t3s
DJANG_DEBUG=No
celery max-tasks-per-child=1
memory starts out at baseline, increases as the export task gets underway, then decreases back to baseline once the worker is recycled (max-tasks-per-child=1)

Run 3: memory-raft-3-debug-on.txt

707782 t3s
DJANG_DEBUG=Yes
celery max-tasks-per-child=inf
memory stuck at 1.1gb after file export ends

Run 4: memory-raft-4-debug-on.txt

707782 t3s
DJANG_DEBUG=Yes
celery max-tasks-per-child=inf
second test without changing anything. hoping to see mem leak.
mem usage didn't change (1.1gb) through whole run. maybe exporting a larger set (than previous) would have it climb more.
checking the memory again after some cooldown time. memory-raft-4-after.txt

Run 5: memory-raft-5-debug-off.txt

707782 t3s
DJANG_DEBUG=No
celery max-tasks-per-child=inf
memory reset after restage. see if DJANGO_DEBUG=No affects mem hold.
memory still climbs and holds at 1.1gb

Run 6: memory-raft-6-debug-off.txt

396334 t1s
DJANG_DEBUG=No
celery max-tasks-per-child=inf
3 days later (Run 5 was friday, run 6 was monday)
still holding 1.1gB mem after the weekend with no export. trying a different export set to see if the cache is additive
mem did not increase past 1.1gb. seems memory is "reserved" and not released, but if the process doesn't need more it won't take it.

Next steps

Better monitoring - it's possible that running of cf app to get memory stats is just missing something. Regardless, it only gives us a simple picture of memory usage for the entire backend deployment. It would be more useful to see memory requests and limits per service (django, redis, celery, gunicorn, etc)
- PLG deployed in Cloud.gov #3046 gives us more reliable monitoring. Create system uptime dashboard with alerts and metrics #3217 provides alerting in response to certain metrics. Promtail Pipelines #3243 enables querying and alerting based on app logs. With some combination of these, we should be able to create an alert for celery workers dying w/ OOM errors
- 2592/separate celery #2773 allows django, redis, and celery to each be monitored and tuned individually. Spike - Application performance monitoring dashboard #3245 creates a dashboard within cloud.gov to give a better idea of memory usage between django/celery (without PLG) and allow us to tune
Debug system swap memory and garbage collection. This goes into much lower-level debugging of the python and system runtimes in an attempt to better diagnose the issue. We may be able to tune python's garbage collection to better handle our memory needs.

Possible solutions

Decrease gunicorn workers to 1 - avoids additive memory pressure from concurrent users
- or, switch to a more configurable/efficient WSGI server - https://uwsgi-docs.readthedocs.io/en/latest/
Increase memory for tdp-backend. Production already has 4GB of memory compared to 2GB in the lower environments, so it's likely that prod can handle much larger exports and concurrent requests already.
Separate django, redis, celery from running on the same application so that they don't compete for system resources, and they can be individually tuned (2592/separate celery #2773)
Schedule memory-heavy celery tasks to run when memory pressure from other parts of the system is expected to be low.
Utilize AWS lambda functions to independently scale individual tasks

jtimpe · 2024-10-30T19:06:52Z

~~create ticket for promtail pipeline alert~~ Create an alert for celery out-of-memory exceptions #3261
if alert fires - explore other solutions

jtimpe added the spike label Sep 25, 2024

vlasse86 added the P3 Needed – Routine label Oct 1, 2024

andrew-jameson added the Refined Ticket has been refined at the backlog refinement label Oct 7, 2024

vlasse86 assigned jtimpe Oct 8, 2024

vlasse86 added the office hours label Oct 25, 2024

jtimpe mentioned this issue Nov 1, 2024

Create an alert for celery out-of-memory exceptions #3261

Open

7 tasks

vlasse86 removed the office hours label Nov 1, 2024

vlasse86 closed this as completed Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike - testing CSV exports in production #3201

Spike - testing CSV exports in production #3201

jtimpe commented Sep 25, 2024 •

edited

Loading

jtimpe commented Sep 25, 2024

jtimpe commented Oct 8, 2024

jtimpe commented Oct 29, 2024 •

edited

Loading

jtimpe commented Oct 30, 2024 •

edited

Loading

Spike - testing CSV exports in production #3201

Spike - testing CSV exports in production #3201

Comments

jtimpe commented Sep 25, 2024 • edited Loading

jtimpe commented Sep 25, 2024

jtimpe commented Oct 8, 2024

jtimpe commented Oct 29, 2024 • edited Loading

Summary of testing so far

Test Runs

Run 1: memory-raft-1-debug-on.txt

Run 2: memory-raft-2-debug-off.txt

Run 3: memory-raft-3-debug-on.txt

Run 4: memory-raft-4-debug-on.txt

Run 5: memory-raft-5-debug-off.txt

Run 6: memory-raft-6-debug-off.txt

Next steps

Possible solutions

jtimpe commented Oct 30, 2024 • edited Loading

jtimpe commented Sep 25, 2024 •

edited

Loading

jtimpe commented Oct 29, 2024 •

edited

Loading

jtimpe commented Oct 30, 2024 •

edited

Loading