Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% CPU usage #80

Open
bgondell opened this issue Jan 15, 2025 · 5 comments
Open

100% CPU usage #80

bgondell opened this issue Jan 15, 2025 · 5 comments

Comments

@bgondell
Copy link

bgondell commented Jan 15, 2025

TT uses 100% of cpu after a few seconds.

When I implemented tt, no problems, just worked fine. Usage is pretty intensive, 700-1000 metrics every Telegraf 10s flush.
After various days, it started to use 100% of cpu, but still works. Restarting service keep CPU usage under control.
But now, it goes almost directly to 100% cpu.

I compile last version with same results.

I clean the data folder (dec 2024 and first days 2025, almost 1,6Gb ) and appears to works fine, but I want to figure what is happening.

ls -ltr ticktock-bkp2/2024/12
total 56
drwxr--r-- 32 root root 4096 Dec  2 14:06 1733097600.1733184000
drwxr--r-- 32 root root 4096 Dec  3 13:35 1733184000.1733270400
drwxr--r-- 32 root root 4096 Dec  4 16:58 1733270400.1733356800
drwxr--r-- 32 root root 4096 Dec  5 00:00 1733356800.1733443200
drwxr--r-- 32 root root 4096 Dec  9 15:13 1733702400.1733788800
drwxr--r-- 32 root root 4096 Dec 10 11:45 1733788800.1733875200
drwxr--r-- 32 root root 4096 Dec 11 11:49 1733875200.1733961600
drwxr--r-- 32 root root 4096 Dec 12 12:54 1733961600.1734048000
drwxr--r-- 32 root root 4096 Dec 13 13:04 1734048000.1734134400
drwxr--r-- 32 root root 4096 Dec 16 11:53 1734307200.1734393600
drwxr--r-- 32 root root 4096 Dec 17 12:13 1734393600.1734480000
drwxr--r-- 32 root root 4096 Dec 18 12:16 1734480000.1734566400
drwxr--r-- 32 root root 4096 Dec 19 14:02 1734566400.1734652800
drwxr--r--  2 root root 4096 Jan  7 15:36 rollup

du -sk  ticktock-bkp2/2024
852976  ticktock-bkp2/2024

ls -ltr ticktock-bkp2/2025/01
total 32
drwxr--r--  2 root root 4096 Jan  7 15:36 rollup
drwxr--r-- 32 root root 4096 Jan  7 15:36 1736208000.1736294400
drwxr--r-- 32 root root 4096 Jan  8 00:01 1736294400.1736380800
drwxr--r-- 32 root root 4096 Jan  9 00:00 1736380800.1736467200
drwxr--r-- 32 root root 4096 Jan 10 00:00 1736467200.1736553600
drwxr--r-- 32 root root 4096 Jan 13 11:14 1736726400.1736812800
drwxr--r-- 32 root root 4096 Jan 14 10:28 1736812800.1736899200
drwxr--r-- 32 root root 4096 Jan 15 11:12 1736899200.1736985600

du -sk  ticktock-bkp2/2025
793928  ticktock-bkp2/2025

Here's tt.conf:

# Remove the leading semicolon to enable the config.

# TickTock home directory.
# If specified, data will be stored under <ticktock.home>/data;
# logs will be stored under <ticktock.home>/log;
; ticktock.home = /etc/ticktock
tsdb.data.dir = /mnt/dietpi_userdata/ticktock
append.log.dir = /var/log/ticktock

# The HTTP server port number;
; http.server.port = 6182

# The TCP server port number;
# The first one accepts data in OpenTSDB's telnet format;
# The second one accepts data in InfluxDB's line protocol format;
# If any one of these are not used, omit it like this:
# tcp.server.port = ,6180  // only use InfluxDB's format;
; tcp.server.port = 6181,6180

# This size needs to be big enough to hold the largest HTTP request.
; tcp.buffer.size = 512kb

# How often should we flush data to disk. Default is 5 minutes.
# Which means you will lose the last 5 minutes of data if the
# server is terminated abnormally. Increasing this frequency
# will have a negative impact on performance, severely if more
# than once a minute.
; tsdb.flush.frequency = 5min

# Resolution of timestamps on data points;
# Either millisecond or second;
# This config can't be changed on existing databases;
tsdb.timestamp.resolution = millisecond

# Supported log levels: TRACE, DEBUG, TCP, HTTP, INFO, WARN, ERROR, FATAL
log.level = WARN

# How often to flush append logs?
# Note that data that came after last flush may be lost forever.
append.log.flush.frequency = 10s

Logs are empty:

ls -ltr /var/log/ticktock
total 0

Just in case, tt Telegraf's config:

[agent]
  interval = "200ms"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "60s"
  flush_jitter = "5s"
  precision = ""
  hostname = ""
  omit_hostname = false
[[outputs.opentsdb]]
    prefix = "telegraf."
    host = "tcp://127.0.0.1"
    port = 6181

How can debug this behavior?

Thanks!!

@ylin30
Copy link
Collaborator

ylin30 commented Jan 15, 2025

hi, I can't tell immediately the cause yet. But the loads (7-10k metrics) and disk size (1.6GB) should not be a problem to TT. I have been testing v 0.20.* in these days and it can handle up to 3M time series at 10s interval incoming rate in RaspberryPI4 for days. BTW, note the difference between metrics and time series. One metric may have many time series.

append.log.dir is for Write Ahead Logging. Your log path should be still the default, <TT binary>/log. Please find them and try to see if you can find something interesting in logs.

According to my experience, 100% cpu is often caused by IO saturation, especially with sudden spikes. Please check:

  1. if your disk space is exhausted? (using du or df)
  2. if io util is 100%? You can collect the OS metrics in your server with collectors.

Sometimes it may also be caused by memory saturation but it is easy to figure out by just calling "free".

Another possibility is rollup process which may consume IO and CPU. But we have been testing in our side and it is completely acceptable in our testing.

I have some questions for you to help debugging:

  1. What particular version of TT used?
  2. What hardware and OS?

Thanks,
Yi

@ylin30
Copy link
Collaborator

ylin30 commented Jan 15, 2025

hi Bruno, installing collectors to collect IO util might be too annoying to you. You can just do a top. If wa is high, that means cpu is busying with IO waiting.

image

@ytyou
Copy link
Owner

ytyou commented Jan 16, 2025

Hi, it seems to me that the TickTockDB config of "append.log.flush.frequency = 10s" is too aggressive. Please try to increase it to at least 1 minute (default is 5 minutes). Let us know if that helps. Thank you.

@ytyou
Copy link
Owner

ytyou commented Jan 16, 2025

Hi, Can you also send us the last line in the ts file under TickTockDB's data directory? Thank you.

@ylin30
Copy link
Collaborator

ylin30 commented Jan 19, 2025

hi @bgondell, any updates on this issue? Don't hesitate to shoot us any questions you might have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants