108-cpu-ja.tex

%\chapter{CPU and Scheduler Hogs}
\chapter{CPUとスケジューラの大量消費}
\label{chap:cpu-hogs}

% While memory leaks tend to absolutely kill your system, CPU exhaustion tends to act like a bottleneck and limits the maximal work you can get out of a node. Erlang developers will have a tendency to scale horizontally when they face such issues. It is often an easy enough job to scale out the more basic pieces of code out there. Only centralized global state (process registries, ETS tables, and so on) usually need to be modified.\footnote{Usually this takes the form of sharding or finding a state-replication scheme that's suitable, and little more. It's still a decent piece of work, but nothing compared to finding out most of your program's semantics aren't applicable to distributed systems given Erlang usually forces your hand there in the first place.} Still, if you want to optimize locally before scaling out at first, you need to be able to find your CPU and scheduler hogs.
メモリリークはシステムを完全に停止させる傾向にありますが、CPUの枯渇はボトルネックのように振る舞う傾向があり、ノードの最大性能を制限します。
Erlang開発者は、そういった問題に直面したとき水平スケールしようとするでしょう。
基本的なコードが多ければ多いほど、スケールアウトが簡単なことが多いからです。
中央集権的なグローバルの状態（プロセスレジストリ、ETSテーブルなど）だけは、システムの分散化に通常はコードの修正が必要です。
\footnote{通常、これはシャーディングをするか、状態をレプリケーションする適切な方法を見つける形をとります。依然として大変な問題ではありますが、あなたのプログラムのセマンティクスの大部分を理解することに比べれば、分散システムへの適用は大した問題ではありません。何故なら、最初にErlangは分散システムを強要するからです。}
それでも、スケールアウトする前にローカルで最適化したい場合は、CPUとスケジューラの大量消費を見つけ出す必要があります。

% It is generally difficult to properly analyze the CPU usage of an Erlang node to pin problems to a specific piece of code. With everything concurrent and in a virtual machine, there is no guarantee you will find out if a specific process, driver, your own Erlang code, NIFs you may have installed, or some third-party library is eating up all your processing power.
ErlangノードのCPU使用率を適切に分析し、問題となる特定のコードを見つけ出すのは難しいとされています。
あらゆるものが並行して仮想マシンの中にあるため、
処理能力を食い尽くしているのが、ある特定のプロセス、ドライバー、自前のErlangコード、組み込まれたNIF、あるいはサードパーティのライブラリーなのかを
見つけ出せる保証はありません。

% The existing approaches are often limited to profiling and reduction-counting if it's in your code, and to monitoring the scheduler's work if it might be anywhere else (but also your code).
既存のアプローチは大体限られており、コード内にある場合はプロファイリングとリダクションカウント\footnote{訳注:プロセスごとにリダクションというカウンタがあり、このカウンターは通常一つの各関数呼び出しでインクリメントされます。このリダクション回数が最大値に達した時、コンテキストスイッチが発生します。\href{http://erlang.org/doc/man/erlang.html\#bump\_reductions-1}{http://erlang.org/doc/man/erlang.html\#bump\_reductions-1}}を確認する二つのアプローチ、そして、どこか他の場所（コード内も含む）にある場合はスケジューラの動作をモニタリングするアプローチがあります。

%\section{Profiling and Reduction Counts}
\section{プロファイリングとリダクションカウント}
\label{sec:cpu-profiling}

% To pin issues to specific pieces of Erlang code, as mentioned earlier, there are two main approaches. One will be to do the old standard profiling routine, likely using one of the following applications:\footnote{All of these profilers work using Erlang tracing functionality with almost no restraint. They will have an impact on the run-time performance of the application, and shouldn't be used in production.}
前述したとおり、Erlangコード内の問題箇所を特定するアプローチは主に二つあります。一つは、次に示すアプリケーションを利用するような、古い標準的なプロファイリングの方法です。\footnote{これらのプロファイラは殆ど制限なくErlangのトレース機能を使って動作します。そのため、ランタイム上のアプリケーションのパフォーマンスに影響します。本番環境では使わないほうが良いでしょう。}

\begin{itemize*}
%	\item \otpapp{eprof},\footnote{\href{http://www.erlang.org/doc/man/eprof.html}{http://www.erlang.org/doc/man/eprof.html}} the oldest Erlang profiler around. It will give general percentage values and will mostly report in terms of time taken.
 \item \otpapp{eprof}、\footnote{\href{http://www.erlang.org/doc/man/eprof.html}{http://www.erlang.org/doc/man/eprof.html}}一番古いErlangのプロファイラです。おおよそのパーセンテージ値とその所要時間を結果として出します。
%	\item \otpapp{fprof},\footnote{\href{http://www.erlang.org/doc/man/fprof.html}{http://www.erlang.org/doc/man/fprof.html}} a more powerful replacement of eprof. It will support full concurrency and generate in-depth reports. In fact, the reports are so deep that they are usually considered opaque and hard to read.
	\item \otpapp{fprof}、\footnote{\href{http://www.erlang.org/doc/man/fprof.html}{http://www.erlang.org/doc/man/fprof.html}}
%a more powerful replacement of eprof. It will support full concurrency and generate in-depth reports. In fact, the reports are so deep that they are usually considered opaque and hard to read.
eprofのよりも強力なプロファイラです。完全な並行性をサポートしており、詳細なレポートを出してくれます。レポート内容が深すぎるため、一般的に不透明で読みにくいとされています。

%	\item \otpapp{eflame},\footnote{\href{https://github.com/proger/eflame}{https://github.com/proger/eflame}} the newest kid on the block. It generates flame graphs to show deep call sequences and hot-spots in usage on a given piece of code. It allows one to quickly find issues with a single look at the final result.
  \item \otpapp{eflame}、\footnote{\href{https://github.com/proger/eflame}{https://github.com/proger/eflame}}
新しいプロファイラの一つです。対象のコードの深い呼び出しシーケンスやホットスポットを可視化するflame graphsを生成します。
最終結果を見るだけで問題をすぐに見つけることが出来ます。

\end{itemize*}

%It will be left to the reader to thoroughly read each of these application's documentation. The other approach will be to run \function{recon:proc\_window/3} as introduced in Subsection \ref{subsec:digging-procs}:
各アプリケーションのドキュメントの熟読は読者にお任せします。
もう一つのアプローチは、\ref{subsec:digging-procs}で紹介した\function{recon:proc\_window/3}を実行する方法です。

\begin{VerbatimEshell}
1> recon:proc_window(reductions, 3, 500).
[{<0.46.0>,51728,
  [{current_function,{queue,in,2}},
   {initial_call,{erlang,apply,2}}]},
 {<0.49.0>,5728,
  [{current_function,{dict,new,0}},
   {initial_call,{erlang,apply,2}}]},
 {<0.43.0>,650,
  [{current_function,{timer,sleep,1}},
   {initial_call,{erlang,apply,2}}]}]
\end{VerbatimEshell}

%The reduction count has a direct link to function calls in Erlang, and a high count is usually the synonym of a high amount of CPU usage.
リダクションカウントはErlangの関数呼び出しに直結していて、カウントが高いことはCPU使用量が高いこととほぼ同義です。

%What's interesting with this function is to try it while a system is already rather busy,\footnote{See Subsection \ref{subsec:global-cpu}} with a relatively short interval. Repeat it many times, and you should hopefully see a pattern emerge where the same processes (or the same \emph{kind} of processes) tend to always come up on top.
この関数の興味深いところは、既にシステムがビジー状態のとき、\footnote{\ref{subsec:global-cpu}を参照してください}比較的短い間隔で試してみることです。繰り返し何度も実行して、上手くいけば、同じプロセス（あるいは同じ\emph{種類}のプロセス）が常に上位に現れるパターンを確認できるはずです。

%Using the code locations\footnote{Call \expression{recon:info(PidTerm, location)} or \expression{process\_info(Pid, current\_stacktrace)} to get this information.} and current functions being run, you should be able to identify what kind of code hogs all your schedulers.
実行中のコードの位置\footnote{コードの位置は\expression{recon:info(PidTerm, location)}または\expression{process\_info(Pid, current\_stacktrace)}で取得してください。}や現在の関数を使って、どの種類のコードが全スケジューラのリソースを独占しているか特定できるはずです。

%\section{System Monitors}
\section{システムモニター}
\label{sec:cpu-system-monitors}

%If nothing seems to stand out through either profiling or checking reduction counts, it's possible some of your work ends up being done by NIFs, garbage collections, and so on. These kinds of work may not always increment their reductions count correctly, so they won't show up with the previous methods, only through long run times.
もしプロファイリングやリダクションカウントの調査から何も目立つものが見つからない場合、NIFやガベージコレクションなどに行き着いている可能性があります。それらは常にリダクションカウントを正確に増加させるとは限らないため、これまでのやり方では表れず、実行時間の長さとしてのみ表れます。

%To find about such cases, the best way around is to use \function{erlang:system\_monitor/2}, and look for \term{long\_gc} and \term{long\_schedule}. The former will show whenever garbage collection ends up doing a lot of work (it takes time!), and the latter will likely catch issues with busy processes, either through NIFs or some other means, that end up making them hard to de-schedule.\footnote{Long garbage collections count towards scheduling time. It is very possible that a lot of your long schedules will be tied to garbage collections depending on your system.}
そういったケースを見つけるために、一番の方法は \function{erlang:system\_monitor/2} を使い、 \term{long\_gc} と \term{long\_schedule} を探すことです。前者はガベージコレクションが沢山の作業（時間がかかります！）を終えるたびに表示され、そして後者はNIFや他の理由のせいでスケジューラをなかなか手放さないビジープロセスの可能性にあるものを捕捉しやすくします。\footnote{長いガベージコレクションはスケジュール時間に反映されます。システムによってはガベージコレクションが沢山の長いスケジュールに結びついている可能性は非常に高いです。}

%We've seen how to set such a system monitor In Garbage Collection in \ref{subsubsec:leak-gc}, but here's a different pattern\footnote{If you're on 17.0 or newer versions, the shell functions can be made recursive far more simply by using their named form, but to have the widest compatibility possible with older versions of Erlang, I've let them as is.} I've used before to catch long-running items:
\ref{subsubsec:leak-gc} のガベージコレクションではシステムモニターをどのように設置するかを見てきましたが、私が以前長時間稼動しているアイテムを捉えるのに使った別のパターン\footnote{もし17.0以降の新しいバージョンなら、この関数は名前つきフォームで再帰することで簡潔にできますが、それより古いバージョンのErlangでも互換性を保てるようそのままにしておきます}があります

\begin{VerbatimEshell}
1> F = fun(F) ->
    receive
        {monitor, Pid, long_schedule, Info} ->
            io:format("monitor=long_schedule pid=~p info=~p~n", [Pid, Info]);
        {monitor, Pid, long_gc, Info} -> 
            io:format("monitor=long_gc pid=~p info=~p~n", [Pid, Info])
    end,
    F(F)
end.
2> Setup = fun(Delay) -> fun() -> 
     register(temp_sys_monitor, self()),
     erlang:system_monitor(self(), [{long_schedule, Delay}, {long_gc, Delay}]),
     F(F)
end end.
3> spawn_link(Setup(1000)).
<0.1293.0>
monitor=long_schedule pid=<0.54.0> info=[{timeout,1102},
                                         {in,{some_module,some_function,3}},
                                         {out,{some_module,some_function,3}}]
\end{VerbatimEshell}

%Be sure to set the \term{long\_schedule} and \term{long\_gc} values to large-ish values that might be reasonable to you. In this example, they're set to 1000 milliseconds. You can either kill the monitor by calling \expression{exit(whereis(temp\_sys\_monitor), kill)} (which will in turn kill the shell because it's linked), or just disconnect from the node (which will kill the process because it's linked to the shell.)
\term{long\_schedule}と\term{long\_gc}をそこそこ大きめな適切な値にしてください。この例では、1000ミリ秒にします。あなたは\expression{exit(whereis(temp\_sys\_monitor), kill)}を呼ぶ（リンクされているため次にシェルを殺すことになります）か、単にノードから切断する（リンクされているためプロセスを殺すことになります）だけでモニターを殺すことができます。

%This kind of code and monitoring can be moved to its own module where it reports to a long-term logging storage, and can be used as a canary for performance degradation or overload detection.
こういったコードとモニタリングは、長期間のログ保管に向いたストレージへレポートを送るモニタリング用のモジュールへと移せ、パフォーマンス劣化や過負荷の発見のためのカナリアとして使うことができます。

%\subsection{Suspended Ports}
\subsection{一時停止しているポート}
\label{subsec:port-system-monitors}

%An interesting part of system monitors that didn't fit anywhere but may have to do with scheduling is regarding ports. When a process sends too many message to a port and the port's internal queue gets full, the Erlang schedulers will forcibly de-schedule the sender until space is freed. This may end up surprising a few users who didn't expect that implicit back-pressure from the VM.
これまでのどの文にもあてはまらなかったけれどスケジューリングに関係のある、システムモニターのおもしろいところはポートに関するところです。あるプロセスが沢山のメッセージをポートに送りポートの内部キューが一杯になったとき、スペースに空きがでるまでErlangのスケジューラーは強制的に送り側のスケジュールを解除します。これはVMから暗黙のバックプレッシャーを予想していなかったいくばくかのユーザーを驚かせることになるかもしれません。

%This kind of event can be monitored by passing in the atom \term{busy\_port} to the system monitor. Specifically for clustered nodes, the atom \term{busy\_dist\_port} can be used to find when a local process gets de-scheduled when contacting a process on a remote node whose inter-node communication was handled by a busy port.
こういったイベントは \term{busy\_port} アトムをシステムモニターへ送ることでモニターできます。特にクラスタ化されたノードで、ノード間の通信がビジーポートによって処理されていた場合、リモートノード上のプロセスと通信しているローカルプロセスがスケジュール解除されるため、それを発見するのに \term{busy\_dist\_port} アトムが使われます。

%If you find out you're having problems with these, try replacing your sending functions where in critical paths with \function{erlang:port\_command(Port, Data, [nosuspend])} for ports, and \function{erlang:send(Pid, Msg, [nosuspend])} for messages to distributed processes. They will then tell you when the message could not be sent and you would therefore have been descheduled.
もしそういった問題を抱えていることを発見した場合、クリティカルパスにある送信の関数をポートに送る場合は \function{erlang:port\_command(Port, Data, [nosuspend])}、分散したプロセスの場合は \function{erlang:send(Pid, Msg, [nosuspend])} へと置き換えてみてください。メッセージを送れないことでスケジュール解除される場合すぐに知らせてくれます。

%\section{Exercises}
\section{演習}

%\subsection*{Review Questions}
\subsection*{復習問題}

\begin{enumerate}
	%\item What are the two main approaches to pin issues about CPU usages?
	\item CPU 利用率に関する問題を特定するための主なアプローチ２つとは何ですか？
	%\item Name some of the profiling tools available. What approaches are preferable for production use? Why?
	\item プロファイル用のツールの名前をいくつか挙げてください。本番環境で使用する場合、どの方法が好ましいですか？ またそれはなぜですか？
	%\item Why can long scheduling monitors be useful to find CPU or scheduler over-consumption?
  \item ロングスケジュールモニター\footnote{システムモニター(erlang:system\_monitor/2)で監視できる項目に long\_schedule があります。これは NIF や driver で bump reductions をせずにCPUガメるひとを見つけるための監視項目です。\href{http://erlang.org/doc/man/erlang.html\#system\_monitor-2}{http://erlang.org/doc/man/erlang.html\#system\_monitor-2}も参考になるかと思います}が、CPU やスケジューラの使いすぎを見つけるのに便利なのはなぜですか？

\end{enumerate}

%\subsection*{Open-ended Questions}
\subsection*{自由回答}

\begin{enumerate}
	%\item If you find that a process doing very little work with reductions ends up being scheduled for long periods of time, what can you guess about it or the code it runs?
  \item ほとんど仕事をしない(リダクションカウンタを増やさない)プロセスが長期間スケジューリングされているのを見つけた場合、そのプロセスもしくは実行しているコードについて何が考えられますか？

	%\item Can you set up a system monitor and then trigger it with regular Erlang code? Can you use it to find out for how long processes seem to be scheduled on average? You may need to manually start random processes from the shell that are more aggressive in their work than those provided by the existing system.
	\item あなたはシステムモニターを設定して、通常の Erlang コードでそれを起動できますか？ プロセスが平均しておよそどれぐらいの長さスケジューリングされているかを見つけるために、システムモニターを利用できますか？ 既存のシステムにすでにあるものよりもより適切にそれを行えるようなプロセスを、シェルから手動で手当たり次第に起動する必要があるかもしれません。

\end{enumerate}