point-cloud-classification.html

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
  <meta charset="utf-8" />
  <meta name="generator" content="pandoc" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
  <meta name="author" content="SternGerlach" />
  <title>点群処理のFPGA高速化</title>
  <style>
    code{white-space: pre-wrap;}
    span.smallcaps{font-variant: small-caps;}
    div.columns{display: flex; gap: min(4vw, 1.5em);}
    div.column{flex: auto; overflow-x: auto;}
    div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
    ul.task-list{list-style: none;}
    ul.task-list li input[type="checkbox"] {
      width: 0.8em;
      margin: 0 0.8em 0.2em -1.6em;
      vertical-align: middle;
    }
    pre > code.sourceCode { white-space: pre; position: relative; }
    pre > code.sourceCode > span { display: inline-block; line-height: 1.25; }
    pre > code.sourceCode > span:empty { height: 1.2em; }
    .sourceCode { overflow: visible; }
    code.sourceCode > span { color: inherit; text-decoration: inherit; }
    div.sourceCode { margin: 1em 0; }
    pre.sourceCode { margin: 0; }
    @media screen {
    div.sourceCode { overflow: auto; }
    }
    @media print {
    pre > code.sourceCode { white-space: pre-wrap; }
    pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
    }
    pre.numberSource code
      { counter-reset: source-line 0; }
    pre.numberSource code > span
      { position: relative; left: -4em; counter-increment: source-line; }
    pre.numberSource code > span > a:first-child::before
      { content: counter(source-line);
        position: relative; left: -1em; text-align: right; vertical-align: baseline;
        border: none; display: inline-block;
        -webkit-touch-callout: none; -webkit-user-select: none;
        -khtml-user-select: none; -moz-user-select: none;
        -ms-user-select: none; user-select: none;
        padding: 0 4px; width: 4em;
        color: #aaaaaa;
      }
    pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa;  padding-left: 4px; }
    div.sourceCode
      {   }
    @media screen {
    pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
    }
    code span.al { color: #ff0000; font-weight: bold; } /* Alert */
    code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
    code span.at { color: #7d9029; } /* Attribute */
    code span.bn { color: #40a070; } /* BaseN */
    code span.bu { color: #008000; } /* BuiltIn */
    code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
    code span.ch { color: #4070a0; } /* Char */
    code span.cn { color: #880000; } /* Constant */
    code span.co { color: #60a0b0; font-style: italic; } /* Comment */
    code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
    code span.do { color: #ba2121; font-style: italic; } /* Documentation */
    code span.dt { color: #902000; } /* DataType */
    code span.dv { color: #40a070; } /* DecVal */
    code span.er { color: #ff0000; font-weight: bold; } /* Error */
    code span.ex { } /* Extension */
    code span.fl { color: #40a070; } /* Float */
    code span.fu { color: #06287e; } /* Function */
    code span.im { color: #008000; font-weight: bold; } /* Import */
    code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
    code span.kw { color: #007020; font-weight: bold; } /* Keyword */
    code span.op { color: #666666; } /* Operator */
    code span.ot { color: #007020; } /* Other */
    code span.pp { color: #bc7a00; } /* Preprocessor */
    code span.sc { color: #4070a0; } /* SpecialChar */
    code span.ss { color: #bb6688; } /* SpecialString */
    code span.st { color: #4070a0; } /* String */
    code span.va { color: #19177c; } /* Variable */
    code span.vs { color: #4070a0; } /* VerbatimString */
    code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
  </style>
  <link rel="stylesheet" href="style.css" />
  <script
  src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml-full.js"
  type="text/javascript"></script>
  <!--[if lt IE 9]>
    <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
  <![endif]-->
</head>
<body>
<header id="title-block-header">
<h1 class="title">点群処理のFPGA高速化</h1>
<p class="author">SternGerlach</p>
</header>
<!--
 pandoc -s -f markdown -t html5 --mathjax --css style.css point-cloud-classification.md -o point-cloud-classification.html
-->
<p><a href="./index.html">ホームに戻る</a></p>
<h1 id="このページについて">このページについて</h1>
<p>このページは、<a
href="https://adventar.org/calendars/7773">慶應理工アドベントカレンダー2022</a>の22日目の記事です。
去年の記事は<a
href="./scan-matching-branch-and-bound.html">こちら</a>と<a
href="./scan-matching-branch-and-bound-impl.html">こちら</a>です。</p>
<p>早速余談ですが、1983年12月22日は、Yellow Magic Orchestra (YMO)
が行った最後の国内ツアーの最終日で、開催場所は日本武道館でした。
今日は、その散開ツアーからちょうど39年目の記念すべき日です。
1984年2月22日発売の「アフター・サーヴィス」や、1992年11月21日発売の「コンプリート・サーヴィス」に音源が収録されているので、みなさん是非聴いてみてください。
また余談ですが、普段は(研究そっちのけで)CDを集めています。
70年代から80年代にかけてのアーティストが好きです。
最近は、専らオフコースを聴いています。
オフコースの旧規格盤のコレクションは<a
href="./off-course-ca35-series.html">こちら</a>にあります。
また、コレクションは<a href="./cds.html">こちら</a>と<a
href="./toshiba-emi.html">こちら</a>にまとめてあります。
暇なときにご覧ください。</p>
<p>もう一つ余談。 今年聴いたなかで最も良かったアルバム。</p>
<ol type="1">
<li>チューリップ「Halo」(1983年 / VICL-62399 / 2007年盤)
<ul>
<li>特によかった曲:
🥇「丘に吹く風」🥈「愛を抱きしめて」🥉「輝く星」「想い出のランドスケープ」「コスモスの咲く郷」「星空の伝言」「セルリアン・ブルー」</li>
</ul></li>
<li>オフコース「この道をゆけば オフ・コース・ラウンド2」(1974年 /
CA35-1033 / 1983年盤)
<ul>
<li>特によかった曲:
🥇「はたちの頃」🥈「別れの情景(1)」🥉「首輪のない犬」「あの角をまがれば」「日曜日のたいくつ」</li>
</ul></li>
<li>オフコース「I Love You」(1982年 / CA35-1002 / 1982年盤)
<ul>
<li>特によかった曲:
🥇「哀しき街」🥈「決して彼等のようではなく」🥉「Yes-Yes-Yes」「愛のゆくえ」</li>
</ul></li>
<li>オフコース「ワインの匂い」(1975年 / CA35-1032 / 1983年盤)
<ul>
<li>特によかった曲:
🥇「幻想」🥈「老人のつぶやき」🥉「憂き世に」「雨よ激しく」「倖せなんて」「ワインの匂い」「眠れぬ夜」</li>
</ul></li>
<li>オフコース「Song Is Love」(1976年 / CA35-1041 / 1983年盤)
<ul>
<li>特によかった曲:
🥇「冬が来るまえに」🥈「青空と人生と」🥉「歌を捧げて」「青春」「ひとりで生きてゆければ」</li>
</ul></li>
<li>チューリップ「New Tune」(1985年 / 35FD-1005 / 1985年盤)
<ul>
<li>特によかった曲:
🥇「もっと幸せに素直になれたら」🥈「ロベリア」🥉「Our
Song」「ふたつめのクリスマス」「そんな男になれたら」</li>
</ul></li>
<li>大滝詠一「Each Time」(1984年 / 35DH 78 / 1984年盤)
<ul>
<li>特によかった曲: 🥇「Bachelor
Girl」🥈「ペパーミント・ブルー」🥉「魔法の瞳」「恋のナックルボール」</li>
</ul></li>
<li>麗美「“R”」(1984年 / 35C31-7250 / 1984年盤)
<ul>
<li>特によかった曲:
🥇「星のクライマー」🥈「風は明日へ」🥉「空が一面海に見えた日」「恋の一時間は孤独の千年」「青春のリグレット」「ポニーテイル」</li>
</ul></li>
<li>ハイ・ファイ・セット「Sweet Locomotion」(1986年 / 32DH 393 /
1986年盤)
<ul>
<li>特によかった曲:
🥇「ひときれの恋」🥈「たった一枚のフォトグラフ」🥉「Elevator Town」「Do
You Remember Me?」</li>
</ul></li>
<li>和久井映見「Flora」(1990年 / PSCR-1006 / 1990年盤)
<ul>
<li>特によかった曲:
🥇「マイ・ロンリィ・グッバイ・クラブ」🥈「偶然の旅人」🥉「夢で会いましょう」「神様がいない土曜日」</li>
</ul></li>
<li>鈴木康博「Sincerely」(1983年 / CA35-1043 / 1983年盤)
<ul>
<li>特によかった曲: 🥇「瑠璃色の夜明け」🥈「僕と海へ」🥉「ラララ
～愛の世界へ～」「入り江」「君の誕生日」</li>
</ul></li>
<li>岡田有希子「ヴィーナス誕生」(1986年 / D32A0164 / 1986年盤)
<ul>
<li>特によかった曲:
🥇「ヴィーナス誕生」🥈「銀河のバカンス」🥉「眠れぬ夜のAquarius」「Wonder
Trip Lover」「Spring Accident」</li>
</ul></li>
<li>尾崎亜美「Kids」(1986年 / D32A0235 / 1986年盤)
<ul>
<li>特によかった曲:
🥇「流れ星が好き」🥈「シャイネスボーイ」🥉「St.Valentine’s Day
Rhapsody」「Com’on Mamy」</li>
</ul></li>
<li>久保田早紀「夜の底は柔らかな幻」(1984年 / DYCL-17 / 2005年盤)
<ul>
<li>特によかった曲:
🥇「ピアニッシモで…」🥈「寒い絵葉書」🥉「月の浜辺ボタンがひとつ」「メランコリーのテーブルクロス」</li>
</ul></li>
<li>薬師丸ひろ子「花図鑑」(1986年 / CA32-1260 / 1986年盤)
<ul>
<li>特によかった曲:
🥇「紅い花、青い花」🥈「寒椿、咲いた」🥉「ローズ・ティーはいかが?」「哀しみの種」「透明なチューリップ」「麦わら帽子のアン」</li>
</ul></li>
</ol>
<p>イントロが良い曲 (おまけ)。</p>
<ol type="1">
<li>チューリップ「Shooting Star」(1981年)</li>
<li>井上鑑「Karsavina ～ニジンスキーの翼」(1983年)</li>
<li>井上鑑「Running Fence -Ode A Christo」(1982年)</li>
</ol>
<p>今年は、点群処理 (点群分類タスク)
向けニューラルネットのFPGA高速化を試してみます。
LeNetやResNetなど、画像処理向けニューラルネットのFPGA高速化も面白いのですが、既にたくさんの素晴らしい記事が出ているのでやめました。
音楽の話も、誰にも通じないし、ウケないと思ったのでやめました。
コンピュータで閲覧されることをお勧めします。</p>
<h1 id="ニューラルネットの準備">ニューラルネットの準備</h1>
<p>点群の分類、セグメンテーション、レジストレーションなど、様々なタスクに対応した代表的なモデルとして、2017年にCVPRで発表されたPointNetが挙げられます。
PointNetは、MLPとMaxプーリング層からなる、シンプルかつ強力なモデルです。
分類タスク向けのPointNetの構造を、以下に示します。</p>
<p><a
href="point-cloud-classification-images/pointnet-layers.svg"><img src="point-cloud-classification-images/pointnet-layers.svg" width="100%" /></a></p>
<p>モデルは、点群からの特徴抽出と、特徴に基づく分類の、2つの部分に分けられます
(図のFeature extractionとClassification)。</p>
<p>図の左端に示すように、<span
class="math inline">\(N\)</span>個の点を含む、3次元の点群<span
class="math inline">\(\mathcal{P} = \left\{ \boldsymbol{p}_1, \ldots,
\boldsymbol{p}_N \right\} \in \mathbb{R}^{N \times
3}\)</span>が入力です。 MLPを用いて、各点<span
class="math inline">\(\boldsymbol{p}_i \in
\mathbb{R}^3\)</span>に対して、1024次元のローカルな特徴<span
class="math inline">\(\boldsymbol{\psi}_i \in
\mathbb{R}^{1024}\)</span>を計算します。
全ての点に対してローカルな特徴量<span
class="math inline">\(\boldsymbol{\Psi} = \left\{ \boldsymbol{\psi}_1,
\ldots, \boldsymbol{\psi}_N \right\} \in \mathbb{R}^{N \times
1024}\)</span>を計算したら、それらをMaxプーリング層により集約して、点群全体を表すグローバルな特徴量<span
class="math inline">\(\boldsymbol{\phi} \in
\mathbb{R}^{1024}\)</span>を得ます (<span
class="math inline">\(\boldsymbol{\phi} \gets \max(\boldsymbol{\psi}_1,
\ldots, \boldsymbol{\psi}_N)\)</span>)。</p>
<p>分類用のネットワークは、この特徴量<span
class="math inline">\(\boldsymbol{\phi}\)</span>を入力として、各物体のクラスに対するロジット
(スコア)を出力します。 物体のクラス数を<span
class="math inline">\(K\)</span>とすれば、出力は<span
class="math inline">\(K\)</span>次元のベクトルとなります。</p>
<p>図のInput TransformおよびFeature
Transformは、点群の特徴に対してアフィン変換を施し、剛体変換に対して不変な特徴量を得るためのネットワークですが、実装が面倒なので取り除きます(<strong>最適化その1:
モデルの簡略化</strong>)。
従って、今回FPGA上に実装するPointNetは、以下のようになります。</p>
<p>画像認識向けのモデルとは異なり、畳み込み層がありません。
また、MLPは、全結合層、ReLU活性化層、バッチ正規化層をまとめたものとします。</p>
<p><a
href="point-cloud-classification-images/pointnet-layers2.svg"><img src="point-cloud-classification-images/pointnet-layers2.svg" width="80%" /></a></p>
<p>PyTorchによるモデルの定義は、次のようになります
(<code>net/model.py</code>)。 ソースコード全体は<a
href="https://github.com/sterngerlach/advent_2022_point_cloud_classification">こちらのリポジトリ</a>に置かれているので、適宜ご参照ください。</p>
<div class="sourceCode" id="cb1"><pre
class="sourceCode python"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> PointNetFeat(torch.nn.Module):</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>):</span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a>        <span class="bu">super</span>().<span class="fu">__init__</span>()</span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.conv1 <span class="op">=</span> torch.nn.Conv1d(<span class="dv">3</span>, <span class="dv">64</span>, <span class="dv">1</span>)</span>
<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.conv2 <span class="op">=</span> torch.nn.Conv1d(<span class="dv">64</span>, <span class="dv">64</span>, <span class="dv">1</span>)</span>
<span id="cb1-7"><a href="#cb1-7" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.conv3 <span class="op">=</span> torch.nn.Conv1d(<span class="dv">64</span>, <span class="dv">64</span>, <span class="dv">1</span>)</span>
<span id="cb1-8"><a href="#cb1-8" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.conv4 <span class="op">=</span> torch.nn.Conv1d(<span class="dv">64</span>, <span class="dv">128</span>, <span class="dv">1</span>)</span>
<span id="cb1-9"><a href="#cb1-9" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.conv5 <span class="op">=</span> torch.nn.Conv1d(<span class="dv">128</span>, <span class="dv">1024</span>, <span class="dv">1</span>)</span>
<span id="cb1-10"><a href="#cb1-10" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.bn1 <span class="op">=</span> torch.nn.BatchNorm1d(<span class="dv">64</span>)</span>
<span id="cb1-11"><a href="#cb1-11" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.bn2 <span class="op">=</span> torch.nn.BatchNorm1d(<span class="dv">64</span>)</span>
<span id="cb1-12"><a href="#cb1-12" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.bn3 <span class="op">=</span> torch.nn.BatchNorm1d(<span class="dv">64</span>)</span>
<span id="cb1-13"><a href="#cb1-13" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.bn4 <span class="op">=</span> torch.nn.BatchNorm1d(<span class="dv">128</span>)</span>
<span id="cb1-14"><a href="#cb1-14" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.bn5 <span class="op">=</span> torch.nn.BatchNorm1d(<span class="dv">1024</span>)</span>
<span id="cb1-15"><a href="#cb1-15" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-16"><a href="#cb1-16" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> forward(<span class="va">self</span>, x: torch.Tensor):</span>
<span id="cb1-17"><a href="#cb1-17" aria-hidden="true" tabindex="-1"></a>        <span class="co"># `x` is of size [B, N, 3]</span></span>
<span id="cb1-18"><a href="#cb1-18" aria-hidden="true" tabindex="-1"></a>        N <span class="op">=</span> x.shape[<span class="dv">1</span>]</span>
<span id="cb1-19"><a href="#cb1-19" aria-hidden="true" tabindex="-1"></a>        <span class="co"># `x` is of size [B, 3, N]</span></span>
<span id="cb1-20"><a href="#cb1-20" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> x.transpose(<span class="dv">1</span>, <span class="dv">2</span>)</span>
<span id="cb1-21"><a href="#cb1-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-22"><a href="#cb1-22" aria-hidden="true" tabindex="-1"></a>        <span class="co"># `x` is of size [B, 1024, N]</span></span>
<span id="cb1-23"><a href="#cb1-23" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> F.relu(<span class="va">self</span>.bn1(<span class="va">self</span>.conv1(x)))</span>
<span id="cb1-24"><a href="#cb1-24" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> F.relu(<span class="va">self</span>.bn2(<span class="va">self</span>.conv2(x)))</span>
<span id="cb1-25"><a href="#cb1-25" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> F.relu(<span class="va">self</span>.bn3(<span class="va">self</span>.conv3(x)))</span>
<span id="cb1-26"><a href="#cb1-26" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> F.relu(<span class="va">self</span>.bn4(<span class="va">self</span>.conv4(x)))</span>
<span id="cb1-27"><a href="#cb1-27" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> F.relu(<span class="va">self</span>.bn5(<span class="va">self</span>.conv5(x)))</span>
<span id="cb1-28"><a href="#cb1-28" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-29"><a href="#cb1-29" aria-hidden="true" tabindex="-1"></a>        <span class="co"># `x` is of size [B, 1024]</span></span>
<span id="cb1-30"><a href="#cb1-30" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> torch.<span class="bu">max</span>(x, dim<span class="op">=</span><span class="dv">2</span>)[<span class="dv">0</span>]</span>
<span id="cb1-31"><a href="#cb1-31" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-32"><a href="#cb1-32" aria-hidden="true" tabindex="-1"></a>        <span class="cf">return</span> x</span>
<span id="cb1-33"><a href="#cb1-33" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-34"><a href="#cb1-34" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> PointNetCls(torch.nn.Module):</span>
<span id="cb1-35"><a href="#cb1-35" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>, num_classes: <span class="bu">int</span>):</span>
<span id="cb1-36"><a href="#cb1-36" aria-hidden="true" tabindex="-1"></a>        <span class="bu">super</span>().<span class="fu">__init__</span>()</span>
<span id="cb1-37"><a href="#cb1-37" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-38"><a href="#cb1-38" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Feature extraction</span></span>
<span id="cb1-39"><a href="#cb1-39" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.feat <span class="op">=</span> PointNetFeat()</span>
<span id="cb1-40"><a href="#cb1-40" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-41"><a href="#cb1-41" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Classification network</span></span>
<span id="cb1-42"><a href="#cb1-42" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.fc1 <span class="op">=</span> torch.nn.Linear(<span class="dv">1024</span>, <span class="dv">512</span>)</span>
<span id="cb1-43"><a href="#cb1-43" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.fc2 <span class="op">=</span> torch.nn.Linear(<span class="dv">512</span>, <span class="dv">256</span>)</span>
<span id="cb1-44"><a href="#cb1-44" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.fc3 <span class="op">=</span> torch.nn.Linear(<span class="dv">256</span>, num_classes)</span>
<span id="cb1-45"><a href="#cb1-45" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.bn1 <span class="op">=</span> torch.nn.BatchNorm1d(<span class="dv">512</span>)</span>
<span id="cb1-46"><a href="#cb1-46" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.bn2 <span class="op">=</span> torch.nn.BatchNorm1d(<span class="dv">256</span>)</span>
<span id="cb1-47"><a href="#cb1-47" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-48"><a href="#cb1-48" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> forward(<span class="va">self</span>, x):</span>
<span id="cb1-49"><a href="#cb1-49" aria-hidden="true" tabindex="-1"></a>        <span class="co"># `x` is of size [B, N, 3]</span></span>
<span id="cb1-50"><a href="#cb1-50" aria-hidden="true" tabindex="-1"></a>        <span class="co"># `x` is of size [B, 1024]</span></span>
<span id="cb1-51"><a href="#cb1-51" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> <span class="va">self</span>.feat(x)</span>
<span id="cb1-52"><a href="#cb1-52" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-53"><a href="#cb1-53" aria-hidden="true" tabindex="-1"></a>        <span class="co"># `x` is of size [B, `num_classes`]</span></span>
<span id="cb1-54"><a href="#cb1-54" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> F.relu(<span class="va">self</span>.bn1(<span class="va">self</span>.fc1(x)))</span>
<span id="cb1-55"><a href="#cb1-55" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> F.relu(<span class="va">self</span>.bn2(<span class="va">self</span>.fc2(x)))</span>
<span id="cb1-56"><a href="#cb1-56" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> <span class="va">self</span>.fc3(x)</span>
<span id="cb1-57"><a href="#cb1-57" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-58"><a href="#cb1-58" aria-hidden="true" tabindex="-1"></a>        <span class="cf">return</span> x</span></code></pre></div>
<p>さて、このモデルをそのまま実装する場合、次のような問題があります。
特徴抽出部分 (図のFeature extraction)に注目します。
図中の灰色の四角に示すように、<span
class="math inline">\(N\)</span>個全ての点に対する中間結果や、ローカルな特徴量<span
class="math inline">\(\boldsymbol{\Psi}\)</span>を、どこかに保持しておく必要があります。
大容量のメモリを搭載したGPUであれば、これでも問題ありませんが、FPGA内部のオンチップメモリ
(BlockRAM)
は非常に容量が少ないので、全ての点に対する中間結果を保持しようとすると、オンチップメモリがあっという間に枯渇するでしょう。
言い換えると、搭載されているオンチップメモリの容量によって、点の個数<span
class="math inline">\(N\)</span>が制限されてしまいます。
これは避けたいものです。
オンチップメモリの代わりに、容量の大きなDRAM上に置くこともできますが、データへのアクセス時間は長くなります。
全ての層の中間結果をDRAMに置くと、データ転送のオーバーヘッドが増加して、性能に悪影響を及ぼします。
層の中間結果は、オンチップバッファに置きたいものです。</p>
<p>そこで、全ての点<span
class="math inline">\(\mathcal{P}\)</span>に対して、ローカルな特徴量<span
class="math inline">\(\boldsymbol{\Psi}\)</span>を一気に計算するのではなく、1つずつの点<span
class="math inline">\(\boldsymbol{p}\)</span>に対して順にローカルな特徴量<span
class="math inline">\(\boldsymbol{\psi}\)</span>を計算しましょう。
一気に計算するのと比べて計算効率は落ちますが、1つの点に対する中間結果やローカルな特徴量だけを保持すればよいので、オンチップバッファの消費を大きく削減できます。</p>
<p>以前は
(PyTorchなどのフレームワークを使う場合は)、特徴抽出は次のように行われていました。</p>
<ol type="1">
<li>全ての点<span
class="math inline">\(\mathcal{P}\)</span>に対して、ローカルな特徴量を<span
class="math inline">\(\boldsymbol{\Psi}\)</span>をまとめて計算する
(<span class="math inline">\((N, 64)\)</span>や<span
class="math inline">\((N, 1024)\)</span>のバッファが必要)。</li>
<li>Maxプーリング層により、ローカルな特徴量<span
class="math inline">\(\boldsymbol{\Psi}\)</span>を集約して、グローバルな特徴量<span
class="math inline">\(\boldsymbol{\phi}\)</span>を得る (<span
class="math inline">\(\boldsymbol{\phi} \gets \max(\boldsymbol{\psi}_1,
\ldots, \boldsymbol{\psi}_N)\)</span>)。</li>
<li>グローバルな特徴量<span
class="math inline">\(\boldsymbol{\phi}\)</span>をMLPに入力し、各クラスに対するロジット(<span
class="math inline">\(K\)</span>次元のベクトル)を得る。</li>
</ol>
<p>これを、次のように変更します(<strong>最適化その2:
計算順序の変更</strong>)。</p>
<ol type="1">
<li>グローバルな特徴量<span
class="math inline">\(\boldsymbol{\phi}\)</span>を、<span
class="math inline">\(\boldsymbol{0}\)</span>で初期化する。</li>
<li>各点<span class="math inline">\(\boldsymbol{p}_i \ (i = 1, \ldots,
N)\)</span>に対して、以下の処理を行う。
<ol type="1">
<li>MLPの順伝播により、ローカルな特徴量<span
class="math inline">\(\boldsymbol{\psi}_i\)</span>を得る (<span
class="math inline">\((1, 64)\)</span>や<span class="math inline">\((1,
1024)\)</span>のバッファがあればよい)。</li>
<li><span class="math inline">\(\boldsymbol{\phi}\)</span>と<span
class="math inline">\(\boldsymbol{\psi}_i\)</span>との、要素ごとの<span
class="math inline">\(\max\)</span>をとることで、<span
class="math inline">\(\boldsymbol{\phi}\)</span>を更新する (<span
class="math inline">\(\boldsymbol{\phi} \gets \max(\boldsymbol{\phi},
\boldsymbol{\psi}_i)\)</span>)。</li>
</ol></li>
<li>グローバルな特徴量<span
class="math inline">\(\boldsymbol{\phi}\)</span>をMLPに入力し、各クラスに対するロジット(<span
class="math inline">\(K\)</span>次元のベクトル)を得る。</li>
</ol>
<p>全ての点に対するローカルな特徴量<span
class="math inline">\(\boldsymbol{\Psi}\)</span>を集約するのではなく、各点<span
class="math inline">\(\boldsymbol{p}_i\)</span>に対するローカルな特徴量<span
class="math inline">\(\boldsymbol{\psi}_i\)</span>を使って、グローバルな特徴量<span
class="math inline">\(\boldsymbol{\phi}\)</span>を逐次的に更新していきます。
これは近似ではないので、全く同じ結果となります。</p>
<p>最終的に、今回FPGA上に実装するPointNetは、以下のようになります。</p>
<p><a
href="point-cloud-classification-images/pointnet-layers3.svg"><img src="point-cloud-classification-images/pointnet-layers3.svg" width="80%" /></a></p>
<h1 id="高位合成による実装">高位合成による実装</h1>
<p>今回は、高位合成 (HLS: High-Level
Synthesis)を用いて、上記に示すPointNetの専用回路
(<strong>IPコア</strong>) を記述します。
ニューラルネットの推論を実現する別の手段として、行列演算や畳み込み演算用の、巨大かつ汎用的な演算回路をFPGA上に実装し、それに繰り返しデータを与えることも考えられます。</p>
<p>高位合成は、C/C++による動作レベル (Behavior Level)
の回路記述を、Verilog HDLやSystemVerilogによるレジスタ転送レベル (RTL:
Register Transfer Level) の回路記述に変換するための技術です。 Verilog
HDLを直接記述するのに比べて、遥かに楽で、ストレスが少なく、生産性が向上します。
但し、C/C++で記述するとはいっても、通常のソフトウェア開発とは全く様相が異なります。
<code>malloc()</code>や<code>new</code>はもちろんのこと、これらに依存する<code>std::vector</code>などの便利なデータ型も使えないので、固定長の配列に置き換えてどうにかします。
ニューラルネットはサイズが固定で、一般には決まった動作をするので、FPGA上に実装しやすいです。</p>
<p>高位合成用のツールとして、Xilinx社のVitis HLS 2022.1を利用します。
また実装対象のFPGAとして、Xilinx ZCU104 Evaluation Board
(XCZU7EV-2FFVC1156)を使います。 Xilinx
ZCU104には、FPGAのほかに、クアッドコア ARM Cortex-A53 CPU
(1.2GHz)と2GBのDRAMも搭載されており、Linuxが動作します。</p>
<p>早速、PointNetのIPコアを示します
(適宜GitHubのリポジトリをご覧ください)。
高位合成ツールのバックエンドがGCC
6.2ですので、C++14やC++17の一部機能が利用できます。
但し、ツールのバグを踏むかもしれないので、あまり凝った機能は使わないようにしています。</p>
<div class="sourceCode" id="cb2"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Size of the PointNet classification network</span></span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="co">// Refer to net/model.py for details</span></span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a><span class="co">// Size of the feature extraction network</span></span>
<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a><span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> kFeatDims0 <span class="op">=</span> <span class="dv">3</span><span class="op">;</span></span>
<span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> kFeatDims1 <span class="op">=</span> <span class="dv">64</span><span class="op">;</span></span>
<span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> kFeatDims2 <span class="op">=</span> <span class="dv">64</span><span class="op">;</span></span>
<span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a><span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> kFeatDims3 <span class="op">=</span> <span class="dv">64</span><span class="op">;</span></span>
<span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a><span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> kFeatDims4 <span class="op">=</span> <span class="dv">128</span><span class="op">;</span></span>
<span id="cb2-10"><a href="#cb2-10" aria-hidden="true" tabindex="-1"></a><span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> kFeatDims5 <span class="op">=</span> <span class="dv">1024</span><span class="op">;</span></span>
<span id="cb2-11"><a href="#cb2-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-12"><a href="#cb2-12" aria-hidden="true" tabindex="-1"></a><span class="co">// Size of the classification network</span></span>
<span id="cb2-13"><a href="#cb2-13" aria-hidden="true" tabindex="-1"></a><span class="co">// ModelNet40 has 40 object classes</span></span>
<span id="cb2-14"><a href="#cb2-14" aria-hidden="true" tabindex="-1"></a><span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> kClsDims0 <span class="op">=</span> kFeatDims5<span class="op">;</span></span>
<span id="cb2-15"><a href="#cb2-15" aria-hidden="true" tabindex="-1"></a><span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> kClsDims1 <span class="op">=</span> <span class="dv">512</span><span class="op">;</span></span>
<span id="cb2-16"><a href="#cb2-16" aria-hidden="true" tabindex="-1"></a><span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> kClsDims2 <span class="op">=</span> <span class="dv">256</span><span class="op">;</span></span>
<span id="cb2-17"><a href="#cb2-17" aria-hidden="true" tabindex="-1"></a><span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> kClsDims3 <span class="op">=</span> <span class="dv">40</span><span class="op">;</span></span>
<span id="cb2-18"><a href="#cb2-18" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-19"><a href="#cb2-19" aria-hidden="true" tabindex="-1"></a><span class="co">// Top function</span></span>
<span id="cb2-20"><a href="#cb2-20" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> PointNetClsTop<span class="op">(</span><span class="at">const</span> <span class="dt">int</span> op_mode<span class="op">,</span></span>
<span id="cb2-21"><a href="#cb2-21" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> point_cloud<span class="op">,</span></span>
<span id="cb2-22"><a href="#cb2-22" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">int</span> num_points<span class="op">,</span></span>
<span id="cb2-23"><a href="#cb2-23" aria-hidden="true" tabindex="-1"></a>                    <span class="dt">float</span><span class="op">*</span> out_logits<span class="op">,</span></span>
<span id="cb2-24"><a href="#cb2-24" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params1<span class="op">,</span></span>
<span id="cb2-25"><a href="#cb2-25" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params2<span class="op">,</span></span>
<span id="cb2-26"><a href="#cb2-26" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params3<span class="op">,</span></span>
<span id="cb2-27"><a href="#cb2-27" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params4<span class="op">,</span></span>
<span id="cb2-28"><a href="#cb2-28" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params5<span class="op">,</span></span>
<span id="cb2-29"><a href="#cb2-29" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> cls_params1<span class="op">,</span></span>
<span id="cb2-30"><a href="#cb2-30" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> cls_params2<span class="op">,</span></span>
<span id="cb2-31"><a href="#cb2-31" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> cls_params3<span class="op">)</span></span>
<span id="cb2-32"><a href="#cb2-32" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb2-33"><a href="#cb2-33" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=point_cloud offset=slave bundle=gmem0</span></span>
<span id="cb2-34"><a href="#cb2-34" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=out_logits offset=slave bundle=gmem0</span></span>
<span id="cb2-35"><a href="#cb2-35" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=feat_params1 offset=slave bundle=gmem0</span></span>
<span id="cb2-36"><a href="#cb2-36" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=feat_params2 offset=slave bundle=gmem0</span></span>
<span id="cb2-37"><a href="#cb2-37" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=feat_params3 offset=slave bundle=gmem0</span></span>
<span id="cb2-38"><a href="#cb2-38" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=feat_params4 offset=slave bundle=gmem0</span></span>
<span id="cb2-39"><a href="#cb2-39" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=feat_params5 offset=slave bundle=gmem0</span></span>
<span id="cb2-40"><a href="#cb2-40" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=cls_params1 offset=slave bundle=gmem0</span></span>
<span id="cb2-41"><a href="#cb2-41" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=cls_params2 offset=slave bundle=gmem0</span></span>
<span id="cb2-42"><a href="#cb2-42" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=cls_params3 offset=slave bundle=gmem0</span></span>
<span id="cb2-43"><a href="#cb2-43" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-44"><a href="#cb2-44" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=op_mode bundle=control</span></span>
<span id="cb2-45"><a href="#cb2-45" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=point_cloud bundle=control</span></span>
<span id="cb2-46"><a href="#cb2-46" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=num_points bundle=control</span></span>
<span id="cb2-47"><a href="#cb2-47" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=out_logits bundle=control</span></span>
<span id="cb2-48"><a href="#cb2-48" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=feat_params1 bundle=control</span></span>
<span id="cb2-49"><a href="#cb2-49" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=feat_params2 bundle=control</span></span>
<span id="cb2-50"><a href="#cb2-50" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=feat_params3 bundle=control</span></span>
<span id="cb2-51"><a href="#cb2-51" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=feat_params4 bundle=control</span></span>
<span id="cb2-52"><a href="#cb2-52" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=feat_params5 bundle=control</span></span>
<span id="cb2-53"><a href="#cb2-53" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=cls_params1 bundle=control</span></span>
<span id="cb2-54"><a href="#cb2-54" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=cls_params2 bundle=control</span></span>
<span id="cb2-55"><a href="#cb2-55" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=cls_params3 bundle=control</span></span>
<span id="cb2-56"><a href="#cb2-56" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=return bundle=control</span></span>
<span id="cb2-57"><a href="#cb2-57" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-58"><a href="#cb2-58" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Parameters for feature extraction</span></span>
<span id="cb2-59"><a href="#cb2-59" aria-hidden="true" tabindex="-1"></a>  LinearParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims0<span class="op">,</span> kFeatDims1<span class="op">&gt;</span> feat_conv1<span class="op">;</span></span>
<span id="cb2-60"><a href="#cb2-60" aria-hidden="true" tabindex="-1"></a>  LinearParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims1<span class="op">,</span> kFeatDims2<span class="op">&gt;</span> feat_conv2<span class="op">;</span></span>
<span id="cb2-61"><a href="#cb2-61" aria-hidden="true" tabindex="-1"></a>  LinearParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims2<span class="op">,</span> kFeatDims3<span class="op">&gt;</span> feat_conv3<span class="op">;</span></span>
<span id="cb2-62"><a href="#cb2-62" aria-hidden="true" tabindex="-1"></a>  LinearParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims3<span class="op">,</span> kFeatDims4<span class="op">&gt;</span> feat_conv4<span class="op">;</span></span>
<span id="cb2-63"><a href="#cb2-63" aria-hidden="true" tabindex="-1"></a>  LinearParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims4<span class="op">,</span> kFeatDims5<span class="op">&gt;</span> feat_conv5<span class="op">;</span></span>
<span id="cb2-64"><a href="#cb2-64" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims1<span class="op">&gt;</span> feat_bn1<span class="op">;</span></span>
<span id="cb2-65"><a href="#cb2-65" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims2<span class="op">&gt;</span> feat_bn2<span class="op">;</span></span>
<span id="cb2-66"><a href="#cb2-66" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims3<span class="op">&gt;</span> feat_bn3<span class="op">;</span></span>
<span id="cb2-67"><a href="#cb2-67" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims4<span class="op">&gt;</span> feat_bn4<span class="op">;</span></span>
<span id="cb2-68"><a href="#cb2-68" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims5<span class="op">&gt;</span> feat_bn5<span class="op">;</span></span>
<span id="cb2-69"><a href="#cb2-69" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-70"><a href="#cb2-70" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Parameters for classification network</span></span>
<span id="cb2-71"><a href="#cb2-71" aria-hidden="true" tabindex="-1"></a>  <span class="co">// LinearParams&lt;param_t, kClsDims0, kClsDims1&gt; cls_fc1;</span></span>
<span id="cb2-72"><a href="#cb2-72" aria-hidden="true" tabindex="-1"></a>  <span class="co">// LinearParams&lt;param_t, kClsDims1, kClsDims2&gt; cls_fc2;</span></span>
<span id="cb2-73"><a href="#cb2-73" aria-hidden="true" tabindex="-1"></a>  LinearParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kClsDims2<span class="op">,</span> kClsDims3<span class="op">&gt;</span> cls_fc3<span class="op">;</span></span>
<span id="cb2-74"><a href="#cb2-74" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kClsDims1<span class="op">&gt;</span> cls_bn1<span class="op">;</span></span>
<span id="cb2-75"><a href="#cb2-75" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kClsDims2<span class="op">&gt;</span> cls_bn2<span class="op">;</span></span>
<span id="cb2-76"><a href="#cb2-76" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-77"><a href="#cb2-77" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Extracted feature</span></span>
<span id="cb2-78"><a href="#cb2-78" aria-hidden="true" tabindex="-1"></a>  <span class="dt">value_t</span> feature<span class="op">[</span>kFeatDims5<span class="op">];</span></span>
<span id="cb2-79"><a href="#cb2-79" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-80"><a href="#cb2-80" aria-hidden="true" tabindex="-1"></a>  <span class="cf">if</span> <span class="op">(</span>op_mode <span class="op">==</span> kModeInitWeights<span class="op">)</span> <span class="op">{</span></span>
<span id="cb2-81"><a href="#cb2-81" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Initialize the PointNet feature extraction network</span></span>
<span id="cb2-82"><a href="#cb2-82" aria-hidden="true" tabindex="-1"></a>    InitializeFeatNaive<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">&gt;(</span></span>
<span id="cb2-83"><a href="#cb2-83" aria-hidden="true" tabindex="-1"></a>      <span class="op">&amp;</span>feat_conv1<span class="op">,</span> <span class="op">&amp;</span>feat_conv2<span class="op">,</span> <span class="op">&amp;</span>feat_conv3<span class="op">,</span> <span class="op">&amp;</span>feat_conv4<span class="op">,</span> <span class="op">&amp;</span>feat_conv5<span class="op">,</span></span>
<span id="cb2-84"><a href="#cb2-84" aria-hidden="true" tabindex="-1"></a>      <span class="op">&amp;</span>feat_bn1<span class="op">,</span> <span class="op">&amp;</span>feat_bn2<span class="op">,</span> <span class="op">&amp;</span>feat_bn3<span class="op">,</span> <span class="op">&amp;</span>feat_bn4<span class="op">,</span> <span class="op">&amp;</span>feat_bn5<span class="op">,</span></span>
<span id="cb2-85"><a href="#cb2-85" aria-hidden="true" tabindex="-1"></a>      feat_params1<span class="op">,</span> feat_params2<span class="op">,</span> feat_params3<span class="op">,</span> feat_params4<span class="op">,</span> feat_params5<span class="op">);</span></span>
<span id="cb2-86"><a href="#cb2-86" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Initialize the classification network</span></span>
<span id="cb2-87"><a href="#cb2-87" aria-hidden="true" tabindex="-1"></a>    InitializeClsNaive<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">&gt;(</span></span>
<span id="cb2-88"><a href="#cb2-88" aria-hidden="true" tabindex="-1"></a>      <span class="op">&amp;</span>cls_fc3<span class="op">,</span> <span class="op">&amp;</span>cls_bn1<span class="op">,</span> <span class="op">&amp;</span>cls_bn2<span class="op">,</span></span>
<span id="cb2-89"><a href="#cb2-89" aria-hidden="true" tabindex="-1"></a>      cls_params1<span class="op">,</span> cls_params2<span class="op">,</span> cls_params3<span class="op">);</span></span>
<span id="cb2-90"><a href="#cb2-90" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span> <span class="cf">else</span> <span class="cf">if</span> <span class="op">(</span>op_mode <span class="op">==</span> kModeInference<span class="op">)</span> <span class="op">{</span></span>
<span id="cb2-91"><a href="#cb2-91" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Run the PointNet feature extraction</span></span>
<span id="cb2-92"><a href="#cb2-92" aria-hidden="true" tabindex="-1"></a>    InferenceFeatNaive<span class="op">&lt;</span><span class="dt">value_t</span><span class="op">,</span> <span class="dt">param_t</span><span class="op">,</span> <span class="dv">1024</span><span class="op">&gt;(</span></span>
<span id="cb2-93"><a href="#cb2-93" aria-hidden="true" tabindex="-1"></a>      point_cloud<span class="op">,</span> num_points<span class="op">,</span> feature<span class="op">,</span></span>
<span id="cb2-94"><a href="#cb2-94" aria-hidden="true" tabindex="-1"></a>      <span class="op">&amp;</span>feat_conv1<span class="op">,</span> <span class="op">&amp;</span>feat_conv2<span class="op">,</span> <span class="op">&amp;</span>feat_conv3<span class="op">,</span> <span class="op">&amp;</span>feat_conv4<span class="op">,</span> <span class="op">&amp;</span>feat_conv5<span class="op">,</span></span>
<span id="cb2-95"><a href="#cb2-95" aria-hidden="true" tabindex="-1"></a>      <span class="op">&amp;</span>feat_bn1<span class="op">,</span> <span class="op">&amp;</span>feat_bn2<span class="op">,</span> <span class="op">&amp;</span>feat_bn3<span class="op">,</span> <span class="op">&amp;</span>feat_bn4<span class="op">,</span> <span class="op">&amp;</span>feat_bn5<span class="op">);</span></span>
<span id="cb2-96"><a href="#cb2-96" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-97"><a href="#cb2-97" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Run the classification</span></span>
<span id="cb2-98"><a href="#cb2-98" aria-hidden="true" tabindex="-1"></a>    InferenceClsNaive<span class="op">&lt;</span><span class="dt">value_t</span><span class="op">,</span> <span class="dt">param_t</span><span class="op">&gt;(</span></span>
<span id="cb2-99"><a href="#cb2-99" aria-hidden="true" tabindex="-1"></a>      feature<span class="op">,</span> out_logits<span class="op">,</span></span>
<span id="cb2-100"><a href="#cb2-100" aria-hidden="true" tabindex="-1"></a>      <span class="op">&amp;</span>cls_fc3<span class="op">,</span> <span class="op">&amp;</span>cls_bn1<span class="op">,</span> <span class="op">&amp;</span>cls_bn2<span class="op">,</span></span>
<span id="cb2-101"><a href="#cb2-101" aria-hidden="true" tabindex="-1"></a>      cls_params1<span class="op">,</span> cls_params2<span class="op">,</span> cls_params3<span class="op">);</span></span>
<span id="cb2-102"><a href="#cb2-102" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb2-103"><a href="#cb2-103" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>上記を高位合成すると、次のようなIPコアが作られます。</p>
<p><a
href="point-cloud-classification-images/pointnet-ip-core.svg"><img src="point-cloud-classification-images/pointnet-ip-core.svg" width="50%" /></a></p>
<p>このIPコアを別のIPコアと組み合わせることで
(後述)、次のようなブロックデザインができます。</p>
<p><a
href="point-cloud-classification-images/board-design.svg"><img src="point-cloud-classification-images/board-design.svg" width="100%" /></a></p>
<p>このブロックデザインに対して、論理合成および配置配線することで、回路情報を表すビットストリーム
(Bitstream) を生成します。
ビットストリームをFPGAにロードすることで、PointNetの専用回路が使えるようになります。</p>
<h2 id="入出力ポート">入出力ポート</h2>
<p><code>PointNetClsTop</code>が、IPコアを表す最上位の関数です。
トップ関数 (Top function) とよばれます。
関数の引数は、IPコアの入出力ポートとなり、別のIPコアに接続されます
(上のブロックデザインをご覧ください)。 HLSでは、関数そのものが回路
(Verilog HDLにおけるモジュール) になります。
関数の再帰呼び出しはできません。</p>
<p>特徴抽出用のネットワークには5つのMLP、またクラス分類用のネットワークには3つのMLPが含まれます。
これらのパラメータは、ソフトウェア側から操作できるように、DRAM上のバッファに置かれます。
また、点群<span
class="math inline">\(\mathcal{P}\)</span>や、モデルの出力(ロジット)も同様に、DRAMバッファに置かれます。</p>
<p><code>feat_params1</code>から<code>feat_params5</code>までと、<code>cls_params1</code>から<code>cls_params3</code>までの8つのポートは、DRAMバッファ上のパラメータを、IPコア側から読み取るために使います。
<code>point_cloud</code>は点群の読み出し、<code>out_logits</code>はロジットの書き込みのために使います。
<code>op_mode</code>は回路の動作モード、<code>num_points</code>は点の個数<span
class="math inline">\(N\)</span>を設定するための制御レジスタです。</p>
<p><code>#pragma HLS</code>から始まる行は、高位合成ツールに対して、C/C++からRTLに変換する際のヒントを与えます
(必ずしも守ってくれるとは限りません)。
パイプライン化、データフロー最適化などはC/C++では記述できませんが、このような<strong>HLSプラグマ</strong>を適切な場所に置くことで、高位合成ツールが自動的にこれらの最適化を施してくれます。</p>
<p><code>#pragma HLS INLINE off</code>とすると、その関数がインライン展開されなくなります
(必ず、1つのモジュールとして作られる)。
大きな関数であれば、自動的にインライン展開されることはありませんが、念のため付与しています。
以下のような状況では、関数<code>B</code>をインライン展開しない方がいいと思います。
同時に使われないのにも関わらず、関数<code>A</code>の内部に<code>B</code>のコピーが3つ作られて、リソースの無駄遣いとなります。
関数<code>B</code>のインライン化を抑制して、<code>B</code>を1つだけ作り、それを使い回した方がいいでしょう。</p>
<div class="sourceCode" id="cb3"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> B<span class="op">(</span><span class="at">const</span> <span class="dt">float</span> x_in<span class="op">[</span><span class="dv">10</span><span class="op">],</span> <span class="dt">float</span> y_out<span class="op">[</span><span class="dv">10</span><span class="op">])</span></span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE</span></span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a>  <span class="co">// 何らかの処理</span></span>
<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> A<span class="op">(</span><span class="at">const</span> <span class="dt">float</span> x_in<span class="op">[</span><span class="dv">10</span><span class="op">],</span> <span class="dt">float</span> y_out<span class="op">[</span><span class="dv">10</span><span class="op">])</span></span>
<span id="cb3-9"><a href="#cb3-9" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb3-10"><a href="#cb3-10" aria-hidden="true" tabindex="-1"></a>  <span class="dt">float</span> x0<span class="op">[</span><span class="dv">10</span><span class="op">];</span></span>
<span id="cb3-11"><a href="#cb3-11" aria-hidden="true" tabindex="-1"></a>  <span class="dt">float</span> x1<span class="op">[</span><span class="dv">10</span><span class="op">];</span></span>
<span id="cb3-12"><a href="#cb3-12" aria-hidden="true" tabindex="-1"></a>  B<span class="op">(</span>x_in<span class="op">,</span> x0<span class="op">);</span></span>
<span id="cb3-13"><a href="#cb3-13" aria-hidden="true" tabindex="-1"></a>  B<span class="op">(</span>x0<span class="op">,</span> x1<span class="op">);</span></span>
<span id="cb3-14"><a href="#cb3-14" aria-hidden="true" tabindex="-1"></a>  B<span class="op">(</span>x1<span class="op">,</span> y_out<span class="op">);</span></span>
<span id="cb3-15"><a href="#cb3-15" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p><code>#pragma HLS INTERFACE m_axi</code>と、<code>#pragma HLS INTERFACE s_axilite</code>の記述が目立ちますが、入出力ポート
(例えば<code>feat_params1</code>)
に対してこの2つのHLSプラグマを記述すると、IPコア側からDRAMバッファを読み書きできるようになります。
読み書きの際には、AXIとよばれるプロトコルを使用しますが、<code>#pragma HLS INTERFACE m_axi</code>によってそれを指定できます
(IPコア側がマスターになります)。</p>
<p>ソフトウェア側からは、各ポートに対して、バッファの物理アドレスを割り当てて、ポートとバッファを紐づけます。
各ポートには、物理アドレスを設定するための制御レジスタを作成する必要があり、<code>#pragma HLS INTERFACE s_axilite</code>によってそれを実現できます
(IPコア側からみるとスレーブです)。
<code>op_mode</code>、<code>num_points</code>に対してもレジスタを作成します。
<code>port=return</code>としている行は、IPコア用の制御レジスタを作成し、CPU側からIPコアの動作を開始したり、状態
(アイドル状態なのか動作中か) を読み取ったりするために必要です。
これらのレジスタは、ソフトウェア側から、メモリマップトI/OおよびAXI-Liteプロトコルによって読み書きされます。</p>
<p>各入出力ポートからは、PyTorchのモデルで定義した、各層のパラメータが読み出されます
(一次元の配列として、全てのパラメータが連結されます)。</p>
<ul>
<li><code>feat_params1</code>: <code>PointNetFeat::conv1</code> +
<code>PointNetFeat::bn1</code>のパラメータ</li>
<li><code>feat_params2</code>: <code>PointNetFeat::conv2</code> +
<code>PointNetFeat::bn2</code>のパラメータ</li>
<li><code>feat_params3</code>: <code>PointNetFeat::conv3</code> +
<code>PointNetFeat::bn3</code>のパラメータ</li>
<li><code>feat_params4</code>: <code>PointNetFeat::conv4</code> +
<code>PointNetFeat::bn4</code>のパラメータ</li>
<li><code>feat_params5</code>: <code>PointNetFeat::conv5</code> +
<code>PointNetFeat::bn5</code>のパラメータ</li>
<li><code>cls_params1</code>: <code>PointNetCls::fc1</code> +
<code>PointNetCls::bn1</code>のパラメータ</li>
<li><code>cls_params2</code>: <code>PointNetCls::fc2</code> +
<code>PointNetCls::bn2</code>のパラメータ</li>
<li><code>cls_params3</code>:
<code>PointNetCls::fc3</code>のパラメータ</li>
</ul>
<div class="sourceCode" id="cb4"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> PointNetClsTop<span class="op">(</span><span class="at">const</span> <span class="dt">int</span> op_mode<span class="op">,</span></span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> point_cloud<span class="op">,</span></span>
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">int</span> num_points<span class="op">,</span></span>
<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a>                    <span class="dt">float</span><span class="op">*</span> out_logits<span class="op">,</span></span>
<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params1<span class="op">,</span></span>
<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params2<span class="op">,</span></span>
<span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params3<span class="op">,</span></span>
<span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params4<span class="op">,</span></span>
<span id="cb4-9"><a href="#cb4-9" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params5<span class="op">,</span></span>
<span id="cb4-10"><a href="#cb4-10" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> cls_params1<span class="op">,</span></span>
<span id="cb4-11"><a href="#cb4-11" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> cls_params2<span class="op">,</span></span>
<span id="cb4-12"><a href="#cb4-12" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> cls_params3<span class="op">)</span></span>
<span id="cb4-13"><a href="#cb4-13" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb4-14"><a href="#cb4-14" aria-hidden="true" tabindex="-1"></a>  <span class="co">// ...</span></span>
<span id="cb4-15"><a href="#cb4-15" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<h2 id="各層のパラメータと処理">各層のパラメータと処理</h2>
<p><code>torch.nn.Conv1d</code>および<code>torch.nn.Linear</code>のパラメータとしては、重みとバイアスが挙げられます。
<code>Conv1d</code>とありますが、カーネルサイズは1なので、<code>Linear</code>と動作が同じになります。
以後、<code>Conv1d</code>と<code>Linear</code>を同一視します。
入力と出力の次元数を<span
class="math inline">\(\mathrm{InDims}\)</span>、<span
class="math inline">\(\mathrm{OutDims}\)</span>とすると、重みとバイアスのサイズは<span
class="math inline">\((\mathrm{OutDims},
\mathrm{InDims})\)</span>、<span
class="math inline">\((\mathrm{OutDims})\)</span>となります。 入力<span
class="math inline">\(\boldsymbol{x} \in
\mathbb{R}^{\mathrm{InDims}}\)</span>、重み<span
class="math inline">\(\boldsymbol{W} \in \mathbb{R}^{\mathrm{OutDims}
\times \mathrm{InDims}}\)</span>、バイアス<span
class="math inline">\(\boldsymbol{b} \in
\mathbb{R}^{\mathrm{OutDims}}\)</span>があるとき、出力<span
class="math inline">\(\boldsymbol{y} \in
\mathbb{R}^{\mathrm{OutDims}}\)</span>は次のように計算されます。 <span
class="math display">\[
  \boldsymbol{y} = \boldsymbol{W} \boldsymbol{x} + \boldsymbol{b}
\]</span></p>
<p><code>torch.nn.BatchNorm1d</code>のパラメータとしては、平均、標準偏差、重み、バイアスの4つが挙げられます。
入出力の次元を<span
class="math inline">\(\mathrm{Dims}\)</span>とすると、これら4つのパラメータのサイズは<span
class="math inline">\((\mathrm{Dims})\)</span>です。
平均、標準偏差、重み、バイアス<span
class="math inline">\(\boldsymbol{\mu}, \boldsymbol{\sigma},
\boldsymbol{w}, \boldsymbol{b} \in
\mathbb{R}^{\mathrm{Dims}}\)</span>があるとき、入力<span
class="math inline">\(\boldsymbol{x} \in
\mathbb{R}^{\mathrm{Dims}}\)</span>に対して出力<span
class="math inline">\(\boldsymbol{y} \in
\mathbb{R}^{\mathrm{Dims}}\)</span>は次のように計算されます。 <span
class="math display">\[
  y_i = \frac{x_i - \mu_i}{\sqrt{\sigma_i^2 + \varepsilon}} \cdot w_i +
b_i \quad (i = 1, \ldots, \mathrm{Dims})
\]</span> <span
class="math inline">\(\varepsilon\)</span>は、ゼロ除算を防ぐための小さな正の値です。
<span class="math inline">\(x_i\)</span>は、<span
class="math inline">\(\boldsymbol{x}\)</span>の第<span
class="math inline">\(i\)</span>要素です (他も同様)。
上記をみると、<span class="math inline">\(w_i / \sqrt{\sigma_i^2 +
\varepsilon}\)</span>の部分を先に計算できることが分かります。 <span
class="math inline">\(\boldsymbol{w}\)</span>と<span
class="math inline">\(\boldsymbol{\sigma}\)</span>の両方を使う場合と比べて、除算および平方根の計算を省略できます。
また、オンチップバッファの使用量を削減できます。
細かい話にみえますが、リソース制約の大きなFPGA上に実装する場合は重要です。
バッチ正規化の計算は以下のようにします。 <span class="math display">\[
  y_i = \left( x_i - \mu_i \right) \cdot s_i + b_i \quad (i = 1, \ldots,
\mathrm{Dims})
\]</span> 上記の<span
class="math inline">\(s_i\)</span>を、ここでは<strong>スケール</strong>と呼ぶことにします。
パラメータは、平均<span
class="math inline">\(\boldsymbol{\mu}\)</span>、バイアス<span
class="math inline">\(\boldsymbol{b}\)</span>、スケール<span
class="math inline">\(\boldsymbol{s} \in
\mathbb{R}^{\mathrm{Dims}}\)</span>の3つになります。 <span
class="math inline">\(\boldsymbol{s}\)</span>の計算は、モデルの初期化時にソフトウェア上で行うことにします。</p>
<p>バッチ正規化の後にReLU活性化が計算されます。
各層を別々に実装するよりも、まとめてしまった方が効率がよいので、バッチ正規化とReLU活性化を次のようにまとめます
(<strong>最適化その3: 計算の簡略化</strong>)。 <span
class="math display">\[
  y_i = \max \left( 0, \left( x_i - \mu_i \right) \cdot s_i + b_i
\right) \quad (i = 1, \ldots, \mathrm{Dims})
\]</span></p>
<p>最後にMaxプーリング層ですが、先述の通り、各点に対するローカル特徴量<span
class="math inline">\(\boldsymbol{\psi}_i \in
\mathbb{R}^{1024}\)</span>と、現在のグローバル特徴量<span
class="math inline">\(\boldsymbol{\phi} \in
\mathbb{R}^{1024}\)</span>との、要素ごとの<span
class="math inline">\(\max\)</span>に置き換えました。
Maxプーリング層の計算は次のようになります。 <span
class="math display">\[
  \phi_i = \max \left( \phi_i, \psi_i \right) \quad (i = 1, \ldots,
1024)
\]</span></p>
<p>さて、ソースコードの<code>LinearParams&lt;T, InDims_, OutDims_&gt;</code>構造体と、<code>BatchNorm1dParams&lt;T, Dims_&gt;</code>構造体は、全結合層
(<code>Conv1d</code>および<code>Linear</code>) と、バッチ正規化層
(<code>BatchNorm1d</code>) のパラメータをそれぞれまとめたものです。</p>
<div class="sourceCode" id="cb5"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Parameters for fully-connected layers</span></span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="dt">int</span> InDims_<span class="op">,</span> <span class="dt">int</span> OutDims_<span class="op">&gt;</span></span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> LinearParams</span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a>  <span class="kw">enum</span></span>
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a>  <span class="op">{</span></span>
<span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a>    InDims <span class="op">=</span> InDims_<span class="op">,</span></span>
<span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a>    OutDims <span class="op">=</span> OutDims_<span class="op">,</span></span>
<span id="cb5-9"><a href="#cb5-9" aria-hidden="true" tabindex="-1"></a>  <span class="op">};</span></span>
<span id="cb5-10"><a href="#cb5-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-11"><a href="#cb5-11" aria-hidden="true" tabindex="-1"></a>  T weight<span class="op">[</span>OutDims<span class="op">][</span>InDims<span class="op">];</span></span>
<span id="cb5-12"><a href="#cb5-12" aria-hidden="true" tabindex="-1"></a>  T bias<span class="op">[</span>OutDims<span class="op">];</span></span>
<span id="cb5-13"><a href="#cb5-13" aria-hidden="true" tabindex="-1"></a><span class="op">};</span></span>
<span id="cb5-14"><a href="#cb5-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-15"><a href="#cb5-15" aria-hidden="true" tabindex="-1"></a><span class="co">// Parameters for 1D batch normalization layers</span></span>
<span id="cb5-16"><a href="#cb5-16" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="dt">int</span> Dims_<span class="op">&gt;</span></span>
<span id="cb5-17"><a href="#cb5-17" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> BatchNorm1dParams</span>
<span id="cb5-18"><a href="#cb5-18" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb5-19"><a href="#cb5-19" aria-hidden="true" tabindex="-1"></a>  <span class="kw">enum</span></span>
<span id="cb5-20"><a href="#cb5-20" aria-hidden="true" tabindex="-1"></a>  <span class="op">{</span></span>
<span id="cb5-21"><a href="#cb5-21" aria-hidden="true" tabindex="-1"></a>    Dims <span class="op">=</span> Dims_<span class="op">,</span></span>
<span id="cb5-22"><a href="#cb5-22" aria-hidden="true" tabindex="-1"></a>  <span class="op">};</span></span>
<span id="cb5-23"><a href="#cb5-23" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-24"><a href="#cb5-24" aria-hidden="true" tabindex="-1"></a>  <span class="co">// `scale` is obtained by multiplying weights and reciprocal of the</span></span>
<span id="cb5-25"><a href="#cb5-25" aria-hidden="true" tabindex="-1"></a>  <span class="co">// square root of the standard deviation (to reduce the computational cost)</span></span>
<span id="cb5-26"><a href="#cb5-26" aria-hidden="true" tabindex="-1"></a>  T scale<span class="op">[</span>Dims<span class="op">];</span></span>
<span id="cb5-27"><a href="#cb5-27" aria-hidden="true" tabindex="-1"></a>  T bias<span class="op">[</span>Dims<span class="op">];</span></span>
<span id="cb5-28"><a href="#cb5-28" aria-hidden="true" tabindex="-1"></a>  T mean<span class="op">[</span>Dims<span class="op">];</span></span>
<span id="cb5-29"><a href="#cb5-29" aria-hidden="true" tabindex="-1"></a><span class="op">};</span></span></code></pre></div>
<p><code>PointNetClsTop</code>内では、PyTorchで定義されたモデルの各層に対応して、以下のようなパラメータが宣言されます。</p>
<ul>
<li><code>feat_conv1</code>:
<code>PointNetFeat::conv1</code>の重み、バイアス</li>
<li><code>feat_conv2</code>:
<code>PointNetFeat::conv2</code>の重み、バイアス</li>
<li><code>feat_conv3</code>:
<code>PointNetFeat::conv3</code>の重み、バイアス</li>
<li><code>feat_conv4</code>:
<code>PointNetFeat::conv4</code>の重み、バイアス</li>
<li><code>feat_conv5</code>:
<code>PointNetFeat::conv5</code>の重み、バイアス</li>
<li><code>feat_bn1</code>:
<code>PointNetFeat::bn1</code>の平均、バイアス、スケール</li>
<li><code>feat_bn2</code>:
<code>PointNetFeat::bn2</code>の平均、バイアス、スケール</li>
<li><code>feat_bn3</code>:
<code>PointNetFeat::bn3</code>の平均、バイアス、スケール</li>
<li><code>feat_bn4</code>:
<code>PointNetFeat::bn4</code>の平均、バイアス、スケール</li>
<li><code>feat_bn5</code>:
<code>PointNetFeat::bn5</code>の平均、バイアス、スケール</li>
<li><code>cls_fc3</code>:
<code>PointNetCls::fc3</code>の重み、バイアス</li>
<li><code>cls_bn1</code>:
<code>PointNetCls::bn1</code>の平均、バイアス、スケール</li>
<li><code>cls_bn2</code>:
<code>PointNetCls::bn2</code>の平均、バイアス、スケール</li>
</ul>
<p>特徴抽出ネットワークの全ての層のパラメータは、推論を開始する前に予め、オンチップメモリ上に置いておきます。
一方、分類ネットワークの全結合層2つ
(<code>PointNetCls::fc1</code>、<code>PointNetCls::fc2</code>)
のパラメータは、オンチップメモリ上には置かないようにします。
パラメータサイズが大きく、オンチップメモリが不足するためです。
これらの層については、推論時にDRAMバッファから読み出します。
言い換えると、パラメータの一部をDRAMバッファから取り出して、出力の一部を計算することを繰り返します。
一部のパラメータを保持するために、小さなオンチップバッファを用意すればよくなります。</p>
<p>特徴抽出ネットワークについては、<span
class="math inline">\(N\)</span>個全ての点に対して特徴抽出を行うために、<span
class="math inline">\(N\)</span>回の順伝播が起こります。
推論時間のなかで占める割合が大きいので、1回の順伝播に要する計算時間をうまく短縮できれば、全体の推論時間の大幅な短縮につながります
(<strong>アムダールの法則</strong>)。
一方、分類ネットワークの順伝播は1度だけで、推論時間のなかではそれほど重要ではありません。
パラメータをオンチップメモリに事前に格納するのと比べて、推論時にDRAMバッファから読み出すと、層の計算時間は伸びてしまいますが、推論時間に与える影響はそれほど大きくありません。</p>
<h2 id="データ型">データ型</h2>
<p>Vitis
HLSでは、任意精度の<strong>固定</strong>小数点数型<code>ap_fixed</code>が用意されています。
単精度浮動小数点数<code>float</code>や、半精度浮動小数点数<code>half</code>も利用できます。
ここではリソース消費を抑えるために、固定小数点数を使います。</p>
<p>デフォルトのオーバーフローモード (<code>ap_o_mode::AP_WRAP</code>)
では、値がオーバーフローしたときに折り返します。
これだと、最大値から急に最小値になったりして危なっかしいので、最大値あるいは最小値に留まり続けるように、飽和モード
(<code>ap_o_mode::AP_SAT</code>) に変更しています。
飽和モードを使う固定小数点数型を、<code>ap_fixed_sat</code>として定義しました。</p>
<p>ニューラルネットの入出力とパラメータとでビット幅を変えるために、入出力用とパラメータ用に別々の型を用意しました
(<code>param_t</code>および<code>value_t</code>)。
パラメータの値域に合わせて、ビット幅を削減できるかもしれません。
ビット幅の削減や量子化、小数点型のフォーマットなどは、それ自体が立派な研究分野となっています。</p>
<div class="sourceCode" id="cb6"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Value types</span></span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="dt">int</span> _AP_W<span class="op">,</span> <span class="dt">int</span> _AP_I<span class="op">&gt;</span></span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="kw">using</span> ap_fixed_sat <span class="op">=</span> ap_fixed<span class="op">&lt;</span></span>
<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a>  _AP_W<span class="op">,</span> _AP_I<span class="op">,</span> ap_q_mode<span class="op">::</span>AP_TRN<span class="op">,</span> ap_o_mode<span class="op">::</span>AP_SAT<span class="op">,</span> <span class="dv">0</span><span class="op">&gt;;</span></span>
<span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a><span class="co">// Data type for values (layer inputs, outputs, and intermediate results)</span></span>
<span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a><span class="kw">using</span> <span class="dt">value_t</span> <span class="op">=</span> ap_fixed_sat<span class="op">&lt;</span>kValueBitWidth<span class="op">,</span> kValueIntWidth<span class="op">&gt;;</span></span>
<span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a><span class="co">// Data type for network parameters</span></span>
<span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a><span class="kw">using</span> <span class="dt">param_t</span> <span class="op">=</span> ap_fixed_sat<span class="op">&lt;</span>kParamBitWidth<span class="op">,</span> kParamIntWidth<span class="op">&gt;;</span></span></code></pre></div>
<h2 id="動作モード">動作モード</h2>
<p>さて、ここで示すIPコアには、2つの<strong>動作モード</strong>
(Operation mode) が用意されています。</p>
<ul>
<li>重み初期化モード (<code>kModeInitWeights</code>):
重みをDRAMバッファから読み取って、オンチップバッファに格納する。</li>
<li>推論モード (<code>kModeInference</code>):
入力点群から、各クラスのロジットを計算する。</li>
</ul>
<p>これらを順に説明します。</p>
<h3 id="重み初期化モード">重み初期化モード</h3>
<p>特徴抽出ネットワークの全パラメータと、分類ネットワークのパラメータの一部を、DRAMバッファから読み取って、オンチップバッファに格納します。
以下に示す、<code>InitializeFeatNaive</code>および<code>InitializeClsNaive</code>を利用します。
それぞれ、特徴抽出ネットワークと、分類ネットワークのための関数です。</p>
<div class="sourceCode" id="cb7"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Naive implementation of the parameter initialization</span></span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for parameters</span></span>
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">&gt;</span></span>
<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> InitializeFeatNaive<span class="op">(</span>LinearParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims0<span class="op">,</span> kFeatDims1<span class="op">&gt;*</span> conv1<span class="op">,</span></span>
<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a>                         LinearParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims1<span class="op">,</span> kFeatDims2<span class="op">&gt;*</span> conv2<span class="op">,</span></span>
<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a>                         LinearParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims2<span class="op">,</span> kFeatDims3<span class="op">&gt;*</span> conv3<span class="op">,</span></span>
<span id="cb7-7"><a href="#cb7-7" aria-hidden="true" tabindex="-1"></a>                         LinearParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims3<span class="op">,</span> kFeatDims4<span class="op">&gt;*</span> conv4<span class="op">,</span></span>
<span id="cb7-8"><a href="#cb7-8" aria-hidden="true" tabindex="-1"></a>                         LinearParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims4<span class="op">,</span> kFeatDims5<span class="op">&gt;*</span> conv5<span class="op">,</span></span>
<span id="cb7-9"><a href="#cb7-9" aria-hidden="true" tabindex="-1"></a>                         BatchNorm1dParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims1<span class="op">&gt;*</span> bn1<span class="op">,</span></span>
<span id="cb7-10"><a href="#cb7-10" aria-hidden="true" tabindex="-1"></a>                         BatchNorm1dParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims2<span class="op">&gt;*</span> bn2<span class="op">,</span></span>
<span id="cb7-11"><a href="#cb7-11" aria-hidden="true" tabindex="-1"></a>                         BatchNorm1dParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims3<span class="op">&gt;*</span> bn3<span class="op">,</span></span>
<span id="cb7-12"><a href="#cb7-12" aria-hidden="true" tabindex="-1"></a>                         BatchNorm1dParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims4<span class="op">&gt;*</span> bn4<span class="op">,</span></span>
<span id="cb7-13"><a href="#cb7-13" aria-hidden="true" tabindex="-1"></a>                         BatchNorm1dParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims5<span class="op">&gt;*</span> bn5<span class="op">,</span></span>
<span id="cb7-14"><a href="#cb7-14" aria-hidden="true" tabindex="-1"></a>                         <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> params1<span class="op">,</span></span>
<span id="cb7-15"><a href="#cb7-15" aria-hidden="true" tabindex="-1"></a>                         <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> params2<span class="op">,</span></span>
<span id="cb7-16"><a href="#cb7-16" aria-hidden="true" tabindex="-1"></a>                         <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> params3<span class="op">,</span></span>
<span id="cb7-17"><a href="#cb7-17" aria-hidden="true" tabindex="-1"></a>                         <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> params4<span class="op">,</span></span>
<span id="cb7-18"><a href="#cb7-18" aria-hidden="true" tabindex="-1"></a>                         <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> params5<span class="op">)</span></span>
<span id="cb7-19"><a href="#cb7-19" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb7-20"><a href="#cb7-20" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb7-21"><a href="#cb7-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-22"><a href="#cb7-22" aria-hidden="true" tabindex="-1"></a>  ReadBlockParamsNaive<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims0<span class="op">,</span> kFeatDims1<span class="op">&gt;(</span>conv1<span class="op">,</span> bn1<span class="op">,</span> params1<span class="op">);</span></span>
<span id="cb7-23"><a href="#cb7-23" aria-hidden="true" tabindex="-1"></a>  ReadBlockParamsNaive<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims1<span class="op">,</span> kFeatDims2<span class="op">&gt;(</span>conv2<span class="op">,</span> bn2<span class="op">,</span> params2<span class="op">);</span></span>
<span id="cb7-24"><a href="#cb7-24" aria-hidden="true" tabindex="-1"></a>  ReadBlockParamsNaive<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims2<span class="op">,</span> kFeatDims3<span class="op">&gt;(</span>conv3<span class="op">,</span> bn3<span class="op">,</span> params3<span class="op">);</span></span>
<span id="cb7-25"><a href="#cb7-25" aria-hidden="true" tabindex="-1"></a>  ReadBlockParamsNaive<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims3<span class="op">,</span> kFeatDims4<span class="op">&gt;(</span>conv4<span class="op">,</span> bn4<span class="op">,</span> params4<span class="op">);</span></span>
<span id="cb7-26"><a href="#cb7-26" aria-hidden="true" tabindex="-1"></a>  ReadBlockParamsNaive<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims4<span class="op">,</span> kFeatDims5<span class="op">&gt;(</span>conv5<span class="op">,</span> bn5<span class="op">,</span> params5<span class="op">);</span></span>
<span id="cb7-27"><a href="#cb7-27" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb7-28"><a href="#cb7-28" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-29"><a href="#cb7-29" aria-hidden="true" tabindex="-1"></a><span class="co">// Naive implementation of the parameter initialization</span></span>
<span id="cb7-30"><a href="#cb7-30" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for parameters</span></span>
<span id="cb7-31"><a href="#cb7-31" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">&gt;</span></span>
<span id="cb7-32"><a href="#cb7-32" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> InitializeClsNaive<span class="op">(</span>LinearParams<span class="op">&lt;</span>T<span class="op">,</span> kClsDims2<span class="op">,</span> kClsDims3<span class="op">&gt;*</span> fc3<span class="op">,</span></span>
<span id="cb7-33"><a href="#cb7-33" aria-hidden="true" tabindex="-1"></a>                        BatchNorm1dParams<span class="op">&lt;</span>T<span class="op">,</span> kClsDims1<span class="op">&gt;*</span> bn1<span class="op">,</span></span>
<span id="cb7-34"><a href="#cb7-34" aria-hidden="true" tabindex="-1"></a>                        BatchNorm1dParams<span class="op">&lt;</span>T<span class="op">,</span> kClsDims2<span class="op">&gt;*</span> bn2<span class="op">,</span></span>
<span id="cb7-35"><a href="#cb7-35" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> params1<span class="op">,</span></span>
<span id="cb7-36"><a href="#cb7-36" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> params2<span class="op">,</span></span>
<span id="cb7-37"><a href="#cb7-37" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> params3<span class="op">)</span></span>
<span id="cb7-38"><a href="#cb7-38" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb7-39"><a href="#cb7-39" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb7-40"><a href="#cb7-40" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-41"><a href="#cb7-41" aria-hidden="true" tabindex="-1"></a>  ReadBatchNorm1dParamsNaive<span class="op">&lt;</span>T<span class="op">,</span> kClsDims1<span class="op">&gt;(</span></span>
<span id="cb7-42"><a href="#cb7-42" aria-hidden="true" tabindex="-1"></a>    bn1<span class="op">,</span> params1<span class="op">,</span> kClsDims0 <span class="op">*</span> kClsDims1 <span class="op">+</span> kClsDims1<span class="op">);</span></span>
<span id="cb7-43"><a href="#cb7-43" aria-hidden="true" tabindex="-1"></a>  ReadBatchNorm1dParamsNaive<span class="op">&lt;</span>T<span class="op">,</span> kClsDims2<span class="op">&gt;(</span></span>
<span id="cb7-44"><a href="#cb7-44" aria-hidden="true" tabindex="-1"></a>    bn2<span class="op">,</span> params2<span class="op">,</span> kClsDims1 <span class="op">*</span> kClsDims2 <span class="op">+</span> kClsDims2<span class="op">);</span></span>
<span id="cb7-45"><a href="#cb7-45" aria-hidden="true" tabindex="-1"></a>  ReadLinearParamsNaive<span class="op">&lt;</span>T<span class="op">,</span> kClsDims2<span class="op">,</span> kClsDims3<span class="op">&gt;(</span></span>
<span id="cb7-46"><a href="#cb7-46" aria-hidden="true" tabindex="-1"></a>    fc3<span class="op">,</span> params3<span class="op">,</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb7-47"><a href="#cb7-47" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>これらの関数のなかでは、<code>ReadBlockParamsNaive</code>、<code>ReadLinearParamsNaive</code>、そして<code>ReadBatchNorm1dParamsNaive</code>の3つの関数を呼び出しています。
各関数は次のような動作です (詳細はソースコードをご参照ください)。
DRAMバッファ上には<code>float</code>型で置かれていますが、これを固定小数点数型に直す処理も含まれます。</p>
<ul>
<li><code>ReadLinearParamsNaive&lt;T, InDims, OutDims&gt;</code>:
DRAMバッファから、全結合層
(<code>Conv1d</code>および<code>Linear</code>)
の重みとバイアスを読み取る。
重みのサイズは<code>(OutDims, InDims)</code>、バイアスのサイズは<code>(OutDims)</code>である。
2つのパラメータは、1次元の配列として連結されているとする
(配列のサイズは<code>OutDims * InDims + OutDims</code>)。</li>
<li><code>ReadBatchNorm1dParamsNaive&lt;T, Dims&gt;</code>:
DRAMバッファから、バッチ正規化層 (<code>BatchNorm1d</code>)
のスケール、バイアス、平均を読み取る。
パラメータのサイズは<code>(Dims)</code>である。
3つのパラメータは、1次元の配列として連結されているとする
(配列のサイズは<code>3 * Dims</code>)。</li>
<li><code>ReadBlockParamsNaive&lt;T, InDims, OutDims</code>:
DRAMバッファから、全結合層およびバッチ正規化層のパラメータ5つを読み取る。
5つのパラメータは、1次元の配列として連結されているとする
(配列のサイズは<code>OutDims * InDims + 4 * OutDims</code>)。</li>
</ul>
<h3 id="推論モード">推論モード</h3>
<p>入力点群から、各クラスのロジットを計算します。
以下に示す、<code>InferenceFeatNaive</code>および<code>InferenceClsNaive</code>を利用します。
それぞれ、特徴抽出ネットワークと、分類ネットワークの処理です。</p>
<div class="sourceCode" id="cb8"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Naive implementation of the PointNet feature extraction</span></span>
<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for layer input, output, and intermediate results</span></span>
<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="co">// `U` is the type for parameters</span></span>
<span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a><span class="co">// `N` is the expected number of input points (e.g., 1024)</span></span>
<span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="kw">typename</span> U<span class="op">,</span> <span class="dt">int</span> N<span class="op">&gt;</span></span>
<span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> InferenceFeatNaive<span class="op">(</span><span class="at">const</span> <span class="dt">float</span><span class="op">*</span> point_cloud<span class="op">,</span></span>
<span id="cb8-7"><a href="#cb8-7" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> <span class="dt">int</span> num_points<span class="op">,</span></span>
<span id="cb8-8"><a href="#cb8-8" aria-hidden="true" tabindex="-1"></a>                        T feature<span class="op">[</span>kFeatDims5<span class="op">],</span></span>
<span id="cb8-9"><a href="#cb8-9" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> LinearParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims0<span class="op">,</span> kFeatDims1<span class="op">&gt;*</span> conv1<span class="op">,</span></span>
<span id="cb8-10"><a href="#cb8-10" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> LinearParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims1<span class="op">,</span> kFeatDims2<span class="op">&gt;*</span> conv2<span class="op">,</span></span>
<span id="cb8-11"><a href="#cb8-11" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> LinearParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims2<span class="op">,</span> kFeatDims3<span class="op">&gt;*</span> conv3<span class="op">,</span></span>
<span id="cb8-12"><a href="#cb8-12" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> LinearParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims3<span class="op">,</span> kFeatDims4<span class="op">&gt;*</span> conv4<span class="op">,</span></span>
<span id="cb8-13"><a href="#cb8-13" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> LinearParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims4<span class="op">,</span> kFeatDims5<span class="op">&gt;*</span> conv5<span class="op">,</span></span>
<span id="cb8-14"><a href="#cb8-14" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> BatchNorm1dParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims1<span class="op">&gt;*</span> bn1<span class="op">,</span></span>
<span id="cb8-15"><a href="#cb8-15" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> BatchNorm1dParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims2<span class="op">&gt;*</span> bn2<span class="op">,</span></span>
<span id="cb8-16"><a href="#cb8-16" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> BatchNorm1dParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims3<span class="op">&gt;*</span> bn3<span class="op">,</span></span>
<span id="cb8-17"><a href="#cb8-17" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> BatchNorm1dParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims4<span class="op">&gt;*</span> bn4<span class="op">,</span></span>
<span id="cb8-18"><a href="#cb8-18" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> BatchNorm1dParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims5<span class="op">&gt;*</span> bn5<span class="op">)</span></span>
<span id="cb8-19"><a href="#cb8-19" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb8-20"><a href="#cb8-20" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb8-21"><a href="#cb8-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-22"><a href="#cb8-22" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Zero-initialize the output feature</span></span>
<span id="cb8-23"><a href="#cb8-23" aria-hidden="true" tabindex="-1"></a>  VectorNdSetZero<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims5<span class="op">&gt;(</span>feature<span class="op">);</span></span>
<span id="cb8-24"><a href="#cb8-24" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-25"><a href="#cb8-25" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Compute the feature</span></span>
<span id="cb8-26"><a href="#cb8-26" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> num_points<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb8-27"><a href="#cb8-27" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS LOOP_TRIPCOUNT min=N max=N avg=N</span></span>
<span id="cb8-28"><a href="#cb8-28" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS LOOP_FLATTEN off</span></span>
<span id="cb8-29"><a href="#cb8-29" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-30"><a href="#cb8-30" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Input, output, and intermediate results</span></span>
<span id="cb8-31"><a href="#cb8-31" aria-hidden="true" tabindex="-1"></a>    T x0<span class="op">[</span>kFeatDims0<span class="op">];</span></span>
<span id="cb8-32"><a href="#cb8-32" aria-hidden="true" tabindex="-1"></a>    T x1<span class="op">[</span>kFeatDims1<span class="op">];</span></span>
<span id="cb8-33"><a href="#cb8-33" aria-hidden="true" tabindex="-1"></a>    T x2<span class="op">[</span>kFeatDims1<span class="op">];</span></span>
<span id="cb8-34"><a href="#cb8-34" aria-hidden="true" tabindex="-1"></a>    T x3<span class="op">[</span>kFeatDims2<span class="op">];</span></span>
<span id="cb8-35"><a href="#cb8-35" aria-hidden="true" tabindex="-1"></a>    T x4<span class="op">[</span>kFeatDims2<span class="op">];</span></span>
<span id="cb8-36"><a href="#cb8-36" aria-hidden="true" tabindex="-1"></a>    T x5<span class="op">[</span>kFeatDims3<span class="op">];</span></span>
<span id="cb8-37"><a href="#cb8-37" aria-hidden="true" tabindex="-1"></a>    T x6<span class="op">[</span>kFeatDims3<span class="op">];</span></span>
<span id="cb8-38"><a href="#cb8-38" aria-hidden="true" tabindex="-1"></a>    T x7<span class="op">[</span>kFeatDims4<span class="op">];</span></span>
<span id="cb8-39"><a href="#cb8-39" aria-hidden="true" tabindex="-1"></a>    T x8<span class="op">[</span>kFeatDims4<span class="op">];</span></span>
<span id="cb8-40"><a href="#cb8-40" aria-hidden="true" tabindex="-1"></a>    T x9<span class="op">[</span>kFeatDims5<span class="op">];</span></span>
<span id="cb8-41"><a href="#cb8-41" aria-hidden="true" tabindex="-1"></a>    T x10<span class="op">[</span>kFeatDims5<span class="op">];</span></span>
<span id="cb8-42"><a href="#cb8-42" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-43"><a href="#cb8-43" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Read a point from a DDR memory</span></span>
<span id="cb8-44"><a href="#cb8-44" aria-hidden="true" tabindex="-1"></a>    ReadPointNaive<span class="op">&lt;</span>T<span class="op">&gt;(</span>point_cloud<span class="op">,</span> i<span class="op">,</span> x0<span class="op">);</span></span>
<span id="cb8-45"><a href="#cb8-45" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-46"><a href="#cb8-46" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Compute a point feature</span></span>
<span id="cb8-47"><a href="#cb8-47" aria-hidden="true" tabindex="-1"></a>    LinearNaive<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims0<span class="op">,</span> kFeatDims1<span class="op">,</span> <span class="kw">false</span><span class="op">&gt;(</span></span>
<span id="cb8-48"><a href="#cb8-48" aria-hidden="true" tabindex="-1"></a>      x0<span class="op">,</span> x1<span class="op">,</span> conv1<span class="op">-&gt;</span>weight<span class="op">,</span> conv1<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb8-49"><a href="#cb8-49" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUNaive<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims1<span class="op">&gt;(</span></span>
<span id="cb8-50"><a href="#cb8-50" aria-hidden="true" tabindex="-1"></a>      x1<span class="op">,</span> x2<span class="op">,</span> bn1<span class="op">-&gt;</span>scale<span class="op">,</span> bn1<span class="op">-&gt;</span>bias<span class="op">,</span> bn1<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb8-51"><a href="#cb8-51" aria-hidden="true" tabindex="-1"></a>    LinearNaive<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims1<span class="op">,</span> kFeatDims2<span class="op">,</span> <span class="kw">false</span><span class="op">&gt;(</span></span>
<span id="cb8-52"><a href="#cb8-52" aria-hidden="true" tabindex="-1"></a>      x2<span class="op">,</span> x3<span class="op">,</span> conv2<span class="op">-&gt;</span>weight<span class="op">,</span> conv2<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb8-53"><a href="#cb8-53" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUNaive<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims2<span class="op">&gt;(</span></span>
<span id="cb8-54"><a href="#cb8-54" aria-hidden="true" tabindex="-1"></a>      x3<span class="op">,</span> x4<span class="op">,</span> bn2<span class="op">-&gt;</span>scale<span class="op">,</span> bn2<span class="op">-&gt;</span>bias<span class="op">,</span> bn2<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb8-55"><a href="#cb8-55" aria-hidden="true" tabindex="-1"></a>    LinearNaive<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims2<span class="op">,</span> kFeatDims3<span class="op">,</span> <span class="kw">false</span><span class="op">&gt;(</span></span>
<span id="cb8-56"><a href="#cb8-56" aria-hidden="true" tabindex="-1"></a>      x4<span class="op">,</span> x5<span class="op">,</span> conv3<span class="op">-&gt;</span>weight<span class="op">,</span> conv3<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb8-57"><a href="#cb8-57" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUNaive<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims3<span class="op">&gt;(</span></span>
<span id="cb8-58"><a href="#cb8-58" aria-hidden="true" tabindex="-1"></a>      x5<span class="op">,</span> x6<span class="op">,</span> bn3<span class="op">-&gt;</span>scale<span class="op">,</span> bn3<span class="op">-&gt;</span>bias<span class="op">,</span> bn3<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb8-59"><a href="#cb8-59" aria-hidden="true" tabindex="-1"></a>    LinearNaive<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims3<span class="op">,</span> kFeatDims4<span class="op">,</span> <span class="kw">false</span><span class="op">&gt;(</span></span>
<span id="cb8-60"><a href="#cb8-60" aria-hidden="true" tabindex="-1"></a>      x6<span class="op">,</span> x7<span class="op">,</span> conv4<span class="op">-&gt;</span>weight<span class="op">,</span> conv4<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb8-61"><a href="#cb8-61" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUNaive<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims4<span class="op">&gt;(</span></span>
<span id="cb8-62"><a href="#cb8-62" aria-hidden="true" tabindex="-1"></a>      x7<span class="op">,</span> x8<span class="op">,</span> bn4<span class="op">-&gt;</span>scale<span class="op">,</span> bn4<span class="op">-&gt;</span>bias<span class="op">,</span> bn4<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb8-63"><a href="#cb8-63" aria-hidden="true" tabindex="-1"></a>    LinearNaive<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims4<span class="op">,</span> kFeatDims5<span class="op">,</span> <span class="kw">false</span><span class="op">&gt;(</span></span>
<span id="cb8-64"><a href="#cb8-64" aria-hidden="true" tabindex="-1"></a>      x8<span class="op">,</span> x9<span class="op">,</span> conv5<span class="op">-&gt;</span>weight<span class="op">,</span> conv5<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb8-65"><a href="#cb8-65" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUNaive<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims5<span class="op">&gt;(</span></span>
<span id="cb8-66"><a href="#cb8-66" aria-hidden="true" tabindex="-1"></a>      x9<span class="op">,</span> x10<span class="op">,</span> bn5<span class="op">-&gt;</span>scale<span class="op">,</span> bn5<span class="op">-&gt;</span>bias<span class="op">,</span> bn5<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb8-67"><a href="#cb8-67" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-68"><a href="#cb8-68" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Update the output feature</span></span>
<span id="cb8-69"><a href="#cb8-69" aria-hidden="true" tabindex="-1"></a>    MaxPool1dNaive<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims5<span class="op">&gt;(</span>x10<span class="op">,</span> feature<span class="op">);</span></span>
<span id="cb8-70"><a href="#cb8-70" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb8-71"><a href="#cb8-71" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb8-72"><a href="#cb8-72" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-73"><a href="#cb8-73" aria-hidden="true" tabindex="-1"></a><span class="co">// Naive implementation of the classification network</span></span>
<span id="cb8-74"><a href="#cb8-74" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for layer input, output, and intermediate results</span></span>
<span id="cb8-75"><a href="#cb8-75" aria-hidden="true" tabindex="-1"></a><span class="co">// `U` is the type for parameters</span></span>
<span id="cb8-76"><a href="#cb8-76" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="kw">typename</span> U<span class="op">&gt;</span></span>
<span id="cb8-77"><a href="#cb8-77" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> InferenceClsNaive<span class="op">(</span><span class="at">const</span> T feature<span class="op">[</span>kFeatDims5<span class="op">],</span></span>
<span id="cb8-78"><a href="#cb8-78" aria-hidden="true" tabindex="-1"></a>                       <span class="dt">float</span><span class="op">*</span> out_logits<span class="op">,</span></span>
<span id="cb8-79"><a href="#cb8-79" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> LinearParams<span class="op">&lt;</span>U<span class="op">,</span> kClsDims2<span class="op">,</span> kClsDims3<span class="op">&gt;*</span> fc3<span class="op">,</span></span>
<span id="cb8-80"><a href="#cb8-80" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> BatchNorm1dParams<span class="op">&lt;</span>U<span class="op">,</span> kClsDims1<span class="op">&gt;*</span> bn1<span class="op">,</span></span>
<span id="cb8-81"><a href="#cb8-81" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> BatchNorm1dParams<span class="op">&lt;</span>U<span class="op">,</span> kClsDims2<span class="op">&gt;*</span> bn2<span class="op">,</span></span>
<span id="cb8-82"><a href="#cb8-82" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> params1<span class="op">,</span></span>
<span id="cb8-83"><a href="#cb8-83" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> params2<span class="op">,</span></span>
<span id="cb8-84"><a href="#cb8-84" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> params3<span class="op">)</span></span>
<span id="cb8-85"><a href="#cb8-85" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb8-86"><a href="#cb8-86" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb8-87"><a href="#cb8-87" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-88"><a href="#cb8-88" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>kFeatDims5 <span class="op">==</span> kClsDims0<span class="op">,</span></span>
<span id="cb8-89"><a href="#cb8-89" aria-hidden="true" tabindex="-1"></a>                <span class="st">&quot;Feature dimension should be equal to the input dimension&quot;</span><span class="op">);</span></span>
<span id="cb8-90"><a href="#cb8-90" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-91"><a href="#cb8-91" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Input, output, and intermediate results</span></span>
<span id="cb8-92"><a href="#cb8-92" aria-hidden="true" tabindex="-1"></a>  T x0<span class="op">[</span>kClsDims1<span class="op">];</span></span>
<span id="cb8-93"><a href="#cb8-93" aria-hidden="true" tabindex="-1"></a>  T x1<span class="op">[</span>kClsDims1<span class="op">];</span></span>
<span id="cb8-94"><a href="#cb8-94" aria-hidden="true" tabindex="-1"></a>  T x2<span class="op">[</span>kClsDims2<span class="op">];</span></span>
<span id="cb8-95"><a href="#cb8-95" aria-hidden="true" tabindex="-1"></a>  T x3<span class="op">[</span>kClsDims2<span class="op">];</span></span>
<span id="cb8-96"><a href="#cb8-96" aria-hidden="true" tabindex="-1"></a>  T x4<span class="op">[</span>kClsDims3<span class="op">];</span></span>
<span id="cb8-97"><a href="#cb8-97" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-98"><a href="#cb8-98" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Compute logits</span></span>
<span id="cb8-99"><a href="#cb8-99" aria-hidden="true" tabindex="-1"></a>  LinearNaiveDDR<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kClsDims0<span class="op">,</span> kClsDims1<span class="op">,</span> <span class="kw">false</span><span class="op">&gt;(</span></span>
<span id="cb8-100"><a href="#cb8-100" aria-hidden="true" tabindex="-1"></a>    feature<span class="op">,</span> x0<span class="op">,</span> params1<span class="op">,</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb8-101"><a href="#cb8-101" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dReLUNaive<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kClsDims1<span class="op">&gt;(</span></span>
<span id="cb8-102"><a href="#cb8-102" aria-hidden="true" tabindex="-1"></a>    x0<span class="op">,</span> x1<span class="op">,</span> bn1<span class="op">-&gt;</span>scale<span class="op">,</span> bn1<span class="op">-&gt;</span>bias<span class="op">,</span> bn1<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb8-103"><a href="#cb8-103" aria-hidden="true" tabindex="-1"></a>  LinearNaiveDDR<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kClsDims1<span class="op">,</span> kClsDims2<span class="op">,</span> <span class="kw">false</span><span class="op">&gt;(</span></span>
<span id="cb8-104"><a href="#cb8-104" aria-hidden="true" tabindex="-1"></a>    x1<span class="op">,</span> x2<span class="op">,</span> params2<span class="op">,</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb8-105"><a href="#cb8-105" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dReLUNaive<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kClsDims2<span class="op">&gt;(</span></span>
<span id="cb8-106"><a href="#cb8-106" aria-hidden="true" tabindex="-1"></a>    x2<span class="op">,</span> x3<span class="op">,</span> bn2<span class="op">-&gt;</span>scale<span class="op">,</span> bn2<span class="op">-&gt;</span>bias<span class="op">,</span> bn2<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb8-107"><a href="#cb8-107" aria-hidden="true" tabindex="-1"></a>  LinearNaive<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kClsDims2<span class="op">,</span> kClsDims3<span class="op">,</span> <span class="kw">false</span><span class="op">&gt;(</span></span>
<span id="cb8-108"><a href="#cb8-108" aria-hidden="true" tabindex="-1"></a>    x3<span class="op">,</span> x4<span class="op">,</span> fc3<span class="op">-&gt;</span>weight<span class="op">,</span> fc3<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb8-109"><a href="#cb8-109" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-110"><a href="#cb8-110" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Write the result</span></span>
<span id="cb8-111"><a href="#cb8-111" aria-hidden="true" tabindex="-1"></a>  WriteTensor1dNaive<span class="op">&lt;</span>T<span class="op">,</span> kClsDims3<span class="op">&gt;(</span>out_logits<span class="op">,</span> x4<span class="op">,</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb8-112"><a href="#cb8-112" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p><code>InferenceFeatNaive</code>では、DRAMに置かれた点群データ
(<code>point_cloud</code>) から、1つずつ点を読み取ります。 各点
(<code>x0</code>) に対してローカルな特徴量 (<code>x10</code>)
を計算し、現在のグローバル特徴量 (<code>feature</code>)
を更新する処理を、点の個数 (<code>num_points</code>) だけ繰り返します。
<code>InferenceClsNaive</code>は、点群全体を表すグローバル特徴量
(<code>feature</code>) を受け取って、各クラスに対するロジット
(<code>x4</code>) を計算し、それをDRAMバッファ (<code>out_logits</code>)
に書き戻します。</p>
<p><code>ReadPointNaive</code>は、<span
class="math inline">\(i\)</span>番目の点<span
class="math inline">\(\boldsymbol{p}_i\)</span>を、DRAMバッファから読み取るものです。
<code>LinearNaive</code>、<code>BatchNorm1dReLUNaive</code>、<code>MaxPool1dNaive</code>は、名前の通り、全結合層
(<code>Conv1d</code>)、バッチ正規化層とReLU活性化、Maxプーリング層に対応します
(先程の計算式を参照)。
オンチップバッファからパラメータを読み出して、層の出力を計算します。
<code>LinearNaiveDDR</code>も全結合層の関数ですが、DRAMバッファからパラメータを少しずつ取り出しつつ、出力を計算します。
これらの関数を以下に示します。
HLSプラグマを除けば、ソフトウェア実装と大体同じであることが分かります。
行数は多いですが、処理内容は単純です。</p>
<div class="sourceCode" id="cb9"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Naive implementation of the fully-connected layer</span></span>
<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for values</span></span>
<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="co">// `TParam` is the type for weight and bias</span></span>
<span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a><span class="co">// `InDims` is the number of input dimensions</span></span>
<span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a><span class="co">// `OutDims` is the number of output dimensions</span></span>
<span id="cb9-6"><a href="#cb9-6" aria-hidden="true" tabindex="-1"></a><span class="co">// `ApplyReLU` is the flag to apply ReLU activation</span></span>
<span id="cb9-7"><a href="#cb9-7" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="kw">typename</span> TParam<span class="op">,</span></span>
<span id="cb9-8"><a href="#cb9-8" aria-hidden="true" tabindex="-1"></a>          <span class="dt">int</span> InDims<span class="op">,</span> <span class="dt">int</span> OutDims<span class="op">,</span> <span class="dt">bool</span> ApplyReLU<span class="op">&gt;</span></span>
<span id="cb9-9"><a href="#cb9-9" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> LinearNaive<span class="op">(</span><span class="at">const</span> T x<span class="op">[</span>InDims<span class="op">],</span></span>
<span id="cb9-10"><a href="#cb9-10" aria-hidden="true" tabindex="-1"></a>                 T y<span class="op">[</span>OutDims<span class="op">],</span></span>
<span id="cb9-11"><a href="#cb9-11" aria-hidden="true" tabindex="-1"></a>                 <span class="at">const</span> TParam weight<span class="op">[</span>OutDims<span class="op">][</span>InDims<span class="op">],</span></span>
<span id="cb9-12"><a href="#cb9-12" aria-hidden="true" tabindex="-1"></a>                 <span class="at">const</span> TParam bias<span class="op">[</span>OutDims<span class="op">])</span></span>
<span id="cb9-13"><a href="#cb9-13" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb9-14"><a href="#cb9-14" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb9-15"><a href="#cb9-15" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-16"><a href="#cb9-16" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> OutDims<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb9-17"><a href="#cb9-17" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE off</span></span>
<span id="cb9-18"><a href="#cb9-18" aria-hidden="true" tabindex="-1"></a>    T val <span class="op">=</span> bias<span class="op">[</span>i<span class="op">];</span></span>
<span id="cb9-19"><a href="#cb9-19" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-20"><a href="#cb9-20" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> j <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> j <span class="op">&lt;</span> InDims<span class="op">;</span> <span class="op">++</span>j<span class="op">)</span> <span class="op">{</span></span>
<span id="cb9-21"><a href="#cb9-21" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE</span></span>
<span id="cb9-22"><a href="#cb9-22" aria-hidden="true" tabindex="-1"></a>      val <span class="op">+=</span> x<span class="op">[</span>j<span class="op">]</span> <span class="op">*</span> weight<span class="op">[</span>i<span class="op">][</span>j<span class="op">];</span></span>
<span id="cb9-23"><a href="#cb9-23" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span>
<span id="cb9-24"><a href="#cb9-24" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-25"><a href="#cb9-25" aria-hidden="true" tabindex="-1"></a>    <span class="cf">if</span> <span class="op">(</span>ApplyReLU<span class="op">)</span></span>
<span id="cb9-26"><a href="#cb9-26" aria-hidden="true" tabindex="-1"></a>      y<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> val <span class="op">&gt;</span> T<span class="op">(</span><span class="dv">0</span><span class="op">)</span> <span class="op">?</span> val <span class="op">:</span> T<span class="op">(</span><span class="dv">0</span><span class="op">);</span></span>
<span id="cb9-27"><a href="#cb9-27" aria-hidden="true" tabindex="-1"></a>    <span class="cf">else</span></span>
<span id="cb9-28"><a href="#cb9-28" aria-hidden="true" tabindex="-1"></a>      y<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> val<span class="op">;</span></span>
<span id="cb9-29"><a href="#cb9-29" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb9-30"><a href="#cb9-30" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb9-31"><a href="#cb9-31" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-32"><a href="#cb9-32" aria-hidden="true" tabindex="-1"></a><span class="co">// Naive implementation of the fully-connected layer</span></span>
<span id="cb9-33"><a href="#cb9-33" aria-hidden="true" tabindex="-1"></a><span class="co">// Weight and bias parameters are stored on the DDR memory</span></span>
<span id="cb9-34"><a href="#cb9-34" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="kw">typename</span> TParam<span class="op">,</span></span>
<span id="cb9-35"><a href="#cb9-35" aria-hidden="true" tabindex="-1"></a>          <span class="dt">int</span> InDims<span class="op">,</span> <span class="dt">int</span> OutDims<span class="op">,</span> <span class="dt">bool</span> ApplyReLU<span class="op">&gt;</span></span>
<span id="cb9-36"><a href="#cb9-36" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> LinearNaiveDDR<span class="op">(</span><span class="at">const</span> T x<span class="op">[</span>InDims<span class="op">],</span></span>
<span id="cb9-37"><a href="#cb9-37" aria-hidden="true" tabindex="-1"></a>                    T y<span class="op">[</span>OutDims<span class="op">],</span></span>
<span id="cb9-38"><a href="#cb9-38" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> params<span class="op">,</span></span>
<span id="cb9-39"><a href="#cb9-39" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">int</span> offset<span class="op">)</span></span>
<span id="cb9-40"><a href="#cb9-40" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb9-41"><a href="#cb9-41" aria-hidden="true" tabindex="-1"></a>  <span class="co">// `params` contains weight parameters of size (`OutDims`, `InDims`) and</span></span>
<span id="cb9-42"><a href="#cb9-42" aria-hidden="true" tabindex="-1"></a>  <span class="co">// bias parameters of size (`OutDims`) in a contiguous buffer</span></span>
<span id="cb9-43"><a href="#cb9-43" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-44"><a href="#cb9-44" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb9-45"><a href="#cb9-45" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-46"><a href="#cb9-46" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> OffsetToBias <span class="op">=</span> OutDims <span class="op">*</span> InDims<span class="op">;</span></span>
<span id="cb9-47"><a href="#cb9-47" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-48"><a href="#cb9-48" aria-hidden="true" tabindex="-1"></a>  TParam bias<span class="op">[</span>OutDims<span class="op">];</span></span>
<span id="cb9-49"><a href="#cb9-49" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-50"><a href="#cb9-50" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Copy the bias parameters in advance</span></span>
<span id="cb9-51"><a href="#cb9-51" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> OutDims<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb9-52"><a href="#cb9-52" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE II=1</span></span>
<span id="cb9-53"><a href="#cb9-53" aria-hidden="true" tabindex="-1"></a>    bias<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> TParam<span class="op">(</span>params<span class="op">[</span>offset <span class="op">+</span> OffsetToBias <span class="op">+</span> i<span class="op">]);</span></span>
<span id="cb9-54"><a href="#cb9-54" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb9-55"><a href="#cb9-55" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-56"><a href="#cb9-56" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> OutDims<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb9-57"><a href="#cb9-57" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE off</span></span>
<span id="cb9-58"><a href="#cb9-58" aria-hidden="true" tabindex="-1"></a>    T val <span class="op">=</span> bias<span class="op">[</span>i<span class="op">];</span></span>
<span id="cb9-59"><a href="#cb9-59" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-60"><a href="#cb9-60" aria-hidden="true" tabindex="-1"></a>    TParam weight<span class="op">[</span>InDims<span class="op">];</span></span>
<span id="cb9-61"><a href="#cb9-61" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-62"><a href="#cb9-62" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> j <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> j <span class="op">&lt;</span> InDims<span class="op">;</span> <span class="op">++</span>j<span class="op">)</span> <span class="op">{</span></span>
<span id="cb9-63"><a href="#cb9-63" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE II=1</span></span>
<span id="cb9-64"><a href="#cb9-64" aria-hidden="true" tabindex="-1"></a>      weight<span class="op">[</span>j<span class="op">]</span> <span class="op">=</span> TParam<span class="op">(</span>params<span class="op">[</span>offset <span class="op">+</span> i <span class="op">*</span> InDims <span class="op">+</span> j<span class="op">]);</span></span>
<span id="cb9-65"><a href="#cb9-65" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span>
<span id="cb9-66"><a href="#cb9-66" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-67"><a href="#cb9-67" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> j <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> j <span class="op">&lt;</span> InDims<span class="op">;</span> <span class="op">++</span>j<span class="op">)</span> <span class="op">{</span></span>
<span id="cb9-68"><a href="#cb9-68" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE</span></span>
<span id="cb9-69"><a href="#cb9-69" aria-hidden="true" tabindex="-1"></a>      val <span class="op">+=</span> x<span class="op">[</span>j<span class="op">]</span> <span class="op">*</span> weight<span class="op">[</span>j<span class="op">];</span></span>
<span id="cb9-70"><a href="#cb9-70" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span>
<span id="cb9-71"><a href="#cb9-71" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-72"><a href="#cb9-72" aria-hidden="true" tabindex="-1"></a>    <span class="cf">if</span> <span class="op">(</span>ApplyReLU<span class="op">)</span></span>
<span id="cb9-73"><a href="#cb9-73" aria-hidden="true" tabindex="-1"></a>      y<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> val <span class="op">&gt;</span> T<span class="op">(</span><span class="dv">0</span><span class="op">)</span> <span class="op">?</span> val <span class="op">:</span> T<span class="op">(</span><span class="dv">0</span><span class="op">);</span></span>
<span id="cb9-74"><a href="#cb9-74" aria-hidden="true" tabindex="-1"></a>    <span class="cf">else</span></span>
<span id="cb9-75"><a href="#cb9-75" aria-hidden="true" tabindex="-1"></a>      y<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> val<span class="op">;</span></span>
<span id="cb9-76"><a href="#cb9-76" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb9-77"><a href="#cb9-77" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb9-78"><a href="#cb9-78" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-79"><a href="#cb9-79" aria-hidden="true" tabindex="-1"></a><span class="co">// Naive implementation of the 1D batch normalization and ReLU activation</span></span>
<span id="cb9-80"><a href="#cb9-80" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for values</span></span>
<span id="cb9-81"><a href="#cb9-81" aria-hidden="true" tabindex="-1"></a><span class="co">// `TParam` is the type for parameters</span></span>
<span id="cb9-82"><a href="#cb9-82" aria-hidden="true" tabindex="-1"></a><span class="co">// `Dims` is the number of input and output dimensions</span></span>
<span id="cb9-83"><a href="#cb9-83" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="kw">typename</span> TParam<span class="op">,</span> <span class="dt">int</span> Dims<span class="op">&gt;</span></span>
<span id="cb9-84"><a href="#cb9-84" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> BatchNorm1dReLUNaive<span class="op">(</span><span class="at">const</span> T x<span class="op">[</span>Dims<span class="op">],</span></span>
<span id="cb9-85"><a href="#cb9-85" aria-hidden="true" tabindex="-1"></a>                          T y<span class="op">[</span>Dims<span class="op">],</span></span>
<span id="cb9-86"><a href="#cb9-86" aria-hidden="true" tabindex="-1"></a>                          <span class="at">const</span> TParam scale<span class="op">[</span>Dims<span class="op">],</span></span>
<span id="cb9-87"><a href="#cb9-87" aria-hidden="true" tabindex="-1"></a>                          <span class="at">const</span> TParam bias<span class="op">[</span>Dims<span class="op">],</span></span>
<span id="cb9-88"><a href="#cb9-88" aria-hidden="true" tabindex="-1"></a>                          <span class="at">const</span> TParam mean<span class="op">[</span>Dims<span class="op">])</span></span>
<span id="cb9-89"><a href="#cb9-89" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb9-90"><a href="#cb9-90" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb9-91"><a href="#cb9-91" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-92"><a href="#cb9-92" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> Dims<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb9-93"><a href="#cb9-93" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE</span></span>
<span id="cb9-94"><a href="#cb9-94" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Batch normalization with the learned parameters</span></span>
<span id="cb9-95"><a href="#cb9-95" aria-hidden="true" tabindex="-1"></a>    T val <span class="op">=</span> <span class="op">(</span>x<span class="op">[</span>i<span class="op">]</span> <span class="op">-</span> mean<span class="op">[</span>i<span class="op">])</span> <span class="op">*</span> scale<span class="op">[</span>i<span class="op">]</span> <span class="op">+</span> bias<span class="op">[</span>i<span class="op">];</span></span>
<span id="cb9-96"><a href="#cb9-96" aria-hidden="true" tabindex="-1"></a>    <span class="co">// ReLU activation</span></span>
<span id="cb9-97"><a href="#cb9-97" aria-hidden="true" tabindex="-1"></a>    y<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> val <span class="op">&gt;</span> T<span class="op">(</span><span class="dv">0</span><span class="op">)</span> <span class="op">?</span> val <span class="op">:</span> T<span class="op">(</span><span class="dv">0</span><span class="op">);</span></span>
<span id="cb9-98"><a href="#cb9-98" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb9-99"><a href="#cb9-99" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb9-100"><a href="#cb9-100" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-101"><a href="#cb9-101" aria-hidden="true" tabindex="-1"></a><span class="co">// Naive implementation of the 1D max-pooling layer</span></span>
<span id="cb9-102"><a href="#cb9-102" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for values</span></span>
<span id="cb9-103"><a href="#cb9-103" aria-hidden="true" tabindex="-1"></a><span class="co">// `Dims` is the number of input and output dimensions</span></span>
<span id="cb9-104"><a href="#cb9-104" aria-hidden="true" tabindex="-1"></a><span class="co">// `y` must be properly initialized</span></span>
<span id="cb9-105"><a href="#cb9-105" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="dt">int</span> Dims<span class="op">&gt;</span></span>
<span id="cb9-106"><a href="#cb9-106" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> MaxPool1dNaive<span class="op">(</span><span class="at">const</span> T x<span class="op">[</span>Dims<span class="op">],</span> T y<span class="op">[</span>Dims<span class="op">])</span></span>
<span id="cb9-107"><a href="#cb9-107" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb9-108"><a href="#cb9-108" aria-hidden="true" tabindex="-1"></a>  <span class="co">// `x` is of size (1, `Dims`)</span></span>
<span id="cb9-109"><a href="#cb9-109" aria-hidden="true" tabindex="-1"></a>  <span class="co">// `y` is of size (1, `Dims`)</span></span>
<span id="cb9-110"><a href="#cb9-110" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-111"><a href="#cb9-111" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb9-112"><a href="#cb9-112" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-113"><a href="#cb9-113" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> Dims<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb9-114"><a href="#cb9-114" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE</span></span>
<span id="cb9-115"><a href="#cb9-115" aria-hidden="true" tabindex="-1"></a>    y<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> x<span class="op">[</span>i<span class="op">]</span> <span class="op">&gt;</span> y<span class="op">[</span>i<span class="op">]</span> <span class="op">?</span> x<span class="op">[</span>i<span class="op">]</span> <span class="op">:</span> y<span class="op">[</span>i<span class="op">];</span></span>
<span id="cb9-116"><a href="#cb9-116" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb9-117"><a href="#cb9-117" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p><code>LinearNaiveDDR</code>では、全結合層のバイアス項
<code>bias</code>と、出力1要素分の計算に必要な重み
<code>weight</code>だけをオンチップメモリ上に保持します。
入出力の次元を<span class="math inline">\(\mathrm{InDims},
\mathrm{OutDims}\)</span>とすれば、<code>bias</code>のサイズは<span
class="math inline">\(\mathrm{OutDims}\)</span>、<code>weight</code>のサイズは<span
class="math inline">\(\mathrm{InDims}\)</span>となります。</p>
<p>上記の関数のループには<code>#pragma HLS PIPELINE</code>が付加されており、ループ内部の処理が自動的にパイプライン化されます
(<strong>最適化その4: ループのパイプライン化</strong>)。
<code>#pragma HLS PIPELINE off</code>とすると、このパイプライン化が抑制されます。
パイプライン化による効果を、以下の図に示します。</p>
<p><a
href="point-cloud-classification-images/pipelined-execution.svg"><img src="point-cloud-classification-images/pipelined-execution.svg" width="70%" /></a></p>
<p>ループをパイプライン化しない場合は、ループの各イテレーションを順に実行します
(図の上部)。 一方、パイプライン化では、ループ内部の処理を分割
(図の場合は4分割) し、それぞれの処理を時間的にオーバーラップさせます
(図の下部)。
複数のイテレーションを同時に実行するので、ループの実行時間を短縮できます。
ループの実行時間は、最も時間の掛かる処理 (図の場合は処理3)
によって決まります。
イテレーションの処理を、なるべく均等に分割することで、パイプライン化の効果が増します。
上記のソースコードのように、最内ループにパイプライン化を適用すると、処理時間を大きく削減できます。
2重ループのうち外側のループにパイプライン化を適用すると、内側のループは全て展開されて、1重ループに直されるので、リソース消費が大幅に増えてしまいます。
外側のループには、パイプライン化を適用しない方がいいと思います。</p>
<p>上記のIPコアは、<code>hls/src/top_naive.cpp</code>にあります。</p>
<h2 id="並列化-データ並列性の活用">並列化 (データ並列性の活用)</h2>
<p>このIPコアも正しく動作するのですが、明らかにナイーブな
(全く工夫していない素朴な) 実装です。 データ並列性 (Data parallelism)
を活かして、各層の計算を並列化してみましょう (<strong>最適化その5:
データ並列性</strong>)。</p>
<p>全結合層の計算をもう一度みてみます。 <span class="math display">\[
  \boldsymbol{y} = \boldsymbol{W} \boldsymbol{x} + \boldsymbol{b}
\]</span> 出力<span
class="math inline">\(\boldsymbol{y}\)</span>の各要素<span
class="math inline">\(y_i\)</span>は次のように計算されます。 <span
class="math display">\[
  y_i = \sum_j W_{i, j} x_j + b_i
\]</span> <span class="math inline">\(B\)</span>個の出力要素<span
class="math inline">\(y_i, y_{i + 1}, \ldots, y_{i + B -
1}\)</span>の間には依存がないので
(それぞれの要素は互いに依存せず独立に計算できるので)、並列に計算してみましょう。
<span class="math display">\[
  \begin{eqnarray}
    y_i &amp;=&amp; \sum_j W_{i, j} x_j + b_i \\
    y_{i + 1} &amp;=&amp; \sum_j W_{i + 1, j} x_j + b_{i + 1} \\
    &amp;\vdots&amp; \\
    y_{i + B - 1} &amp;=&amp; \sum_j W_{i + B - 1, j} x_j + b_{i + B -
1}
  \end{eqnarray}
\]</span> <span class="math inline">\(W_{i, j} x_j, W_{i + 1, j} x_j,
\ldots, W_{i + B - 1, j} x_j\)</span>の<span
class="math inline">\(B\)</span>個の積を並列化するわけです。
言い換えると、<span class="math inline">\(j\)</span> (入力次元)
に関するループはそのままにして、<span class="math inline">\(i\)</span>
(出力次元) に関するループを並列化することになります。 <span
class="math inline">\(B\)</span>個の出力を並列に計算するので、<span
class="math inline">\(B\)</span>倍の高速化が期待できます
(リソース消費も<span class="math inline">\(B\)</span>倍になります)。</p>
<p>バッチ正規化とReLU活性化についても同様に、複数の出力要素<span
class="math inline">\(y_i, y_{i + 1}, \ldots, y_{i + B -
1}\)</span>を並列に計算します。 <span class="math display">\[
  \begin{eqnarray}
    y_i &amp;=&amp; \max \left( 0, \left( x_i - \mu_i \right) \cdot s_i
+ b_i \right) \\
    y_{i + 1} &amp;=&amp; \max \left( 0, \left( x_{i + 1} - \mu_{i + 1}
\right) \cdot s_{i + 1} + b_{i + 1} \right) \\
    &amp;\vdots&amp; \\
    y_{i + B - 1} &amp;=&amp; \max \left( 0, \left( x_{i + B - 1} -
\mu_{i + B - 1} \right) \cdot s_{i + B - 1} + b_{i + B - 1} \right)
  \end{eqnarray}
\]</span></p>
<p>Maxプーリングについても全く同じで、複数の出力要素<span
class="math inline">\(\phi_i, \phi_{i + 1}, \ldots, \phi_{i + B -
1}\)</span>を並列に計算します。 <span class="math display">\[
  \begin{eqnarray}
    \phi_i &amp;=&amp; \max \left( \phi_i, \psi_i \right) \\
    \phi_{i + 1} &amp;=&amp; \max \left( \phi_{i + 1}, \psi_{i + 1}
\right) \\
    &amp;\vdots&amp; \\
    \phi_{i + B - 1} &amp;=&amp; \max \left( \phi_{i + B - 1}, \psi_{i +
B - 1} \right)
  \end{eqnarray}
\]</span></p>
<p><code>LinearNaive</code>、<code>LinearNaiveDDR</code>、<code>BatchNorm1dReLUNaive</code>、<code>MaxPool1dNaive</code>が、各層のナイーブな実装でした。
並列化したバージョン
<code>LinearOpt1</code>、<code>LinearOpt1DDR</code>、<code>BatchNorm1dReLUOpt1</code>、<code>MaxPool1dOpt1</code>に置き換えます
(名前を<code>Naive</code>から<code>Opt1</code>にします)。
テンプレート引数として<code>B</code>が追加されています
(<code>B</code>並列)。</p>
<div class="sourceCode" id="cb10"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Parallel implementation of the fully-connected layer</span></span>
<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="co">// Matrix-vector multiplication is parallelized along the output dimension</span></span>
<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for values</span></span>
<span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a><span class="co">// `TParam` is the type for weight and bias</span></span>
<span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a><span class="co">// `InDims` is the number of input dimensions</span></span>
<span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a><span class="co">// `OutDims` is the number of output dimensions</span></span>
<span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a><span class="co">// `ApplyReLU` is the flag to apply ReLU activation</span></span>
<span id="cb10-8"><a href="#cb10-8" aria-hidden="true" tabindex="-1"></a><span class="co">// `B` is the block size for the output dimension</span></span>
<span id="cb10-9"><a href="#cb10-9" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="kw">typename</span> TParam<span class="op">,</span></span>
<span id="cb10-10"><a href="#cb10-10" aria-hidden="true" tabindex="-1"></a>          <span class="dt">int</span> InDims<span class="op">,</span> <span class="dt">int</span> OutDims<span class="op">,</span> <span class="dt">bool</span> ApplyReLU<span class="op">,</span> <span class="dt">int</span> B<span class="op">&gt;</span></span>
<span id="cb10-11"><a href="#cb10-11" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> LinearOpt1<span class="op">(</span><span class="at">const</span> T x<span class="op">[</span>InDims<span class="op">],</span></span>
<span id="cb10-12"><a href="#cb10-12" aria-hidden="true" tabindex="-1"></a>                T y<span class="op">[</span>OutDims<span class="op">],</span></span>
<span id="cb10-13"><a href="#cb10-13" aria-hidden="true" tabindex="-1"></a>                <span class="at">const</span> TParam weight<span class="op">[</span>OutDims<span class="op">][</span>InDims<span class="op">],</span></span>
<span id="cb10-14"><a href="#cb10-14" aria-hidden="true" tabindex="-1"></a>                <span class="at">const</span> TParam bias<span class="op">[</span>OutDims<span class="op">])</span></span>
<span id="cb10-15"><a href="#cb10-15" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb10-16"><a href="#cb10-16" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb10-17"><a href="#cb10-17" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-18"><a href="#cb10-18" aria-hidden="true" tabindex="-1"></a>  <span class="co">// `OutDims` must be a multiple of `B`</span></span>
<span id="cb10-19"><a href="#cb10-19" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>OutDims <span class="op">%</span> B <span class="op">==</span> <span class="dv">0</span><span class="op">,</span> <span class="st">&quot;`OutDims` must be a multiple of `B`&quot;</span><span class="op">);</span></span>
<span id="cb10-20"><a href="#cb10-20" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-21"><a href="#cb10-21" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i0 <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i0 <span class="op">&lt;</span> OutDims<span class="op">;</span> i0 <span class="op">+=</span> B<span class="op">)</span> <span class="op">{</span></span>
<span id="cb10-22"><a href="#cb10-22" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE off</span></span>
<span id="cb10-23"><a href="#cb10-23" aria-hidden="true" tabindex="-1"></a>    T vals<span class="op">[</span>B<span class="op">];</span></span>
<span id="cb10-24"><a href="#cb10-24" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=vals type=complete dim=1</span></span>
<span id="cb10-25"><a href="#cb10-25" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-26"><a href="#cb10-26" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> j <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> j <span class="op">&lt;</span> InDims<span class="op">;</span> <span class="op">++</span>j<span class="op">)</span> <span class="op">{</span></span>
<span id="cb10-27"><a href="#cb10-27" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE</span></span>
<span id="cb10-28"><a href="#cb10-28" aria-hidden="true" tabindex="-1"></a>      <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i1 <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i1 <span class="op">&lt;</span> B<span class="op">;</span> <span class="op">++</span>i1<span class="op">)</span> <span class="op">{</span></span>
<span id="cb10-29"><a href="#cb10-29" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS UNROLL</span></span>
<span id="cb10-30"><a href="#cb10-30" aria-hidden="true" tabindex="-1"></a>        <span class="dt">int</span> i <span class="op">=</span> i0 <span class="op">+</span> i1<span class="op">;</span></span>
<span id="cb10-31"><a href="#cb10-31" aria-hidden="true" tabindex="-1"></a>        T last <span class="op">=</span> <span class="op">(</span>j <span class="op">==</span> <span class="dv">0</span><span class="op">)</span> <span class="op">?</span> T<span class="op">(</span>bias<span class="op">[</span>i<span class="op">])</span> <span class="op">:</span> vals<span class="op">[</span>i1<span class="op">];</span></span>
<span id="cb10-32"><a href="#cb10-32" aria-hidden="true" tabindex="-1"></a>        vals<span class="op">[</span>i1<span class="op">]</span> <span class="op">=</span> last <span class="op">+</span> x<span class="op">[</span>j<span class="op">]</span> <span class="op">*</span> weight<span class="op">[</span>i<span class="op">][</span>j<span class="op">];</span></span>
<span id="cb10-33"><a href="#cb10-33" aria-hidden="true" tabindex="-1"></a>      <span class="op">}</span></span>
<span id="cb10-34"><a href="#cb10-34" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span>
<span id="cb10-35"><a href="#cb10-35" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-36"><a href="#cb10-36" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i1 <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i1 <span class="op">&lt;</span> B<span class="op">;</span> <span class="op">++</span>i1<span class="op">)</span> <span class="op">{</span></span>
<span id="cb10-37"><a href="#cb10-37" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS UNROLL</span></span>
<span id="cb10-38"><a href="#cb10-38" aria-hidden="true" tabindex="-1"></a>      <span class="dt">int</span> i <span class="op">=</span> i0 <span class="op">+</span> i1<span class="op">;</span></span>
<span id="cb10-39"><a href="#cb10-39" aria-hidden="true" tabindex="-1"></a>      <span class="cf">if</span> <span class="op">(</span>ApplyReLU<span class="op">)</span></span>
<span id="cb10-40"><a href="#cb10-40" aria-hidden="true" tabindex="-1"></a>        y<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> vals<span class="op">[</span>i1<span class="op">]</span> <span class="op">&gt;</span> T<span class="op">(</span><span class="dv">0</span><span class="op">)</span> <span class="op">?</span> vals<span class="op">[</span>i1<span class="op">]</span> <span class="op">:</span> T<span class="op">(</span><span class="dv">0</span><span class="op">);</span></span>
<span id="cb10-41"><a href="#cb10-41" aria-hidden="true" tabindex="-1"></a>      <span class="cf">else</span></span>
<span id="cb10-42"><a href="#cb10-42" aria-hidden="true" tabindex="-1"></a>        y<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> vals<span class="op">[</span>i1<span class="op">];</span></span>
<span id="cb10-43"><a href="#cb10-43" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span>
<span id="cb10-44"><a href="#cb10-44" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb10-45"><a href="#cb10-45" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb10-46"><a href="#cb10-46" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-47"><a href="#cb10-47" aria-hidden="true" tabindex="-1"></a><span class="co">// Parallel implementation of the fully-connected layer</span></span>
<span id="cb10-48"><a href="#cb10-48" aria-hidden="true" tabindex="-1"></a><span class="co">// Weight and bias parameters are stored on the DDR memory</span></span>
<span id="cb10-49"><a href="#cb10-49" aria-hidden="true" tabindex="-1"></a><span class="co">// Matrix-vector multiplication is parallelized along the output dimension</span></span>
<span id="cb10-50"><a href="#cb10-50" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="kw">typename</span> TParam<span class="op">,</span></span>
<span id="cb10-51"><a href="#cb10-51" aria-hidden="true" tabindex="-1"></a>          <span class="dt">int</span> InDims<span class="op">,</span> <span class="dt">int</span> OutDims<span class="op">,</span> <span class="dt">bool</span> ApplyReLU<span class="op">,</span> <span class="dt">int</span> B<span class="op">&gt;</span></span>
<span id="cb10-52"><a href="#cb10-52" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> LinearOpt1DDR<span class="op">(</span><span class="at">const</span> T x<span class="op">[</span>InDims<span class="op">],</span></span>
<span id="cb10-53"><a href="#cb10-53" aria-hidden="true" tabindex="-1"></a>                   T y<span class="op">[</span>OutDims<span class="op">],</span></span>
<span id="cb10-54"><a href="#cb10-54" aria-hidden="true" tabindex="-1"></a>                   <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> params<span class="op">,</span></span>
<span id="cb10-55"><a href="#cb10-55" aria-hidden="true" tabindex="-1"></a>                   <span class="at">const</span> <span class="dt">int</span> offset<span class="op">)</span></span>
<span id="cb10-56"><a href="#cb10-56" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb10-57"><a href="#cb10-57" aria-hidden="true" tabindex="-1"></a>  <span class="co">// `params` contains weight parameters of size (`OutDims`, `InDims`) and</span></span>
<span id="cb10-58"><a href="#cb10-58" aria-hidden="true" tabindex="-1"></a>  <span class="co">// bias parameters of size (`OutDims`) in a contiguous buffer</span></span>
<span id="cb10-59"><a href="#cb10-59" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-60"><a href="#cb10-60" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb10-61"><a href="#cb10-61" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-62"><a href="#cb10-62" aria-hidden="true" tabindex="-1"></a>  <span class="co">// `OutDims` must be a multiple of `B`</span></span>
<span id="cb10-63"><a href="#cb10-63" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>OutDims <span class="op">%</span> B <span class="op">==</span> <span class="dv">0</span><span class="op">,</span> <span class="st">&quot;`OutDims` must be a multiple of `B`&quot;</span><span class="op">);</span></span>
<span id="cb10-64"><a href="#cb10-64" aria-hidden="true" tabindex="-1"></a>  <span class="co">// `B` must be larger than 1</span></span>
<span id="cb10-65"><a href="#cb10-65" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>B <span class="op">&gt;</span> <span class="dv">1</span><span class="op">,</span> <span class="st">&quot;`B` must be larger than 1&quot;</span><span class="op">);</span></span>
<span id="cb10-66"><a href="#cb10-66" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-67"><a href="#cb10-67" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> BHalf <span class="op">=</span> B <span class="op">/</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb10-68"><a href="#cb10-68" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> OffsetToBias <span class="op">=</span> OutDims <span class="op">*</span> InDims<span class="op">;</span></span>
<span id="cb10-69"><a href="#cb10-69" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-70"><a href="#cb10-70" aria-hidden="true" tabindex="-1"></a>  TParam bias<span class="op">[</span>OutDims<span class="op">];</span></span>
<span id="cb10-71"><a href="#cb10-71" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=bias type=cyclic factor=BHalf dim=1</span></span>
<span id="cb10-72"><a href="#cb10-72" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-73"><a href="#cb10-73" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Copy the bias parameters in advance</span></span>
<span id="cb10-74"><a href="#cb10-74" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> OutDims<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb10-75"><a href="#cb10-75" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE II=1</span></span>
<span id="cb10-76"><a href="#cb10-76" aria-hidden="true" tabindex="-1"></a>    bias<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> TParam<span class="op">(</span>params<span class="op">[</span>offset <span class="op">+</span> OffsetToBias <span class="op">+</span> i<span class="op">]);</span></span>
<span id="cb10-77"><a href="#cb10-77" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb10-78"><a href="#cb10-78" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-79"><a href="#cb10-79" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i0 <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i0 <span class="op">&lt;</span> OutDims<span class="op">;</span> i0 <span class="op">+=</span> B<span class="op">)</span> <span class="op">{</span></span>
<span id="cb10-80"><a href="#cb10-80" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE off</span></span>
<span id="cb10-81"><a href="#cb10-81" aria-hidden="true" tabindex="-1"></a>    T vals<span class="op">[</span>B<span class="op">];</span></span>
<span id="cb10-82"><a href="#cb10-82" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=vals type=complete dim=1</span></span>
<span id="cb10-83"><a href="#cb10-83" aria-hidden="true" tabindex="-1"></a>    TParam weight<span class="op">[</span>B<span class="op">][</span>InDims<span class="op">];</span></span>
<span id="cb10-84"><a href="#cb10-84" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=weight type=cyclic factor=BHalf dim=1</span></span>
<span id="cb10-85"><a href="#cb10-85" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-86"><a href="#cb10-86" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Copy the weight parameters for `B` outputs</span></span>
<span id="cb10-87"><a href="#cb10-87" aria-hidden="true" tabindex="-1"></a>    <span class="at">const</span> <span class="dt">int</span> offset0 <span class="op">=</span> offset <span class="op">+</span> i0 <span class="op">*</span> InDims<span class="op">;</span></span>
<span id="cb10-88"><a href="#cb10-88" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i1 <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i1 <span class="op">&lt;</span> B<span class="op">;</span> <span class="op">++</span>i1<span class="op">)</span> <span class="op">{</span></span>
<span id="cb10-89"><a href="#cb10-89" aria-hidden="true" tabindex="-1"></a>      <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> j <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> j <span class="op">&lt;</span> InDims<span class="op">;</span> <span class="op">++</span>j<span class="op">)</span> <span class="op">{</span></span>
<span id="cb10-90"><a href="#cb10-90" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE II=1</span></span>
<span id="cb10-91"><a href="#cb10-91" aria-hidden="true" tabindex="-1"></a>        weight<span class="op">[</span>i1<span class="op">][</span>j<span class="op">]</span> <span class="op">=</span> TParam<span class="op">(</span>params<span class="op">[</span>offset0 <span class="op">+</span> i1 <span class="op">*</span> InDims <span class="op">+</span> j<span class="op">]);</span></span>
<span id="cb10-92"><a href="#cb10-92" aria-hidden="true" tabindex="-1"></a>      <span class="op">}</span></span>
<span id="cb10-93"><a href="#cb10-93" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span>
<span id="cb10-94"><a href="#cb10-94" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-95"><a href="#cb10-95" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> j <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> j <span class="op">&lt;</span> InDims<span class="op">;</span> <span class="op">++</span>j<span class="op">)</span> <span class="op">{</span></span>
<span id="cb10-96"><a href="#cb10-96" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE</span></span>
<span id="cb10-97"><a href="#cb10-97" aria-hidden="true" tabindex="-1"></a>      <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i1 <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i1 <span class="op">&lt;</span> B<span class="op">;</span> <span class="op">++</span>i1<span class="op">)</span> <span class="op">{</span></span>
<span id="cb10-98"><a href="#cb10-98" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS UNROLL</span></span>
<span id="cb10-99"><a href="#cb10-99" aria-hidden="true" tabindex="-1"></a>        <span class="dt">int</span> i <span class="op">=</span> i0 <span class="op">+</span> i1<span class="op">;</span></span>
<span id="cb10-100"><a href="#cb10-100" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> <span class="op">(</span>i <span class="op">&lt;</span> OutDims<span class="op">)</span> <span class="op">{</span></span>
<span id="cb10-101"><a href="#cb10-101" aria-hidden="true" tabindex="-1"></a>          T last <span class="op">=</span> <span class="op">(</span>j <span class="op">==</span> <span class="dv">0</span><span class="op">)</span> <span class="op">?</span> T<span class="op">(</span>bias<span class="op">[</span>i<span class="op">])</span> <span class="op">:</span> vals<span class="op">[</span>i1<span class="op">];</span></span>
<span id="cb10-102"><a href="#cb10-102" aria-hidden="true" tabindex="-1"></a>          vals<span class="op">[</span>i1<span class="op">]</span> <span class="op">=</span> last <span class="op">+</span> x<span class="op">[</span>j<span class="op">]</span> <span class="op">*</span> weight<span class="op">[</span>i1<span class="op">][</span>j<span class="op">];</span></span>
<span id="cb10-103"><a href="#cb10-103" aria-hidden="true" tabindex="-1"></a>        <span class="op">}</span></span>
<span id="cb10-104"><a href="#cb10-104" aria-hidden="true" tabindex="-1"></a>      <span class="op">}</span></span>
<span id="cb10-105"><a href="#cb10-105" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span>
<span id="cb10-106"><a href="#cb10-106" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-107"><a href="#cb10-107" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i1 <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i1 <span class="op">&lt;</span> B<span class="op">;</span> <span class="op">++</span>i1<span class="op">)</span> <span class="op">{</span></span>
<span id="cb10-108"><a href="#cb10-108" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS UNROLL</span></span>
<span id="cb10-109"><a href="#cb10-109" aria-hidden="true" tabindex="-1"></a>      <span class="dt">int</span> i <span class="op">=</span> i0 <span class="op">+</span> i1<span class="op">;</span></span>
<span id="cb10-110"><a href="#cb10-110" aria-hidden="true" tabindex="-1"></a>      <span class="cf">if</span> <span class="op">(</span>i <span class="op">&lt;</span> OutDims<span class="op">)</span> <span class="op">{</span></span>
<span id="cb10-111"><a href="#cb10-111" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> <span class="op">(</span>ApplyReLU<span class="op">)</span></span>
<span id="cb10-112"><a href="#cb10-112" aria-hidden="true" tabindex="-1"></a>          y<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> vals<span class="op">[</span>i1<span class="op">]</span> <span class="op">&gt;</span> T<span class="op">(</span><span class="dv">0</span><span class="op">)</span> <span class="op">?</span> vals<span class="op">[</span>i1<span class="op">]</span> <span class="op">:</span> T<span class="op">(</span><span class="dv">0</span><span class="op">);</span></span>
<span id="cb10-113"><a href="#cb10-113" aria-hidden="true" tabindex="-1"></a>        <span class="cf">else</span></span>
<span id="cb10-114"><a href="#cb10-114" aria-hidden="true" tabindex="-1"></a>          y<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> vals<span class="op">[</span>i1<span class="op">];</span></span>
<span id="cb10-115"><a href="#cb10-115" aria-hidden="true" tabindex="-1"></a>      <span class="op">}</span></span>
<span id="cb10-116"><a href="#cb10-116" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span>
<span id="cb10-117"><a href="#cb10-117" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb10-118"><a href="#cb10-118" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb10-119"><a href="#cb10-119" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-120"><a href="#cb10-120" aria-hidden="true" tabindex="-1"></a><span class="co">// Parallel implementation of the 1D batch normalization and ReLU activation</span></span>
<span id="cb10-121"><a href="#cb10-121" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for values</span></span>
<span id="cb10-122"><a href="#cb10-122" aria-hidden="true" tabindex="-1"></a><span class="co">// `TParam` is the type for parameters</span></span>
<span id="cb10-123"><a href="#cb10-123" aria-hidden="true" tabindex="-1"></a><span class="co">// `Dims` is the number of input and output dimensions</span></span>
<span id="cb10-124"><a href="#cb10-124" aria-hidden="true" tabindex="-1"></a><span class="co">// `B` is the block size for the output dimension</span></span>
<span id="cb10-125"><a href="#cb10-125" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="kw">typename</span> TParam<span class="op">,</span> <span class="dt">int</span> Dims<span class="op">,</span> <span class="dt">int</span> B<span class="op">&gt;</span></span>
<span id="cb10-126"><a href="#cb10-126" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> BatchNorm1dReLUOpt1<span class="op">(</span><span class="at">const</span> T x<span class="op">[</span>Dims<span class="op">],</span></span>
<span id="cb10-127"><a href="#cb10-127" aria-hidden="true" tabindex="-1"></a>                         T y<span class="op">[</span>Dims<span class="op">],</span></span>
<span id="cb10-128"><a href="#cb10-128" aria-hidden="true" tabindex="-1"></a>                         <span class="at">const</span> TParam scale<span class="op">[</span>Dims<span class="op">],</span></span>
<span id="cb10-129"><a href="#cb10-129" aria-hidden="true" tabindex="-1"></a>                         <span class="at">const</span> TParam bias<span class="op">[</span>Dims<span class="op">],</span></span>
<span id="cb10-130"><a href="#cb10-130" aria-hidden="true" tabindex="-1"></a>                         <span class="at">const</span> TParam mean<span class="op">[</span>Dims<span class="op">])</span></span>
<span id="cb10-131"><a href="#cb10-131" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb10-132"><a href="#cb10-132" aria-hidden="true" tabindex="-1"></a>  <span class="co">// `scale` is the multiplication of the weight and reciprocal of the</span></span>
<span id="cb10-133"><a href="#cb10-133" aria-hidden="true" tabindex="-1"></a>  <span class="co">// standard deviation (to reduce the on-chip memory consumption)</span></span>
<span id="cb10-134"><a href="#cb10-134" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-135"><a href="#cb10-135" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb10-136"><a href="#cb10-136" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-137"><a href="#cb10-137" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>Dims <span class="op">%</span> B <span class="op">==</span> <span class="dv">0</span><span class="op">,</span> <span class="st">&quot;`Dims` must be a multiple of `B`&quot;</span><span class="op">);</span></span>
<span id="cb10-138"><a href="#cb10-138" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-139"><a href="#cb10-139" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i0 <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i0 <span class="op">&lt;</span> Dims<span class="op">;</span> i0 <span class="op">+=</span> B<span class="op">)</span> <span class="op">{</span></span>
<span id="cb10-140"><a href="#cb10-140" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE</span></span>
<span id="cb10-141"><a href="#cb10-141" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i1 <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i1 <span class="op">&lt;</span> B<span class="op">;</span> <span class="op">++</span>i1<span class="op">)</span> <span class="op">{</span></span>
<span id="cb10-142"><a href="#cb10-142" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS UNROLL</span></span>
<span id="cb10-143"><a href="#cb10-143" aria-hidden="true" tabindex="-1"></a>      <span class="dt">int</span> i <span class="op">=</span> i0 <span class="op">+</span> i1<span class="op">;</span></span>
<span id="cb10-144"><a href="#cb10-144" aria-hidden="true" tabindex="-1"></a>      <span class="co">// Batch normalization with the learned parameters</span></span>
<span id="cb10-145"><a href="#cb10-145" aria-hidden="true" tabindex="-1"></a>      T val <span class="op">=</span> <span class="op">(</span>x<span class="op">[</span>i<span class="op">]</span> <span class="op">-</span> mean<span class="op">[</span>i<span class="op">])</span> <span class="op">*</span> scale<span class="op">[</span>i<span class="op">]</span> <span class="op">+</span> bias<span class="op">[</span>i<span class="op">];</span></span>
<span id="cb10-146"><a href="#cb10-146" aria-hidden="true" tabindex="-1"></a>      <span class="co">// ReLU activation</span></span>
<span id="cb10-147"><a href="#cb10-147" aria-hidden="true" tabindex="-1"></a>      y<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> val <span class="op">&gt;</span> T<span class="op">(</span><span class="dv">0</span><span class="op">)</span> <span class="op">?</span> val <span class="op">:</span> T<span class="op">(</span><span class="dv">0</span><span class="op">);</span></span>
<span id="cb10-148"><a href="#cb10-148" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span>
<span id="cb10-149"><a href="#cb10-149" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb10-150"><a href="#cb10-150" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb10-151"><a href="#cb10-151" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-152"><a href="#cb10-152" aria-hidden="true" tabindex="-1"></a><span class="co">// Parallel implementation of the 1D max-pooling layer</span></span>
<span id="cb10-153"><a href="#cb10-153" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for values</span></span>
<span id="cb10-154"><a href="#cb10-154" aria-hidden="true" tabindex="-1"></a><span class="co">// `Dims` is the number of input and output dimensions</span></span>
<span id="cb10-155"><a href="#cb10-155" aria-hidden="true" tabindex="-1"></a><span class="co">// `B` is the block size for the output dimension</span></span>
<span id="cb10-156"><a href="#cb10-156" aria-hidden="true" tabindex="-1"></a><span class="co">// `y` must be properly initialized</span></span>
<span id="cb10-157"><a href="#cb10-157" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="dt">int</span> Dims<span class="op">,</span> <span class="dt">int</span> B<span class="op">&gt;</span></span>
<span id="cb10-158"><a href="#cb10-158" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> MaxPool1dOpt1<span class="op">(</span><span class="at">const</span> T x<span class="op">[</span>Dims<span class="op">],</span> T y<span class="op">[</span>Dims<span class="op">])</span></span>
<span id="cb10-159"><a href="#cb10-159" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb10-160"><a href="#cb10-160" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb10-161"><a href="#cb10-161" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-162"><a href="#cb10-162" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>Dims <span class="op">%</span> B <span class="op">==</span> <span class="dv">0</span><span class="op">,</span> <span class="st">&quot;`Dims` must be a multiple of `B`&quot;</span><span class="op">);</span></span>
<span id="cb10-163"><a href="#cb10-163" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-164"><a href="#cb10-164" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i0 <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i0 <span class="op">&lt;</span> Dims<span class="op">;</span> i0 <span class="op">+=</span> B<span class="op">)</span> <span class="op">{</span></span>
<span id="cb10-165"><a href="#cb10-165" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE</span></span>
<span id="cb10-166"><a href="#cb10-166" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i1 <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i1 <span class="op">&lt;</span> B<span class="op">;</span> <span class="op">++</span>i1<span class="op">)</span> <span class="op">{</span></span>
<span id="cb10-167"><a href="#cb10-167" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS UNROLL</span></span>
<span id="cb10-168"><a href="#cb10-168" aria-hidden="true" tabindex="-1"></a>      <span class="dt">int</span> i <span class="op">=</span> i0 <span class="op">+</span> i1<span class="op">;</span></span>
<span id="cb10-169"><a href="#cb10-169" aria-hidden="true" tabindex="-1"></a>      y<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> x<span class="op">[</span>i<span class="op">]</span> <span class="op">&gt;</span> y<span class="op">[</span>i<span class="op">]</span> <span class="op">?</span> x<span class="op">[</span>i<span class="op">]</span> <span class="op">:</span> y<span class="op">[</span>i<span class="op">];</span></span>
<span id="cb10-170"><a href="#cb10-170" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span>
<span id="cb10-171"><a href="#cb10-171" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb10-172"><a href="#cb10-172" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p><code>LinearOpt1</code>と<code>LinearNaive</code>を比べてみると、<code>j</code>
(入力次元) のループはそのままで、<code>i</code> (出力次元)
に関するループが、<code>i0</code>と<code>i1</code>の2つに分割されています。
<code>i0</code>は<code>B</code>刻み、<code>i1</code>は<code>i0</code>から<code>i0 + B - 1</code>まで1つずつ増えてゆきます。
<code>i1</code>に関するループはアンローリング
(<code>#pragma HLS UNROLL</code>)
されているので、ループの中身が完全に展開されます。
<code>i1</code>のループ自体は無くなって、<code>i0</code>から<code>i0 + B - 1</code>までの処理が並列に実行されます。
最初のループに注目してみましょう。</p>
<div class="sourceCode" id="cb11"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> j <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> j <span class="op">&lt;</span> InDims<span class="op">;</span> <span class="op">++</span>j<span class="op">)</span> <span class="op">{</span></span>
<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE</span></span>
<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>      <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i1 <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i1 <span class="op">&lt;</span> B<span class="op">;</span> <span class="op">++</span>i1<span class="op">)</span> <span class="op">{</span></span>
<span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS UNROLL</span></span>
<span id="cb11-5"><a href="#cb11-5" aria-hidden="true" tabindex="-1"></a>        <span class="dt">int</span> i <span class="op">=</span> i0 <span class="op">+</span> i1<span class="op">;</span></span>
<span id="cb11-6"><a href="#cb11-6" aria-hidden="true" tabindex="-1"></a>        T last <span class="op">=</span> <span class="op">(</span>j <span class="op">==</span> <span class="dv">0</span><span class="op">)</span> <span class="op">?</span> T<span class="op">(</span>bias<span class="op">[</span>i<span class="op">])</span> <span class="op">:</span> vals<span class="op">[</span>i1<span class="op">];</span></span>
<span id="cb11-7"><a href="#cb11-7" aria-hidden="true" tabindex="-1"></a>        vals<span class="op">[</span>i1<span class="op">]</span> <span class="op">=</span> last <span class="op">+</span> x<span class="op">[</span>j<span class="op">]</span> <span class="op">*</span> weight<span class="op">[</span>i<span class="op">][</span>j<span class="op">];</span></span>
<span id="cb11-8"><a href="#cb11-8" aria-hidden="true" tabindex="-1"></a>      <span class="op">}</span></span>
<span id="cb11-9"><a href="#cb11-9" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span></code></pre></div>
<div class="sourceCode" id="cb12"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> j <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> j <span class="op">&lt;</span> InDims<span class="op">;</span> <span class="op">++</span>j<span class="op">)</span> <span class="op">{</span></span>
<span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>  <span class="pp">#pragma HLS PIPELINE</span></span>
<span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a>      T last0 <span class="op">=</span> <span class="op">(</span>j <span class="op">==</span> <span class="dv">0</span><span class="op">)</span> <span class="op">?</span> T<span class="op">(</span>bias<span class="op">[</span>i0 <span class="op">+</span> <span class="dv">0</span><span class="op">])</span> <span class="op">:</span> vals<span class="op">[</span><span class="dv">0</span><span class="op">];</span></span>
<span id="cb12-4"><a href="#cb12-4" aria-hidden="true" tabindex="-1"></a>      T last1 <span class="op">=</span> <span class="op">(</span>j <span class="op">==</span> <span class="dv">0</span><span class="op">)</span> <span class="op">?</span> T<span class="op">(</span>bias<span class="op">[</span>i0 <span class="op">+</span> <span class="dv">1</span><span class="op">])</span> <span class="op">:</span> vals<span class="op">[</span><span class="dv">1</span><span class="op">];</span></span>
<span id="cb12-5"><a href="#cb12-5" aria-hidden="true" tabindex="-1"></a>      <span class="co">// ...</span></span>
<span id="cb12-6"><a href="#cb12-6" aria-hidden="true" tabindex="-1"></a>      T lastB1 <span class="op">=</span> <span class="op">(</span>j <span class="op">==</span> <span class="dv">0</span><span class="op">)</span> <span class="op">?</span> T<span class="op">(</span>bias<span class="op">[</span>i0 <span class="op">+</span> B <span class="op">-</span> <span class="dv">1</span><span class="op">])</span> <span class="op">:</span> vals<span class="op">[</span>B <span class="op">-</span> <span class="dv">1</span><span class="op">];</span></span>
<span id="cb12-7"><a href="#cb12-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-8"><a href="#cb12-8" aria-hidden="true" tabindex="-1"></a>      vals<span class="op">[</span><span class="dv">0</span><span class="op">]</span> <span class="op">=</span> last0 <span class="op">+</span> x<span class="op">[</span>j<span class="op">]</span> <span class="op">*</span> weight<span class="op">[</span>i0 <span class="op">+</span> <span class="dv">0</span><span class="op">][</span>j<span class="op">];</span></span>
<span id="cb12-9"><a href="#cb12-9" aria-hidden="true" tabindex="-1"></a>      vals<span class="op">[</span><span class="dv">1</span><span class="op">]</span> <span class="op">=</span> last1 <span class="op">+</span> x<span class="op">[</span>j<span class="op">]</span> <span class="op">*</span> weight<span class="op">[</span>i0 <span class="op">+</span> <span class="dv">1</span><span class="op">][</span>j<span class="op">];</span></span>
<span id="cb12-10"><a href="#cb12-10" aria-hidden="true" tabindex="-1"></a>      <span class="co">// ...</span></span>
<span id="cb12-11"><a href="#cb12-11" aria-hidden="true" tabindex="-1"></a>      vals<span class="op">[</span>B <span class="op">-</span> <span class="dv">1</span><span class="op">]</span> <span class="op">=</span> lastB1 <span class="op">+</span> x<span class="op">[</span>j<span class="op">]</span> <span class="op">*</span> weight<span class="op">[</span>i0 <span class="op">+</span> B <span class="op">-</span> <span class="dv">1</span><span class="op">][</span>j<span class="op">];</span></span>
<span id="cb12-12"><a href="#cb12-12" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span></code></pre></div>
<p>並列処理のために、<code>vals</code>という、サイズ<code>B</code>の一時配列を新たに用意しています。
この配列には、出力<code>y[i0]</code>から<code>y[i0 + B - 1]</code>までの計算結果を保持します。
<code>vals</code>の各要素は、バイアス項<code>bias[i0]</code>から<code>bias[i0 + B - 1]</code>で初期化されます。
その後、<code>j</code>のループによって、<code>x[j] * weight[i0][j]</code>から<code>x[j] * weight[i0 + B - 1][j]</code>が、<code>vals</code>の各要素に順に加算されます。
上記の計算式と対応していることが分かります。</p>
<p>ループを展開すると、<code>vals[0]</code>から<code>vals[B - 1]</code>までの全要素、それから<code>bias[i0]</code>から<code>bias[i0 + B - 1]</code>まで、そして<code>weight[i0][j]</code>から<code>weight[i0 + B - 1][j]</code>までの<code>B</code>個の要素に、1サイクルでアクセスする必要があります。
これを実現するためには、配列<code>bias</code>、<code>vals</code>、<code>weight</code>のポート数を<code>B</code>以上にする必要があります。</p>
<p><code>vals</code>については、<code>#pragma HLS ARRAY_PARTITION type=complete</code>を使って、配列を個々の要素に完全に分解しています。
分割しない場合はポートが2つしかないので、同時に2つの要素を読み出す
(あるいは1要素を読み出して、別の1要素へ書き込む) ことしかできません。
完全に分割すると、配列の全ての要素を同時に読み書きできるようになります。
なお、完全に分割すると、オンチップメモリ (BlockRAM)
ではなく、フリップフロップ (FF) を使って配列が実装されます。</p>
<p><code>B</code>個の要素をもつ配列<code>vals</code>を、完全に分割すると、次のようになります。</p>
<p><a
href="point-cloud-classification-images/complete-partition.svg"><img src="point-cloud-classification-images/complete-partition.svg" width="50%" /></a></p>
<p><code>LinearOpt1</code>内には記述されていませんが、<code>weight</code>と<code>bias</code>については、別の場所で、<code>vals</code>と同様のHLSプラグマを指定する必要があります。
<code>weight</code>と<code>bias</code>から、1サイクルで<code>B</code>個の<strong>連続した</strong>要素
(<code>bias[i0]</code>から<code>bias[i0 + B - 1]</code>まで、そして<code>weight[i0][j]</code>から<code>weight[i0 + B - 1][j]</code>まで)
を読み出すためには、次のように<strong>サイクリック分割</strong>します。
<code>weight</code>は2次元配列ですが、最初の次元に対して分割したいので、<code>dim=1</code>を指定します。
オンチップメモリ (BlockRAM)
1つにつきポートが2つ付いており、1サイクルで2要素の読み出し
(あるいは1つの書き出しと1つの読み出し) ができます。
<code>B</code>個の要素を1サイクルで読み出すためには、配列を<code>BHalf = B / 2</code>個に分割すればよいです。</p>
<div class="sourceCode" id="cb13"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> BHalf <span class="op">=</span> B <span class="op">/</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>  TParam weight<span class="op">[</span>OutDims<span class="op">][</span>InDims<span class="op">];</span></span>
<span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=weight type=cyclic factor=BHalf dim=1</span></span>
<span id="cb13-4"><a href="#cb13-4" aria-hidden="true" tabindex="-1"></a>  TParam bias<span class="op">[</span>OutDims<span class="op">];</span></span>
<span id="cb13-5"><a href="#cb13-5" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=bias type=cyclic factor=BHalf dim=1</span></span></code></pre></div>
<p>簡単な例として、2次元配列<code>w[8][4]</code>を、最初の次元で4つにサイクリック分割
(<code>factor=4 dim=1</code>) すれば、次のようになります。
4分割するとポート数が8つに増えるので、8つの連続した要素
(例えば<code>w[0][j]</code>から<code>w[7][j]</code>まで)
をまとめて読み出せるようになります。</p>
<p>サイクリック分割では、分割されたそれぞれの配列に対して順に、先頭の要素から
(<code>w[0][0]</code>、<code>w[1][0]</code>、<code>w[2][0]</code>の順に)
詰めていきます。
全ての配列に要素が入ったら、また最初の配列に戻って、要素を順に詰めていきます。
これを繰り返すと図のような配置になります。 連続する要素
(<code>w[0][0]</code>、<code>w[1][0]</code>、<code>w[2][0]</code>、<code>w[3][0]</code>など)
が別々の配列に格納されるので、これらを一度に取り出すことができます。
ループアンローリングと、配列のサイクリック分割を組み合わせることで、配列の連続する要素に対する並列処理を、容易に実現できます。
このことから、<code>#pragma HLS UNROLL</code>と<code>#pragma HLS ARRAY_PARTITION</code>は、セットで使う場面が多いと思います。
アンローリング係数と、配列の分割数は揃える必要があります。
係数<code>B</code>でアンローリングしたら、配列は<code>B / 2</code>個
(<code>B</code>個でもよい)
にサイクリック分割しないと、<code>B</code>並列になりません。
また、ループをアンローリングしたのに、配列を一切分割しなければ、並列処理になりません。</p>
<p><a
href="point-cloud-classification-images/cyclic-partition.svg"><img src="point-cloud-classification-images/cyclic-partition.svg" width="60%" /></a></p>
<p>最初の次元で2つにサイクリック分割 (<code>factor=2 dim=1</code>)
すれば、次のようになります。
2分割するとポート数が4つに増えるので、4つの連続した要素
(例えば<code>w[0][j]</code>から<code>w[3][j]</code>、あるいは<code>w[4][j]</code>から<code>w[7][j]</code>まで)
をまとめて読み出せます。</p>
<p><a
href="point-cloud-classification-images/cyclic-partition3.svg"><img src="point-cloud-classification-images/cyclic-partition3.svg" width="60%" /></a></p>
<p>2番目の次元で2つにサイクリック分割 (<code>factor=2 dim=2</code>)
すれば、次のようになります。
今度は、2番目の次元について、4つの連続した要素
(例えば<code>w[i][0]</code>から<code>w[i][3]</code>まで)
に1サイクルでアクセスできます。</p>
<p><a
href="point-cloud-classification-images/cyclic-partition2.svg"><img src="point-cloud-classification-images/cyclic-partition2.svg" width="50%" /></a></p>
<p>これらを考えると、<code>weight</code>と<code>bias</code>については上記のプラグマを使えばよいと分かります。</p>
<p>さて、2つ目のループに注目してみましょう。
1つ目のループで計算された<code>B</code>個の要素を、出力<code>y</code>に書き込む部分です。</p>
<div class="sourceCode" id="cb14"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i1 <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i1 <span class="op">&lt;</span> B<span class="op">;</span> <span class="op">++</span>i1<span class="op">)</span> <span class="op">{</span></span>
<span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS UNROLL</span></span>
<span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a>      <span class="dt">int</span> i <span class="op">=</span> i0 <span class="op">+</span> i1<span class="op">;</span></span>
<span id="cb14-4"><a href="#cb14-4" aria-hidden="true" tabindex="-1"></a>      <span class="cf">if</span> <span class="op">(</span>ApplyReLU<span class="op">)</span></span>
<span id="cb14-5"><a href="#cb14-5" aria-hidden="true" tabindex="-1"></a>        y<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> vals<span class="op">[</span>i1<span class="op">]</span> <span class="op">&gt;</span> T<span class="op">(</span><span class="dv">0</span><span class="op">)</span> <span class="op">?</span> vals<span class="op">[</span>i1<span class="op">]</span> <span class="op">:</span> T<span class="op">(</span><span class="dv">0</span><span class="op">);</span></span>
<span id="cb14-6"><a href="#cb14-6" aria-hidden="true" tabindex="-1"></a>      <span class="cf">else</span></span>
<span id="cb14-7"><a href="#cb14-7" aria-hidden="true" tabindex="-1"></a>        y<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> vals<span class="op">[</span>i1<span class="op">];</span></span>
<span id="cb14-8"><a href="#cb14-8" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span></code></pre></div>
<p>このループもアンローリングされて、次のようになります。</p>
<div class="sourceCode" id="cb15"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a>    <span class="cf">if</span> <span class="op">(</span>ApplyReLU<span class="op">)</span> <span class="op">{</span></span>
<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a>      y<span class="op">[</span>i0 <span class="op">+</span> <span class="dv">0</span><span class="op">]</span> <span class="op">=</span> vals<span class="op">[</span><span class="dv">0</span><span class="op">]</span> <span class="op">&gt;</span> T<span class="op">(</span><span class="dv">0</span><span class="op">)</span> <span class="op">?</span> vals<span class="op">[</span><span class="dv">0</span><span class="op">]</span> <span class="op">:</span> T<span class="op">(</span><span class="dv">0</span><span class="op">);</span></span>
<span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a>      y<span class="op">[</span>i0 <span class="op">+</span> <span class="dv">1</span><span class="op">]</span> <span class="op">=</span> vals<span class="op">[</span><span class="dv">1</span><span class="op">]</span> <span class="op">&gt;</span> T<span class="op">(</span><span class="dv">0</span><span class="op">)</span> <span class="op">?</span> vals<span class="op">[</span><span class="dv">1</span><span class="op">]</span> <span class="op">:</span> T<span class="op">(</span><span class="dv">0</span><span class="op">);</span></span>
<span id="cb15-4"><a href="#cb15-4" aria-hidden="true" tabindex="-1"></a>      <span class="co">// ...</span></span>
<span id="cb15-5"><a href="#cb15-5" aria-hidden="true" tabindex="-1"></a>      y<span class="op">[</span>i0 <span class="op">+</span> B <span class="op">-</span> <span class="dv">1</span><span class="op">]</span> <span class="op">=</span> vals<span class="op">[</span>B <span class="op">-</span> <span class="dv">1</span><span class="op">]</span> <span class="op">&gt;</span> T<span class="op">(</span><span class="dv">0</span><span class="op">)</span> <span class="op">?</span> vals<span class="op">[</span>B <span class="op">-</span> <span class="dv">1</span><span class="op">]</span> <span class="op">:</span> T<span class="op">(</span><span class="dv">0</span><span class="op">);</span></span>
<span id="cb15-6"><a href="#cb15-6" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span> <span class="cf">else</span> <span class="op">{</span></span>
<span id="cb15-7"><a href="#cb15-7" aria-hidden="true" tabindex="-1"></a>      y<span class="op">[</span>i0 <span class="op">+</span> <span class="dv">0</span><span class="op">]</span> <span class="op">=</span> vals<span class="op">[</span><span class="dv">0</span><span class="op">];</span></span>
<span id="cb15-8"><a href="#cb15-8" aria-hidden="true" tabindex="-1"></a>      y<span class="op">[</span>i0 <span class="op">+</span> <span class="dv">1</span><span class="op">]</span> <span class="op">=</span> vals<span class="op">[</span><span class="dv">1</span><span class="op">];</span></span>
<span id="cb15-9"><a href="#cb15-9" aria-hidden="true" tabindex="-1"></a>      <span class="co">// ...</span></span>
<span id="cb15-10"><a href="#cb15-10" aria-hidden="true" tabindex="-1"></a>      y<span class="op">[</span>i0 <span class="op">+</span> B <span class="op">-</span> <span class="dv">1</span><span class="op">]</span> <span class="op">=</span> vals<span class="op">[</span>B <span class="op">-</span> <span class="dv">1</span><span class="op">];</span></span>
<span id="cb15-11"><a href="#cb15-11" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span></code></pre></div>
<p>出力<code>y[i0]</code>から<code>y[i0 + B - 1]</code>までの、連続する<code>B</code>個の要素に1サイクルでアクセスする必要があります。
<code>LinearOpt1</code>内には記載されませんが、配列<code>y</code>も、次のようにサイクリック分割すればよいです。</p>
<div class="sourceCode" id="cb16"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> BHalf <span class="op">=</span> B <span class="op">/</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a>  T y<span class="op">[</span>OutDims<span class="op">];</span></span>
<span id="cb16-3"><a href="#cb16-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=y type=cyclic factor=BHalf dim=1</span></span></code></pre></div>
<p>なお、入力<code>x</code>については、ループの各イテレーションで1つの要素にしかアクセスしないため、分割する必要はありません。
<code>LinearOpt1</code>を使って、全結合層の処理を<code>B</code>並列で実行するには、引数である重み<code>weight</code>、バイアス<code>bias</code>、出力<code>y</code>を、出力の次元で<code>B / 2</code>個に分割しなければなりません
(<code>B</code>が2であれば分割の必要はない)。</p>
<p>以上が<code>LinearOpt1</code>の主な変更点です。
<code>LinearOpt1DDR</code>についても、<code>B</code>個の出力を並列に計算するために、同様の変更がなされています。
全結合層のバイアス項<code>bias</code>と、出力の<code>B</code>要素分を計算するために必要な重み<code>weight</code>を、DRAMバッファからオンチップバッファ上に転送しています。
<code>LinearNaiveDDR</code>とは異なり、重みを保持するバッファ<code>weight</code>は、2次元配列となっています。
<code>B</code>個の必要な要素を取り出すために、<code>bias</code>と<code>weight</code>は<code>BHalf = B / 2</code>個に分割されています。</p>
<p><code>BatchNorm1dReLUOpt1</code>と<code>MaxPool1dOpt1</code>についても、<code>i</code>
(出力次元)
に関するループが、<code>i0</code>と<code>i1</code>の2つに分割されています。
<code>i1</code>のループはアンローリングされ、<code>B</code>個の出力が並列に計算されます。
<code>BatchNorm1dReLUOpt1</code>を使って、バッチ正規化とReLU活性化を<code>B</code>並列で実行するには、関数の入力<code>x</code>、出力<code>y</code>と、バッチ正規化層のパラメータ
(スケール<code>scale</code>、バイアス<code>bias</code>、平均<code>mean</code>)
を<code>B / 2</code>個に分割します。
<code>MaxPool1dOpt1</code>についても同様で、<code>B</code>並列でMaxプーリングを行うために、関数の入力<code>x</code>と<code>y</code>を<code>B / 2</code>個に分割します
(<code>x</code>は各点に対するローカル特徴量で、<code>y</code>は点群全体を表すグローバルな特徴量)。</p>
<p>各層を<code>B</code>並列で動作させるための、配列の分割のルールを次にまとめます。
2並列の場合は、分割の必要がないことが分かります。</p>
<ul>
<li><code>LinearOpt1</code>:
重み<code>weight</code>、バイアス<code>bias</code>、出力<code>y</code>を、出力の次元で<code>B / 2</code>個に分割
(入力<code>x</code>は分割の必要なし)</li>
<li><code>LinearOpt1DDR</code>:
出力<code>y</code>を<code>B / 2</code>個に分割
(入力<code>x</code>は分割の必要なし)</li>
<li><code>BatchNorm1dReLUOpt1</code>:
入力<code>x</code>と出力<code>y</code>、パラメータ
(スケール<code>scale</code>、バイアス<code>bias</code>、平均<code>mean</code>)
を、<code>B / 2</code>個に分割</li>
<li><code>MaxPool1dOpt1</code>:
入力<code>x</code>と出力<code>y</code>を、<code>B / 2</code>個に分割</li>
</ul>
<p>これらの並列化されたバージョンを使って、特徴抽出ネットワークと、分類ネットワークの推論処理を次のように書き換えます。
<code>InferenceFeatNaive</code>と<code>InferenceClsNaive</code>から、それぞれ<code>InferenceFeatOpt1</code>と<code>InferenceClsOpt1</code>になります。
関数の引数は変更しません。
なお、<code>InitializeFeatNaive</code>と<code>InitializeClsNaive</code>
(重みの初期化関数) は、そのまま使うことにします
(関数名だけ、<code>InitializeFeatOpt1</code>、<code>InitializeClsOpt1</code>としました)。</p>
<div class="sourceCode" id="cb17"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Parallel implementation of the PointNet feature extraction</span></span>
<span id="cb17-2"><a href="#cb17-2" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for layer input, output, and intermediate results</span></span>
<span id="cb17-3"><a href="#cb17-3" aria-hidden="true" tabindex="-1"></a><span class="co">// `U` is the type for parameters</span></span>
<span id="cb17-4"><a href="#cb17-4" aria-hidden="true" tabindex="-1"></a><span class="co">// `N` is the expected number of input points (e.g., 1024)</span></span>
<span id="cb17-5"><a href="#cb17-5" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="kw">typename</span> U<span class="op">,</span> <span class="dt">int</span> N<span class="op">&gt;</span></span>
<span id="cb17-6"><a href="#cb17-6" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> InferenceFeatOpt1<span class="op">(</span><span class="at">const</span> <span class="dt">float</span><span class="op">*</span> point_cloud<span class="op">,</span></span>
<span id="cb17-7"><a href="#cb17-7" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> <span class="dt">int</span> num_points<span class="op">,</span></span>
<span id="cb17-8"><a href="#cb17-8" aria-hidden="true" tabindex="-1"></a>                       T feature<span class="op">[</span>kFeatDims5<span class="op">],</span></span>
<span id="cb17-9"><a href="#cb17-9" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> LinearParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims0<span class="op">,</span> kFeatDims1<span class="op">&gt;*</span> conv1<span class="op">,</span></span>
<span id="cb17-10"><a href="#cb17-10" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> LinearParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims1<span class="op">,</span> kFeatDims2<span class="op">&gt;*</span> conv2<span class="op">,</span></span>
<span id="cb17-11"><a href="#cb17-11" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> LinearParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims2<span class="op">,</span> kFeatDims3<span class="op">&gt;*</span> conv3<span class="op">,</span></span>
<span id="cb17-12"><a href="#cb17-12" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> LinearParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims3<span class="op">,</span> kFeatDims4<span class="op">&gt;*</span> conv4<span class="op">,</span></span>
<span id="cb17-13"><a href="#cb17-13" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> LinearParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims4<span class="op">,</span> kFeatDims5<span class="op">&gt;*</span> conv5<span class="op">,</span></span>
<span id="cb17-14"><a href="#cb17-14" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> BatchNorm1dParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims1<span class="op">&gt;*</span> bn1<span class="op">,</span></span>
<span id="cb17-15"><a href="#cb17-15" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> BatchNorm1dParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims2<span class="op">&gt;*</span> bn2<span class="op">,</span></span>
<span id="cb17-16"><a href="#cb17-16" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> BatchNorm1dParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims3<span class="op">&gt;*</span> bn3<span class="op">,</span></span>
<span id="cb17-17"><a href="#cb17-17" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> BatchNorm1dParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims4<span class="op">&gt;*</span> bn4<span class="op">,</span></span>
<span id="cb17-18"><a href="#cb17-18" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> BatchNorm1dParams<span class="op">&lt;</span>U<span class="op">,</span> kFeatDims5<span class="op">&gt;*</span> bn5<span class="op">)</span></span>
<span id="cb17-19"><a href="#cb17-19" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb17-20"><a href="#cb17-20" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb17-21"><a href="#cb17-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb17-22"><a href="#cb17-22" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Zero-initialize the output feature</span></span>
<span id="cb17-23"><a href="#cb17-23" aria-hidden="true" tabindex="-1"></a>  VectorNdSetZero<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims5<span class="op">&gt;(</span>feature<span class="op">);</span></span>
<span id="cb17-24"><a href="#cb17-24" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb17-25"><a href="#cb17-25" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Compute the feature</span></span>
<span id="cb17-26"><a href="#cb17-26" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> num_points<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb17-27"><a href="#cb17-27" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS LOOP_TRIPCOUNT min=N max=N avg=N</span></span>
<span id="cb17-28"><a href="#cb17-28" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS LOOP_FLATTEN off</span></span>
<span id="cb17-29"><a href="#cb17-29" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb17-30"><a href="#cb17-30" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Input, output, and intermediate results</span></span>
<span id="cb17-31"><a href="#cb17-31" aria-hidden="true" tabindex="-1"></a>    T x0<span class="op">[</span>kFeatDims0<span class="op">];</span></span>
<span id="cb17-32"><a href="#cb17-32" aria-hidden="true" tabindex="-1"></a>    T x1<span class="op">[</span>kFeatDims1<span class="op">];</span></span>
<span id="cb17-33"><a href="#cb17-33" aria-hidden="true" tabindex="-1"></a>    T x2<span class="op">[</span>kFeatDims1<span class="op">];</span></span>
<span id="cb17-34"><a href="#cb17-34" aria-hidden="true" tabindex="-1"></a>    T x3<span class="op">[</span>kFeatDims2<span class="op">];</span></span>
<span id="cb17-35"><a href="#cb17-35" aria-hidden="true" tabindex="-1"></a>    T x4<span class="op">[</span>kFeatDims2<span class="op">];</span></span>
<span id="cb17-36"><a href="#cb17-36" aria-hidden="true" tabindex="-1"></a>    T x5<span class="op">[</span>kFeatDims3<span class="op">];</span></span>
<span id="cb17-37"><a href="#cb17-37" aria-hidden="true" tabindex="-1"></a>    T x6<span class="op">[</span>kFeatDims3<span class="op">];</span></span>
<span id="cb17-38"><a href="#cb17-38" aria-hidden="true" tabindex="-1"></a>    T x7<span class="op">[</span>kFeatDims4<span class="op">];</span></span>
<span id="cb17-39"><a href="#cb17-39" aria-hidden="true" tabindex="-1"></a>    T x8<span class="op">[</span>kFeatDims4<span class="op">];</span></span>
<span id="cb17-40"><a href="#cb17-40" aria-hidden="true" tabindex="-1"></a>    T x9<span class="op">[</span>kFeatDims5<span class="op">];</span></span>
<span id="cb17-41"><a href="#cb17-41" aria-hidden="true" tabindex="-1"></a>    T x10<span class="op">[</span>kFeatDims5<span class="op">];</span></span>
<span id="cb17-42"><a href="#cb17-42" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb17-43"><a href="#cb17-43" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=x3 type=cyclic factor=4 dim=1</span></span>
<span id="cb17-44"><a href="#cb17-44" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=x5 type=cyclic factor=4 dim=1</span></span>
<span id="cb17-45"><a href="#cb17-45" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=x7 type=cyclic factor=8 dim=1</span></span>
<span id="cb17-46"><a href="#cb17-46" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=x9 type=cyclic factor=64 dim=1</span></span>
<span id="cb17-47"><a href="#cb17-47" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb17-48"><a href="#cb17-48" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Read a point from a DDR memory</span></span>
<span id="cb17-49"><a href="#cb17-49" aria-hidden="true" tabindex="-1"></a>    ReadPointNaive<span class="op">&lt;</span>T<span class="op">&gt;(</span>point_cloud<span class="op">,</span> i<span class="op">,</span> x0<span class="op">);</span></span>
<span id="cb17-50"><a href="#cb17-50" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb17-51"><a href="#cb17-51" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Compute a point feature</span></span>
<span id="cb17-52"><a href="#cb17-52" aria-hidden="true" tabindex="-1"></a>    LinearOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims0<span class="op">,</span> kFeatDims1<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb17-53"><a href="#cb17-53" aria-hidden="true" tabindex="-1"></a>      x0<span class="op">,</span> x1<span class="op">,</span> conv1<span class="op">-&gt;</span>weight<span class="op">,</span> conv1<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb17-54"><a href="#cb17-54" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims1<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb17-55"><a href="#cb17-55" aria-hidden="true" tabindex="-1"></a>      x1<span class="op">,</span> x2<span class="op">,</span> bn1<span class="op">-&gt;</span>scale<span class="op">,</span> bn1<span class="op">-&gt;</span>bias<span class="op">,</span> bn1<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb17-56"><a href="#cb17-56" aria-hidden="true" tabindex="-1"></a>    LinearOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims1<span class="op">,</span> kFeatDims2<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">8</span><span class="op">&gt;(</span></span>
<span id="cb17-57"><a href="#cb17-57" aria-hidden="true" tabindex="-1"></a>      x2<span class="op">,</span> x3<span class="op">,</span> conv2<span class="op">-&gt;</span>weight<span class="op">,</span> conv2<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb17-58"><a href="#cb17-58" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims2<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb17-59"><a href="#cb17-59" aria-hidden="true" tabindex="-1"></a>      x3<span class="op">,</span> x4<span class="op">,</span> bn2<span class="op">-&gt;</span>scale<span class="op">,</span> bn2<span class="op">-&gt;</span>bias<span class="op">,</span> bn2<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb17-60"><a href="#cb17-60" aria-hidden="true" tabindex="-1"></a>    LinearOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims2<span class="op">,</span> kFeatDims3<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">8</span><span class="op">&gt;(</span></span>
<span id="cb17-61"><a href="#cb17-61" aria-hidden="true" tabindex="-1"></a>      x4<span class="op">,</span> x5<span class="op">,</span> conv3<span class="op">-&gt;</span>weight<span class="op">,</span> conv3<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb17-62"><a href="#cb17-62" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims3<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb17-63"><a href="#cb17-63" aria-hidden="true" tabindex="-1"></a>      x5<span class="op">,</span> x6<span class="op">,</span> bn3<span class="op">-&gt;</span>scale<span class="op">,</span> bn3<span class="op">-&gt;</span>bias<span class="op">,</span> bn3<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb17-64"><a href="#cb17-64" aria-hidden="true" tabindex="-1"></a>    LinearOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims3<span class="op">,</span> kFeatDims4<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">16</span><span class="op">&gt;(</span></span>
<span id="cb17-65"><a href="#cb17-65" aria-hidden="true" tabindex="-1"></a>      x6<span class="op">,</span> x7<span class="op">,</span> conv4<span class="op">-&gt;</span>weight<span class="op">,</span> conv4<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb17-66"><a href="#cb17-66" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims4<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb17-67"><a href="#cb17-67" aria-hidden="true" tabindex="-1"></a>      x7<span class="op">,</span> x8<span class="op">,</span> bn4<span class="op">-&gt;</span>scale<span class="op">,</span> bn4<span class="op">-&gt;</span>bias<span class="op">,</span> bn4<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb17-68"><a href="#cb17-68" aria-hidden="true" tabindex="-1"></a>    LinearOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims4<span class="op">,</span> kFeatDims5<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">128</span><span class="op">&gt;(</span></span>
<span id="cb17-69"><a href="#cb17-69" aria-hidden="true" tabindex="-1"></a>      x8<span class="op">,</span> x9<span class="op">,</span> conv5<span class="op">-&gt;</span>weight<span class="op">,</span> conv5<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb17-70"><a href="#cb17-70" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims5<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb17-71"><a href="#cb17-71" aria-hidden="true" tabindex="-1"></a>      x9<span class="op">,</span> x10<span class="op">,</span> bn5<span class="op">-&gt;</span>scale<span class="op">,</span> bn5<span class="op">-&gt;</span>bias<span class="op">,</span> bn5<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb17-72"><a href="#cb17-72" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb17-73"><a href="#cb17-73" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Update the output feature</span></span>
<span id="cb17-74"><a href="#cb17-74" aria-hidden="true" tabindex="-1"></a>    MaxPool1dOpt1<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims5<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span>x10<span class="op">,</span> feature<span class="op">);</span></span>
<span id="cb17-75"><a href="#cb17-75" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb17-76"><a href="#cb17-76" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb17-77"><a href="#cb17-77" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb17-78"><a href="#cb17-78" aria-hidden="true" tabindex="-1"></a><span class="co">// Parallel implementation of the classification network</span></span>
<span id="cb17-79"><a href="#cb17-79" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for layer input, output, and intermediate results</span></span>
<span id="cb17-80"><a href="#cb17-80" aria-hidden="true" tabindex="-1"></a><span class="co">// `U` is the type for parameters</span></span>
<span id="cb17-81"><a href="#cb17-81" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="kw">typename</span> U<span class="op">&gt;</span></span>
<span id="cb17-82"><a href="#cb17-82" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> InferenceClsOpt1<span class="op">(</span><span class="at">const</span> T feature<span class="op">[</span>kFeatDims5<span class="op">],</span></span>
<span id="cb17-83"><a href="#cb17-83" aria-hidden="true" tabindex="-1"></a>                      <span class="dt">float</span><span class="op">*</span> out_logits<span class="op">,</span></span>
<span id="cb17-84"><a href="#cb17-84" aria-hidden="true" tabindex="-1"></a>                      <span class="at">const</span> LinearParams<span class="op">&lt;</span>U<span class="op">,</span> kClsDims2<span class="op">,</span> kClsDims3<span class="op">&gt;*</span> fc3<span class="op">,</span></span>
<span id="cb17-85"><a href="#cb17-85" aria-hidden="true" tabindex="-1"></a>                      <span class="at">const</span> BatchNorm1dParams<span class="op">&lt;</span>U<span class="op">,</span> kClsDims1<span class="op">&gt;*</span> bn1<span class="op">,</span></span>
<span id="cb17-86"><a href="#cb17-86" aria-hidden="true" tabindex="-1"></a>                      <span class="at">const</span> BatchNorm1dParams<span class="op">&lt;</span>U<span class="op">,</span> kClsDims2<span class="op">&gt;*</span> bn2<span class="op">,</span></span>
<span id="cb17-87"><a href="#cb17-87" aria-hidden="true" tabindex="-1"></a>                      <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> params1<span class="op">,</span></span>
<span id="cb17-88"><a href="#cb17-88" aria-hidden="true" tabindex="-1"></a>                      <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> params2<span class="op">,</span></span>
<span id="cb17-89"><a href="#cb17-89" aria-hidden="true" tabindex="-1"></a>                      <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> params3<span class="op">)</span></span>
<span id="cb17-90"><a href="#cb17-90" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb17-91"><a href="#cb17-91" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb17-92"><a href="#cb17-92" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb17-93"><a href="#cb17-93" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>kFeatDims5 <span class="op">==</span> kClsDims0<span class="op">,</span></span>
<span id="cb17-94"><a href="#cb17-94" aria-hidden="true" tabindex="-1"></a>                <span class="st">&quot;Feature dimension should be equal to the input dimension&quot;</span><span class="op">);</span></span>
<span id="cb17-95"><a href="#cb17-95" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb17-96"><a href="#cb17-96" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Input, output, and intermediate results</span></span>
<span id="cb17-97"><a href="#cb17-97" aria-hidden="true" tabindex="-1"></a>  T x0<span class="op">[</span>kClsDims1<span class="op">];</span></span>
<span id="cb17-98"><a href="#cb17-98" aria-hidden="true" tabindex="-1"></a>  T x1<span class="op">[</span>kClsDims1<span class="op">];</span></span>
<span id="cb17-99"><a href="#cb17-99" aria-hidden="true" tabindex="-1"></a>  T x2<span class="op">[</span>kClsDims2<span class="op">];</span></span>
<span id="cb17-100"><a href="#cb17-100" aria-hidden="true" tabindex="-1"></a>  T x3<span class="op">[</span>kClsDims2<span class="op">];</span></span>
<span id="cb17-101"><a href="#cb17-101" aria-hidden="true" tabindex="-1"></a>  T x4<span class="op">[</span>kClsDims3<span class="op">];</span></span>
<span id="cb17-102"><a href="#cb17-102" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb17-103"><a href="#cb17-103" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=x0 type=cyclic factor=8 dim=1</span></span>
<span id="cb17-104"><a href="#cb17-104" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=x2 type=cyclic factor=4 dim=1</span></span>
<span id="cb17-105"><a href="#cb17-105" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb17-106"><a href="#cb17-106" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Compute logits</span></span>
<span id="cb17-107"><a href="#cb17-107" aria-hidden="true" tabindex="-1"></a>  LinearOpt1DDR<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kClsDims0<span class="op">,</span> kClsDims1<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">16</span><span class="op">&gt;(</span></span>
<span id="cb17-108"><a href="#cb17-108" aria-hidden="true" tabindex="-1"></a>    feature<span class="op">,</span> x0<span class="op">,</span> params1<span class="op">,</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb17-109"><a href="#cb17-109" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dReLUOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kClsDims1<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb17-110"><a href="#cb17-110" aria-hidden="true" tabindex="-1"></a>    x0<span class="op">,</span> x1<span class="op">,</span> bn1<span class="op">-&gt;</span>scale<span class="op">,</span> bn1<span class="op">-&gt;</span>bias<span class="op">,</span> bn1<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb17-111"><a href="#cb17-111" aria-hidden="true" tabindex="-1"></a>  LinearOpt1DDR<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kClsDims1<span class="op">,</span> kClsDims2<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">8</span><span class="op">&gt;(</span></span>
<span id="cb17-112"><a href="#cb17-112" aria-hidden="true" tabindex="-1"></a>    x1<span class="op">,</span> x2<span class="op">,</span> params2<span class="op">,</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb17-113"><a href="#cb17-113" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dReLUOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kClsDims2<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb17-114"><a href="#cb17-114" aria-hidden="true" tabindex="-1"></a>    x2<span class="op">,</span> x3<span class="op">,</span> bn2<span class="op">-&gt;</span>scale<span class="op">,</span> bn2<span class="op">-&gt;</span>bias<span class="op">,</span> bn2<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb17-115"><a href="#cb17-115" aria-hidden="true" tabindex="-1"></a>  LinearOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kClsDims2<span class="op">,</span> kClsDims3<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb17-116"><a href="#cb17-116" aria-hidden="true" tabindex="-1"></a>    x3<span class="op">,</span> x4<span class="op">,</span> fc3<span class="op">-&gt;</span>weight<span class="op">,</span> fc3<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb17-117"><a href="#cb17-117" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb17-118"><a href="#cb17-118" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Write the result</span></span>
<span id="cb17-119"><a href="#cb17-119" aria-hidden="true" tabindex="-1"></a>  WriteTensor1dNaive<span class="op">&lt;</span>T<span class="op">,</span> kClsDims3<span class="op">&gt;(</span>out_logits<span class="op">,</span> x4<span class="op">,</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb17-120"><a href="#cb17-120" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>各層の関数を呼び出す際に、テンプレート引数に並列化度も指定しています。
例えば、特徴抽出ネットワークの4番目の全結合層
(PyTorchのモデルにおける<code>PointNetFeat::conv4</code>)
は16並列、最後の全結合層 (<code>PointNetFeat::conv5</code>)
は128並列で実行されます。
一方、バッチ正規化層とMaxプーリングは、2並列で実行されています。
各層の並列度をどのように決定したのかについては、後述します。</p>
<p>続いて、IPコアの最上位関数<code>PointNetClsTop</code>を以下に示します。</p>
<div class="sourceCode" id="cb18"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> PointNetClsTop<span class="op">(</span><span class="at">const</span> <span class="dt">int</span> op_mode<span class="op">,</span></span>
<span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> point_cloud<span class="op">,</span></span>
<span id="cb18-3"><a href="#cb18-3" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">int</span> num_points<span class="op">,</span></span>
<span id="cb18-4"><a href="#cb18-4" aria-hidden="true" tabindex="-1"></a>                    <span class="dt">float</span><span class="op">*</span> out_logits<span class="op">,</span></span>
<span id="cb18-5"><a href="#cb18-5" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params1<span class="op">,</span></span>
<span id="cb18-6"><a href="#cb18-6" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params2<span class="op">,</span></span>
<span id="cb18-7"><a href="#cb18-7" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params3<span class="op">,</span></span>
<span id="cb18-8"><a href="#cb18-8" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params4<span class="op">,</span></span>
<span id="cb18-9"><a href="#cb18-9" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params5<span class="op">,</span></span>
<span id="cb18-10"><a href="#cb18-10" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> cls_params1<span class="op">,</span></span>
<span id="cb18-11"><a href="#cb18-11" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> cls_params2<span class="op">,</span></span>
<span id="cb18-12"><a href="#cb18-12" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> cls_params3<span class="op">)</span></span>
<span id="cb18-13"><a href="#cb18-13" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb18-14"><a href="#cb18-14" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=point_cloud offset=slave bundle=gmem0</span></span>
<span id="cb18-15"><a href="#cb18-15" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=out_logits offset=slave bundle=gmem0</span></span>
<span id="cb18-16"><a href="#cb18-16" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=feat_params1 offset=slave bundle=gmem0</span></span>
<span id="cb18-17"><a href="#cb18-17" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=feat_params2 offset=slave bundle=gmem0</span></span>
<span id="cb18-18"><a href="#cb18-18" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=feat_params3 offset=slave bundle=gmem0</span></span>
<span id="cb18-19"><a href="#cb18-19" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=feat_params4 offset=slave bundle=gmem0</span></span>
<span id="cb18-20"><a href="#cb18-20" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=feat_params5 offset=slave bundle=gmem0</span></span>
<span id="cb18-21"><a href="#cb18-21" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=cls_params1 offset=slave bundle=gmem0</span></span>
<span id="cb18-22"><a href="#cb18-22" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=cls_params2 offset=slave bundle=gmem0</span></span>
<span id="cb18-23"><a href="#cb18-23" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=cls_params3 offset=slave bundle=gmem0</span></span>
<span id="cb18-24"><a href="#cb18-24" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-25"><a href="#cb18-25" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=op_mode bundle=control</span></span>
<span id="cb18-26"><a href="#cb18-26" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=point_cloud bundle=control</span></span>
<span id="cb18-27"><a href="#cb18-27" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=num_points bundle=control</span></span>
<span id="cb18-28"><a href="#cb18-28" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=out_logits bundle=control</span></span>
<span id="cb18-29"><a href="#cb18-29" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=feat_params1 bundle=control</span></span>
<span id="cb18-30"><a href="#cb18-30" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=feat_params2 bundle=control</span></span>
<span id="cb18-31"><a href="#cb18-31" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=feat_params3 bundle=control</span></span>
<span id="cb18-32"><a href="#cb18-32" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=feat_params4 bundle=control</span></span>
<span id="cb18-33"><a href="#cb18-33" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=feat_params5 bundle=control</span></span>
<span id="cb18-34"><a href="#cb18-34" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=cls_params1 bundle=control</span></span>
<span id="cb18-35"><a href="#cb18-35" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=cls_params2 bundle=control</span></span>
<span id="cb18-36"><a href="#cb18-36" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=cls_params3 bundle=control</span></span>
<span id="cb18-37"><a href="#cb18-37" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=return bundle=control</span></span>
<span id="cb18-38"><a href="#cb18-38" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-39"><a href="#cb18-39" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Parameters for feature extraction</span></span>
<span id="cb18-40"><a href="#cb18-40" aria-hidden="true" tabindex="-1"></a>  LinearParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims0<span class="op">,</span> kFeatDims1<span class="op">&gt;</span> feat_conv1<span class="op">;</span></span>
<span id="cb18-41"><a href="#cb18-41" aria-hidden="true" tabindex="-1"></a>  LinearParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims1<span class="op">,</span> kFeatDims2<span class="op">&gt;</span> feat_conv2<span class="op">;</span></span>
<span id="cb18-42"><a href="#cb18-42" aria-hidden="true" tabindex="-1"></a>  LinearParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims2<span class="op">,</span> kFeatDims3<span class="op">&gt;</span> feat_conv3<span class="op">;</span></span>
<span id="cb18-43"><a href="#cb18-43" aria-hidden="true" tabindex="-1"></a>  LinearParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims3<span class="op">,</span> kFeatDims4<span class="op">&gt;</span> feat_conv4<span class="op">;</span></span>
<span id="cb18-44"><a href="#cb18-44" aria-hidden="true" tabindex="-1"></a>  LinearParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims4<span class="op">,</span> kFeatDims5<span class="op">&gt;</span> feat_conv5<span class="op">;</span></span>
<span id="cb18-45"><a href="#cb18-45" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims1<span class="op">&gt;</span> feat_bn1<span class="op">;</span></span>
<span id="cb18-46"><a href="#cb18-46" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims2<span class="op">&gt;</span> feat_bn2<span class="op">;</span></span>
<span id="cb18-47"><a href="#cb18-47" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims3<span class="op">&gt;</span> feat_bn3<span class="op">;</span></span>
<span id="cb18-48"><a href="#cb18-48" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims4<span class="op">&gt;</span> feat_bn4<span class="op">;</span></span>
<span id="cb18-49"><a href="#cb18-49" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kFeatDims5<span class="op">&gt;</span> feat_bn5<span class="op">;</span></span>
<span id="cb18-50"><a href="#cb18-50" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-51"><a href="#cb18-51" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=feat_conv2.weight type=cyclic factor=4 dim=1</span></span>
<span id="cb18-52"><a href="#cb18-52" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=feat_conv2.bias type=cyclic factor=4 dim=1</span></span>
<span id="cb18-53"><a href="#cb18-53" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=feat_conv3.weight type=cyclic factor=4 dim=1</span></span>
<span id="cb18-54"><a href="#cb18-54" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=feat_conv3.bias type=cyclic factor=4 dim=1</span></span>
<span id="cb18-55"><a href="#cb18-55" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=feat_conv4.weight type=cyclic factor=8 dim=1</span></span>
<span id="cb18-56"><a href="#cb18-56" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=feat_conv4.bias type=cyclic factor=8 dim=1</span></span>
<span id="cb18-57"><a href="#cb18-57" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=feat_conv5.weight type=cyclic factor=64 dim=1</span></span>
<span id="cb18-58"><a href="#cb18-58" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=feat_conv5.bias type=cyclic factor=64 dim=1</span></span>
<span id="cb18-59"><a href="#cb18-59" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-60"><a href="#cb18-60" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Parameters for classification network</span></span>
<span id="cb18-61"><a href="#cb18-61" aria-hidden="true" tabindex="-1"></a>  <span class="co">// LinearParams&lt;param_t, kClsDims0, kClsDims1&gt; cls_fc1;</span></span>
<span id="cb18-62"><a href="#cb18-62" aria-hidden="true" tabindex="-1"></a>  <span class="co">// LinearParams&lt;param_t, kClsDims1, kClsDims2&gt; cls_fc2;</span></span>
<span id="cb18-63"><a href="#cb18-63" aria-hidden="true" tabindex="-1"></a>  LinearParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kClsDims2<span class="op">,</span> kClsDims3<span class="op">&gt;</span> cls_fc3<span class="op">;</span></span>
<span id="cb18-64"><a href="#cb18-64" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kClsDims1<span class="op">&gt;</span> cls_bn1<span class="op">;</span></span>
<span id="cb18-65"><a href="#cb18-65" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dParams<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">,</span> kClsDims2<span class="op">&gt;</span> cls_bn2<span class="op">;</span></span>
<span id="cb18-66"><a href="#cb18-66" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-67"><a href="#cb18-67" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Extracted feature</span></span>
<span id="cb18-68"><a href="#cb18-68" aria-hidden="true" tabindex="-1"></a>  <span class="dt">value_t</span> feature<span class="op">[</span>kFeatDims5<span class="op">];</span></span>
<span id="cb18-69"><a href="#cb18-69" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-70"><a href="#cb18-70" aria-hidden="true" tabindex="-1"></a>  <span class="cf">if</span> <span class="op">(</span>op_mode <span class="op">==</span> kModeInitWeights<span class="op">)</span> <span class="op">{</span></span>
<span id="cb18-71"><a href="#cb18-71" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Initialize the PointNet feature extraction network</span></span>
<span id="cb18-72"><a href="#cb18-72" aria-hidden="true" tabindex="-1"></a>    InitializeFeatOpt1<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">&gt;(</span></span>
<span id="cb18-73"><a href="#cb18-73" aria-hidden="true" tabindex="-1"></a>      <span class="op">&amp;</span>feat_conv1<span class="op">,</span> <span class="op">&amp;</span>feat_conv2<span class="op">,</span> <span class="op">&amp;</span>feat_conv3<span class="op">,</span> <span class="op">&amp;</span>feat_conv4<span class="op">,</span> <span class="op">&amp;</span>feat_conv5<span class="op">,</span></span>
<span id="cb18-74"><a href="#cb18-74" aria-hidden="true" tabindex="-1"></a>      <span class="op">&amp;</span>feat_bn1<span class="op">,</span> <span class="op">&amp;</span>feat_bn2<span class="op">,</span> <span class="op">&amp;</span>feat_bn3<span class="op">,</span> <span class="op">&amp;</span>feat_bn4<span class="op">,</span> <span class="op">&amp;</span>feat_bn5<span class="op">,</span></span>
<span id="cb18-75"><a href="#cb18-75" aria-hidden="true" tabindex="-1"></a>      feat_params1<span class="op">,</span> feat_params2<span class="op">,</span> feat_params3<span class="op">,</span> feat_params4<span class="op">,</span> feat_params5<span class="op">);</span></span>
<span id="cb18-76"><a href="#cb18-76" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Initialize the classification network</span></span>
<span id="cb18-77"><a href="#cb18-77" aria-hidden="true" tabindex="-1"></a>    InitializeClsOpt1<span class="op">&lt;</span><span class="dt">param_t</span><span class="op">&gt;(</span></span>
<span id="cb18-78"><a href="#cb18-78" aria-hidden="true" tabindex="-1"></a>      <span class="op">&amp;</span>cls_fc3<span class="op">,</span> <span class="op">&amp;</span>cls_bn1<span class="op">,</span> <span class="op">&amp;</span>cls_bn2<span class="op">,</span></span>
<span id="cb18-79"><a href="#cb18-79" aria-hidden="true" tabindex="-1"></a>      cls_params1<span class="op">,</span> cls_params2<span class="op">,</span> cls_params3<span class="op">);</span></span>
<span id="cb18-80"><a href="#cb18-80" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span> <span class="cf">else</span> <span class="cf">if</span> <span class="op">(</span>op_mode <span class="op">==</span> kModeInference<span class="op">)</span> <span class="op">{</span></span>
<span id="cb18-81"><a href="#cb18-81" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Run the PointNet feature extraction</span></span>
<span id="cb18-82"><a href="#cb18-82" aria-hidden="true" tabindex="-1"></a>    InferenceFeatOpt1<span class="op">&lt;</span><span class="dt">value_t</span><span class="op">,</span> <span class="dt">param_t</span><span class="op">,</span> <span class="dv">1024</span><span class="op">&gt;(</span></span>
<span id="cb18-83"><a href="#cb18-83" aria-hidden="true" tabindex="-1"></a>      point_cloud<span class="op">,</span> num_points<span class="op">,</span> feature<span class="op">,</span></span>
<span id="cb18-84"><a href="#cb18-84" aria-hidden="true" tabindex="-1"></a>      <span class="op">&amp;</span>feat_conv1<span class="op">,</span> <span class="op">&amp;</span>feat_conv2<span class="op">,</span> <span class="op">&amp;</span>feat_conv3<span class="op">,</span> <span class="op">&amp;</span>feat_conv4<span class="op">,</span> <span class="op">&amp;</span>feat_conv5<span class="op">,</span></span>
<span id="cb18-85"><a href="#cb18-85" aria-hidden="true" tabindex="-1"></a>      <span class="op">&amp;</span>feat_bn1<span class="op">,</span> <span class="op">&amp;</span>feat_bn2<span class="op">,</span> <span class="op">&amp;</span>feat_bn3<span class="op">,</span> <span class="op">&amp;</span>feat_bn4<span class="op">,</span> <span class="op">&amp;</span>feat_bn5<span class="op">);</span></span>
<span id="cb18-86"><a href="#cb18-86" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-87"><a href="#cb18-87" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Run the classification</span></span>
<span id="cb18-88"><a href="#cb18-88" aria-hidden="true" tabindex="-1"></a>    InferenceClsOpt1<span class="op">&lt;</span><span class="dt">value_t</span><span class="op">,</span> <span class="dt">param_t</span><span class="op">&gt;(</span></span>
<span id="cb18-89"><a href="#cb18-89" aria-hidden="true" tabindex="-1"></a>      feature<span class="op">,</span> out_logits<span class="op">,</span></span>
<span id="cb18-90"><a href="#cb18-90" aria-hidden="true" tabindex="-1"></a>      <span class="op">&amp;</span>cls_fc3<span class="op">,</span> <span class="op">&amp;</span>cls_bn1<span class="op">,</span> <span class="op">&amp;</span>cls_bn2<span class="op">,</span></span>
<span id="cb18-91"><a href="#cb18-91" aria-hidden="true" tabindex="-1"></a>      cls_params1<span class="op">,</span> cls_params2<span class="op">,</span> cls_params3<span class="op">);</span></span>
<span id="cb18-92"><a href="#cb18-92" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb18-93"><a href="#cb18-93" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>関数の入出力ポートについては全く同一です。
以前のバージョンと比較すると、層の入出力やパラメータを保持するバッファ
(<code>feat_conv5.weight</code>、<code>feat_conv5.bias</code>、<code>x3</code>、<code>x5</code>など)
を分割するために、<code>#pragma HLS ARRAY_PARTITION</code>が追加されていることが分かります。
配列の分割数 (<code>factor</code>)
については、上述のルールに則っています。
例えば、<code>InferenceFeatOpt1</code>と<code>PointNetClsTop</code>をみると、特徴抽出ネットワークの最後の全結合層を128並列で実行したいので、出力用のバッファ<code>x10</code>と、全結合層の2つのパラメータ<code>feat_conv5.weight</code>、<code>feat_conv5.bias</code>を64分割しています
(記述する場所が散らばっているのが難点です)。
同様に、<code>InferenceClsOpt1</code>と<code>PointNetClsTop</code>をみると、分類ネットワークの最初の全結合層は16並列で実行されるので、出力用のバッファ<code>x0</code>は8分割しています。
バッチ正規化層とMaxプーリングは2並列なので、配列を分割する必要はありません。</p>
<p>先述のように、配列を分割するとポート数が増えて、一度に多くの要素を読み出せるようになりますが、貴重なオンチップメモリの消費も増えます。
オンチップメモリの消費を抑えつつ、なるべく並列度を上げる必要があります。
推論時間の短縮に最も効果がある部分
(例えば特徴抽出ネットワークの最後の全結合層)
の並列度を上げて、効果があまりない部分 (例えばバッチ正規化層)
の並列度は下げています。</p>
<p>ここで、各層の実行サイクル数を比較してみます (動作周波数は150MHz)。
特徴抽出ネットワークについては次のようになりました。</p>
<table>
<colgroup>
<col style="width: 33%" />
<col style="width: 33%" />
<col style="width: 33%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">層</th>
<th style="text-align: left;"><code>InferenceFeatNaive</code></th>
<th style="text-align: left;"><code>InferenceFeatOpt1</code></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">全結合層1
(<code>PointNetFeat::conv1</code>)</td>
<td style="text-align: left;">577 (3.843us)</td>
<td style="text-align: left;">321 (2.138us)</td>
</tr>
<tr class="even">
<td style="text-align: left;">バッチ正規化層 + ReLU
(<code>PointNetFeat::bn1</code>)</td>
<td style="text-align: left;">68 (0.453us)</td>
<td style="text-align: left;">36 (0.240us)</td>
</tr>
<tr class="odd">
<td style="text-align: left;">全結合層2
(<code>PointNetFeat::conv2</code>)</td>
<td style="text-align: left;">4,481 (29.84us)</td>
<td style="text-align: left;">569 (3.790us)</td>
</tr>
<tr class="even">
<td style="text-align: left;">バッチ正規化層 + ReLU
(<code>PointNetFeat::bn2</code>)</td>
<td style="text-align: left;">68 (0.453us)</td>
<td style="text-align: left;">36 (0.240us)</td>
</tr>
<tr class="odd">
<td style="text-align: left;">全結合層3
(<code>PointNetFeat::conv3</code>)</td>
<td style="text-align: left;">4,481 (29.84us)</td>
<td style="text-align: left;">569 (3.790us)</td>
</tr>
<tr class="even">
<td style="text-align: left;">バッチ正規化層 + ReLU
(<code>PointNetFeat::bn3</code>)</td>
<td style="text-align: left;">68 (0.453us)</td>
<td style="text-align: left;">36 (0.240us)</td>
</tr>
<tr class="odd">
<td style="text-align: left;">全結合層4
(<code>PointNetFeat::conv4</code>)</td>
<td style="text-align: left;">8,961 (59.68us)</td>
<td style="text-align: left;">569 (3.790us)</td>
</tr>
<tr class="even">
<td style="text-align: left;">バッチ正規化層 + ReLU
(<code>PointNetFeat::bn4</code>)</td>
<td style="text-align: left;">132 (0.879us)</td>
<td style="text-align: left;">68 (0.453us)</td>
</tr>
<tr class="odd">
<td style="text-align: left;">全結合層5
(<code>PointNetFeat::conv5</code>)</td>
<td style="text-align: left;">137,217 (914.0us)</td>
<td style="text-align: left;">1,081 (7.199us)</td>
</tr>
<tr class="even">
<td style="text-align: left;">バッチ正規化層 + ReLU
(<code>PointNetFeat::bn5</code>)</td>
<td style="text-align: left;">1,028 (6.846us)</td>
<td style="text-align: left;">516 (3.437us)</td>
</tr>
<tr class="odd">
<td style="text-align: left;">Maxプーリング層</td>
<td style="text-align: left;">1,026 (6.833us)</td>
<td style="text-align: left;">514 (3.423us)</td>
</tr>
<tr class="even">
<td style="text-align: left;">全体 (1回分)</td>
<td style="text-align: left;">158,149 (1.053ms)</td>
<td style="text-align: left;">4,357 (29.02us)</td>
</tr>
<tr class="odd">
<td style="text-align: left;">全体 (1024回分)</td>
<td style="text-align: left;">161,945,604 (1.079s)</td>
<td style="text-align: left;">4,462,596 (29.72ms)</td>
</tr>
</tbody>
</table>
<p>特徴抽出ネットワークに関しては、やはり最後の全結合層がボトルネックとなっています。
128並列にすることで、実行時間を126.9倍
(137,217サイクルから1,081サイクル) 削減できています。
4つ目の全結合層についても、16並列にすることで、実行時間が15.75倍
(8,961サイクルから569サイクル) 短くなりました。
全結合層やバッチ正規化層、Maxプーリング層にみられるデータ並列性を活かして、推論時間を短縮できました。
また分類ネットワークについては次のようになりました。</p>
<table>
<colgroup>
<col style="width: 33%" />
<col style="width: 33%" />
<col style="width: 33%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">層</th>
<th style="text-align: left;"><code>InferenceClsNaive</code></th>
<th style="text-align: left;"><code>InferenceClsOpt1</code></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">全結合層1
(<code>PointNetCls::fc1</code>)</td>
<td style="text-align: left;">1,056,279 (7.035ms)</td>
<td style="text-align: left;">558,071 (3.717ms)</td>
</tr>
<tr class="even">
<td style="text-align: left;">バッチ正規化層 + ReLU
(<code>PointNetCls::bn1</code>)</td>
<td style="text-align: left;">516 (3.437us)</td>
<td style="text-align: left;">260 (1.732us)</td>
</tr>
<tr class="odd">
<td style="text-align: left;">全結合層2
(<code>PointNetCls::fc2</code>)</td>
<td style="text-align: left;">266,007 (1.772ms)</td>
<td style="text-align: left;">148,183 (987.0us)</td>
</tr>
<tr class="even">
<td style="text-align: left;">バッチ正規化層 + ReLU
(<code>PointNetCls::bn2</code>)</td>
<td style="text-align: left;">260 (1.732us)</td>
<td style="text-align: left;">132 (0.879us)</td>
</tr>
<tr class="odd">
<td style="text-align: left;">全結合層3
(<code>PointNetCls::fc3</code>)</td>
<td style="text-align: left;">10,481 (69.80us)</td>
<td style="text-align: left;">5,261 (35.04us)</td>
</tr>
<tr class="even">
<td style="text-align: left;">全体</td>
<td style="text-align: left;">1,333,605 (8.882ms)</td>
<td style="text-align: left;">711,969 (4.742ms)</td>
</tr>
</tbody>
</table>
<p>最初の全結合層は16並列で実行するようにしましたが、実行時間は1.89倍
(1,056,279サイクルから558,071サイクル) しか短くなっていません。
前述のように、分類ネットワークの最初の全結合層2つでは、パラメータをオンチップバッファに置くのではなく、DRAMバッファから必要な部分だけを転送しています。
行列の積や加算は16並列で実行されるのですが、データ転送部分の実行時間は短縮されないので、このような結果になっています。
2つ目の全結合層に関しても同様に、8並列を指定したのですが、実行時間は1.80倍
(266,007サイクルから148,183サイクル) の削減に留まっています。</p>
<p>現在の実装では、入出力ポートの幅は32ビットで、1サイクルにつき<code>float</code>のデータを1つずつ転送しています。
入出力ポートの幅を広げて、1サイクルで複数のデータを転送すれば、データ転送の実行時間を短縮できます。
後ほど、ポート幅を32ビットから64ビットに広げて、1サイクルで<code>float</code>のデータを2つずつ転送するように、改善します。</p>
<p>IPコアの動作モードには2つありますが、このうち重みの初期化モードについては、全く手を加えていません。
重みの初期化は、IPコアの利用開始前に一度だけ行われ、ネットワークの推論時間とは全く関係ないためです。</p>
<p>以上で推論の並列化が済みました。
詳しくは<code>hls/src/top_opt1.cpp</code>をご参照ください。</p>
<h2 id="並列化その2-タスク並列性の活用">並列化その2
(タスク並列性の活用)</h2>
<p>各層の計算は並列化できましたが、特徴抽出ネットワークの部分には、まだ高速化の余地が残されています。
特徴抽出ネットワークの推論処理を、もう一度みてみましょう。</p>
<div class="sourceCode" id="cb19"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Compute the feature</span></span>
<span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> num_points<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb19-3"><a href="#cb19-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS LOOP_TRIPCOUNT min=N max=N avg=N</span></span>
<span id="cb19-4"><a href="#cb19-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS LOOP_FLATTEN off</span></span>
<span id="cb19-5"><a href="#cb19-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb19-6"><a href="#cb19-6" aria-hidden="true" tabindex="-1"></a>    <span class="co">// ...</span></span>
<span id="cb19-7"><a href="#cb19-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb19-8"><a href="#cb19-8" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Read a point from a DDR memory</span></span>
<span id="cb19-9"><a href="#cb19-9" aria-hidden="true" tabindex="-1"></a>    ReadPointNaive<span class="op">&lt;</span>T<span class="op">&gt;(</span>point_cloud<span class="op">,</span> i<span class="op">,</span> x0<span class="op">);</span></span>
<span id="cb19-10"><a href="#cb19-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb19-11"><a href="#cb19-11" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Compute a point feature</span></span>
<span id="cb19-12"><a href="#cb19-12" aria-hidden="true" tabindex="-1"></a>    LinearOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims0<span class="op">,</span> kFeatDims1<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb19-13"><a href="#cb19-13" aria-hidden="true" tabindex="-1"></a>      x0<span class="op">,</span> x1<span class="op">,</span> conv1<span class="op">-&gt;</span>weight<span class="op">,</span> conv1<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb19-14"><a href="#cb19-14" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims1<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb19-15"><a href="#cb19-15" aria-hidden="true" tabindex="-1"></a>      x1<span class="op">,</span> x2<span class="op">,</span> bn1<span class="op">-&gt;</span>scale<span class="op">,</span> bn1<span class="op">-&gt;</span>bias<span class="op">,</span> bn1<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb19-16"><a href="#cb19-16" aria-hidden="true" tabindex="-1"></a>    LinearOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims1<span class="op">,</span> kFeatDims2<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">8</span><span class="op">&gt;(</span></span>
<span id="cb19-17"><a href="#cb19-17" aria-hidden="true" tabindex="-1"></a>      x2<span class="op">,</span> x3<span class="op">,</span> conv2<span class="op">-&gt;</span>weight<span class="op">,</span> conv2<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb19-18"><a href="#cb19-18" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims2<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb19-19"><a href="#cb19-19" aria-hidden="true" tabindex="-1"></a>      x3<span class="op">,</span> x4<span class="op">,</span> bn2<span class="op">-&gt;</span>scale<span class="op">,</span> bn2<span class="op">-&gt;</span>bias<span class="op">,</span> bn2<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb19-20"><a href="#cb19-20" aria-hidden="true" tabindex="-1"></a>    LinearOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims2<span class="op">,</span> kFeatDims3<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">8</span><span class="op">&gt;(</span></span>
<span id="cb19-21"><a href="#cb19-21" aria-hidden="true" tabindex="-1"></a>      x4<span class="op">,</span> x5<span class="op">,</span> conv3<span class="op">-&gt;</span>weight<span class="op">,</span> conv3<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb19-22"><a href="#cb19-22" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims3<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb19-23"><a href="#cb19-23" aria-hidden="true" tabindex="-1"></a>      x5<span class="op">,</span> x6<span class="op">,</span> bn3<span class="op">-&gt;</span>scale<span class="op">,</span> bn3<span class="op">-&gt;</span>bias<span class="op">,</span> bn3<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb19-24"><a href="#cb19-24" aria-hidden="true" tabindex="-1"></a>    LinearOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims3<span class="op">,</span> kFeatDims4<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">16</span><span class="op">&gt;(</span></span>
<span id="cb19-25"><a href="#cb19-25" aria-hidden="true" tabindex="-1"></a>      x6<span class="op">,</span> x7<span class="op">,</span> conv4<span class="op">-&gt;</span>weight<span class="op">,</span> conv4<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb19-26"><a href="#cb19-26" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims4<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb19-27"><a href="#cb19-27" aria-hidden="true" tabindex="-1"></a>      x7<span class="op">,</span> x8<span class="op">,</span> bn4<span class="op">-&gt;</span>scale<span class="op">,</span> bn4<span class="op">-&gt;</span>bias<span class="op">,</span> bn4<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb19-28"><a href="#cb19-28" aria-hidden="true" tabindex="-1"></a>    LinearOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims4<span class="op">,</span> kFeatDims5<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">128</span><span class="op">&gt;(</span></span>
<span id="cb19-29"><a href="#cb19-29" aria-hidden="true" tabindex="-1"></a>      x8<span class="op">,</span> x9<span class="op">,</span> conv5<span class="op">-&gt;</span>weight<span class="op">,</span> conv5<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb19-30"><a href="#cb19-30" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims5<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb19-31"><a href="#cb19-31" aria-hidden="true" tabindex="-1"></a>      x9<span class="op">,</span> x10<span class="op">,</span> bn5<span class="op">-&gt;</span>scale<span class="op">,</span> bn5<span class="op">-&gt;</span>bias<span class="op">,</span> bn5<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb19-32"><a href="#cb19-32" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb19-33"><a href="#cb19-33" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Update the output feature</span></span>
<span id="cb19-34"><a href="#cb19-34" aria-hidden="true" tabindex="-1"></a>    MaxPool1dOpt1<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims5<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span>x10<span class="op">,</span> feature<span class="op">);</span></span>
<span id="cb19-35"><a href="#cb19-35" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span></code></pre></div>
<p>ループの内部をみると、最初に、DRAMに置かれた点群<code>point_cloud</code>から<code>i</code>番目の点を取ってきて、オンチップバッファ<code>x0</code>に格納しています。
続いて、この<code>x0</code>がバケツリレーのように、複数の関数に渡されていきます。
例えば、最初の全結合層によって<code>x0</code>から<code>x1</code>、バッチ正規化層によって<code>x1</code>から<code>x2</code>、次の全結合層によって<code>x2</code>から<code>x3</code>が計算されています。
ある層の関数 (例えば<code>LinearOpt1(x4, x5)</code>)
は、その一つ前の関数の出力 (<code>x4</code>) を入力として受け取り、出力
(<code>x5</code>) を次の関数に引き渡します。
全ての関数が、入出力を介して、数珠つなぎのようになっています。
関数の実行の流れを図にすると、次のようになります。</p>
<p><a
href="point-cloud-classification-images/dataflow-optimization-before.svg"><img src="point-cloud-classification-images/dataflow-optimization-before.svg" width="80%" /></a></p>
<p>先程のパイプライン化と同様に、複数の点について処理を並列化できます。</p>
<p><a
href="point-cloud-classification-images/dataflow-optimization-after.svg"><img src="point-cloud-classification-images/dataflow-optimization-after.svg" width="90%" /></a></p>
<p>例えば、1つ目の点に対して、最後の全結合層を計算している間に、2つ目の点に対して、その一つ前のバッチ正規化層を計算するというように、複数の点に対する処理を時間的にオーバーラップさせます。
以前は、ループ内の処理をパイプライン化して、ループの複数のイテレーションを並列に実行しました。
そして、パイプラインの各ステージは、主に乗算や加算でした。
ここでは、各ステージは一つの関数 (タスク)
に対応するので、より粗粒度なパイプライン化といえます。
このようなタスクレベルのパイプライン化は、Vitis
HLSでは<strong>データフロー最適化</strong> (Dataflow optimization)
とよばれています (<strong>最適化その6: データフロー最適化</strong>)。
データフロー最適化を適用するには、いろいろな条件がありますが、今回の場合は大丈夫です。</p>
<p>以前述べたように、パイプラインの各ステージの実行サイクル数をなるべく均等に揃えることで、パイプラインの効果が増します。
各層の計算時間を、なるべく均一にしたいということです。
計算時間は、上の表にまとめられています。
データ並列性を利用する前は、実行サイクル数 (特に全結合層)
には、かなりのばらつきがありました。
全結合層5つだけ抜き出してみると、577、4,481、4,481、8,961、137,217となっています。
それぞれの層を、2、8、8、16、128並列で実行することで
(<code>InferenceFeatOpt1</code>を参照)、321、569、569、569、1,081サイクルに削減され、ばらつきもかなり抑えられました。
最後の全結合層を256並列にすれば、さらに均等になりますが、回路が複雑になり過ぎるのでやめました。</p>
<p>パイプラインは最も時間の長いステージによって性能が制限されます。
今回の場合は、最後の全結合層 (1,081サイクル) によって性能が決まります。
他のステージは、1,081サイクル以下であれば、何サイクルであろうとも性能に影響を与えません。
リソース消費を抑えるため、他のステージに関しては、1,081サイクルを超えない範囲で、なるべく並列度を落としました。</p>
<p>特徴抽出ネットワークに関してはこのように、データフロー最適化を予め考慮したうえで、各層の並列度を指定しました。
分類ネットワークの並列度は、何となく決めています。</p>
<p>データフロー最適化を施した実装を、次に示します。
<code>InferenceFeatOpt1</code>から、<code>InferenceFeatOpt2</code>としました。</p>
<div class="sourceCode" id="cb20"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Parallel implementation of the PointNet feature extraction</span></span>
<span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for layer input, output, and intermediate results</span></span>
<span id="cb20-3"><a href="#cb20-3" aria-hidden="true" tabindex="-1"></a><span class="co">// `U` is the type for parameters</span></span>
<span id="cb20-4"><a href="#cb20-4" aria-hidden="true" tabindex="-1"></a><span class="co">// `N` is the expected number of input points (e.g., 1024)</span></span>
<span id="cb20-5"><a href="#cb20-5" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="kw">typename</span> U<span class="op">,</span> <span class="dt">int</span> N<span class="op">&gt;</span></span>
<span id="cb20-6"><a href="#cb20-6" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> InferenceFeatOpt2<span class="op">(...)</span></span>
<span id="cb20-7"><a href="#cb20-7" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb20-8"><a href="#cb20-8" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb20-9"><a href="#cb20-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-10"><a href="#cb20-10" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Zero-initialize the output feature</span></span>
<span id="cb20-11"><a href="#cb20-11" aria-hidden="true" tabindex="-1"></a>  VectorNdSetZero<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims5<span class="op">&gt;(</span>feature<span class="op">);</span></span>
<span id="cb20-12"><a href="#cb20-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-13"><a href="#cb20-13" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Compute the feature</span></span>
<span id="cb20-14"><a href="#cb20-14" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> num_points<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb20-15"><a href="#cb20-15" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS LOOP_TRIPCOUNT min=N max=N avg=N</span></span>
<span id="cb20-16"><a href="#cb20-16" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS LOOP_FLATTEN off</span></span>
<span id="cb20-17"><a href="#cb20-17" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS DATAFLOW</span></span>
<span id="cb20-18"><a href="#cb20-18" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-19"><a href="#cb20-19" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS STABLE variable=point_cloud</span></span>
<span id="cb20-20"><a href="#cb20-20" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS STABLE variable=num_points</span></span>
<span id="cb20-21"><a href="#cb20-21" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS STABLE variable=feature</span></span>
<span id="cb20-22"><a href="#cb20-22" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS STABLE variable=conv1</span></span>
<span id="cb20-23"><a href="#cb20-23" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS STABLE variable=conv2</span></span>
<span id="cb20-24"><a href="#cb20-24" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS STABLE variable=conv3</span></span>
<span id="cb20-25"><a href="#cb20-25" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS STABLE variable=conv4</span></span>
<span id="cb20-26"><a href="#cb20-26" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS STABLE variable=conv5</span></span>
<span id="cb20-27"><a href="#cb20-27" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS STABLE variable=bn1</span></span>
<span id="cb20-28"><a href="#cb20-28" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS STABLE variable=bn2</span></span>
<span id="cb20-29"><a href="#cb20-29" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS STABLE variable=bn3</span></span>
<span id="cb20-30"><a href="#cb20-30" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS STABLE variable=bn4</span></span>
<span id="cb20-31"><a href="#cb20-31" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS STABLE variable=bn5</span></span>
<span id="cb20-32"><a href="#cb20-32" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-33"><a href="#cb20-33" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Input, output, and intermediate results</span></span>
<span id="cb20-34"><a href="#cb20-34" aria-hidden="true" tabindex="-1"></a>    <span class="co">// ...</span></span>
<span id="cb20-35"><a href="#cb20-35" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-36"><a href="#cb20-36" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Read a point from a DDR memory</span></span>
<span id="cb20-37"><a href="#cb20-37" aria-hidden="true" tabindex="-1"></a>    ReadPointNaive<span class="op">&lt;</span>T<span class="op">&gt;(</span>point_cloud<span class="op">,</span> i<span class="op">,</span> x0<span class="op">);</span></span>
<span id="cb20-38"><a href="#cb20-38" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-39"><a href="#cb20-39" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Compute a point feature</span></span>
<span id="cb20-40"><a href="#cb20-40" aria-hidden="true" tabindex="-1"></a>    LinearOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims0<span class="op">,</span> kFeatDims1<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb20-41"><a href="#cb20-41" aria-hidden="true" tabindex="-1"></a>      x0<span class="op">,</span> x1<span class="op">,</span> conv1<span class="op">-&gt;</span>weight<span class="op">,</span> conv1<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb20-42"><a href="#cb20-42" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims1<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb20-43"><a href="#cb20-43" aria-hidden="true" tabindex="-1"></a>      x1<span class="op">,</span> x2<span class="op">,</span> bn1<span class="op">-&gt;</span>scale<span class="op">,</span> bn1<span class="op">-&gt;</span>bias<span class="op">,</span> bn1<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb20-44"><a href="#cb20-44" aria-hidden="true" tabindex="-1"></a>    LinearOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims1<span class="op">,</span> kFeatDims2<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">8</span><span class="op">&gt;(</span></span>
<span id="cb20-45"><a href="#cb20-45" aria-hidden="true" tabindex="-1"></a>      x2<span class="op">,</span> x3<span class="op">,</span> conv2<span class="op">-&gt;</span>weight<span class="op">,</span> conv2<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb20-46"><a href="#cb20-46" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims2<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb20-47"><a href="#cb20-47" aria-hidden="true" tabindex="-1"></a>      x3<span class="op">,</span> x4<span class="op">,</span> bn2<span class="op">-&gt;</span>scale<span class="op">,</span> bn2<span class="op">-&gt;</span>bias<span class="op">,</span> bn2<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb20-48"><a href="#cb20-48" aria-hidden="true" tabindex="-1"></a>    LinearOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims2<span class="op">,</span> kFeatDims3<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">8</span><span class="op">&gt;(</span></span>
<span id="cb20-49"><a href="#cb20-49" aria-hidden="true" tabindex="-1"></a>      x4<span class="op">,</span> x5<span class="op">,</span> conv3<span class="op">-&gt;</span>weight<span class="op">,</span> conv3<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb20-50"><a href="#cb20-50" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims3<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb20-51"><a href="#cb20-51" aria-hidden="true" tabindex="-1"></a>      x5<span class="op">,</span> x6<span class="op">,</span> bn3<span class="op">-&gt;</span>scale<span class="op">,</span> bn3<span class="op">-&gt;</span>bias<span class="op">,</span> bn3<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb20-52"><a href="#cb20-52" aria-hidden="true" tabindex="-1"></a>    LinearOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims3<span class="op">,</span> kFeatDims4<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">16</span><span class="op">&gt;(</span></span>
<span id="cb20-53"><a href="#cb20-53" aria-hidden="true" tabindex="-1"></a>      x6<span class="op">,</span> x7<span class="op">,</span> conv4<span class="op">-&gt;</span>weight<span class="op">,</span> conv4<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb20-54"><a href="#cb20-54" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims4<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb20-55"><a href="#cb20-55" aria-hidden="true" tabindex="-1"></a>      x7<span class="op">,</span> x8<span class="op">,</span> bn4<span class="op">-&gt;</span>scale<span class="op">,</span> bn4<span class="op">-&gt;</span>bias<span class="op">,</span> bn4<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb20-56"><a href="#cb20-56" aria-hidden="true" tabindex="-1"></a>    LinearOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims4<span class="op">,</span> kFeatDims5<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">128</span><span class="op">&gt;(</span></span>
<span id="cb20-57"><a href="#cb20-57" aria-hidden="true" tabindex="-1"></a>      x8<span class="op">,</span> x9<span class="op">,</span> conv5<span class="op">-&gt;</span>weight<span class="op">,</span> conv5<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb20-58"><a href="#cb20-58" aria-hidden="true" tabindex="-1"></a>    BatchNorm1dReLUOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kFeatDims5<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb20-59"><a href="#cb20-59" aria-hidden="true" tabindex="-1"></a>      x9<span class="op">,</span> x10<span class="op">,</span> bn5<span class="op">-&gt;</span>scale<span class="op">,</span> bn5<span class="op">-&gt;</span>bias<span class="op">,</span> bn5<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb20-60"><a href="#cb20-60" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-61"><a href="#cb20-61" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Update the output feature</span></span>
<span id="cb20-62"><a href="#cb20-62" aria-hidden="true" tabindex="-1"></a>    MaxPool1dOpt1<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims5<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span>x10<span class="op">,</span> feature<span class="op">);</span></span>
<span id="cb20-63"><a href="#cb20-63" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb20-64"><a href="#cb20-64" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p><code>InferenceFeatOpt1</code>と異なるのはHLSプラグマの部分だけです。
ループの先頭部分には<code>#pragma HLS DATAFLOW</code>の記述があり、ループの中身をデータフロー最適化するように指示します。
<code>#pragma HLS STABLE</code>の部分は、ループの各イテレーションを開始するにあたって、その変数について同期をとる必要がない、ということを示します。
各層のパラメータや点群など、ループの実行中は変化しない変数に付与しています。
この記述がないと、データフロー最適化がうまく機能しません。</p>
<p>この2種類のHLSプラグマを挿入するだけで、データフロー最適化をいとも簡単に実現できます。
高位合成ツールは素晴らしいと思います。 <code>PointNetClsTop</code>
(トップ関数) や分類ネットワークの推論 (<code>InferenceClsOpt1</code>)
については以前と全く同じであるため、ここでは割愛します。</p>
<p>データフロー最適化による効果をみてみます。
<code>InferenceFeatOpt1</code>では、1つの点に対する順伝播に4,357サイクル
(29.02us)
要していましたが、<code>InferenceFeatOpt2</code>でも4,344サイクル
(28.93us) で、ほぼ変わりません。
一方、1,024個の点に対する処理時間をみてみると、<code>InferenceFeatOpt1</code>では4,462,596サイクル
(29.72ms) でしたが、<code>InferenceFeatOpt2</code>では1,112,259サイクル
(7.408ms) に削減されています。
パイプライン化しても、各入力データに対する計算時間 (レイテンシ)
は変化しませんが、単位時間あたりに処理可能なデータ数 (スループット)
は改善するので、それに伴って全体の性能も向上するということです。</p>
<p>これでデータフロー最適化は終わりです。
詳しくは<code>hls/src/top_opt2.cpp</code>をご覧ください。</p>
<h2 id="入出力ポート幅の拡張">入出力ポート幅の拡張</h2>
<p>分類ネットワークの全結合層部分では、積和演算を並列化したにもかかわらず、全体の処理時間はそれほど短縮されませんでした。
DRAMからオンチップバッファへのパラメータ転送のサイクル数が、変化していないためです。
そこで最後の最適化として、入出力ポートのビット幅を32から64に広げて、1サイクルにつき2つの<code>float</code>データを転送できるように、実装を修正してみましょう
(<strong>最適化その7: データ転送</strong>)。</p>
<p>最初に、IPコアの最上位関数<code>PointNetClsTop</code>から修正します。
修正前は、次のようになっていました。</p>
<div class="sourceCode" id="cb21"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> PointNetClsTop<span class="op">(</span><span class="at">const</span> <span class="dt">int</span> op_mode<span class="op">,</span></span>
<span id="cb21-2"><a href="#cb21-2" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> point_cloud<span class="op">,</span></span>
<span id="cb21-3"><a href="#cb21-3" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">int</span> num_points<span class="op">,</span></span>
<span id="cb21-4"><a href="#cb21-4" aria-hidden="true" tabindex="-1"></a>                    <span class="dt">float</span><span class="op">*</span> out_logits<span class="op">,</span></span>
<span id="cb21-5"><a href="#cb21-5" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params1<span class="op">,</span></span>
<span id="cb21-6"><a href="#cb21-6" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params2<span class="op">,</span></span>
<span id="cb21-7"><a href="#cb21-7" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params3<span class="op">,</span></span>
<span id="cb21-8"><a href="#cb21-8" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params4<span class="op">,</span></span>
<span id="cb21-9"><a href="#cb21-9" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params5<span class="op">,</span></span>
<span id="cb21-10"><a href="#cb21-10" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> cls_params1<span class="op">,</span></span>
<span id="cb21-11"><a href="#cb21-11" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> cls_params2<span class="op">,</span></span>
<span id="cb21-12"><a href="#cb21-12" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> cls_params3<span class="op">)</span></span>
<span id="cb21-13"><a href="#cb21-13" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb21-14"><a href="#cb21-14" aria-hidden="true" tabindex="-1"></a>  <span class="co">// ...</span></span>
<span id="cb21-15"><a href="#cb21-15" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>これを、次のように64ビット幅にします。</p>
<div class="sourceCode" id="cb22"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb22-1"><a href="#cb22-1" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> PointNetClsTop<span class="op">(</span><span class="at">const</span> <span class="dt">int</span> op_mode<span class="op">,</span></span>
<span id="cb22-2"><a href="#cb22-2" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> point_cloud<span class="op">,</span></span>
<span id="cb22-3"><a href="#cb22-3" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">int</span> num_points<span class="op">,</span></span>
<span id="cb22-4"><a href="#cb22-4" aria-hidden="true" tabindex="-1"></a>                    ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> out_logits<span class="op">,</span></span>
<span id="cb22-5"><a href="#cb22-5" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> feat_params1<span class="op">,</span></span>
<span id="cb22-6"><a href="#cb22-6" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> feat_params2<span class="op">,</span></span>
<span id="cb22-7"><a href="#cb22-7" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> feat_params3<span class="op">,</span></span>
<span id="cb22-8"><a href="#cb22-8" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> feat_params4<span class="op">,</span></span>
<span id="cb22-9"><a href="#cb22-9" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> feat_params5<span class="op">,</span></span>
<span id="cb22-10"><a href="#cb22-10" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> cls_params1<span class="op">,</span></span>
<span id="cb22-11"><a href="#cb22-11" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> cls_params2<span class="op">,</span></span>
<span id="cb22-12"><a href="#cb22-12" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> cls_params3<span class="op">)</span></span>
<span id="cb22-13"><a href="#cb22-13" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb22-14"><a href="#cb22-14" aria-hidden="true" tabindex="-1"></a>  <span class="co">// ...</span></span>
<span id="cb22-15"><a href="#cb22-15" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p><code>ap_uint</code>は、Vitis
HLSで提供されている、任意ビット長の符号なし整数型です。
ここでは64ビットとしています。
1サイクルにつきデータを2つずつ読み取らなければいけないので、データ転送に関する部分を全て修正します。
DRAMからパラメータを取り出して、オンチップバッファに格納する、重み初期化関数<code>InitializeFeatOpt1</code>、<code>InitializeClsOpt1</code>も次のように直して、新たに<code>InitializeFeatOpt3</code>、<code>InitializeClsOpt3</code>とします。
単に、関数の引数を<code>float*</code>から<code>ap_uint&lt;64&gt;*</code>に変更しただけです。</p>
<div class="sourceCode" id="cb23"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Parallel implementation of the parameter initialization</span></span>
<span id="cb23-2"><a href="#cb23-2" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for parameters</span></span>
<span id="cb23-3"><a href="#cb23-3" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">&gt;</span></span>
<span id="cb23-4"><a href="#cb23-4" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> InitializeFeatOpt3<span class="op">(</span>LinearParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims0<span class="op">,</span> kFeatDims1<span class="op">&gt;*</span> conv1<span class="op">,</span></span>
<span id="cb23-5"><a href="#cb23-5" aria-hidden="true" tabindex="-1"></a>                        LinearParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims1<span class="op">,</span> kFeatDims2<span class="op">&gt;*</span> conv2<span class="op">,</span></span>
<span id="cb23-6"><a href="#cb23-6" aria-hidden="true" tabindex="-1"></a>                        LinearParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims2<span class="op">,</span> kFeatDims3<span class="op">&gt;*</span> conv3<span class="op">,</span></span>
<span id="cb23-7"><a href="#cb23-7" aria-hidden="true" tabindex="-1"></a>                        LinearParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims3<span class="op">,</span> kFeatDims4<span class="op">&gt;*</span> conv4<span class="op">,</span></span>
<span id="cb23-8"><a href="#cb23-8" aria-hidden="true" tabindex="-1"></a>                        LinearParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims4<span class="op">,</span> kFeatDims5<span class="op">&gt;*</span> conv5<span class="op">,</span></span>
<span id="cb23-9"><a href="#cb23-9" aria-hidden="true" tabindex="-1"></a>                        BatchNorm1dParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims1<span class="op">&gt;*</span> bn1<span class="op">,</span></span>
<span id="cb23-10"><a href="#cb23-10" aria-hidden="true" tabindex="-1"></a>                        BatchNorm1dParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims2<span class="op">&gt;*</span> bn2<span class="op">,</span></span>
<span id="cb23-11"><a href="#cb23-11" aria-hidden="true" tabindex="-1"></a>                        BatchNorm1dParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims3<span class="op">&gt;*</span> bn3<span class="op">,</span></span>
<span id="cb23-12"><a href="#cb23-12" aria-hidden="true" tabindex="-1"></a>                        BatchNorm1dParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims4<span class="op">&gt;*</span> bn4<span class="op">,</span></span>
<span id="cb23-13"><a href="#cb23-13" aria-hidden="true" tabindex="-1"></a>                        BatchNorm1dParams<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims5<span class="op">&gt;*</span> bn5<span class="op">,</span></span>
<span id="cb23-14"><a href="#cb23-14" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> params1<span class="op">,</span></span>
<span id="cb23-15"><a href="#cb23-15" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> params2<span class="op">,</span></span>
<span id="cb23-16"><a href="#cb23-16" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> params3<span class="op">,</span></span>
<span id="cb23-17"><a href="#cb23-17" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> params4<span class="op">,</span></span>
<span id="cb23-18"><a href="#cb23-18" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> params5<span class="op">)</span></span>
<span id="cb23-19"><a href="#cb23-19" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb23-20"><a href="#cb23-20" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb23-21"><a href="#cb23-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-22"><a href="#cb23-22" aria-hidden="true" tabindex="-1"></a>  ReadBlockParamsOpt2<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims0<span class="op">,</span> kFeatDims1<span class="op">&gt;(</span>conv1<span class="op">,</span> bn1<span class="op">,</span> params1<span class="op">);</span></span>
<span id="cb23-23"><a href="#cb23-23" aria-hidden="true" tabindex="-1"></a>  ReadBlockParamsOpt1<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims1<span class="op">,</span> kFeatDims2<span class="op">&gt;(</span>conv2<span class="op">,</span> bn2<span class="op">,</span> params2<span class="op">);</span></span>
<span id="cb23-24"><a href="#cb23-24" aria-hidden="true" tabindex="-1"></a>  ReadBlockParamsOpt1<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims2<span class="op">,</span> kFeatDims3<span class="op">&gt;(</span>conv3<span class="op">,</span> bn3<span class="op">,</span> params3<span class="op">);</span></span>
<span id="cb23-25"><a href="#cb23-25" aria-hidden="true" tabindex="-1"></a>  ReadBlockParamsOpt1<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims3<span class="op">,</span> kFeatDims4<span class="op">&gt;(</span>conv4<span class="op">,</span> bn4<span class="op">,</span> params4<span class="op">);</span></span>
<span id="cb23-26"><a href="#cb23-26" aria-hidden="true" tabindex="-1"></a>  ReadBlockParamsOpt1<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims4<span class="op">,</span> kFeatDims5<span class="op">&gt;(</span>conv5<span class="op">,</span> bn5<span class="op">,</span> params5<span class="op">);</span></span>
<span id="cb23-27"><a href="#cb23-27" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb23-28"><a href="#cb23-28" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-29"><a href="#cb23-29" aria-hidden="true" tabindex="-1"></a><span class="co">// Parallel implementation of the parameter initialization</span></span>
<span id="cb23-30"><a href="#cb23-30" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for parameters</span></span>
<span id="cb23-31"><a href="#cb23-31" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">&gt;</span></span>
<span id="cb23-32"><a href="#cb23-32" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> InitializeClsOpt3<span class="op">(</span>LinearParams<span class="op">&lt;</span>T<span class="op">,</span> kClsDims2<span class="op">,</span> kClsDims3<span class="op">&gt;*</span> fc3<span class="op">,</span></span>
<span id="cb23-33"><a href="#cb23-33" aria-hidden="true" tabindex="-1"></a>                       BatchNorm1dParams<span class="op">&lt;</span>T<span class="op">,</span> kClsDims1<span class="op">&gt;*</span> bn1<span class="op">,</span></span>
<span id="cb23-34"><a href="#cb23-34" aria-hidden="true" tabindex="-1"></a>                       BatchNorm1dParams<span class="op">&lt;</span>T<span class="op">,</span> kClsDims2<span class="op">&gt;*</span> bn2<span class="op">,</span></span>
<span id="cb23-35"><a href="#cb23-35" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> params1<span class="op">,</span></span>
<span id="cb23-36"><a href="#cb23-36" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> params2<span class="op">,</span></span>
<span id="cb23-37"><a href="#cb23-37" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> params3<span class="op">)</span></span>
<span id="cb23-38"><a href="#cb23-38" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb23-39"><a href="#cb23-39" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb23-40"><a href="#cb23-40" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-41"><a href="#cb23-41" aria-hidden="true" tabindex="-1"></a>  ReadBatchNorm1dParamsOpt1<span class="op">&lt;</span>T<span class="op">,</span> kClsDims1<span class="op">&gt;(</span></span>
<span id="cb23-42"><a href="#cb23-42" aria-hidden="true" tabindex="-1"></a>    bn1<span class="op">,</span> params1<span class="op">,</span> kClsDims0 <span class="op">*</span> kClsDims1 <span class="op">+</span> kClsDims1<span class="op">);</span></span>
<span id="cb23-43"><a href="#cb23-43" aria-hidden="true" tabindex="-1"></a>  ReadBatchNorm1dParamsOpt1<span class="op">&lt;</span>T<span class="op">,</span> kClsDims2<span class="op">&gt;(</span></span>
<span id="cb23-44"><a href="#cb23-44" aria-hidden="true" tabindex="-1"></a>    bn2<span class="op">,</span> params2<span class="op">,</span> kClsDims1 <span class="op">*</span> kClsDims2 <span class="op">+</span> kClsDims2<span class="op">);</span></span>
<span id="cb23-45"><a href="#cb23-45" aria-hidden="true" tabindex="-1"></a>  ReadLinearParamsOpt1<span class="op">&lt;</span>T<span class="op">,</span> kClsDims2<span class="op">,</span> kClsDims3<span class="op">&gt;(</span></span>
<span id="cb23-46"><a href="#cb23-46" aria-hidden="true" tabindex="-1"></a>    fc3<span class="op">,</span> params3<span class="op">,</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb23-47"><a href="#cb23-47" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>最初の実装では<code>ReadLinearParamsNaive</code>、<code>ReadBatchNorm1dParamsNaive</code>、<code>ReadBlockParamsNaive</code>を使っていましたが、ここでは新たに<code>ReadLinearParamsOpt1</code>、<code>ReadBatchNorm1dParamsOpt1</code>、<code>ReadBlockParamsOpt1</code>、<code>ReadBlockParamsOpt2</code>の4種類を使っています。
詳しく中身をみてみましょう。</p>
<div class="sourceCode" id="cb24"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb24-1"><a href="#cb24-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Parallel implementation of the parameter initialization</span></span>
<span id="cb24-2"><a href="#cb24-2" aria-hidden="true" tabindex="-1"></a><span class="co">// Read the parameters for a linear layer from a DDR memory and</span></span>
<span id="cb24-3"><a href="#cb24-3" aria-hidden="true" tabindex="-1"></a><span class="co">// store them to BRAM buffers</span></span>
<span id="cb24-4"><a href="#cb24-4" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for parameters</span></span>
<span id="cb24-5"><a href="#cb24-5" aria-hidden="true" tabindex="-1"></a><span class="co">// `InDims` is the number of input dimensions</span></span>
<span id="cb24-6"><a href="#cb24-6" aria-hidden="true" tabindex="-1"></a><span class="co">// `OutDims` is the number of output dimensions</span></span>
<span id="cb24-7"><a href="#cb24-7" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="dt">int</span> InDims<span class="op">,</span> <span class="dt">int</span> OutDims<span class="op">&gt;</span></span>
<span id="cb24-8"><a href="#cb24-8" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> ReadLinearParamsOpt1<span class="op">(</span>LinearParams<span class="op">&lt;</span>T<span class="op">,</span> InDims<span class="op">,</span> OutDims<span class="op">&gt;*</span> linear<span class="op">,</span></span>
<span id="cb24-9"><a href="#cb24-9" aria-hidden="true" tabindex="-1"></a>                          <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> params<span class="op">,</span></span>
<span id="cb24-10"><a href="#cb24-10" aria-hidden="true" tabindex="-1"></a>                          <span class="at">const</span> <span class="dt">int</span> offset<span class="op">)</span></span>
<span id="cb24-11"><a href="#cb24-11" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb24-12"><a href="#cb24-12" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE</span></span>
<span id="cb24-13"><a href="#cb24-13" aria-hidden="true" tabindex="-1"></a>  <span class="co">// `params` contains weight parameters of size (`OutDims`, `InDims`) and</span></span>
<span id="cb24-14"><a href="#cb24-14" aria-hidden="true" tabindex="-1"></a>  <span class="co">// bias parameters of size (`OutDims`) in a contiguous buffer</span></span>
<span id="cb24-15"><a href="#cb24-15" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb24-16"><a href="#cb24-16" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>InDims <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">,</span> <span class="st">&quot;`InDims` must be a multiple of 2&quot;</span><span class="op">);</span></span>
<span id="cb24-17"><a href="#cb24-17" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>OutDims <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">,</span> <span class="st">&quot;`OutDims` must be a multiple of 2&quot;</span><span class="op">);</span></span>
<span id="cb24-18"><a href="#cb24-18" aria-hidden="true" tabindex="-1"></a>  <span class="ot">assert</span><span class="op">(</span>offset <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb24-19"><a href="#cb24-19" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb24-20"><a href="#cb24-20" aria-hidden="true" tabindex="-1"></a>  ReadTensor2dOpt1<span class="op">&lt;</span>T<span class="op">,</span> OutDims<span class="op">,</span> InDims<span class="op">&gt;(</span>linear<span class="op">-&gt;</span>weight<span class="op">,</span> params<span class="op">,</span> offset<span class="op">);</span></span>
<span id="cb24-21"><a href="#cb24-21" aria-hidden="true" tabindex="-1"></a>  ReadTensor1dOpt1<span class="op">&lt;</span>T<span class="op">,</span> OutDims<span class="op">&gt;(</span>linear<span class="op">-&gt;</span>bias<span class="op">,</span> params<span class="op">,</span></span>
<span id="cb24-22"><a href="#cb24-22" aria-hidden="true" tabindex="-1"></a>                               offset <span class="op">+</span> InDims <span class="op">*</span> OutDims<span class="op">);</span></span>
<span id="cb24-23"><a href="#cb24-23" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb24-24"><a href="#cb24-24" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb24-25"><a href="#cb24-25" aria-hidden="true" tabindex="-1"></a><span class="co">// Parallel implementation of the parameter initialization</span></span>
<span id="cb24-26"><a href="#cb24-26" aria-hidden="true" tabindex="-1"></a><span class="co">// Read the parameters for a 1D batch normalization layer from a DDR memory and</span></span>
<span id="cb24-27"><a href="#cb24-27" aria-hidden="true" tabindex="-1"></a><span class="co">// store them to BRAM buffers</span></span>
<span id="cb24-28"><a href="#cb24-28" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for parameters</span></span>
<span id="cb24-29"><a href="#cb24-29" aria-hidden="true" tabindex="-1"></a><span class="co">// `Dims` is the number of input and output dimensions</span></span>
<span id="cb24-30"><a href="#cb24-30" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="dt">int</span> Dims<span class="op">&gt;</span></span>
<span id="cb24-31"><a href="#cb24-31" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> ReadBatchNorm1dParamsOpt1<span class="op">(</span>BatchNorm1dParams<span class="op">&lt;</span>T<span class="op">,</span> Dims<span class="op">&gt;*</span> bn<span class="op">,</span></span>
<span id="cb24-32"><a href="#cb24-32" aria-hidden="true" tabindex="-1"></a>                               <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> params<span class="op">,</span></span>
<span id="cb24-33"><a href="#cb24-33" aria-hidden="true" tabindex="-1"></a>                               <span class="at">const</span> <span class="dt">int</span> offset<span class="op">)</span></span>
<span id="cb24-34"><a href="#cb24-34" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb24-35"><a href="#cb24-35" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE</span></span>
<span id="cb24-36"><a href="#cb24-36" aria-hidden="true" tabindex="-1"></a>  <span class="co">// `params` contains scale parameters of size (`Dims`),</span></span>
<span id="cb24-37"><a href="#cb24-37" aria-hidden="true" tabindex="-1"></a>  <span class="co">// bias of size (`Dims`), and mean of size (`Dims`) in a contiguous buffer</span></span>
<span id="cb24-38"><a href="#cb24-38" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb24-39"><a href="#cb24-39" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>Dims <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">,</span> <span class="st">&quot;`Dims` must be a multiple of 2&quot;</span><span class="op">);</span></span>
<span id="cb24-40"><a href="#cb24-40" aria-hidden="true" tabindex="-1"></a>  <span class="ot">assert</span><span class="op">(</span>offset <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb24-41"><a href="#cb24-41" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb24-42"><a href="#cb24-42" aria-hidden="true" tabindex="-1"></a>  ReadTensor1dOpt1<span class="op">&lt;</span>T<span class="op">,</span> Dims<span class="op">&gt;(</span>bn<span class="op">-&gt;</span>scale<span class="op">,</span> params<span class="op">,</span> offset<span class="op">);</span></span>
<span id="cb24-43"><a href="#cb24-43" aria-hidden="true" tabindex="-1"></a>  ReadTensor1dOpt1<span class="op">&lt;</span>T<span class="op">,</span> Dims<span class="op">&gt;(</span>bn<span class="op">-&gt;</span>bias<span class="op">,</span> params<span class="op">,</span> offset <span class="op">+</span> Dims<span class="op">);</span></span>
<span id="cb24-44"><a href="#cb24-44" aria-hidden="true" tabindex="-1"></a>  ReadTensor1dOpt1<span class="op">&lt;</span>T<span class="op">,</span> Dims<span class="op">&gt;(</span>bn<span class="op">-&gt;</span>mean<span class="op">,</span> params<span class="op">,</span> offset <span class="op">+</span> Dims <span class="op">*</span> <span class="dv">2</span><span class="op">);</span></span>
<span id="cb24-45"><a href="#cb24-45" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb24-46"><a href="#cb24-46" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb24-47"><a href="#cb24-47" aria-hidden="true" tabindex="-1"></a><span class="co">// Parallel implementation of the parameter initialization</span></span>
<span id="cb24-48"><a href="#cb24-48" aria-hidden="true" tabindex="-1"></a><span class="co">// Read the parameters for a linear and 1D batch normalization layer</span></span>
<span id="cb24-49"><a href="#cb24-49" aria-hidden="true" tabindex="-1"></a><span class="co">// from a DDR memory and store them to BRAM buffers</span></span>
<span id="cb24-50"><a href="#cb24-50" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for parameters</span></span>
<span id="cb24-51"><a href="#cb24-51" aria-hidden="true" tabindex="-1"></a><span class="co">// `InDims` is the number of input dimensions</span></span>
<span id="cb24-52"><a href="#cb24-52" aria-hidden="true" tabindex="-1"></a><span class="co">// `OutDims` is the number of output dimensions</span></span>
<span id="cb24-53"><a href="#cb24-53" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="dt">int</span> InDims<span class="op">,</span> <span class="dt">int</span> OutDims<span class="op">&gt;</span></span>
<span id="cb24-54"><a href="#cb24-54" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> ReadBlockParamsOpt1<span class="op">(</span>LinearParams<span class="op">&lt;</span>T<span class="op">,</span> InDims<span class="op">,</span> OutDims<span class="op">&gt;*</span> linear<span class="op">,</span></span>
<span id="cb24-55"><a href="#cb24-55" aria-hidden="true" tabindex="-1"></a>                         BatchNorm1dParams<span class="op">&lt;</span>T<span class="op">,</span> OutDims<span class="op">&gt;*</span> bn<span class="op">,</span></span>
<span id="cb24-56"><a href="#cb24-56" aria-hidden="true" tabindex="-1"></a>                         <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> params<span class="op">)</span></span>
<span id="cb24-57"><a href="#cb24-57" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb24-58"><a href="#cb24-58" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE</span></span>
<span id="cb24-59"><a href="#cb24-59" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb24-60"><a href="#cb24-60" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>InDims <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">,</span> <span class="st">&quot;`InDims` must be a multiple of 2&quot;</span><span class="op">);</span></span>
<span id="cb24-61"><a href="#cb24-61" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>OutDims <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">,</span> <span class="st">&quot;`OutDims` must be a multiple of 2&quot;</span><span class="op">);</span></span>
<span id="cb24-62"><a href="#cb24-62" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb24-63"><a href="#cb24-63" aria-hidden="true" tabindex="-1"></a>  ReadTensor2dOpt1<span class="op">&lt;</span>T<span class="op">,</span> OutDims<span class="op">,</span> InDims<span class="op">&gt;(</span>linear<span class="op">-&gt;</span>weight<span class="op">,</span> params<span class="op">,</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb24-64"><a href="#cb24-64" aria-hidden="true" tabindex="-1"></a>  ReadTensor1dOpt1<span class="op">&lt;</span>T<span class="op">,</span> OutDims<span class="op">&gt;(</span>linear<span class="op">-&gt;</span>bias<span class="op">,</span> params<span class="op">,</span> InDims <span class="op">*</span> OutDims<span class="op">);</span></span>
<span id="cb24-65"><a href="#cb24-65" aria-hidden="true" tabindex="-1"></a>  ReadTensor1dOpt1<span class="op">&lt;</span>T<span class="op">,</span> OutDims<span class="op">&gt;(</span>bn<span class="op">-&gt;</span>scale<span class="op">,</span> params<span class="op">,</span></span>
<span id="cb24-66"><a href="#cb24-66" aria-hidden="true" tabindex="-1"></a>                               InDims <span class="op">*</span> OutDims <span class="op">+</span> OutDims<span class="op">);</span></span>
<span id="cb24-67"><a href="#cb24-67" aria-hidden="true" tabindex="-1"></a>  ReadTensor1dOpt1<span class="op">&lt;</span>T<span class="op">,</span> OutDims<span class="op">&gt;(</span>bn<span class="op">-&gt;</span>bias<span class="op">,</span> params<span class="op">,</span></span>
<span id="cb24-68"><a href="#cb24-68" aria-hidden="true" tabindex="-1"></a>                               InDims <span class="op">*</span> OutDims <span class="op">+</span> OutDims <span class="op">*</span> <span class="dv">2</span><span class="op">);</span></span>
<span id="cb24-69"><a href="#cb24-69" aria-hidden="true" tabindex="-1"></a>  ReadTensor1dOpt1<span class="op">&lt;</span>T<span class="op">,</span> OutDims<span class="op">&gt;(</span>bn<span class="op">-&gt;</span>mean<span class="op">,</span> params<span class="op">,</span></span>
<span id="cb24-70"><a href="#cb24-70" aria-hidden="true" tabindex="-1"></a>                               InDims <span class="op">*</span> OutDims <span class="op">+</span> OutDims <span class="op">*</span> <span class="dv">3</span><span class="op">);</span></span>
<span id="cb24-71"><a href="#cb24-71" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb24-72"><a href="#cb24-72" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb24-73"><a href="#cb24-73" aria-hidden="true" tabindex="-1"></a><span class="co">// Parallel implementation of the parameter initialization</span></span>
<span id="cb24-74"><a href="#cb24-74" aria-hidden="true" tabindex="-1"></a><span class="co">// Read the parameters for a linear and 1D batch normalization layer</span></span>
<span id="cb24-75"><a href="#cb24-75" aria-hidden="true" tabindex="-1"></a><span class="co">// from a DDR memory and store them to BRAM buffers</span></span>
<span id="cb24-76"><a href="#cb24-76" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for parameters</span></span>
<span id="cb24-77"><a href="#cb24-77" aria-hidden="true" tabindex="-1"></a><span class="co">// `InDims` is the number of input dimensions</span></span>
<span id="cb24-78"><a href="#cb24-78" aria-hidden="true" tabindex="-1"></a><span class="co">// `OutDims` is the number of output dimensions</span></span>
<span id="cb24-79"><a href="#cb24-79" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="dt">int</span> InDims<span class="op">,</span> <span class="dt">int</span> OutDims<span class="op">&gt;</span></span>
<span id="cb24-80"><a href="#cb24-80" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> ReadBlockParamsOpt2<span class="op">(</span>LinearParams<span class="op">&lt;</span>T<span class="op">,</span> InDims<span class="op">,</span> OutDims<span class="op">&gt;*</span> linear<span class="op">,</span></span>
<span id="cb24-81"><a href="#cb24-81" aria-hidden="true" tabindex="-1"></a>                         BatchNorm1dParams<span class="op">&lt;</span>T<span class="op">,</span> OutDims<span class="op">&gt;*</span> bn<span class="op">,</span></span>
<span id="cb24-82"><a href="#cb24-82" aria-hidden="true" tabindex="-1"></a>                         <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> params<span class="op">)</span></span>
<span id="cb24-83"><a href="#cb24-83" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb24-84"><a href="#cb24-84" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE</span></span>
<span id="cb24-85"><a href="#cb24-85" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb24-86"><a href="#cb24-86" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>InDims <span class="op">==</span> <span class="dv">3</span><span class="op">,</span> <span class="st">&quot;`InDims` must be 3&quot;</span><span class="op">);</span></span>
<span id="cb24-87"><a href="#cb24-87" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>OutDims <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">,</span> <span class="st">&quot;`OutDims` must be a multiple of 2&quot;</span><span class="op">);</span></span>
<span id="cb24-88"><a href="#cb24-88" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb24-89"><a href="#cb24-89" aria-hidden="true" tabindex="-1"></a>  ReadTensor2dOpt2<span class="op">&lt;</span>T<span class="op">,</span> OutDims<span class="op">,</span> InDims<span class="op">&gt;(</span>linear<span class="op">-&gt;</span>weight<span class="op">,</span> params<span class="op">,</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb24-90"><a href="#cb24-90" aria-hidden="true" tabindex="-1"></a>  ReadTensor1dOpt1<span class="op">&lt;</span>T<span class="op">,</span> OutDims<span class="op">&gt;(</span>linear<span class="op">-&gt;</span>bias<span class="op">,</span> params<span class="op">,</span> InDims <span class="op">*</span> OutDims<span class="op">);</span></span>
<span id="cb24-91"><a href="#cb24-91" aria-hidden="true" tabindex="-1"></a>  ReadTensor1dOpt1<span class="op">&lt;</span>T<span class="op">,</span> OutDims<span class="op">&gt;(</span>bn<span class="op">-&gt;</span>scale<span class="op">,</span> params<span class="op">,</span></span>
<span id="cb24-92"><a href="#cb24-92" aria-hidden="true" tabindex="-1"></a>                               InDims <span class="op">*</span> OutDims <span class="op">+</span> OutDims<span class="op">);</span></span>
<span id="cb24-93"><a href="#cb24-93" aria-hidden="true" tabindex="-1"></a>  ReadTensor1dOpt1<span class="op">&lt;</span>T<span class="op">,</span> OutDims<span class="op">&gt;(</span>bn<span class="op">-&gt;</span>bias<span class="op">,</span> params<span class="op">,</span></span>
<span id="cb24-94"><a href="#cb24-94" aria-hidden="true" tabindex="-1"></a>                               InDims <span class="op">*</span> OutDims <span class="op">+</span> OutDims <span class="op">*</span> <span class="dv">2</span><span class="op">);</span></span>
<span id="cb24-95"><a href="#cb24-95" aria-hidden="true" tabindex="-1"></a>  ReadTensor1dOpt1<span class="op">&lt;</span>T<span class="op">,</span> OutDims<span class="op">&gt;(</span>bn<span class="op">-&gt;</span>mean<span class="op">,</span> params<span class="op">,</span></span>
<span id="cb24-96"><a href="#cb24-96" aria-hidden="true" tabindex="-1"></a>                               InDims <span class="op">*</span> OutDims <span class="op">+</span> OutDims <span class="op">*</span> <span class="dv">3</span><span class="op">);</span></span>
<span id="cb24-97"><a href="#cb24-97" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>基本的には元のナイーブな実装と同じですが、引数の型が<code>float*</code>から<code>ap_uint&lt;64&gt;*</code>に変わっています。
関数の中身も単純で、指定したオフセットから、指定したサイズのパラメータを読み取ることを繰り返すだけです。
例えばバッチ正規化層のパラメータを読み取るときは、スケール、バイアス、平均の順に読み取ります。
DRAMバッファ上には予め、正しい位置にこの順で並べておく必要があります。
中で使われている関数<code>ReadTensor1dOpt1</code>、<code>ReadTensor2dOpt1</code>、<code>ReadTensor2dOpt2</code>は次の通りです。</p>
<div class="sourceCode" id="cb25"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb25-1"><a href="#cb25-1" aria-hidden="true" tabindex="-1"></a><span class="kw">union</span> <span class="dt">conv32_t</span></span>
<span id="cb25-2"><a href="#cb25-2" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb25-3"><a href="#cb25-3" aria-hidden="true" tabindex="-1"></a>  <span class="bu">std::</span>uint32_t<span class="op"> </span>u32<span class="op">;</span></span>
<span id="cb25-4"><a href="#cb25-4" aria-hidden="true" tabindex="-1"></a>  <span class="dt">int</span> i32<span class="op">;</span></span>
<span id="cb25-5"><a href="#cb25-5" aria-hidden="true" tabindex="-1"></a>  <span class="dt">float</span> f<span class="op">;</span></span>
<span id="cb25-6"><a href="#cb25-6" aria-hidden="true" tabindex="-1"></a><span class="op">};</span></span>
<span id="cb25-7"><a href="#cb25-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-8"><a href="#cb25-8" aria-hidden="true" tabindex="-1"></a><span class="co">// Interpret float as std::uint32_t</span></span>
<span id="cb25-9"><a href="#cb25-9" aria-hidden="true" tabindex="-1"></a><span class="kw">inline</span> <span class="bu">std::</span>uint32_t<span class="op"> </span>FloatToU32<span class="op">(</span><span class="at">const</span> <span class="dt">float</span> f<span class="op">)</span></span>
<span id="cb25-10"><a href="#cb25-10" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb25-11"><a href="#cb25-11" aria-hidden="true" tabindex="-1"></a>  <span class="dt">conv32_t</span> conv<span class="op">;</span></span>
<span id="cb25-12"><a href="#cb25-12" aria-hidden="true" tabindex="-1"></a>  conv<span class="op">.</span>f <span class="op">=</span> f<span class="op">;</span></span>
<span id="cb25-13"><a href="#cb25-13" aria-hidden="true" tabindex="-1"></a>  <span class="cf">return</span> conv<span class="op">.</span>u32<span class="op">;</span></span>
<span id="cb25-14"><a href="#cb25-14" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb25-15"><a href="#cb25-15" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-16"><a href="#cb25-16" aria-hidden="true" tabindex="-1"></a><span class="co">// Interpret std::uint32_t as float</span></span>
<span id="cb25-17"><a href="#cb25-17" aria-hidden="true" tabindex="-1"></a><span class="kw">inline</span> <span class="dt">float</span> U32ToFloat<span class="op">(</span><span class="at">const</span> <span class="bu">std::</span>uint32_t<span class="op"> </span>u32<span class="op">)</span></span>
<span id="cb25-18"><a href="#cb25-18" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb25-19"><a href="#cb25-19" aria-hidden="true" tabindex="-1"></a>  <span class="dt">conv32_t</span> conv<span class="op">;</span></span>
<span id="cb25-20"><a href="#cb25-20" aria-hidden="true" tabindex="-1"></a>  conv<span class="op">.</span>u32 <span class="op">=</span> u32<span class="op">;</span></span>
<span id="cb25-21"><a href="#cb25-21" aria-hidden="true" tabindex="-1"></a>  <span class="cf">return</span> conv<span class="op">.</span>f<span class="op">;</span></span>
<span id="cb25-22"><a href="#cb25-22" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb25-23"><a href="#cb25-23" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-24"><a href="#cb25-24" aria-hidden="true" tabindex="-1"></a><span class="co">// Read a 1D tensor from a DDR memory</span></span>
<span id="cb25-25"><a href="#cb25-25" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="dt">int</span> D0<span class="op">&gt;</span></span>
<span id="cb25-26"><a href="#cb25-26" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> ReadTensor1dNaive<span class="op">(</span>T tensor<span class="op">[</span>D0<span class="op">],</span></span>
<span id="cb25-27"><a href="#cb25-27" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> src<span class="op">,</span></span>
<span id="cb25-28"><a href="#cb25-28" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> <span class="dt">int</span> offset<span class="op">)</span></span>
<span id="cb25-29"><a href="#cb25-29" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb25-30"><a href="#cb25-30" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb25-31"><a href="#cb25-31" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-32"><a href="#cb25-32" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> D0<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb25-33"><a href="#cb25-33" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE II=1</span></span>
<span id="cb25-34"><a href="#cb25-34" aria-hidden="true" tabindex="-1"></a>    tensor<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> T<span class="op">(</span>src<span class="op">[</span>offset <span class="op">+</span> i<span class="op">]);</span></span>
<span id="cb25-35"><a href="#cb25-35" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb25-36"><a href="#cb25-36" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb25-37"><a href="#cb25-37" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-38"><a href="#cb25-38" aria-hidden="true" tabindex="-1"></a><span class="co">// Read a 1D tensor from a DDR memory</span></span>
<span id="cb25-39"><a href="#cb25-39" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="dt">int</span> D0<span class="op">&gt;</span></span>
<span id="cb25-40"><a href="#cb25-40" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> ReadTensor1dOpt1<span class="op">(</span>T tensor<span class="op">[</span>D0<span class="op">],</span></span>
<span id="cb25-41"><a href="#cb25-41" aria-hidden="true" tabindex="-1"></a>                      <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> src<span class="op">,</span></span>
<span id="cb25-42"><a href="#cb25-42" aria-hidden="true" tabindex="-1"></a>                      <span class="at">const</span> <span class="dt">int</span> offset<span class="op">)</span></span>
<span id="cb25-43"><a href="#cb25-43" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb25-44"><a href="#cb25-44" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb25-45"><a href="#cb25-45" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-46"><a href="#cb25-46" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>D0 <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">,</span> <span class="st">&quot;`D0` must be a multiple of 2&quot;</span><span class="op">);</span></span>
<span id="cb25-47"><a href="#cb25-47" aria-hidden="true" tabindex="-1"></a>  <span class="ot">assert</span><span class="op">(</span>offset <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb25-48"><a href="#cb25-48" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-49"><a href="#cb25-49" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> D0Over2 <span class="op">=</span> D0 <span class="op">/</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb25-50"><a href="#cb25-50" aria-hidden="true" tabindex="-1"></a>  <span class="at">const</span> <span class="dt">int</span> offset2 <span class="op">=</span> offset <span class="op">/</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb25-51"><a href="#cb25-51" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-52"><a href="#cb25-52" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> D0Over2<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb25-53"><a href="#cb25-53" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE II=1</span></span>
<span id="cb25-54"><a href="#cb25-54" aria-hidden="true" tabindex="-1"></a>    <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;</span> tensor_data <span class="op">=</span> src<span class="op">[</span>offset2 <span class="op">+</span> i<span class="op">];</span></span>
<span id="cb25-55"><a href="#cb25-55" aria-hidden="true" tabindex="-1"></a>    tensor<span class="op">[</span>i <span class="op">*</span> <span class="dv">2</span> <span class="op">+</span> <span class="dv">0</span><span class="op">]</span> <span class="op">=</span> T<span class="op">(</span>U32ToFloat<span class="op">(</span>tensor_data<span class="op">.</span>range<span class="op">(</span><span class="dv">31</span><span class="op">,</span> <span class="dv">0</span><span class="op">)));</span></span>
<span id="cb25-56"><a href="#cb25-56" aria-hidden="true" tabindex="-1"></a>    tensor<span class="op">[</span>i <span class="op">*</span> <span class="dv">2</span> <span class="op">+</span> <span class="dv">1</span><span class="op">]</span> <span class="op">=</span> T<span class="op">(</span>U32ToFloat<span class="op">(</span>tensor_data<span class="op">.</span>range<span class="op">(</span><span class="dv">63</span><span class="op">,</span> <span class="dv">32</span><span class="op">)));</span></span>
<span id="cb25-57"><a href="#cb25-57" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb25-58"><a href="#cb25-58" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb25-59"><a href="#cb25-59" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-60"><a href="#cb25-60" aria-hidden="true" tabindex="-1"></a><span class="co">// Read a 2D tensor from a DDR memory</span></span>
<span id="cb25-61"><a href="#cb25-61" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="dt">int</span> D0<span class="op">,</span> <span class="dt">int</span> D1<span class="op">&gt;</span></span>
<span id="cb25-62"><a href="#cb25-62" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> ReadTensor2dNaive<span class="op">(</span>T tensor<span class="op">[</span>D0<span class="op">][</span>D1<span class="op">],</span></span>
<span id="cb25-63"><a href="#cb25-63" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> src<span class="op">,</span></span>
<span id="cb25-64"><a href="#cb25-64" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> <span class="dt">int</span> offset<span class="op">)</span></span>
<span id="cb25-65"><a href="#cb25-65" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb25-66"><a href="#cb25-66" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb25-67"><a href="#cb25-67" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-68"><a href="#cb25-68" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> D0<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb25-69"><a href="#cb25-69" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> j <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> j <span class="op">&lt;</span> D1<span class="op">;</span> <span class="op">++</span>j<span class="op">)</span> <span class="op">{</span></span>
<span id="cb25-70"><a href="#cb25-70" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE II=1</span></span>
<span id="cb25-71"><a href="#cb25-71" aria-hidden="true" tabindex="-1"></a>      <span class="at">const</span> <span class="dt">int</span> idx <span class="op">=</span> i <span class="op">*</span> D1 <span class="op">+</span> j<span class="op">;</span></span>
<span id="cb25-72"><a href="#cb25-72" aria-hidden="true" tabindex="-1"></a>      tensor<span class="op">[</span>i<span class="op">][</span>j<span class="op">]</span> <span class="op">=</span> T<span class="op">(</span>src<span class="op">[</span>offset <span class="op">+</span> idx<span class="op">]);</span></span>
<span id="cb25-73"><a href="#cb25-73" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span>
<span id="cb25-74"><a href="#cb25-74" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb25-75"><a href="#cb25-75" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb25-76"><a href="#cb25-76" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-77"><a href="#cb25-77" aria-hidden="true" tabindex="-1"></a><span class="co">// Read a 2D tensor from a DDR memory</span></span>
<span id="cb25-78"><a href="#cb25-78" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="dt">int</span> D0<span class="op">,</span> <span class="dt">int</span> D1<span class="op">&gt;</span></span>
<span id="cb25-79"><a href="#cb25-79" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> ReadTensor2dOpt1<span class="op">(</span>T tensor<span class="op">[</span>D0<span class="op">][</span>D1<span class="op">],</span></span>
<span id="cb25-80"><a href="#cb25-80" aria-hidden="true" tabindex="-1"></a>                      <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> src<span class="op">,</span></span>
<span id="cb25-81"><a href="#cb25-81" aria-hidden="true" tabindex="-1"></a>                      <span class="at">const</span> <span class="dt">int</span> offset<span class="op">)</span></span>
<span id="cb25-82"><a href="#cb25-82" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb25-83"><a href="#cb25-83" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb25-84"><a href="#cb25-84" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-85"><a href="#cb25-85" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>D1 <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">,</span> <span class="st">&quot;`D1` must be a multiple of 2&quot;</span><span class="op">);</span></span>
<span id="cb25-86"><a href="#cb25-86" aria-hidden="true" tabindex="-1"></a>  <span class="ot">assert</span><span class="op">(</span>offset <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb25-87"><a href="#cb25-87" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-88"><a href="#cb25-88" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> D1Over2 <span class="op">=</span> D1 <span class="op">/</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb25-89"><a href="#cb25-89" aria-hidden="true" tabindex="-1"></a>  <span class="at">const</span> <span class="dt">int</span> offset2 <span class="op">=</span> offset <span class="op">/</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb25-90"><a href="#cb25-90" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-91"><a href="#cb25-91" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> D0<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb25-92"><a href="#cb25-92" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> j <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> j <span class="op">&lt;</span> D1Over2<span class="op">;</span> <span class="op">++</span>j<span class="op">)</span> <span class="op">{</span></span>
<span id="cb25-93"><a href="#cb25-93" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE II=1</span></span>
<span id="cb25-94"><a href="#cb25-94" aria-hidden="true" tabindex="-1"></a>      <span class="at">const</span> <span class="dt">int</span> idx <span class="op">=</span> i <span class="op">*</span> D1Over2 <span class="op">+</span> j<span class="op">;</span></span>
<span id="cb25-95"><a href="#cb25-95" aria-hidden="true" tabindex="-1"></a>      <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;</span> tensor_data <span class="op">=</span> src<span class="op">[</span>offset2 <span class="op">+</span> idx<span class="op">];</span></span>
<span id="cb25-96"><a href="#cb25-96" aria-hidden="true" tabindex="-1"></a>      tensor<span class="op">[</span>i<span class="op">][</span>j <span class="op">*</span> <span class="dv">2</span> <span class="op">+</span> <span class="dv">0</span><span class="op">]</span> <span class="op">=</span> T<span class="op">(</span>U32ToFloat<span class="op">(</span>tensor_data<span class="op">.</span>range<span class="op">(</span><span class="dv">31</span><span class="op">,</span> <span class="dv">0</span><span class="op">)));</span></span>
<span id="cb25-97"><a href="#cb25-97" aria-hidden="true" tabindex="-1"></a>      tensor<span class="op">[</span>i<span class="op">][</span>j <span class="op">*</span> <span class="dv">2</span> <span class="op">+</span> <span class="dv">1</span><span class="op">]</span> <span class="op">=</span> T<span class="op">(</span>U32ToFloat<span class="op">(</span>tensor_data<span class="op">.</span>range<span class="op">(</span><span class="dv">63</span><span class="op">,</span> <span class="dv">32</span><span class="op">)));</span></span>
<span id="cb25-98"><a href="#cb25-98" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span>
<span id="cb25-99"><a href="#cb25-99" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb25-100"><a href="#cb25-100" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb25-101"><a href="#cb25-101" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-102"><a href="#cb25-102" aria-hidden="true" tabindex="-1"></a><span class="co">// Read a 2D tensor of size (`D0`, 3) from a DDR memory</span></span>
<span id="cb25-103"><a href="#cb25-103" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="dt">int</span> D0<span class="op">,</span> <span class="dt">int</span> D1<span class="op">&gt;</span></span>
<span id="cb25-104"><a href="#cb25-104" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> ReadTensor2dOpt2<span class="op">(</span>T tensor<span class="op">[</span>D0<span class="op">][</span>D1<span class="op">],</span></span>
<span id="cb25-105"><a href="#cb25-105" aria-hidden="true" tabindex="-1"></a>                      <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> src<span class="op">,</span></span>
<span id="cb25-106"><a href="#cb25-106" aria-hidden="true" tabindex="-1"></a>                      <span class="at">const</span> <span class="dt">int</span> offset<span class="op">)</span></span>
<span id="cb25-107"><a href="#cb25-107" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb25-108"><a href="#cb25-108" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb25-109"><a href="#cb25-109" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-110"><a href="#cb25-110" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>D0 <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">,</span> <span class="st">&quot;`D0` must be a multiple of 2&quot;</span><span class="op">);</span></span>
<span id="cb25-111"><a href="#cb25-111" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>D1 <span class="op">==</span> <span class="dv">3</span><span class="op">,</span> <span class="st">&quot;`D1` must be 3&quot;</span><span class="op">);</span></span>
<span id="cb25-112"><a href="#cb25-112" aria-hidden="true" tabindex="-1"></a>  <span class="ot">assert</span><span class="op">(</span>offset <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb25-113"><a href="#cb25-113" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-114"><a href="#cb25-114" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> Iter <span class="op">=</span> D0 <span class="op">*</span> D1 <span class="op">/</span> <span class="op">(</span><span class="dv">2</span> <span class="op">*</span> <span class="dv">3</span><span class="op">);</span></span>
<span id="cb25-115"><a href="#cb25-115" aria-hidden="true" tabindex="-1"></a>  <span class="at">const</span> <span class="dt">int</span> offset2 <span class="op">=</span> offset <span class="op">/</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb25-116"><a href="#cb25-116" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-117"><a href="#cb25-117" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> Iter<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb25-118"><a href="#cb25-118" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE</span></span>
<span id="cb25-119"><a href="#cb25-119" aria-hidden="true" tabindex="-1"></a>    <span class="at">const</span> <span class="dt">int</span> src_idx <span class="op">=</span> i <span class="op">*</span> <span class="dv">3</span><span class="op">;</span></span>
<span id="cb25-120"><a href="#cb25-120" aria-hidden="true" tabindex="-1"></a>    <span class="at">const</span> <span class="dt">int</span> dst_idx <span class="op">=</span> i <span class="op">*</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb25-121"><a href="#cb25-121" aria-hidden="true" tabindex="-1"></a>    <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;</span> tensor_data0 <span class="op">=</span> src<span class="op">[</span>offset2 <span class="op">+</span> src_idx <span class="op">+</span> <span class="dv">0</span><span class="op">];</span></span>
<span id="cb25-122"><a href="#cb25-122" aria-hidden="true" tabindex="-1"></a>    <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;</span> tensor_data1 <span class="op">=</span> src<span class="op">[</span>offset2 <span class="op">+</span> src_idx <span class="op">+</span> <span class="dv">1</span><span class="op">];</span></span>
<span id="cb25-123"><a href="#cb25-123" aria-hidden="true" tabindex="-1"></a>    <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;</span> tensor_data2 <span class="op">=</span> src<span class="op">[</span>offset2 <span class="op">+</span> src_idx <span class="op">+</span> <span class="dv">2</span><span class="op">];</span></span>
<span id="cb25-124"><a href="#cb25-124" aria-hidden="true" tabindex="-1"></a>    tensor<span class="op">[</span>dst_idx <span class="op">+</span> <span class="dv">0</span><span class="op">][</span><span class="dv">0</span><span class="op">]</span> <span class="op">=</span> T<span class="op">(</span>U32ToFloat<span class="op">(</span>tensor_data0<span class="op">.</span>range<span class="op">(</span><span class="dv">31</span><span class="op">,</span> <span class="dv">0</span><span class="op">)));</span></span>
<span id="cb25-125"><a href="#cb25-125" aria-hidden="true" tabindex="-1"></a>    tensor<span class="op">[</span>dst_idx <span class="op">+</span> <span class="dv">0</span><span class="op">][</span><span class="dv">1</span><span class="op">]</span> <span class="op">=</span> T<span class="op">(</span>U32ToFloat<span class="op">(</span>tensor_data0<span class="op">.</span>range<span class="op">(</span><span class="dv">63</span><span class="op">,</span> <span class="dv">32</span><span class="op">)));</span></span>
<span id="cb25-126"><a href="#cb25-126" aria-hidden="true" tabindex="-1"></a>    tensor<span class="op">[</span>dst_idx <span class="op">+</span> <span class="dv">0</span><span class="op">][</span><span class="dv">2</span><span class="op">]</span> <span class="op">=</span> T<span class="op">(</span>U32ToFloat<span class="op">(</span>tensor_data1<span class="op">.</span>range<span class="op">(</span><span class="dv">31</span><span class="op">,</span> <span class="dv">0</span><span class="op">)));</span></span>
<span id="cb25-127"><a href="#cb25-127" aria-hidden="true" tabindex="-1"></a>    tensor<span class="op">[</span>dst_idx <span class="op">+</span> <span class="dv">1</span><span class="op">][</span><span class="dv">0</span><span class="op">]</span> <span class="op">=</span> T<span class="op">(</span>U32ToFloat<span class="op">(</span>tensor_data1<span class="op">.</span>range<span class="op">(</span><span class="dv">63</span><span class="op">,</span> <span class="dv">32</span><span class="op">)));</span></span>
<span id="cb25-128"><a href="#cb25-128" aria-hidden="true" tabindex="-1"></a>    tensor<span class="op">[</span>dst_idx <span class="op">+</span> <span class="dv">1</span><span class="op">][</span><span class="dv">1</span><span class="op">]</span> <span class="op">=</span> T<span class="op">(</span>U32ToFloat<span class="op">(</span>tensor_data2<span class="op">.</span>range<span class="op">(</span><span class="dv">31</span><span class="op">,</span> <span class="dv">0</span><span class="op">)));</span></span>
<span id="cb25-129"><a href="#cb25-129" aria-hidden="true" tabindex="-1"></a>    tensor<span class="op">[</span>dst_idx <span class="op">+</span> <span class="dv">1</span><span class="op">][</span><span class="dv">2</span><span class="op">]</span> <span class="op">=</span> T<span class="op">(</span>U32ToFloat<span class="op">(</span>tensor_data2<span class="op">.</span>range<span class="op">(</span><span class="dv">63</span><span class="op">,</span> <span class="dv">32</span><span class="op">)));</span></span>
<span id="cb25-130"><a href="#cb25-130" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb25-131"><a href="#cb25-131" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>比較できるように、データを1つずつ読み取る、元のナイーブな実装も載せました。
各関数の動作をまとめます。</p>
<ul>
<li><code>ReadTensor1dOpt1&lt;T, D0&gt;(tensor, src, offset)</code>:
指定されたDRAMバッファ<code>src</code>の、<code>float</code>で<code>offset</code>個分だけずらした場所から
(<code>src</code>に<code>4 * offset</code>バイト分だけ足したアドレスから)、<code>D0</code>個分の<code>float</code>を2つずつ読み取る。
読み取ったデータは<code>float</code>から<code>T</code>型にキャストして、指定された1次元のオンチップバッファ<code>tensor</code>
(サイズ<code>(D0)</code>)に2つずつ格納する。
1サイクルで2つずつ読み取るため、サイズ<code>D0</code>は偶数と仮定している。</li>
<li><code>ReadTensor2dOpt1&lt;T, D0, D1&gt;(tensor, src, offset)</code>:
指定されたDRAMバッファ<code>src</code>からデータを2つずつ読み取って、2次元のオンチップバッファ<code>tensor</code>
(サイズ<code>(D0, D1)</code>)に格納する。
1サイクルで2つずつ読み取るため、サイズ<code>D1</code>は偶数と仮定している。</li>
<li><code>ReadTensor2dOpt2&lt;T, D0, D1&gt;(tensor, src, offset)</code>:
<code>D1</code>が3である場合の専用の実装。
3サイクル掛けて、指定されたDRAMバッファ<code>src</code>からデータを6つ読み取った後、オンチップバッファ<code>tensor</code>に格納していく。
実装を簡略化するため、サイズに関しては、<code>D1</code>は3、<code>D0</code>は偶数であることを仮定している
(要素数が偶数)。</li>
</ul>
<p><code>ReadTensor2dOpt2</code>および<code>ReadBlockParamsOpt2</code>は、特徴抽出ネットワークにおける最初の全結合層の重みを転送するために使われています
(<code>InitializeFeatOpt3</code>を参照)。
最初の全結合層は、3次元の点の座標を64次元の特徴に変換するので、重みのサイズは<code>(64, 3)</code>となります。
データを2つずつ読み取りたいのに、2番目の次元が奇数で、実装上の都合が悪いので、専用の関数を用意したわけです。
<code>ReadTensor2dOpt2</code>では、重みを6つずつ読み取ることで対処しています。
別の対処法としては、重みのバッファサイズを<code>(64, 3)</code>から<code>(64, 4)</code>に広げることが考えられます
(4番目の次元は単に使わない)。</p>
<p><code>ReadBlockParamsOpt1</code>と<code>ReadBlockParamsOpt2</code>の違いは、<code>ReadTensor2dOpt1</code>と<code>ReadTensor2dOpt2</code>のどちらを使っているかだけです。
2つの関数は、C++17に用意された<code>if constexpr</code>文を使えば、1つにまとめられると思いますが、今回はC++14までの機能を使っているので、別々にしています。</p>
<p><code>ap_uint</code>型には<code>range()</code>という便利なメソッドが用意されており、指定したビットの部分を自由に取り出せます。
<code>range(31, 0)</code>で下位32ビット、<code>range(63, 32)</code>で上位32ビットを取り出しています。</p>
<p><code>U32ToFloat()</code>、<code>FloatToU32()</code>は、ビット表現を維持したまま、別の型に解釈するための関数です
(<code>float</code>と符号なし32ビット整数)。
<code>tensor_data.range(31, 0)</code>は32ビットの符号なし整数型
(<code>unsigned int</code>や<code>ap_uint&lt;32&gt;</code>)
ですが、実際には<code>float</code>のデータが入っているので、<code>U32ToFloat()</code>を使って<code>float</code>に解釈し直しています。
2つの関数は、共用体を使って実現しています。
C++20であれば、<code>std::bit_cast</code>で同等の処理ができます。</p>
<p>特徴抽出ネットワークの推論に着目します
(<code>InferenceFeatOpt2</code>を参照)。
<code>i</code>番目の点をDRAMバッファから読み取る<code>ReadPointNaive</code>も、64ビット幅に合わせて書き直します。
修正後のバージョンを<code>ReadPointOpt1</code>としました。</p>
<div class="sourceCode" id="cb26"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb26-1"><a href="#cb26-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Read a point from a DDR memory</span></span>
<span id="cb26-2"><a href="#cb26-2" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">&gt;</span></span>
<span id="cb26-3"><a href="#cb26-3" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> ReadPointNaive<span class="op">(</span><span class="at">const</span> <span class="dt">float</span><span class="op">*</span> point_cloud<span class="op">,</span></span>
<span id="cb26-4"><a href="#cb26-4" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">int</span> idx<span class="op">,</span></span>
<span id="cb26-5"><a href="#cb26-5" aria-hidden="true" tabindex="-1"></a>                    T x<span class="op">[</span><span class="dv">3</span><span class="op">])</span></span>
<span id="cb26-6"><a href="#cb26-6" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb26-7"><a href="#cb26-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb26-8"><a href="#cb26-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-9"><a href="#cb26-9" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> <span class="dv">3</span><span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb26-10"><a href="#cb26-10" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE II=1</span></span>
<span id="cb26-11"><a href="#cb26-11" aria-hidden="true" tabindex="-1"></a>    x<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> T<span class="op">(</span>point_cloud<span class="op">[</span>idx <span class="op">*</span> <span class="dv">3</span> <span class="op">+</span> i<span class="op">]);</span></span>
<span id="cb26-12"><a href="#cb26-12" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb26-13"><a href="#cb26-13" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb26-14"><a href="#cb26-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-15"><a href="#cb26-15" aria-hidden="true" tabindex="-1"></a><span class="co">// Read a point from a DDR memory</span></span>
<span id="cb26-16"><a href="#cb26-16" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">&gt;</span></span>
<span id="cb26-17"><a href="#cb26-17" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> ReadPointOpt1<span class="op">(</span><span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> point_cloud<span class="op">,</span></span>
<span id="cb26-18"><a href="#cb26-18" aria-hidden="true" tabindex="-1"></a>                   <span class="at">const</span> <span class="dt">int</span> idx<span class="op">,</span></span>
<span id="cb26-19"><a href="#cb26-19" aria-hidden="true" tabindex="-1"></a>                   T x<span class="op">[</span><span class="dv">3</span><span class="op">])</span></span>
<span id="cb26-20"><a href="#cb26-20" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb26-21"><a href="#cb26-21" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb26-22"><a href="#cb26-22" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-23"><a href="#cb26-23" aria-hidden="true" tabindex="-1"></a>  <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;</span> point_data0 <span class="op">=</span> point_cloud<span class="op">[</span>idx <span class="op">*</span> <span class="dv">2</span> <span class="op">+</span> <span class="dv">0</span><span class="op">];</span></span>
<span id="cb26-24"><a href="#cb26-24" aria-hidden="true" tabindex="-1"></a>  <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;</span> point_data1 <span class="op">=</span> point_cloud<span class="op">[</span>idx <span class="op">*</span> <span class="dv">2</span> <span class="op">+</span> <span class="dv">1</span><span class="op">];</span></span>
<span id="cb26-25"><a href="#cb26-25" aria-hidden="true" tabindex="-1"></a>  x<span class="op">[</span><span class="dv">0</span><span class="op">]</span> <span class="op">=</span> T<span class="op">(</span>U32ToFloat<span class="op">(</span>point_data0<span class="op">.</span>range<span class="op">(</span><span class="dv">31</span><span class="op">,</span> <span class="dv">0</span><span class="op">)));</span></span>
<span id="cb26-26"><a href="#cb26-26" aria-hidden="true" tabindex="-1"></a>  x<span class="op">[</span><span class="dv">1</span><span class="op">]</span> <span class="op">=</span> T<span class="op">(</span>U32ToFloat<span class="op">(</span>point_data0<span class="op">.</span>range<span class="op">(</span><span class="dv">63</span><span class="op">,</span> <span class="dv">32</span><span class="op">)));</span></span>
<span id="cb26-27"><a href="#cb26-27" aria-hidden="true" tabindex="-1"></a>  x<span class="op">[</span><span class="dv">2</span><span class="op">]</span> <span class="op">=</span> T<span class="op">(</span>U32ToFloat<span class="op">(</span>point_data1<span class="op">.</span>range<span class="op">(</span><span class="dv">31</span><span class="op">,</span> <span class="dv">0</span><span class="op">)));</span></span>
<span id="cb26-28"><a href="#cb26-28" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p><code>ReadPointNaive</code>では、DRAMバッファ<code>point_cloud</code>のサイズが<span
class="math inline">\((N, 3)\)</span>であることを想定していました。
一方<code>ReadPointOpt1</code>では、実装を簡単にするため、バッファサイズが<span
class="math inline">\((N, 4)\)</span>であるとします
(4番目の次元については使わない)。
<code>i</code>番目の点を読み取るときは、バッファの<code>idx * 2 + 0</code>番目と<code>idx * 2 + 1</code>番目を参照すればよいです。</p>
<p>最後に、分類ネットワークの推論を直します
(<code>InferenceClsOpt1</code>を参照)。
点群の特徴量から、物体の各クラスに対するロジットを計算し、<code>WriteTensor1dNaive</code>によりDRAMバッファに書き込んでいます。
<code>WriteTensor1dNaive</code>を、64ビット幅に合わせて書き直します。
修正後のバージョンを<code>WriteTensor1dOpt1</code>としました。</p>
<div class="sourceCode" id="cb27"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb27-1"><a href="#cb27-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Write a 1D tensor to a DDR memory</span></span>
<span id="cb27-2"><a href="#cb27-2" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="dt">int</span> D0<span class="op">&gt;</span></span>
<span id="cb27-3"><a href="#cb27-3" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> WriteTensor1dNaive<span class="op">(</span><span class="dt">float</span><span class="op">*</span> dst<span class="op">,</span></span>
<span id="cb27-4"><a href="#cb27-4" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> T tensor<span class="op">[</span>D0<span class="op">],</span></span>
<span id="cb27-5"><a href="#cb27-5" aria-hidden="true" tabindex="-1"></a>                        <span class="at">const</span> <span class="dt">int</span> offset<span class="op">)</span></span>
<span id="cb27-6"><a href="#cb27-6" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb27-7"><a href="#cb27-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb27-8"><a href="#cb27-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-9"><a href="#cb27-9" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> D0<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb27-10"><a href="#cb27-10" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE II=1</span></span>
<span id="cb27-11"><a href="#cb27-11" aria-hidden="true" tabindex="-1"></a>    dst<span class="op">[</span>offset <span class="op">+</span> i<span class="op">]</span> <span class="op">=</span> <span class="kw">static_cast</span><span class="op">&lt;</span><span class="dt">float</span><span class="op">&gt;(</span>tensor<span class="op">[</span>i<span class="op">]);</span></span>
<span id="cb27-12"><a href="#cb27-12" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb27-13"><a href="#cb27-13" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb27-14"><a href="#cb27-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-15"><a href="#cb27-15" aria-hidden="true" tabindex="-1"></a><span class="co">// Write a 1D tensor to a DDR memory</span></span>
<span id="cb27-16"><a href="#cb27-16" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="dt">int</span> D0<span class="op">&gt;</span></span>
<span id="cb27-17"><a href="#cb27-17" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> WriteTensor1dOpt1<span class="op">(</span>ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> dst<span class="op">,</span></span>
<span id="cb27-18"><a href="#cb27-18" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> T tensor<span class="op">[</span>D0<span class="op">],</span></span>
<span id="cb27-19"><a href="#cb27-19" aria-hidden="true" tabindex="-1"></a>                       <span class="at">const</span> <span class="dt">int</span> offset<span class="op">)</span></span>
<span id="cb27-20"><a href="#cb27-20" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb27-21"><a href="#cb27-21" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb27-22"><a href="#cb27-22" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-23"><a href="#cb27-23" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>D0 <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">,</span> <span class="st">&quot;`D0` must be a multiple of 2&quot;</span><span class="op">);</span></span>
<span id="cb27-24"><a href="#cb27-24" aria-hidden="true" tabindex="-1"></a>  <span class="ot">assert</span><span class="op">(</span>offset <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb27-25"><a href="#cb27-25" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-26"><a href="#cb27-26" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> D0Over2 <span class="op">=</span> D0 <span class="op">/</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb27-27"><a href="#cb27-27" aria-hidden="true" tabindex="-1"></a>  <span class="at">const</span> <span class="dt">int</span> offset2 <span class="op">=</span> offset <span class="op">/</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb27-28"><a href="#cb27-28" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-29"><a href="#cb27-29" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> D0Over2<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb27-30"><a href="#cb27-30" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE II=1</span></span>
<span id="cb27-31"><a href="#cb27-31" aria-hidden="true" tabindex="-1"></a>    ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;</span> tensor_data<span class="op">;</span></span>
<span id="cb27-32"><a href="#cb27-32" aria-hidden="true" tabindex="-1"></a>    tensor_data<span class="op">.</span>range<span class="op">(</span><span class="dv">31</span><span class="op">,</span> <span class="dv">0</span><span class="op">)</span> <span class="op">=</span> FloatToU32<span class="op">(</span></span>
<span id="cb27-33"><a href="#cb27-33" aria-hidden="true" tabindex="-1"></a>      <span class="kw">static_cast</span><span class="op">&lt;</span><span class="dt">float</span><span class="op">&gt;(</span>tensor<span class="op">[</span>i <span class="op">*</span> <span class="dv">2</span> <span class="op">+</span> <span class="dv">0</span><span class="op">]));</span></span>
<span id="cb27-34"><a href="#cb27-34" aria-hidden="true" tabindex="-1"></a>    tensor_data<span class="op">.</span>range<span class="op">(</span><span class="dv">63</span><span class="op">,</span> <span class="dv">32</span><span class="op">)</span> <span class="op">=</span> FloatToU32<span class="op">(</span></span>
<span id="cb27-35"><a href="#cb27-35" aria-hidden="true" tabindex="-1"></a>      <span class="kw">static_cast</span><span class="op">&lt;</span><span class="dt">float</span><span class="op">&gt;(</span>tensor<span class="op">[</span>i <span class="op">*</span> <span class="dv">2</span> <span class="op">+</span> <span class="dv">1</span><span class="op">]));</span></span>
<span id="cb27-36"><a href="#cb27-36" aria-hidden="true" tabindex="-1"></a>    dst<span class="op">[</span>offset2 <span class="op">+</span> i<span class="op">]</span> <span class="op">=</span> tensor_data<span class="op">;</span></span>
<span id="cb27-37"><a href="#cb27-37" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb27-38"><a href="#cb27-38" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>オンチップバッファ<code>tensor</code>に置かれたサイズ<code>(D0)</code>のデータを、1サイクルに2つずつ、DRAMに書き戻しています。
実装を簡単にするため、<code>D0</code>は偶数であると仮定します。
2つのデータは<code>T</code>型ですが、ソフトウェア側から利用しやすいように<code>float</code>に直し、更に<code>FloatToU32</code>を使って、ビット表現を維持したまま32ビットの符号なし整数型に再解釈しています。
これら2つを、<code>ap_uint&lt;64&gt;</code>型の上位32ビットと下位32ビットに詰めて、DRAMバッファに書き戻しています。</p>
<p>最初の2つの全結合層 (<code>LinearOpt1DDR</code>)
も直して、新たに<code>LinearOpt2DDR</code>を作ります。
重みとバイアスの転送部分を変更します。
転送に要するサイクル数が半分ほどになるので、分類ネットワークの推論時間の削減が期待されます。
実装を簡単にするため、入出力の次元がいずれも偶数であることを前提としています。</p>
<div class="sourceCode" id="cb28"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb28-1"><a href="#cb28-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Parallel implementation of the fully-connected layer</span></span>
<span id="cb28-2"><a href="#cb28-2" aria-hidden="true" tabindex="-1"></a><span class="co">// Weight and bias parameters are stored on the DDR memory</span></span>
<span id="cb28-3"><a href="#cb28-3" aria-hidden="true" tabindex="-1"></a><span class="co">// Matrix-vector multiplication is parallelized along the output dimension</span></span>
<span id="cb28-4"><a href="#cb28-4" aria-hidden="true" tabindex="-1"></a><span class="co">// `T` is the type for values</span></span>
<span id="cb28-5"><a href="#cb28-5" aria-hidden="true" tabindex="-1"></a><span class="co">// `TParam` is the type for weight and bias</span></span>
<span id="cb28-6"><a href="#cb28-6" aria-hidden="true" tabindex="-1"></a><span class="co">// `InDims` is the number of input dimensions</span></span>
<span id="cb28-7"><a href="#cb28-7" aria-hidden="true" tabindex="-1"></a><span class="co">// `OutDims` is the number of output dimensions</span></span>
<span id="cb28-8"><a href="#cb28-8" aria-hidden="true" tabindex="-1"></a><span class="co">// `ApplyReLU` is the flag to apply ReLU activation</span></span>
<span id="cb28-9"><a href="#cb28-9" aria-hidden="true" tabindex="-1"></a><span class="co">// `B` is the block size for the output dimension</span></span>
<span id="cb28-10"><a href="#cb28-10" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="kw">typename</span> TParam<span class="op">,</span></span>
<span id="cb28-11"><a href="#cb28-11" aria-hidden="true" tabindex="-1"></a>          <span class="dt">int</span> InDims<span class="op">,</span> <span class="dt">int</span> OutDims<span class="op">,</span> <span class="dt">bool</span> ApplyReLU<span class="op">,</span> <span class="dt">int</span> B<span class="op">&gt;</span></span>
<span id="cb28-12"><a href="#cb28-12" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> LinearOpt2DDR<span class="op">(</span><span class="at">const</span> T x<span class="op">[</span>InDims<span class="op">],</span></span>
<span id="cb28-13"><a href="#cb28-13" aria-hidden="true" tabindex="-1"></a>                   T y<span class="op">[</span>OutDims<span class="op">],</span></span>
<span id="cb28-14"><a href="#cb28-14" aria-hidden="true" tabindex="-1"></a>                   <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;*</span> params<span class="op">,</span></span>
<span id="cb28-15"><a href="#cb28-15" aria-hidden="true" tabindex="-1"></a>                   <span class="at">const</span> <span class="dt">int</span> offset<span class="op">)</span></span>
<span id="cb28-16"><a href="#cb28-16" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb28-17"><a href="#cb28-17" aria-hidden="true" tabindex="-1"></a>  <span class="co">// `x` is of size (1, `InDims`)</span></span>
<span id="cb28-18"><a href="#cb28-18" aria-hidden="true" tabindex="-1"></a>  <span class="co">// `y` is of size (1, `OutDims`)</span></span>
<span id="cb28-19"><a href="#cb28-19" aria-hidden="true" tabindex="-1"></a>  <span class="co">// `params` contains weight parameters of size (`OutDims`, `InDims`) and</span></span>
<span id="cb28-20"><a href="#cb28-20" aria-hidden="true" tabindex="-1"></a>  <span class="co">// bias parameters of size (`OutDims`) in a contiguous buffer</span></span>
<span id="cb28-21"><a href="#cb28-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb28-22"><a href="#cb28-22" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb28-23"><a href="#cb28-23" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb28-24"><a href="#cb28-24" aria-hidden="true" tabindex="-1"></a>  <span class="co">// `OutDims` must be a multiple of `B`</span></span>
<span id="cb28-25"><a href="#cb28-25" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>OutDims <span class="op">%</span> B <span class="op">==</span> <span class="dv">0</span><span class="op">,</span> <span class="st">&quot;`OutDims` must be a multiple of `B`&quot;</span><span class="op">);</span></span>
<span id="cb28-26"><a href="#cb28-26" aria-hidden="true" tabindex="-1"></a>  <span class="co">// `B` must be larger than 1</span></span>
<span id="cb28-27"><a href="#cb28-27" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>B <span class="op">&gt;</span> <span class="dv">1</span><span class="op">,</span> <span class="st">&quot;`B` must be larger than 1&quot;</span><span class="op">);</span></span>
<span id="cb28-28"><a href="#cb28-28" aria-hidden="true" tabindex="-1"></a>  <span class="co">// `InDims` must be a multiple of 2</span></span>
<span id="cb28-29"><a href="#cb28-29" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>InDims <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">,</span> <span class="st">&quot;`InDims` must be a multiple of 2&quot;</span><span class="op">);</span></span>
<span id="cb28-30"><a href="#cb28-30" aria-hidden="true" tabindex="-1"></a>  <span class="co">// `OutDims` must be a multiple of 2</span></span>
<span id="cb28-31"><a href="#cb28-31" aria-hidden="true" tabindex="-1"></a>  <span class="kw">static_assert</span><span class="op">(</span>OutDims <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">,</span> <span class="st">&quot;`OutDims` must be a multiple of 2&quot;</span><span class="op">);</span></span>
<span id="cb28-32"><a href="#cb28-32" aria-hidden="true" tabindex="-1"></a>  <span class="co">// `offset` must be a multiple of 2</span></span>
<span id="cb28-33"><a href="#cb28-33" aria-hidden="true" tabindex="-1"></a>  <span class="ot">assert</span><span class="op">(</span>offset <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb28-34"><a href="#cb28-34" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb28-35"><a href="#cb28-35" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> BHalf <span class="op">=</span> B <span class="op">/</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb28-36"><a href="#cb28-36" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> OffsetToBias <span class="op">=</span> OutDims <span class="op">*</span> InDims <span class="op">/</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb28-37"><a href="#cb28-37" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> InDims2 <span class="op">=</span> InDims <span class="op">/</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb28-38"><a href="#cb28-38" aria-hidden="true" tabindex="-1"></a>  <span class="kw">constexpr</span> <span class="at">const</span> <span class="dt">int</span> OutDims2 <span class="op">=</span> OutDims <span class="op">/</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb28-39"><a href="#cb28-39" aria-hidden="true" tabindex="-1"></a>  <span class="at">const</span> <span class="dt">int</span> offset2 <span class="op">=</span> offset <span class="op">/</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb28-40"><a href="#cb28-40" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb28-41"><a href="#cb28-41" aria-hidden="true" tabindex="-1"></a>  TParam bias<span class="op">[</span>OutDims<span class="op">];</span></span>
<span id="cb28-42"><a href="#cb28-42" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=bias type=cyclic factor=BHalf dim=1</span></span>
<span id="cb28-43"><a href="#cb28-43" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb28-44"><a href="#cb28-44" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Copy the bias parameters in advance</span></span>
<span id="cb28-45"><a href="#cb28-45" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> OutDims2<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb28-46"><a href="#cb28-46" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE II=1</span></span>
<span id="cb28-47"><a href="#cb28-47" aria-hidden="true" tabindex="-1"></a>    <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;</span> bias_data <span class="op">=</span> params<span class="op">[</span>offset2 <span class="op">+</span> OffsetToBias <span class="op">+</span> i<span class="op">];</span></span>
<span id="cb28-48"><a href="#cb28-48" aria-hidden="true" tabindex="-1"></a>    bias<span class="op">[</span>i <span class="op">*</span> <span class="dv">2</span> <span class="op">+</span> <span class="dv">0</span><span class="op">]</span> <span class="op">=</span> TParam<span class="op">(</span>U32ToFloat<span class="op">(</span>bias_data<span class="op">.</span>range<span class="op">(</span><span class="dv">31</span><span class="op">,</span> <span class="dv">0</span><span class="op">)));</span></span>
<span id="cb28-49"><a href="#cb28-49" aria-hidden="true" tabindex="-1"></a>    bias<span class="op">[</span>i <span class="op">*</span> <span class="dv">2</span> <span class="op">+</span> <span class="dv">1</span><span class="op">]</span> <span class="op">=</span> TParam<span class="op">(</span>U32ToFloat<span class="op">(</span>bias_data<span class="op">.</span>range<span class="op">(</span><span class="dv">63</span><span class="op">,</span> <span class="dv">32</span><span class="op">)));</span></span>
<span id="cb28-50"><a href="#cb28-50" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb28-51"><a href="#cb28-51" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb28-52"><a href="#cb28-52" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i0 <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i0 <span class="op">&lt;</span> OutDims<span class="op">;</span> i0 <span class="op">+=</span> B<span class="op">)</span> <span class="op">{</span></span>
<span id="cb28-53"><a href="#cb28-53" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE off</span></span>
<span id="cb28-54"><a href="#cb28-54" aria-hidden="true" tabindex="-1"></a>    T vals<span class="op">[</span>B<span class="op">];</span></span>
<span id="cb28-55"><a href="#cb28-55" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=vals type=complete dim=1</span></span>
<span id="cb28-56"><a href="#cb28-56" aria-hidden="true" tabindex="-1"></a>    TParam weight<span class="op">[</span>B<span class="op">][</span>InDims<span class="op">];</span></span>
<span id="cb28-57"><a href="#cb28-57" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS ARRAY_PARTITION variable=weight type=cyclic factor=BHalf dim=1</span></span>
<span id="cb28-58"><a href="#cb28-58" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb28-59"><a href="#cb28-59" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Copy the weight parameters for `B` outputs</span></span>
<span id="cb28-60"><a href="#cb28-60" aria-hidden="true" tabindex="-1"></a>    <span class="at">const</span> <span class="dt">int</span> offset0 <span class="op">=</span> offset2 <span class="op">+</span> i0 <span class="op">*</span> InDims2<span class="op">;</span></span>
<span id="cb28-61"><a href="#cb28-61" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i1 <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i1 <span class="op">&lt;</span> B<span class="op">;</span> <span class="op">++</span>i1<span class="op">)</span> <span class="op">{</span></span>
<span id="cb28-62"><a href="#cb28-62" aria-hidden="true" tabindex="-1"></a>      <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> j <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> j <span class="op">&lt;</span> InDims2<span class="op">;</span> <span class="op">++</span>j<span class="op">)</span> <span class="op">{</span></span>
<span id="cb28-63"><a href="#cb28-63" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE</span></span>
<span id="cb28-64"><a href="#cb28-64" aria-hidden="true" tabindex="-1"></a>        <span class="at">const</span> ap_uint<span class="op">&lt;</span><span class="dv">64</span><span class="op">&gt;</span> weight_data <span class="op">=</span> params<span class="op">[</span>offset0 <span class="op">+</span> i1 <span class="op">*</span> InDims2 <span class="op">+</span> j<span class="op">];</span></span>
<span id="cb28-65"><a href="#cb28-65" aria-hidden="true" tabindex="-1"></a>        weight<span class="op">[</span>i1<span class="op">][</span>j <span class="op">*</span> <span class="dv">2</span> <span class="op">+</span> <span class="dv">0</span><span class="op">]</span> <span class="op">=</span> TParam<span class="op">(</span></span>
<span id="cb28-66"><a href="#cb28-66" aria-hidden="true" tabindex="-1"></a>          U32ToFloat<span class="op">(</span>weight_data<span class="op">.</span>range<span class="op">(</span><span class="dv">31</span><span class="op">,</span> <span class="dv">0</span><span class="op">)));</span></span>
<span id="cb28-67"><a href="#cb28-67" aria-hidden="true" tabindex="-1"></a>        weight<span class="op">[</span>i1<span class="op">][</span>j <span class="op">*</span> <span class="dv">2</span> <span class="op">+</span> <span class="dv">1</span><span class="op">]</span> <span class="op">=</span> TParam<span class="op">(</span></span>
<span id="cb28-68"><a href="#cb28-68" aria-hidden="true" tabindex="-1"></a>          U32ToFloat<span class="op">(</span>weight_data<span class="op">.</span>range<span class="op">(</span><span class="dv">63</span><span class="op">,</span> <span class="dv">32</span><span class="op">)));</span></span>
<span id="cb28-69"><a href="#cb28-69" aria-hidden="true" tabindex="-1"></a>      <span class="op">}</span></span>
<span id="cb28-70"><a href="#cb28-70" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span>
<span id="cb28-71"><a href="#cb28-71" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb28-72"><a href="#cb28-72" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> j <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> j <span class="op">&lt;</span> InDims<span class="op">;</span> <span class="op">++</span>j<span class="op">)</span> <span class="op">{</span></span>
<span id="cb28-73"><a href="#cb28-73" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS PIPELINE</span></span>
<span id="cb28-74"><a href="#cb28-74" aria-hidden="true" tabindex="-1"></a>      <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i1 <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i1 <span class="op">&lt;</span> B<span class="op">;</span> <span class="op">++</span>i1<span class="op">)</span> <span class="op">{</span></span>
<span id="cb28-75"><a href="#cb28-75" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS UNROLL</span></span>
<span id="cb28-76"><a href="#cb28-76" aria-hidden="true" tabindex="-1"></a>        <span class="dt">int</span> i <span class="op">=</span> i0 <span class="op">+</span> i1<span class="op">;</span></span>
<span id="cb28-77"><a href="#cb28-77" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> <span class="op">(</span>i <span class="op">&lt;</span> OutDims<span class="op">)</span> <span class="op">{</span></span>
<span id="cb28-78"><a href="#cb28-78" aria-hidden="true" tabindex="-1"></a>          T last <span class="op">=</span> <span class="op">(</span>j <span class="op">==</span> <span class="dv">0</span><span class="op">)</span> <span class="op">?</span> T<span class="op">(</span>bias<span class="op">[</span>i<span class="op">])</span> <span class="op">:</span> vals<span class="op">[</span>i1<span class="op">];</span></span>
<span id="cb28-79"><a href="#cb28-79" aria-hidden="true" tabindex="-1"></a>          vals<span class="op">[</span>i1<span class="op">]</span> <span class="op">=</span> last <span class="op">+</span> x<span class="op">[</span>j<span class="op">]</span> <span class="op">*</span> weight<span class="op">[</span>i1<span class="op">][</span>j<span class="op">];</span></span>
<span id="cb28-80"><a href="#cb28-80" aria-hidden="true" tabindex="-1"></a>        <span class="op">}</span></span>
<span id="cb28-81"><a href="#cb28-81" aria-hidden="true" tabindex="-1"></a>      <span class="op">}</span></span>
<span id="cb28-82"><a href="#cb28-82" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span>
<span id="cb28-83"><a href="#cb28-83" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb28-84"><a href="#cb28-84" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i1 <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i1 <span class="op">&lt;</span> B<span class="op">;</span> <span class="op">++</span>i1<span class="op">)</span> <span class="op">{</span></span>
<span id="cb28-85"><a href="#cb28-85" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS UNROLL</span></span>
<span id="cb28-86"><a href="#cb28-86" aria-hidden="true" tabindex="-1"></a>      <span class="dt">int</span> i <span class="op">=</span> i0 <span class="op">+</span> i1<span class="op">;</span></span>
<span id="cb28-87"><a href="#cb28-87" aria-hidden="true" tabindex="-1"></a>      <span class="cf">if</span> <span class="op">(</span>i <span class="op">&lt;</span> OutDims<span class="op">)</span> <span class="op">{</span></span>
<span id="cb28-88"><a href="#cb28-88" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> <span class="op">(</span>ApplyReLU<span class="op">)</span></span>
<span id="cb28-89"><a href="#cb28-89" aria-hidden="true" tabindex="-1"></a>          y<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> vals<span class="op">[</span>i1<span class="op">]</span> <span class="op">&gt;</span> T<span class="op">(</span><span class="dv">0</span><span class="op">)</span> <span class="op">?</span> vals<span class="op">[</span>i1<span class="op">]</span> <span class="op">:</span> T<span class="op">(</span><span class="dv">0</span><span class="op">);</span></span>
<span id="cb28-90"><a href="#cb28-90" aria-hidden="true" tabindex="-1"></a>        <span class="cf">else</span></span>
<span id="cb28-91"><a href="#cb28-91" aria-hidden="true" tabindex="-1"></a>          y<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> vals<span class="op">[</span>i1<span class="op">];</span></span>
<span id="cb28-92"><a href="#cb28-92" aria-hidden="true" tabindex="-1"></a>      <span class="op">}</span></span>
<span id="cb28-93"><a href="#cb28-93" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span>
<span id="cb28-94"><a href="#cb28-94" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb28-95"><a href="#cb28-95" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>2つのネットワークについて、データの入出力に関連する部分を修正しました。
<code>InferenceFeatOpt2</code>と<code>InferenceClsOpt1</code>に対して、修正を施したものを<code>InferenceFeatOpt3</code>、<code>InferenceClsOpt3</code>とします。
<code>InferenceFeatOpt3</code>では、点群データを読み取る際に、<code>ReadPointNaive</code>の代わりに<code>ReadPointOpt1</code>を使っています
(他は同じ)。
また<code>InferenceClsOpt3</code>では、ロジットを書き込む際に、<code>WriteTensor1dNaive</code>ではなく<code>WriteTensor1dOpt1</code>を使い、最初の2つの全結合層については、<code>LinearOpt1DDR</code>の代わりに<code>LinearOpt2DDR</code>を使っています。</p>
<div class="sourceCode" id="cb29"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb29-1"><a href="#cb29-1" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="kw">typename</span> U<span class="op">,</span> <span class="dt">int</span> N<span class="op">&gt;</span></span>
<span id="cb29-2"><a href="#cb29-2" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> InferenceFeatOpt3<span class="op">(...)</span></span>
<span id="cb29-3"><a href="#cb29-3" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb29-4"><a href="#cb29-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb29-5"><a href="#cb29-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb29-6"><a href="#cb29-6" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Zero-initialize the output feature</span></span>
<span id="cb29-7"><a href="#cb29-7" aria-hidden="true" tabindex="-1"></a>  VectorNdSetZero<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims5<span class="op">&gt;(</span>feature<span class="op">);</span></span>
<span id="cb29-8"><a href="#cb29-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb29-9"><a href="#cb29-9" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Compute the feature</span></span>
<span id="cb29-10"><a href="#cb29-10" aria-hidden="true" tabindex="-1"></a>  <span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> num_points<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span> <span class="op">{</span></span>
<span id="cb29-11"><a href="#cb29-11" aria-hidden="true" tabindex="-1"></a>    <span class="co">// ...</span></span>
<span id="cb29-12"><a href="#cb29-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb29-13"><a href="#cb29-13" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Read a point from a DDR memory</span></span>
<span id="cb29-14"><a href="#cb29-14" aria-hidden="true" tabindex="-1"></a>    ReadPointOpt1<span class="op">&lt;</span>T<span class="op">&gt;(</span>point_cloud<span class="op">,</span> i<span class="op">,</span> x0<span class="op">);</span></span>
<span id="cb29-15"><a href="#cb29-15" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb29-16"><a href="#cb29-16" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Compute a point feature</span></span>
<span id="cb29-17"><a href="#cb29-17" aria-hidden="true" tabindex="-1"></a>    <span class="co">// ...</span></span>
<span id="cb29-18"><a href="#cb29-18" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb29-19"><a href="#cb29-19" aria-hidden="true" tabindex="-1"></a>    <span class="co">// Update the output feature</span></span>
<span id="cb29-20"><a href="#cb29-20" aria-hidden="true" tabindex="-1"></a>    MaxPool1dOpt1<span class="op">&lt;</span>T<span class="op">,</span> kFeatDims5<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span>x10<span class="op">,</span> feature<span class="op">);</span></span>
<span id="cb29-21"><a href="#cb29-21" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb29-22"><a href="#cb29-22" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb29-23"><a href="#cb29-23" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb29-24"><a href="#cb29-24" aria-hidden="true" tabindex="-1"></a><span class="kw">template</span> <span class="op">&lt;</span><span class="kw">typename</span> T<span class="op">,</span> <span class="kw">typename</span> U<span class="op">&gt;</span></span>
<span id="cb29-25"><a href="#cb29-25" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> InferenceClsOpt3<span class="op">(...)</span></span>
<span id="cb29-26"><a href="#cb29-26" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb29-27"><a href="#cb29-27" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INLINE off</span></span>
<span id="cb29-28"><a href="#cb29-28" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb29-29"><a href="#cb29-29" aria-hidden="true" tabindex="-1"></a>  <span class="co">// ...</span></span>
<span id="cb29-30"><a href="#cb29-30" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb29-31"><a href="#cb29-31" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Compute logits</span></span>
<span id="cb29-32"><a href="#cb29-32" aria-hidden="true" tabindex="-1"></a>  LinearOpt2DDR<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kClsDims0<span class="op">,</span> kClsDims1<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">16</span><span class="op">&gt;(</span></span>
<span id="cb29-33"><a href="#cb29-33" aria-hidden="true" tabindex="-1"></a>    feature<span class="op">,</span> x0<span class="op">,</span> params1<span class="op">,</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb29-34"><a href="#cb29-34" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dReLUOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kClsDims1<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb29-35"><a href="#cb29-35" aria-hidden="true" tabindex="-1"></a>    x0<span class="op">,</span> x1<span class="op">,</span> bn1<span class="op">-&gt;</span>scale<span class="op">,</span> bn1<span class="op">-&gt;</span>bias<span class="op">,</span> bn1<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb29-36"><a href="#cb29-36" aria-hidden="true" tabindex="-1"></a>  LinearOpt2DDR<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kClsDims1<span class="op">,</span> kClsDims2<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">8</span><span class="op">&gt;(</span></span>
<span id="cb29-37"><a href="#cb29-37" aria-hidden="true" tabindex="-1"></a>    x1<span class="op">,</span> x2<span class="op">,</span> params2<span class="op">,</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb29-38"><a href="#cb29-38" aria-hidden="true" tabindex="-1"></a>  BatchNorm1dReLUOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kClsDims2<span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb29-39"><a href="#cb29-39" aria-hidden="true" tabindex="-1"></a>    x2<span class="op">,</span> x3<span class="op">,</span> bn2<span class="op">-&gt;</span>scale<span class="op">,</span> bn2<span class="op">-&gt;</span>bias<span class="op">,</span> bn2<span class="op">-&gt;</span>mean<span class="op">);</span></span>
<span id="cb29-40"><a href="#cb29-40" aria-hidden="true" tabindex="-1"></a>  LinearOpt1<span class="op">&lt;</span>T<span class="op">,</span> U<span class="op">,</span> kClsDims2<span class="op">,</span> kClsDims3<span class="op">,</span> <span class="kw">false</span><span class="op">,</span> <span class="dv">2</span><span class="op">&gt;(</span></span>
<span id="cb29-41"><a href="#cb29-41" aria-hidden="true" tabindex="-1"></a>    x3<span class="op">,</span> x4<span class="op">,</span> fc3<span class="op">-&gt;</span>weight<span class="op">,</span> fc3<span class="op">-&gt;</span>bias<span class="op">);</span></span>
<span id="cb29-42"><a href="#cb29-42" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb29-43"><a href="#cb29-43" aria-hidden="true" tabindex="-1"></a>  <span class="co">// Write the result</span></span>
<span id="cb29-44"><a href="#cb29-44" aria-hidden="true" tabindex="-1"></a>  WriteTensor1dOpt1<span class="op">&lt;</span>T<span class="op">,</span> kClsDims3<span class="op">&gt;(</span>out_logits<span class="op">,</span> x4<span class="op">,</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb29-45"><a href="#cb29-45" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>入出力ポート幅によって、どの程度実行時間を削減できたでしょうか。
特徴抽出ネットワーク<code>InferenceFeatOpt2</code>の実行サイクル数は1,112,259
(7.408ms)、新たに用意した<code>InferenceFeatOpt3</code>は1,112,254
(7.408ms) でした。 ほぼ一緒です。
分類ネットワークに関しては、ポート幅32ビット用の<code>InferenceClsOpt1</code>は711,969サイクル
(4.742ms)
でしたが、64ビット用の<code>InferenceClsOpt3</code>では383,885サイクル
(2.557ms) に削減されました。
ポート幅を2倍に広げたことで、分類ネットワークの推論時間を1.85倍短縮できたわけです。</p>
<p>当初のナイーブ実装 (<code>InferenceFeatNaive</code> +
<code>InferenceClsNaive</code>) と、ここに示す実装
(<code>InferenceFeatOpt3</code> + <code>InferenceClsOpt3</code>)
とで、実行サイクル数はどの程度変化したでしょうか。
上がナイーブ実装、下が最適化済みの実装での結果です。
ナイーブ実装では、推論に163,279,213サイクル (1.087s)
要していますが、最適化によって1,496,143サイクル (9.964ms)
にまで削減されています。 およそ109倍の差ですね。</p>
<p><a
href="point-cloud-classification-images/pointnet-naive-clock-cycles.png"><img src="point-cloud-classification-images/pointnet-naive-clock-cycles.png" width="80%" /></a></p>
<p><a
href="point-cloud-classification-images/pointnet-opt3-clock-cycles.png"><img src="point-cloud-classification-images/pointnet-opt3-clock-cycles.png" width="80%" /></a></p>
<p>以上で、高位合成の実装ができあがりました。
<code>hls/src/top_opt3.cpp</code>をご覧ください。</p>
<h2 id="ビットストリームの準備">ビットストリームの準備</h2>
<p>高位合成の実装ができたので、Vitis
HLSでコンパイルし、IPコアを作成します。
今回は、以下のような環境で作業しています
(試す人はいないと思いますが書いておきます)。</p>
<ul>
<li>Ubuntu 20.04.5 LTS</li>
<li>Intel(R) Xeon(R) E-2186G CPU @ 3.80GHz</li>
<li>64GB DRAM</li>
<li>Vivado ML Edition 2022.1
(インストール場所は<code>/tools/Xilinx</code>以下)</li>
<li>CMake 3.16.3</li>
</ul>
<p>また、対象のFPGAボードは、Xilinx ZCU104 Evaluation Board
(XCZU7EV-2FFVC1156)です。</p>
<p>今回用意したGitHubリポジトリでは、以下のように<code>make</code>するだけで、自動的にIPコアを作成できます。
TclスクリプトとCMakeを組み合わせて実現されています。
上のスクリーンショットのように、Vitis
HLSにはGUIが用意されていますが、Tclスクリプトを使えばコマンドライン上でのバッチ処理が可能です。
適当な場所にリポジトリをクローンしたら、<code>hls</code>ディレクトリに移って、作業用のディレクトリを準備します。
続いてCMakeプロジェクトを構成し、所望のIPコアを<code>make</code>で作成します。</p>
<pre><code># 予めVivadoとVitis HLSを使えるようにsourceする
&gt; source /tools/Xilinx/Vivado/2022.1/settings64.sh

# GitHubリポジトリのクローン
&gt; git clone git@github.com:sterngerlach/advent_2022_point_cloud_classification.git
&gt; cd advent_2022_point_cloud_classification

# 作業用ディレクトリの準備
&gt; cd hls
&gt; mkdir build
&gt; mkdir work

&gt; cd build

# CMakeプロジェクトを構成
# settings64.shによってCMakeが書き換えられるので、システムのCMakeを使う
&gt; /usr/bin/cmake ..

# ナイーブ実装からIPコアを作成
# workディレクトリ内に作られる
&gt; make pointnet_naive_150_csynth_export

# データ並列性を活用した (ループアンローリングと配列の分割を済ませた) IPコアを作成
&gt; make pointnet_opt1_csynth_export

# データフロー最適化を済ませたIPコアを作成
&gt; make pointnet_opt2_csynth_export

# 入出力のポート幅を64ビットに広げたIPコアを作成
&gt; make pointnet_opt3_csynth_export</code></pre>
<p>IPコアを作成したら、GUIを起動して、合成結果をみてみましょう
(上のスクリーンショットのような画面が開きます)。</p>
<pre><code>&gt; cd hls/work

# ナイーブ実装用のVitis HLSプロジェクトをGUIで開く
&gt; vitis_hls -p pointnet_naive_150

# 他も同様
&gt; vitis_hls -p pointnet_opt1
&gt; vitis_hls -p pointnet_opt2
&gt; vitis_hls -p pointnet_opt3</code></pre>
<p>Vitis
HLSを使うのはここまでで、これ以降は、Vivadoを使った作業に移ります。
続いて、このIPコアを、別のIPコアと組み合わせて、ボードデザインを用意します。
今回は、ボードデザインの作成については省略します。
最初に、<code>vivado</code>ディレクトリに移って、作業用のディレクトリを準備します。
続いてCMakeプロジェクトを構成し、所望のボードデザインを<code>make</code>で作成します。</p>
<pre><code># 作業用ディレクトリの準備
&gt; cd vivado
&gt; mkdir build
&gt; mkdir work
&gt; mkdir bitstream

&gt; cd build

# CMakeプロジェクトを構成
# settings64.shによってCMakeが書き換えられるので、システムのCMakeを使う
# Vitis HLSによるIPコアの合成が終わっていないとエラー
&gt; /usr/bin/cmake ..

# ナイーブ実装のIPコアから、ボードデザインを作成
&gt; make pointnet_naive_150_create

# 最適化済みのIPコアから、ボードデザインを作成
&gt; make pointnet_opt1_create
&gt; make pointnet_opt2_create
&gt; make pointnet_opt3_create</code></pre>
<p>ボードデザインを作成したら、GUIを起動して、ブロック図をみてみましょう。</p>
<pre><code>&gt; cd vivado/work
&gt; vivado -project pointnet_naive_150/pointnet_naive_150.xpr
&gt; vivado -project pointnet_opt1/pointnet_opt1.xpr
&gt; vivado -project pointnet_opt2/pointnet_opt2.xpr
&gt; vivado -project pointnet_opt3/pointnet_opt3.xpr</code></pre>
<p><a
href="point-cloud-classification-images/pointnet-opt3-vivado.png"><img src="point-cloud-classification-images/pointnet-opt3-vivado.png" width="80%" /></a></p>
<p>左側のFlow Navigatorから、「Open Block
Design」を選択すると、ブロック図を表示できます。</p>
<p><a
href="point-cloud-classification-images/pointnet-opt3-vivado2.png"><img src="point-cloud-classification-images/pointnet-opt3-vivado2.png" width="80%" /></a></p>
<p>ブロック図を拡大したものが以下です。</p>
<p><a
href="point-cloud-classification-images/board-design.svg"><img src="point-cloud-classification-images/board-design.svg" width="100%" /></a></p>
<p>ボードデザインに対して、論理合成と配置配線を行い、回路情報をまとめたビットストリーム
(Bitstream) を作成しましょう。
マシンのスペックにもよりますが、こちらの環境では、1つのボードデザインの論理合成と配置配線に、30分以上掛かりました
(8コアを使った場合)。
今回のGitHubリポジトリには、ビットストリームも入れてあるので、この作業は必要ありません
(試してみても大丈夫です)。</p>
<pre><code>&gt; cd vivado/build
&gt; make pointnet_naive_150_impl &amp;&amp; make pointnet_naive_150_copy_bitstream
&gt; make pointnet_opt1_impl &amp;&amp; make pointnet_opt1_copy_bitstream
&gt; make pointnet_opt2_impl &amp;&amp; make pointnet_opt2_copy_bitstream
&gt; make pointnet_opt3_impl &amp;&amp; make pointnet_opt3_copy_bitstream</code></pre>
<p>もう一度GUIを起動して、合成済みの回路をみてみましょう。 左側のFlow
Navigatorから、「Open Implemented Design」を選択します。
個人的には、ニューヨークのマンハッタンのようにみえて、美しいと思います。
GUI上で、リソースの使用率 (Utilization) や、電力消費の見積もり
(Power)、タイミング (Timing) などを確認できます。</p>
<p><a
href="point-cloud-classification-images/pointnet-opt3-vivado3.png"><img src="point-cloud-classification-images/pointnet-opt3-vivado3.png" width="80%" /></a></p>
<p><code>vivado/bitstream</code>ディレクトリ以下に、生成されたビットストリームがコピーされます。
ビットストリーム (拡張子<code>.bit</code>) の他に、Hardware
Handoffファイル (拡張子<code>.hwh</code>) もあります。
Handoffファイルには、回路のメタデータが含まれます。
FPGAボードにビットストリームをロードするためには、2つのファイルがセットで必要になります。
ビットストリームを読み直せば、動かす回路を何度でも切り替えられるというのが、ASICに対するFPGAの大きな利点です。
さて、これらのファイルを<code>scp</code>などでFPGAボード上に転送すれば、回路を動かす準備が整います。</p>
<pre><code>&gt; cd vivado/bitstream
&gt; ls
-rw-rw-r-- 1 x x  19M Dec 14 23:34 pointnet_naive_150.bit
-rw-rw-r-- 1 x x 363K Dec 14 23:34 pointnet_naive_150.hwh
-rw-rw-r-- 1 x x  19M Dec 15 00:01 pointnet_opt1.bit
-rw-rw-r-- 1 x x 363K Dec 15 00:01 pointnet_opt1.hwh
-rw-rw-r-- 1 x x  19M Dec 14 23:20 pointnet_opt2.bit
-rw-rw-r-- 1 x x 363K Dec 14 23:20 pointnet_opt2.hwh
-rw-rw-r-- 1 x x  19M Dec 15 18:07 pointnet_opt3.bit
-rw-rw-r-- 1 x x 363K Dec 15 18:07 pointnet_opt3.hwh</code></pre>
<h2 id="回路を動かす">回路を動かす</h2>
<p>ビットストリームを用意できたので、いよいよ回路を動かしてみます。
今回使用するFPGAボード、Xilinx ZCU104 Evaluation Kitは、SoC
(System-on-Chip) とよばれています。 FPGAの他に、クアッドコア ARM
Cortex-A53 CPU
(1.2GHz)、2GBのDRAMや、様々な周辺回路が統合されていて、Linuxが動作します。
ここではOSとして、Ubuntu 20.04をベースとしたPynq Linux 2.7を使います。
Pynq
Linuxには<code>pynq</code>とよばれるPythonのライブラリが付属しており、PythonからFPGA関連の処理を簡単に行えます。</p>
<p>以下を試すためには、Pynq Linux上に、PyTorch 1.11.0や、TorchVision
0.12.0、NumPy、SciPy、H5py、Tqdmなどのライブラリを予めインストールする必要がありますが、ここでは説明が長くなってしまうため割愛します。
基本的には<code>pip</code>コマンドでインストールできます。 なお、Xilinx
ZCU104、Pynq Linux 2.7用にビルドされたPyTorch 1.11.0、TorchVision
0.12.0のWheelファイルは、<a
href="https://github.com/sterngerlach/pytorch-pynq-builds">こちらのリポジトリ</a>に置いてあります。
ここまで苦労して、なぜFPGA上で機械学習モデルを動かそうとするのか、たまに自問自答することがあります。</p>
<p>これ以降はC/C++ではなく、Pythonのコードを書いていきます。</p>
<p>最初に、PyTorchのモデルの定義を再掲します
(<code>net/model.py</code>)。 何の捻りもなく、シンプルですね。</p>
<div class="sourceCode" id="cb36"><pre
class="sourceCode python"><code class="sourceCode python"><span id="cb36-1"><a href="#cb36-1" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> PointNetFeat(torch.nn.Module):</span>
<span id="cb36-2"><a href="#cb36-2" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>):</span>
<span id="cb36-3"><a href="#cb36-3" aria-hidden="true" tabindex="-1"></a>        <span class="bu">super</span>().<span class="fu">__init__</span>()</span>
<span id="cb36-4"><a href="#cb36-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-5"><a href="#cb36-5" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.conv1 <span class="op">=</span> torch.nn.Conv1d(<span class="dv">3</span>, <span class="dv">64</span>, <span class="dv">1</span>)</span>
<span id="cb36-6"><a href="#cb36-6" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.conv2 <span class="op">=</span> torch.nn.Conv1d(<span class="dv">64</span>, <span class="dv">64</span>, <span class="dv">1</span>)</span>
<span id="cb36-7"><a href="#cb36-7" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.conv3 <span class="op">=</span> torch.nn.Conv1d(<span class="dv">64</span>, <span class="dv">64</span>, <span class="dv">1</span>)</span>
<span id="cb36-8"><a href="#cb36-8" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.conv4 <span class="op">=</span> torch.nn.Conv1d(<span class="dv">64</span>, <span class="dv">128</span>, <span class="dv">1</span>)</span>
<span id="cb36-9"><a href="#cb36-9" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.conv5 <span class="op">=</span> torch.nn.Conv1d(<span class="dv">128</span>, <span class="dv">1024</span>, <span class="dv">1</span>)</span>
<span id="cb36-10"><a href="#cb36-10" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.bn1 <span class="op">=</span> torch.nn.BatchNorm1d(<span class="dv">64</span>)</span>
<span id="cb36-11"><a href="#cb36-11" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.bn2 <span class="op">=</span> torch.nn.BatchNorm1d(<span class="dv">64</span>)</span>
<span id="cb36-12"><a href="#cb36-12" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.bn3 <span class="op">=</span> torch.nn.BatchNorm1d(<span class="dv">64</span>)</span>
<span id="cb36-13"><a href="#cb36-13" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.bn4 <span class="op">=</span> torch.nn.BatchNorm1d(<span class="dv">128</span>)</span>
<span id="cb36-14"><a href="#cb36-14" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.bn5 <span class="op">=</span> torch.nn.BatchNorm1d(<span class="dv">1024</span>)</span>
<span id="cb36-15"><a href="#cb36-15" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-16"><a href="#cb36-16" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> forward(<span class="va">self</span>, x: torch.Tensor):</span>
<span id="cb36-17"><a href="#cb36-17" aria-hidden="true" tabindex="-1"></a>        <span class="co"># `x` is of size [B, N, 3]</span></span>
<span id="cb36-18"><a href="#cb36-18" aria-hidden="true" tabindex="-1"></a>        N <span class="op">=</span> x.shape[<span class="dv">1</span>]</span>
<span id="cb36-19"><a href="#cb36-19" aria-hidden="true" tabindex="-1"></a>        <span class="co"># `x` is of size [B, 3, N]</span></span>
<span id="cb36-20"><a href="#cb36-20" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> x.transpose(<span class="dv">1</span>, <span class="dv">2</span>)</span>
<span id="cb36-21"><a href="#cb36-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-22"><a href="#cb36-22" aria-hidden="true" tabindex="-1"></a>        <span class="co"># `x` is of size [B, 1024, N]</span></span>
<span id="cb36-23"><a href="#cb36-23" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> F.relu(<span class="va">self</span>.bn1(<span class="va">self</span>.conv1(x)))</span>
<span id="cb36-24"><a href="#cb36-24" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> F.relu(<span class="va">self</span>.bn2(<span class="va">self</span>.conv2(x)))</span>
<span id="cb36-25"><a href="#cb36-25" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> F.relu(<span class="va">self</span>.bn3(<span class="va">self</span>.conv3(x)))</span>
<span id="cb36-26"><a href="#cb36-26" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> F.relu(<span class="va">self</span>.bn4(<span class="va">self</span>.conv4(x)))</span>
<span id="cb36-27"><a href="#cb36-27" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> F.relu(<span class="va">self</span>.bn5(<span class="va">self</span>.conv5(x)))</span>
<span id="cb36-28"><a href="#cb36-28" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-29"><a href="#cb36-29" aria-hidden="true" tabindex="-1"></a>        <span class="co"># `x` is of size [B, 1024]</span></span>
<span id="cb36-30"><a href="#cb36-30" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> torch.<span class="bu">max</span>(x, dim<span class="op">=</span><span class="dv">2</span>)[<span class="dv">0</span>]</span>
<span id="cb36-31"><a href="#cb36-31" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-32"><a href="#cb36-32" aria-hidden="true" tabindex="-1"></a>        <span class="cf">return</span> x</span>
<span id="cb36-33"><a href="#cb36-33" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-34"><a href="#cb36-34" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> PointNetCls(torch.nn.Module):</span>
<span id="cb36-35"><a href="#cb36-35" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>, num_classes: <span class="bu">int</span>):</span>
<span id="cb36-36"><a href="#cb36-36" aria-hidden="true" tabindex="-1"></a>        <span class="bu">super</span>().<span class="fu">__init__</span>()</span>
<span id="cb36-37"><a href="#cb36-37" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-38"><a href="#cb36-38" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Feature extraction</span></span>
<span id="cb36-39"><a href="#cb36-39" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.feat <span class="op">=</span> PointNetFeat()</span>
<span id="cb36-40"><a href="#cb36-40" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-41"><a href="#cb36-41" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Classification network</span></span>
<span id="cb36-42"><a href="#cb36-42" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.fc1 <span class="op">=</span> torch.nn.Linear(<span class="dv">1024</span>, <span class="dv">512</span>)</span>
<span id="cb36-43"><a href="#cb36-43" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.fc2 <span class="op">=</span> torch.nn.Linear(<span class="dv">512</span>, <span class="dv">256</span>)</span>
<span id="cb36-44"><a href="#cb36-44" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.fc3 <span class="op">=</span> torch.nn.Linear(<span class="dv">256</span>, num_classes)</span>
<span id="cb36-45"><a href="#cb36-45" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.bn1 <span class="op">=</span> torch.nn.BatchNorm1d(<span class="dv">512</span>)</span>
<span id="cb36-46"><a href="#cb36-46" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.bn2 <span class="op">=</span> torch.nn.BatchNorm1d(<span class="dv">256</span>)</span>
<span id="cb36-47"><a href="#cb36-47" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-48"><a href="#cb36-48" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> forward(<span class="va">self</span>, x):</span>
<span id="cb36-49"><a href="#cb36-49" aria-hidden="true" tabindex="-1"></a>        <span class="co"># `x` is of size [B, N, 3]</span></span>
<span id="cb36-50"><a href="#cb36-50" aria-hidden="true" tabindex="-1"></a>        <span class="co"># `x` is of size [B, 1024]</span></span>
<span id="cb36-51"><a href="#cb36-51" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> <span class="va">self</span>.feat(x)</span>
<span id="cb36-52"><a href="#cb36-52" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-53"><a href="#cb36-53" aria-hidden="true" tabindex="-1"></a>        <span class="co"># `x` is of size [B, `num_classes`]</span></span>
<span id="cb36-54"><a href="#cb36-54" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> F.relu(<span class="va">self</span>.bn1(<span class="va">self</span>.fc1(x)))</span>
<span id="cb36-55"><a href="#cb36-55" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> F.relu(<span class="va">self</span>.bn2(<span class="va">self</span>.fc2(x)))</span>
<span id="cb36-56"><a href="#cb36-56" aria-hidden="true" tabindex="-1"></a>        x <span class="op">=</span> <span class="va">self</span>.fc3(x)</span>
<span id="cb36-57"><a href="#cb36-57" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-58"><a href="#cb36-58" aria-hidden="true" tabindex="-1"></a>        <span class="cf">return</span> x</span></code></pre></div>
<p>次に、FPGAで高速化されたモデルを示します
(<code>host/model_zcu104.py</code>)。
モデルの名前は<code>PointNetClsZCU104</code>です。 上記のCPU版のモデル
(<code>PointNetCls</code>) と、使い勝手が同じになるようにしました。</p>
<div class="sourceCode" id="cb37"><pre
class="sourceCode python"><code class="sourceCode python"><span id="cb37-1"><a href="#cb37-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> net.model <span class="im">import</span> PointNetCls</span>
<span id="cb37-2"><a href="#cb37-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-3"><a href="#cb37-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Split the 64-bit address</span></span>
<span id="cb37-4"><a href="#cb37-4" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> split_address(addr: <span class="bu">int</span>) <span class="op">-&gt;</span> Tuple[<span class="bu">int</span>, <span class="bu">int</span>]:</span>
<span id="cb37-5"><a href="#cb37-5" aria-hidden="true" tabindex="-1"></a>    mask <span class="op">=</span> (<span class="dv">1</span> <span class="op">&lt;&lt;</span> <span class="dv">32</span>) <span class="op">-</span> <span class="dv">1</span></span>
<span id="cb37-6"><a href="#cb37-6" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> addr <span class="op">&amp;</span> mask, addr <span class="op">&gt;&gt;</span> <span class="dv">32</span></span>
<span id="cb37-7"><a href="#cb37-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-8"><a href="#cb37-8" aria-hidden="true" tabindex="-1"></a><span class="co"># Allocate a contiguous buffer for torch.nn.Conv1d (torch.nn.Linear)</span></span>
<span id="cb37-9"><a href="#cb37-9" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> allocate_linear_buffer(in_dims: <span class="bu">int</span>, out_dims: <span class="bu">int</span>) <span class="op">\</span></span>
<span id="cb37-10"><a href="#cb37-10" aria-hidden="true" tabindex="-1"></a>    <span class="op">-&gt;</span> pynq.<span class="bu">buffer</span>.PynqBuffer:</span>
<span id="cb37-11"><a href="#cb37-11" aria-hidden="true" tabindex="-1"></a>    buf_size <span class="op">=</span> in_dims <span class="op">*</span> out_dims <span class="op">+</span> out_dims</span>
<span id="cb37-12"><a href="#cb37-12" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> pynq.allocate(shape<span class="op">=</span>(buf_size,), dtype<span class="op">=</span>np.float32, cacheable<span class="op">=</span><span class="va">False</span>)</span>
<span id="cb37-13"><a href="#cb37-13" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-14"><a href="#cb37-14" aria-hidden="true" tabindex="-1"></a><span class="co"># Allocate a contiguous buffer for a block with torch.nn.Conv1d</span></span>
<span id="cb37-15"><a href="#cb37-15" aria-hidden="true" tabindex="-1"></a><span class="co"># (torch.nn.Linear) and torch.nn.BatchNorm1d</span></span>
<span id="cb37-16"><a href="#cb37-16" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> allocate_block_buffer(in_dims: <span class="bu">int</span>, out_dims: <span class="bu">int</span>) <span class="op">\</span></span>
<span id="cb37-17"><a href="#cb37-17" aria-hidden="true" tabindex="-1"></a>    <span class="op">-&gt;</span> pynq.<span class="bu">buffer</span>.PynqBuffer:</span>
<span id="cb37-18"><a href="#cb37-18" aria-hidden="true" tabindex="-1"></a>    buf_size <span class="op">=</span> <span class="dv">0</span></span>
<span id="cb37-19"><a href="#cb37-19" aria-hidden="true" tabindex="-1"></a>    buf_size <span class="op">+=</span> in_dims <span class="op">*</span> out_dims <span class="op">+</span> out_dims</span>
<span id="cb37-20"><a href="#cb37-20" aria-hidden="true" tabindex="-1"></a>    buf_size <span class="op">+=</span> out_dims <span class="op">*</span> <span class="dv">3</span></span>
<span id="cb37-21"><a href="#cb37-21" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> pynq.allocate(shape<span class="op">=</span>(buf_size,), dtype<span class="op">=</span>np.float32, cacheable<span class="op">=</span><span class="va">False</span>)</span>
<span id="cb37-22"><a href="#cb37-22" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-23"><a href="#cb37-23" aria-hidden="true" tabindex="-1"></a><span class="co"># Write the torch.nn.Conv1d parameters to the contiguous buffer</span></span>
<span id="cb37-24"><a href="#cb37-24" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> write_conv1d_params(buf: pynq.<span class="bu">buffer</span>.PynqBuffer,</span>
<span id="cb37-25"><a href="#cb37-25" aria-hidden="true" tabindex="-1"></a>                        layer: torch.nn.Conv1d,</span>
<span id="cb37-26"><a href="#cb37-26" aria-hidden="true" tabindex="-1"></a>                        offset: <span class="bu">int</span> <span class="op">=</span> <span class="dv">0</span>) <span class="op">-&gt;</span> <span class="bu">int</span>:</span>
<span id="cb37-27"><a href="#cb37-27" aria-hidden="true" tabindex="-1"></a>    <span class="cf">if</span> layer.kernel_size <span class="op">!=</span> (<span class="dv">1</span>,):</span>
<span id="cb37-28"><a href="#cb37-28" aria-hidden="true" tabindex="-1"></a>        <span class="cf">raise</span> <span class="pp">RuntimeError</span>(<span class="ss">f&quot;Kernel size should be 1&quot;</span>)</span>
<span id="cb37-29"><a href="#cb37-29" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-30"><a href="#cb37-30" aria-hidden="true" tabindex="-1"></a>    weight_size <span class="op">=</span> layer.out_channels <span class="op">*</span> layer.in_channels</span>
<span id="cb37-31"><a href="#cb37-31" aria-hidden="true" tabindex="-1"></a>    bias_size <span class="op">=</span> layer.out_channels</span>
<span id="cb37-32"><a href="#cb37-32" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-33"><a href="#cb37-33" aria-hidden="true" tabindex="-1"></a>    buf[offset:offset<span class="op">+</span>weight_size] <span class="op">=</span> layer.weight.data.view(<span class="op">-</span><span class="dv">1</span>)</span>
<span id="cb37-34"><a href="#cb37-34" aria-hidden="true" tabindex="-1"></a>    offset <span class="op">+=</span> weight_size</span>
<span id="cb37-35"><a href="#cb37-35" aria-hidden="true" tabindex="-1"></a>    buf[offset:offset<span class="op">+</span>bias_size] <span class="op">=</span> layer.bias.data.view(<span class="op">-</span><span class="dv">1</span>)</span>
<span id="cb37-36"><a href="#cb37-36" aria-hidden="true" tabindex="-1"></a>    offset <span class="op">+=</span> bias_size</span>
<span id="cb37-37"><a href="#cb37-37" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-38"><a href="#cb37-38" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> offset</span>
<span id="cb37-39"><a href="#cb37-39" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-40"><a href="#cb37-40" aria-hidden="true" tabindex="-1"></a><span class="co"># Write the torch.nn.Linear parameters to the contiguous buffer</span></span>
<span id="cb37-41"><a href="#cb37-41" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> write_linear_params(buf: pynq.<span class="bu">buffer</span>.PynqBuffer,</span>
<span id="cb37-42"><a href="#cb37-42" aria-hidden="true" tabindex="-1"></a>                        layer: torch.nn.Linear,</span>
<span id="cb37-43"><a href="#cb37-43" aria-hidden="true" tabindex="-1"></a>                        offset: <span class="bu">int</span> <span class="op">=</span> <span class="dv">0</span>) <span class="op">-&gt;</span> <span class="bu">int</span>:</span>
<span id="cb37-44"><a href="#cb37-44" aria-hidden="true" tabindex="-1"></a>    weight_size <span class="op">=</span> layer.out_features <span class="op">*</span> layer.in_features</span>
<span id="cb37-45"><a href="#cb37-45" aria-hidden="true" tabindex="-1"></a>    bias_size <span class="op">=</span> layer.out_features</span>
<span id="cb37-46"><a href="#cb37-46" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-47"><a href="#cb37-47" aria-hidden="true" tabindex="-1"></a>    buf[offset:offset<span class="op">+</span>weight_size] <span class="op">=</span> layer.weight.data.view(<span class="op">-</span><span class="dv">1</span>)</span>
<span id="cb37-48"><a href="#cb37-48" aria-hidden="true" tabindex="-1"></a>    offset <span class="op">+=</span> weight_size</span>
<span id="cb37-49"><a href="#cb37-49" aria-hidden="true" tabindex="-1"></a>    buf[offset:offset<span class="op">+</span>bias_size] <span class="op">=</span> layer.bias.data.view(<span class="op">-</span><span class="dv">1</span>)</span>
<span id="cb37-50"><a href="#cb37-50" aria-hidden="true" tabindex="-1"></a>    offset <span class="op">+=</span> bias_size</span>
<span id="cb37-51"><a href="#cb37-51" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-52"><a href="#cb37-52" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> offset</span>
<span id="cb37-53"><a href="#cb37-53" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-54"><a href="#cb37-54" aria-hidden="true" tabindex="-1"></a><span class="co"># Write the torch.nn.BatchNorm1d parameters to the contiguous buffer</span></span>
<span id="cb37-55"><a href="#cb37-55" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> write_batchnorm1d_params(buf: pynq.<span class="bu">buffer</span>.PynqBuffer,</span>
<span id="cb37-56"><a href="#cb37-56" aria-hidden="true" tabindex="-1"></a>                             layer: torch.nn.BatchNorm1d,</span>
<span id="cb37-57"><a href="#cb37-57" aria-hidden="true" tabindex="-1"></a>                             offset: <span class="bu">int</span> <span class="op">=</span> <span class="dv">0</span>) <span class="op">-&gt;</span> <span class="bu">int</span>:</span>
<span id="cb37-58"><a href="#cb37-58" aria-hidden="true" tabindex="-1"></a>    dims <span class="op">=</span> layer.num_features</span>
<span id="cb37-59"><a href="#cb37-59" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-60"><a href="#cb37-60" aria-hidden="true" tabindex="-1"></a>    <span class="co"># `scale` is the multiplication of the weight and reciprocal of the</span></span>
<span id="cb37-61"><a href="#cb37-61" aria-hidden="true" tabindex="-1"></a>    <span class="co"># standard deviation (to reduce the on-chip memory consumption)</span></span>
<span id="cb37-62"><a href="#cb37-62" aria-hidden="true" tabindex="-1"></a>    std_inv <span class="op">=</span> torch.sqrt(layer.running_var.data <span class="op">+</span> layer.eps)</span>
<span id="cb37-63"><a href="#cb37-63" aria-hidden="true" tabindex="-1"></a>    std_inv <span class="op">=</span> torch.reciprocal(std_inv)</span>
<span id="cb37-64"><a href="#cb37-64" aria-hidden="true" tabindex="-1"></a>    scale <span class="op">=</span> std_inv <span class="op">*</span> layer.weight.data</span>
<span id="cb37-65"><a href="#cb37-65" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-66"><a href="#cb37-66" aria-hidden="true" tabindex="-1"></a>    buf[offset:offset<span class="op">+</span>dims] <span class="op">=</span> scale.data.view(<span class="op">-</span><span class="dv">1</span>)</span>
<span id="cb37-67"><a href="#cb37-67" aria-hidden="true" tabindex="-1"></a>    offset <span class="op">+=</span> dims</span>
<span id="cb37-68"><a href="#cb37-68" aria-hidden="true" tabindex="-1"></a>    buf[offset:offset<span class="op">+</span>dims] <span class="op">=</span> layer.bias.data.view(<span class="op">-</span><span class="dv">1</span>)</span>
<span id="cb37-69"><a href="#cb37-69" aria-hidden="true" tabindex="-1"></a>    offset <span class="op">+=</span> dims</span>
<span id="cb37-70"><a href="#cb37-70" aria-hidden="true" tabindex="-1"></a>    buf[offset:offset<span class="op">+</span>dims] <span class="op">=</span> layer.running_mean.data.view(<span class="op">-</span><span class="dv">1</span>)</span>
<span id="cb37-71"><a href="#cb37-71" aria-hidden="true" tabindex="-1"></a>    offset <span class="op">+=</span> dims</span>
<span id="cb37-72"><a href="#cb37-72" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-73"><a href="#cb37-73" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> offset</span>
<span id="cb37-74"><a href="#cb37-74" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-75"><a href="#cb37-75" aria-hidden="true" tabindex="-1"></a><span class="co"># Write the block (torch.nn.Conv1d and torch.nn.BatchNorm1d) parameters</span></span>
<span id="cb37-76"><a href="#cb37-76" aria-hidden="true" tabindex="-1"></a><span class="co"># to the contiguous buffer</span></span>
<span id="cb37-77"><a href="#cb37-77" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> write_conv_batchnorm1d_params(buf: pynq.<span class="bu">buffer</span>.PynqBuffer,</span>
<span id="cb37-78"><a href="#cb37-78" aria-hidden="true" tabindex="-1"></a>                                  conv: torch.nn.Conv1d,</span>
<span id="cb37-79"><a href="#cb37-79" aria-hidden="true" tabindex="-1"></a>                                  bn: torch.nn.BatchNorm1d):</span>
<span id="cb37-80"><a href="#cb37-80" aria-hidden="true" tabindex="-1"></a>    offset <span class="op">=</span> <span class="dv">0</span></span>
<span id="cb37-81"><a href="#cb37-81" aria-hidden="true" tabindex="-1"></a>    offset <span class="op">=</span> write_conv1d_params(buf, conv, offset)</span>
<span id="cb37-82"><a href="#cb37-82" aria-hidden="true" tabindex="-1"></a>    offset <span class="op">=</span> write_batchnorm1d_params(buf, bn, offset)</span>
<span id="cb37-83"><a href="#cb37-83" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-84"><a href="#cb37-84" aria-hidden="true" tabindex="-1"></a><span class="co"># Write the block (torch.nn.Linear and torch.nn.BatchNorm1d) parameters</span></span>
<span id="cb37-85"><a href="#cb37-85" aria-hidden="true" tabindex="-1"></a><span class="co"># to the contiguous buffer</span></span>
<span id="cb37-86"><a href="#cb37-86" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> write_linear_batchnorm1d_params(buf: pynq.<span class="bu">buffer</span>.PynqBuffer,</span>
<span id="cb37-87"><a href="#cb37-87" aria-hidden="true" tabindex="-1"></a>                                    linear: torch.nn.Linear,</span>
<span id="cb37-88"><a href="#cb37-88" aria-hidden="true" tabindex="-1"></a>                                    bn: torch.nn.BatchNorm1d):</span>
<span id="cb37-89"><a href="#cb37-89" aria-hidden="true" tabindex="-1"></a>    offset <span class="op">=</span> <span class="dv">0</span></span>
<span id="cb37-90"><a href="#cb37-90" aria-hidden="true" tabindex="-1"></a>    offset <span class="op">=</span> write_linear_params(buf, linear, offset)</span>
<span id="cb37-91"><a href="#cb37-91" aria-hidden="true" tabindex="-1"></a>    offset <span class="op">=</span> write_batchnorm1d_params(buf, bn, offset)</span>
<span id="cb37-92"><a href="#cb37-92" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-93"><a href="#cb37-93" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> PointNetClsZCU104(torch.nn.Module):</span>
<span id="cb37-94"><a href="#cb37-94" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Operation modes (refer to hls/src/op_modes.hpp)</span></span>
<span id="cb37-95"><a href="#cb37-95" aria-hidden="true" tabindex="-1"></a>    MODE_INIT_WEIGHTS <span class="op">=</span> <span class="dv">100</span></span>
<span id="cb37-96"><a href="#cb37-96" aria-hidden="true" tabindex="-1"></a>    MODE_INFERENCE <span class="op">=</span> <span class="dv">101</span></span>
<span id="cb37-97"><a href="#cb37-97" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-98"><a href="#cb37-98" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>, model_cpu: PointNetCls,</span>
<span id="cb37-99"><a href="#cb37-99" aria-hidden="true" tabindex="-1"></a>                 overlay_path: <span class="bu">str</span>, num_points: <span class="bu">int</span>):</span>
<span id="cb37-100"><a href="#cb37-100" aria-hidden="true" tabindex="-1"></a>        <span class="bu">super</span>().<span class="fu">__init__</span>()</span>
<span id="cb37-101"><a href="#cb37-101" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-102"><a href="#cb37-102" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Load an overlay</span></span>
<span id="cb37-103"><a href="#cb37-103" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.overlay <span class="op">=</span> <span class="va">self</span>.load_overlay(overlay_path)</span>
<span id="cb37-104"><a href="#cb37-104" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Get the IP core module</span></span>
<span id="cb37-105"><a href="#cb37-105" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.net_ip: pynq.DefaultIP <span class="op">=</span> <span class="va">self</span>.overlay.PointNetClsTop</span>
<span id="cb37-106"><a href="#cb37-106" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Get the control registers of the IP core</span></span>
<span id="cb37-107"><a href="#cb37-107" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.registers <span class="op">=</span> <span class="va">self</span>.net_ip.register_map</span>
<span id="cb37-108"><a href="#cb37-108" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-109"><a href="#cb37-109" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Check the data width of the AXI master interface</span></span>
<span id="cb37-110"><a href="#cb37-110" aria-hidden="true" tabindex="-1"></a>        net_ip_params <span class="op">=</span> <span class="va">self</span>.overlay.ip_dict[<span class="st">&quot;PointNetClsTop&quot;</span>][<span class="st">&quot;parameters&quot;</span>]</span>
<span id="cb37-111"><a href="#cb37-111" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.axi_m_addr_width <span class="op">=</span> <span class="bu">int</span>(net_ip_params[<span class="st">&quot;C_M_AXI_GMEM0_ADDR_WIDTH&quot;</span>])</span>
<span id="cb37-112"><a href="#cb37-112" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.axi_m_data_width <span class="op">=</span> <span class="bu">int</span>(net_ip_params[<span class="st">&quot;C_M_AXI_GMEM0_DATA_WIDTH&quot;</span>])</span>
<span id="cb37-113"><a href="#cb37-113" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-114"><a href="#cb37-114" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Allocate buffers for PointNet feature extraction network</span></span>
<span id="cb37-115"><a href="#cb37-115" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.buf_feat_params1 <span class="op">=</span> allocate_block_buffer(<span class="dv">3</span>, <span class="dv">64</span>)</span>
<span id="cb37-116"><a href="#cb37-116" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.buf_feat_params2 <span class="op">=</span> allocate_block_buffer(<span class="dv">64</span>, <span class="dv">64</span>)</span>
<span id="cb37-117"><a href="#cb37-117" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.buf_feat_params3 <span class="op">=</span> allocate_block_buffer(<span class="dv">64</span>, <span class="dv">64</span>)</span>
<span id="cb37-118"><a href="#cb37-118" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.buf_feat_params4 <span class="op">=</span> allocate_block_buffer(<span class="dv">64</span>, <span class="dv">128</span>)</span>
<span id="cb37-119"><a href="#cb37-119" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.buf_feat_params5 <span class="op">=</span> allocate_block_buffer(<span class="dv">128</span>, <span class="dv">1024</span>)</span>
<span id="cb37-120"><a href="#cb37-120" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-121"><a href="#cb37-121" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Allocate buffers for classification network</span></span>
<span id="cb37-122"><a href="#cb37-122" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.buf_cls_params1 <span class="op">=</span> allocate_block_buffer(<span class="dv">1024</span>, <span class="dv">512</span>)</span>
<span id="cb37-123"><a href="#cb37-123" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.buf_cls_params2 <span class="op">=</span> allocate_block_buffer(<span class="dv">512</span>, <span class="dv">256</span>)</span>
<span id="cb37-124"><a href="#cb37-124" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.buf_cls_params3 <span class="op">=</span> allocate_linear_buffer(<span class="dv">256</span>, <span class="dv">40</span>)</span>
<span id="cb37-125"><a href="#cb37-125" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-126"><a href="#cb37-126" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Allocate a buffer for point cloud</span></span>
<span id="cb37-127"><a href="#cb37-127" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.num_points <span class="op">=</span> num_points</span>
<span id="cb37-128"><a href="#cb37-128" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> <span class="va">self</span>.axi_m_data_width <span class="op">==</span> <span class="dv">32</span>:</span>
<span id="cb37-129"><a href="#cb37-129" aria-hidden="true" tabindex="-1"></a>            <span class="va">self</span>.buf_point_cloud: pynq.<span class="bu">buffer</span>.PynqBuffer <span class="op">=</span> pynq.allocate(</span>
<span id="cb37-130"><a href="#cb37-130" aria-hidden="true" tabindex="-1"></a>                shape<span class="op">=</span>(<span class="va">self</span>.num_points, <span class="dv">3</span>), dtype<span class="op">=</span>np.float32, cacheable<span class="op">=</span><span class="va">False</span>)</span>
<span id="cb37-131"><a href="#cb37-131" aria-hidden="true" tabindex="-1"></a>        <span class="cf">elif</span> <span class="va">self</span>.axi_m_data_width <span class="op">==</span> <span class="dv">64</span>:</span>
<span id="cb37-132"><a href="#cb37-132" aria-hidden="true" tabindex="-1"></a>            <span class="va">self</span>.buf_point_cloud: pynq.<span class="bu">buffer</span>.PynqBuffer <span class="op">=</span> pynq.allocate(</span>
<span id="cb37-133"><a href="#cb37-133" aria-hidden="true" tabindex="-1"></a>                shape<span class="op">=</span>(<span class="va">self</span>.num_points, <span class="dv">4</span>), dtype<span class="op">=</span>np.float32, cacheable<span class="op">=</span><span class="va">False</span>)</span>
<span id="cb37-134"><a href="#cb37-134" aria-hidden="true" tabindex="-1"></a>        <span class="cf">else</span>:</span>
<span id="cb37-135"><a href="#cb37-135" aria-hidden="true" tabindex="-1"></a>            <span class="cf">raise</span> <span class="pp">RuntimeError</span>(<span class="ss">f&quot;Unexpected data width for AXI master&quot;</span>)</span>
<span id="cb37-136"><a href="#cb37-136" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-137"><a href="#cb37-137" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Allocate a buffer for output logits</span></span>
<span id="cb37-138"><a href="#cb37-138" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.buf_out_logits: pynq.<span class="bu">buffer</span>.PynqBuffer <span class="op">=</span> pynq.allocate(</span>
<span id="cb37-139"><a href="#cb37-139" aria-hidden="true" tabindex="-1"></a>            shape<span class="op">=</span>(<span class="dv">40</span>,), dtype<span class="op">=</span>np.float32, cacheable<span class="op">=</span><span class="va">False</span>)</span>
<span id="cb37-140"><a href="#cb37-140" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-141"><a href="#cb37-141" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Copy parameters for PointNet feature extraction network</span></span>
<span id="cb37-142"><a href="#cb37-142" aria-hidden="true" tabindex="-1"></a>        write_conv_batchnorm1d_params(<span class="va">self</span>.buf_feat_params1,</span>
<span id="cb37-143"><a href="#cb37-143" aria-hidden="true" tabindex="-1"></a>            model_cpu.feat.conv1, model_cpu.feat.bn1)</span>
<span id="cb37-144"><a href="#cb37-144" aria-hidden="true" tabindex="-1"></a>        write_conv_batchnorm1d_params(<span class="va">self</span>.buf_feat_params2,</span>
<span id="cb37-145"><a href="#cb37-145" aria-hidden="true" tabindex="-1"></a>            model_cpu.feat.conv2, model_cpu.feat.bn2)</span>
<span id="cb37-146"><a href="#cb37-146" aria-hidden="true" tabindex="-1"></a>        write_conv_batchnorm1d_params(<span class="va">self</span>.buf_feat_params3,</span>
<span id="cb37-147"><a href="#cb37-147" aria-hidden="true" tabindex="-1"></a>            model_cpu.feat.conv3, model_cpu.feat.bn3)</span>
<span id="cb37-148"><a href="#cb37-148" aria-hidden="true" tabindex="-1"></a>        write_conv_batchnorm1d_params(<span class="va">self</span>.buf_feat_params4,</span>
<span id="cb37-149"><a href="#cb37-149" aria-hidden="true" tabindex="-1"></a>            model_cpu.feat.conv4, model_cpu.feat.bn4)</span>
<span id="cb37-150"><a href="#cb37-150" aria-hidden="true" tabindex="-1"></a>        write_conv_batchnorm1d_params(<span class="va">self</span>.buf_feat_params5,</span>
<span id="cb37-151"><a href="#cb37-151" aria-hidden="true" tabindex="-1"></a>            model_cpu.feat.conv5, model_cpu.feat.bn5)</span>
<span id="cb37-152"><a href="#cb37-152" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-153"><a href="#cb37-153" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Copy parameters for classification network</span></span>
<span id="cb37-154"><a href="#cb37-154" aria-hidden="true" tabindex="-1"></a>        write_linear_batchnorm1d_params(<span class="va">self</span>.buf_cls_params1,</span>
<span id="cb37-155"><a href="#cb37-155" aria-hidden="true" tabindex="-1"></a>            model_cpu.fc1, model_cpu.bn1)</span>
<span id="cb37-156"><a href="#cb37-156" aria-hidden="true" tabindex="-1"></a>        write_linear_batchnorm1d_params(<span class="va">self</span>.buf_cls_params2,</span>
<span id="cb37-157"><a href="#cb37-157" aria-hidden="true" tabindex="-1"></a>            model_cpu.fc2, model_cpu.bn2)</span>
<span id="cb37-158"><a href="#cb37-158" aria-hidden="true" tabindex="-1"></a>        write_linear_params(<span class="va">self</span>.buf_cls_params3, model_cpu.fc3)</span>
<span id="cb37-159"><a href="#cb37-159" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-160"><a href="#cb37-160" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Set the physical addresses of the buffers</span></span>
<span id="cb37-161"><a href="#cb37-161" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.registers.point_cloud_1, <span class="va">self</span>.registers.point_cloud_2 <span class="op">=</span> <span class="op">\</span></span>
<span id="cb37-162"><a href="#cb37-162" aria-hidden="true" tabindex="-1"></a>            split_address(<span class="va">self</span>.buf_point_cloud.device_address)</span>
<span id="cb37-163"><a href="#cb37-163" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.registers.out_logits_1, <span class="va">self</span>.registers.out_logits_2 <span class="op">=</span> <span class="op">\</span></span>
<span id="cb37-164"><a href="#cb37-164" aria-hidden="true" tabindex="-1"></a>            split_address(<span class="va">self</span>.buf_out_logits.device_address)</span>
<span id="cb37-165"><a href="#cb37-165" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.registers.feat_params1_1, <span class="va">self</span>.registers.feat_params1_2 <span class="op">=</span> <span class="op">\</span></span>
<span id="cb37-166"><a href="#cb37-166" aria-hidden="true" tabindex="-1"></a>            split_address(<span class="va">self</span>.buf_feat_params1.device_address)</span>
<span id="cb37-167"><a href="#cb37-167" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.registers.feat_params2_1, <span class="va">self</span>.registers.feat_params2_2 <span class="op">=</span> <span class="op">\</span></span>
<span id="cb37-168"><a href="#cb37-168" aria-hidden="true" tabindex="-1"></a>            split_address(<span class="va">self</span>.buf_feat_params2.device_address)</span>
<span id="cb37-169"><a href="#cb37-169" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.registers.feat_params3_1, <span class="va">self</span>.registers.feat_params3_2 <span class="op">=</span> <span class="op">\</span></span>
<span id="cb37-170"><a href="#cb37-170" aria-hidden="true" tabindex="-1"></a>            split_address(<span class="va">self</span>.buf_feat_params3.device_address)</span>
<span id="cb37-171"><a href="#cb37-171" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.registers.feat_params4_1, <span class="va">self</span>.registers.feat_params4_2 <span class="op">=</span> <span class="op">\</span></span>
<span id="cb37-172"><a href="#cb37-172" aria-hidden="true" tabindex="-1"></a>            split_address(<span class="va">self</span>.buf_feat_params4.device_address)</span>
<span id="cb37-173"><a href="#cb37-173" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.registers.feat_params5_1, <span class="va">self</span>.registers.feat_params5_2 <span class="op">=</span> <span class="op">\</span></span>
<span id="cb37-174"><a href="#cb37-174" aria-hidden="true" tabindex="-1"></a>            split_address(<span class="va">self</span>.buf_feat_params5.device_address)</span>
<span id="cb37-175"><a href="#cb37-175" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.registers.cls_params1_1, <span class="va">self</span>.registers.cls_params1_2 <span class="op">=</span> <span class="op">\</span></span>
<span id="cb37-176"><a href="#cb37-176" aria-hidden="true" tabindex="-1"></a>            split_address(<span class="va">self</span>.buf_cls_params1.device_address)</span>
<span id="cb37-177"><a href="#cb37-177" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.registers.cls_params2_1, <span class="va">self</span>.registers.cls_params2_2 <span class="op">=</span> <span class="op">\</span></span>
<span id="cb37-178"><a href="#cb37-178" aria-hidden="true" tabindex="-1"></a>            split_address(<span class="va">self</span>.buf_cls_params2.device_address)</span>
<span id="cb37-179"><a href="#cb37-179" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.registers.cls_params3_1, <span class="va">self</span>.registers.cls_params3_2 <span class="op">=</span> <span class="op">\</span></span>
<span id="cb37-180"><a href="#cb37-180" aria-hidden="true" tabindex="-1"></a>            split_address(<span class="va">self</span>.buf_cls_params3.device_address)</span>
<span id="cb37-181"><a href="#cb37-181" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-182"><a href="#cb37-182" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Synchronize the buffers</span></span>
<span id="cb37-183"><a href="#cb37-183" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.buf_feat_params1.sync_to_device()</span>
<span id="cb37-184"><a href="#cb37-184" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.buf_feat_params2.sync_to_device()</span>
<span id="cb37-185"><a href="#cb37-185" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.buf_feat_params3.sync_to_device()</span>
<span id="cb37-186"><a href="#cb37-186" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.buf_feat_params4.sync_to_device()</span>
<span id="cb37-187"><a href="#cb37-187" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.buf_feat_params5.sync_to_device()</span>
<span id="cb37-188"><a href="#cb37-188" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.buf_cls_params1.sync_to_device()</span>
<span id="cb37-189"><a href="#cb37-189" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.buf_cls_params2.sync_to_device()</span>
<span id="cb37-190"><a href="#cb37-190" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.buf_cls_params3.sync_to_device()</span>
<span id="cb37-191"><a href="#cb37-191" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-192"><a href="#cb37-192" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Initialize the weights (transfer the weights to the on-chip buffers)</span></span>
<span id="cb37-193"><a href="#cb37-193" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.registers.op_mode <span class="op">=</span> PointNetClsZCU104.MODE_INIT_WEIGHTS</span>
<span id="cb37-194"><a href="#cb37-194" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.registers.CTRL.AP_START <span class="op">=</span> <span class="dv">1</span></span>
<span id="cb37-195"><a href="#cb37-195" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.wait_for_ip()</span>
<span id="cb37-196"><a href="#cb37-196" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-197"><a href="#cb37-197" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> load_overlay(<span class="va">self</span>, overlay_path):</span>
<span id="cb37-198"><a href="#cb37-198" aria-hidden="true" tabindex="-1"></a>        overlay <span class="op">=</span> pynq.Overlay(overlay_path)</span>
<span id="cb37-199"><a href="#cb37-199" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-200"><a href="#cb37-200" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> <span class="kw">not</span> overlay.is_loaded():</span>
<span id="cb37-201"><a href="#cb37-201" aria-hidden="true" tabindex="-1"></a>            <span class="cf">raise</span> <span class="pp">RuntimeError</span>(<span class="ss">f&quot;Unable to load overlay: </span><span class="sc">{</span>overlay_path<span class="sc">}</span><span class="ss">&quot;</span>)</span>
<span id="cb37-202"><a href="#cb37-202" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-203"><a href="#cb37-203" aria-hidden="true" tabindex="-1"></a>        <span class="cf">return</span> overlay</span>
<span id="cb37-204"><a href="#cb37-204" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-205"><a href="#cb37-205" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> wait_for_ip(<span class="va">self</span>):</span>
<span id="cb37-206"><a href="#cb37-206" aria-hidden="true" tabindex="-1"></a>        <span class="cf">while</span> <span class="va">self</span>.registers.CTRL.AP_DONE <span class="op">==</span> <span class="dv">0</span>:</span>
<span id="cb37-207"><a href="#cb37-207" aria-hidden="true" tabindex="-1"></a>            <span class="cf">pass</span></span>
<span id="cb37-208"><a href="#cb37-208" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-209"><a href="#cb37-209" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> forward(<span class="va">self</span>, x: torch.Tensor):</span>
<span id="cb37-210"><a href="#cb37-210" aria-hidden="true" tabindex="-1"></a>        <span class="co"># `x` is of size [B, N, 3]</span></span>
<span id="cb37-211"><a href="#cb37-211" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> x.ndim <span class="op">!=</span> <span class="dv">3</span> <span class="kw">or</span> x.shape[<span class="dv">2</span>] <span class="op">!=</span> <span class="dv">3</span>:</span>
<span id="cb37-212"><a href="#cb37-212" aria-hidden="true" tabindex="-1"></a>            <span class="cf">raise</span> <span class="pp">RuntimeError</span>(<span class="ss">f&quot;Unexpected shape of the input: </span><span class="sc">{</span>x<span class="sc">.</span>shape<span class="sc">}</span><span class="ss">&quot;</span>)</span>
<span id="cb37-213"><a href="#cb37-213" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-214"><a href="#cb37-214" aria-hidden="true" tabindex="-1"></a>        batch_size <span class="op">=</span> x.shape[<span class="dv">0</span>]</span>
<span id="cb37-215"><a href="#cb37-215" aria-hidden="true" tabindex="-1"></a>        num_points <span class="op">=</span> x.shape[<span class="dv">1</span>]</span>
<span id="cb37-216"><a href="#cb37-216" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-217"><a href="#cb37-217" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Reallocate the buffer for point cloud if necessary</span></span>
<span id="cb37-218"><a href="#cb37-218" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> num_points <span class="op">&gt;</span> <span class="va">self</span>.num_points:</span>
<span id="cb37-219"><a href="#cb37-219" aria-hidden="true" tabindex="-1"></a>            <span class="va">self</span>.num_points <span class="op">=</span> num_points</span>
<span id="cb37-220"><a href="#cb37-220" aria-hidden="true" tabindex="-1"></a>            <span class="va">self</span>.buf_point_cloud.freebuffer()</span>
<span id="cb37-221"><a href="#cb37-221" aria-hidden="true" tabindex="-1"></a>            <span class="cf">if</span> <span class="va">self</span>.axi_m_data_width <span class="op">==</span> <span class="dv">32</span>:</span>
<span id="cb37-222"><a href="#cb37-222" aria-hidden="true" tabindex="-1"></a>                <span class="va">self</span>.buf_point_cloud: pynq.<span class="bu">buffer</span>.PynqBuffer <span class="op">=</span> pynq.allocate(</span>
<span id="cb37-223"><a href="#cb37-223" aria-hidden="true" tabindex="-1"></a>                    shape<span class="op">=</span>(<span class="va">self</span>.num_points, <span class="dv">3</span>),</span>
<span id="cb37-224"><a href="#cb37-224" aria-hidden="true" tabindex="-1"></a>                    dtype<span class="op">=</span>np.float32, cacheable<span class="op">=</span><span class="va">False</span>)</span>
<span id="cb37-225"><a href="#cb37-225" aria-hidden="true" tabindex="-1"></a>            <span class="cf">elif</span> <span class="va">self</span>.axi_m_data_width <span class="op">==</span> <span class="dv">64</span>:</span>
<span id="cb37-226"><a href="#cb37-226" aria-hidden="true" tabindex="-1"></a>                <span class="va">self</span>.buf_point_cloud: pynq.<span class="bu">buffer</span>.PynqBuffer <span class="op">=</span> pynq.allocate(</span>
<span id="cb37-227"><a href="#cb37-227" aria-hidden="true" tabindex="-1"></a>                    shape<span class="op">=</span>(<span class="va">self</span>.num_points, <span class="dv">4</span>),</span>
<span id="cb37-228"><a href="#cb37-228" aria-hidden="true" tabindex="-1"></a>                    dtype<span class="op">=</span>np.float32, cacheable<span class="op">=</span><span class="va">False</span>)</span>
<span id="cb37-229"><a href="#cb37-229" aria-hidden="true" tabindex="-1"></a>            <span class="cf">else</span>:</span>
<span id="cb37-230"><a href="#cb37-230" aria-hidden="true" tabindex="-1"></a>                <span class="cf">raise</span> <span class="pp">RuntimeError</span>(<span class="ss">f&quot;Unexpected data width for AXI master&quot;</span>)</span>
<span id="cb37-231"><a href="#cb37-231" aria-hidden="true" tabindex="-1"></a>            <span class="va">self</span>.registers.point_cloud_1, <span class="va">self</span>.registers.point_cloud_2 <span class="op">=</span> <span class="op">\</span></span>
<span id="cb37-232"><a href="#cb37-232" aria-hidden="true" tabindex="-1"></a>                split_address(<span class="va">self</span>.buf_point_cloud.device_address)</span>
<span id="cb37-233"><a href="#cb37-233" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-234"><a href="#cb37-234" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Allocate the Tensor for output</span></span>
<span id="cb37-235"><a href="#cb37-235" aria-hidden="true" tabindex="-1"></a>        out <span class="op">=</span> torch.empty(size<span class="op">=</span>(batch_size, <span class="dv">40</span>),</span>
<span id="cb37-236"><a href="#cb37-236" aria-hidden="true" tabindex="-1"></a>                          dtype<span class="op">=</span>x.dtype, device<span class="op">=</span>x.device)</span>
<span id="cb37-237"><a href="#cb37-237" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-238"><a href="#cb37-238" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Run the inference</span></span>
<span id="cb37-239"><a href="#cb37-239" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.registers.op_mode <span class="op">=</span> PointNetClsZCU104.MODE_INFERENCE</span>
<span id="cb37-240"><a href="#cb37-240" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.registers.num_points <span class="op">=</span> num_points</span>
<span id="cb37-241"><a href="#cb37-241" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-242"><a href="#cb37-242" aria-hidden="true" tabindex="-1"></a>        <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(batch_size):</span>
<span id="cb37-243"><a href="#cb37-243" aria-hidden="true" tabindex="-1"></a>            <span class="co"># Copy the input point cloud</span></span>
<span id="cb37-244"><a href="#cb37-244" aria-hidden="true" tabindex="-1"></a>            <span class="va">self</span>.buf_point_cloud[:num_points, :<span class="dv">3</span>] <span class="op">=</span> x[i].view(<span class="op">-</span><span class="dv">1</span>, <span class="dv">3</span>)</span>
<span id="cb37-245"><a href="#cb37-245" aria-hidden="true" tabindex="-1"></a>            <span class="va">self</span>.buf_point_cloud.sync_to_device()</span>
<span id="cb37-246"><a href="#cb37-246" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-247"><a href="#cb37-247" aria-hidden="true" tabindex="-1"></a>            <span class="co"># Run the inference</span></span>
<span id="cb37-248"><a href="#cb37-248" aria-hidden="true" tabindex="-1"></a>            <span class="va">self</span>.registers.CTRL.AP_START <span class="op">=</span> <span class="dv">1</span></span>
<span id="cb37-249"><a href="#cb37-249" aria-hidden="true" tabindex="-1"></a>            <span class="va">self</span>.wait_for_ip()</span>
<span id="cb37-250"><a href="#cb37-250" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-251"><a href="#cb37-251" aria-hidden="true" tabindex="-1"></a>            <span class="co"># Copy the output logits</span></span>
<span id="cb37-252"><a href="#cb37-252" aria-hidden="true" tabindex="-1"></a>            <span class="va">self</span>.buf_out_logits.sync_from_device()</span>
<span id="cb37-253"><a href="#cb37-253" aria-hidden="true" tabindex="-1"></a>            out[i, :] <span class="op">=</span> torch.from_numpy(<span class="va">self</span>.buf_out_logits)</span>
<span id="cb37-254"><a href="#cb37-254" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-255"><a href="#cb37-255" aria-hidden="true" tabindex="-1"></a>        <span class="cf">return</span> out</span></code></pre></div>
<h3 id="ipコアの初期化">IPコアの初期化</h3>
<p><code>PointNetClsZCU104</code>クラスのコンストラクタで、以下のような手順で初期化し、IPコアを使えるようにします。
この手順で行う必要はありません。 各手順について、順番に説明します。
詳しくは、<a
href="https://pynq.readthedocs.io/en/latest/">Pynqの公式ドキュメント</a>をご覧ください。</p>
<ol type="1">
<li>ビットストリームのロード (<code>load_overlay</code>)</li>
<li>DRAMバッファの確保
(<code>allocate_block_buffer</code>、<code>pynq.allocate</code>)</li>
<li>DRAMバッファへのパラメータのコピー
(<code>write_conv_batchnorm1d_params</code>、<code>write_linear_batchnorm1d_params</code>、<code>write_linear_params</code>)</li>
<li>DRAMバッファの物理アドレスを、ポートのレジスタに対して設定</li>
<li>DRAMバッファの内容を同期 (<code>sync_to_device</code>)</li>
<li>重み初期化モードで、IPコアを動作させ、DRAMバッファに置かれたパラメータをオンチップバッファ上にコピー</li>
<li>IPコアの動作終了を待機 (<code>wait_for_ip</code>)</li>
</ol>
<p>ビットストリームを操作するためのクラスは<code>pynq.Overlay</code>であり、ファイルパスを与えて、指定したビットストリームをロードします。
拡張子が<code>.bit</code>のビットストリームの他に、<code>.hwh</code>のHandoffファイルも必要です。
ビットストリームが<code>path/to/X.bit</code>であれば、対応するHandoffが<code>path/to/X.hwh</code>になければエラーとなります。
<code>pynq.Overlay</code>クラスのインスタンス<code>self.overlay</code>を起点として、FPGAに対する様々な処理を行っていきます。</p>
<p>オーバーレイ (ビットストリーム)
をロードしたら、自作のIPコア<code>PointNetClsTop</code>を取り出して、<code>self.net_ip</code>に格納します。
IPコアのプロパティ名は、ボードデザインにおける各IPの名前と対応しています
(<a
href="point-cloud-classification-images/board-design.svg">こちらの画像</a>を参照。)
例えば、割込みコントローラ (AXI Interrupt Controller)
には、<code>axi_intc_0</code>プロパティを通じてアクセスできます。
IPコアを操作するためのクラスは、デフォルトでは<code>pynq.DefaultIP</code>となっています。
このクラスを継承して、自作のIPコアをより便利に使えるように、様々なメソッドを追加することも可能です。
さらに、IPコアの制御レジスタにアクセスするためのインタフェース<code>register_map</code>
(<code>pynq.registers.RegisterMap</code>のサブクラス)
を取り出して、<code>self.registers</code>に格納します。</p>
<p>次の3行で、IPコアの入出力ポートのアドレス幅とデータ幅を調べて、<code>self.axi_m_addr_width</code>および<code>self.axi_m_data_width</code>に格納します。
前者は64、後者は32または64です
(入出力ポートの型を<code>ap_uint&lt;64&gt;*</code>とした場合は64、<code>float*</code>のままであれば32)。
前述の通り、ポート幅が32ビットであれば、点群バッファのサイズは<span
class="math inline">\((N,
3)\)</span>でよいのですが、64ビットの場合は、データを2つずつ読み取る関係上、バッファサイズを<span
class="math inline">\((N, 4)\)</span>にする必要があります。
<code>self.axi_m_data_width</code>を参照すれば、点群バッファのサイズを決定できます。</p>
<p>続いて、パラメータや入出力を保持するためのDRAMバッファを確保します。
このバッファは少し特殊なもので、LinuxカーネルのCMA (Contiguous Memory
Allocator) という機能により確保されます。
通常の<code>malloc()</code>や<code>new</code>を使ってバッファを確保すると、そのバッファへの仮想アドレスしか分かりません。
一方、FPGA側からは、物理アドレスを使用してバッファにアクセスするので、仮想アドレスだけでなく、物理アドレスも予め知っておく必要があります。</p>
<p><code>allocate_linear_buffer</code>関数は、その名の通り、全結合層
(入力次元<code>in_dims</code>、出力次元<code>out_dims</code>)
のパラメータ用のバッファを確保します。 最初に、全結合層の重み
(<code>in_dims * out_dims</code>) とバイアス (<code>out_dims</code>)
の要素数を足して、バッファサイズを決定します。
続いて、<code>pynq.allocate</code>関数を呼び出して、指定したサイズおよびデータ型<code>np.float32</code>
(<code>float</code>) の、1次元のバッファを確保します。
このバッファはDRAMの特殊な領域に置かれて、メモリ上で連続していることが保証されます。
<code>allocate_block_buffer</code>関数は、全結合層とバッチ正規化層のパラメータを保持するためのバッファを確保します。
全パラメータの要素数を足し合わせてサイズを決定し、<code>pynq.allocate</code>関数を使って、1次元のバッファを確保します。
これらのバッファは<code>pynq.buffer.PynqBuffer</code>クラスのインスタンスですが、NumPy配列
(<code>np.ndarray</code>) と同じように利用できます。
例えば、<code>torch.from_numpy</code>関数により、PyTorchのテンソルに変換できます。</p>
<p>特徴抽出ネットワーク
(<code>buf_feat_params1</code>から<code>buf_feat_params5</code>)
と、分類ネットワーク
(<code>buf_cls_params1</code>から<code>buf_cls_params3</code>)
のパラメータ用のバッファを確保します。 その後、入力 (点群) と出力
(ロジット) 用のバッファも確保します。
入力については上述の通り、ポートのビット幅が64であれば<code>(self.num_points, 4)</code>、32であれば<code>(self.num_points, 3)</code>とします。</p>
<p>DRAMバッファを確保し終えたら、次はモデルのパラメータをバッファへコピーします。
モデルは<code>PointNetCls</code>クラスのインスタンスで、コンストラクタの引数<code>model_cpu</code>として渡されます。
<code>write_conv1d_params</code>、<code>write_linear_params</code>は、それぞれ<code>torch.nn.Conv1d</code>、<code>torch.nn.Linear</code>のパラメータのコピーに使われます。
<code>write_conv1d_params</code>では、カーネルサイズが1である
(それゆえ全結合層<code>torch.nn.Linear</code>と動作が同じである)
ことを前提とします。
重みとバイアスの順で、指定された1次元のDRAMバッファに並べてゆきます。
IPコア側の期待通りにデータが配置されるように、細心の注意を払う必要があります。
これら2つの関数は、高位合成の実装における、<code>ReadLinearParamsNaive</code>や<code>ReadLinearParamsOpt1</code>と適合するように作られています。</p>
<p><code>write_batchnorm1d_params</code>は、<code>torch.nn.BatchNorm1d</code>のパラメータを、指定されたDRAMバッファにコピーします。
IPコア側では、<code>ReadBatchNorm1dParamsNaive</code>や<code>ReadBatchNorm1dParamsOpt1</code>に示すように、スケール、バイアス、平均の順で、パラメータが並ぶことを期待しています。
バッチ正規化層の分散と重みから、スケールを計算しています
(計算式については先述)。</p>
<p><code>write_conv_batchnorm1d_params</code>と<code>write_linear_batchnorm1d_params</code>は、全結合層
(<code>torch.nn.Conv1d</code>、<code>torch.nn.Linear</code>)
とバッチ正規化層 (<code>torch.nn.BatchNorm1d</code>)
のパラメータを、指定されたDRAMバッファにコピーします。
全結合層の重み、バイアス、それからバッチ正規化層のスケール、バイアス、平均を、この順で並べる必要があります。
IPコア側の<code>ReadBlockParamsNaive</code>、<code>ReadBlockParamsOpt1</code>、<code>ReadBlockParamsOpt2</code>と対応することが分かります。
モデルのパラメータはPyTorchのテンソルですが、そのままDRAMバッファ
(<code>pynq.buffer.PynqBuffer</code>) に代入できます。</p>
<p>パラメータを無事にコピーできたので、DRAMバッファの物理アドレスを設定します。
IPコアのトップ関数<code>PointNetClsTop</code>は次のように宣言されていました
(<code>float*</code>の代わりに<code>ap_uint&lt;64&gt;*</code>もあり)。</p>
<div class="sourceCode" id="cb38"><pre
class="sourceCode c++"><code class="sourceCode cpp"><span id="cb38-1"><a href="#cb38-1" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> PointNetClsTop<span class="op">(</span><span class="at">const</span> <span class="dt">int</span> op_mode<span class="op">,</span></span>
<span id="cb38-2"><a href="#cb38-2" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> point_cloud<span class="op">,</span></span>
<span id="cb38-3"><a href="#cb38-3" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">int</span> num_points<span class="op">,</span></span>
<span id="cb38-4"><a href="#cb38-4" aria-hidden="true" tabindex="-1"></a>                    <span class="dt">float</span><span class="op">*</span> out_logits<span class="op">,</span></span>
<span id="cb38-5"><a href="#cb38-5" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params1<span class="op">,</span></span>
<span id="cb38-6"><a href="#cb38-6" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params2<span class="op">,</span></span>
<span id="cb38-7"><a href="#cb38-7" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params3<span class="op">,</span></span>
<span id="cb38-8"><a href="#cb38-8" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params4<span class="op">,</span></span>
<span id="cb38-9"><a href="#cb38-9" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> feat_params5<span class="op">,</span></span>
<span id="cb38-10"><a href="#cb38-10" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> cls_params1<span class="op">,</span></span>
<span id="cb38-11"><a href="#cb38-11" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> cls_params2<span class="op">,</span></span>
<span id="cb38-12"><a href="#cb38-12" aria-hidden="true" tabindex="-1"></a>                    <span class="at">const</span> <span class="dt">float</span><span class="op">*</span> cls_params3<span class="op">)</span></span>
<span id="cb38-13"><a href="#cb38-13" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb38-14"><a href="#cb38-14" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=point_cloud offset=slave bundle=gmem0</span></span>
<span id="cb38-15"><a href="#cb38-15" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=out_logits offset=slave bundle=gmem0</span></span>
<span id="cb38-16"><a href="#cb38-16" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=feat_params1 offset=slave bundle=gmem0</span></span>
<span id="cb38-17"><a href="#cb38-17" aria-hidden="true" tabindex="-1"></a><span class="co">// ...</span></span>
<span id="cb38-18"><a href="#cb38-18" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE m_axi port=cls_params3 offset=slave bundle=gmem0</span></span>
<span id="cb38-19"><a href="#cb38-19" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb38-20"><a href="#cb38-20" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=op_mode bundle=control</span></span>
<span id="cb38-21"><a href="#cb38-21" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=point_cloud bundle=control</span></span>
<span id="cb38-22"><a href="#cb38-22" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=num_points bundle=control</span></span>
<span id="cb38-23"><a href="#cb38-23" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=out_logits bundle=control</span></span>
<span id="cb38-24"><a href="#cb38-24" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=feat_params1 bundle=control</span></span>
<span id="cb38-25"><a href="#cb38-25" aria-hidden="true" tabindex="-1"></a><span class="co">// ...</span></span>
<span id="cb38-26"><a href="#cb38-26" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=cls_params3 bundle=control</span></span>
<span id="cb38-27"><a href="#cb38-27" aria-hidden="true" tabindex="-1"></a><span class="pp">#pragma HLS INTERFACE s_axilite port=return bundle=control</span></span>
<span id="cb38-28"><a href="#cb38-28" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p><code>op_mode</code>と<code>num_points</code>を除く、DRAMバッファ用の入出力ポートについて、<code>#pragma HLS INTERFACE m_axi</code>と<code>#pragma HLS INTERFACE s_axilite</code>の記述がみられます。
この2つのHLSプラグマを付与すると、各ポートに対して、DRAMバッファの物理アドレスを指定するための、制御レジスタが作成されます。
アドレスは64ビットですが、制御レジスタのデータ幅は32ビットなので、上位32ビットと下位32ビット用に、2つの制御レジスタが用意されます。
例えば、<code>point_cloud</code>ポートについては、<code>point_cloud_1</code>
(下位32ビット) と、<code>point_cloud_2</code> (上位32ビット)
の、2つです。
DRAMバッファの物理アドレスを設定すれば、ポートとDRAMバッファとが紐づけられ、FPGA側からバッファにアクセスできるようになります。
Pynqライブラリを使うと、普通に値を代入しているようにみえますが、実際には、メモリマップトI/Oで実現されています。
言い換えると、各制御レジスタには専用のアドレスが割り振られており、そのアドレスに対して読み書きしています。
制御レジスタへのアクセスには、先ほどの<code>self.registers</code>を利用します。</p>
<p><code>op_mode</code>と<code>num_points</code>についても、<code>#pragma HLS INTERFACE s_axilite</code>の記述があるので、この2つ
(動作モードと点の個数) を設定するための制御レジスタが用意されます。</p>
<p>ここまで済んだら、<code>sync_to_device</code>メソッドによりDRAMバッファの内容を同期させて、FPGA側から正しく読めるようにします。</p>
<p>最後に、動作モード<code>op_mode</code>を<strong>重み初期化</strong>に設定し、制御レジスタのうち<code>CTRL.AP_START</code>を1にすることで、IPコアの動作を開始します。
重み初期化モードでは、DRAMバッファからパラメータを読み出して、オンチップバッファに格納します。
<code>#pragma HLS INTERFACE s_axilite port=return bundle=control</code>の記述があるおかげで、ソフトウェア側からIPコアを制御するための<code>CTRL</code>レジスタが用意されます。
IPコアの動作を開始したら、<code>wait_for_ip</code>メソッドを呼んで、動作終了
(パラメータの転送完了) を待機します。
<code>wait_for_ip</code>メソッド内では、<code>CTRL</code>レジスタの<code>AP_DONE</code>が1になるまで、ビジーウェイトします。
以上で初期化がおしまいです。</p>
<h3 id="推論">推論</h3>
<p>初期化には様々な工程があって面倒ですが、推論は比較的簡単です。
PyTorchの通常のモジュールと同じく、<code>forward</code>メソッドに推論処理を記述します。
入力点群<code>x</code>は、サイズが<span class="math inline">\((B, N,
3)\)</span>のバッチであるとします (<span
class="math inline">\(B\)</span>はバッチサイズ、<span
class="math inline">\(N\)</span>は点の個数)。
今回のIPコアは、バッチデータを扱うようには作っていないので、バッチ内の各サンプルを1つずつ処理することになります。
出力<code>out</code>は、物体のクラス数を<span
class="math inline">\(K\)</span>とすると、サイズが<span
class="math inline">\((B, K)\)</span>となります。
今回はModelNet40とよばれるデータセットを使うので、クラス数は<span
class="math inline">\(K = 40\)</span>です。</p>
<p>最初に、点群のサイズ<span
class="math inline">\(N\)</span>が、点群用に確保してある現在のDRAMバッファよりも大きければ、DRAMバッファを確保し直します。
続いて、バッチ内の各サンプルに対して推論処理を行って、物体の各クラスに対するロジット
(スコア) を計算します。
点群用のDRAMバッファ<code>buf_point_cloud</code>に点群データをコピーして、FPGA側から正しく読み出せるように、バッファを同期します。
ソフトウェア側からは、入出力ポートの幅 (32か64かどうか)
はそれほど意識する必要がありません。 2つの制御レジスタ
(動作モード<code>op_mode</code>と点の個数<code>num_points</code>)
は、予め設定しておきます。</p>
<p><code>CTRL</code>レジスタの<code>AP_START</code>を1にすることで、<strong>推論</strong>モードでのIPコアの動作を開始します。
<code>wait_for_ip</code>メソッドにより動作の終了を待機します。
モデルの出力であるロジットは、IPコア側からDRAMバッファ<code>buf_out_logits</code>に書き込まれているので、それをPyTorchのテンソルに変換したうえで、出力用のテンソル<code>out</code>に改めて書き込みます。
以上が推論処理の説明でした。</p>
<p>このように、IPコアの実装だけでなく、それを実際に使うためのドライバも用意する必要があるので、手間が掛かりますね。
今回は、Pynqライブラリを使ったので、FPGAに関する処理は、比較的容易に記述できました。
また、CPU・GPU版のモデルと同じように使いたいので、PyTorchのモジュール
(<code>torch.nn.Module</code>) としてドライバを作成しました。
Pythonの代わりにC++を使うことも、もちろん可能です。
その場合は、ビットストリームのロード (<a
href="https://github.com/sterngerlach/my-lidar-graph-slam-v2/blob/b271f4f13050f2f7aced3feb3c37253f287ee006/src/my_lidar_graph_slam/hw/bitstream_loader.cpp">例えばこちら</a>)、メモリマップトI/Oの準備
(<a
href="https://github.com/sterngerlach/my-lidar-graph-slam-v2/blob/b271f4f13050f2f7aced3feb3c37253f287ee006/src/my_lidar_graph_slam/hw/mmio.cpp">例えばこちら</a>)、DRAMバッファの確保
(<a
href="https://github.com/sterngerlach/my-lidar-graph-slam-v2/blob/b271f4f13050f2f7aced3feb3c37253f287ee006/src/my_lidar_graph_slam/hw/cma_memory.cpp">例えばこちら</a>)などを、C++で記述することになります
(Pynqライブラリをそのまま移植したのを覚えています)。</p>
<h1 id="評価">評価</h1>
<h2 id="推論時間の比較">推論時間の比較</h2>
<p>ようやく、IPコアを使った評価に入りました。
最初に、推論時間を比較してみましょう。 以下のソースコードを利用します
(<code>host/time_zcu104.py</code>)。</p>
<div class="sourceCode" id="cb39"><pre
class="sourceCode python"><code class="sourceCode python"><span id="cb39-1"><a href="#cb39-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> main():</span>
<span id="cb39-2"><a href="#cb39-2" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Parse the command-line arguments</span></span>
<span id="cb39-3"><a href="#cb39-3" aria-hidden="true" tabindex="-1"></a>    args <span class="op">=</span> parse_command_line()</span>
<span id="cb39-4"><a href="#cb39-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb39-5"><a href="#cb39-5" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Create a PointNet classification model</span></span>
<span id="cb39-6"><a href="#cb39-6" aria-hidden="true" tabindex="-1"></a>    model <span class="op">=</span> PointNetCls(num_classes<span class="op">=</span><span class="dv">40</span>)</span>
<span id="cb39-7"><a href="#cb39-7" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Create an FPGA model</span></span>
<span id="cb39-8"><a href="#cb39-8" aria-hidden="true" tabindex="-1"></a>    model_zcu104 <span class="op">=</span> PointNetClsZCU104(model, args.bitstream, args.num_points)</span>
<span id="cb39-9"><a href="#cb39-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb39-10"><a href="#cb39-10" aria-hidden="true" tabindex="-1"></a>    model.<span class="bu">eval</span>()</span>
<span id="cb39-11"><a href="#cb39-11" aria-hidden="true" tabindex="-1"></a>    model_zcu104.<span class="bu">eval</span>()</span>
<span id="cb39-12"><a href="#cb39-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb39-13"><a href="#cb39-13" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Test the output</span></span>
<span id="cb39-14"><a href="#cb39-14" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Create a random input point cloud</span></span>
<span id="cb39-15"><a href="#cb39-15" aria-hidden="true" tabindex="-1"></a>    point_cloud <span class="op">=</span> torch.rand(size<span class="op">=</span>(<span class="dv">1</span>, args.num_points, <span class="dv">3</span>))</span>
<span id="cb39-16"><a href="#cb39-16" aria-hidden="true" tabindex="-1"></a>    out_cpu <span class="op">=</span> model(point_cloud)</span>
<span id="cb39-17"><a href="#cb39-17" aria-hidden="true" tabindex="-1"></a>    out_zcu104 <span class="op">=</span> model_zcu104(point_cloud)</span>
<span id="cb39-18"><a href="#cb39-18" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb39-19"><a href="#cb39-19" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f&quot;Output (CPU):</span><span class="ch">\n</span><span class="sc">{</span>out_cpu<span class="sc">}</span><span class="ss">&quot;</span>)</span>
<span id="cb39-20"><a href="#cb39-20" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f&quot;Output (FPGA):</span><span class="ch">\n</span><span class="sc">{</span>out_zcu104<span class="sc">}</span><span class="ss">&quot;</span>)</span>
<span id="cb39-21"><a href="#cb39-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb39-22"><a href="#cb39-22" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Measure the inference times</span></span>
<span id="cb39-23"><a href="#cb39-23" aria-hidden="true" tabindex="-1"></a>    times_cpu <span class="op">=</span> []</span>
<span id="cb39-24"><a href="#cb39-24" aria-hidden="true" tabindex="-1"></a>    times_zcu104 <span class="op">=</span> []</span>
<span id="cb39-25"><a href="#cb39-25" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb39-26"><a href="#cb39-26" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> _ <span class="kw">in</span> <span class="bu">range</span>(args.runs):</span>
<span id="cb39-27"><a href="#cb39-27" aria-hidden="true" tabindex="-1"></a>        <span class="co"># Create a random input point cloud</span></span>
<span id="cb39-28"><a href="#cb39-28" aria-hidden="true" tabindex="-1"></a>        point_cloud <span class="op">=</span> torch.rand(size<span class="op">=</span>(<span class="dv">1</span>, args.num_points, <span class="dv">3</span>))</span>
<span id="cb39-29"><a href="#cb39-29" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb39-30"><a href="#cb39-30" aria-hidden="true" tabindex="-1"></a>        t0 <span class="op">=</span> time.monotonic()</span>
<span id="cb39-31"><a href="#cb39-31" aria-hidden="true" tabindex="-1"></a>        model(point_cloud)</span>
<span id="cb39-32"><a href="#cb39-32" aria-hidden="true" tabindex="-1"></a>        elapsed_cpu <span class="op">=</span> (time.monotonic() <span class="op">-</span> t0) <span class="op">*</span> <span class="fl">1e3</span></span>
<span id="cb39-33"><a href="#cb39-33" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb39-34"><a href="#cb39-34" aria-hidden="true" tabindex="-1"></a>        t0 <span class="op">=</span> time.monotonic()</span>
<span id="cb39-35"><a href="#cb39-35" aria-hidden="true" tabindex="-1"></a>        model_zcu104(point_cloud)</span>
<span id="cb39-36"><a href="#cb39-36" aria-hidden="true" tabindex="-1"></a>        elapsed_zcu104 <span class="op">=</span> (time.monotonic() <span class="op">-</span> t0) <span class="op">*</span> <span class="fl">1e3</span></span>
<span id="cb39-37"><a href="#cb39-37" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb39-38"><a href="#cb39-38" aria-hidden="true" tabindex="-1"></a>        times_cpu.append(elapsed_cpu)</span>
<span id="cb39-39"><a href="#cb39-39" aria-hidden="true" tabindex="-1"></a>        times_zcu104.append(elapsed_zcu104)</span>
<span id="cb39-40"><a href="#cb39-40" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb39-41"><a href="#cb39-41" aria-hidden="true" tabindex="-1"></a>    time_avg_cpu <span class="op">=</span> np.mean(times_cpu)</span>
<span id="cb39-42"><a href="#cb39-42" aria-hidden="true" tabindex="-1"></a>    time_std_cpu <span class="op">=</span> np.std(times_cpu)</span>
<span id="cb39-43"><a href="#cb39-43" aria-hidden="true" tabindex="-1"></a>    time_avg_zcu104 <span class="op">=</span> np.mean(times_zcu104)</span>
<span id="cb39-44"><a href="#cb39-44" aria-hidden="true" tabindex="-1"></a>    time_std_zcu104 <span class="op">=</span> np.std(times_zcu104)</span>
<span id="cb39-45"><a href="#cb39-45" aria-hidden="true" tabindex="-1"></a>    speedup_factor <span class="op">=</span> time_avg_cpu <span class="op">/</span> time_avg_zcu104</span>
<span id="cb39-46"><a href="#cb39-46" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb39-47"><a href="#cb39-47" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f&quot;Inference time (CPU): &quot;</span> \</span>
<span id="cb39-48"><a href="#cb39-48" aria-hidden="true" tabindex="-1"></a>          <span class="ss">f&quot;mean: </span><span class="sc">{</span>time_avg_cpu<span class="sc">:.3f}</span><span class="ss">ms, &quot;</span> \</span>
<span id="cb39-49"><a href="#cb39-49" aria-hidden="true" tabindex="-1"></a>          <span class="ss">f&quot;std: </span><span class="sc">{</span>time_std_cpu<span class="sc">:.3f}</span><span class="ss">ms&quot;</span>)</span>
<span id="cb39-50"><a href="#cb39-50" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f&quot;Inference time (FPGA): &quot;</span> \</span>
<span id="cb39-51"><a href="#cb39-51" aria-hidden="true" tabindex="-1"></a>          <span class="ss">f&quot;mean: </span><span class="sc">{</span>time_avg_zcu104<span class="sc">:.3f}</span><span class="ss">ms, &quot;</span> \</span>
<span id="cb39-52"><a href="#cb39-52" aria-hidden="true" tabindex="-1"></a>          <span class="ss">f&quot;std: </span><span class="sc">{</span>time_std_zcu104<span class="sc">:.3f}</span><span class="ss">ms&quot;</span>)</span>
<span id="cb39-53"><a href="#cb39-53" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f&quot;Speedup: </span><span class="sc">{</span>speedup_factor<span class="sc">:.3f}</span><span class="ss">x&quot;</span>)</span></code></pre></div>
<p>ここでは精度は気にしないので、学習済みのモデルをロードする処理は省かれています。
但し、CPU版のモデル<code>PointNetCls</code>と、FPGA版のモデル<code>PointNetClsZCU104</code>とで、パラメータを揃える必要はあります。
また、CPU版のモデルは<code>eval</code>モードで動作させます。
バッチ正規化層の挙動が訓練モードになり、バッチ数が1のときにエラーとなります。
また、訓練済みのパラメータではなく、入力のバッチから平均や標準偏差が計算されるので、FPGA版のモデルと出力結果が合わなくなります。
指定された回数<code>args.runs</code>だけ、推論時間の計測を行い、平均と標準偏差、また高速化率を算出します。
また最初に、双方のモデルの出力が合っているかどうか
(大体近い値が出力されるか) を確認しています
(本当は、IPコアの作成時にもテストします)。</p>
<p>FPGAボード上で以下のコマンドを実行します。</p>
<pre><code>&gt; cd advent_2022_point_cloud_classification/host

# ナイーブ実装 (動作周波数150MHz)
&gt; sudo XILINX_XRT=/usr ./time_zcu104.sh ../vivado/bitstream/pointnet_naive_150.bit

# データ並列性を活用した (ループアンローリングと配列の分割を済ませた) 実装 (動作周波数150MHz)
&gt; sudo XILINX_XRT=/usr ./time_zcu104.sh ../vivado/bitstream/pointnet_opt1.bit

# データフロー最適化を済ませた実装 (動作周波数150MHz)
&gt; sudo XILINX_XRT=/usr ./time_zcu104.sh ../vivado/bitstream/pointnet_opt2.bit

# 入出力のポート幅を64ビットに広げた実装 (動作周波数150MHz)
&gt; sudo XILINX_XRT=/usr ./time_zcu104.sh ../vivado/bitstream/pointnet_opt3.bit</code></pre>
<p>ナイーブな実装でテストした場合の出力例を以下に示します。</p>
<pre><code>$ sudo XILINX_XRT=/usr ./time_zcu104.sh ../vivado/bitstream/pointnet_naive_150.bit
Output (CPU):
tensor([[-0.0594, -0.0272,  0.0115, -0.0481, -0.0529,  0.0449, -0.0634, -0.0328,
          0.0348, -0.0071, -0.0228,  0.0412,  0.0128, -0.0175, -0.0086, -0.0023,
         -0.0192, -0.0101, -0.0072,  0.0520, -0.0106, -0.0110,  0.0113,  0.0499,
         -0.0563, -0.0523, -0.0711, -0.0104, -0.0048, -0.0404,  0.0375,  0.0089,
          0.0326, -0.0408, -0.0302, -0.0041,  0.0534, -0.0349,  0.0380, -0.0020]],
       grad_fn=&lt;AddmmBackward0&gt;)
Output (FPGA):
tensor([[-0.0592, -0.0274,  0.0114, -0.0491, -0.0527,  0.0446, -0.0632, -0.0335,
          0.0337, -0.0071, -0.0258,  0.0399,  0.0119, -0.0170, -0.0091, -0.0030,
         -0.0216, -0.0112, -0.0106,  0.0522, -0.0111, -0.0130,  0.0114,  0.0487,
         -0.0571, -0.0523, -0.0714, -0.0103, -0.0058, -0.0389,  0.0383,  0.0068,
          0.0306, -0.0421, -0.0314, -0.0052,  0.0539, -0.0360,  0.0399, -0.0031]])
Inference time (CPU): mean: 369.048ms, std: 1.086ms
Inference time (FPGA): mean: 1071.358ms, std: 0.023ms
Speedup: 0.344x</code></pre>
<p>CPU版のモデルでは<code>float</code>を使いますが、FPGA版のモデルでは固定小数点数
(<code>ap_fixed</code>)
を使っているので、同じモデルパラメータと入力を与えても、出力結果には多少のずれが生じます
(ここでは、固定小数点数のビット幅を32ビット、整数部を16ビット、小数部を16ビットに設定しています)。
しかし、CPU版とFPGA版のモデルで、大体似たような出力が得られています
(小数第2位くらいまでは合っています)。
クラス分類問題であれば、これで問題ないと思います。
推論時間をみると、ナイーブな実装では、CPU版のモデルよりも3倍程度遅いことが分かります。</p>
<p>各実装に対する推論時間をまとめます。</p>
<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">実装</th>
<th style="text-align: left;">推論時間の平均 (ms)</th>
<th style="text-align: left;">標準偏差 (ms)</th>
<th style="text-align: left;">高速化率 (ソフトウェア比)</th>
<th style="text-align: left;">高速化率 (ナイーブ実装比)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">CPU版</td>
<td style="text-align: left;">369.0</td>
<td style="text-align: left;">1.086</td>
<td style="text-align: left;"><strong>1.0x</strong></td>
<td style="text-align: left;">2.904x</td>
</tr>
<tr class="even">
<td style="text-align: left;">ナイーブ (100MHz)</td>
<td style="text-align: left;">1606.4</td>
<td style="text-align: left;">0.041</td>
<td style="text-align: left;">0.230x</td>
<td style="text-align: left;">0.667x</td>
</tr>
<tr class="odd">
<td style="text-align: left;">ナイーブ (150MHz)</td>
<td style="text-align: left;">1071.4</td>
<td style="text-align: left;">0.023</td>
<td style="text-align: left;">0.344x</td>
<td style="text-align: left;"><strong>1.0x</strong></td>
</tr>
<tr class="even">
<td style="text-align: left;">ナイーブ (200MHz)</td>
<td style="text-align: left;">872.05</td>
<td style="text-align: left;">0.077</td>
<td style="text-align: left;">0.423x</td>
<td style="text-align: left;">1.223x</td>
</tr>
<tr class="odd">
<td style="text-align: left;">ナイーブ (250MHz)</td>
<td style="text-align: left;">665.33</td>
<td style="text-align: left;">0.073</td>
<td style="text-align: left;">0.555x</td>
<td style="text-align: left;">1.610x</td>
</tr>
<tr class="even">
<td style="text-align: left;">データ並列性 (150MHz)</td>
<td style="text-align: left;">34.60</td>
<td style="text-align: left;">0.027</td>
<td style="text-align: left;">10.66x</td>
<td style="text-align: left;">30.97x</td>
</tr>
<tr class="odd">
<td style="text-align: left;">データフロー最適化 (150MHz)</td>
<td style="text-align: left;">12.93</td>
<td style="text-align: left;">0.016</td>
<td style="text-align: left;">28.54x</td>
<td style="text-align: left;">82.86x</td>
</tr>
<tr class="even">
<td style="text-align: left;">ポート幅拡張 (150MHz)</td>
<td style="text-align: left;"><strong>10.80</strong></td>
<td style="text-align: left;"><strong>0.012</strong></td>
<td style="text-align: left;"><strong>34.17x</strong></td>
<td style="text-align: left;"><strong>99.20x</strong></td>
</tr>
</tbody>
</table>
<p>ナイーブな実装 (150MHz) は、CPUに比べて性能がたったの0.344倍でした。
ナイーブな実装のままでは、動作周波数を250MHzまで上げても、依然としてCPUよりも遅いです。
データ並列性の利用によって、推論時間は30.97倍も短縮され、CPUに比べて10.66倍高速になりました。
Vitis HLSにより出力されたクロックサイクル数をみると、ナイーブな実装
(150MHz) では161,945,604 (1.079s)、並列化後の実装では4,462,596
(29.72ms)となっています。
実際には、前者は1.071s、後者は34.60msなので、大体合っているといえます。
特徴抽出ネットワークにおけるデータフロー最適化の活用によって、推論時間はさらに2.68倍短縮され、CPUに比べて28.54倍、当初のナイーブな実装に比べて82.86倍も高速になりました。
さらにポート幅を32ビットから64ビットに拡張することで、主に分類ネットワークが高速化されました。
推論時間は1.20倍短縮され、CPUに比べて34.17倍、当初のナイーブな実装と比べると99.20倍の高速化となりました。
このように、各種最適化を施すことで、着実に高速化できていることが分かります。
しかも、基本的には、各種HLSプラグマを挿入するだけよいので、非常に楽です。</p>
<h2 id="精度">精度</h2>
<p>つぎにモデルの分類精度をみてみましょう。
ここではModelNet40データセットの、テストデータを利用します。
データセットは<a
href="https://shapenet.cs.stanford.edu/media/modelnet40_ply_hdf5_2048.zip">こちら</a>からダウンロードできます。
各サンプルは、飛行機、自動車、ラップトップ、人間など、単一の物体を表すCADモデルから得られた、2048個の点をもつ点群です。
以下のソースコードを利用します (<code>host/test_zcu104.py</code>)。
データセットの処理や、モデルの訓練については、GitHubのリポジトリを参照してください。</p>
<div class="sourceCode" id="cb42"><pre
class="sourceCode python"><code class="sourceCode python"><span id="cb42-1"><a href="#cb42-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> test(args: argparse.Namespace,</span>
<span id="cb42-2"><a href="#cb42-2" aria-hidden="true" tabindex="-1"></a>         model: torch.nn.Module,</span>
<span id="cb42-3"><a href="#cb42-3" aria-hidden="true" tabindex="-1"></a>         model_zcu104: torch.nn.Module,</span>
<span id="cb42-4"><a href="#cb42-4" aria-hidden="true" tabindex="-1"></a>         test_loader: torch.utils.data.DataLoader):</span>
<span id="cb42-5"><a href="#cb42-5" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f&quot;Testing PointNet ...&quot;</span>)</span>
<span id="cb42-6"><a href="#cb42-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-7"><a href="#cb42-7" aria-hidden="true" tabindex="-1"></a>    <span class="co"># model.eval()</span></span>
<span id="cb42-8"><a href="#cb42-8" aria-hidden="true" tabindex="-1"></a>    model_zcu104.<span class="bu">eval</span>()</span>
<span id="cb42-9"><a href="#cb42-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-10"><a href="#cb42-10" aria-hidden="true" tabindex="-1"></a>    <span class="co"># test_loss_total = 0.0</span></span>
<span id="cb42-11"><a href="#cb42-11" aria-hidden="true" tabindex="-1"></a>    <span class="co"># correct = 0</span></span>
<span id="cb42-12"><a href="#cb42-12" aria-hidden="true" tabindex="-1"></a>    test_loss_total_zcu104 <span class="op">=</span> <span class="fl">0.0</span></span>
<span id="cb42-13"><a href="#cb42-13" aria-hidden="true" tabindex="-1"></a>    correct_zcu104 <span class="op">=</span> <span class="dv">0</span></span>
<span id="cb42-14"><a href="#cb42-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-15"><a href="#cb42-15" aria-hidden="true" tabindex="-1"></a>    <span class="cf">with</span> torch.no_grad():</span>
<span id="cb42-16"><a href="#cb42-16" aria-hidden="true" tabindex="-1"></a>        <span class="cf">for</span> i, batch <span class="kw">in</span> <span class="bu">enumerate</span>(test_loader):</span>
<span id="cb42-17"><a href="#cb42-17" aria-hidden="true" tabindex="-1"></a>            <span class="cf">if</span> i <span class="op">%</span> <span class="dv">5</span> <span class="op">==</span> <span class="dv">0</span>:</span>
<span id="cb42-18"><a href="#cb42-18" aria-hidden="true" tabindex="-1"></a>                <span class="bu">print</span>(<span class="ss">f&quot;Testing batch </span><span class="sc">{</span>i<span class="sc">}</span><span class="ss"> ...&quot;</span>)</span>
<span id="cb42-19"><a href="#cb42-19" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-20"><a href="#cb42-20" aria-hidden="true" tabindex="-1"></a>            data, target <span class="op">=</span> batch[<span class="st">&quot;points&quot;</span>], batch[<span class="st">&quot;label&quot;</span>]</span>
<span id="cb42-21"><a href="#cb42-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-22"><a href="#cb42-22" aria-hidden="true" tabindex="-1"></a>            <span class="co"># out = model(data)</span></span>
<span id="cb42-23"><a href="#cb42-23" aria-hidden="true" tabindex="-1"></a>            <span class="co"># pred = out.argmax(dim=1, keepdim=True)</span></span>
<span id="cb42-24"><a href="#cb42-24" aria-hidden="true" tabindex="-1"></a>            <span class="co"># loss = F.cross_entropy(out, target)</span></span>
<span id="cb42-25"><a href="#cb42-25" aria-hidden="true" tabindex="-1"></a>            <span class="co"># correct += pred.eq(target.view_as(pred)).sum().item()</span></span>
<span id="cb42-26"><a href="#cb42-26" aria-hidden="true" tabindex="-1"></a>            <span class="co"># test_loss_total += loss.item() * len(data)</span></span>
<span id="cb42-27"><a href="#cb42-27" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-28"><a href="#cb42-28" aria-hidden="true" tabindex="-1"></a>            out_zcu104 <span class="op">=</span> model_zcu104(data)</span>
<span id="cb42-29"><a href="#cb42-29" aria-hidden="true" tabindex="-1"></a>            pred_zcu104 <span class="op">=</span> out_zcu104.argmax(dim<span class="op">=</span><span class="dv">1</span>, keepdim<span class="op">=</span><span class="va">True</span>)</span>
<span id="cb42-30"><a href="#cb42-30" aria-hidden="true" tabindex="-1"></a>            loss_zcu104 <span class="op">=</span> F.cross_entropy(out_zcu104, target)</span>
<span id="cb42-31"><a href="#cb42-31" aria-hidden="true" tabindex="-1"></a>            correct_zcu104 <span class="op">+=</span> pred_zcu104.eq(</span>
<span id="cb42-32"><a href="#cb42-32" aria-hidden="true" tabindex="-1"></a>                target.view_as(pred_zcu104)).<span class="bu">sum</span>().item()</span>
<span id="cb42-33"><a href="#cb42-33" aria-hidden="true" tabindex="-1"></a>            test_loss_total_zcu104 <span class="op">+=</span> loss_zcu104.item() <span class="op">*</span> <span class="bu">len</span>(data)</span>
<span id="cb42-34"><a href="#cb42-34" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-35"><a href="#cb42-35" aria-hidden="true" tabindex="-1"></a>    <span class="co"># test_loss_avg = test_loss_total / len(test_loader.dataset)</span></span>
<span id="cb42-36"><a href="#cb42-36" aria-hidden="true" tabindex="-1"></a>    <span class="co"># test_acc = correct * 1e2 / len(test_loader.dataset)</span></span>
<span id="cb42-37"><a href="#cb42-37" aria-hidden="true" tabindex="-1"></a>    test_loss_avg_zcu104 <span class="op">=</span> test_loss_total_zcu104 <span class="op">/</span> <span class="bu">len</span>(test_loader.dataset)</span>
<span id="cb42-38"><a href="#cb42-38" aria-hidden="true" tabindex="-1"></a>    test_acc_zcu104 <span class="op">=</span> correct_zcu104 <span class="op">*</span> <span class="fl">1e2</span> <span class="op">/</span> <span class="bu">len</span>(test_loader.dataset)</span>
<span id="cb42-39"><a href="#cb42-39" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-40"><a href="#cb42-40" aria-hidden="true" tabindex="-1"></a>    <span class="co"># print(f&quot;Test result (CPU): &quot; \</span></span>
<span id="cb42-41"><a href="#cb42-41" aria-hidden="true" tabindex="-1"></a>    <span class="co">#       f&quot;loss: {test_loss_avg:.6f}, &quot; \</span></span>
<span id="cb42-42"><a href="#cb42-42" aria-hidden="true" tabindex="-1"></a>    <span class="co">#       f&quot;accuracy: {test_acc:.3f}%, &quot; \</span></span>
<span id="cb42-43"><a href="#cb42-43" aria-hidden="true" tabindex="-1"></a>    <span class="co">#       f&quot;correct: {correct}&quot;)</span></span>
<span id="cb42-44"><a href="#cb42-44" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f&quot;Test result (FPGA): &quot;</span> \</span>
<span id="cb42-45"><a href="#cb42-45" aria-hidden="true" tabindex="-1"></a>          <span class="ss">f&quot;loss: </span><span class="sc">{</span>test_loss_avg_zcu104<span class="sc">:.6f}</span><span class="ss">, &quot;</span> \</span>
<span id="cb42-46"><a href="#cb42-46" aria-hidden="true" tabindex="-1"></a>          <span class="ss">f&quot;accuracy: </span><span class="sc">{</span>test_acc_zcu104<span class="sc">:.3f}</span><span class="ss">%, &quot;</span> \</span>
<span id="cb42-47"><a href="#cb42-47" aria-hidden="true" tabindex="-1"></a>          <span class="ss">f&quot;correct: </span><span class="sc">{</span>correct_zcu104<span class="sc">}</span><span class="ss">, &quot;</span> \</span>
<span id="cb42-48"><a href="#cb42-48" aria-hidden="true" tabindex="-1"></a>          <span class="ss">f&quot;total: </span><span class="sc">{</span><span class="bu">len</span>(test_loader.dataset)<span class="sc">}</span><span class="ss">&quot;</span>)</span></code></pre></div>
<p>FPGAボード上で以下のコマンドを実行します。</p>
<pre><code>&gt; cd advent_2022_point_cloud_classification/host

# データ並列性を活用した (ループアンローリングと配列の分割を済ませた) 実装 (動作周波数150MHz)
&gt; sudo XILINX_XRT=/usr ./test_zcu104.sh ../vivado/bitstream/pointnet_opt1.bit

# データフロー最適化を済ませた実装 (動作周波数150MHz)
&gt; sudo XILINX_XRT=/usr ./test_zcu104.sh ../vivado/bitstream/pointnet_opt2.bit

# 入出力のポート幅を64ビットに広げた実装 (動作周波数150MHz)
&gt; sudo XILINX_XRT=/usr ./test_zcu104.sh ../vivado/bitstream/pointnet_opt3.bit</code></pre>
<p>出力結果の例を以下に示します。</p>
<pre><code>&gt; sudo XILINX_XRT=/usr ./test_zcu104.sh ../vivado/bitstream/pointnet_opt1.bit
Testing batch 0 ....
Testing batch 5 ...
...
Testing batch 2445 ...
Testing batch 2450 ...
Testing batch 2455 ...
Testing batch 2460 ...
Testing batch 2465 ...
Test result (FPGA): loss: 0.375841, accuracy: 89.506%, correct: 2209, total: 2468</code></pre>
<p>各実装に対する精度をまとめます。
全部で2,468個のテストサンプルがあります。
ナイーブ実装に関しては、時間が掛かりすぎるので省略しています。</p>
<table>
<thead>
<tr class="header">
<th style="text-align: left;">実装</th>
<th style="text-align: left;">正解数</th>
<th style="text-align: left;">精度</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">CPU版</td>
<td style="text-align: left;">2209</td>
<td style="text-align: left;">89.506%</td>
</tr>
<tr class="even">
<td style="text-align: left;">データ並列性 (150MHz)</td>
<td style="text-align: left;">2209</td>
<td style="text-align: left;">89.506%</td>
</tr>
<tr class="odd">
<td style="text-align: left;">データフロー最適化 (150MHz)</td>
<td style="text-align: left;">2209</td>
<td style="text-align: left;">89.506%</td>
</tr>
<tr class="even">
<td style="text-align: left;">ポート幅拡張 (150MHz)</td>
<td style="text-align: left;">2209</td>
<td style="text-align: left;">89.506%</td>
</tr>
</tbody>
</table>
<p>いずれのIPコアも、CPU上で動かした場合と全く同じ精度が得られています。
<code>float</code>の代わりに固定小数点数<code>ap_fixed</code>を使っていますが、いまのところは精度低下はみられません。</p>
<h2 id="リソース消費">リソース消費</h2>
<p>各種IPコアの、リソース消費を調べてみましょう。 リソース消費は、LUT
(ルックアップテーブル)、FF (フリップフロップ)、BRAM (BlockRAM)、URAM
(UltraRAM)、DSP (Digital Signal Processor)の5つに分類されます。</p>
<p><a
href="point-cloud-classification-images/pointnet-opt3-vivado4.png"><img src="point-cloud-classification-images/pointnet-opt3-vivado4.png" width="80%" /></a></p>
<p>リソース消費を表にまとめます。</p>
<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">実装</th>
<th style="text-align: left;">LUT</th>
<th style="text-align: left;">FF</th>
<th style="text-align: left;">BRAM (36Kb)</th>
<th style="text-align: left;">URAM</th>
<th style="text-align: left;">DSP</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">合計</td>
<td style="text-align: left;">230,400</td>
<td style="text-align: left;">460,800</td>
<td style="text-align: left;">312</td>
<td style="text-align: left;">96</td>
<td style="text-align: left;">1,728</td>
</tr>
<tr class="even">
<td style="text-align: left;">ナイーブ (100MHz)</td>
<td style="text-align: left;">22,378 (9.71%)</td>
<td style="text-align: left;">11,045 (2.40%)</td>
<td style="text-align: left;">149.5 (47.92%)</td>
<td style="text-align: left;">2 (2.08%)</td>
<td style="text-align: left;">48 (2.78%)</td>
</tr>
<tr class="odd">
<td style="text-align: left;">ナイーブ (150MHz)</td>
<td style="text-align: left;">22,140 (9.61%)</td>
<td style="text-align: left;">12,428 (2.70%)</td>
<td style="text-align: left;">161.5 (51.76%)</td>
<td style="text-align: left;">2 (2.08%)</td>
<td style="text-align: left;">48 (2.78%)</td>
</tr>
<tr class="even">
<td style="text-align: left;">ナイーブ (200MHz)</td>
<td style="text-align: left;">21,344 (9.26%)</td>
<td style="text-align: left;">13,616 (2.95%)</td>
<td style="text-align: left;">149.5 (47.92%)</td>
<td style="text-align: left;">2 (2.08%)</td>
<td style="text-align: left;">48 (2.78%)</td>
</tr>
<tr class="odd">
<td style="text-align: left;">ナイーブ (250MHz)</td>
<td style="text-align: left;">20,663 (8.97%)</td>
<td style="text-align: left;">14,713 (3.19%)</td>
<td style="text-align: left;">149.5 (47.92%)</td>
<td style="text-align: left;">2 (2.08%)</td>
<td style="text-align: left;">20 (1.16%)</td>
</tr>
<tr class="even">
<td style="text-align: left;">データ並列性 (150MHz)</td>
<td style="text-align: left;">58,223 (25.27%)</td>
<td style="text-align: left;">42,755 (9.28%)</td>
<td style="text-align: left;">287.5 (92.15%)</td>
<td style="text-align: left;">0 (0.00%)</td>
<td style="text-align: left;">768 (44.44%)</td>
</tr>
<tr class="odd">
<td style="text-align: left;">データフロー最適化 (150MHz)</td>
<td style="text-align: left;">136,408 (59.20%)</td>
<td style="text-align: left;">48,940 (10.62%)</td>
<td style="text-align: left;">310.5 (99.52%)</td>
<td style="text-align: left;">0 (0.00%)</td>
<td style="text-align: left;">808 (46.76%)</td>
</tr>
<tr class="even">
<td style="text-align: left;">ポート幅拡張 (150MHz)</td>
<td style="text-align: left;">84,263 (36.57%)</td>
<td style="text-align: left;">49,660 (10.78%)</td>
<td style="text-align: left;">263.5 (84.46%)</td>
<td style="text-align: left;">64 (66.67%)</td>
<td style="text-align: left;">808 (46.76%)</td>
</tr>
</tbody>
</table>
<p>データ並列性を活用すると、複数の積和演算を並列に行う必要があるため、DSPの消費が大幅に増加していることが分かります。
一方、データフロー最適化を用いても、リソース消費はそれほど増えていません
(ただし、BRAMが不足して、LUTをLUTRAMとして使っているので、LUTの消費は増加しています)。
データフロー最適化によって、リソース消費の増加を抑えつつ、回路の性能を改善できます。
ポート幅を拡張しても、URAM以外のリソース消費はあまり変わっていません
(BRAMが不足してエラーになったので、オンチップバッファの一部をURAMで実装しています)。</p>
<p>今回は20万円程度するFPGAボード、Xilinx ZCU104 Evaluation
Kitを使いました。 このボードのFPGAチップ (XCZU7EV-2FFVC1156)
には、BRAMだけでなくURAMも提供されているので、比較的大きなオンチップバッファ
(数MB程度) を作成できます。 URAM (UltraRAM) はBRAM (BlockRAM)
に比べて個数が少ないですが
(BRAMが312個に対してURAMは96個)、1個あたりの容量は大きいので、粗粒度だといえます。
低価格のFPGAボードだと、URAMが提供されていないので、BRAMを大切に使う必要があります。
個人的には、BRAMが一番最初に枯渇することが多いです
(FPGAに慣れていない初心者なので、うまく実装できません)。</p>
<h2 id="値のビット幅削減">値のビット幅削減</h2>
<p>いままでは、層の入出力やモデルのパラメータを表現するのに、32ビットの固定小数点数
(整数部と小数部が16ビットずつ) を使っていました。
精度をある程度保ったまま、ビット数 (リソース消費)
を抑えられるでしょうか。 ここでは、以下のビット数の組み合わせで、IPコア
(動作周波数150MHz) を作ってみましょう。
IPコアは、データ並列性を活用、データフロー最適化を施し、さらにポート幅を拡張したバージョンです。
これらのビット数は何となく決めました。
モデルのパラメータの方は、層の入出力に比べて値域が狭いので、よりビット数を削減できるかもしれません。</p>
<table>
<colgroup>
<col style="width: 33%" />
<col style="width: 33%" />
<col style="width: 33%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">名前</th>
<th style="text-align: left;">層の入出力 (<code>value_t</code>)</th>
<th style="text-align: left;">モデルのパラメータ
(<code>param_t</code>)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">28-28</td>
<td style="text-align: left;">28ビット (整数部14 + 小数部14)</td>
<td style="text-align: left;">28ビット (整数部10 + 小数部18)</td>
</tr>
<tr class="even">
<td style="text-align: left;">28-24</td>
<td style="text-align: left;">28ビット (整数部14 + 小数部14)</td>
<td style="text-align: left;">24ビット (整数部8 + 小数部16)</td>
</tr>
<tr class="odd">
<td style="text-align: left;">24-24</td>
<td style="text-align: left;">24ビット (整数部12 + 小数部12)</td>
<td style="text-align: left;">24ビット (整数部8 + 小数部16)</td>
</tr>
<tr class="even">
<td style="text-align: left;">24-20</td>
<td style="text-align: left;">24ビット (整数部12 + 小数部12)</td>
<td style="text-align: left;">20ビット (整数部6 + 小数部14)</td>
</tr>
<tr class="odd">
<td style="text-align: left;">24-16</td>
<td style="text-align: left;">24ビット (整数部12 + 小数部12)</td>
<td style="text-align: left;">16ビット (整数部4 + 小数部12)</td>
</tr>
<tr class="even">
<td style="text-align: left;">20-20</td>
<td style="text-align: left;">20ビット (整数部10 + 小数部10)</td>
<td style="text-align: left;">20ビット (整数部6 + 小数部14)</td>
</tr>
<tr class="odd">
<td style="text-align: left;">20-16</td>
<td style="text-align: left;">20ビット (整数部10 + 小数部10)</td>
<td style="text-align: left;">16ビット (整数部4 + 小数部12)</td>
</tr>
</tbody>
</table>
<p>各実装における精度をまとめます。</p>
<table>
<thead>
<tr class="header">
<th style="text-align: left;">実装</th>
<th style="text-align: left;">正解数</th>
<th style="text-align: left;">精度</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">CPU版</td>
<td style="text-align: left;">2209</td>
<td style="text-align: left;">89.506%</td>
</tr>
<tr class="even">
<td style="text-align: left;">ポート幅拡張 (150MHz)</td>
<td style="text-align: left;">2209</td>
<td style="text-align: left;">89.506%</td>
</tr>
<tr class="odd">
<td style="text-align: left;">ポート幅拡張 (150MHz、28-28)</td>
<td style="text-align: left;">2206</td>
<td style="text-align: left;">89.384%</td>
</tr>
<tr class="even">
<td style="text-align: left;">ポート幅拡張 (150MHz、28-24)</td>
<td style="text-align: left;">2206</td>
<td style="text-align: left;">89.384%</td>
</tr>
<tr class="odd">
<td style="text-align: left;">ポート幅拡張 (150MHz、24-24)</td>
<td style="text-align: left;">2200</td>
<td style="text-align: left;">89.141%</td>
</tr>
<tr class="even">
<td style="text-align: left;">ポート幅拡張 (150MHz、24-20)</td>
<td style="text-align: left;">550</td>
<td style="text-align: left;">22.285%</td>
</tr>
<tr class="odd">
<td style="text-align: left;">ポート幅拡張 (150MHz、24-16)</td>
<td style="text-align: left;">121</td>
<td style="text-align: left;">4.903%</td>
</tr>
<tr class="even">
<td style="text-align: left;">ポート幅拡張 (150MHz、20-20)</td>
<td style="text-align: left;">448</td>
<td style="text-align: left;">18.152%</td>
</tr>
<tr class="odd">
<td style="text-align: left;">ポート幅拡張 (150MHz、20-16)</td>
<td style="text-align: left;">122</td>
<td style="text-align: left;">4.903%</td>
</tr>
</tbody>
</table>
<p>また、リソース消費を以下にまとめます。</p>
<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">実装</th>
<th style="text-align: left;">LUT</th>
<th style="text-align: left;">FF</th>
<th style="text-align: left;">BRAM (36Kb)</th>
<th style="text-align: left;">URAM</th>
<th style="text-align: left;">DSP</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">合計</td>
<td style="text-align: left;">230,400</td>
<td style="text-align: left;">460,800</td>
<td style="text-align: left;">312</td>
<td style="text-align: left;">96</td>
<td style="text-align: left;">1,728</td>
</tr>
<tr class="even">
<td style="text-align: left;">ポート幅拡張 (150MHz)</td>
<td style="text-align: left;">84,263 (36.57%)</td>
<td style="text-align: left;">49,660 (10.78%)</td>
<td style="text-align: left;">263.5 (84.46%)</td>
<td style="text-align: left;">64 (66.67%)</td>
<td style="text-align: left;">808 (46.76%)</td>
</tr>
<tr class="odd">
<td style="text-align: left;">ポート幅拡張 (150MHz、28-28)</td>
<td style="text-align: left;">74,342 (32.27%)</td>
<td style="text-align: left;">47,267 (10.26%)</td>
<td style="text-align: left;">261.5 (83.81%)</td>
<td style="text-align: left;">64 (66.67%)</td>
<td style="text-align: left;">808 (46.76%)</td>
</tr>
<tr class="even">
<td style="text-align: left;">ポート幅拡張 (150MHz、28-24)</td>
<td style="text-align: left;">63,749 (27.67%)</td>
<td style="text-align: left;">39,139 (8.49%)</td>
<td style="text-align: left;">257 (82.37%)</td>
<td style="text-align: left;">64 (66.67%)</td>
<td style="text-align: left;">404 (23.38%)</td>
</tr>
<tr class="odd">
<td style="text-align: left;">ポート幅拡張 (150MHz、24-24)</td>
<td style="text-align: left;">59,970 (26.03%)</td>
<td style="text-align: left;">36,240 (7.86%)</td>
<td style="text-align: left;">257 (82.37%)</td>
<td style="text-align: left;">64 (66.67%)</td>
<td style="text-align: left;">404 (23.38%)</td>
</tr>
<tr class="even">
<td style="text-align: left;">ポート幅拡張 (150MHz、24-20)</td>
<td style="text-align: left;">75,997 (32.98%)</td>
<td style="text-align: left;">40,762 (8.85%)</td>
<td style="text-align: left;">259 (83.01%)</td>
<td style="text-align: left;">64 (66.67%)</td>
<td style="text-align: left;">202 (11.69%)</td>
</tr>
</tbody>
</table>
<p>ビット数を削減しても、推論時間は変わりませんでした。
ビット数の削減に応じて、実装を少し直す必要がありそうです。</p>
<p>上記の結果をみると、重みのビット数を24ビットから20ビットに削減した途端に、分類精度が一気に低下していることが分かります
(ここまでの急激な低下には驚きました)。
層の入出力とモデルのパラメータをいずれも24ビットに設定したIPコアが、最もリソース効率がよいといえます。
リソース消費をみると、ビット数を削減することで回路の複雑さが徐々に下がってゆき、それに伴ってLUTやFFの使用量が漸減しています。
28ビットから24ビットに落とすと、積和演算に必要なDSPブロックの数が半減していることが分かります。
24ビットから20ビットにすると、DSPの使用量はさらに半減しています
(その分LUTとFFが増加しています)。
BRAMやURAMについては、ビット数をもう少し減らさないと、消費量が減らないようです
(オンチップメモリの不足が頭痛の種になります)。</p>
<h1 id="おわりに">おわりに</h1>
<p>今回は、FPGAを用いて、点群の分類タスクを高速化しました。
分類タスクには、軽量かつシンプルなPointNetを利用しました。
FPGAのリソース消費を抑えるため、モデルを簡略化し、また計算順序を変更しました。
続いて、Xilinx社の高位合成ツールVitis HLS
2022.1を使って、PointNet用のカスタムIPコアを作成しました。
パイプライン化、層の計算の並列化
(ループのアンローリングと配列の分割)、データフロー最適化などを使って、IPコアの実装を少しずつ改善していきました。</p>
<p>IPコアを他のIPコアと繋ぎ合わせてボードデザインを作成し、Xilinx Vivado
2022.1により論理合成・配置配線を行って、FPGAに書き込み可能なビットストリームを作成しました。
ビットストリームをロードして高速に推論するためのドライバを、Pynqライブラリにより記述しました。
ModelNet40データセットを使用し、Xilinx ZCU104 Evaluation
Kit上で、推論時間、リソース消費、分類精度の3つの観点から評価を行いました。
また、複数のボードデザインでの性能を比較することで、各種最適化による効果を調べました。
ビット数を削減し、リソース効率を改善することも試みました。</p>
<p>高位合成ツールを使うことで、Verilog
HDLなしで、C/C++だけで、高効率なIPコアを作成できました。
しかしそれでも、PyTorchなどの深層学習ライブラリを使うのと比べて、ソースコードの記述量は何倍も多くなりました。
内部で行われている処理の流れをよく観察して、全て理解しないと、それを高速化するIPコアも作成できません。
リソース制約、データ転送など、考えなくてはならない事柄も多いです。
作業工程が多くて大変ですが、自作のIPコアが正しく動作した
(ソフトウェア実装と同じような出力が得られた)
ときや、実装を高速化できたときの歓びは、そのぶん大きいと思います。
有難うございました。</p>
<p>GPUって便利だなあと思うことしきりです。</p>
</body>
</html>