ãã®ããŒãžã¯ãæ ¶æçå·¥ã¢ããã³ãã«ã¬ã³ããŒ2022ã®22æ¥ç®ã®èšäºã§ãã å»å¹Žã®èšäºã¯ãã¡ããšãã¡ãã§ãã
æ©éäœè«ã§ããã1983幎12æ22æ¥ã¯ãYellow Magic Orchestra (YMO) ãè¡ã£ãæåŸã®åœå ãã¢ãŒã®æçµæ¥ã§ãéå¬å Žæã¯æ¥æ¬æŠé通ã§ããã ä»æ¥ã¯ããã®æ£éãã¢ãŒããã¡ããã©39幎ç®ã®èšå¿µãã¹ãæ¥ã§ãã 1984幎2æ22æ¥çºå£²ã®ãã¢ãã¿ãŒã»ãµãŒãŽã£ã¹ããã1992幎11æ21æ¥çºå£²ã®ãã³ã³ããªãŒãã»ãµãŒãŽã£ã¹ãã«é³æºãåé²ãããŠããã®ã§ãã¿ãªããæ¯éèŽããŠã¿ãŠãã ããã ãŸãäœè«ã§ãããæ®æ®µã¯(ç 究ãã£ã¡ã®ãã§)CDãéããŠããŸãã 70幎代ãã80幎代ã«ãããŠã®ã¢ãŒãã£ã¹ãã奜ãã§ãã æè¿ã¯ãå°ããªãã³ãŒã¹ãèŽããŠããŸãã ãªãã³ãŒã¹ã®æ§èŠæ Œç€ã®ã³ã¬ã¯ã·ã§ã³ã¯ãã¡ãã«ãããŸãã ãŸããã³ã¬ã¯ã·ã§ã³ã¯ãã¡ããšãã¡ãã«ãŸãšããŠãããŸãã æãªãšãã«ã芧ãã ããã
ããäžã€äœè«ã ä»å¹ŽèŽãããªãã§æãè¯ãã£ãã¢ã«ãã ã
- ãã¥ãŒãªãããHaloã(1983幎 / VICL-62399 / 2007幎ç€)
- ç¹ã«ããã£ãæ²: ð¥ãäžã«å¹ã颚ãð¥ãæãæ±ããããŠãð¥ãèŒãæããæ³ãåºã®ã©ã³ãã¹ã±ãŒãããã³ã¹ã¢ã¹ã®å²ãé·ããæ空ã®äŒèšããã»ã«ãªã¢ã³ã»ãã«ãŒã
- ãªãã³ãŒã¹ããã®éãããã° ãªãã»ã³ãŒã¹ã»ã©ãŠã³ã2ã(1974幎 / CA35-1033 / 1983幎ç€)
- ç¹ã«ããã£ãæ²: ð¥ãã¯ãã¡ã®é ãð¥ãå¥ãã®æ æ¯(1)ãð¥ãéŠèŒªã®ãªãç¬ãããã®è§ããŸããã°ããæ¥ææ¥ã®ãããã€ã
- ãªãã³ãŒã¹ãI Love Youã(1982幎 / CA35-1002 / 1982幎ç€)
- ç¹ã«ããã£ãæ²: ð¥ãåããè¡ãð¥ã決ããŠåœŒçã®ããã§ã¯ãªããð¥ãYes-Yes-Yesããæã®ãããã
- ãªãã³ãŒã¹ãã¯ã€ã³ã®åãã(1975幎 / CA35-1032 / 1983幎ç€)
- ç¹ã«ããã£ãæ²: ð¥ãå¹»æ³ãð¥ãè人ã®ã€ã¶ãããð¥ãæãäžã«ããéšãæ¿ããããåããªããŠããã¯ã€ã³ã®åãããç ãã¬å€ã
- ãªãã³ãŒã¹ãSong Is Loveã(1976幎 / CA35-1041 / 1983幎ç€)
- ç¹ã«ããã£ãæ²: ð¥ãå¬ãæ¥ããŸãã«ãð¥ãé空ãšäººçãšãð¥ãæãæ§ããŠããéæ¥ããã²ãšãã§çããŠãããã°ã
- ãã¥ãŒãªãããNew Tuneã(1985幎 / 35FD-1005 / 1985幎ç€)
- ç¹ã«ããã£ãæ²: ð¥ããã£ãšå¹žãã«çŽ çŽã«ãªããããð¥ããããªã¢ãð¥ãOur Songãããµãã€ãã®ã¯ãªã¹ãã¹ãããããªç·ã«ãªãããã
- 倧æ»è© äžãEach Timeã(1984幎 / 35DH 78 / 1984幎ç€)
- ç¹ã«ããã£ãæ²: ð¥ãBachelor Girlãð¥ããããŒãã³ãã»ãã«ãŒãð¥ãéæ³ã®ç³ããæã®ããã¯ã«ããŒã«ã
- éºçŸã"R"ã(1984幎 / 35C31-7250 / 1984幎ç€)
- ç¹ã«ããã£ãæ²: ð¥ãæã®ã¯ã©ã€ããŒãð¥ã颚ã¯ææ¥ãžãð¥ã空ãäžé¢æµ·ã«èŠããæ¥ããæã®äžæéã¯å€ç¬ã®å幎ããéæ¥ã®ãªã°ã¬ãããããããŒãã€ã«ã
- ãã€ã»ãã¡ã€ã»ã»ãããSweet Locomotionã(1986幎 / 32DH 393 / 1986幎ç€)
- ç¹ã«ããã£ãæ²: ð¥ãã²ãšããã®æãð¥ããã£ãäžæã®ãã©ãã°ã©ããð¥ãElevator TownããDo You Remember Me?ã
- åä¹
äºæ èŠãFloraã(1990幎 / PSCR-1006 / 1990幎ç€)
- ç¹ã«ããã£ãæ²: ð¥ããã€ã»ãã³ãªã£ã»ã°ããã€ã»ã¯ã©ããð¥ãå¶ç¶ã®æ 人ãð¥ã倢ã§äŒããŸãããããç¥æ§ãããªãåææ¥ã
- éŽæšåº·åãSincerelyã(1983幎 / CA35-1043 / 1983幎ç€)
- ç¹ã«ããã£ãæ²: ð¥ãç çè²ã®å€æããð¥ãåãšæµ·ãžãð¥ãã©ã©ã© ïœæã®äžçãžïœããå ¥ãæ±ããåã®èªçæ¥ã
- 岡ç°æåžåããŽã£ãŒãã¹èªçã(1986幎 / D32A0164 / 1986幎ç€)
- ç¹ã«ããã£ãæ²: ð¥ããŽã£ãŒãã¹èªçãð¥ãéæ²³ã®ãã«ã³ã¹ãð¥ãç ãã¬å€ã®AquariusããWonder Trip LoverããSpring Accidentã
- å°ŸåŽäºçŸãKidsã(1986幎 / D32A0235 / 1986幎ç€)
- ç¹ã«ããã£ãæ²: ð¥ãæµãæã奜ããð¥ãã·ã£ã€ãã¹ããŒã€ãð¥ãSt.Valentine's Day RhapsodyããCom'on Mamyã
- ä¹
ä¿ç°æ©çŽãå€ã®åºã¯æãããªå¹»ã(1984幎 / DYCL-17 / 2005幎ç€)
- ç¹ã«ããã£ãæ²: ð¥ããã¢ããã·ã¢ã§...ãð¥ãå¯ãçµµèæžãð¥ãæã®æµèŸºãã¿ã³ãã²ãšã€ããã¡ã©ã³ã³ãªãŒã®ããŒãã«ã¯ãã¹ã
- è¬åž«äžžã²ãåãè±å³éã(1986幎 / CA32-1260 / 1986幎ç€)
- ç¹ã«ããã£ãæ²: ð¥ãçŽ ãè±ãéãè±ãð¥ãå¯æ€¿ãå²ãããð¥ãããŒãºã»ãã£ãŒã¯ããã?ããåãã¿ã®çš®ããéæãªãã¥ãŒãªãããã麊ããåžœåã®ã¢ã³ã
ã€ã³ãããè¯ãæ² (ããŸã)ã
- ãã¥ãŒãªãããShooting Starã(1981幎)
- äºäžéãKarsavina ïœããžã³ã¹ããŒã®ç¿Œã(1983幎)
- äºäžéãRunning Fence -Ode A Christoã(1982幎)
ä»å¹Žã¯ãç¹çŸ€åŠç (ç¹çŸ€åé¡ã¿ã¹ã¯) åããã¥ãŒã©ã«ãããã®FPGAé«éåãè©ŠããŠã¿ãŸãã LeNetãResNetãªã©ãç»ååŠçåããã¥ãŒã©ã«ãããã®FPGAé«éåãé¢çœãã®ã§ãããæ¢ã«ããããã®çŽ æŽãããèšäºãåºãŠããã®ã§ãããŸããã é³æ¥œã®è©±ãã誰ã«ãéããªããããŠã±ãªããšæã£ãã®ã§ãããŸããã ã³ã³ãã¥ãŒã¿ã§é²èŠ§ãããããšããå§ãããŸãã
ç¹çŸ€ã®åé¡ãã»ã°ã¡ã³ããŒã·ã§ã³ãã¬ãžã¹ãã¬ãŒã·ã§ã³ãªã©ãæ§ã ãªã¿ã¹ã¯ã«å¯Ÿå¿ãã代衚çãªã¢ãã«ãšããŠã2017幎ã«CVPRã§çºè¡šãããPointNetãæããããŸãã PointNetã¯ãMLPãšMaxããŒãªã³ã°å±€ãããªããã·ã³ãã«ãã€åŒ·åãªã¢ãã«ã§ãã åé¡ã¿ã¹ã¯åãã®PointNetã®æ§é ãã以äžã«ç€ºããŸãã
ã¢ãã«ã¯ãç¹çŸ€ããã®ç¹åŸŽæœåºãšãç¹åŸŽã«åºã¥ãåé¡ã®ã2ã€ã®éšåã«åããããŸã (å³ã®Feature extractionãšClassification)ã
å³ã®å·Šç«¯ã«ç€ºãããã«ã$N$åã®ç¹ãå«ãã3次å ã®ç¹çŸ€$\mathcal{P} = \left{ \boldsymbol{p}_1, \ldots, \boldsymbol{p}_N \right} \in \mathbb{R}^{N \times 3}$ãå ¥åã§ãã MLPãçšããŠãåç¹$\boldsymbol{p}_i \in \mathbb{R}^3$ã«å¯ŸããŠã1024次å ã®ããŒã«ã«ãªç¹åŸŽ$\boldsymbol{\psi}_i \in \mathbb{R}^{1024}$ãèšç®ããŸãã å šãŠã®ç¹ã«å¯ŸããŠããŒã«ã«ãªç¹åŸŽé$\boldsymbol{\Psi} = \left{ \boldsymbol{\psi}_1, \ldots, \boldsymbol{\psi}_N \right} \in \mathbb{R}^{N \times 1024}$ãèšç®ããããããããMaxããŒãªã³ã°å±€ã«ããéçŽããŠãç¹çŸ€å šäœãè¡šãã°ããŒãã«ãªç¹åŸŽé$\boldsymbol{\phi} \in \mathbb{R}^{1024}$ãåŸãŸã ($\boldsymbol{\phi} \gets \max(\boldsymbol{\psi}_1, \ldots, \boldsymbol{\psi}_N)$)ã
åé¡çšã®ãããã¯ãŒã¯ã¯ããã®ç¹åŸŽé$\boldsymbol{\phi}$ãå ¥åãšããŠãåç©äœã®ã¯ã©ã¹ã«å¯Ÿããããžãã (ã¹ã³ã¢)ãåºåããŸãã ç©äœã®ã¯ã©ã¹æ°ã$K$ãšããã°ãåºåã¯$K$次å ã®ãã¯ãã«ãšãªããŸãã
å³ã®Input Transformããã³Feature Transformã¯ãç¹çŸ€ã®ç¹åŸŽã«å¯ŸããŠã¢ãã£ã³å€æãæœããåäœå€æã«å¯ŸããŠäžå€ãªç¹åŸŽéãåŸãããã®ãããã¯ãŒã¯ã§ãããå®è£ ãé¢åãªã®ã§åãé€ããŸã(æé©åãã®1: ã¢ãã«ã®ç°¡ç¥å)ã åŸã£ãŠãä»åFPGAäžã«å®è£ ããPointNetã¯ã以äžã®ããã«ãªããŸãã
ç»åèªèåãã®ã¢ãã«ãšã¯ç°ãªããç³ã¿èŸŒã¿å±€ããããŸããã ãŸããMLPã¯ãå šçµåå±€ãReLU掻æ§åå±€ããããæ£èŠåå±€ããŸãšãããã®ãšããŸãã
PyTorchã«ããã¢ãã«ã®å®çŸ©ã¯ã次ã®ããã«ãªããŸã (net/model.py
)ã
ãœãŒã¹ã³ãŒãå
šäœã¯ãã¡ãã®ãªããžããªã«çœ®ãããŠããã®ã§ãé©å®ãåç
§ãã ããã
class PointNetFeat(torch.nn.Module):
def __init__(self):
super().__init__()
self.conv1 = torch.nn.Conv1d(3, 64, 1)
self.conv2 = torch.nn.Conv1d(64, 64, 1)
self.conv3 = torch.nn.Conv1d(64, 64, 1)
self.conv4 = torch.nn.Conv1d(64, 128, 1)
self.conv5 = torch.nn.Conv1d(128, 1024, 1)
self.bn1 = torch.nn.BatchNorm1d(64)
self.bn2 = torch.nn.BatchNorm1d(64)
self.bn3 = torch.nn.BatchNorm1d(64)
self.bn4 = torch.nn.BatchNorm1d(128)
self.bn5 = torch.nn.BatchNorm1d(1024)
def forward(self, x: torch.Tensor):
# `x` is of size [B, N, 3]
N = x.shape[1]
# `x` is of size [B, 3, N]
x = x.transpose(1, 2)
# `x` is of size [B, 1024, N]
x = F.relu(self.bn1(self.conv1(x)))
x = F.relu(self.bn2(self.conv2(x)))
x = F.relu(self.bn3(self.conv3(x)))
x = F.relu(self.bn4(self.conv4(x)))
x = F.relu(self.bn5(self.conv5(x)))
# `x` is of size [B, 1024]
x = torch.max(x, dim=2)[0]
return x
class PointNetCls(torch.nn.Module):
def __init__(self, num_classes: int):
super().__init__()
# Feature extraction
self.feat = PointNetFeat()
# Classification network
self.fc1 = torch.nn.Linear(1024, 512)
self.fc2 = torch.nn.Linear(512, 256)
self.fc3 = torch.nn.Linear(256, num_classes)
self.bn1 = torch.nn.BatchNorm1d(512)
self.bn2 = torch.nn.BatchNorm1d(256)
def forward(self, x):
# `x` is of size [B, N, 3]
# `x` is of size [B, 1024]
x = self.feat(x)
# `x` is of size [B, `num_classes`]
x = F.relu(self.bn1(self.fc1(x)))
x = F.relu(self.bn2(self.fc2(x)))
x = self.fc3(x)
return x
ããŠããã®ã¢ãã«ããã®ãŸãŸå®è£ ããå Žåã次ã®ãããªåé¡ããããŸãã ç¹åŸŽæœåºéšå (å³ã®Feature extraction)ã«æ³šç®ããŸãã å³äžã®ç°è²ã®åè§ã«ç€ºãããã«ã$N$åå šãŠã®ç¹ã«å¯ŸããäžéçµæããããŒã«ã«ãªç¹åŸŽé$\boldsymbol{\Psi}$ããã©ããã«ä¿æããŠããå¿ èŠããããŸãã 倧容éã®ã¡ã¢ãªãæèŒããGPUã§ããã°ãããã§ãåé¡ãããŸããããFPGAå éšã®ãªã³ãããã¡ã¢ãª (BlockRAM) ã¯éåžžã«å®¹éãå°ãªãã®ã§ãå šãŠã®ç¹ã«å¯Ÿããäžéçµæãä¿æããããšãããšããªã³ãããã¡ã¢ãªããã£ãšããéã«æ¯æžããã§ãããã èšãæãããšãæèŒãããŠãããªã³ãããã¡ã¢ãªã®å®¹éã«ãã£ãŠãç¹ã®åæ°$N$ãå¶éãããŠããŸããŸãã ããã¯é¿ããããã®ã§ãã ãªã³ãããã¡ã¢ãªã®ä»£ããã«ã容éã®å€§ããªDRAMäžã«çœ®ãããšãã§ããŸãããããŒã¿ãžã®ã¢ã¯ã»ã¹æéã¯é·ããªããŸãã å šãŠã®å±€ã®äžéçµæãDRAMã«çœ®ããšãããŒã¿è»¢éã®ãªãŒããŒããããå¢å ããŠãæ§èœã«æªåœ±é¿ãåãŒããŸãã å±€ã®äžéçµæã¯ããªã³ããããããã¡ã«çœ®ããããã®ã§ãã
ããã§ãå šãŠã®ç¹$\mathcal{P}$ã«å¯ŸããŠãããŒã«ã«ãªç¹åŸŽé$\boldsymbol{\Psi}$ãäžæ°ã«èšç®ããã®ã§ã¯ãªãã1ã€ãã€ã®ç¹$\boldsymbol{p}$ã«å¯ŸããŠé ã«ããŒã«ã«ãªç¹åŸŽé$\boldsymbol{\psi}$ãèšç®ããŸãããã äžæ°ã«èšç®ããã®ãšæ¯ã¹ãŠèšç®å¹çã¯èœã¡ãŸããã1ã€ã®ç¹ã«å¯ŸããäžéçµæãããŒã«ã«ãªç¹åŸŽéã ããä¿æããã°ããã®ã§ããªã³ããããããã¡ã®æ¶è²»ã倧ããåæžã§ããŸãã
以å㯠(PyTorchãªã©ã®ãã¬ãŒã ã¯ãŒã¯ã䜿ãå Žåã¯)ãç¹åŸŽæœåºã¯æ¬¡ã®ããã«è¡ãããŠããŸããã
- å
šãŠã®ç¹$\mathcal{P}$ã«å¯ŸããŠãããŒã«ã«ãªç¹åŸŽéã$\boldsymbol{\Psi}$ããŸãšããŠèšç®ãã (
$(N, 64)$ ã$(N, 1024)$ã®ãããã¡ãå¿ èŠ)ã - MaxããŒãªã³ã°å±€ã«ãããããŒã«ã«ãªç¹åŸŽé$\boldsymbol{\Psi}$ãéçŽããŠãã°ããŒãã«ãªç¹åŸŽé$\boldsymbol{\phi}$ãåŸã ($\boldsymbol{\phi} \gets \max(\boldsymbol{\psi}_1, \ldots, \boldsymbol{\psi}_N)$)ã
- ã°ããŒãã«ãªç¹åŸŽé$\boldsymbol{\phi}$ãMLPã«å
¥åããåã¯ã©ã¹ã«å¯Ÿããããžãã(
$K$ 次å ã®ãã¯ãã«)ãåŸãã
ãããã次ã®ããã«å€æŽããŸã(æé©åãã®2: èšç®é åºã®å€æŽ)ã
- ã°ããŒãã«ãªç¹åŸŽé$\boldsymbol{\phi}$ãã$\boldsymbol{0}$ã§åæåããã
- åç¹$\boldsymbol{p}_i \ (i = 1, \ldots, N)$ã«å¯ŸããŠã以äžã®åŠçãè¡ãã
- MLPã®é äŒæã«ãããããŒã«ã«ãªç¹åŸŽé$\boldsymbol{\psi}_i$ãåŸã (
$(1, 64)$ ã$(1, 1024)$ã®ãããã¡ãããã°ãã)ã -
$\boldsymbol{\phi}$ ãš$\boldsymbol{\psi}_i$ãšã®ãèŠçŽ ããšã®$\max$ããšãããšã§ã$\boldsymbol{\phi}$ãæŽæ°ãã ($\boldsymbol{\phi} \gets \max(\boldsymbol{\phi}, \boldsymbol{\psi}_i)$)ã
- MLPã®é äŒæã«ãããããŒã«ã«ãªç¹åŸŽé$\boldsymbol{\psi}_i$ãåŸã (
- ã°ããŒãã«ãªç¹åŸŽé$\boldsymbol{\phi}$ãMLPã«å
¥åããåã¯ã©ã¹ã«å¯Ÿããããžãã(
$K$ 次å ã®ãã¯ãã«)ãåŸãã
å šãŠã®ç¹ã«å¯ŸããããŒã«ã«ãªç¹åŸŽé$\boldsymbol{\Psi}$ãéçŽããã®ã§ã¯ãªããåç¹$\boldsymbol{p}_i$ã«å¯ŸããããŒã«ã«ãªç¹åŸŽé$\boldsymbol{\psi}_i$ã䜿ã£ãŠãã°ããŒãã«ãªç¹åŸŽé$\boldsymbol{\phi}$ãé次çã«æŽæ°ããŠãããŸãã ããã¯è¿äŒŒã§ã¯ãªãã®ã§ãå šãåãçµæãšãªããŸãã
æçµçã«ãä»åFPGAäžã«å®è£ ããPointNetã¯ã以äžã®ããã«ãªããŸãã
ä»åã¯ãé«äœåæ (HLS: High-Level Synthesis)ãçšããŠãäžèšã«ç€ºãPointNetã®å°çšåè·¯ (IPã³ã¢) ãèšè¿°ããŸãã ãã¥ãŒã©ã«ãããã®æšè«ãå®çŸããå¥ã®æ段ãšããŠãè¡åæŒç®ãç³ã¿èŸŒã¿æŒç®çšã®ã巚倧ãã€æ±çšçãªæŒç®åè·¯ãFPGAäžã«å®è£ ããããã«ç¹°ãè¿ãããŒã¿ãäžããããšãèããããŸãã
é«äœåæã¯ãC/C++ã«ããåäœã¬ãã« (Behavior Level) ã®åè·¯èšè¿°ããVerilog HDLãSystemVerilogã«ããã¬ãžã¹ã¿è»¢éã¬ãã« (RTL: Register Transfer Level) ã®åè·¯èšè¿°ã«å€æããããã®æè¡ã§ãã
Verilog HDLãçŽæ¥èšè¿°ããã®ã«æ¯ã¹ãŠãé¥ãã«æ¥œã§ãã¹ãã¬ã¹ãå°ãªããçç£æ§ãåäžããŸãã
äœããC/C++ã§èšè¿°ãããšã¯ãã£ãŠããéåžžã®ãœãããŠã§ã¢éçºãšã¯å
šãæ§çžãç°ãªããŸãã
malloc()
ãnew
ã¯ãã¡ããã®ããšããããã«äŸåããstd::vector
ãªã©ã®äŸ¿å©ãªããŒã¿åã䜿ããªãã®ã§ãåºå®é·ã®é
åã«çœ®ãæããŠã©ãã«ãããŸãã
ãã¥ãŒã©ã«ãããã¯ãµã€ãºãåºå®ã§ãäžè¬ã«ã¯æ±ºãŸã£ãåäœãããã®ã§ãFPGAäžã«å®è£
ããããã§ãã
é«äœåæçšã®ããŒã«ãšããŠãXilinx瀟ã®Vitis HLS 2022.1ãå©çšããŸãã ãŸãå®è£ 察象ã®FPGAãšããŠãXilinx ZCU104 Evaluation Board (XCZU7EV-2FFVC1156)ã䜿ããŸãã Xilinx ZCU104ã«ã¯ãFPGAã®ã»ãã«ãã¯ã¢ããã³ã¢ ARM Cortex-A53 CPU (1.2GHz)ãš2GBã®DRAMãæèŒãããŠãããLinuxãåäœããŸãã
æ©éãPointNetã®IPã³ã¢ã瀺ããŸã (é©å®GitHubã®ãªããžããªãã芧ãã ãã)ã é«äœåæããŒã«ã®ããã¯ãšã³ããGCC 6.2ã§ãã®ã§ãC++14ãC++17ã®äžéšæ©èœãå©çšã§ããŸãã äœããããŒã«ã®ãã°ãèžããããããªãã®ã§ãããŸãåã£ãæ©èœã¯äœ¿ããªãããã«ããŠããŸãã
// Size of the PointNet classification network
// Refer to net/model.py for details
// Size of the feature extraction network
constexpr const int kFeatDims0 = 3;
constexpr const int kFeatDims1 = 64;
constexpr const int kFeatDims2 = 64;
constexpr const int kFeatDims3 = 64;
constexpr const int kFeatDims4 = 128;
constexpr const int kFeatDims5 = 1024;
// Size of the classification network
// ModelNet40 has 40 object classes
constexpr const int kClsDims0 = kFeatDims5;
constexpr const int kClsDims1 = 512;
constexpr const int kClsDims2 = 256;
constexpr const int kClsDims3 = 40;
// Top function
void PointNetClsTop(const int op_mode,
const float* point_cloud,
const int num_points,
float* out_logits,
const float* feat_params1,
const float* feat_params2,
const float* feat_params3,
const float* feat_params4,
const float* feat_params5,
const float* cls_params1,
const float* cls_params2,
const float* cls_params3)
{
#pragma HLS INTERFACE m_axi port=point_cloud offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=out_logits offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=feat_params1 offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=feat_params2 offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=feat_params3 offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=feat_params4 offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=feat_params5 offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=cls_params1 offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=cls_params2 offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=cls_params3 offset=slave bundle=gmem0
#pragma HLS INTERFACE s_axilite port=op_mode bundle=control
#pragma HLS INTERFACE s_axilite port=point_cloud bundle=control
#pragma HLS INTERFACE s_axilite port=num_points bundle=control
#pragma HLS INTERFACE s_axilite port=out_logits bundle=control
#pragma HLS INTERFACE s_axilite port=feat_params1 bundle=control
#pragma HLS INTERFACE s_axilite port=feat_params2 bundle=control
#pragma HLS INTERFACE s_axilite port=feat_params3 bundle=control
#pragma HLS INTERFACE s_axilite port=feat_params4 bundle=control
#pragma HLS INTERFACE s_axilite port=feat_params5 bundle=control
#pragma HLS INTERFACE s_axilite port=cls_params1 bundle=control
#pragma HLS INTERFACE s_axilite port=cls_params2 bundle=control
#pragma HLS INTERFACE s_axilite port=cls_params3 bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control
// Parameters for feature extraction
LinearParams<param_t, kFeatDims0, kFeatDims1> feat_conv1;
LinearParams<param_t, kFeatDims1, kFeatDims2> feat_conv2;
LinearParams<param_t, kFeatDims2, kFeatDims3> feat_conv3;
LinearParams<param_t, kFeatDims3, kFeatDims4> feat_conv4;
LinearParams<param_t, kFeatDims4, kFeatDims5> feat_conv5;
BatchNorm1dParams<param_t, kFeatDims1> feat_bn1;
BatchNorm1dParams<param_t, kFeatDims2> feat_bn2;
BatchNorm1dParams<param_t, kFeatDims3> feat_bn3;
BatchNorm1dParams<param_t, kFeatDims4> feat_bn4;
BatchNorm1dParams<param_t, kFeatDims5> feat_bn5;
// Parameters for classification network
// LinearParams<param_t, kClsDims0, kClsDims1> cls_fc1;
// LinearParams<param_t, kClsDims1, kClsDims2> cls_fc2;
LinearParams<param_t, kClsDims2, kClsDims3> cls_fc3;
BatchNorm1dParams<param_t, kClsDims1> cls_bn1;
BatchNorm1dParams<param_t, kClsDims2> cls_bn2;
// Extracted feature
value_t feature[kFeatDims5];
if (op_mode == kModeInitWeights) {
// Initialize the PointNet feature extraction network
InitializeFeatNaive<param_t>(
&feat_conv1, &feat_conv2, &feat_conv3, &feat_conv4, &feat_conv5,
&feat_bn1, &feat_bn2, &feat_bn3, &feat_bn4, &feat_bn5,
feat_params1, feat_params2, feat_params3, feat_params4, feat_params5);
// Initialize the classification network
InitializeClsNaive<param_t>(
&cls_fc3, &cls_bn1, &cls_bn2,
cls_params1, cls_params2, cls_params3);
} else if (op_mode == kModeInference) {
// Run the PointNet feature extraction
InferenceFeatNaive<value_t, param_t, 1024>(
point_cloud, num_points, feature,
&feat_conv1, &feat_conv2, &feat_conv3, &feat_conv4, &feat_conv5,
&feat_bn1, &feat_bn2, &feat_bn3, &feat_bn4, &feat_bn5);
// Run the classification
InferenceClsNaive<value_t, param_t>(
feature, out_logits,
&cls_fc3, &cls_bn1, &cls_bn2,
cls_params1, cls_params2, cls_params3);
}
}
äžèšãé«äœåæãããšã次ã®ãããªIPã³ã¢ãäœãããŸãã
ãã®IPã³ã¢ãå¥ã®IPã³ã¢ãšçµã¿åãããããšã§ (åŸè¿°)ã次ã®ãããªãããã¯ãã¶ã€ã³ãã§ããŸãã
ãã®ãããã¯ãã¶ã€ã³ã«å¯ŸããŠãè«çåæããã³é 眮é ç·ããããšã§ãåè·¯æ å ±ãè¡šããããã¹ããªãŒã (Bitstream) ãçæããŸãã ãããã¹ããªãŒã ãFPGAã«ããŒãããããšã§ãPointNetã®å°çšåè·¯ã䜿ããããã«ãªããŸãã
PointNetClsTop
ããIPã³ã¢ãè¡šãæäžäœã®é¢æ°ã§ãã
ãããé¢æ° (Top function) ãšãã°ããŸãã
é¢æ°ã®åŒæ°ã¯ãIPã³ã¢ã®å
¥åºåããŒããšãªããå¥ã®IPã³ã¢ã«æ¥ç¶ãããŸã (äžã®ãããã¯ãã¶ã€ã³ãã芧ãã ãã)ã
HLSã§ã¯ãé¢æ°ãã®ãã®ãåè·¯ (Verilog HDLã«ãããã¢ãžã¥ãŒã«) ã«ãªããŸãã
é¢æ°ã®ååž°åŒã³åºãã¯ã§ããŸããã
ç¹åŸŽæœåºçšã®ãããã¯ãŒã¯ã«ã¯5ã€ã®MLPããŸãã¯ã©ã¹åé¡çšã®ãããã¯ãŒã¯ã«ã¯3ã€ã®MLPãå«ãŸããŸãã ãããã®ãã©ã¡ãŒã¿ã¯ããœãããŠã§ã¢åŽããæäœã§ããããã«ãDRAMäžã®ãããã¡ã«çœ®ãããŸãã ãŸããç¹çŸ€$\mathcal{P}$ããã¢ãã«ã®åºå(ããžãã)ãåæ§ã«ãDRAMãããã¡ã«çœ®ãããŸãã
feat_params1
ããfeat_params5
ãŸã§ãšãcls_params1
ããcls_params3
ãŸã§ã®8ã€ã®ããŒãã¯ãDRAMãããã¡äžã®ãã©ã¡ãŒã¿ããIPã³ã¢åŽããèªã¿åãããã«äœ¿ããŸãã
point_cloud
ã¯ç¹çŸ€ã®èªã¿åºããout_logits
ã¯ããžããã®æžã蟌ã¿ã®ããã«äœ¿ããŸãã
op_mode
ã¯åè·¯ã®åäœã¢ãŒããnum_points
ã¯ç¹ã®åæ°$N$ãèšå®ããããã®å¶åŸ¡ã¬ãžã¹ã¿ã§ãã
#pragma HLS
ããå§ãŸãè¡ã¯ãé«äœåæããŒã«ã«å¯ŸããŠãC/C++ããRTLã«å€æããéã®ãã³ããäžããŸã (å¿
ãããå®ã£ãŠããããšã¯éããŸãã)ã
ãã€ãã©ã€ã³åãããŒã¿ãããŒæé©åãªã©ã¯C/C++ã§ã¯èšè¿°ã§ããŸãããããã®ãããªHLSãã©ã°ããé©åãªå Žæã«çœ®ãããšã§ãé«äœåæããŒã«ãèªåçã«ãããã®æé©åãæœããŠãããŸãã
#pragma HLS INLINE off
ãšãããšããã®é¢æ°ãã€ã³ã©ã€ã³å±éãããªããªããŸã (å¿
ãã1ã€ã®ã¢ãžã¥ãŒã«ãšããŠäœããã)ã
倧ããªé¢æ°ã§ããã°ãèªåçã«ã€ã³ã©ã€ã³å±éãããããšã¯ãããŸãããã念ã®ããä»äžããŠããŸãã
以äžã®ãããªç¶æ³ã§ã¯ãé¢æ°B
ãã€ã³ã©ã€ã³å±éããªãæ¹ããããšæããŸãã
åæã«äœ¿ãããªãã®ã«ãé¢ããããé¢æ°A
ã®å
éšã«B
ã®ã³ããŒã3ã€äœãããŠããªãœãŒã¹ã®ç¡é§é£ããšãªããŸãã
é¢æ°B
ã®ã€ã³ã©ã€ã³åãæå¶ããŠãB
ã1ã€ã ãäœããããã䜿ãåããæ¹ãããã§ãããã
void B(const float x_in[10], float y_out[10])
{
#pragma HLS INLINE
// äœããã®åŠç
}
void A(const float x_in[10], float y_out[10])
{
float x0[10];
float x1[10];
B(x_in, x0);
B(x0, x1);
B(x1, y_out);
}
#pragma HLS INTERFACE m_axi
ãšã#pragma HLS INTERFACE s_axilite
ã®èšè¿°ãç®ç«ã¡ãŸãããå
¥åºåããŒã (äŸãã°feat_params1
) ã«å¯ŸããŠãã®2ã€ã®HLSãã©ã°ããèšè¿°ãããšãIPã³ã¢åŽããDRAMãããã¡ãèªã¿æžãã§ããããã«ãªããŸãã
èªã¿æžãã®éã«ã¯ãAXIãšãã°ãããããã³ã«ã䜿çšããŸããã#pragma HLS INTERFACE m_axi
ã«ãã£ãŠãããæå®ã§ããŸã (IPã³ã¢åŽããã¹ã¿ãŒã«ãªããŸã)ã
ãœãããŠã§ã¢åŽããã¯ãåããŒãã«å¯ŸããŠããããã¡ã®ç©çã¢ãã¬ã¹ãå²ãåœãŠãŠãããŒããšãããã¡ãçŽã¥ããŸãã
åããŒãã«ã¯ãç©çã¢ãã¬ã¹ãèšå®ããããã®å¶åŸ¡ã¬ãžã¹ã¿ãäœæããå¿
èŠãããã#pragma HLS INTERFACE s_axilite
ã«ãã£ãŠãããå®çŸã§ããŸã (IPã³ã¢åŽããã¿ããšã¹ã¬ãŒãã§ã)ã
op_mode
ãnum_points
ã«å¯ŸããŠãã¬ãžã¹ã¿ãäœæããŸãã
port=return
ãšããŠããè¡ã¯ãIPã³ã¢çšã®å¶åŸ¡ã¬ãžã¹ã¿ãäœæããCPUåŽããIPã³ã¢ã®åäœãéå§ããããç¶æ
(ã¢ã€ãã«ç¶æ
ãªã®ãåäœäžã) ãèªã¿åã£ããããããã«å¿
èŠã§ãã
ãããã®ã¬ãžã¹ã¿ã¯ããœãããŠã§ã¢åŽãããã¡ã¢ãªããããI/Oããã³AXI-Liteãããã³ã«ã«ãã£ãŠèªã¿æžããããŸãã
åå ¥åºåããŒãããã¯ãPyTorchã®ã¢ãã«ã§å®çŸ©ãããåå±€ã®ãã©ã¡ãŒã¿ãèªã¿åºãããŸã (äžæ¬¡å ã®é åãšããŠãå šãŠã®ãã©ã¡ãŒã¿ãé£çµãããŸã)ã
feat_params1
:PointNetFeat::conv1
+PointNetFeat::bn1
ã®ãã©ã¡ãŒã¿feat_params2
:PointNetFeat::conv2
+PointNetFeat::bn2
ã®ãã©ã¡ãŒã¿feat_params3
:PointNetFeat::conv3
+PointNetFeat::bn3
ã®ãã©ã¡ãŒã¿feat_params4
:PointNetFeat::conv4
+PointNetFeat::bn4
ã®ãã©ã¡ãŒã¿feat_params5
:PointNetFeat::conv5
+PointNetFeat::bn5
ã®ãã©ã¡ãŒã¿cls_params1
:PointNetCls::fc1
+PointNetCls::bn1
ã®ãã©ã¡ãŒã¿cls_params2
:PointNetCls::fc2
+PointNetCls::bn2
ã®ãã©ã¡ãŒã¿cls_params3
:PointNetCls::fc3
ã®ãã©ã¡ãŒã¿
void PointNetClsTop(const int op_mode,
const float* point_cloud,
const int num_points,
float* out_logits,
const float* feat_params1,
const float* feat_params2,
const float* feat_params3,
const float* feat_params4,
const float* feat_params5,
const float* cls_params1,
const float* cls_params2,
const float* cls_params3)
{
// ...
}
torch.nn.Conv1d
ããã³torch.nn.Linear
ã®ãã©ã¡ãŒã¿ãšããŠã¯ãéã¿ãšãã€ã¢ã¹ãæããããŸãã
Conv1d
ãšãããŸãããã«ãŒãã«ãµã€ãºã¯1ãªã®ã§ãLinear
ãšåäœãåãã«ãªããŸãã
以åŸãConv1d
ãšLinear
ãåäžèŠããŸãã
å
¥åãšåºåã®æ¬¡å
æ°ã$\mathrm{InDims}$ã$\mathrm{OutDims}$ãšãããšãéã¿ãšãã€ã¢ã¹ã®ãµã€ãºã¯$(\mathrm{OutDims}, \mathrm{InDims})$ã$(\mathrm{OutDims})$ãšãªããŸãã
å
¥å$\boldsymbol{x} \in \mathbb{R}^{\mathrm{InDims}}$ãéã¿$\boldsymbol{W} \in \mathbb{R}^{\mathrm{OutDims} \times \mathrm{InDims}}$ããã€ã¢ã¹$\boldsymbol{b} \in \mathbb{R}^{\mathrm{OutDims}}$ããããšããåºå$\boldsymbol{y} \in \mathbb{R}^{\mathrm{OutDims}}$ã¯æ¬¡ã®ããã«èšç®ãããŸãã
$$
\boldsymbol{y} = \boldsymbol{W} \boldsymbol{x} + \boldsymbol{b}
$$
torch.nn.BatchNorm1d
ã®ãã©ã¡ãŒã¿ãšããŠã¯ãå¹³åãæšæºåå·®ãéã¿ããã€ã¢ã¹ã®4ã€ãæããããŸãã
å
¥åºåã®æ¬¡å
ã$\mathrm{Dims}$ãšãããšãããã4ã€ã®ãã©ã¡ãŒã¿ã®ãµã€ãºã¯$(\mathrm{Dims})$ã§ãã
å¹³åãæšæºåå·®ãéã¿ããã€ã¢ã¹$\boldsymbol{\mu}, \boldsymbol{\sigma}, \boldsymbol{w}, \boldsymbol{b} \in \mathbb{R}^{\mathrm{Dims}}$ããããšããå
¥å$\boldsymbol{x} \in \mathbb{R}^{\mathrm{Dims}}$ã«å¯ŸããŠåºå$\boldsymbol{y} \in \mathbb{R}^{\mathrm{Dims}}$ã¯æ¬¡ã®ããã«èšç®ãããŸãã
$$
y_i = \frac{x_i - \mu_i}{\sqrt{\sigma_i^2 + \varepsilon}} \cdot w_i + b_i \quad (i = 1, \ldots, \mathrm{Dims})
$$
ãããæ£èŠåã®åŸã«ReLU掻æ§åãèšç®ãããŸãã åå±€ãå¥ã ã«å®è£ ãããããããŸãšããŠããŸã£ãæ¹ãå¹çãããã®ã§ããããæ£èŠåãšReLU掻æ§åã次ã®ããã«ãŸãšããŸã (æé©åãã®3: èšç®ã®ç°¡ç¥å)ã $$ y_i = \max \left( 0, \left( x_i - \mu_i \right) \cdot s_i + b_i \right) \quad (i = 1, \ldots, \mathrm{Dims}) $$
æåŸã«MaxããŒãªã³ã°å±€ã§ãããå è¿°ã®éããåç¹ã«å¯ŸããããŒã«ã«ç¹åŸŽé$\boldsymbol{\psi}_i \in \mathbb{R}^{1024}$ãšãçŸåšã®ã°ããŒãã«ç¹åŸŽé$\boldsymbol{\phi} \in \mathbb{R}^{1024}$ãšã®ãèŠçŽ ããšã®$\max$ã«çœ®ãæããŸããã MaxããŒãªã³ã°å±€ã®èšç®ã¯æ¬¡ã®ããã«ãªããŸãã $$ \phi_i = \max \left( \phi_i, \psi_i \right) \quad (i = 1, \ldots, 1024) $$
ããŠããœãŒã¹ã³ãŒãã®LinearParams<T, InDims_, OutDims_>
æ§é äœãšãBatchNorm1dParams<T, Dims_>
æ§é äœã¯ãå
šçµåå±€ (Conv1d
ããã³Linear
) ãšããããæ£èŠåå±€ (BatchNorm1d
) ã®ãã©ã¡ãŒã¿ããããããŸãšãããã®ã§ãã
// Parameters for fully-connected layers
template <typename T, int InDims_, int OutDims_>
struct LinearParams
{
enum
{
InDims = InDims_,
OutDims = OutDims_,
};
T weight[OutDims][InDims];
T bias[OutDims];
};
// Parameters for 1D batch normalization layers
template <typename T, int Dims_>
struct BatchNorm1dParams
{
enum
{
Dims = Dims_,
};
// `scale` is obtained by multiplying weights and reciprocal of the
// square root of the standard deviation (to reduce the computational cost)
T scale[Dims];
T bias[Dims];
T mean[Dims];
};
PointNetClsTop
å
ã§ã¯ãPyTorchã§å®çŸ©ãããã¢ãã«ã®åå±€ã«å¯Ÿå¿ããŠã以äžã®ãããªãã©ã¡ãŒã¿ã宣èšãããŸãã
feat_conv1
:PointNetFeat::conv1
ã®éã¿ããã€ã¢ã¹feat_conv2
:PointNetFeat::conv2
ã®éã¿ããã€ã¢ã¹feat_conv3
:PointNetFeat::conv3
ã®éã¿ããã€ã¢ã¹feat_conv4
:PointNetFeat::conv4
ã®éã¿ããã€ã¢ã¹feat_conv5
:PointNetFeat::conv5
ã®éã¿ããã€ã¢ã¹feat_bn1
:PointNetFeat::bn1
ã®å¹³åããã€ã¢ã¹ãã¹ã±ãŒã«feat_bn2
:PointNetFeat::bn2
ã®å¹³åããã€ã¢ã¹ãã¹ã±ãŒã«feat_bn3
:PointNetFeat::bn3
ã®å¹³åããã€ã¢ã¹ãã¹ã±ãŒã«feat_bn4
:PointNetFeat::bn4
ã®å¹³åããã€ã¢ã¹ãã¹ã±ãŒã«feat_bn5
:PointNetFeat::bn5
ã®å¹³åããã€ã¢ã¹ãã¹ã±ãŒã«cls_fc3
:PointNetCls::fc3
ã®éã¿ããã€ã¢ã¹cls_bn1
:PointNetCls::bn1
ã®å¹³åããã€ã¢ã¹ãã¹ã±ãŒã«cls_bn2
:PointNetCls::bn2
ã®å¹³åããã€ã¢ã¹ãã¹ã±ãŒã«
ç¹åŸŽæœåºãããã¯ãŒã¯ã®å
šãŠã®å±€ã®ãã©ã¡ãŒã¿ã¯ãæšè«ãéå§ããåã«äºãããªã³ãããã¡ã¢ãªäžã«çœ®ããŠãããŸãã
äžæ¹ãåé¡ãããã¯ãŒã¯ã®å
šçµåå±€2〠(PointNetCls::fc1
ãPointNetCls::fc2
) ã®ãã©ã¡ãŒã¿ã¯ããªã³ãããã¡ã¢ãªäžã«ã¯çœ®ããªãããã«ããŸãã
ãã©ã¡ãŒã¿ãµã€ãºã倧ããããªã³ãããã¡ã¢ãªãäžè¶³ããããã§ãã
ãããã®å±€ã«ã€ããŠã¯ãæšè«æã«DRAMãããã¡ããèªã¿åºããŸãã
èšãæãããšããã©ã¡ãŒã¿ã®äžéšãDRAMãããã¡ããåãåºããŠãåºåã®äžéšãèšç®ããããšãç¹°ãè¿ããŸãã
äžéšã®ãã©ã¡ãŒã¿ãä¿æããããã«ãå°ããªãªã³ããããããã¡ãçšæããã°ãããªããŸãã
ç¹åŸŽæœåºãããã¯ãŒã¯ã«ã€ããŠã¯ã$N$åå šãŠã®ç¹ã«å¯ŸããŠç¹åŸŽæœåºãè¡ãããã«ã$N$åã®é äŒæãèµ·ãããŸãã æšè«æéã®ãªãã§å ããå²åã倧ããã®ã§ã1åã®é äŒæã«èŠããèšç®æéãããŸãççž®ã§ããã°ãå šäœã®æšè«æéã®å€§å¹ ãªççž®ã«ã€ãªãããŸã (ã¢ã ããŒã«ã®æ³å)ã äžæ¹ãåé¡ãããã¯ãŒã¯ã®é äŒæã¯1床ã ãã§ãæšè«æéã®ãªãã§ã¯ããã»ã©éèŠã§ã¯ãããŸããã ãã©ã¡ãŒã¿ããªã³ãããã¡ã¢ãªã«äºåã«æ ŒçŽããã®ãšæ¯ã¹ãŠãæšè«æã«DRAMãããã¡ããèªã¿åºããšãå±€ã®èšç®æéã¯äŒžã³ãŠããŸããŸãããæšè«æéã«äžãã圱é¿ã¯ããã»ã©å€§ãããããŸããã
Vitis HLSã§ã¯ãä»»æ粟床ã®åºå®å°æ°ç¹æ°åap_fixed
ãçšæãããŠããŸãã
å粟床浮åå°æ°ç¹æ°float
ããå粟床浮åå°æ°ç¹æ°half
ãå©çšã§ããŸãã
ããã§ã¯ãªãœãŒã¹æ¶è²»ãæããããã«ãåºå®å°æ°ç¹æ°ã䜿ããŸãã
ããã©ã«ãã®ãªãŒããŒãããŒã¢ãŒã (ap_o_mode::AP_WRAP
) ã§ã¯ãå€ããªãŒããŒãããŒãããšãã«æãè¿ããŸãã
ããã ãšãæ倧å€ããæ¥ã«æå°å€ã«ãªã£ããããŠå±ãªã£ãããã®ã§ãæ倧å€ãããã¯æå°å€ã«çãŸãç¶ããããã«ã飜åã¢ãŒã (ap_o_mode::AP_SAT
) ã«å€æŽããŠããŸãã
飜åã¢ãŒãã䜿ãåºå®å°æ°ç¹æ°åããap_fixed_sat
ãšããŠå®çŸ©ããŸããã
ãã¥ãŒã©ã«ãããã®å
¥åºåãšãã©ã¡ãŒã¿ãšã§ãããå¹
ãå€ããããã«ãå
¥åºåçšãšãã©ã¡ãŒã¿çšã«å¥ã
ã®åãçšæããŸãã (param_t
ããã³value_t
)ã
ãã©ã¡ãŒã¿ã®å€åã«åãããŠããããå¹
ãåæžã§ãããããããŸããã
ãããå¹
ã®åæžãéååãå°æ°ç¹åã®ãã©ãŒããããªã©ã¯ãããèªäœãç«æŽŸãªç 究åéãšãªã£ãŠããŸãã
// Value types
template <int _AP_W, int _AP_I>
using ap_fixed_sat = ap_fixed<
_AP_W, _AP_I, ap_q_mode::AP_TRN, ap_o_mode::AP_SAT, 0>;
// Data type for values (layer inputs, outputs, and intermediate results)
using value_t = ap_fixed_sat<kValueBitWidth, kValueIntWidth>;
// Data type for network parameters
using param_t = ap_fixed_sat<kParamBitWidth, kParamIntWidth>;
ããŠãããã§ç€ºãIPã³ã¢ã«ã¯ã2ã€ã®åäœã¢ãŒã (Operation mode) ãçšæãããŠããŸãã
- éã¿åæåã¢ãŒã (
kModeInitWeights
): éã¿ãDRAMãããã¡ããèªã¿åã£ãŠããªã³ããããããã¡ã«æ ŒçŽããã - æšè«ã¢ãŒã (
kModeInference
): å ¥åç¹çŸ€ãããåã¯ã©ã¹ã®ããžãããèšç®ããã
ããããé ã«èª¬æããŸãã
ç¹åŸŽæœåºãããã¯ãŒã¯ã®å
šãã©ã¡ãŒã¿ãšãåé¡ãããã¯ãŒã¯ã®ãã©ã¡ãŒã¿ã®äžéšããDRAMãããã¡ããèªã¿åã£ãŠããªã³ããããããã¡ã«æ ŒçŽããŸãã
以äžã«ç€ºããInitializeFeatNaive
ããã³InitializeClsNaive
ãå©çšããŸãã
ãããããç¹åŸŽæœåºãããã¯ãŒã¯ãšãåé¡ãããã¯ãŒã¯ã®ããã®é¢æ°ã§ãã
// Naive implementation of the parameter initialization
// `T` is the type for parameters
template <typename T>
void InitializeFeatNaive(LinearParams<T, kFeatDims0, kFeatDims1>* conv1,
LinearParams<T, kFeatDims1, kFeatDims2>* conv2,
LinearParams<T, kFeatDims2, kFeatDims3>* conv3,
LinearParams<T, kFeatDims3, kFeatDims4>* conv4,
LinearParams<T, kFeatDims4, kFeatDims5>* conv5,
BatchNorm1dParams<T, kFeatDims1>* bn1,
BatchNorm1dParams<T, kFeatDims2>* bn2,
BatchNorm1dParams<T, kFeatDims3>* bn3,
BatchNorm1dParams<T, kFeatDims4>* bn4,
BatchNorm1dParams<T, kFeatDims5>* bn5,
const float* params1,
const float* params2,
const float* params3,
const float* params4,
const float* params5)
{
#pragma HLS INLINE off
ReadBlockParamsNaive<T, kFeatDims0, kFeatDims1>(conv1, bn1, params1);
ReadBlockParamsNaive<T, kFeatDims1, kFeatDims2>(conv2, bn2, params2);
ReadBlockParamsNaive<T, kFeatDims2, kFeatDims3>(conv3, bn3, params3);
ReadBlockParamsNaive<T, kFeatDims3, kFeatDims4>(conv4, bn4, params4);
ReadBlockParamsNaive<T, kFeatDims4, kFeatDims5>(conv5, bn5, params5);
}
// Naive implementation of the parameter initialization
// `T` is the type for parameters
template <typename T>
void InitializeClsNaive(LinearParams<T, kClsDims2, kClsDims3>* fc3,
BatchNorm1dParams<T, kClsDims1>* bn1,
BatchNorm1dParams<T, kClsDims2>* bn2,
const float* params1,
const float* params2,
const float* params3)
{
#pragma HLS INLINE off
ReadBatchNorm1dParamsNaive<T, kClsDims1>(
bn1, params1, kClsDims0 * kClsDims1 + kClsDims1);
ReadBatchNorm1dParamsNaive<T, kClsDims2>(
bn2, params2, kClsDims1 * kClsDims2 + kClsDims2);
ReadLinearParamsNaive<T, kClsDims2, kClsDims3>(
fc3, params3, 0);
}
ãããã®é¢æ°ã®ãªãã§ã¯ãReadBlockParamsNaive
ãReadLinearParamsNaive
ããããŠReadBatchNorm1dParamsNaive
ã®3ã€ã®é¢æ°ãåŒã³åºããŠããŸãã
åé¢æ°ã¯æ¬¡ã®ãããªåäœã§ã (詳现ã¯ãœãŒã¹ã³ãŒãããåç
§ãã ãã)ã
DRAMãããã¡äžã«ã¯float
åã§çœ®ãããŠããŸããããããåºå®å°æ°ç¹æ°åã«çŽãåŠçãå«ãŸããŸãã
ReadLinearParamsNaive<T, InDims, OutDims>
: DRAMãããã¡ãããå šçµåå±€ (Conv1d
ããã³Linear
) ã®éã¿ãšãã€ã¢ã¹ãèªã¿åãã éã¿ã®ãµã€ãºã¯(OutDims, InDims)
ããã€ã¢ã¹ã®ãµã€ãºã¯(OutDims)
ã§ããã 2ã€ã®ãã©ã¡ãŒã¿ã¯ã1次å ã®é åãšããŠé£çµãããŠãããšãã (é åã®ãµã€ãºã¯OutDims * InDims + OutDims
)ãReadBatchNorm1dParamsNaive<T, Dims>
: DRAMãããã¡ããããããæ£èŠåå±€ (BatchNorm1d
) ã®ã¹ã±ãŒã«ããã€ã¢ã¹ãå¹³åãèªã¿åãã ãã©ã¡ãŒã¿ã®ãµã€ãºã¯(Dims)
ã§ããã 3ã€ã®ãã©ã¡ãŒã¿ã¯ã1次å ã®é åãšããŠé£çµãããŠãããšãã (é åã®ãµã€ãºã¯3 * Dims
)ãReadBlockParamsNaive<T, InDims, OutDims
: DRAMãããã¡ãããå šçµåå±€ããã³ãããæ£èŠåå±€ã®ãã©ã¡ãŒã¿5ã€ãèªã¿åãã 5ã€ã®ãã©ã¡ãŒã¿ã¯ã1次å ã®é åãšããŠé£çµãããŠãããšãã (é åã®ãµã€ãºã¯OutDims * InDims + 4 * OutDims
)ã
å
¥åç¹çŸ€ãããåã¯ã©ã¹ã®ããžãããèšç®ããŸãã
以äžã«ç€ºããInferenceFeatNaive
ããã³InferenceClsNaive
ãå©çšããŸãã
ãããããç¹åŸŽæœåºãããã¯ãŒã¯ãšãåé¡ãããã¯ãŒã¯ã®åŠçã§ãã
// Naive implementation of the PointNet feature extraction
// `T` is the type for layer input, output, and intermediate results
// `U` is the type for parameters
// `N` is the expected number of input points (e.g., 1024)
template <typename T, typename U, int N>
void InferenceFeatNaive(const float* point_cloud,
const int num_points,
T feature[kFeatDims5],
const LinearParams<U, kFeatDims0, kFeatDims1>* conv1,
const LinearParams<U, kFeatDims1, kFeatDims2>* conv2,
const LinearParams<U, kFeatDims2, kFeatDims3>* conv3,
const LinearParams<U, kFeatDims3, kFeatDims4>* conv4,
const LinearParams<U, kFeatDims4, kFeatDims5>* conv5,
const BatchNorm1dParams<U, kFeatDims1>* bn1,
const BatchNorm1dParams<U, kFeatDims2>* bn2,
const BatchNorm1dParams<U, kFeatDims3>* bn3,
const BatchNorm1dParams<U, kFeatDims4>* bn4,
const BatchNorm1dParams<U, kFeatDims5>* bn5)
{
#pragma HLS INLINE off
// Zero-initialize the output feature
VectorNdSetZero<T, kFeatDims5>(feature);
// Compute the feature
for (int i = 0; i < num_points; ++i) {
#pragma HLS LOOP_TRIPCOUNT min=N max=N avg=N
#pragma HLS LOOP_FLATTEN off
// Input, output, and intermediate results
T x0[kFeatDims0];
T x1[kFeatDims1];
T x2[kFeatDims1];
T x3[kFeatDims2];
T x4[kFeatDims2];
T x5[kFeatDims3];
T x6[kFeatDims3];
T x7[kFeatDims4];
T x8[kFeatDims4];
T x9[kFeatDims5];
T x10[kFeatDims5];
// Read a point from a DDR memory
ReadPointNaive<T>(point_cloud, i, x0);
// Compute a point feature
LinearNaive<T, U, kFeatDims0, kFeatDims1, false>(
x0, x1, conv1->weight, conv1->bias);
BatchNorm1dReLUNaive<T, U, kFeatDims1>(
x1, x2, bn1->scale, bn1->bias, bn1->mean);
LinearNaive<T, U, kFeatDims1, kFeatDims2, false>(
x2, x3, conv2->weight, conv2->bias);
BatchNorm1dReLUNaive<T, U, kFeatDims2>(
x3, x4, bn2->scale, bn2->bias, bn2->mean);
LinearNaive<T, U, kFeatDims2, kFeatDims3, false>(
x4, x5, conv3->weight, conv3->bias);
BatchNorm1dReLUNaive<T, U, kFeatDims3>(
x5, x6, bn3->scale, bn3->bias, bn3->mean);
LinearNaive<T, U, kFeatDims3, kFeatDims4, false>(
x6, x7, conv4->weight, conv4->bias);
BatchNorm1dReLUNaive<T, U, kFeatDims4>(
x7, x8, bn4->scale, bn4->bias, bn4->mean);
LinearNaive<T, U, kFeatDims4, kFeatDims5, false>(
x8, x9, conv5->weight, conv5->bias);
BatchNorm1dReLUNaive<T, U, kFeatDims5>(
x9, x10, bn5->scale, bn5->bias, bn5->mean);
// Update the output feature
MaxPool1dNaive<T, kFeatDims5>(x10, feature);
}
}
// Naive implementation of the classification network
// `T` is the type for layer input, output, and intermediate results
// `U` is the type for parameters
template <typename T, typename U>
void InferenceClsNaive(const T feature[kFeatDims5],
float* out_logits,
const LinearParams<U, kClsDims2, kClsDims3>* fc3,
const BatchNorm1dParams<U, kClsDims1>* bn1,
const BatchNorm1dParams<U, kClsDims2>* bn2,
const float* params1,
const float* params2,
const float* params3)
{
#pragma HLS INLINE off
static_assert(kFeatDims5 == kClsDims0,
"Feature dimension should be equal to the input dimension");
// Input, output, and intermediate results
T x0[kClsDims1];
T x1[kClsDims1];
T x2[kClsDims2];
T x3[kClsDims2];
T x4[kClsDims3];
// Compute logits
LinearNaiveDDR<T, U, kClsDims0, kClsDims1, false>(
feature, x0, params1, 0);
BatchNorm1dReLUNaive<T, U, kClsDims1>(
x0, x1, bn1->scale, bn1->bias, bn1->mean);
LinearNaiveDDR<T, U, kClsDims1, kClsDims2, false>(
x1, x2, params2, 0);
BatchNorm1dReLUNaive<T, U, kClsDims2>(
x2, x3, bn2->scale, bn2->bias, bn2->mean);
LinearNaive<T, U, kClsDims2, kClsDims3, false>(
x3, x4, fc3->weight, fc3->bias);
// Write the result
WriteTensor1dNaive<T, kClsDims3>(out_logits, x4, 0);
}
InferenceFeatNaive
ã§ã¯ãDRAMã«çœ®ãããç¹çŸ€ããŒã¿ (point_cloud
) ããã1ã€ãã€ç¹ãèªã¿åããŸãã
åç¹ (x0
) ã«å¯ŸããŠããŒã«ã«ãªç¹åŸŽé (x10
) ãèšç®ããçŸåšã®ã°ããŒãã«ç¹åŸŽé (feature
) ãæŽæ°ããåŠçããç¹ã®åæ° (num_points
) ã ãç¹°ãè¿ããŸãã
InferenceClsNaive
ã¯ãç¹çŸ€å
šäœãè¡šãã°ããŒãã«ç¹åŸŽé (feature
) ãåãåã£ãŠãåã¯ã©ã¹ã«å¯Ÿããããžãã (x4
) ãèšç®ãããããDRAMãããã¡ (out_logits
) ã«æžãæ»ããŸãã
ReadPointNaive
ã¯ã$i$çªç®ã®ç¹$\boldsymbol{p}_i$ããDRAMãããã¡ããèªã¿åããã®ã§ãã
LinearNaive
ãBatchNorm1dReLUNaive
ãMaxPool1dNaive
ã¯ãååã®éããå
šçµåå±€ (Conv1d
)ããããæ£èŠåå±€ãšReLU掻æ§åãMaxããŒãªã³ã°å±€ã«å¯Ÿå¿ããŸã (å
çšã®èšç®åŒãåç
§)ã
ãªã³ããããããã¡ãããã©ã¡ãŒã¿ãèªã¿åºããŠãå±€ã®åºåãèšç®ããŸãã
LinearNaiveDDR
ãå
šçµåå±€ã®é¢æ°ã§ãããDRAMãããã¡ãããã©ã¡ãŒã¿ãå°ããã€åãåºãã€ã€ãåºåãèšç®ããŸãã
ãããã®é¢æ°ã以äžã«ç€ºããŸãã
HLSãã©ã°ããé€ãã°ããœãããŠã§ã¢å®è£
ãšå€§äœåãã§ããããšãåãããŸãã
è¡æ°ã¯å€ãã§ãããåŠçå
容ã¯åçŽã§ãã
// Naive implementation of the fully-connected layer
// `T` is the type for values
// `TParam` is the type for weight and bias
// `InDims` is the number of input dimensions
// `OutDims` is the number of output dimensions
// `ApplyReLU` is the flag to apply ReLU activation
template <typename T, typename TParam,
int InDims, int OutDims, bool ApplyReLU>
void LinearNaive(const T x[InDims],
T y[OutDims],
const TParam weight[OutDims][InDims],
const TParam bias[OutDims])
{
#pragma HLS INLINE off
for (int i = 0; i < OutDims; ++i) {
#pragma HLS PIPELINE off
T val = bias[i];
for (int j = 0; j < InDims; ++j) {
#pragma HLS PIPELINE
val += x[j] * weight[i][j];
}
if (ApplyReLU)
y[i] = val > T(0) ? val : T(0);
else
y[i] = val;
}
}
// Naive implementation of the fully-connected layer
// Weight and bias parameters are stored on the DDR memory
template <typename T, typename TParam,
int InDims, int OutDims, bool ApplyReLU>
void LinearNaiveDDR(const T x[InDims],
T y[OutDims],
const float* params,
const int offset)
{
// `params` contains weight parameters of size (`OutDims`, `InDims`) and
// bias parameters of size (`OutDims`) in a contiguous buffer
#pragma HLS INLINE off
constexpr const int OffsetToBias = OutDims * InDims;
TParam bias[OutDims];
// Copy the bias parameters in advance
for (int i = 0; i < OutDims; ++i) {
#pragma HLS PIPELINE II=1
bias[i] = TParam(params[offset + OffsetToBias + i]);
}
for (int i = 0; i < OutDims; ++i) {
#pragma HLS PIPELINE off
T val = bias[i];
TParam weight[InDims];
for (int j = 0; j < InDims; ++j) {
#pragma HLS PIPELINE II=1
weight[j] = TParam(params[offset + i * InDims + j]);
}
for (int j = 0; j < InDims; ++j) {
#pragma HLS PIPELINE
val += x[j] * weight[j];
}
if (ApplyReLU)
y[i] = val > T(0) ? val : T(0);
else
y[i] = val;
}
}
// Naive implementation of the 1D batch normalization and ReLU activation
// `T` is the type for values
// `TParam` is the type for parameters
// `Dims` is the number of input and output dimensions
template <typename T, typename TParam, int Dims>
void BatchNorm1dReLUNaive(const T x[Dims],
T y[Dims],
const TParam scale[Dims],
const TParam bias[Dims],
const TParam mean[Dims])
{
#pragma HLS INLINE off
for (int i = 0; i < Dims; ++i) {
#pragma HLS PIPELINE
// Batch normalization with the learned parameters
T val = (x[i] - mean[i]) * scale[i] + bias[i];
// ReLU activation
y[i] = val > T(0) ? val : T(0);
}
}
// Naive implementation of the 1D max-pooling layer
// `T` is the type for values
// `Dims` is the number of input and output dimensions
// `y` must be properly initialized
template <typename T, int Dims>
void MaxPool1dNaive(const T x[Dims], T y[Dims])
{
// `x` is of size (1, `Dims`)
// `y` is of size (1, `Dims`)
#pragma HLS INLINE off
for (int i = 0; i < Dims; ++i) {
#pragma HLS PIPELINE
y[i] = x[i] > y[i] ? x[i] : y[i];
}
}
LinearNaiveDDR
ã§ã¯ãå
šçµåå±€ã®ãã€ã¢ã¹é
bias
ãšãåºå1èŠçŽ åã®èšç®ã«å¿
èŠãªéã¿ weight
ã ãããªã³ãããã¡ã¢ãªäžã«ä¿æããŸãã
å
¥åºåã®æ¬¡å
ã$\mathrm{InDims}, \mathrm{OutDims}$ãšããã°ãbias
ã®ãµã€ãºã¯$\mathrm{OutDims}$ãweight
ã®ãµã€ãºã¯$\mathrm{InDims}$ãšãªããŸãã
äžèšã®é¢æ°ã®ã«ãŒãã«ã¯#pragma HLS PIPELINE
ãä»å ãããŠãããã«ãŒãå
éšã®åŠçãèªåçã«ãã€ãã©ã€ã³åãããŸã (æé©åãã®4: ã«ãŒãã®ãã€ãã©ã€ã³å)ã
#pragma HLS PIPELINE off
ãšãããšããã®ãã€ãã©ã€ã³åãæå¶ãããŸãã
ãã€ãã©ã€ã³åã«ããå¹æãã以äžã®å³ã«ç€ºããŸãã
ã«ãŒãããã€ãã©ã€ã³åããªãå Žåã¯ãã«ãŒãã®åã€ãã¬ãŒã·ã§ã³ãé ã«å®è¡ããŸã (å³ã®äžéš)ã äžæ¹ããã€ãã©ã€ã³åã§ã¯ãã«ãŒãå éšã®åŠçãåå² (å³ã®å Žåã¯4åå²) ããããããã®åŠçãæéçã«ãªãŒããŒã©ãããããŸã (å³ã®äžéš)ã è€æ°ã®ã€ãã¬ãŒã·ã§ã³ãåæã«å®è¡ããã®ã§ãã«ãŒãã®å®è¡æéãççž®ã§ããŸãã ã«ãŒãã®å®è¡æéã¯ãæãæéã®æããåŠç (å³ã®å Žåã¯åŠç3) ã«ãã£ãŠæ±ºãŸããŸãã ã€ãã¬ãŒã·ã§ã³ã®åŠçãããªãã¹ãåçã«åå²ããããšã§ããã€ãã©ã€ã³åã®å¹æãå¢ããŸãã äžèšã®ãœãŒã¹ã³ãŒãã®ããã«ãæå ã«ãŒãã«ãã€ãã©ã€ã³åãé©çšãããšãåŠçæéã倧ããåæžã§ããŸãã 2éã«ãŒãã®ãã¡å€åŽã®ã«ãŒãã«ãã€ãã©ã€ã³åãé©çšãããšãå åŽã®ã«ãŒãã¯å šãŠå±éãããŠã1éã«ãŒãã«çŽãããã®ã§ããªãœãŒã¹æ¶è²»ãå€§å¹ ã«å¢ããŠããŸããŸãã å€åŽã®ã«ãŒãã«ã¯ããã€ãã©ã€ã³åãé©çšããªãæ¹ããããšæããŸãã
äžèšã®IPã³ã¢ã¯ãhls/src/top_naive.cpp
ã«ãããŸãã
ãã®IPã³ã¢ãæ£ããåäœããã®ã§ãããæããã«ãã€ãŒã㪠(å šã工倫ããŠããªãçŽ æŽãª) å®è£ ã§ãã ããŒã¿äžŠåæ§ (Data parallelism) ã掻ãããŠãåå±€ã®èšç®ã䞊ååããŠã¿ãŸããã (æé©åãã®5: ããŒã¿äžŠåæ§)ã
å
šçµåå±€ã®èšç®ãããäžåºŠã¿ãŠã¿ãŸãã
$$
\boldsymbol{y} = \boldsymbol{W} \boldsymbol{x} + \boldsymbol{b}
$$
åºå$\boldsymbol{y}$ã®åèŠçŽ $y_i$ã¯æ¬¡ã®ããã«èšç®ãããŸãã
$$
y_i = \sum_j W_{i, j} x_j + b_i
$$
ãããæ£èŠåãšReLU掻æ§åã«ã€ããŠãåæ§ã«ãè€æ°ã®åºåèŠçŽ $y_i, y_{i + 1}, \ldots, y_{i + B - 1}$ã䞊åã«èšç®ããŸãã $$ \begin{eqnarray} y_i &=& \max \left( 0, \left( x_i - \mu_i \right) \cdot s_i + b_i \right) \ y_{i + 1} &=& \max \left( 0, \left( x_{i + 1} - \mu_{i + 1} \right) \cdot s_{i + 1} + b_{i + 1} \right) \ &\vdots& \ y_{i + B - 1} &=& \max \left( 0, \left( x_{i + B - 1} - \mu_{i + B - 1} \right) \cdot s_{i + B - 1} + b_{i + B - 1} \right) \end{eqnarray} $$
MaxããŒãªã³ã°ã«ã€ããŠãå šãåãã§ãè€æ°ã®åºåèŠçŽ $\phi_i, \phi_{i + 1}, \ldots, \phi_{i + B - 1}$ã䞊åã«èšç®ããŸãã $$ \begin{eqnarray} \phi_i &=& \max \left( \phi_i, \psi_i \right) \ \phi_{i + 1} &=& \max \left( \phi_{i + 1}, \psi_{i + 1} \right) \ &\vdots& \ \phi_{i + B - 1} &=& \max \left( \phi_{i + B - 1}, \psi_{i + B - 1} \right) \end{eqnarray} $$
LinearNaive
ãLinearNaiveDDR
ãBatchNorm1dReLUNaive
ãMaxPool1dNaive
ããåå±€ã®ãã€ãŒããªå®è£
ã§ããã
䞊ååããããŒãžã§ã³ LinearOpt1
ãLinearOpt1DDR
ãBatchNorm1dReLUOpt1
ãMaxPool1dOpt1
ã«çœ®ãæããŸã (ååãNaive
ããOpt1
ã«ããŸã)ã
ãã³ãã¬ãŒãåŒæ°ãšããŠB
ãè¿œå ãããŠããŸã (B
䞊å)ã
// Parallel implementation of the fully-connected layer
// Matrix-vector multiplication is parallelized along the output dimension
// `T` is the type for values
// `TParam` is the type for weight and bias
// `InDims` is the number of input dimensions
// `OutDims` is the number of output dimensions
// `ApplyReLU` is the flag to apply ReLU activation
// `B` is the block size for the output dimension
template <typename T, typename TParam,
int InDims, int OutDims, bool ApplyReLU, int B>
void LinearOpt1(const T x[InDims],
T y[OutDims],
const TParam weight[OutDims][InDims],
const TParam bias[OutDims])
{
#pragma HLS INLINE off
// `OutDims` must be a multiple of `B`
static_assert(OutDims % B == 0, "`OutDims` must be a multiple of `B`");
for (int i0 = 0; i0 < OutDims; i0 += B) {
#pragma HLS PIPELINE off
T vals[B];
#pragma HLS ARRAY_PARTITION variable=vals type=complete dim=1
for (int j = 0; j < InDims; ++j) {
#pragma HLS PIPELINE
for (int i1 = 0; i1 < B; ++i1) {
#pragma HLS UNROLL
int i = i0 + i1;
T last = (j == 0) ? T(bias[i]) : vals[i1];
vals[i1] = last + x[j] * weight[i][j];
}
}
for (int i1 = 0; i1 < B; ++i1) {
#pragma HLS UNROLL
int i = i0 + i1;
if (ApplyReLU)
y[i] = vals[i1] > T(0) ? vals[i1] : T(0);
else
y[i] = vals[i1];
}
}
}
// Parallel implementation of the fully-connected layer
// Weight and bias parameters are stored on the DDR memory
// Matrix-vector multiplication is parallelized along the output dimension
template <typename T, typename TParam,
int InDims, int OutDims, bool ApplyReLU, int B>
void LinearOpt1DDR(const T x[InDims],
T y[OutDims],
const float* params,
const int offset)
{
// `params` contains weight parameters of size (`OutDims`, `InDims`) and
// bias parameters of size (`OutDims`) in a contiguous buffer
#pragma HLS INLINE off
// `OutDims` must be a multiple of `B`
static_assert(OutDims % B == 0, "`OutDims` must be a multiple of `B`");
// `B` must be larger than 1
static_assert(B > 1, "`B` must be larger than 1");
constexpr const int BHalf = B / 2;
constexpr const int OffsetToBias = OutDims * InDims;
TParam bias[OutDims];
#pragma HLS ARRAY_PARTITION variable=bias type=cyclic factor=BHalf dim=1
// Copy the bias parameters in advance
for (int i = 0; i < OutDims; ++i) {
#pragma HLS PIPELINE II=1
bias[i] = TParam(params[offset + OffsetToBias + i]);
}
for (int i0 = 0; i0 < OutDims; i0 += B) {
#pragma HLS PIPELINE off
T vals[B];
#pragma HLS ARRAY_PARTITION variable=vals type=complete dim=1
TParam weight[B][InDims];
#pragma HLS ARRAY_PARTITION variable=weight type=cyclic factor=BHalf dim=1
// Copy the weight parameters for `B` outputs
const int offset0 = offset + i0 * InDims;
for (int i1 = 0; i1 < B; ++i1) {
for (int j = 0; j < InDims; ++j) {
#pragma HLS PIPELINE II=1
weight[i1][j] = TParam(params[offset0 + i1 * InDims + j]);
}
}
for (int j = 0; j < InDims; ++j) {
#pragma HLS PIPELINE
for (int i1 = 0; i1 < B; ++i1) {
#pragma HLS UNROLL
int i = i0 + i1;
if (i < OutDims) {
T last = (j == 0) ? T(bias[i]) : vals[i1];
vals[i1] = last + x[j] * weight[i1][j];
}
}
}
for (int i1 = 0; i1 < B; ++i1) {
#pragma HLS UNROLL
int i = i0 + i1;
if (i < OutDims) {
if (ApplyReLU)
y[i] = vals[i1] > T(0) ? vals[i1] : T(0);
else
y[i] = vals[i1];
}
}
}
}
// Parallel implementation of the 1D batch normalization and ReLU activation
// `T` is the type for values
// `TParam` is the type for parameters
// `Dims` is the number of input and output dimensions
// `B` is the block size for the output dimension
template <typename T, typename TParam, int Dims, int B>
void BatchNorm1dReLUOpt1(const T x[Dims],
T y[Dims],
const TParam scale[Dims],
const TParam bias[Dims],
const TParam mean[Dims])
{
// `scale` is the multiplication of the weight and reciprocal of the
// standard deviation (to reduce the on-chip memory consumption)
#pragma HLS INLINE off
static_assert(Dims % B == 0, "`Dims` must be a multiple of `B`");
for (int i0 = 0; i0 < Dims; i0 += B) {
#pragma HLS PIPELINE
for (int i1 = 0; i1 < B; ++i1) {
#pragma HLS UNROLL
int i = i0 + i1;
// Batch normalization with the learned parameters
T val = (x[i] - mean[i]) * scale[i] + bias[i];
// ReLU activation
y[i] = val > T(0) ? val : T(0);
}
}
}
// Parallel implementation of the 1D max-pooling layer
// `T` is the type for values
// `Dims` is the number of input and output dimensions
// `B` is the block size for the output dimension
// `y` must be properly initialized
template <typename T, int Dims, int B>
void MaxPool1dOpt1(const T x[Dims], T y[Dims])
{
#pragma HLS INLINE off
static_assert(Dims % B == 0, "`Dims` must be a multiple of `B`");
for (int i0 = 0; i0 < Dims; i0 += B) {
#pragma HLS PIPELINE
for (int i1 = 0; i1 < B; ++i1) {
#pragma HLS UNROLL
int i = i0 + i1;
y[i] = x[i] > y[i] ? x[i] : y[i];
}
}
}
LinearOpt1
ãšLinearNaive
ãæ¯ã¹ãŠã¿ããšãj
(å
¥å次å
) ã®ã«ãŒãã¯ãã®ãŸãŸã§ãi
(åºå次å
) ã«é¢ããã«ãŒãããi0
ãši1
ã®2ã€ã«åå²ãããŠããŸãã
i0
ã¯B
å»ã¿ãi1
ã¯i0
ããi0 + B - 1
ãŸã§1ã€ãã€å¢ããŠãããŸãã
i1
ã«é¢ããã«ãŒãã¯ã¢ã³ããŒãªã³ã° (#pragma HLS UNROLL
) ãããŠããã®ã§ãã«ãŒãã®äžèº«ãå®å
šã«å±éãããŸãã
i1
ã®ã«ãŒãèªäœã¯ç¡ããªã£ãŠãi0
ããi0 + B - 1
ãŸã§ã®åŠçã䞊åã«å®è¡ãããŸãã
æåã®ã«ãŒãã«æ³šç®ããŠã¿ãŸãããã
for (int j = 0; j < InDims; ++j) {
#pragma HLS PIPELINE
for (int i1 = 0; i1 < B; ++i1) {
#pragma HLS UNROLL
int i = i0 + i1;
T last = (j == 0) ? T(bias[i]) : vals[i1];
vals[i1] = last + x[j] * weight[i][j];
}
}
for (int j = 0; j < InDims; ++j) {
#pragma HLS PIPELINE
T last0 = (j == 0) ? T(bias[i0 + 0]) : vals[0];
T last1 = (j == 0) ? T(bias[i0 + 1]) : vals[1];
// ...
T lastB1 = (j == 0) ? T(bias[i0 + B - 1]) : vals[B - 1];
vals[0] = last0 + x[j] * weight[i0 + 0][j];
vals[1] = last1 + x[j] * weight[i0 + 1][j];
// ...
vals[B - 1] = lastB1 + x[j] * weight[i0 + B - 1][j];
}
䞊ååŠçã®ããã«ãvals
ãšããããµã€ãºB
ã®äžæé
åãæ°ãã«çšæããŠããŸãã
ãã®é
åã«ã¯ãåºåy[i0]
ããy[i0 + B - 1]
ãŸã§ã®èšç®çµæãä¿æããŸãã
vals
ã®åèŠçŽ ã¯ããã€ã¢ã¹é
bias[i0]
ããbias[i0 + B - 1]
ã§åæåãããŸãã
ãã®åŸãj
ã®ã«ãŒãã«ãã£ãŠãx[j] * weight[i0][j]
ããx[j] * weight[i0 + B - 1][j]
ããvals
ã®åèŠçŽ ã«é ã«å ç®ãããŸãã
äžèšã®èšç®åŒãšå¯Ÿå¿ããŠããããšãåãããŸãã
ã«ãŒããå±éãããšãvals[0]
ããvals[B - 1]
ãŸã§ã®å
šèŠçŽ ãããããbias[i0]
ããbias[i0 + B - 1]
ãŸã§ããããŠweight[i0][j]
ããweight[i0 + B - 1][j]
ãŸã§ã®B
åã®èŠçŽ ã«ã1ãµã€ã¯ã«ã§ã¢ã¯ã»ã¹ããå¿
èŠããããŸãã
ãããå®çŸããããã«ã¯ãé
åbias
ãvals
ãweight
ã®ããŒãæ°ãB
以äžã«ããå¿
èŠããããŸãã
vals
ã«ã€ããŠã¯ã#pragma HLS ARRAY_PARTITION type=complete
ã䜿ã£ãŠãé
åãåã
ã®èŠçŽ ã«å®å
šã«å解ããŠããŸãã
åå²ããªãå Žåã¯ããŒãã2ã€ãããªãã®ã§ãåæã«2ã€ã®èŠçŽ ãèªã¿åºã (ãããã¯1èŠçŽ ãèªã¿åºããŠãå¥ã®1èŠçŽ ãžæžã蟌ã) ããšããã§ããŸããã
å®å
šã«åå²ãããšãé
åã®å
šãŠã®èŠçŽ ãåæã«èªã¿æžãã§ããããã«ãªããŸãã
ãªããå®å
šã«åå²ãããšããªã³ãããã¡ã¢ãª (BlockRAM) ã§ã¯ãªããããªãããããã (FF) ã䜿ã£ãŠé
åãå®è£
ãããŸãã
B
åã®èŠçŽ ããã€é
åvals
ããå®å
šã«åå²ãããšã次ã®ããã«ãªããŸãã
LinearOpt1
å
ã«ã¯èšè¿°ãããŠããŸããããweight
ãšbias
ã«ã€ããŠã¯ãå¥ã®å Žæã§ãvals
ãšåæ§ã®HLSãã©ã°ããæå®ããå¿
èŠããããŸãã
weight
ãšbias
ããã1ãµã€ã¯ã«ã§B
åã®é£ç¶ããèŠçŽ (bias[i0]
ããbias[i0 + B - 1]
ãŸã§ããããŠweight[i0][j]
ããweight[i0 + B - 1][j]
ãŸã§) ãèªã¿åºãããã«ã¯ã次ã®ããã«ãµã€ã¯ãªãã¯åå²ããŸãã
weight
ã¯2次å
é
åã§ãããæåã®æ¬¡å
ã«å¯ŸããŠåå²ãããã®ã§ãdim=1
ãæå®ããŸãã
ãªã³ãããã¡ã¢ãª (BlockRAM) 1ã€ã«ã€ãããŒãã2ã€ä»ããŠããã1ãµã€ã¯ã«ã§2èŠçŽ ã®èªã¿åºã (ãããã¯1ã€ã®æžãåºããš1ã€ã®èªã¿åºã) ãã§ããŸãã
B
åã®èŠçŽ ã1ãµã€ã¯ã«ã§èªã¿åºãããã«ã¯ãé
åãBHalf = B / 2
åã«åå²ããã°ããã§ãã
constexpr const int BHalf = B / 2;
TParam weight[OutDims][InDims];
#pragma HLS ARRAY_PARTITION variable=weight type=cyclic factor=BHalf dim=1
TParam bias[OutDims];
#pragma HLS ARRAY_PARTITION variable=bias type=cyclic factor=BHalf dim=1
ç°¡åãªäŸãšããŠã2次å
é
åw[8][4]
ããæåã®æ¬¡å
ã§4ã€ã«ãµã€ã¯ãªãã¯åå² (factor=4 dim=1
) ããã°ã次ã®ããã«ãªããŸãã
4åå²ãããšããŒãæ°ã8ã€ã«å¢ããã®ã§ã8ã€ã®é£ç¶ããèŠçŽ (äŸãã°w[0][j]
ããw[7][j]
ãŸã§) ããŸãšããŠèªã¿åºããããã«ãªããŸãã
ãµã€ã¯ãªãã¯åå²ã§ã¯ãåå²ãããããããã®é
åã«å¯ŸããŠé ã«ãå
é ã®èŠçŽ ãã (w[0][0]
ãw[1][0]
ãw[2][0]
ã®é ã«) è©°ããŠãããŸãã
å
šãŠã®é
åã«èŠçŽ ãå
¥ã£ããããŸãæåã®é
åã«æ»ã£ãŠãèŠçŽ ãé ã«è©°ããŠãããŸãã
ãããç¹°ãè¿ããšå³ã®ãããªé
眮ã«ãªããŸãã
é£ç¶ããèŠçŽ (w[0][0]
ãw[1][0]
ãw[2][0]
ãw[3][0]
ãªã©) ãå¥ã
ã®é
åã«æ ŒçŽãããã®ã§ãããããäžåºŠã«åãåºãããšãã§ããŸãã
ã«ãŒãã¢ã³ããŒãªã³ã°ãšãé
åã®ãµã€ã¯ãªãã¯åå²ãçµã¿åãããããšã§ãé
åã®é£ç¶ããèŠçŽ ã«å¯Ÿãã䞊ååŠçãã容æã«å®çŸã§ããŸãã
ãã®ããšããã#pragma HLS UNROLL
ãš#pragma HLS ARRAY_PARTITION
ã¯ãã»ããã§äœ¿ãå Žé¢ãå€ããšæããŸãã
ã¢ã³ããŒãªã³ã°ä¿æ°ãšãé
åã®åå²æ°ã¯æããå¿
èŠããããŸãã
ä¿æ°B
ã§ã¢ã³ããŒãªã³ã°ããããé
åã¯B / 2
å (B
åã§ããã) ã«ãµã€ã¯ãªãã¯åå²ããªããšãB
䞊åã«ãªããŸããã
ãŸããã«ãŒããã¢ã³ããŒãªã³ã°ããã®ã«ãé
åãäžååå²ããªããã°ã䞊ååŠçã«ãªããŸããã
æåã®æ¬¡å
ã§2ã€ã«ãµã€ã¯ãªãã¯åå² (factor=2 dim=1
) ããã°ã次ã®ããã«ãªããŸãã
2åå²ãããšããŒãæ°ã4ã€ã«å¢ããã®ã§ã4ã€ã®é£ç¶ããèŠçŽ (äŸãã°w[0][j]
ããw[3][j]
ããããã¯w[4][j]
ããw[7][j]
ãŸã§) ããŸãšããŠèªã¿åºããŸãã
2çªç®ã®æ¬¡å
ã§2ã€ã«ãµã€ã¯ãªãã¯åå² (factor=2 dim=2
) ããã°ã次ã®ããã«ãªããŸãã
ä»åºŠã¯ã2çªç®ã®æ¬¡å
ã«ã€ããŠã4ã€ã®é£ç¶ããèŠçŽ (äŸãã°w[i][0]
ããw[i][3]
ãŸã§) ã«1ãµã€ã¯ã«ã§ã¢ã¯ã»ã¹ã§ããŸãã
ããããèãããšãweight
ãšbias
ã«ã€ããŠã¯äžèšã®ãã©ã°ãã䜿ãã°ãããšåãããŸãã
ããŠã2ã€ç®ã®ã«ãŒãã«æ³šç®ããŠã¿ãŸãããã
1ã€ç®ã®ã«ãŒãã§èšç®ãããB
åã®èŠçŽ ããåºåy
ã«æžã蟌ãéšåã§ãã
for (int i1 = 0; i1 < B; ++i1) {
#pragma HLS UNROLL
int i = i0 + i1;
if (ApplyReLU)
y[i] = vals[i1] > T(0) ? vals[i1] : T(0);
else
y[i] = vals[i1];
}
ãã®ã«ãŒããã¢ã³ããŒãªã³ã°ãããŠã次ã®ããã«ãªããŸãã
if (ApplyReLU) {
y[i0 + 0] = vals[0] > T(0) ? vals[0] : T(0);
y[i0 + 1] = vals[1] > T(0) ? vals[1] : T(0);
// ...
y[i0 + B - 1] = vals[B - 1] > T(0) ? vals[B - 1] : T(0);
} else {
y[i0 + 0] = vals[0];
y[i0 + 1] = vals[1];
// ...
y[i0 + B - 1] = vals[B - 1];
}
åºåy[i0]
ããy[i0 + B - 1]
ãŸã§ã®ãé£ç¶ããB
åã®èŠçŽ ã«1ãµã€ã¯ã«ã§ã¢ã¯ã»ã¹ããå¿
èŠããããŸãã
LinearOpt1
å
ã«ã¯èšèŒãããŸããããé
åy
ãã次ã®ããã«ãµã€ã¯ãªãã¯åå²ããã°ããã§ãã
constexpr const int BHalf = B / 2;
T y[OutDims];
#pragma HLS ARRAY_PARTITION variable=y type=cyclic factor=BHalf dim=1
ãªããå
¥åx
ã«ã€ããŠã¯ãã«ãŒãã®åã€ãã¬ãŒã·ã§ã³ã§1ã€ã®èŠçŽ ã«ããã¢ã¯ã»ã¹ããªããããåå²ããå¿
èŠã¯ãããŸããã
LinearOpt1
ã䜿ã£ãŠãå
šçµåå±€ã®åŠçãB
䞊åã§å®è¡ããã«ã¯ãåŒæ°ã§ããéã¿weight
ããã€ã¢ã¹bias
ãåºåy
ããåºåã®æ¬¡å
ã§B / 2
åã«åå²ããªããã°ãªããŸãã (B
ã2ã§ããã°åå²ã®å¿
èŠã¯ãªã)ã
以äžãLinearOpt1
ã®äž»ãªå€æŽç¹ã§ãã
LinearOpt1DDR
ã«ã€ããŠããB
åã®åºåã䞊åã«èšç®ããããã«ãåæ§ã®å€æŽããªãããŠããŸãã
å
šçµåå±€ã®ãã€ã¢ã¹é
bias
ãšãåºåã®B
èŠçŽ åãèšç®ããããã«å¿
èŠãªéã¿weight
ããDRAMãããã¡ãããªã³ããããããã¡äžã«è»¢éããŠããŸãã
LinearNaiveDDR
ãšã¯ç°ãªããéã¿ãä¿æãããããã¡weight
ã¯ã2次å
é
åãšãªã£ãŠããŸãã
B
åã®å¿
èŠãªèŠçŽ ãåãåºãããã«ãbias
ãšweight
ã¯BHalf = B / 2
åã«åå²ãããŠããŸãã
BatchNorm1dReLUOpt1
ãšMaxPool1dOpt1
ã«ã€ããŠããi
(åºå次å
) ã«é¢ããã«ãŒãããi0
ãši1
ã®2ã€ã«åå²ãããŠããŸãã
i1
ã®ã«ãŒãã¯ã¢ã³ããŒãªã³ã°ãããB
åã®åºåã䞊åã«èšç®ãããŸãã
BatchNorm1dReLUOpt1
ã䜿ã£ãŠããããæ£èŠåãšReLU掻æ§åãB
䞊åã§å®è¡ããã«ã¯ãé¢æ°ã®å
¥åx
ãåºåy
ãšããããæ£èŠåå±€ã®ãã©ã¡ãŒã¿ (ã¹ã±ãŒã«scale
ããã€ã¢ã¹bias
ãå¹³åmean
) ãB / 2
åã«åå²ããŸãã
MaxPool1dOpt1
ã«ã€ããŠãåæ§ã§ãB
䞊åã§MaxããŒãªã³ã°ãè¡ãããã«ãé¢æ°ã®å
¥åx
ãšy
ãB / 2
åã«åå²ããŸã (x
ã¯åç¹ã«å¯ŸããããŒã«ã«ç¹åŸŽéã§ãy
ã¯ç¹çŸ€å
šäœãè¡šãã°ããŒãã«ãªç¹åŸŽé)ã
åå±€ãB
䞊åã§åäœãããããã®ãé
åã®åå²ã®ã«ãŒã«ã次ã«ãŸãšããŸãã
2䞊åã®å Žåã¯ãåå²ã®å¿
èŠããªãããšãåãããŸãã
LinearOpt1
: éã¿weight
ããã€ã¢ã¹bias
ãåºåy
ããåºåã®æ¬¡å ã§B / 2
åã«åå² (å ¥åx
ã¯åå²ã®å¿ èŠãªã)LinearOpt1DDR
: åºåy
ãB / 2
åã«åå² (å ¥åx
ã¯åå²ã®å¿ èŠãªã)BatchNorm1dReLUOpt1
: å ¥åx
ãšåºåy
ããã©ã¡ãŒã¿ (ã¹ã±ãŒã«scale
ããã€ã¢ã¹bias
ãå¹³åmean
) ããB / 2
åã«åå²MaxPool1dOpt1
: å ¥åx
ãšåºåy
ããB / 2
åã«åå²
ãããã®äžŠååãããããŒãžã§ã³ã䜿ã£ãŠãç¹åŸŽæœåºãããã¯ãŒã¯ãšãåé¡ãããã¯ãŒã¯ã®æšè«åŠçã次ã®ããã«æžãæããŸãã
InferenceFeatNaive
ãšInferenceClsNaive
ãããããããInferenceFeatOpt1
ãšInferenceClsOpt1
ã«ãªããŸãã
é¢æ°ã®åŒæ°ã¯å€æŽããŸããã
ãªããInitializeFeatNaive
ãšInitializeClsNaive
(éã¿ã®åæåé¢æ°) ã¯ããã®ãŸãŸäœ¿ãããšã«ããŸã (é¢æ°åã ããInitializeFeatOpt1
ãInitializeClsOpt1
ãšããŸãã)ã
// Parallel implementation of the PointNet feature extraction
// `T` is the type for layer input, output, and intermediate results
// `U` is the type for parameters
// `N` is the expected number of input points (e.g., 1024)
template <typename T, typename U, int N>
void InferenceFeatOpt1(const float* point_cloud,
const int num_points,
T feature[kFeatDims5],
const LinearParams<U, kFeatDims0, kFeatDims1>* conv1,
const LinearParams<U, kFeatDims1, kFeatDims2>* conv2,
const LinearParams<U, kFeatDims2, kFeatDims3>* conv3,
const LinearParams<U, kFeatDims3, kFeatDims4>* conv4,
const LinearParams<U, kFeatDims4, kFeatDims5>* conv5,
const BatchNorm1dParams<U, kFeatDims1>* bn1,
const BatchNorm1dParams<U, kFeatDims2>* bn2,
const BatchNorm1dParams<U, kFeatDims3>* bn3,
const BatchNorm1dParams<U, kFeatDims4>* bn4,
const BatchNorm1dParams<U, kFeatDims5>* bn5)
{
#pragma HLS INLINE off
// Zero-initialize the output feature
VectorNdSetZero<T, kFeatDims5>(feature);
// Compute the feature
for (int i = 0; i < num_points; ++i) {
#pragma HLS LOOP_TRIPCOUNT min=N max=N avg=N
#pragma HLS LOOP_FLATTEN off
// Input, output, and intermediate results
T x0[kFeatDims0];
T x1[kFeatDims1];
T x2[kFeatDims1];
T x3[kFeatDims2];
T x4[kFeatDims2];
T x5[kFeatDims3];
T x6[kFeatDims3];
T x7[kFeatDims4];
T x8[kFeatDims4];
T x9[kFeatDims5];
T x10[kFeatDims5];
#pragma HLS ARRAY_PARTITION variable=x3 type=cyclic factor=4 dim=1
#pragma HLS ARRAY_PARTITION variable=x5 type=cyclic factor=4 dim=1
#pragma HLS ARRAY_PARTITION variable=x7 type=cyclic factor=8 dim=1
#pragma HLS ARRAY_PARTITION variable=x9 type=cyclic factor=64 dim=1
// Read a point from a DDR memory
ReadPointNaive<T>(point_cloud, i, x0);
// Compute a point feature
LinearOpt1<T, U, kFeatDims0, kFeatDims1, false, 2>(
x0, x1, conv1->weight, conv1->bias);
BatchNorm1dReLUOpt1<T, U, kFeatDims1, 2>(
x1, x2, bn1->scale, bn1->bias, bn1->mean);
LinearOpt1<T, U, kFeatDims1, kFeatDims2, false, 8>(
x2, x3, conv2->weight, conv2->bias);
BatchNorm1dReLUOpt1<T, U, kFeatDims2, 2>(
x3, x4, bn2->scale, bn2->bias, bn2->mean);
LinearOpt1<T, U, kFeatDims2, kFeatDims3, false, 8>(
x4, x5, conv3->weight, conv3->bias);
BatchNorm1dReLUOpt1<T, U, kFeatDims3, 2>(
x5, x6, bn3->scale, bn3->bias, bn3->mean);
LinearOpt1<T, U, kFeatDims3, kFeatDims4, false, 16>(
x6, x7, conv4->weight, conv4->bias);
BatchNorm1dReLUOpt1<T, U, kFeatDims4, 2>(
x7, x8, bn4->scale, bn4->bias, bn4->mean);
LinearOpt1<T, U, kFeatDims4, kFeatDims5, false, 128>(
x8, x9, conv5->weight, conv5->bias);
BatchNorm1dReLUOpt1<T, U, kFeatDims5, 2>(
x9, x10, bn5->scale, bn5->bias, bn5->mean);
// Update the output feature
MaxPool1dOpt1<T, kFeatDims5, 2>(x10, feature);
}
}
// Parallel implementation of the classification network
// `T` is the type for layer input, output, and intermediate results
// `U` is the type for parameters
template <typename T, typename U>
void InferenceClsOpt1(const T feature[kFeatDims5],
float* out_logits,
const LinearParams<U, kClsDims2, kClsDims3>* fc3,
const BatchNorm1dParams<U, kClsDims1>* bn1,
const BatchNorm1dParams<U, kClsDims2>* bn2,
const float* params1,
const float* params2,
const float* params3)
{
#pragma HLS INLINE off
static_assert(kFeatDims5 == kClsDims0,
"Feature dimension should be equal to the input dimension");
// Input, output, and intermediate results
T x0[kClsDims1];
T x1[kClsDims1];
T x2[kClsDims2];
T x3[kClsDims2];
T x4[kClsDims3];
#pragma HLS ARRAY_PARTITION variable=x0 type=cyclic factor=8 dim=1
#pragma HLS ARRAY_PARTITION variable=x2 type=cyclic factor=4 dim=1
// Compute logits
LinearOpt1DDR<T, U, kClsDims0, kClsDims1, false, 16>(
feature, x0, params1, 0);
BatchNorm1dReLUOpt1<T, U, kClsDims1, 2>(
x0, x1, bn1->scale, bn1->bias, bn1->mean);
LinearOpt1DDR<T, U, kClsDims1, kClsDims2, false, 8>(
x1, x2, params2, 0);
BatchNorm1dReLUOpt1<T, U, kClsDims2, 2>(
x2, x3, bn2->scale, bn2->bias, bn2->mean);
LinearOpt1<T, U, kClsDims2, kClsDims3, false, 2>(
x3, x4, fc3->weight, fc3->bias);
// Write the result
WriteTensor1dNaive<T, kClsDims3>(out_logits, x4, 0);
}
åå±€ã®é¢æ°ãåŒã³åºãéã«ããã³ãã¬ãŒãåŒæ°ã«äžŠåå床ãæå®ããŠããŸãã
äŸãã°ãç¹åŸŽæœåºãããã¯ãŒã¯ã®4çªç®ã®å
šçµåå±€ (PyTorchã®ã¢ãã«ã«ãããPointNetFeat::conv4
) ã¯16䞊åãæåŸã®å
šçµåå±€ (PointNetFeat::conv5
) ã¯128䞊åã§å®è¡ãããŸãã
äžæ¹ããããæ£èŠåå±€ãšMaxããŒãªã³ã°ã¯ã2䞊åã§å®è¡ãããŠããŸãã
åå±€ã®äžŠå床ãã©ã®ããã«æ±ºå®ããã®ãã«ã€ããŠã¯ãåŸè¿°ããŸãã
ç¶ããŠãIPã³ã¢ã®æäžäœé¢æ°PointNetClsTop
ã以äžã«ç€ºããŸãã
void PointNetClsTop(const int op_mode,
const float* point_cloud,
const int num_points,
float* out_logits,
const float* feat_params1,
const float* feat_params2,
const float* feat_params3,
const float* feat_params4,
const float* feat_params5,
const float* cls_params1,
const float* cls_params2,
const float* cls_params3)
{
#pragma HLS INTERFACE m_axi port=point_cloud offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=out_logits offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=feat_params1 offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=feat_params2 offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=feat_params3 offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=feat_params4 offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=feat_params5 offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=cls_params1 offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=cls_params2 offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=cls_params3 offset=slave bundle=gmem0
#pragma HLS INTERFACE s_axilite port=op_mode bundle=control
#pragma HLS INTERFACE s_axilite port=point_cloud bundle=control
#pragma HLS INTERFACE s_axilite port=num_points bundle=control
#pragma HLS INTERFACE s_axilite port=out_logits bundle=control
#pragma HLS INTERFACE s_axilite port=feat_params1 bundle=control
#pragma HLS INTERFACE s_axilite port=feat_params2 bundle=control
#pragma HLS INTERFACE s_axilite port=feat_params3 bundle=control
#pragma HLS INTERFACE s_axilite port=feat_params4 bundle=control
#pragma HLS INTERFACE s_axilite port=feat_params5 bundle=control
#pragma HLS INTERFACE s_axilite port=cls_params1 bundle=control
#pragma HLS INTERFACE s_axilite port=cls_params2 bundle=control
#pragma HLS INTERFACE s_axilite port=cls_params3 bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control
// Parameters for feature extraction
LinearParams<param_t, kFeatDims0, kFeatDims1> feat_conv1;
LinearParams<param_t, kFeatDims1, kFeatDims2> feat_conv2;
LinearParams<param_t, kFeatDims2, kFeatDims3> feat_conv3;
LinearParams<param_t, kFeatDims3, kFeatDims4> feat_conv4;
LinearParams<param_t, kFeatDims4, kFeatDims5> feat_conv5;
BatchNorm1dParams<param_t, kFeatDims1> feat_bn1;
BatchNorm1dParams<param_t, kFeatDims2> feat_bn2;
BatchNorm1dParams<param_t, kFeatDims3> feat_bn3;
BatchNorm1dParams<param_t, kFeatDims4> feat_bn4;
BatchNorm1dParams<param_t, kFeatDims5> feat_bn5;
#pragma HLS ARRAY_PARTITION variable=feat_conv2.weight type=cyclic factor=4 dim=1
#pragma HLS ARRAY_PARTITION variable=feat_conv2.bias type=cyclic factor=4 dim=1
#pragma HLS ARRAY_PARTITION variable=feat_conv3.weight type=cyclic factor=4 dim=1
#pragma HLS ARRAY_PARTITION variable=feat_conv3.bias type=cyclic factor=4 dim=1
#pragma HLS ARRAY_PARTITION variable=feat_conv4.weight type=cyclic factor=8 dim=1
#pragma HLS ARRAY_PARTITION variable=feat_conv4.bias type=cyclic factor=8 dim=1
#pragma HLS ARRAY_PARTITION variable=feat_conv5.weight type=cyclic factor=64 dim=1
#pragma HLS ARRAY_PARTITION variable=feat_conv5.bias type=cyclic factor=64 dim=1
// Parameters for classification network
// LinearParams<param_t, kClsDims0, kClsDims1> cls_fc1;
// LinearParams<param_t, kClsDims1, kClsDims2> cls_fc2;
LinearParams<param_t, kClsDims2, kClsDims3> cls_fc3;
BatchNorm1dParams<param_t, kClsDims1> cls_bn1;
BatchNorm1dParams<param_t, kClsDims2> cls_bn2;
// Extracted feature
value_t feature[kFeatDims5];
if (op_mode == kModeInitWeights) {
// Initialize the PointNet feature extraction network
InitializeFeatOpt1<param_t>(
&feat_conv1, &feat_conv2, &feat_conv3, &feat_conv4, &feat_conv5,
&feat_bn1, &feat_bn2, &feat_bn3, &feat_bn4, &feat_bn5,
feat_params1, feat_params2, feat_params3, feat_params4, feat_params5);
// Initialize the classification network
InitializeClsOpt1<param_t>(
&cls_fc3, &cls_bn1, &cls_bn2,
cls_params1, cls_params2, cls_params3);
} else if (op_mode == kModeInference) {
// Run the PointNet feature extraction
InferenceFeatOpt1<value_t, param_t, 1024>(
point_cloud, num_points, feature,
&feat_conv1, &feat_conv2, &feat_conv3, &feat_conv4, &feat_conv5,
&feat_bn1, &feat_bn2, &feat_bn3, &feat_bn4, &feat_bn5);
// Run the classification
InferenceClsOpt1<value_t, param_t>(
feature, out_logits,
&cls_fc3, &cls_bn1, &cls_bn2,
cls_params1, cls_params2, cls_params3);
}
}
é¢æ°ã®å
¥åºåããŒãã«ã€ããŠã¯å
šãåäžã§ãã
以åã®ããŒãžã§ã³ãšæ¯èŒãããšãå±€ã®å
¥åºåããã©ã¡ãŒã¿ãä¿æãããããã¡ (feat_conv5.weight
ãfeat_conv5.bias
ãx3
ãx5
ãªã©) ãåå²ããããã«ã#pragma HLS ARRAY_PARTITION
ãè¿œå ãããŠããããšãåãããŸãã
é
åã®åå²æ° (factor
) ã«ã€ããŠã¯ãäžè¿°ã®ã«ãŒã«ã«åã£ãŠããŸãã
äŸãã°ãInferenceFeatOpt1
ãšPointNetClsTop
ãã¿ããšãç¹åŸŽæœåºãããã¯ãŒã¯ã®æåŸã®å
šçµåå±€ã128䞊åã§å®è¡ãããã®ã§ãåºåçšã®ãããã¡x10
ãšãå
šçµåå±€ã®2ã€ã®ãã©ã¡ãŒã¿feat_conv5.weight
ãfeat_conv5.bias
ã64åå²ããŠããŸã (èšè¿°ããå Žæãæ£ãã°ã£ãŠããã®ãé£ç¹ã§ã)ã
åæ§ã«ãInferenceClsOpt1
ãšPointNetClsTop
ãã¿ããšãåé¡ãããã¯ãŒã¯ã®æåã®å
šçµåå±€ã¯16䞊åã§å®è¡ãããã®ã§ãåºåçšã®ãããã¡x0
ã¯8åå²ããŠããŸãã
ãããæ£èŠåå±€ãšMaxããŒãªã³ã°ã¯2䞊åãªã®ã§ãé
åãåå²ããå¿
èŠã¯ãããŸããã
å è¿°ã®ããã«ãé åãåå²ãããšããŒãæ°ãå¢ããŠãäžåºŠã«å€ãã®èŠçŽ ãèªã¿åºããããã«ãªããŸããã貎éãªãªã³ãããã¡ã¢ãªã®æ¶è²»ãå¢ããŸãã ãªã³ãããã¡ã¢ãªã®æ¶è²»ãæãã€ã€ããªãã¹ã䞊å床ãäžããå¿ èŠããããŸãã æšè«æéã®ççž®ã«æãå¹æãããéšå (äŸãã°ç¹åŸŽæœåºãããã¯ãŒã¯ã®æåŸã®å šçµåå±€) ã®äžŠå床ãäžããŠãå¹æãããŸããªãéšå (äŸãã°ãããæ£èŠåå±€) ã®äžŠå床ã¯äžããŠããŸãã
ããã§ãåå±€ã®å®è¡ãµã€ã¯ã«æ°ãæ¯èŒããŠã¿ãŸã (åäœåšæ³¢æ°ã¯150MHz)ã ç¹åŸŽæœåºãããã¯ãŒã¯ã«ã€ããŠã¯æ¬¡ã®ããã«ãªããŸããã
å±€ | InferenceFeatNaive |
InferenceFeatOpt1 |
---|---|---|
å
šçµåå±€1 (PointNetFeat::conv1 ) |
577 (3.843us) | 321 (2.138us) |
ãããæ£èŠåå±€ + ReLU (PointNetFeat::bn1 ) |
68 (0.453us) | 36 (0.240us) |
å
šçµåå±€2 (PointNetFeat::conv2 ) |
4,481 (29.84us) | 569 (3.790us) |
ãããæ£èŠåå±€ + ReLU (PointNetFeat::bn2 ) |
68 (0.453us) | 36 (0.240us) |
å
šçµåå±€3 (PointNetFeat::conv3 ) |
4,481 (29.84us) | 569 (3.790us) |
ãããæ£èŠåå±€ + ReLU (PointNetFeat::bn3 ) |
68 (0.453us) | 36 (0.240us) |
å
šçµåå±€4 (PointNetFeat::conv4 ) |
8,961 (59.68us) | 569 (3.790us) |
ãããæ£èŠåå±€ + ReLU (PointNetFeat::bn4 ) |
132 (0.879us) | 68 (0.453us) |
å
šçµåå±€5 (PointNetFeat::conv5 ) |
137,217 (914.0us) | 1,081 (7.199us) |
ãããæ£èŠåå±€ + ReLU (PointNetFeat::bn5 ) |
1,028 (6.846us) | 516 (3.437us) |
MaxããŒãªã³ã°å±€ | 1,026 (6.833us) | 514 (3.423us) |
å šäœ (1åå) | 158,149 (1.053ms) | 4,357 (29.02us) |
å šäœ (1024åå) | 161,945,604 (1.079s) | 4,462,596 (29.72ms) |
ç¹åŸŽæœåºãããã¯ãŒã¯ã«é¢ããŠã¯ããã¯ãæåŸã®å šçµåå±€ãããã«ããã¯ãšãªã£ãŠããŸãã 128䞊åã«ããããšã§ãå®è¡æéã126.9å (137,217ãµã€ã¯ã«ãã1,081ãµã€ã¯ã«) åæžã§ããŠããŸãã 4ã€ç®ã®å šçµåå±€ã«ã€ããŠãã16䞊åã«ããããšã§ãå®è¡æéã15.75å (8,961ãµã€ã¯ã«ãã569ãµã€ã¯ã«) çããªããŸããã å šçµåå±€ããããæ£èŠåå±€ãMaxããŒãªã³ã°å±€ã«ã¿ãããããŒã¿äžŠåæ§ã掻ãããŠãæšè«æéãççž®ã§ããŸããã ãŸãåé¡ãããã¯ãŒã¯ã«ã€ããŠã¯æ¬¡ã®ããã«ãªããŸããã
å±€ | InferenceClsNaive |
InferenceClsOpt1 |
---|---|---|
å
šçµåå±€1 (PointNetCls::fc1 ) |
1,056,279 (7.035ms) | 558,071 (3.717ms) |
ãããæ£èŠåå±€ + ReLU (PointNetCls::bn1 ) |
516 (3.437us) | 260 (1.732us) |
å
šçµåå±€2 (PointNetCls::fc2 ) |
266,007 (1.772ms) | 148,183 (987.0us) |
ãããæ£èŠåå±€ + ReLU (PointNetCls::bn2 ) |
260 (1.732us) | 132 (0.879us) |
å
šçµåå±€3 (PointNetCls::fc3 ) |
10,481 (69.80us) | 5,261 (35.04us) |
å šäœ | 1,333,605 (8.882ms) | 711,969 (4.742ms) |
æåã®å šçµåå±€ã¯16䞊åã§å®è¡ããããã«ããŸããããå®è¡æéã¯1.89å (1,056,279ãµã€ã¯ã«ãã558,071ãµã€ã¯ã«) ããçããªã£ãŠããŸããã åè¿°ã®ããã«ãåé¡ãããã¯ãŒã¯ã®æåã®å šçµåå±€2ã€ã§ã¯ããã©ã¡ãŒã¿ããªã³ããããããã¡ã«çœ®ãã®ã§ã¯ãªããDRAMãããã¡ããå¿ èŠãªéšåã ãã転éããŠããŸãã è¡åã®ç©ãå ç®ã¯16䞊åã§å®è¡ãããã®ã§ãããããŒã¿è»¢ééšåã®å®è¡æéã¯ççž®ãããªãã®ã§ããã®ãããªçµæã«ãªã£ãŠããŸãã 2ã€ç®ã®å šçµåå±€ã«é¢ããŠãåæ§ã«ã8䞊åãæå®ããã®ã§ãããå®è¡æéã¯1.80å (266,007ãµã€ã¯ã«ãã148,183ãµã€ã¯ã«) ã®åæžã«çãŸã£ãŠããŸãã
çŸåšã®å®è£
ã§ã¯ãå
¥åºåããŒãã®å¹
ã¯32ãããã§ã1ãµã€ã¯ã«ã«ã€ãfloat
ã®ããŒã¿ã1ã€ãã€è»¢éããŠããŸãã
å
¥åºåããŒãã®å¹
ãåºããŠã1ãµã€ã¯ã«ã§è€æ°ã®ããŒã¿ã転éããã°ãããŒã¿è»¢éã®å®è¡æéãççž®ã§ããŸãã
åŸã»ã©ãããŒãå¹
ã32ããããã64ãããã«åºããŠã1ãµã€ã¯ã«ã§float
ã®ããŒã¿ã2ã€ãã€è»¢éããããã«ãæ¹åããŸãã
IPã³ã¢ã®åäœã¢ãŒãã«ã¯2ã€ãããŸããããã®ãã¡éã¿ã®åæåã¢ãŒãã«ã€ããŠã¯ãå šãæãå ããŠããŸããã éã¿ã®åæåã¯ãIPã³ã¢ã®å©çšéå§åã«äžåºŠã ãè¡ããããããã¯ãŒã¯ã®æšè«æéãšã¯å šãé¢ä¿ãªãããã§ãã
以äžã§æšè«ã®äžŠååãæžã¿ãŸããã
詳ããã¯hls/src/top_opt1.cpp
ããåç
§ãã ããã
åå±€ã®èšç®ã¯äžŠååã§ããŸããããç¹åŸŽæœåºãããã¯ãŒã¯ã®éšåã«ã¯ããŸã é«éåã®äœå°ãæ®ãããŠããŸãã ç¹åŸŽæœåºãããã¯ãŒã¯ã®æšè«åŠçããããäžåºŠã¿ãŠã¿ãŸãããã
// Compute the feature
for (int i = 0; i < num_points; ++i) {
#pragma HLS LOOP_TRIPCOUNT min=N max=N avg=N
#pragma HLS LOOP_FLATTEN off
// ...
// Read a point from a DDR memory
ReadPointNaive<T>(point_cloud, i, x0);
// Compute a point feature
LinearOpt1<T, U, kFeatDims0, kFeatDims1, false, 2>(
x0, x1, conv1->weight, conv1->bias);
BatchNorm1dReLUOpt1<T, U, kFeatDims1, 2>(
x1, x2, bn1->scale, bn1->bias, bn1->mean);
LinearOpt1<T, U, kFeatDims1, kFeatDims2, false, 8>(
x2, x3, conv2->weight, conv2->bias);
BatchNorm1dReLUOpt1<T, U, kFeatDims2, 2>(
x3, x4, bn2->scale, bn2->bias, bn2->mean);
LinearOpt1<T, U, kFeatDims2, kFeatDims3, false, 8>(
x4, x5, conv3->weight, conv3->bias);
BatchNorm1dReLUOpt1<T, U, kFeatDims3, 2>(
x5, x6, bn3->scale, bn3->bias, bn3->mean);
LinearOpt1<T, U, kFeatDims3, kFeatDims4, false, 16>(
x6, x7, conv4->weight, conv4->bias);
BatchNorm1dReLUOpt1<T, U, kFeatDims4, 2>(
x7, x8, bn4->scale, bn4->bias, bn4->mean);
LinearOpt1<T, U, kFeatDims4, kFeatDims5, false, 128>(
x8, x9, conv5->weight, conv5->bias);
BatchNorm1dReLUOpt1<T, U, kFeatDims5, 2>(
x9, x10, bn5->scale, bn5->bias, bn5->mean);
// Update the output feature
MaxPool1dOpt1<T, kFeatDims5, 2>(x10, feature);
}
ã«ãŒãã®å
éšãã¿ããšãæåã«ãDRAMã«çœ®ãããç¹çŸ€point_cloud
ããi
çªç®ã®ç¹ãåã£ãŠããŠããªã³ããããããã¡x0
ã«æ ŒçŽããŠããŸãã
ç¶ããŠããã®x0
ããã±ããªã¬ãŒã®ããã«ãè€æ°ã®é¢æ°ã«æž¡ãããŠãããŸãã
äŸãã°ãæåã®å
šçµåå±€ã«ãã£ãŠx0
ããx1
ããããæ£èŠåå±€ã«ãã£ãŠx1
ããx2
ã次ã®å
šçµåå±€ã«ãã£ãŠx2
ããx3
ãèšç®ãããŠããŸãã
ããå±€ã®é¢æ° (äŸãã°LinearOpt1(x4, x5)
) ã¯ããã®äžã€åã®é¢æ°ã®åºå (x4
) ãå
¥åãšããŠåãåããåºå (x5
) ã次ã®é¢æ°ã«åŒãæž¡ããŸãã
å
šãŠã®é¢æ°ããå
¥åºåãä»ããŠãæ°ç ã€ãªãã®ããã«ãªã£ãŠããŸãã
é¢æ°ã®å®è¡ã®æµããå³ã«ãããšã次ã®ããã«ãªããŸãã
å çšã®ãã€ãã©ã€ã³åãšåæ§ã«ãè€æ°ã®ç¹ã«ã€ããŠåŠçã䞊ååã§ããŸãã
äŸãã°ã1ã€ç®ã®ç¹ã«å¯ŸããŠãæåŸã®å šçµåå±€ãèšç®ããŠããéã«ã2ã€ç®ã®ç¹ã«å¯ŸããŠããã®äžã€åã®ãããæ£èŠåå±€ãèšç®ãããšããããã«ãè€æ°ã®ç¹ã«å¯ŸããåŠçãæéçã«ãªãŒããŒã©ãããããŸãã 以åã¯ãã«ãŒãå ã®åŠçããã€ãã©ã€ã³åããŠãã«ãŒãã®è€æ°ã®ã€ãã¬ãŒã·ã§ã³ã䞊åã«å®è¡ããŸããã ãããŠããã€ãã©ã€ã³ã®åã¹ããŒãžã¯ãäž»ã«ä¹ç®ãå ç®ã§ããã ããã§ã¯ãåã¹ããŒãžã¯äžã€ã®é¢æ° (ã¿ã¹ã¯) ã«å¯Ÿå¿ããã®ã§ãããç²ç²åºŠãªãã€ãã©ã€ã³åãšãããŸãã ãã®ãããªã¿ã¹ã¯ã¬ãã«ã®ãã€ãã©ã€ã³åã¯ãVitis HLSã§ã¯ããŒã¿ãããŒæé©å (Dataflow optimization) ãšãã°ããŠããŸã (æé©åãã®6: ããŒã¿ãããŒæé©å)ã ããŒã¿ãããŒæé©åãé©çšããã«ã¯ããããããªæ¡ä»¶ããããŸãããä»åã®å Žåã¯å€§äžå€«ã§ãã
以åè¿°ã¹ãããã«ããã€ãã©ã€ã³ã®åã¹ããŒãžã®å®è¡ãµã€ã¯ã«æ°ããªãã¹ãåçã«æããããšã§ããã€ãã©ã€ã³ã®å¹æãå¢ããŸãã
åå±€ã®èšç®æéãããªãã¹ãåäžã«ããããšããããšã§ãã
èšç®æéã¯ãäžã®è¡šã«ãŸãšããããŠããŸãã
ããŒã¿äžŠåæ§ãå©çšããåã¯ãå®è¡ãµã€ã¯ã«æ° (ç¹ã«å
šçµåå±€) ã«ã¯ãããªãã®ã°ãã€ãããããŸããã
å
šçµåå±€5ã€ã ãæãåºããŠã¿ããšã577ã4,481ã4,481ã8,961ã137,217ãšãªã£ãŠããŸãã
ããããã®å±€ãã2ã8ã8ã16ã128䞊åã§å®è¡ããããšã§ (InferenceFeatOpt1
ãåç
§)ã321ã569ã569ã569ã1,081ãµã€ã¯ã«ã«åæžãããã°ãã€ããããªãæããããŸããã
æåŸã®å
šçµåå±€ã256䞊åã«ããã°ãããã«åçã«ãªããŸãããåè·¯ãè€éã«ãªãéããã®ã§ãããŸããã
ãã€ãã©ã€ã³ã¯æãæéã®é·ãã¹ããŒãžã«ãã£ãŠæ§èœãå¶éãããŸãã ä»åã®å Žåã¯ãæåŸã®å šçµåå±€ (1,081ãµã€ã¯ã«) ã«ãã£ãŠæ§èœã決ãŸããŸãã ä»ã®ã¹ããŒãžã¯ã1,081ãµã€ã¯ã«ä»¥äžã§ããã°ãäœãµã€ã¯ã«ã§ããããšãæ§èœã«åœ±é¿ãäžããŸããã ãªãœãŒã¹æ¶è²»ãæãããããä»ã®ã¹ããŒãžã«é¢ããŠã¯ã1,081ãµã€ã¯ã«ãè¶ ããªãç¯å²ã§ããªãã¹ã䞊å床ãèœãšããŸããã
ç¹åŸŽæœåºãããã¯ãŒã¯ã«é¢ããŠã¯ãã®ããã«ãããŒã¿ãããŒæé©åãäºãèæ ®ããããã§ãåå±€ã®äžŠå床ãæå®ããŸããã åé¡ãããã¯ãŒã¯ã®äžŠå床ã¯ãäœãšãªã決ããŠããŸãã
ããŒã¿ãããŒæé©åãæœããå®è£
ãã次ã«ç€ºããŸãã
InferenceFeatOpt1
ãããInferenceFeatOpt2
ãšããŸããã
// Parallel implementation of the PointNet feature extraction
// `T` is the type for layer input, output, and intermediate results
// `U` is the type for parameters
// `N` is the expected number of input points (e.g., 1024)
template <typename T, typename U, int N>
void InferenceFeatOpt2(...)
{
#pragma HLS INLINE off
// Zero-initialize the output feature
VectorNdSetZero<T, kFeatDims5>(feature);
// Compute the feature
for (int i = 0; i < num_points; ++i) {
#pragma HLS LOOP_TRIPCOUNT min=N max=N avg=N
#pragma HLS LOOP_FLATTEN off
#pragma HLS DATAFLOW
#pragma HLS STABLE variable=point_cloud
#pragma HLS STABLE variable=num_points
#pragma HLS STABLE variable=feature
#pragma HLS STABLE variable=conv1
#pragma HLS STABLE variable=conv2
#pragma HLS STABLE variable=conv3
#pragma HLS STABLE variable=conv4
#pragma HLS STABLE variable=conv5
#pragma HLS STABLE variable=bn1
#pragma HLS STABLE variable=bn2
#pragma HLS STABLE variable=bn3
#pragma HLS STABLE variable=bn4
#pragma HLS STABLE variable=bn5
// Input, output, and intermediate results
// ...
// Read a point from a DDR memory
ReadPointNaive<T>(point_cloud, i, x0);
// Compute a point feature
LinearOpt1<T, U, kFeatDims0, kFeatDims1, false, 2>(
x0, x1, conv1->weight, conv1->bias);
BatchNorm1dReLUOpt1<T, U, kFeatDims1, 2>(
x1, x2, bn1->scale, bn1->bias, bn1->mean);
LinearOpt1<T, U, kFeatDims1, kFeatDims2, false, 8>(
x2, x3, conv2->weight, conv2->bias);
BatchNorm1dReLUOpt1<T, U, kFeatDims2, 2>(
x3, x4, bn2->scale, bn2->bias, bn2->mean);
LinearOpt1<T, U, kFeatDims2, kFeatDims3, false, 8>(
x4, x5, conv3->weight, conv3->bias);
BatchNorm1dReLUOpt1<T, U, kFeatDims3, 2>(
x5, x6, bn3->scale, bn3->bias, bn3->mean);
LinearOpt1<T, U, kFeatDims3, kFeatDims4, false, 16>(
x6, x7, conv4->weight, conv4->bias);
BatchNorm1dReLUOpt1<T, U, kFeatDims4, 2>(
x7, x8, bn4->scale, bn4->bias, bn4->mean);
LinearOpt1<T, U, kFeatDims4, kFeatDims5, false, 128>(
x8, x9, conv5->weight, conv5->bias);
BatchNorm1dReLUOpt1<T, U, kFeatDims5, 2>(
x9, x10, bn5->scale, bn5->bias, bn5->mean);
// Update the output feature
MaxPool1dOpt1<T, kFeatDims5, 2>(x10, feature);
}
}
InferenceFeatOpt1
ãšç°ãªãã®ã¯HLSãã©ã°ãã®éšåã ãã§ãã
ã«ãŒãã®å
é éšåã«ã¯#pragma HLS DATAFLOW
ã®èšè¿°ããããã«ãŒãã®äžèº«ãããŒã¿ãããŒæé©åããããã«æ瀺ããŸãã
#pragma HLS STABLE
ã®éšåã¯ãã«ãŒãã®åã€ãã¬ãŒã·ã§ã³ãéå§ããã«ããã£ãŠããã®å€æ°ã«ã€ããŠåæããšãå¿
èŠããªãããšããããšã瀺ããŸãã
åå±€ã®ãã©ã¡ãŒã¿ãç¹çŸ€ãªã©ãã«ãŒãã®å®è¡äžã¯å€åããªãå€æ°ã«ä»äžããŠããŸãã
ãã®èšè¿°ããªããšãããŒã¿ãããŒæé©åãããŸãæ©èœããŸããã
ãã®2çš®é¡ã®HLSãã©ã°ããæ¿å
¥ããã ãã§ãããŒã¿ãããŒæé©åãããšãç°¡åã«å®çŸã§ããŸãã
é«äœåæããŒã«ã¯çŽ æŽããããšæããŸãã
PointNetClsTop
(ãããé¢æ°) ãåé¡ãããã¯ãŒã¯ã®æšè« (InferenceClsOpt1
) ã«ã€ããŠã¯ä»¥åãšå
šãåãã§ãããããããã§ã¯å²æããŸãã
ããŒã¿ãããŒæé©åã«ããå¹æãã¿ãŠã¿ãŸãã
InferenceFeatOpt1
ã§ã¯ã1ã€ã®ç¹ã«å¯Ÿããé äŒæã«4,357ãµã€ã¯ã« (29.02us) èŠããŠããŸããããInferenceFeatOpt2
ã§ã4,344ãµã€ã¯ã« (28.93us) ã§ãã»ãŒå€ãããŸããã
äžæ¹ã1,024åã®ç¹ã«å¯ŸããåŠçæéãã¿ãŠã¿ããšãInferenceFeatOpt1
ã§ã¯4,462,596ãµã€ã¯ã« (29.72ms) ã§ããããInferenceFeatOpt2
ã§ã¯1,112,259ãµã€ã¯ã« (7.408ms) ã«åæžãããŠããŸãã
ãã€ãã©ã€ã³åããŠããåå
¥åããŒã¿ã«å¯Ÿããèšç®æé (ã¬ã€ãã³ã·) ã¯å€åããŸããããåäœæéãããã«åŠçå¯èœãªããŒã¿æ° (ã¹ã«ãŒããã) ã¯æ¹åããã®ã§ãããã«äŒŽã£ãŠå
šäœã®æ§èœãåäžãããšããããšã§ãã
ããã§ããŒã¿ãããŒæé©åã¯çµããã§ãã
詳ããã¯hls/src/top_opt2.cpp
ãã芧ãã ããã
åé¡ãããã¯ãŒã¯ã®å
šçµåå±€éšåã§ã¯ãç©åæŒç®ã䞊ååããã«ãããããããå
šäœã®åŠçæéã¯ããã»ã©ççž®ãããŸããã§ããã
DRAMãããªã³ããããããã¡ãžã®ãã©ã¡ãŒã¿è»¢éã®ãµã€ã¯ã«æ°ããå€åããŠããªãããã§ãã
ããã§æåŸã®æé©åãšããŠãå
¥åºåããŒãã®ãããå¹
ã32ãã64ã«åºããŠã1ãµã€ã¯ã«ã«ã€ã2ã€ã®float
ããŒã¿ã転éã§ããããã«ãå®è£
ãä¿®æ£ããŠã¿ãŸããã (æé©åãã®7: ããŒã¿è»¢é)ã
æåã«ãIPã³ã¢ã®æäžäœé¢æ°PointNetClsTop
ããä¿®æ£ããŸãã
ä¿®æ£åã¯ã次ã®ããã«ãªã£ãŠããŸããã
void PointNetClsTop(const int op_mode,
const float* point_cloud,
const int num_points,
float* out_logits,
const float* feat_params1,
const float* feat_params2,
const float* feat_params3,
const float* feat_params4,
const float* feat_params5,
const float* cls_params1,
const float* cls_params2,
const float* cls_params3)
{
// ...
}
ãããã次ã®ããã«64ãããå¹ ã«ããŸãã
void PointNetClsTop(const int op_mode,
const ap_uint<64>* point_cloud,
const int num_points,
ap_uint<64>* out_logits,
const ap_uint<64>* feat_params1,
const ap_uint<64>* feat_params2,
const ap_uint<64>* feat_params3,
const ap_uint<64>* feat_params4,
const ap_uint<64>* feat_params5,
const ap_uint<64>* cls_params1,
const ap_uint<64>* cls_params2,
const ap_uint<64>* cls_params3)
{
// ...
}
ap_uint
ã¯ãVitis HLSã§æäŸãããŠãããä»»æãããé·ã®ç¬Šå·ãªãæŽæ°åã§ãã
ããã§ã¯64ããããšããŠããŸãã
1ãµã€ã¯ã«ã«ã€ãããŒã¿ã2ã€ãã€èªã¿åããªããã°ãããªãã®ã§ãããŒã¿è»¢éã«é¢ããéšåãå
šãŠä¿®æ£ããŸãã
DRAMãããã©ã¡ãŒã¿ãåãåºããŠããªã³ããããããã¡ã«æ ŒçŽãããéã¿åæåé¢æ°InitializeFeatOpt1
ãInitializeClsOpt1
ã次ã®ããã«çŽããŠãæ°ãã«InitializeFeatOpt3
ãInitializeClsOpt3
ãšããŸãã
åã«ãé¢æ°ã®åŒæ°ãfloat*
ããap_uint<64>*
ã«å€æŽããã ãã§ãã
// Parallel implementation of the parameter initialization
// `T` is the type for parameters
template <typename T>
void InitializeFeatOpt3(LinearParams<T, kFeatDims0, kFeatDims1>* conv1,
LinearParams<T, kFeatDims1, kFeatDims2>* conv2,
LinearParams<T, kFeatDims2, kFeatDims3>* conv3,
LinearParams<T, kFeatDims3, kFeatDims4>* conv4,
LinearParams<T, kFeatDims4, kFeatDims5>* conv5,
BatchNorm1dParams<T, kFeatDims1>* bn1,
BatchNorm1dParams<T, kFeatDims2>* bn2,
BatchNorm1dParams<T, kFeatDims3>* bn3,
BatchNorm1dParams<T, kFeatDims4>* bn4,
BatchNorm1dParams<T, kFeatDims5>* bn5,
const ap_uint<64>* params1,
const ap_uint<64>* params2,
const ap_uint<64>* params3,
const ap_uint<64>* params4,
const ap_uint<64>* params5)
{
#pragma HLS INLINE off
ReadBlockParamsOpt2<T, kFeatDims0, kFeatDims1>(conv1, bn1, params1);
ReadBlockParamsOpt1<T, kFeatDims1, kFeatDims2>(conv2, bn2, params2);
ReadBlockParamsOpt1<T, kFeatDims2, kFeatDims3>(conv3, bn3, params3);
ReadBlockParamsOpt1<T, kFeatDims3, kFeatDims4>(conv4, bn4, params4);
ReadBlockParamsOpt1<T, kFeatDims4, kFeatDims5>(conv5, bn5, params5);
}
// Parallel implementation of the parameter initialization
// `T` is the type for parameters
template <typename T>
void InitializeClsOpt3(LinearParams<T, kClsDims2, kClsDims3>* fc3,
BatchNorm1dParams<T, kClsDims1>* bn1,
BatchNorm1dParams<T, kClsDims2>* bn2,
const ap_uint<64>* params1,
const ap_uint<64>* params2,
const ap_uint<64>* params3)
{
#pragma HLS INLINE off
ReadBatchNorm1dParamsOpt1<T, kClsDims1>(
bn1, params1, kClsDims0 * kClsDims1 + kClsDims1);
ReadBatchNorm1dParamsOpt1<T, kClsDims2>(
bn2, params2, kClsDims1 * kClsDims2 + kClsDims2);
ReadLinearParamsOpt1<T, kClsDims2, kClsDims3>(
fc3, params3, 0);
}
æåã®å®è£
ã§ã¯ReadLinearParamsNaive
ãReadBatchNorm1dParamsNaive
ãReadBlockParamsNaive
ã䜿ã£ãŠããŸããããããã§ã¯æ°ãã«ReadLinearParamsOpt1
ãReadBatchNorm1dParamsOpt1
ãReadBlockParamsOpt1
ãReadBlockParamsOpt2
ã®4çš®é¡ã䜿ã£ãŠããŸãã
詳ããäžèº«ãã¿ãŠã¿ãŸãããã
// Parallel implementation of the parameter initialization
// Read the parameters for a linear layer from a DDR memory and
// store them to BRAM buffers
// `T` is the type for parameters
// `InDims` is the number of input dimensions
// `OutDims` is the number of output dimensions
template <typename T, int InDims, int OutDims>
void ReadLinearParamsOpt1(LinearParams<T, InDims, OutDims>* linear,
const ap_uint<64>* params,
const int offset)
{
#pragma HLS INLINE
// `params` contains weight parameters of size (`OutDims`, `InDims`) and
// bias parameters of size (`OutDims`) in a contiguous buffer
static_assert(InDims % 2 == 0, "`InDims` must be a multiple of 2");
static_assert(OutDims % 2 == 0, "`OutDims` must be a multiple of 2");
assert(offset % 2 == 0);
ReadTensor2dOpt1<T, OutDims, InDims>(linear->weight, params, offset);
ReadTensor1dOpt1<T, OutDims>(linear->bias, params,
offset + InDims * OutDims);
}
// Parallel implementation of the parameter initialization
// Read the parameters for a 1D batch normalization layer from a DDR memory and
// store them to BRAM buffers
// `T` is the type for parameters
// `Dims` is the number of input and output dimensions
template <typename T, int Dims>
void ReadBatchNorm1dParamsOpt1(BatchNorm1dParams<T, Dims>* bn,
const ap_uint<64>* params,
const int offset)
{
#pragma HLS INLINE
// `params` contains scale parameters of size (`Dims`),
// bias of size (`Dims`), and mean of size (`Dims`) in a contiguous buffer
static_assert(Dims % 2 == 0, "`Dims` must be a multiple of 2");
assert(offset % 2 == 0);
ReadTensor1dOpt1<T, Dims>(bn->scale, params, offset);
ReadTensor1dOpt1<T, Dims>(bn->bias, params, offset + Dims);
ReadTensor1dOpt1<T, Dims>(bn->mean, params, offset + Dims * 2);
}
// Parallel implementation of the parameter initialization
// Read the parameters for a linear and 1D batch normalization layer
// from a DDR memory and store them to BRAM buffers
// `T` is the type for parameters
// `InDims` is the number of input dimensions
// `OutDims` is the number of output dimensions
template <typename T, int InDims, int OutDims>
void ReadBlockParamsOpt1(LinearParams<T, InDims, OutDims>* linear,
BatchNorm1dParams<T, OutDims>* bn,
const ap_uint<64>* params)
{
#pragma HLS INLINE
static_assert(InDims % 2 == 0, "`InDims` must be a multiple of 2");
static_assert(OutDims % 2 == 0, "`OutDims` must be a multiple of 2");
ReadTensor2dOpt1<T, OutDims, InDims>(linear->weight, params, 0);
ReadTensor1dOpt1<T, OutDims>(linear->bias, params, InDims * OutDims);
ReadTensor1dOpt1<T, OutDims>(bn->scale, params,
InDims * OutDims + OutDims);
ReadTensor1dOpt1<T, OutDims>(bn->bias, params,
InDims * OutDims + OutDims * 2);
ReadTensor1dOpt1<T, OutDims>(bn->mean, params,
InDims * OutDims + OutDims * 3);
}
// Parallel implementation of the parameter initialization
// Read the parameters for a linear and 1D batch normalization layer
// from a DDR memory and store them to BRAM buffers
// `T` is the type for parameters
// `InDims` is the number of input dimensions
// `OutDims` is the number of output dimensions
template <typename T, int InDims, int OutDims>
void ReadBlockParamsOpt2(LinearParams<T, InDims, OutDims>* linear,
BatchNorm1dParams<T, OutDims>* bn,
const ap_uint<64>* params)
{
#pragma HLS INLINE
static_assert(InDims == 3, "`InDims` must be 3");
static_assert(OutDims % 2 == 0, "`OutDims` must be a multiple of 2");
ReadTensor2dOpt2<T, OutDims, InDims>(linear->weight, params, 0);
ReadTensor1dOpt1<T, OutDims>(linear->bias, params, InDims * OutDims);
ReadTensor1dOpt1<T, OutDims>(bn->scale, params,
InDims * OutDims + OutDims);
ReadTensor1dOpt1<T, OutDims>(bn->bias, params,
InDims * OutDims + OutDims * 2);
ReadTensor1dOpt1<T, OutDims>(bn->mean, params,
InDims * OutDims + OutDims * 3);
}
åºæ¬çã«ã¯å
ã®ãã€ãŒããªå®è£
ãšåãã§ãããåŒæ°ã®åãfloat*
ããap_uint<64>*
ã«å€ãã£ãŠããŸãã
é¢æ°ã®äžèº«ãåçŽã§ãæå®ãããªãã»ãããããæå®ãããµã€ãºã®ãã©ã¡ãŒã¿ãèªã¿åãããšãç¹°ãè¿ãã ãã§ãã
äŸãã°ãããæ£èŠåå±€ã®ãã©ã¡ãŒã¿ãèªã¿åããšãã¯ãã¹ã±ãŒã«ããã€ã¢ã¹ãå¹³åã®é ã«èªã¿åããŸãã
DRAMãããã¡äžã«ã¯äºããæ£ããäœçœ®ã«ãã®é ã§äžŠã¹ãŠããå¿
èŠããããŸãã
äžã§äœ¿ãããŠããé¢æ°ReadTensor1dOpt1
ãReadTensor2dOpt1
ãReadTensor2dOpt2
ã¯æ¬¡ã®éãã§ãã
union conv32_t
{
std::uint32_t u32;
int i32;
float f;
};
// Interpret float as std::uint32_t
inline std::uint32_t FloatToU32(const float f)
{
conv32_t conv;
conv.f = f;
return conv.u32;
}
// Interpret std::uint32_t as float
inline float U32ToFloat(const std::uint32_t u32)
{
conv32_t conv;
conv.u32 = u32;
return conv.f;
}
// Read a 1D tensor from a DDR memory
template <typename T, int D0>
void ReadTensor1dNaive(T tensor[D0],
const float* src,
const int offset)
{
#pragma HLS INLINE off
for (int i = 0; i < D0; ++i) {
#pragma HLS PIPELINE II=1
tensor[i] = T(src[offset + i]);
}
}
// Read a 1D tensor from a DDR memory
template <typename T, int D0>
void ReadTensor1dOpt1(T tensor[D0],
const ap_uint<64>* src,
const int offset)
{
#pragma HLS INLINE off
static_assert(D0 % 2 == 0, "`D0` must be a multiple of 2");
assert(offset % 2 == 0);
constexpr const int D0Over2 = D0 / 2;
const int offset2 = offset / 2;
for (int i = 0; i < D0Over2; ++i) {
#pragma HLS PIPELINE II=1
const ap_uint<64> tensor_data = src[offset2 + i];
tensor[i * 2 + 0] = T(U32ToFloat(tensor_data.range(31, 0)));
tensor[i * 2 + 1] = T(U32ToFloat(tensor_data.range(63, 32)));
}
}
// Read a 2D tensor from a DDR memory
template <typename T, int D0, int D1>
void ReadTensor2dNaive(T tensor[D0][D1],
const float* src,
const int offset)
{
#pragma HLS INLINE off
for (int i = 0; i < D0; ++i) {
for (int j = 0; j < D1; ++j) {
#pragma HLS PIPELINE II=1
const int idx = i * D1 + j;
tensor[i][j] = T(src[offset + idx]);
}
}
}
// Read a 2D tensor from a DDR memory
template <typename T, int D0, int D1>
void ReadTensor2dOpt1(T tensor[D0][D1],
const ap_uint<64>* src,
const int offset)
{
#pragma HLS INLINE off
static_assert(D1 % 2 == 0, "`D1` must be a multiple of 2");
assert(offset % 2 == 0);
constexpr const int D1Over2 = D1 / 2;
const int offset2 = offset / 2;
for (int i = 0; i < D0; ++i) {
for (int j = 0; j < D1Over2; ++j) {
#pragma HLS PIPELINE II=1
const int idx = i * D1Over2 + j;
const ap_uint<64> tensor_data = src[offset2 + idx];
tensor[i][j * 2 + 0] = T(U32ToFloat(tensor_data.range(31, 0)));
tensor[i][j * 2 + 1] = T(U32ToFloat(tensor_data.range(63, 32)));
}
}
}
// Read a 2D tensor of size (`D0`, 3) from a DDR memory
template <typename T, int D0, int D1>
void ReadTensor2dOpt2(T tensor[D0][D1],
const ap_uint<64>* src,
const int offset)
{
#pragma HLS INLINE off
static_assert(D0 % 2 == 0, "`D0` must be a multiple of 2");
static_assert(D1 == 3, "`D1` must be 3");
assert(offset % 2 == 0);
constexpr const int Iter = D0 * D1 / (2 * 3);
const int offset2 = offset / 2;
for (int i = 0; i < Iter; ++i) {
#pragma HLS PIPELINE
const int src_idx = i * 3;
const int dst_idx = i * 2;
const ap_uint<64> tensor_data0 = src[offset2 + src_idx + 0];
const ap_uint<64> tensor_data1 = src[offset2 + src_idx + 1];
const ap_uint<64> tensor_data2 = src[offset2 + src_idx + 2];
tensor[dst_idx + 0][0] = T(U32ToFloat(tensor_data0.range(31, 0)));
tensor[dst_idx + 0][1] = T(U32ToFloat(tensor_data0.range(63, 32)));
tensor[dst_idx + 0][2] = T(U32ToFloat(tensor_data1.range(31, 0)));
tensor[dst_idx + 1][0] = T(U32ToFloat(tensor_data1.range(63, 32)));
tensor[dst_idx + 1][1] = T(U32ToFloat(tensor_data2.range(31, 0)));
tensor[dst_idx + 1][2] = T(U32ToFloat(tensor_data2.range(63, 32)));
}
}
æ¯èŒã§ããããã«ãããŒã¿ã1ã€ãã€èªã¿åããå ã®ãã€ãŒããªå®è£ ãèŒããŸããã åé¢æ°ã®åäœããŸãšããŸãã
ReadTensor1dOpt1<T, D0>(tensor, src, offset)
: æå®ãããDRAMãããã¡src
ã®ãfloat
ã§offset
ååã ãããããå Žæãã (src
ã«4 * offset
ãã€ãåã ã足ããã¢ãã¬ã¹ãã)ãD0
ååã®float
ã2ã€ãã€èªã¿åãã èªã¿åã£ãããŒã¿ã¯float
ããT
åã«ãã£ã¹ãããŠãæå®ããã1次å ã®ãªã³ããããããã¡tensor
(ãµã€ãº(D0)
)ã«2ã€ãã€æ ŒçŽããã 1ãµã€ã¯ã«ã§2ã€ãã€èªã¿åãããããµã€ãºD0
ã¯å¶æ°ãšä»®å®ããŠãããReadTensor2dOpt1<T, D0, D1>(tensor, src, offset)
: æå®ãããDRAMãããã¡src
ããããŒã¿ã2ã€ãã€èªã¿åã£ãŠã2次å ã®ãªã³ããããããã¡tensor
(ãµã€ãº(D0, D1)
)ã«æ ŒçŽããã 1ãµã€ã¯ã«ã§2ã€ãã€èªã¿åãããããµã€ãºD1
ã¯å¶æ°ãšä»®å®ããŠãããReadTensor2dOpt2<T, D0, D1>(tensor, src, offset)
:D1
ã3ã§ããå Žåã®å°çšã®å®è£ ã 3ãµã€ã¯ã«æããŠãæå®ãããDRAMãããã¡src
ããããŒã¿ã6ã€èªã¿åã£ãåŸããªã³ããããããã¡tensor
ã«æ ŒçŽããŠããã å®è£ ãç°¡ç¥åããããããµã€ãºã«é¢ããŠã¯ãD1
ã¯3ãD0
ã¯å¶æ°ã§ããããšãä»®å®ããŠãã (èŠçŽ æ°ãå¶æ°)ã
ReadTensor2dOpt2
ããã³ReadBlockParamsOpt2
ã¯ãç¹åŸŽæœåºãããã¯ãŒã¯ã«ãããæåã®å
šçµåå±€ã®éã¿ã転éããããã«äœ¿ãããŠããŸã (InitializeFeatOpt3
ãåç
§)ã
æåã®å
šçµåå±€ã¯ã3次å
ã®ç¹ã®åº§æšã64次å
ã®ç¹åŸŽã«å€æããã®ã§ãéã¿ã®ãµã€ãºã¯(64, 3)
ãšãªããŸãã
ããŒã¿ã2ã€ãã€èªã¿åãããã®ã«ã2çªç®ã®æ¬¡å
ãå¥æ°ã§ãå®è£
äžã®éœåãæªãã®ã§ãå°çšã®é¢æ°ãçšæããããã§ãã
ReadTensor2dOpt2
ã§ã¯ãéã¿ã6ã€ãã€èªã¿åãããšã§å¯ŸåŠããŠããŸãã
å¥ã®å¯ŸåŠæ³ãšããŠã¯ãéã¿ã®ãããã¡ãµã€ãºã(64, 3)
ãã(64, 4)
ã«åºããããšãèããããŸã (4çªç®ã®æ¬¡å
ã¯åã«äœ¿ããªã)ã
ReadBlockParamsOpt1
ãšReadBlockParamsOpt2
ã®éãã¯ãReadTensor2dOpt1
ãšReadTensor2dOpt2
ã®ã©ã¡ãã䜿ã£ãŠãããã ãã§ãã
2ã€ã®é¢æ°ã¯ãC++17ã«çšæãããif constexpr
æã䜿ãã°ã1ã€ã«ãŸãšãããããšæããŸãããä»åã¯C++14ãŸã§ã®æ©èœã䜿ã£ãŠããã®ã§ãå¥ã
ã«ããŠããŸãã
ap_uint
åã«ã¯range()
ãšãã䟿å©ãªã¡ãœãããçšæãããŠãããæå®ãããããã®éšåãèªç±ã«åãåºããŸãã
range(31, 0)
ã§äžäœ32ããããrange(63, 32)
ã§äžäœ32ããããåãåºããŠããŸãã
U32ToFloat()
ãFloatToU32()
ã¯ããããè¡šçŸãç¶æãããŸãŸãå¥ã®åã«è§£éããããã®é¢æ°ã§ã (float
ãšç¬Šå·ãªã32ãããæŽæ°)ã
tensor_data.range(31, 0)
ã¯32ãããã®ç¬Šå·ãªãæŽæ°å (unsigned int
ãap_uint<32>
) ã§ãããå®éã«ã¯float
ã®ããŒã¿ãå
¥ã£ãŠããã®ã§ãU32ToFloat()
ã䜿ã£ãŠfloat
ã«è§£éãçŽããŠããŸãã
2ã€ã®é¢æ°ã¯ãå
±çšäœã䜿ã£ãŠå®çŸããŠããŸãã
C++20ã§ããã°ãstd::bit_cast
ã§åçã®åŠçãã§ããŸãã
ç¹åŸŽæœåºãããã¯ãŒã¯ã®æšè«ã«çç®ããŸã (InferenceFeatOpt2
ãåç
§)ã
i
çªç®ã®ç¹ãDRAMãããã¡ããèªã¿åãReadPointNaive
ãã64ãããå¹
ã«åãããŠæžãçŽããŸãã
ä¿®æ£åŸã®ããŒãžã§ã³ãReadPointOpt1
ãšããŸããã
// Read a point from a DDR memory
template <typename T>
void ReadPointNaive(const float* point_cloud,
const int idx,
T x[3])
{
#pragma HLS INLINE off
for (int i = 0; i < 3; ++i) {
#pragma HLS PIPELINE II=1
x[i] = T(point_cloud[idx * 3 + i]);
}
}
// Read a point from a DDR memory
template <typename T>
void ReadPointOpt1(const ap_uint<64>* point_cloud,
const int idx,
T x[3])
{
#pragma HLS INLINE off
const ap_uint<64> point_data0 = point_cloud[idx * 2 + 0];
const ap_uint<64> point_data1 = point_cloud[idx * 2 + 1];
x[0] = T(U32ToFloat(point_data0.range(31, 0)));
x[1] = T(U32ToFloat(point_data0.range(63, 32)));
x[2] = T(U32ToFloat(point_data1.range(31, 0)));
}
ReadPointNaive
ã§ã¯ãDRAMãããã¡point_cloud
ã®ãµã€ãºã$(N, 3)$ã§ããããšãæ³å®ããŠããŸããã
äžæ¹ReadPointOpt1
ã§ã¯ãå®è£
ãç°¡åã«ããããããããã¡ãµã€ãºã$(N, 4)$ã§ãããšããŸã (4çªç®ã®æ¬¡å
ã«ã€ããŠã¯äœ¿ããªã)ã
i
çªç®ã®ç¹ãèªã¿åããšãã¯ããããã¡ã®idx * 2 + 0
çªç®ãšidx * 2 + 1
çªç®ãåç
§ããã°ããã§ãã
æåŸã«ãåé¡ãããã¯ãŒã¯ã®æšè«ãçŽããŸã (InferenceClsOpt1
ãåç
§)ã
ç¹çŸ€ã®ç¹åŸŽéãããç©äœã®åã¯ã©ã¹ã«å¯Ÿããããžãããèšç®ããWriteTensor1dNaive
ã«ããDRAMãããã¡ã«æžã蟌ãã§ããŸãã
WriteTensor1dNaive
ãã64ãããå¹
ã«åãããŠæžãçŽããŸãã
ä¿®æ£åŸã®ããŒãžã§ã³ãWriteTensor1dOpt1
ãšããŸããã
// Write a 1D tensor to a DDR memory
template <typename T, int D0>
void WriteTensor1dNaive(float* dst,
const T tensor[D0],
const int offset)
{
#pragma HLS INLINE off
for (int i = 0; i < D0; ++i) {
#pragma HLS PIPELINE II=1
dst[offset + i] = static_cast<float>(tensor[i]);
}
}
// Write a 1D tensor to a DDR memory
template <typename T, int D0>
void WriteTensor1dOpt1(ap_uint<64>* dst,
const T tensor[D0],
const int offset)
{
#pragma HLS INLINE off
static_assert(D0 % 2 == 0, "`D0` must be a multiple of 2");
assert(offset % 2 == 0);
constexpr const int D0Over2 = D0 / 2;
const int offset2 = offset / 2;
for (int i = 0; i < D0Over2; ++i) {
#pragma HLS PIPELINE II=1
ap_uint<64> tensor_data;
tensor_data.range(31, 0) = FloatToU32(
static_cast<float>(tensor[i * 2 + 0]));
tensor_data.range(63, 32) = FloatToU32(
static_cast<float>(tensor[i * 2 + 1]));
dst[offset2 + i] = tensor_data;
}
}
ãªã³ããããããã¡tensor
ã«çœ®ããããµã€ãº(D0)
ã®ããŒã¿ãã1ãµã€ã¯ã«ã«2ã€ãã€ãDRAMã«æžãæ»ããŠããŸãã
å®è£
ãç°¡åã«ãããããD0
ã¯å¶æ°ã§ãããšä»®å®ããŸãã
2ã€ã®ããŒã¿ã¯T
åã§ããããœãããŠã§ã¢åŽããå©çšããããããã«float
ã«çŽããæŽã«FloatToU32
ã䜿ã£ãŠããããè¡šçŸãç¶æãããŸãŸ32ãããã®ç¬Šå·ãªãæŽæ°åã«å解éããŠããŸãã
ããã2ã€ããap_uint<64>
åã®äžäœ32ããããšäžäœ32ãããã«è©°ããŠãDRAMãããã¡ã«æžãæ»ããŠããŸãã
æåã®2ã€ã®å
šçµåå±€ (LinearOpt1DDR
) ãçŽããŠãæ°ãã«LinearOpt2DDR
ãäœããŸãã
éã¿ãšãã€ã¢ã¹ã®è»¢ééšåãå€æŽããŸãã
転éã«èŠãããµã€ã¯ã«æ°ãååã»ã©ã«ãªãã®ã§ãåé¡ãããã¯ãŒã¯ã®æšè«æéã®åæžãæåŸ
ãããŸãã
å®è£
ãç°¡åã«ãããããå
¥åºåã®æ¬¡å
ãããããå¶æ°ã§ããããšãåæãšããŠããŸãã
// Parallel implementation of the fully-connected layer
// Weight and bias parameters are stored on the DDR memory
// Matrix-vector multiplication is parallelized along the output dimension
// `T` is the type for values
// `TParam` is the type for weight and bias
// `InDims` is the number of input dimensions
// `OutDims` is the number of output dimensions
// `ApplyReLU` is the flag to apply ReLU activation
// `B` is the block size for the output dimension
template <typename T, typename TParam,
int InDims, int OutDims, bool ApplyReLU, int B>
void LinearOpt2DDR(const T x[InDims],
T y[OutDims],
const ap_uint<64>* params,
const int offset)
{
// `x` is of size (1, `InDims`)
// `y` is of size (1, `OutDims`)
// `params` contains weight parameters of size (`OutDims`, `InDims`) and
// bias parameters of size (`OutDims`) in a contiguous buffer
#pragma HLS INLINE off
// `OutDims` must be a multiple of `B`
static_assert(OutDims % B == 0, "`OutDims` must be a multiple of `B`");
// `B` must be larger than 1
static_assert(B > 1, "`B` must be larger than 1");
// `InDims` must be a multiple of 2
static_assert(InDims % 2 == 0, "`InDims` must be a multiple of 2");
// `OutDims` must be a multiple of 2
static_assert(OutDims % 2 == 0, "`OutDims` must be a multiple of 2");
// `offset` must be a multiple of 2
assert(offset % 2 == 0);
constexpr const int BHalf = B / 2;
constexpr const int OffsetToBias = OutDims * InDims / 2;
constexpr const int InDims2 = InDims / 2;
constexpr const int OutDims2 = OutDims / 2;
const int offset2 = offset / 2;
TParam bias[OutDims];
#pragma HLS ARRAY_PARTITION variable=bias type=cyclic factor=BHalf dim=1
// Copy the bias parameters in advance
for (int i = 0; i < OutDims2; ++i) {
#pragma HLS PIPELINE II=1
const ap_uint<64> bias_data = params[offset2 + OffsetToBias + i];
bias[i * 2 + 0] = TParam(U32ToFloat(bias_data.range(31, 0)));
bias[i * 2 + 1] = TParam(U32ToFloat(bias_data.range(63, 32)));
}
for (int i0 = 0; i0 < OutDims; i0 += B) {
#pragma HLS PIPELINE off
T vals[B];
#pragma HLS ARRAY_PARTITION variable=vals type=complete dim=1
TParam weight[B][InDims];
#pragma HLS ARRAY_PARTITION variable=weight type=cyclic factor=BHalf dim=1
// Copy the weight parameters for `B` outputs
const int offset0 = offset2 + i0 * InDims2;
for (int i1 = 0; i1 < B; ++i1) {
for (int j = 0; j < InDims2; ++j) {
#pragma HLS PIPELINE
const ap_uint<64> weight_data = params[offset0 + i1 * InDims2 + j];
weight[i1][j * 2 + 0] = TParam(
U32ToFloat(weight_data.range(31, 0)));
weight[i1][j * 2 + 1] = TParam(
U32ToFloat(weight_data.range(63, 32)));
}
}
for (int j = 0; j < InDims; ++j) {
#pragma HLS PIPELINE
for (int i1 = 0; i1 < B; ++i1) {
#pragma HLS UNROLL
int i = i0 + i1;
if (i < OutDims) {
T last = (j == 0) ? T(bias[i]) : vals[i1];
vals[i1] = last + x[j] * weight[i1][j];
}
}
}
for (int i1 = 0; i1 < B; ++i1) {
#pragma HLS UNROLL
int i = i0 + i1;
if (i < OutDims) {
if (ApplyReLU)
y[i] = vals[i1] > T(0) ? vals[i1] : T(0);
else
y[i] = vals[i1];
}
}
}
}
2ã€ã®ãããã¯ãŒã¯ã«ã€ããŠãããŒã¿ã®å
¥åºåã«é¢é£ããéšåãä¿®æ£ããŸããã
InferenceFeatOpt2
ãšInferenceClsOpt1
ã«å¯ŸããŠãä¿®æ£ãæœãããã®ãInferenceFeatOpt3
ãInferenceClsOpt3
ãšããŸãã
InferenceFeatOpt3
ã§ã¯ãç¹çŸ€ããŒã¿ãèªã¿åãéã«ãReadPointNaive
ã®ä»£ããã«ReadPointOpt1
ã䜿ã£ãŠããŸã (ä»ã¯åã)ã
ãŸãInferenceClsOpt3
ã§ã¯ãããžãããæžã蟌ãéã«ãWriteTensor1dNaive
ã§ã¯ãªãWriteTensor1dOpt1
ã䜿ããæåã®2ã€ã®å
šçµåå±€ã«ã€ããŠã¯ãLinearOpt1DDR
ã®ä»£ããã«LinearOpt2DDR
ã䜿ã£ãŠããŸãã
template <typename T, typename U, int N>
void InferenceFeatOpt3(...)
{
#pragma HLS INLINE off
// Zero-initialize the output feature
VectorNdSetZero<T, kFeatDims5>(feature);
// Compute the feature
for (int i = 0; i < num_points; ++i) {
// ...
// Read a point from a DDR memory
ReadPointOpt1<T>(point_cloud, i, x0);
// Compute a point feature
// ...
// Update the output feature
MaxPool1dOpt1<T, kFeatDims5, 2>(x10, feature);
}
}
template <typename T, typename U>
void InferenceClsOpt3(...)
{
#pragma HLS INLINE off
// ...
// Compute logits
LinearOpt2DDR<T, U, kClsDims0, kClsDims1, false, 16>(
feature, x0, params1, 0);
BatchNorm1dReLUOpt1<T, U, kClsDims1, 2>(
x0, x1, bn1->scale, bn1->bias, bn1->mean);
LinearOpt2DDR<T, U, kClsDims1, kClsDims2, false, 8>(
x1, x2, params2, 0);
BatchNorm1dReLUOpt1<T, U, kClsDims2, 2>(
x2, x3, bn2->scale, bn2->bias, bn2->mean);
LinearOpt1<T, U, kClsDims2, kClsDims3, false, 2>(
x3, x4, fc3->weight, fc3->bias);
// Write the result
WriteTensor1dOpt1<T, kClsDims3>(out_logits, x4, 0);
}
å
¥åºåããŒãå¹
ã«ãã£ãŠãã©ã®çšåºŠå®è¡æéãåæžã§ããã§ããããã
ç¹åŸŽæœåºãããã¯ãŒã¯InferenceFeatOpt2
ã®å®è¡ãµã€ã¯ã«æ°ã¯1,112,259 (7.408ms)ãæ°ãã«çšæããInferenceFeatOpt3
ã¯1,112,254 (7.408ms) ã§ããã
ã»ãŒäžç·ã§ãã
åé¡ãããã¯ãŒã¯ã«é¢ããŠã¯ãããŒãå¹
32ãããçšã®InferenceClsOpt1
ã¯711,969ãµã€ã¯ã« (4.742ms) ã§ãããã64ãããçšã®InferenceClsOpt3
ã§ã¯383,885ãµã€ã¯ã« (2.557ms) ã«åæžãããŸããã
ããŒãå¹
ã2åã«åºããããšã§ãåé¡ãããã¯ãŒã¯ã®æšè«æéã1.85åççž®ã§ããããã§ãã
åœåã®ãã€ãŒãå®è£
(InferenceFeatNaive
+ InferenceClsNaive
) ãšãããã«ç€ºãå®è£
(InferenceFeatOpt3
+ InferenceClsOpt3
) ãšã§ãå®è¡ãµã€ã¯ã«æ°ã¯ã©ã®çšåºŠå€åããã§ããããã
äžããã€ãŒãå®è£
ãäžãæé©åæžã¿ã®å®è£
ã§ã®çµæã§ãã
ãã€ãŒãå®è£
ã§ã¯ãæšè«ã«163,279,213ãµã€ã¯ã« (1.087s) èŠããŠããŸãããæé©åã«ãã£ãŠ1,496,143ãµã€ã¯ã« (9.964ms) ã«ãŸã§åæžãããŠããŸãã
ããã109åã®å·®ã§ããã
以äžã§ãé«äœåæã®å®è£
ãã§ãããããŸããã
hls/src/top_opt3.cpp
ãã芧ãã ããã
é«äœåæã®å®è£ ãã§ããã®ã§ãVitis HLSã§ã³ã³ãã€ã«ããIPã³ã¢ãäœæããŸãã ä»åã¯ã以äžã®ãããªç°å¢ã§äœæ¥ããŠããŸã (è©Šã人ã¯ããªããšæããŸããæžããŠãããŸã)ã
- Ubuntu 20.04.5 LTS
- Intel(R) Xeon(R) E-2186G CPU @ 3.80GHz
- 64GB DRAM
- Vivado ML Edition 2022.1 (ã€ã³ã¹ããŒã«å Žæã¯
/tools/Xilinx
以äž) - CMake 3.16.3
ãŸãã察象ã®FPGAããŒãã¯ãXilinx ZCU104 Evaluation Board (XCZU7EV-2FFVC1156)ã§ãã
ä»åçšæããGitHubãªããžããªã§ã¯ã以äžã®ããã«make
ããã ãã§ãèªåçã«IPã³ã¢ãäœæã§ããŸãã
Tclã¹ã¯ãªãããšCMakeãçµã¿åãããŠå®çŸãããŠããŸãã
äžã®ã¹ã¯ãªãŒã³ã·ã§ããã®ããã«ãVitis HLSã«ã¯GUIãçšæãããŠããŸãããTclã¹ã¯ãªããã䜿ãã°ã³ãã³ãã©ã€ã³äžã§ã®ãããåŠçãå¯èœã§ãã
é©åœãªå Žæã«ãªããžããªãã¯ããŒã³ããããhls
ãã£ã¬ã¯ããªã«ç§»ã£ãŠãäœæ¥çšã®ãã£ã¬ã¯ããªãæºåããŸãã
ç¶ããŠCMakeãããžã§ã¯ããæ§æããææã®IPã³ã¢ãmake
ã§äœæããŸãã
# äºãVivadoãšVitis HLSã䜿ããããã«sourceãã
> source /tools/Xilinx/Vivado/2022.1/settings64.sh
# GitHubãªããžããªã®ã¯ããŒã³
> git clone git@github.com:sterngerlach/advent_2022_point_cloud_classification.git
> cd advent_2022_point_cloud_classification
# äœæ¥çšãã£ã¬ã¯ããªã®æºå
> cd hls
> mkdir build
> mkdir work
> cd build
# CMakeãããžã§ã¯ããæ§æ
# settings64.shã«ãã£ãŠCMakeãæžãæããããã®ã§ãã·ã¹ãã ã®CMakeã䜿ã
> /usr/bin/cmake ..
# ãã€ãŒãå®è£
ããIPã³ã¢ãäœæ
# workãã£ã¬ã¯ããªå
ã«äœããã
> make pointnet_naive_150_csynth_export
# ããŒã¿äžŠåæ§ã掻çšãã (ã«ãŒãã¢ã³ããŒãªã³ã°ãšé
åã®åå²ãæžãŸãã) IPã³ã¢ãäœæ
> make pointnet_opt1_csynth_export
# ããŒã¿ãããŒæé©åãæžãŸããIPã³ã¢ãäœæ
> make pointnet_opt2_csynth_export
# å
¥åºåã®ããŒãå¹
ã64ãããã«åºããIPã³ã¢ãäœæ
> make pointnet_opt3_csynth_export
IPã³ã¢ãäœæããããGUIãèµ·åããŠãåæçµæãã¿ãŠã¿ãŸããã (äžã®ã¹ã¯ãªãŒã³ã·ã§ããã®ãããªç»é¢ãéããŸã)ã
> cd hls/work
# ãã€ãŒãå®è£
çšã®Vitis HLSãããžã§ã¯ããGUIã§éã
> vitis_hls -p pointnet_naive_150
# ä»ãåæ§
> vitis_hls -p pointnet_opt1
> vitis_hls -p pointnet_opt2
> vitis_hls -p pointnet_opt3
Vitis HLSã䜿ãã®ã¯ãããŸã§ã§ããã以éã¯ãVivadoã䜿ã£ãäœæ¥ã«ç§»ããŸãã
ç¶ããŠããã®IPã³ã¢ããå¥ã®IPã³ã¢ãšçµã¿åãããŠãããŒããã¶ã€ã³ãçšæããŸãã
ä»åã¯ãããŒããã¶ã€ã³ã®äœæã«ã€ããŠã¯çç¥ããŸãã
æåã«ãvivado
ãã£ã¬ã¯ããªã«ç§»ã£ãŠãäœæ¥çšã®ãã£ã¬ã¯ããªãæºåããŸãã
ç¶ããŠCMakeãããžã§ã¯ããæ§æããææã®ããŒããã¶ã€ã³ãmake
ã§äœæããŸãã
# äœæ¥çšãã£ã¬ã¯ããªã®æºå
> cd vivado
> mkdir build
> mkdir work
> mkdir bitstream
> cd build
# CMakeãããžã§ã¯ããæ§æ
# settings64.shã«ãã£ãŠCMakeãæžãæããããã®ã§ãã·ã¹ãã ã®CMakeã䜿ã
# Vitis HLSã«ããIPã³ã¢ã®åæãçµãã£ãŠããªããšãšã©ãŒ
> /usr/bin/cmake ..
# ãã€ãŒãå®è£
ã®IPã³ã¢ãããããŒããã¶ã€ã³ãäœæ
> make pointnet_naive_150_create
# æé©åæžã¿ã®IPã³ã¢ãããããŒããã¶ã€ã³ãäœæ
> make pointnet_opt1_create
> make pointnet_opt2_create
> make pointnet_opt3_create
ããŒããã¶ã€ã³ãäœæããããGUIãèµ·åããŠããããã¯å³ãã¿ãŠã¿ãŸãããã
> cd vivado/work
> vivado -project pointnet_naive_150/pointnet_naive_150.xpr
> vivado -project pointnet_opt1/pointnet_opt1.xpr
> vivado -project pointnet_opt2/pointnet_opt2.xpr
> vivado -project pointnet_opt3/pointnet_opt3.xpr
å·ŠåŽã®Flow NavigatorããããOpen Block Designããéžæãããšããããã¯å³ã衚瀺ã§ããŸãã
ãããã¯å³ãæ¡å€§ãããã®ã以äžã§ãã
ããŒããã¶ã€ã³ã«å¯ŸããŠãè«çåæãšé 眮é ç·ãè¡ããåè·¯æ å ±ããŸãšãããããã¹ããªãŒã (Bitstream) ãäœæããŸãããã ãã·ã³ã®ã¹ããã¯ã«ããããŸããããã¡ãã®ç°å¢ã§ã¯ã1ã€ã®ããŒããã¶ã€ã³ã®è«çåæãšé 眮é ç·ã«ã30å以äžæãããŸãã (8ã³ã¢ã䜿ã£ãå Žå)ã ä»åã®GitHubãªããžããªã«ã¯ããããã¹ããªãŒã ãå ¥ããŠããã®ã§ããã®äœæ¥ã¯å¿ èŠãããŸãã (è©ŠããŠã¿ãŠã倧äžå€«ã§ã)ã
> cd vivado/build
> make pointnet_naive_150_impl && make pointnet_naive_150_copy_bitstream
> make pointnet_opt1_impl && make pointnet_opt1_copy_bitstream
> make pointnet_opt2_impl && make pointnet_opt2_copy_bitstream
> make pointnet_opt3_impl && make pointnet_opt3_copy_bitstream
ããäžåºŠGUIãèµ·åããŠãåææžã¿ã®åè·¯ãã¿ãŠã¿ãŸãããã å·ŠåŽã®Flow NavigatorããããOpen Implemented DesignããéžæããŸãã å人çã«ã¯ããã¥ãŒãšãŒã¯ã®ãã³ããã¿ã³ã®ããã«ã¿ããŠãçŸãããšæããŸãã GUIäžã§ããªãœãŒã¹ã®äœ¿çšç (Utilization) ããé»åæ¶è²»ã®èŠç©ãã (Power)ãã¿ã€ãã³ã° (Timing) ãªã©ã確èªã§ããŸãã
vivado/bitstream
ãã£ã¬ã¯ããªä»¥äžã«ãçæããããããã¹ããªãŒã ãã³ããŒãããŸãã
ãããã¹ããªãŒã (æ¡åŒµå.bit
) ã®ä»ã«ãHardware Handoffãã¡ã€ã« (æ¡åŒµå.hwh
) ããããŸãã
Handoffãã¡ã€ã«ã«ã¯ãåè·¯ã®ã¡ã¿ããŒã¿ãå«ãŸããŸãã
FPGAããŒãã«ãããã¹ããªãŒã ãããŒãããããã«ã¯ã2ã€ã®ãã¡ã€ã«ãã»ããã§å¿
èŠã«ãªããŸãã
ãããã¹ããªãŒã ãèªã¿çŽãã°ãåããåè·¯ãäœåºŠã§ãåãæ¿ãããããšããã®ããASICã«å¯ŸããFPGAã®å€§ããªå©ç¹ã§ãã
ããŠããããã®ãã¡ã€ã«ãscp
ãªã©ã§FPGAããŒãäžã«è»¢éããã°ãåè·¯ãåããæºåãæŽããŸãã
> cd vivado/bitstream
> ls
-rw-rw-r-- 1 x x 19M Dec 14 23:34 pointnet_naive_150.bit
-rw-rw-r-- 1 x x 363K Dec 14 23:34 pointnet_naive_150.hwh
-rw-rw-r-- 1 x x 19M Dec 15 00:01 pointnet_opt1.bit
-rw-rw-r-- 1 x x 363K Dec 15 00:01 pointnet_opt1.hwh
-rw-rw-r-- 1 x x 19M Dec 14 23:20 pointnet_opt2.bit
-rw-rw-r-- 1 x x 363K Dec 14 23:20 pointnet_opt2.hwh
-rw-rw-r-- 1 x x 19M Dec 15 18:07 pointnet_opt3.bit
-rw-rw-r-- 1 x x 363K Dec 15 18:07 pointnet_opt3.hwh
ãããã¹ããªãŒã ãçšæã§ããã®ã§ãããããåè·¯ãåãããŠã¿ãŸãã
ä»å䜿çšããFPGAããŒããXilinx ZCU104 Evaluation Kitã¯ãSoC (System-on-Chip) ãšãã°ããŠããŸãã
FPGAã®ä»ã«ãã¯ã¢ããã³ã¢ ARM Cortex-A53 CPU (1.2GHz)ã2GBã®DRAMããæ§ã
ãªåšèŸºåè·¯ãçµ±åãããŠããŠãLinuxãåäœããŸãã
ããã§ã¯OSãšããŠãUbuntu 20.04ãããŒã¹ãšããPynq Linux 2.7ã䜿ããŸãã
Pynq Linuxã«ã¯pynq
ãšãã°ããPythonã®ã©ã€ãã©ãªãä»å±ããŠãããPythonããFPGAé¢é£ã®åŠçãç°¡åã«è¡ããŸãã
以äžãè©Šãããã«ã¯ãPynq Linuxäžã«ãPyTorch 1.11.0ããTorchVision 0.12.0ãNumPyãSciPyãH5pyãTqdmãªã©ã®ã©ã€ãã©ãªãäºãã€ã³ã¹ããŒã«ããå¿
èŠããããŸãããããã§ã¯èª¬æãé·ããªã£ãŠããŸãããå²æããŸãã
åºæ¬çã«ã¯pip
ã³ãã³ãã§ã€ã³ã¹ããŒã«ã§ããŸãã
ãªããXilinx ZCU104ãPynq Linux 2.7çšã«ãã«ããããPyTorch 1.11.0ãTorchVision 0.12.0ã®Wheelãã¡ã€ã«ã¯ããã¡ãã®ãªããžããªã«çœ®ããŠãããŸãã
ãããŸã§èŠåŽããŠããªãFPGAäžã§æ©æ¢°åŠç¿ã¢ãã«ãåããããšããã®ããããŸã«èªåèªçããããšããããŸãã
ãã以éã¯C/C++ã§ã¯ãªããPythonã®ã³ãŒããæžããŠãããŸãã
æåã«ãPyTorchã®ã¢ãã«ã®å®çŸ©ãåæ²ããŸã (net/model.py
)ã
äœã®æ»ãããªããã·ã³ãã«ã§ããã
class PointNetFeat(torch.nn.Module):
def __init__(self):
super().__init__()
self.conv1 = torch.nn.Conv1d(3, 64, 1)
self.conv2 = torch.nn.Conv1d(64, 64, 1)
self.conv3 = torch.nn.Conv1d(64, 64, 1)
self.conv4 = torch.nn.Conv1d(64, 128, 1)
self.conv5 = torch.nn.Conv1d(128, 1024, 1)
self.bn1 = torch.nn.BatchNorm1d(64)
self.bn2 = torch.nn.BatchNorm1d(64)
self.bn3 = torch.nn.BatchNorm1d(64)
self.bn4 = torch.nn.BatchNorm1d(128)
self.bn5 = torch.nn.BatchNorm1d(1024)
def forward(self, x: torch.Tensor):
# `x` is of size [B, N, 3]
N = x.shape[1]
# `x` is of size [B, 3, N]
x = x.transpose(1, 2)
# `x` is of size [B, 1024, N]
x = F.relu(self.bn1(self.conv1(x)))
x = F.relu(self.bn2(self.conv2(x)))
x = F.relu(self.bn3(self.conv3(x)))
x = F.relu(self.bn4(self.conv4(x)))
x = F.relu(self.bn5(self.conv5(x)))
# `x` is of size [B, 1024]
x = torch.max(x, dim=2)[0]
return x
class PointNetCls(torch.nn.Module):
def __init__(self, num_classes: int):
super().__init__()
# Feature extraction
self.feat = PointNetFeat()
# Classification network
self.fc1 = torch.nn.Linear(1024, 512)
self.fc2 = torch.nn.Linear(512, 256)
self.fc3 = torch.nn.Linear(256, num_classes)
self.bn1 = torch.nn.BatchNorm1d(512)
self.bn2 = torch.nn.BatchNorm1d(256)
def forward(self, x):
# `x` is of size [B, N, 3]
# `x` is of size [B, 1024]
x = self.feat(x)
# `x` is of size [B, `num_classes`]
x = F.relu(self.bn1(self.fc1(x)))
x = F.relu(self.bn2(self.fc2(x)))
x = self.fc3(x)
return x
次ã«ãFPGAã§é«éåãããã¢ãã«ã瀺ããŸã (host/model_zcu104.py
)ã
ã¢ãã«ã®ååã¯PointNetClsZCU104
ã§ãã
äžèšã®CPUçã®ã¢ãã« (PointNetCls
) ãšã䜿ãåæãåãã«ãªãããã«ããŸããã
from net.model import PointNetCls
# Split the 64-bit address
def split_address(addr: int) -> Tuple[int, int]:
mask = (1 << 32) - 1
return addr & mask, addr >> 32
# Allocate a contiguous buffer for torch.nn.Conv1d (torch.nn.Linear)
def allocate_linear_buffer(in_dims: int, out_dims: int) \
-> pynq.buffer.PynqBuffer:
buf_size = in_dims * out_dims + out_dims
return pynq.allocate(shape=(buf_size,), dtype=np.float32, cacheable=False)
# Allocate a contiguous buffer for a block with torch.nn.Conv1d
# (torch.nn.Linear) and torch.nn.BatchNorm1d
def allocate_block_buffer(in_dims: int, out_dims: int) \
-> pynq.buffer.PynqBuffer:
buf_size = 0
buf_size += in_dims * out_dims + out_dims
buf_size += out_dims * 3
return pynq.allocate(shape=(buf_size,), dtype=np.float32, cacheable=False)
# Write the torch.nn.Conv1d parameters to the contiguous buffer
def write_conv1d_params(buf: pynq.buffer.PynqBuffer,
layer: torch.nn.Conv1d,
offset: int = 0) -> int:
if layer.kernel_size != (1,):
raise RuntimeError(f"Kernel size should be 1")
weight_size = layer.out_channels * layer.in_channels
bias_size = layer.out_channels
buf[offset:offset+weight_size] = layer.weight.data.view(-1)
offset += weight_size
buf[offset:offset+bias_size] = layer.bias.data.view(-1)
offset += bias_size
return offset
# Write the torch.nn.Linear parameters to the contiguous buffer
def write_linear_params(buf: pynq.buffer.PynqBuffer,
layer: torch.nn.Linear,
offset: int = 0) -> int:
weight_size = layer.out_features * layer.in_features
bias_size = layer.out_features
buf[offset:offset+weight_size] = layer.weight.data.view(-1)
offset += weight_size
buf[offset:offset+bias_size] = layer.bias.data.view(-1)
offset += bias_size
return offset
# Write the torch.nn.BatchNorm1d parameters to the contiguous buffer
def write_batchnorm1d_params(buf: pynq.buffer.PynqBuffer,
layer: torch.nn.BatchNorm1d,
offset: int = 0) -> int:
dims = layer.num_features
# `scale` is the multiplication of the weight and reciprocal of the
# standard deviation (to reduce the on-chip memory consumption)
std_inv = torch.sqrt(layer.running_var.data + layer.eps)
std_inv = torch.reciprocal(std_inv)
scale = std_inv * layer.weight.data
buf[offset:offset+dims] = scale.data.view(-1)
offset += dims
buf[offset:offset+dims] = layer.bias.data.view(-1)
offset += dims
buf[offset:offset+dims] = layer.running_mean.data.view(-1)
offset += dims
return offset
# Write the block (torch.nn.Conv1d and torch.nn.BatchNorm1d) parameters
# to the contiguous buffer
def write_conv_batchnorm1d_params(buf: pynq.buffer.PynqBuffer,
conv: torch.nn.Conv1d,
bn: torch.nn.BatchNorm1d):
offset = 0
offset = write_conv1d_params(buf, conv, offset)
offset = write_batchnorm1d_params(buf, bn, offset)
# Write the block (torch.nn.Linear and torch.nn.BatchNorm1d) parameters
# to the contiguous buffer
def write_linear_batchnorm1d_params(buf: pynq.buffer.PynqBuffer,
linear: torch.nn.Linear,
bn: torch.nn.BatchNorm1d):
offset = 0
offset = write_linear_params(buf, linear, offset)
offset = write_batchnorm1d_params(buf, bn, offset)
class PointNetClsZCU104(torch.nn.Module):
# Operation modes (refer to hls/src/op_modes.hpp)
MODE_INIT_WEIGHTS = 100
MODE_INFERENCE = 101
def __init__(self, model_cpu: PointNetCls,
overlay_path: str, num_points: int):
super().__init__()
# Load an overlay
self.overlay = self.load_overlay(overlay_path)
# Get the IP core module
self.net_ip: pynq.DefaultIP = self.overlay.PointNetClsTop
# Get the control registers of the IP core
self.registers = self.net_ip.register_map
# Check the data width of the AXI master interface
net_ip_params = self.overlay.ip_dict["PointNetClsTop"]["parameters"]
self.axi_m_addr_width = int(net_ip_params["C_M_AXI_GMEM0_ADDR_WIDTH"])
self.axi_m_data_width = int(net_ip_params["C_M_AXI_GMEM0_DATA_WIDTH"])
# Allocate buffers for PointNet feature extraction network
self.buf_feat_params1 = allocate_block_buffer(3, 64)
self.buf_feat_params2 = allocate_block_buffer(64, 64)
self.buf_feat_params3 = allocate_block_buffer(64, 64)
self.buf_feat_params4 = allocate_block_buffer(64, 128)
self.buf_feat_params5 = allocate_block_buffer(128, 1024)
# Allocate buffers for classification network
self.buf_cls_params1 = allocate_block_buffer(1024, 512)
self.buf_cls_params2 = allocate_block_buffer(512, 256)
self.buf_cls_params3 = allocate_linear_buffer(256, 40)
# Allocate a buffer for point cloud
self.num_points = num_points
if self.axi_m_data_width == 32:
self.buf_point_cloud: pynq.buffer.PynqBuffer = pynq.allocate(
shape=(self.num_points, 3), dtype=np.float32, cacheable=False)
elif self.axi_m_data_width == 64:
self.buf_point_cloud: pynq.buffer.PynqBuffer = pynq.allocate(
shape=(self.num_points, 4), dtype=np.float32, cacheable=False)
else:
raise RuntimeError(f"Unexpected data width for AXI master")
# Allocate a buffer for output logits
self.buf_out_logits: pynq.buffer.PynqBuffer = pynq.allocate(
shape=(40,), dtype=np.float32, cacheable=False)
# Copy parameters for PointNet feature extraction network
write_conv_batchnorm1d_params(self.buf_feat_params1,
model_cpu.feat.conv1, model_cpu.feat.bn1)
write_conv_batchnorm1d_params(self.buf_feat_params2,
model_cpu.feat.conv2, model_cpu.feat.bn2)
write_conv_batchnorm1d_params(self.buf_feat_params3,
model_cpu.feat.conv3, model_cpu.feat.bn3)
write_conv_batchnorm1d_params(self.buf_feat_params4,
model_cpu.feat.conv4, model_cpu.feat.bn4)
write_conv_batchnorm1d_params(self.buf_feat_params5,
model_cpu.feat.conv5, model_cpu.feat.bn5)
# Copy parameters for classification network
write_linear_batchnorm1d_params(self.buf_cls_params1,
model_cpu.fc1, model_cpu.bn1)
write_linear_batchnorm1d_params(self.buf_cls_params2,
model_cpu.fc2, model_cpu.bn2)
write_linear_params(self.buf_cls_params3, model_cpu.fc3)
# Set the physical addresses of the buffers
self.registers.point_cloud_1, self.registers.point_cloud_2 = \
split_address(self.buf_point_cloud.device_address)
self.registers.out_logits_1, self.registers.out_logits_2 = \
split_address(self.buf_out_logits.device_address)
self.registers.feat_params1_1, self.registers.feat_params1_2 = \
split_address(self.buf_feat_params1.device_address)
self.registers.feat_params2_1, self.registers.feat_params2_2 = \
split_address(self.buf_feat_params2.device_address)
self.registers.feat_params3_1, self.registers.feat_params3_2 = \
split_address(self.buf_feat_params3.device_address)
self.registers.feat_params4_1, self.registers.feat_params4_2 = \
split_address(self.buf_feat_params4.device_address)
self.registers.feat_params5_1, self.registers.feat_params5_2 = \
split_address(self.buf_feat_params5.device_address)
self.registers.cls_params1_1, self.registers.cls_params1_2 = \
split_address(self.buf_cls_params1.device_address)
self.registers.cls_params2_1, self.registers.cls_params2_2 = \
split_address(self.buf_cls_params2.device_address)
self.registers.cls_params3_1, self.registers.cls_params3_2 = \
split_address(self.buf_cls_params3.device_address)
# Synchronize the buffers
self.buf_feat_params1.sync_to_device()
self.buf_feat_params2.sync_to_device()
self.buf_feat_params3.sync_to_device()
self.buf_feat_params4.sync_to_device()
self.buf_feat_params5.sync_to_device()
self.buf_cls_params1.sync_to_device()
self.buf_cls_params2.sync_to_device()
self.buf_cls_params3.sync_to_device()
# Initialize the weights (transfer the weights to the on-chip buffers)
self.registers.op_mode = PointNetClsZCU104.MODE_INIT_WEIGHTS
self.registers.CTRL.AP_START = 1
self.wait_for_ip()
def load_overlay(self, overlay_path):
overlay = pynq.Overlay(overlay_path)
if not overlay.is_loaded():
raise RuntimeError(f"Unable to load overlay: {overlay_path}")
return overlay
def wait_for_ip(self):
while self.registers.CTRL.AP_DONE == 0:
pass
def forward(self, x: torch.Tensor):
# `x` is of size [B, N, 3]
if x.ndim != 3 or x.shape[2] != 3:
raise RuntimeError(f"Unexpected shape of the input: {x.shape}")
batch_size = x.shape[0]
num_points = x.shape[1]
# Reallocate the buffer for point cloud if necessary
if num_points > self.num_points:
self.num_points = num_points
self.buf_point_cloud.freebuffer()
if self.axi_m_data_width == 32:
self.buf_point_cloud: pynq.buffer.PynqBuffer = pynq.allocate(
shape=(self.num_points, 3),
dtype=np.float32, cacheable=False)
elif self.axi_m_data_width == 64:
self.buf_point_cloud: pynq.buffer.PynqBuffer = pynq.allocate(
shape=(self.num_points, 4),
dtype=np.float32, cacheable=False)
else:
raise RuntimeError(f"Unexpected data width for AXI master")
self.registers.point_cloud_1, self.registers.point_cloud_2 = \
split_address(self.buf_point_cloud.device_address)
# Allocate the Tensor for output
out = torch.empty(size=(batch_size, 40),
dtype=x.dtype, device=x.device)
# Run the inference
self.registers.op_mode = PointNetClsZCU104.MODE_INFERENCE
self.registers.num_points = num_points
for i in range(batch_size):
# Copy the input point cloud
self.buf_point_cloud[:num_points, :3] = x[i].view(-1, 3)
self.buf_point_cloud.sync_to_device()
# Run the inference
self.registers.CTRL.AP_START = 1
self.wait_for_ip()
# Copy the output logits
self.buf_out_logits.sync_from_device()
out[i, :] = torch.from_numpy(self.buf_out_logits)
return out
PointNetClsZCU104
ã¯ã©ã¹ã®ã³ã³ã¹ãã©ã¯ã¿ã§ã以äžã®ãããªæé ã§åæåããIPã³ã¢ã䜿ããããã«ããŸãã
ãã®æé ã§è¡ãå¿
èŠã¯ãããŸããã
åæé ã«ã€ããŠãé çªã«èª¬æããŸãã
詳ããã¯ãPynqã®å
¬åŒããã¥ã¡ã³ããã芧ãã ããã
- ãããã¹ããªãŒã ã®ããŒã (
load_overlay
) - DRAMãããã¡ã®ç¢ºä¿ (
allocate_block_buffer
ãpynq.allocate
) - DRAMãããã¡ãžã®ãã©ã¡ãŒã¿ã®ã³ã㌠(
write_conv_batchnorm1d_params
ãwrite_linear_batchnorm1d_params
ãwrite_linear_params
) - DRAMãããã¡ã®ç©çã¢ãã¬ã¹ããããŒãã®ã¬ãžã¹ã¿ã«å¯ŸããŠèšå®
- DRAMãããã¡ã®å
容ãåæ (
sync_to_device
) - éã¿åæåã¢ãŒãã§ãIPã³ã¢ãåäœãããDRAMãããã¡ã«çœ®ããããã©ã¡ãŒã¿ããªã³ããããããã¡äžã«ã³ããŒ
- IPã³ã¢ã®åäœçµäºãåŸ
æ© (
wait_for_ip
)
ãããã¹ããªãŒã ãæäœããããã®ã¯ã©ã¹ã¯pynq.Overlay
ã§ããããã¡ã€ã«ãã¹ãäžããŠãæå®ãããããã¹ããªãŒã ãããŒãããŸãã
æ¡åŒµåã.bit
ã®ãããã¹ããªãŒã ã®ä»ã«ã.hwh
ã®Handoffãã¡ã€ã«ãå¿
èŠã§ãã
ãããã¹ããªãŒã ãpath/to/X.bit
ã§ããã°ã察å¿ããHandoffãpath/to/X.hwh
ã«ãªããã°ãšã©ãŒãšãªããŸãã
pynq.Overlay
ã¯ã©ã¹ã®ã€ã³ã¹ã¿ã³ã¹self.overlay
ãèµ·ç¹ãšããŠãFPGAã«å¯Ÿããæ§ã
ãªåŠçãè¡ã£ãŠãããŸãã
ãªãŒããŒã¬ã€ (ãããã¹ããªãŒã ) ãããŒãããããèªäœã®IPã³ã¢PointNetClsTop
ãåãåºããŠãself.net_ip
ã«æ ŒçŽããŸãã
IPã³ã¢ã®ããããã£åã¯ãããŒããã¶ã€ã³ã«ãããåIPã®ååãšå¯Ÿå¿ããŠããŸã (ãã¡ãã®ç»åãåç
§ã)
äŸãã°ãå²èŸŒã¿ã³ã³ãããŒã© (AXI Interrupt Controller) ã«ã¯ãaxi_intc_0
ããããã£ãéããŠã¢ã¯ã»ã¹ã§ããŸãã
IPã³ã¢ãæäœããããã®ã¯ã©ã¹ã¯ãããã©ã«ãã§ã¯pynq.DefaultIP
ãšãªã£ãŠããŸãã
ãã®ã¯ã©ã¹ãç¶æ¿ããŠãèªäœã®IPã³ã¢ããã䟿å©ã«äœ¿ããããã«ãæ§ã
ãªã¡ãœãããè¿œå ããããšãå¯èœã§ãã
ããã«ãIPã³ã¢ã®å¶åŸ¡ã¬ãžã¹ã¿ã«ã¢ã¯ã»ã¹ããããã®ã€ã³ã¿ãã§ãŒã¹register_map
(pynq.registers.RegisterMap
ã®ãµãã¯ã©ã¹) ãåãåºããŠãself.registers
ã«æ ŒçŽããŸãã
次ã®3è¡ã§ãIPã³ã¢ã®å
¥åºåããŒãã®ã¢ãã¬ã¹å¹
ãšããŒã¿å¹
ã調ã¹ãŠãself.axi_m_addr_width
ããã³self.axi_m_data_width
ã«æ ŒçŽããŸãã
åè
ã¯64ãåŸè
ã¯32ãŸãã¯64ã§ã (å
¥åºåããŒãã®åãap_uint<64>*
ãšããå Žåã¯64ãfloat*
ã®ãŸãŸã§ããã°32)ã
åè¿°ã®éããããŒãå¹
ã32ãããã§ããã°ãç¹çŸ€ãããã¡ã®ãµã€ãºã¯$(N, 3)$ã§ããã®ã§ããã64ãããã®å Žåã¯ãããŒã¿ã2ã€ãã€èªã¿åãé¢ä¿äžããããã¡ãµã€ãºã$(N, 4)$ã«ããå¿
èŠããããŸãã
self.axi_m_data_width
ãåç
§ããã°ãç¹çŸ€ãããã¡ã®ãµã€ãºã決å®ã§ããŸãã
ç¶ããŠããã©ã¡ãŒã¿ãå
¥åºåãä¿æããããã®DRAMãããã¡ã確ä¿ããŸãã
ãã®ãããã¡ã¯å°ãç¹æ®ãªãã®ã§ãLinuxã«ãŒãã«ã®CMA (Contiguous Memory Allocator) ãšããæ©èœã«ãã確ä¿ãããŸãã
éåžžã®malloc()
ãnew
ã䜿ã£ãŠãããã¡ã確ä¿ãããšããã®ãããã¡ãžã®ä»®æ³ã¢ãã¬ã¹ããåãããŸããã
äžæ¹ãFPGAåŽããã¯ãç©çã¢ãã¬ã¹ã䜿çšããŠãããã¡ã«ã¢ã¯ã»ã¹ããã®ã§ãä»®æ³ã¢ãã¬ã¹ã ãã§ãªããç©çã¢ãã¬ã¹ãäºãç¥ã£ãŠããå¿
èŠããããŸãã
allocate_linear_buffer
é¢æ°ã¯ããã®åã®éããå
šçµåå±€ (å
¥å次å
in_dims
ãåºå次å
out_dims
) ã®ãã©ã¡ãŒã¿çšã®ãããã¡ã確ä¿ããŸãã
æåã«ãå
šçµåå±€ã®éã¿ (in_dims * out_dims
) ãšãã€ã¢ã¹ (out_dims
) ã®èŠçŽ æ°ã足ããŠããããã¡ãµã€ãºã決å®ããŸãã
ç¶ããŠãpynq.allocate
é¢æ°ãåŒã³åºããŠãæå®ãããµã€ãºããã³ããŒã¿ånp.float32
(float
) ã®ã1次å
ã®ãããã¡ã確ä¿ããŸãã
ãã®ãããã¡ã¯DRAMã®ç¹æ®ãªé åã«çœ®ãããŠãã¡ã¢ãªäžã§é£ç¶ããŠããããšãä¿èšŒãããŸãã
allocate_block_buffer
é¢æ°ã¯ãå
šçµåå±€ãšãããæ£èŠåå±€ã®ãã©ã¡ãŒã¿ãä¿æããããã®ãããã¡ã確ä¿ããŸãã
å
šãã©ã¡ãŒã¿ã®èŠçŽ æ°ã足ãåãããŠãµã€ãºã決å®ããpynq.allocate
é¢æ°ã䜿ã£ãŠã1次å
ã®ãããã¡ã確ä¿ããŸãã
ãããã®ãããã¡ã¯pynq.buffer.PynqBuffer
ã¯ã©ã¹ã®ã€ã³ã¹ã¿ã³ã¹ã§ãããNumPyé
å (np.ndarray
) ãšåãããã«å©çšã§ããŸãã
äŸãã°ãtorch.from_numpy
é¢æ°ã«ãããPyTorchã®ãã³ãœã«ã«å€æã§ããŸãã
ç¹åŸŽæœåºãããã¯ãŒã¯ (buf_feat_params1
ããbuf_feat_params5
) ãšãåé¡ãããã¯ãŒã¯ (buf_cls_params1
ããbuf_cls_params3
) ã®ãã©ã¡ãŒã¿çšã®ãããã¡ã確ä¿ããŸãã
ãã®åŸãå
¥å (ç¹çŸ€) ãšåºå (ããžãã) çšã®ãããã¡ã確ä¿ããŸãã
å
¥åã«ã€ããŠã¯äžè¿°ã®éããããŒãã®ãããå¹
ã64ã§ããã°(self.num_points, 4)
ã32ã§ããã°(self.num_points, 3)
ãšããŸãã
DRAMãããã¡ã確ä¿ãçµãããã次ã¯ã¢ãã«ã®ãã©ã¡ãŒã¿ããããã¡ãžã³ããŒããŸãã
ã¢ãã«ã¯PointNetCls
ã¯ã©ã¹ã®ã€ã³ã¹ã¿ã³ã¹ã§ãã³ã³ã¹ãã©ã¯ã¿ã®åŒæ°model_cpu
ãšããŠæž¡ãããŸãã
write_conv1d_params
ãwrite_linear_params
ã¯ãããããtorch.nn.Conv1d
ãtorch.nn.Linear
ã®ãã©ã¡ãŒã¿ã®ã³ããŒã«äœ¿ãããŸãã
write_conv1d_params
ã§ã¯ãã«ãŒãã«ãµã€ãºã1ã§ãã (ããããå
šçµåå±€torch.nn.Linear
ãšåäœãåãã§ãã) ããšãåæãšããŸãã
éã¿ãšãã€ã¢ã¹ã®é ã§ãæå®ããã1次å
ã®DRAMãããã¡ã«äžŠã¹ãŠãããŸãã
IPã³ã¢åŽã®æåŸ
éãã«ããŒã¿ãé
眮ãããããã«ã现å¿ã®æ³šæãæãå¿
èŠããããŸãã
ããã2ã€ã®é¢æ°ã¯ãé«äœåæã®å®è£
ã«ããããReadLinearParamsNaive
ãReadLinearParamsOpt1
ãšé©åããããã«äœãããŠããŸãã
write_batchnorm1d_params
ã¯ãtorch.nn.BatchNorm1d
ã®ãã©ã¡ãŒã¿ããæå®ãããDRAMãããã¡ã«ã³ããŒããŸãã
IPã³ã¢åŽã§ã¯ãReadBatchNorm1dParamsNaive
ãReadBatchNorm1dParamsOpt1
ã«ç€ºãããã«ãã¹ã±ãŒã«ããã€ã¢ã¹ãå¹³åã®é ã§ããã©ã¡ãŒã¿ã䞊ã¶ããšãæåŸ
ããŠããŸãã
ãããæ£èŠåå±€ã®åæ£ãšéã¿ãããã¹ã±ãŒã«ãèšç®ããŠããŸã (èšç®åŒã«ã€ããŠã¯å
è¿°)ã
write_conv_batchnorm1d_params
ãšwrite_linear_batchnorm1d_params
ã¯ãå
šçµåå±€ (torch.nn.Conv1d
ãtorch.nn.Linear
) ãšãããæ£èŠåå±€ (torch.nn.BatchNorm1d
) ã®ãã©ã¡ãŒã¿ããæå®ãããDRAMãããã¡ã«ã³ããŒããŸãã
å
šçµåå±€ã®éã¿ããã€ã¢ã¹ããããããããæ£èŠåå±€ã®ã¹ã±ãŒã«ããã€ã¢ã¹ãå¹³åãããã®é ã§äžŠã¹ãå¿
èŠããããŸãã
IPã³ã¢åŽã®ReadBlockParamsNaive
ãReadBlockParamsOpt1
ãReadBlockParamsOpt2
ãšå¯Ÿå¿ããããšãåãããŸãã
ã¢ãã«ã®ãã©ã¡ãŒã¿ã¯PyTorchã®ãã³ãœã«ã§ããããã®ãŸãŸDRAMãããã¡ (pynq.buffer.PynqBuffer
) ã«ä»£å
¥ã§ããŸãã
ãã©ã¡ãŒã¿ãç¡äºã«ã³ããŒã§ããã®ã§ãDRAMãããã¡ã®ç©çã¢ãã¬ã¹ãèšå®ããŸãã
IPã³ã¢ã®ãããé¢æ°PointNetClsTop
ã¯æ¬¡ã®ããã«å®£èšãããŠããŸãã (float*
ã®ä»£ããã«ap_uint<64>*
ããã)ã
void PointNetClsTop(const int op_mode,
const float* point_cloud,
const int num_points,
float* out_logits,
const float* feat_params1,
const float* feat_params2,
const float* feat_params3,
const float* feat_params4,
const float* feat_params5,
const float* cls_params1,
const float* cls_params2,
const float* cls_params3)
{
#pragma HLS INTERFACE m_axi port=point_cloud offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=out_logits offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=feat_params1 offset=slave bundle=gmem0
// ...
#pragma HLS INTERFACE m_axi port=cls_params3 offset=slave bundle=gmem0
#pragma HLS INTERFACE s_axilite port=op_mode bundle=control
#pragma HLS INTERFACE s_axilite port=point_cloud bundle=control
#pragma HLS INTERFACE s_axilite port=num_points bundle=control
#pragma HLS INTERFACE s_axilite port=out_logits bundle=control
#pragma HLS INTERFACE s_axilite port=feat_params1 bundle=control
// ...
#pragma HLS INTERFACE s_axilite port=cls_params3 bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control
}
op_mode
ãšnum_points
ãé€ããDRAMãããã¡çšã®å
¥åºåããŒãã«ã€ããŠã#pragma HLS INTERFACE m_axi
ãš#pragma HLS INTERFACE s_axilite
ã®èšè¿°ãã¿ãããŸãã
ãã®2ã€ã®HLSãã©ã°ããä»äžãããšãåããŒãã«å¯ŸããŠãDRAMãããã¡ã®ç©çã¢ãã¬ã¹ãæå®ããããã®ãå¶åŸ¡ã¬ãžã¹ã¿ãäœæãããŸãã
ã¢ãã¬ã¹ã¯64ãããã§ãããå¶åŸ¡ã¬ãžã¹ã¿ã®ããŒã¿å¹
ã¯32ããããªã®ã§ãäžäœ32ããããšäžäœ32ãããçšã«ã2ã€ã®å¶åŸ¡ã¬ãžã¹ã¿ãçšæãããŸãã
äŸãã°ãpoint_cloud
ããŒãã«ã€ããŠã¯ãpoint_cloud_1
(äžäœ32ããã) ãšãpoint_cloud_2
(äžäœ32ããã) ã®ã2ã€ã§ãã
DRAMãããã¡ã®ç©çã¢ãã¬ã¹ãèšå®ããã°ãããŒããšDRAMãããã¡ãšãçŽã¥ããããFPGAåŽãããããã¡ã«ã¢ã¯ã»ã¹ã§ããããã«ãªããŸãã
Pynqã©ã€ãã©ãªã䜿ããšãæ®éã«å€ã代å
¥ããŠããããã«ã¿ããŸãããå®éã«ã¯ãã¡ã¢ãªããããI/Oã§å®çŸãããŠããŸãã
èšãæãããšãåå¶åŸ¡ã¬ãžã¹ã¿ã«ã¯å°çšã®ã¢ãã¬ã¹ãå²ãæ¯ãããŠããããã®ã¢ãã¬ã¹ã«å¯ŸããŠèªã¿æžãããŠããŸãã
å¶åŸ¡ã¬ãžã¹ã¿ãžã®ã¢ã¯ã»ã¹ã«ã¯ãå
ã»ã©ã®self.registers
ãå©çšããŸãã
op_mode
ãšnum_points
ã«ã€ããŠãã#pragma HLS INTERFACE s_axilite
ã®èšè¿°ãããã®ã§ããã®2〠(åäœã¢ãŒããšç¹ã®åæ°) ãèšå®ããããã®å¶åŸ¡ã¬ãžã¹ã¿ãçšæãããŸãã
ãããŸã§æžãã ããsync_to_device
ã¡ãœããã«ããDRAMãããã¡ã®å
容ãåæãããŠãFPGAåŽããæ£ããèªããããã«ããŸãã
æåŸã«ãåäœã¢ãŒãop_mode
ãéã¿åæåã«èšå®ããå¶åŸ¡ã¬ãžã¹ã¿ã®ãã¡CTRL.AP_START
ã1ã«ããããšã§ãIPã³ã¢ã®åäœãéå§ããŸãã
éã¿åæåã¢ãŒãã§ã¯ãDRAMãããã¡ãããã©ã¡ãŒã¿ãèªã¿åºããŠããªã³ããããããã¡ã«æ ŒçŽããŸãã
#pragma HLS INTERFACE s_axilite port=return bundle=control
ã®èšè¿°ããããããã§ããœãããŠã§ã¢åŽããIPã³ã¢ãå¶åŸ¡ããããã®CTRL
ã¬ãžã¹ã¿ãçšæãããŸãã
IPã³ã¢ã®åäœãéå§ããããwait_for_ip
ã¡ãœãããåŒãã§ãåäœçµäº (ãã©ã¡ãŒã¿ã®è»¢éå®äº) ãåŸ
æ©ããŸãã
wait_for_ip
ã¡ãœããå
ã§ã¯ãCTRL
ã¬ãžã¹ã¿ã®AP_DONE
ã1ã«ãªããŸã§ãããžãŒãŠã§ã€ãããŸãã
以äžã§åæåããããŸãã§ãã
åæåã«ã¯æ§ã
ãªå·¥çšããã£ãŠé¢åã§ãããæšè«ã¯æ¯èŒçç°¡åã§ãã
PyTorchã®éåžžã®ã¢ãžã¥ãŒã«ãšåãããforward
ã¡ãœããã«æšè«åŠçãèšè¿°ããŸãã
å
¥åç¹çŸ€x
ã¯ããµã€ãºã$(B, N, 3)$ã®ãããã§ãããšããŸã (out
ã¯ãç©äœã®ã¯ã©ã¹æ°ã$K$ãšãããšããµã€ãºã$(B, K)$ãšãªããŸãã
ä»åã¯ModelNet40ãšãã°ããããŒã¿ã»ããã䜿ãã®ã§ãã¯ã©ã¹æ°ã¯$K = 40$ã§ãã
æåã«ãç¹çŸ€ã®ãµã€ãº$N$ããç¹çŸ€çšã«ç¢ºä¿ããŠããçŸåšã®DRAMãããã¡ããã倧ãããã°ãDRAMãããã¡ã確ä¿ãçŽããŸãã
ç¶ããŠããããå
ã®åãµã³ãã«ã«å¯ŸããŠæšè«åŠçãè¡ã£ãŠãç©äœã®åã¯ã©ã¹ã«å¯Ÿããããžãã (ã¹ã³ã¢) ãèšç®ããŸãã
ç¹çŸ€çšã®DRAMãããã¡buf_point_cloud
ã«ç¹çŸ€ããŒã¿ãã³ããŒããŠãFPGAåŽããæ£ããèªã¿åºããããã«ããããã¡ãåæããŸãã
ãœãããŠã§ã¢åŽããã¯ãå
¥åºåããŒãã®å¹
(32ã64ãã©ãã) ã¯ããã»ã©æèããå¿
èŠããããŸããã
2ã€ã®å¶åŸ¡ã¬ãžã¹ã¿ (åäœã¢ãŒãop_mode
ãšç¹ã®åæ°num_points
) ã¯ãäºãèšå®ããŠãããŸãã
CTRL
ã¬ãžã¹ã¿ã®AP_START
ã1ã«ããããšã§ãæšè«ã¢ãŒãã§ã®IPã³ã¢ã®åäœãéå§ããŸãã
wait_for_ip
ã¡ãœããã«ããåäœã®çµäºãåŸ
æ©ããŸãã
ã¢ãã«ã®åºåã§ããããžããã¯ãIPã³ã¢åŽããDRAMãããã¡buf_out_logits
ã«æžã蟌ãŸããŠããã®ã§ããããPyTorchã®ãã³ãœã«ã«å€æããããã§ãåºåçšã®ãã³ãœã«out
ã«æ¹ããŠæžã蟌ã¿ãŸãã
以äžãæšè«åŠçã®èª¬æã§ããã
ãã®ããã«ãIPã³ã¢ã®å®è£
ã ãã§ãªãããããå®éã«äœ¿ãããã®ãã©ã€ããçšæããå¿
èŠãããã®ã§ãæéãæãããŸããã
ä»åã¯ãPynqã©ã€ãã©ãªã䜿ã£ãã®ã§ãFPGAã«é¢ããåŠçã¯ãæ¯èŒç容æã«èšè¿°ã§ããŸããã
ãŸããCPUã»GPUçã®ã¢ãã«ãšåãããã«äœ¿ãããã®ã§ãPyTorchã®ã¢ãžã¥ãŒã« (torch.nn.Module
) ãšããŠãã©ã€ããäœæããŸããã
Pythonã®ä»£ããã«C++ã䜿ãããšãããã¡ããå¯èœã§ãã
ãã®å Žåã¯ããããã¹ããªãŒã ã®ããŒã (äŸãã°ãã¡ã)ãã¡ã¢ãªããããI/Oã®æºå (äŸãã°ãã¡ã)ãDRAMãããã¡ã®ç¢ºä¿ (äŸãã°ãã¡ã)ãªã©ããC++ã§èšè¿°ããããšã«ãªããŸã (Pynqã©ã€ãã©ãªããã®ãŸãŸç§»æ€ããã®ãèŠããŠããŸã)ã
ãããããIPã³ã¢ã䜿ã£ãè©äŸ¡ã«å
¥ããŸããã
æåã«ãæšè«æéãæ¯èŒããŠã¿ãŸãããã
以äžã®ãœãŒã¹ã³ãŒããå©çšããŸã (host/time_zcu104.py
)ã
def main():
# Parse the command-line arguments
args = parse_command_line()
# Create a PointNet classification model
model = PointNetCls(num_classes=40)
# Create an FPGA model
model_zcu104 = PointNetClsZCU104(model, args.bitstream, args.num_points)
model.eval()
model_zcu104.eval()
# Test the output
# Create a random input point cloud
point_cloud = torch.rand(size=(1, args.num_points, 3))
out_cpu = model(point_cloud)
out_zcu104 = model_zcu104(point_cloud)
print(f"Output (CPU):\n{out_cpu}")
print(f"Output (FPGA):\n{out_zcu104}")
# Measure the inference times
times_cpu = []
times_zcu104 = []
for _ in range(args.runs):
# Create a random input point cloud
point_cloud = torch.rand(size=(1, args.num_points, 3))
t0 = time.monotonic()
model(point_cloud)
elapsed_cpu = (time.monotonic() - t0) * 1e3
t0 = time.monotonic()
model_zcu104(point_cloud)
elapsed_zcu104 = (time.monotonic() - t0) * 1e3
times_cpu.append(elapsed_cpu)
times_zcu104.append(elapsed_zcu104)
time_avg_cpu = np.mean(times_cpu)
time_std_cpu = np.std(times_cpu)
time_avg_zcu104 = np.mean(times_zcu104)
time_std_zcu104 = np.std(times_zcu104)
speedup_factor = time_avg_cpu / time_avg_zcu104
print(f"Inference time (CPU): " \
f"mean: {time_avg_cpu:.3f}ms, " \
f"std: {time_std_cpu:.3f}ms")
print(f"Inference time (FPGA): " \
f"mean: {time_avg_zcu104:.3f}ms, " \
f"std: {time_std_zcu104:.3f}ms")
print(f"Speedup: {speedup_factor:.3f}x")
ããã§ã¯ç²ŸåºŠã¯æ°ã«ããªãã®ã§ãåŠç¿æžã¿ã®ã¢ãã«ãããŒãããåŠçã¯çãããŠããŸãã
äœããCPUçã®ã¢ãã«PointNetCls
ãšãFPGAçã®ã¢ãã«PointNetClsZCU104
ãšã§ããã©ã¡ãŒã¿ãæããå¿
èŠã¯ãããŸãã
ãŸããCPUçã®ã¢ãã«ã¯eval
ã¢ãŒãã§åäœãããŸãã
ãããæ£èŠåå±€ã®æåãèšç·Žã¢ãŒãã«ãªãããããæ°ã1ã®ãšãã«ãšã©ãŒãšãªããŸãã
ãŸããèšç·Žæžã¿ã®ãã©ã¡ãŒã¿ã§ã¯ãªããå
¥åã®ãããããå¹³åãæšæºåå·®ãèšç®ãããã®ã§ãFPGAçã®ã¢ãã«ãšåºåçµæãåããªããªããŸãã
æå®ãããåæ°args.runs
ã ããæšè«æéã®èšæž¬ãè¡ããå¹³åãšæšæºåå·®ããŸãé«éåçãç®åºããŸãã
ãŸãæåã«ãåæ¹ã®ã¢ãã«ã®åºåãåã£ãŠãããã©ãã (倧äœè¿ãå€ãåºåãããã) ã確èªããŠããŸã (æ¬åœã¯ãIPã³ã¢ã®äœææã«ããã¹ãããŸã)ã
FPGAããŒãäžã§ä»¥äžã®ã³ãã³ããå®è¡ããŸãã
> cd advent_2022_point_cloud_classification/host
# ãã€ãŒãå®è£
(åäœåšæ³¢æ°150MHz)
> sudo XILINX_XRT=/usr ./time_zcu104.sh ../vivado/bitstream/pointnet_naive_150.bit
# ããŒã¿äžŠåæ§ã掻çšãã (ã«ãŒãã¢ã³ããŒãªã³ã°ãšé
åã®åå²ãæžãŸãã) å®è£
(åäœåšæ³¢æ°150MHz)
> sudo XILINX_XRT=/usr ./time_zcu104.sh ../vivado/bitstream/pointnet_opt1.bit
# ããŒã¿ãããŒæé©åãæžãŸããå®è£
(åäœåšæ³¢æ°150MHz)
> sudo XILINX_XRT=/usr ./time_zcu104.sh ../vivado/bitstream/pointnet_opt2.bit
# å
¥åºåã®ããŒãå¹
ã64ãããã«åºããå®è£
(åäœåšæ³¢æ°150MHz)
> sudo XILINX_XRT=/usr ./time_zcu104.sh ../vivado/bitstream/pointnet_opt3.bit
ãã€ãŒããªå®è£ ã§ãã¹ãããå Žåã®åºåäŸã以äžã«ç€ºããŸãã
$ sudo XILINX_XRT=/usr ./time_zcu104.sh ../vivado/bitstream/pointnet_naive_150.bit
Output (CPU):
tensor([[-0.0594, -0.0272, 0.0115, -0.0481, -0.0529, 0.0449, -0.0634, -0.0328,
0.0348, -0.0071, -0.0228, 0.0412, 0.0128, -0.0175, -0.0086, -0.0023,
-0.0192, -0.0101, -0.0072, 0.0520, -0.0106, -0.0110, 0.0113, 0.0499,
-0.0563, -0.0523, -0.0711, -0.0104, -0.0048, -0.0404, 0.0375, 0.0089,
0.0326, -0.0408, -0.0302, -0.0041, 0.0534, -0.0349, 0.0380, -0.0020]],
grad_fn=<AddmmBackward0>)
Output (FPGA):
tensor([[-0.0592, -0.0274, 0.0114, -0.0491, -0.0527, 0.0446, -0.0632, -0.0335,
0.0337, -0.0071, -0.0258, 0.0399, 0.0119, -0.0170, -0.0091, -0.0030,
-0.0216, -0.0112, -0.0106, 0.0522, -0.0111, -0.0130, 0.0114, 0.0487,
-0.0571, -0.0523, -0.0714, -0.0103, -0.0058, -0.0389, 0.0383, 0.0068,
0.0306, -0.0421, -0.0314, -0.0052, 0.0539, -0.0360, 0.0399, -0.0031]])
Inference time (CPU): mean: 369.048ms, std: 1.086ms
Inference time (FPGA): mean: 1071.358ms, std: 0.023ms
Speedup: 0.344x
CPUçã®ã¢ãã«ã§ã¯float
ã䜿ããŸãããFPGAçã®ã¢ãã«ã§ã¯åºå®å°æ°ç¹æ° (ap_fixed
) ã䜿ã£ãŠããã®ã§ãåãã¢ãã«ãã©ã¡ãŒã¿ãšå
¥åãäžããŠããåºåçµæã«ã¯å€å°ã®ãããçããŸã (ããã§ã¯ãåºå®å°æ°ç¹æ°ã®ãããå¹
ã32ããããæŽæ°éšã16ããããå°æ°éšã16ãããã«èšå®ããŠããŸã)ã
ããããCPUçãšFPGAçã®ã¢ãã«ã§ã倧äœäŒŒããããªåºåãåŸãããŠããŸã (å°æ°ç¬¬2äœããããŸã§ã¯åã£ãŠããŸã)ã
ã¯ã©ã¹åé¡åé¡ã§ããã°ãããã§åé¡ãªããšæããŸãã
æšè«æéãã¿ããšããã€ãŒããªå®è£
ã§ã¯ãCPUçã®ã¢ãã«ããã3åçšåºŠé
ãããšãåãããŸãã
åå®è£ ã«å¯Ÿããæšè«æéããŸãšããŸãã
å®è£ | æšè«æéã®å¹³å (ms) | æšæºåå·® (ms) | é«éåç (ãœãããŠã§ã¢æ¯) | é«éåç (ãã€ãŒãå®è£ æ¯) |
---|---|---|---|---|
CPUç | 369.0 | 1.086 | 1.0x | 2.904x |
ãã€ãŒã (100MHz) | 1606.4 | 0.041 | 0.230x | 0.667x |
ãã€ãŒã (150MHz) | 1071.4 | 0.023 | 0.344x | 1.0x |
ãã€ãŒã (200MHz) | 872.05 | 0.077 | 0.423x | 1.223x |
ãã€ãŒã (250MHz) | 665.33 | 0.073 | 0.555x | 1.610x |
ããŒã¿äžŠåæ§ (150MHz) | 34.60 | 0.027 | 10.66x | 30.97x |
ããŒã¿ãããŒæé©å (150MHz) | 12.93 | 0.016 | 28.54x | 82.86x |
ããŒãå¹ æ¡åŒµ (150MHz) | 10.80 | 0.012 | 34.17x | 99.20x |
ãã€ãŒããªå®è£ (150MHz) ã¯ãCPUã«æ¯ã¹ãŠæ§èœããã£ãã®0.344åã§ããã ãã€ãŒããªå®è£ ã®ãŸãŸã§ã¯ãåäœåšæ³¢æ°ã250MHzãŸã§äžããŠããäŸç¶ãšããŠCPUãããé ãã§ãã ããŒã¿äžŠåæ§ã®å©çšã«ãã£ãŠãæšè«æéã¯30.97åãççž®ãããCPUã«æ¯ã¹ãŠ10.66åé«éã«ãªããŸããã Vitis HLSã«ããåºåãããã¯ããã¯ãµã€ã¯ã«æ°ãã¿ããšããã€ãŒããªå®è£ (150MHz) ã§ã¯161,945,604 (1.079s)ã䞊åååŸã®å®è£ ã§ã¯4,462,596 (29.72ms)ãšãªã£ãŠããŸãã å®éã«ã¯ãåè ã¯1.071sãåŸè ã¯34.60msãªã®ã§ã倧äœåã£ãŠãããšãããŸãã ç¹åŸŽæœåºãããã¯ãŒã¯ã«ãããããŒã¿ãããŒæé©åã®æŽ»çšã«ãã£ãŠãæšè«æéã¯ããã«2.68åççž®ãããCPUã«æ¯ã¹ãŠ28.54åãåœåã®ãã€ãŒããªå®è£ ã«æ¯ã¹ãŠ82.86åãé«éã«ãªããŸããã ããã«ããŒãå¹ ã32ããããã64ãããã«æ¡åŒµããããšã§ãäž»ã«åé¡ãããã¯ãŒã¯ãé«éåãããŸããã æšè«æéã¯1.20åççž®ãããCPUã«æ¯ã¹ãŠ34.17åãåœåã®ãã€ãŒããªå®è£ ãšæ¯ã¹ããš99.20åã®é«éåãšãªããŸããã ãã®ããã«ãåçš®æé©åãæœãããšã§ãçå®ã«é«éåã§ããŠããããšãåãããŸãã ããããåºæ¬çã«ã¯ãåçš®HLSãã©ã°ããæ¿å ¥ããã ãããã®ã§ãéåžžã«æ¥œã§ãã
ã€ãã«ã¢ãã«ã®åé¡ç²ŸåºŠãã¿ãŠã¿ãŸãããã
ããã§ã¯ModelNet40ããŒã¿ã»ããã®ããã¹ãããŒã¿ãå©çšããŸãã
ããŒã¿ã»ããã¯ãã¡ãããããŠã³ããŒãã§ããŸãã
åãµã³ãã«ã¯ãé£è¡æ©ãèªåè»ãã©ãããããã人éãªã©ãåäžã®ç©äœãè¡šãCADã¢ãã«ããåŸãããã2048åã®ç¹ããã€ç¹çŸ€ã§ãã
以äžã®ãœãŒã¹ã³ãŒããå©çšããŸã (host/test_zcu104.py
)ã
ããŒã¿ã»ããã®åŠçããã¢ãã«ã®èšç·Žã«ã€ããŠã¯ãGitHubã®ãªããžããªãåç
§ããŠãã ããã
def test(args: argparse.Namespace,
model: torch.nn.Module,
model_zcu104: torch.nn.Module,
test_loader: torch.utils.data.DataLoader):
print(f"Testing PointNet ...")
# model.eval()
model_zcu104.eval()
# test_loss_total = 0.0
# correct = 0
test_loss_total_zcu104 = 0.0
correct_zcu104 = 0
with torch.no_grad():
for i, batch in enumerate(test_loader):
if i % 5 == 0:
print(f"Testing batch {i} ...")
data, target = batch["points"], batch["label"]
# out = model(data)
# pred = out.argmax(dim=1, keepdim=True)
# loss = F.cross_entropy(out, target)
# correct += pred.eq(target.view_as(pred)).sum().item()
# test_loss_total += loss.item() * len(data)
out_zcu104 = model_zcu104(data)
pred_zcu104 = out_zcu104.argmax(dim=1, keepdim=True)
loss_zcu104 = F.cross_entropy(out_zcu104, target)
correct_zcu104 += pred_zcu104.eq(
target.view_as(pred_zcu104)).sum().item()
test_loss_total_zcu104 += loss_zcu104.item() * len(data)
# test_loss_avg = test_loss_total / len(test_loader.dataset)
# test_acc = correct * 1e2 / len(test_loader.dataset)
test_loss_avg_zcu104 = test_loss_total_zcu104 / len(test_loader.dataset)
test_acc_zcu104 = correct_zcu104 * 1e2 / len(test_loader.dataset)
# print(f"Test result (CPU): " \
# f"loss: {test_loss_avg:.6f}, " \
# f"accuracy: {test_acc:.3f}%, " \
# f"correct: {correct}")
print(f"Test result (FPGA): " \
f"loss: {test_loss_avg_zcu104:.6f}, " \
f"accuracy: {test_acc_zcu104:.3f}%, " \
f"correct: {correct_zcu104}, " \
f"total: {len(test_loader.dataset)}")
FPGAããŒãäžã§ä»¥äžã®ã³ãã³ããå®è¡ããŸãã
> cd advent_2022_point_cloud_classification/host
# ããŒã¿äžŠåæ§ã掻çšãã (ã«ãŒãã¢ã³ããŒãªã³ã°ãšé
åã®åå²ãæžãŸãã) å®è£
(åäœåšæ³¢æ°150MHz)
> sudo XILINX_XRT=/usr ./test_zcu104.sh ../vivado/bitstream/pointnet_opt1.bit
# ããŒã¿ãããŒæé©åãæžãŸããå®è£
(åäœåšæ³¢æ°150MHz)
> sudo XILINX_XRT=/usr ./test_zcu104.sh ../vivado/bitstream/pointnet_opt2.bit
# å
¥åºåã®ããŒãå¹
ã64ãããã«åºããå®è£
(åäœåšæ³¢æ°150MHz)
> sudo XILINX_XRT=/usr ./test_zcu104.sh ../vivado/bitstream/pointnet_opt3.bit
åºåçµæã®äŸã以äžã«ç€ºããŸãã
> sudo XILINX_XRT=/usr ./test_zcu104.sh ../vivado/bitstream/pointnet_opt1.bit
Testing batch 0 ....
Testing batch 5 ...
...
Testing batch 2445 ...
Testing batch 2450 ...
Testing batch 2455 ...
Testing batch 2460 ...
Testing batch 2465 ...
Test result (FPGA): loss: 0.375841, accuracy: 89.506%, correct: 2209, total: 2468
åå®è£ ã«å¯Ÿãã粟床ããŸãšããŸãã å šéšã§2,468åã®ãã¹ããµã³ãã«ããããŸãã ãã€ãŒãå®è£ ã«é¢ããŠã¯ãæéãæãããããã®ã§çç¥ããŠããŸãã
å®è£ | æ£è§£æ° | 粟床 |
---|---|---|
CPUç | 2209 | 89.506% |
ããŒã¿äžŠåæ§ (150MHz) | 2209 | 89.506% |
ããŒã¿ãããŒæé©å (150MHz) | 2209 | 89.506% |
ããŒãå¹ æ¡åŒµ (150MHz) | 2209 | 89.506% |
ãããã®IPã³ã¢ããCPUäžã§åãããå Žåãšå
šãåã粟床ãåŸãããŠããŸãã
float
ã®ä»£ããã«åºå®å°æ°ç¹æ°ap_fixed
ã䜿ã£ãŠããŸãããããŸã®ãšããã¯ç²ŸåºŠäœäžã¯ã¿ãããŸããã
åçš®IPã³ã¢ã®ããªãœãŒã¹æ¶è²»ã調ã¹ãŠã¿ãŸãããã ãªãœãŒã¹æ¶è²»ã¯ãLUT (ã«ãã¯ã¢ããããŒãã«)ãFF (ããªãããããã)ãBRAM (BlockRAM)ãURAM (UltraRAM)ãDSP (Digital Signal Processor)ã®5ã€ã«åé¡ãããŸãã
ãªãœãŒã¹æ¶è²»ãè¡šã«ãŸãšããŸãã
å®è£ | LUT | FF | BRAM (36Kb) | URAM | DSP |
---|---|---|---|---|---|
åèš | 230,400 | 460,800 | 312 | 96 | 1,728 |
ãã€ãŒã (100MHz) | 22,378 (9.71%) | 11,045 (2.40%) | 149.5 (47.92%) | 2 (2.08%) | 48 (2.78%) |
ãã€ãŒã (150MHz) | 22,140 (9.61%) | 12,428 (2.70%) | 161.5 (51.76%) | 2 (2.08%) | 48 (2.78%) |
ãã€ãŒã (200MHz) | 21,344 (9.26%) | 13,616 (2.95%) | 149.5 (47.92%) | 2 (2.08%) | 48 (2.78%) |
ãã€ãŒã (250MHz) | 20,663 (8.97%) | 14,713 (3.19%) | 149.5 (47.92%) | 2 (2.08%) | 20 (1.16%) |
ããŒã¿äžŠåæ§ (150MHz) | 58,223 (25.27%) | 42,755 (9.28%) | 287.5 (92.15%) | 0 (0.00%) | 768 (44.44%) |
ããŒã¿ãããŒæé©å (150MHz) | 136,408 (59.20%) | 48,940 (10.62%) | 310.5 (99.52%) | 0 (0.00%) | 808 (46.76%) |
ããŒãå¹ æ¡åŒµ (150MHz) | 84,263 (36.57%) | 49,660 (10.78%) | 263.5 (84.46%) | 64 (66.67%) | 808 (46.76%) |
ããŒã¿äžŠåæ§ã掻çšãããšãè€æ°ã®ç©åæŒç®ã䞊åã«è¡ãå¿ èŠããããããDSPã®æ¶è²»ãå€§å¹ ã«å¢å ããŠããããšãåãããŸãã äžæ¹ãããŒã¿ãããŒæé©åãçšããŠãããªãœãŒã¹æ¶è²»ã¯ããã»ã©å¢ããŠããŸãã (ãã ããBRAMãäžè¶³ããŠãLUTãLUTRAMãšããŠäœ¿ã£ãŠããã®ã§ãLUTã®æ¶è²»ã¯å¢å ããŠããŸã)ã ããŒã¿ãããŒæé©åã«ãã£ãŠããªãœãŒã¹æ¶è²»ã®å¢å ãæãã€ã€ãåè·¯ã®æ§èœãæ¹åã§ããŸãã ããŒãå¹ ãæ¡åŒµããŠããURAM以å€ã®ãªãœãŒã¹æ¶è²»ã¯ããŸãå€ãã£ãŠããŸãã (BRAMãäžè¶³ããŠãšã©ãŒã«ãªã£ãã®ã§ããªã³ããããããã¡ã®äžéšãURAMã§å®è£ ããŠããŸã)ã
ä»åã¯20äžåçšåºŠããFPGAããŒããXilinx ZCU104 Evaluation Kitã䜿ããŸããã ãã®ããŒãã®FPGAããã (XCZU7EV-2FFVC1156) ã«ã¯ãBRAMã ãã§ãªãURAMãæäŸãããŠããã®ã§ãæ¯èŒç倧ããªãªã³ããããããã¡ (æ°MBçšåºŠ) ãäœæã§ããŸãã URAM (UltraRAM) ã¯BRAM (BlockRAM) ã«æ¯ã¹ãŠåæ°ãå°ãªãã§ãã (BRAMã312åã«å¯ŸããŠURAMã¯96å)ã1åãããã®å®¹éã¯å€§ããã®ã§ãç²ç²åºŠã ãšãããŸãã äœäŸ¡æ Œã®FPGAããŒãã ãšãURAMãæäŸãããŠããªãã®ã§ãBRAMã倧åã«äœ¿ãå¿ èŠããããŸãã å人çã«ã¯ãBRAMãäžçªæåã«æ¯æžããããšãå€ãã§ã (FPGAã«æ £ããŠããªãåå¿è ãªã®ã§ãããŸãå®è£ ã§ããŸãã)ã
ããŸãŸã§ã¯ãå±€ã®å ¥åºåãã¢ãã«ã®ãã©ã¡ãŒã¿ãè¡šçŸããã®ã«ã32ãããã®åºå®å°æ°ç¹æ° (æŽæ°éšãšå°æ°éšã16ããããã€) ã䜿ã£ãŠããŸããã 粟床ãããçšåºŠä¿ã£ããŸãŸããããæ° (ãªãœãŒã¹æ¶è²») ãæããããã§ããããã ããã§ã¯ã以äžã®ãããæ°ã®çµã¿åããã§ãIPã³ã¢ (åäœåšæ³¢æ°150MHz) ãäœã£ãŠã¿ãŸãããã IPã³ã¢ã¯ãããŒã¿äžŠåæ§ã掻çšãããŒã¿ãããŒæé©åãæœããããã«ããŒãå¹ ãæ¡åŒµããããŒãžã§ã³ã§ãã ãããã®ãããæ°ã¯äœãšãªã決ããŸããã ã¢ãã«ã®ãã©ã¡ãŒã¿ã®æ¹ã¯ãå±€ã®å ¥åºåã«æ¯ã¹ãŠå€åãçãã®ã§ããããããæ°ãåæžã§ãããããããŸããã
åå | å±€ã®å
¥åºå (value_t ) |
ã¢ãã«ã®ãã©ã¡ãŒã¿ (param_t ) |
---|---|---|
28-28 | 28ããã (æŽæ°éš14 + å°æ°éš14) | 28ããã (æŽæ°éš10 + å°æ°éš18) |
28-24 | 28ããã (æŽæ°éš14 + å°æ°éš14) | 24ããã (æŽæ°éš8 + å°æ°éš16) |
24-24 | 24ããã (æŽæ°éš12 + å°æ°éš12) | 24ããã (æŽæ°éš8 + å°æ°éš16) |
24-20 | 24ããã (æŽæ°éš12 + å°æ°éš12) | 20ããã (æŽæ°éš6 + å°æ°éš14) |
24-16 | 24ããã (æŽæ°éš12 + å°æ°éš12) | 16ããã (æŽæ°éš4 + å°æ°éš12) |
20-20 | 20ããã (æŽæ°éš10 + å°æ°éš10) | 20ããã (æŽæ°éš6 + å°æ°éš14) |
20-16 | 20ããã (æŽæ°éš10 + å°æ°éš10) | 16ããã (æŽæ°éš4 + å°æ°éš12) |
åå®è£ ã«ããã粟床ããŸãšããŸãã
å®è£ | æ£è§£æ° | 粟床 |
---|---|---|
CPUç | 2209 | 89.506% |
ããŒãå¹ æ¡åŒµ (150MHz) | 2209 | 89.506% |
ããŒãå¹ æ¡åŒµ (150MHzã28-28) | 2206 | 89.384% |
ããŒãå¹ æ¡åŒµ (150MHzã28-24) | 2206 | 89.384% |
ããŒãå¹ æ¡åŒµ (150MHzã24-24) | 2200 | 89.141% |
ããŒãå¹ æ¡åŒµ (150MHzã24-20) | 550 | 22.285% |
ããŒãå¹ æ¡åŒµ (150MHzã24-16) | 121 | 4.903% |
ããŒãå¹ æ¡åŒµ (150MHzã20-20) | 448 | 18.152% |
ããŒãå¹ æ¡åŒµ (150MHzã20-16) | 122 | 4.903% |
ãŸãããªãœãŒã¹æ¶è²»ã以äžã«ãŸãšããŸãã
å®è£ | LUT | FF | BRAM (36Kb) | URAM | DSP |
---|---|---|---|---|---|
åèš | 230,400 | 460,800 | 312 | 96 | 1,728 |
ããŒãå¹ æ¡åŒµ (150MHz) | 84,263 (36.57%) | 49,660 (10.78%) | 263.5 (84.46%) | 64 (66.67%) | 808 (46.76%) |
ããŒãå¹ æ¡åŒµ (150MHzã28-28) | 74,342 (32.27%) | 47,267 (10.26%) | 261.5 (83.81%) | 64 (66.67%) | 808 (46.76%) |
ããŒãå¹ æ¡åŒµ (150MHzã28-24) | 63,749 (27.67%) | 39,139 (8.49%) | 257 (82.37%) | 64 (66.67%) | 404 (23.38%) |
ããŒãå¹ æ¡åŒµ (150MHzã24-24) | 59,970 (26.03%) | 36,240 (7.86%) | 257 (82.37%) | 64 (66.67%) | 404 (23.38%) |
ããŒãå¹ æ¡åŒµ (150MHzã24-20) | 75,997 (32.98%) | 40,762 (8.85%) | 259 (83.01%) | 64 (66.67%) | 202 (11.69%) |
ãããæ°ãåæžããŠããæšè«æéã¯å€ãããŸããã§ããã ãããæ°ã®åæžã«å¿ããŠãå®è£ ãå°ãçŽãå¿ èŠãããããã§ãã
äžèšã®çµæãã¿ããšãéã¿ã®ãããæ°ã24ããããã20ãããã«åæžããé端ã«ãåé¡ç²ŸåºŠãäžæ°ã«äœäžããŠããããšãåãããŸã (ãããŸã§ã®æ¥æ¿ãªäœäžã«ã¯é©ããŸãã)ã å±€ã®å ¥åºåãšã¢ãã«ã®ãã©ã¡ãŒã¿ããããã24ãããã«èšå®ããIPã³ã¢ããæããªãœãŒã¹å¹çããããšãããŸãã ãªãœãŒã¹æ¶è²»ãã¿ããšããããæ°ãåæžããããšã§åè·¯ã®è€éããåŸã ã«äžãã£ãŠãããããã«äŒŽã£ãŠLUTãFFã®äœ¿çšéã挞æžããŠããŸãã 28ããããã24ãããã«èœãšããšãç©åæŒç®ã«å¿ èŠãªDSPãããã¯ã®æ°ãåæžããŠããããšãåãããŸãã 24ããããã20ãããã«ãããšãDSPã®äœ¿çšéã¯ããã«åæžããŠããŸã (ãã®åLUTãšFFãå¢å ããŠããŸã)ã BRAMãURAMã«ã€ããŠã¯ããããæ°ãããå°ãæžãããªããšãæ¶è²»éãæžããªãããã§ã (ãªã³ãããã¡ã¢ãªã®äžè¶³ãé çã®çš®ã«ãªããŸã)ã
ä»åã¯ãFPGAãçšããŠãç¹çŸ€ã®åé¡ã¿ã¹ã¯ãé«éåããŸããã åé¡ã¿ã¹ã¯ã«ã¯ã軜éãã€ã·ã³ãã«ãªPointNetãå©çšããŸããã FPGAã®ãªãœãŒã¹æ¶è²»ãæãããããã¢ãã«ãç°¡ç¥åãããŸãèšç®é åºãå€æŽããŸããã ç¶ããŠãXilinx瀟ã®é«äœåæããŒã«Vitis HLS 2022.1ã䜿ã£ãŠãPointNetçšã®ã«ã¹ã¿ã IPã³ã¢ãäœæããŸããã ãã€ãã©ã€ã³åãå±€ã®èšç®ã®äžŠåå (ã«ãŒãã®ã¢ã³ããŒãªã³ã°ãšé åã®åå²)ãããŒã¿ãããŒæé©åãªã©ã䜿ã£ãŠãIPã³ã¢ã®å®è£ ãå°ããã€æ¹åããŠãããŸããã
IPã³ã¢ãä»ã®IPã³ã¢ãšç¹ãåãããŠããŒããã¶ã€ã³ãäœæããXilinx Vivado 2022.1ã«ããè«çåæã»é 眮é ç·ãè¡ã£ãŠãFPGAã«æžã蟌ã¿å¯èœãªãããã¹ããªãŒã ãäœæããŸããã ãããã¹ããªãŒã ãããŒãããŠé«éã«æšè«ããããã®ãã©ã€ãããPynqã©ã€ãã©ãªã«ããèšè¿°ããŸããã ModelNet40ããŒã¿ã»ããã䜿çšããXilinx ZCU104 Evaluation Kitäžã§ãæšè«æéããªãœãŒã¹æ¶è²»ãåé¡ç²ŸåºŠã®3ã€ã®èŠ³ç¹ããè©äŸ¡ãè¡ããŸããã ãŸããè€æ°ã®ããŒããã¶ã€ã³ã§ã®æ§èœãæ¯èŒããããšã§ãåçš®æé©åã«ããå¹æã調ã¹ãŸããã ãããæ°ãåæžãããªãœãŒã¹å¹çãæ¹åããããšãè©Šã¿ãŸããã
é«äœåæããŒã«ã䜿ãããšã§ãVerilog HDLãªãã§ãC/C++ã ãã§ãé«å¹çãªIPã³ã¢ãäœæã§ããŸããã ãããããã§ããPyTorchãªã©ã®æ·±å±€åŠç¿ã©ã€ãã©ãªã䜿ãã®ãšæ¯ã¹ãŠããœãŒã¹ã³ãŒãã®èšè¿°éã¯äœåãå€ããªããŸããã å éšã§è¡ãããŠããåŠçã®æµãããã芳å¯ããŠãå šãŠç解ããªããšããããé«éåããIPã³ã¢ãäœæã§ããŸããã ãªãœãŒã¹å¶çŽãããŒã¿è»¢éãªã©ãèããªããŠã¯ãªããªãäºæãå€ãã§ãã äœæ¥å·¥çšãå€ããŠå€§å€ã§ãããèªäœã®IPã³ã¢ãæ£ããåäœãã (ãœãããŠã§ã¢å®è£ ãšåããããªåºåãåŸããã) ãšãããå®è£ ãé«éåã§ãããšãã®æã³ã¯ããã®ã¶ã倧ãããšæããŸãã æé£ãããããŸããã
GPUã£ãŠäŸ¿å©ã ãªããšæãããšãããã§ãã