-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmethod.tex
400 lines (360 loc) · 21 KB
/
method.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
\section{Materials and Methods}
We evaluate 5 classifiers implemented in either \streamdmcpp~\cite{StreamDM-CPP}
or OrpailleCC~\cite{OrpailleCC}. \streamdmcpp is a C++ implementation of
StreamDM~\cite{StreamDM}, a software to mine big data streams using
\href{https://spark.apache.org/streaming/}{Apache Spark Streaming}. \streamdmcpp
is usually faster than StreamDM in single-core environments, due to the overhead
induced by Spark.
OrpailleCC is a collection of data stream algorithms developed for embedded
devices. The key functions, such as random number generation or memory
allocation, are parametrizable through class templates and can thus be
customized on a given execution platform. OrpailleCC is not limited to
classification algorithms, it implements other data stream algorithms such as
the Cuckoo filter~\cite{cuckoo} or a multi-dimensional extension of the
Lightweight Temporal Compression~\cite{multi-ltc}. We extended it with a few
classifiers for the purpose of this benchmark.
This benchmark includes five popular classification algorithms. The
Mondrian forest~(\mondrianforest)~\cite{mondrian2014} builds decision trees without immediate need
for labels, which is useful in situations where labels are
delayed~\cite{stream_learning_review}. The Micro-Cluster Nearest
Neighbors~\cite{mc-nn} is a compressed version of the k-nearest neighbor~(\knn)
that was shown to be among the most accurate classifiers for \har from wearable
sensors~\cite{Janidarmian_2017}. The \naivebayes~\cite{naive_bayes} classifier
builds a table of attribute occurrence to estimate class likelihoods. The
\hoeffdingtree~\cite{VFDT} builds a decision tree using the Hoeffding Bound to
estimate when the best split is found. Finally, Neural Network classifiers have
become popular by reaching or even exceeding human performance in many fields
such as image recognition or game playing. We use a Feedforward Neural
Network~(\FNN) with one hidden layer,
as described in~\cite{omid_2019} for the recognition of fitness activities on a
low-power platform.
The remainder of this section details the datasets, classifiers, evaluation
metrics and parameters used in our benchmark.
\subsection{Datasets}
\label{sec:method-dataset}
\revision{To conduct our benchmark, we selected the main two
datasets commonly used to evaluate \har from wearable sensors. In addition, we used the
popular MOA stream simulator to generate three synthetic datasets with different
properties.}
\subsubsection{\banosdataset}
%50 Hz sampling.
%117 data per sample.
%33 activities.
%Sensors cover the body.
%Type of data (ideal or self)
%17 subject.
The \banosdataset dataset~\cite{Banos_2014} is a
human activity dataset with 17 participants
and 9 sensors per
participant\footnote{\banosdataset dataset
available
\href{https://archive.ics.uci.edu/ml/datasets/REALDISP+Activity+Recognition+Dataset\#:\~:text=The\%20REALDISP\%20(REAListic\%20sensor\%20DISPlacement,\%2Dplacement\%20and\%20induced\%2Ddisplacement.}{here}.}. Each sensor samples a 3D
acceleration, gyroscope, and magnetic field, as
well as the orientation in a quaternion format,
producing a total of 13 values. Sensors are
sampled at 50~Hz, and each sample is associated
with one of 33 activities. In addition to the 33
activities, an extra activity labeled 0 indicates
no specific activity.
We pre-process the \banosdataset dataset using
non-overlapping windows of one second (50
samples), and using only the 6 axes (acceleration
and gyroscope)
of the right forearm sensor. We compute the average and the standard deviation over the
window as features for each axis. We assign the most
frequent label to the window. The resulting data
points were shuffled uniformly.
In addition, we construct another dataset from \banosdataset, in which we
simulate a concept drift by shifting the activity labels in the
second half of the data stream. This is useful to
observe any behavioral change induced by the
concept drift such as an increase in power
consumption.
\subsubsection{\recofitdataset}
The \recofitdataset dataset~\cite{recofit} is a
human activity dataset containing 94
participants\footnote{\recofitdataset dataset
available
\href{https://msropendata.com/datasets/799c1167-2c8f-44c4-929c-227bf04e2b9a}{here}.}. Similarly to the \banosdataset
dataset, the activity labeled 0 indicates no
specific activity.
Since many of these activities were similar, we
merged some of them together based on the table
in~\cite{behzad2019}.
We pre-processed the dataset similarly to the
\banosdataset one, using non-overlapping windows of
one second, and only using 6 axes (acceleration
and gyroscope) from one sensor.
. From these 6 axes, we used the average and the standard deviation
over the window as features. We assigned the most
frequent label to the window.
\subsubsection{MOA datasets}
Massive Online Analysis~\cite{moa} (MOA) is a Java framework to compare
data stream classifiers. In addition to classification algorithms, MOA provides many
tools to read and generate
datasets.
We generate three synthetic datasets\footnote{MOA commands available
\href{https://github.com/big-data-lab-team/benchmark-har-data-stream/blob/f314bf4a258e96e418e249228897d269c59cd522/Makefile\#L104}{here}.}:
a hyperplane, a RandomRBF, and a RandomTree
dataset. We generate 200,000 data points
for each of these synthetic datasets.
The hyperplane and the RandomRBF both have three features and two classes, however, the RandomRBF has a slight imbalance toward one class.
The RandomTree dataset is the hardest of the three, with six attributes and
ten classes. Since the data points are generated with a tree structure, we
expect the decision trees to show better performances than the other
classifiers.
\subsection{Algorithms and Implementation}
In this section, we describe the algorithms used in the benchmark, their
hyperparameters, and relevant implementation details.
\revision{We selected two of the main algorithms in the popular StreamDM
library: \naivebayes and \hoeffdingtree. These algorithms are commonly found in
data stream classification studies. In addition, we selected representative
algorithms from the main classfification approaches: the \mondrianforest for
tree based-learning, Micro Cluster Nearest Neighbor for cluster based-learning,
and a Feedforward Neural Network for neural networks.}
\subsubsection{Mondrian forest~(\mondrianforest)~\cite{mondrian2014}}
Each tree in a \mondrianforest recursively splits the feature space, similar to
a regular decision tree. However, the feature used in the split and the value
of the split are picked randomly. The probability to select a feature is
proportional to its normalized range, and the value for the split is uniformly
selected in the range of the feature. During prediction, a node combines its
observed label count with its parent prediction. Since the Mondrian tree is able
to reshape the internal structure of the tree, we expect the Mondrian forest to
recover from concept drifts, although most likely slower than MCNN.
%Bounded memory discussion
In OrpailleCC, the amount of memory allocated to the forest is predefined,
and it is shared by all the trees in the forest, leading to a constant
memory footprint for the classifier. This implementation is memory-bounded,
meaning that the classifier can adjust to memory limitations, for instance
by stopping tree growth or replacing existing nodes with new ones. This is
different from an implementation with a constant space complexity, where
the classifier would use the same amount of memory regardless of the
amount of available memory. For instance, in our study, the \mondrianforest
classifier is memory-bounded while \naivebayes classifier has a constant
space complexity.
Mondrian trees can be tuned using three parameters: the base count, the discount
factor, and the budget. The base count is used to initialize the prediction for
the root. The discount factor influences the nodes on how much they should use
their parent prediction. A discount factor closer to one makes the prediction of
a node closer to the prediction of its parent. Finally, the budget controls the
tree depth.
\revision{Hyperparameters used for \mondrianforest are shown in
Table~\ref{tab:mondrian-parameters}}. Additionally, the \mondrianforest is allocated with 600~KB of memory
unless specified otherwise. On the \banosdataset and \recofitdataset datasets,
we also explore the \mondrianforest with 3~MB of memory in order to observe the
effect of available memory on performances (classification, runtime, and power).
\begin{table}
\begin{center}
\begin{tabular}{|c|c|c|c|}
\hline
Number of trees & Base count & Discount & Budget\\
\hline
1 & 0.0 & 1.0 & 1.0 \\
5 & 0.0 & 1.0 & 0.4 \\
10 & 0.0 & 1.0 & 0.4 \\
50 & 0.0 & 1.0 & 0.2 \\
\hline
\end{tabular}
\end{center}
\caption{\revision{Hyperparameters used for the \mondrianforest.}}
\label{tab:mondrian-parameters}
\end{table}
\subsubsection{Micro Cluster Nearest Neighbor~\cite{mc-nn}}
The Micro Cluster Nearest Neighbor~(\mcnn) is a variant of k-nearest neighbors
where data points are aggregated into clusters to reduce storage requirements.
During training, the algorithm merges a new data point to the closest cluster
that shares the same label. If the closest cluster does not share the same label
as the data point, this closest cluster and the closest cluster with the same
label are assigned an error. When a cluster receives too many errors, it is
split. During classification, \mcnn returns the label of the closest cluster.
Regularly, the algorithm also assigns a participation score to each cluster and
when this score gets below a threshold, the cluster is removed. Given that the
maximum number of clusters is fixed, this mechanism makes space for new
clusters, and possibly helps adjust to concept drifts. The space and time
complexities of \mcnn are constant since the maximum number of clusters is
fixed. The reaction to concept drift is influenced by the participation
threshold and the error threshold. A higher participation threshold and a lower
error threshold increase reaction speed to concept drift. Since the error
thresholds used in this study are small, we expect \mcnn to react quite fast and
efficiently to concept drifts.
We implemented two versions of \mcnn in OrpailleCC, differing in the way they
remove clusters during training. The first version (\mcnn Origin) is similar to
the mechanism described in~\cite{mc-nn}, based on participation scores. The
second version (\mcnn OrpailleCC) removes the cluster with the lowest
participation only when space is needed. A cluster slot is needed when an
existing cluster is split and there is no more slot available because the number
of active clusters already reached the maximum defined by the user.
\mcnn OrpailleCC has only one parameter, the error threshold after which a
cluster is split. \mcnn Origin has two parameters: the error threshold and the
participation threshold. The participation threshold is the limit below which a
cluster is removed. \revision{Hyperparameters used for \mcnn are shown in
Table~\ref{tab:mcnn-parameters}}. Additionally, they both implementations have the cluster count as their
last parameter.
\begin{table}
\begin{center}
\begin{tabular}{|c|c|c|c|}
\hline
Number of clusters & Error threshold & Participation threshold \\
\hline
10 & 2 & 10 \\
20 & 10 & 10 \\
33 & 16 & 10 \\
40 & 8 & 10 \\
50 & 2 & 10 \\
\hline
\end{tabular}
\end{center}
\caption{\revision{Hyperparameters used for the \mcnn.}}
\label{tab:mcnn-parameters}
\end{table}
\subsubsection{Naïve Bayes~(\naivebayes)~\cite{naive_bayes}}
The \naivebayes algorithm keeps a table of counters for each feature value and
each label. During prediction, the algorithm assigns a score for each label
depending on how the data point to predict compares to the values observed
during the training phase. Since the counters are not biased toward the more
recent data points, we expect \naivebayes to be slow to adapt if not ineffective
in a concept drift situation.
The implementation from \streamdmcpp was used in this benchmark. It uses a
Gaussian fit for numerical attributes. Two implementations were used, the
OrpailleCC one and the StreamDM one. We used two implementations to provide a
comparison reference between the two libraries.
\subsubsection{Hoeffding Tree~(\hoeffdingtree)~\cite{VFDT}}
Similar to a decision tree, the \hoeffdingtree recursively splits the feature
space to maximize a metric, often the information gain or the Gini index.
However, to estimate when a leaf should be split, the \hoeffdingtree relies on
the Hoeffding bound, a measure of the score deviation of the splits. This
measure allows the leaf to decide when the best split is clearly better than the
second best split based on the data observed from the stream. This mechanism
prevents the tree from waiting until the end of the stream to ensure a split is
the best. During classification, a data point is sorted to a leaf, and a label
is predicted by aggregating the labels of the training data points in that leaf,
usually through majority voting or \naivebayes classification. We used this
classifier as implemented in \streamdmcpp. The \hoeffdingtree is common in data
stream classification, however, the internal nodes are static and cannot be
re-considered. Therefore, any concept drift adaption relies on the new leaves
that will be split.
The \hoeffdingtree has three parameters: the confidence level, the grace period,
and the leaf learner. The confidence level is the probability that the Hoeffding
bound makes a wrong estimation of the deviation. The grace period is the number
of processed data points before a leaf is evaluated for a split. The leaf
learner is the method used in the leaf to predict the label. In this study, we
used a confidence level of $0.01$ with a grace period of 10 data points and the
\naivebayes classifier as leaf learner.
\subsubsection{Feedforward Neural Network~(\FNN)}
\label{sec:method-fnn}
%We need a citation there
A neural network is a combination of artificial neurons, also known as
perceptrons, that all have input weights and an activation function. To predict
a class label, the perceptron applies the activation function to the weighted
sum of its input values. The output value of the perceptron is the result of
this activation function. This prediction phase is also called feed-forward. To
train the neural network, feed-forward is applied first, then the error between
the prediction and the expected result is used in the backpropagation process to
adjust the weights of the input values. A neural network combines multiple
perceptrons by connecting perceptron outputs to inputs of other perceptrons. In
this benchmark, we used a fully-connected \FNN, that is, a network where
perceptrons are organized in layers and all output values from perceptrons of
layer $n-1$ serve as input values for perceptrons of layer $n$. We used a
3-layer network with 120 inputs, 30 perceptrons in the hidden layer, and 33
output perceptrons. Because a \FNN takes many epochs to update and converge it
barely adapts to a concept drifts even though it trains with each new data
point.
In this study, we used histogram features from~\cite{omid_2019} instead of the
ones presented in Section~\ref{sec:method-dataset} because the network performed
poorly with these features. The histogram features produce 20 bins per axis.
This neural network can be tuned by changing the number of layers and the size
of each layer. Additionally, the activation function and the learning ratio can
be changed. The learning ratio indicates by how much the weights should change
during backpropagation.
\subsubsection{Hyperparameters Tuning}
For each classifier, we tuned hyperparameters using the first subject from the
\banosdataset dataset. The data from this subject was pre-processed as the rest
of the \banosdataset dataset (window size of one second, average and standard
deviation on the six-axis of the right forearm sensor, $\ldots$). We did a grid
search to test multiple values for the parameters.
The classifiers start the prequential phase with no knowledge from the first
subject. We made an exception for the \FNN because we noticed that it performed
poorly with random weights and it needed many epochs to achieve better
performances than a random classifier. Therefore, we decided to pre-train the
\FNN and re-use the weights as a starting point for the prequential phase.
For other classifiers, only the hyperparameters were taken from the tuning
phase. We selected the hyperparameters that maximized the F1 score on the first
subject.
\subsubsection{Offline Comparison}
We compared data stream algorithms with an offline \knn. The value of $k$ were
selected using a grid search.
\subsection{Evaluation}
We computed four metrics: the F1 score, the memory
footprint, the runtime, and the power usage.
The F1 score and the memory
footprint were computed periodically during the
execution of a classifier. The
power consumption and the runtime were collected
at the end of each execution.
%Time to process one element.
%Explain the concept drift recovery time.
To compute the F1 score, we monitor the true positives, false positives, true
negatives, and false negatives using the prequential evaluation, meaning that
with each new data point the model is first tested and then trained. From these
counts, we compute the F1 score every 50 elements. We do not apply any fading
factor to attenuate errors throughout the stream. We compute the F1 score in a
one-versus-all fashion for each class, averaged across all classes
(macro-average, code available
\href{https://github.com/big-data-lab-team/benchmark-har-data-stream/blob/f314bf4a258e96e418e249228897d269c59cd522/src/main.cpp\#L81}{here}).
When a class has not been encountered yet, its F1 score is ignored. We use the
F1 score rather than the accuracy because the real data sets are imbalanced.
We measure the memory footprint by reading file
\texttt{/proc/self/statm} every 50 data points.
The runtime of a classifier is the time needed for
the classifier to process the dataset. We
collect the runtime reported by the \textit{perf}
command\footnote{\textit{perf}
\href{https://perf.wiki.kernel.org/index.php/Main_Page}{website}}, which
includes loading of the binary in memory, setting up data structures, and
opening the dataset file. To remove these overheads from our measurements,
we use the runtime of an empty classifier that always predict class 0 as a baseline.
We measure the average power consumed by classification
algorithms with the \textit{perf} command. The power measurement is done
multiple times in a minimal environment. We use the empty classifier as a
baseline.
\subsection{Results Reproducibility}
\revision{
The code and datasets used in this study are available on Github at
\url{https://github.com/big-data-lab-team/benchmark-har-data-stream.git}. We
used release 1.0, available on Zenodo with DOI \href{https://doi.org/10.5281/zenodo.4148947}{10.5281/zenodo.4148947} for long-term archival.
The repository contains the source code and data we used to run the experiment, including the Makefile to compile the benchmark, the Python script
to organize the execution and plot the results, and the
datasets. This experiment requires the following software to run properly.
\begin{itemize}
\item \texttt{Git} to download the repository and its submodules.
\item \texttt{gcc} or any C++ compiler in combination with \texttt{make} to compile the benchmark binaries.
\item \texttt{log4cpp} as a dependency of \streamdmcpp.
\item \texttt{Python}, \texttt{seaborn}, \texttt{matplotlib}, and
\texttt{pandas}, to run the experiment and plot the results.
\end{itemize}
}
\revision{
The plotting script (\texttt{makefile.py}) extracts data from three CSV files:
\texttt{models.csv}, \texttt{output\_runs}, and \texttt{output}.
\texttt{models.csv} lists the combination of classifiers, datasets, and parameters
that were run. This combination is called a model. The \texttt{output\_runs}
file stores information about the model's repetition such as the runtime or the
energy. Finally, the \texttt{output} file contains the accuracy, the F1 score,
and the memory footprint every fifty elements. Each line is identified with
three ids: the model id, the repetition count, and the data point count in the
dataset.
}
\revision{
In the README file, we provided the commands to download, compile, run the
experiment, and plot the results.
}
\revision{
During the benchmark execution, the datasets and output files were stored in
memory through a memfs filesystem mounted on \texttt{/tmp}, to reduce the impact
of I/O time. We averaged metrics across repetitions (same classifier, same
parameters, and same dataset).
The experiment was done with a single core of a cluster node with two Intel(R)
Xeon(R) Gold 6130 CPUs and a main memory of 250G. The node was running a CentOS
Linux 8 with Linux kernel 4.18.
}
% vim: tw=80 ts=2