diff --git a/sheet11/sheet11solutions.ipynb b/sheet11/sheet11solutions.ipynb index 30687fe..cb9e520 100644 --- a/sheet11/sheet11solutions.ipynb +++ b/sheet11/sheet11solutions.ipynb @@ -43,7 +43,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Concept learning aims at acquiring knowledge that allows to distinguish exemplars from non exemplars of a given category (concept). It can be formalized as learning a unary pradicate $p_c$ on the domain $X$ or equivalently an inidcator function $c:X\\to\\{0,1\\}$.\n", + "Concept learning aims at acquiring knowledge that allows to distinguish exemplars from non exemplars of a given category (concept). It can be formalized as learning a unary predicate $p_c$ on the domain $X$ or equivalently an indicator function $c:X\\to\\{0,1\\}$.\n", "\n", "Concept learning is usually supervised: the teacher tells the learner if an example falls under the concept or not.\n", "\n", @@ -59,19 +59,19 @@ ] }, { - "cell_type": "raw", + "cell_type": "markdown", "metadata": {}, "source": [ - "1. Initialize $h$ to the most specific hypothesis in H.\n", - "2. For each positive training instance x do\n", - " For each attribute constraint $a_i$ in h do\n", - " If ($a_$i is not satisfied by x) then\n", - " Replace $a_i$ in h by the next more general constraint\n", - " that is satisfied by x.\n", - " End if\n", - " End for\n", - " End for\n", - "3. Output h.\n", + " 1. Initialize $h$ to the most specific hypothesis in H.\n", + " 2. For each positive training instance x do\n", + " For each attribute constraint $a_i$ in h do\n", + " If ($a_$i is not satisfied by x) then\n", + " Replace $a_i$ in h by the next more general constraint\n", + " that is satisfied by x.\n", + " End if\n", + " End for\n", + " End for\n", + " 3. Output h.\n", "\n", "Inductive Bias: The target concept can be described in its hypothesis space (in our case: it is a conjunction of features). All instances are negative instances unless demonstrated otherwise.\n", "\n", @@ -113,7 +113,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Overfitting means an overly specific adaptation of the learner to the training data. Not only the general structure of the training data has been learnt, but also its specific noise, i.e. artifacts, are learnt and hence the learner looses the capability to generalize and work on other data.\n", + "Overfitting means an overly specific adaptation of the learner to the training data. Not only the general structure of the training data has been learned, but also its specific noise, i.e. artifacts, are learned and hence the learner looses the capability to generalize and work on other data.\n", "\n", "Overfitting can be detected by using a separate test data set. If the error on the test data increases during training, this indicates overfitting." ] @@ -151,13 +151,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Entropy measures the inhomogeneity of a data set (as minimal number of bits needed to encode elements from the set) \n", + "Entropy measures the inhomogeneity of a data set (the minimal number of bits needed to encode elements from the set) \n", "$$E(S) = -p_{+}\\log_2 p_{+} - p_{-}\\log_2p_{-}$$\n", - "where $p_{+}$ denotes the fraction of positive and $p_{-}$ that of negative examples in the data set. A set $S$ with only postive (or only negative) examples would have no entropy (i.e. $E(S)=0$), while a set with the same number of positive and negative examples has maximal entropy ($E(S)=1$).\n", + "where $p_{+}$ denotes the fraction of positive and $p_{-}$ that of negative examples in the data set. A set $S$ with only positive (or only negative) examples would have no entropy (i.e. $E(S)=0$), while a set with the same number of positive and negative examples has maximal entropy ($E(S)=1$).\n", "\n", - "Information gain is the expected reduction in entropy due to splitting the data set $S$ based on one attribute $A$: denote for every value $v\\in\\operatorname{Values}(A)$ the subset of elments from $S$ where $A=v$ by $S_v$. Then the information gain is given by\n", + "Information gain is the expected reduction in entropy due to splitting the data set $S$ based on one attribute $A$: denote for every value $v\\in\\operatorname{Values}(A)$ the subset of elements from $S$ where $A=v$ by $S_v$. Then the information gain is given by\n", "$$\\operatorname{Gain}(S,A) = E(S) - \\sum_{v\\in\\operatorname{Values}(A)}E(S_v)\\cdot\\frac{|S_v|}{|S|}$$\n", - "that is, from the entropy of $S$ the entropy values for $S_v$ are substracted, weighted by their respective sizes. If the subsets $S_v$ are all homogeneous ($E(S_v)=0$), then the information gain is maximal, namely $E(S)$, i.e. the data set can be fully explained by the single attribute $A$. On the other hand, if all $S_v$ have maximal entropy, no information is gained by splitting based on $A$. In practice, something between these extremes will be the case.\n", + "that is, from the entropy of $S$ the entropy values for $S_v$ are subtracted and weighted by their respective sizes. If the subsets $S_v$ are all homogeneous ($E(S_v)=0$), then the information gain is maximal, namely $E(S)$, i.e. the data set can be fully explained by the single attribute $A$. On the other hand, if all $S_v$ have maximal entropy, no information is gained by splitting based on $A$. In practice, something between these extremes will be the case.\n", "\n", "ID3 places the node with highest information gain at the root of the decision tree." ] @@ -200,7 +200,7 @@ "source": [ "An outlier is a value that seemingly does not belong to the rest of the data. It is probably caused by some measurement error (but it may also reflect some real phenomenon).\n", "\n", - "A simple method to detect outliers is to consider their distance from the mean (or median) of the full data set. If this is too large (e.g. greater than 3 standard deviations), the data point is considered to be an outlier (z-test). The Rosner test iteratively removes those outliers until the dataset does not contain anymore of them." + "A simple method to detect outliers is to consider their distances from the mean (or median) of the full data set. If this is too large (e.g. greater than 3 standard deviations), the data point is considered to be an outlier (z-test). The Rosner test iteratively removes those outliers until the dataset does not contain anymore of them." ] }, { @@ -239,7 +239,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Single-linkage clustering is based on the *minimum distance* that defines the distance between two clusters from the distance of their closests points. Single-linkage clustering tends to chaining.\n", + "Single-linkage clustering is based on the *minimum distance* that defines the distance between two clusters from the distance of their closest points. Single-linkage clustering tends to chaining.\n", "\n", "Complete-linkage clustering is based on the *maximum distance* that defines the distance of two clusters to be the maximal distance of two of their points. Complete linkage clustering prefers compact clusters." ] @@ -259,7 +259,7 @@ "source": [ "* Hamming distance: the number of positions where two strings of equal length differ\n", "* Chebyshev distance (also: maximum distance): maximal absolute difference in a single coordinate.\n", - "* p-norm: family of norms, defined by the formula $\\sqrt[p](\\sum_{i=1}^{L}|x_i-y_i|^p)$. important special cases: city block (aka Manhattan, p=1), euclidean distance (p=2)\n", + "* p-norm: family of norms, defined by the formula $\\sqrt[p](\\sum_{i=1}^{L}|x_i-y_i|^p)$. Important special cases: city block (aka Manhattan, p=1), euclidean distance (p=2)\n", "* Jaccard distance: for binary attributes\n", "\n", "Metric axioms for Chebyshev distance $d(\\mathbf{x},\\mathbf{y}) := \\max_{i=1,\\ldots,L}|x_i-y_i|$:\n", @@ -346,7 +346,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The covariance matrix contains the covariance values for all pairs of coordinates. A positive covariance value means that high values for the first coordinate correspond to highe values for the second coordinate. A negative covariance value expresses a correspondence of high values in the first coordinate with low values in the second coordinate. A value of $0$ means, that the values of the two coordinates do not correspond to each other.\n", + "The covariance matrix contains the covariance values for all pairs of coordinates. A positive covariance value means that high values for the first coordinate correspond to high values for the second coordinate. A negative covariance value expresses a correspondence of high values in the first coordinate with low values in the second coordinate. A value of $0$ means, that the values of the two coordinates do not correspond to each other.\n", "\n", "Given a set of $n$ data points in a $d$-dimensional data space as an $n\\times d$-matrix $D$, the covariance matrix is computed as $$C=(D-\\mu)^T\\cdot(D-\\mu)$$ where $\\mu$ denotes the mean vector of the data set.\n", "\n", @@ -373,11 +373,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "* The *multilayer perceptron* (MLP) consists of multiple layers of nodes through which activation is fet forward to compute an output vector to a given input pattern. It usually uses some non-linear activation function in each node and can be trained by a form of error gradient descent called back propagation.\n", + "* The *multilayer perceptron* (MLP) consists of multiple layers of nodes through which activation is fed forward to compute an output vector to a given input pattern. It usually uses some non-linear activation function in each node and can be trained by a form of error gradient descent called back propagation.\n", "\n", - "* A *radial basis function network* (RBF) can be considered as a threee layer network: a given input pattern activates the hidden layer using a radial activation function. The output value is then determined as a linear combination of these values. In contrast to the MLP, the RBF can be considered as a local classifier.\n", + "* A *radial basis function network* (RBFN) can be considered as a threee layer network: a given input pattern activates the hidden layer using a radial activation function. The output value is then determined as a linear combination of these values. In contrast to the MLP, the RBFN can be considered as a local classifier.\n", "\n", - "* A *self-organizing map* (SOM) is a two layer architecture, in which a high-dimensional input space is connected to a low-dimension grid. The SOM learns a discretized, low dimension representation of the input data. In contrast to MLP and RBF, the SOM is an unsupervised approach." + "* A *self-organizing map* (SOM) is a two layer architecture, in which a high-dimensional input space is connected to a low-dimension grid. The SOM learns a discretized, low dimension representation of the input data. In contrast to MLP and RBFN, the SOM is an unsupervised approach." ] }, { @@ -442,7 +442,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "A model is termed local, if the adaptation of model parameters only has local effects, i.e. it will only affect a subset of input values, located close to each other in the input space. In contrast, changing a parameter of a global model may effect all input values. Hence, local methods are considered to be more robust during training, as single (faulty) traning examples only affect a part of the system. Furthermore, such methods may be better to manage, as the effect of a single parameter is easier to understand." + "A model is termed local, if the adaptation of model parameters only has local effects, i.e. it will only effect a subset of input values, located close to each other in the input space. In contrast, changing a parameter of a global model may effect all input values. Hence, local methods are considered to be more robust during training, as single (faulty) traning examples only effect a part of the system. Furthermore, such methods may be better to manage, as the effect of a single parameter is easier to understand." ] }, { @@ -458,7 +458,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "RBFN are a local method as each hidden neuron has a local area of responsibility. In contrast, in a MLP is global, as changing a single weight may change the input-output mapping for all input patterns." + "RBFN are a local method as each hidden neuron has a local area of responsibility. In contrast, a MLP is global, as changing a single weight may change the input-output mapping for all input patterns." ] }, { @@ -497,7 +497,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "A classifier assigns a class to an entity based on its attributes (attributes might be color, height, weight, shape, ..., classes might be car, house, person, banana, yes, ...). Formaly, a classifier is a function $c:X\\to C$ that assigns a class $c(x)\\in C$ to every object $x\\in X$. Hence, a concept is a special classifier with only two classes $C=\\{\\operatorname{true},\\operatorname{false}\\}$." + "A classifier assigns a class to an entity based on its attributes (attributes might be color, height, weight, shape, ..., classes might be car, house, person, banana, yes, ...). Formally, a classifier is a function $c:X\\to C$ that assigns a class $c(x)\\in C$ to every object $x\\in X$. Hence, a concept is a special classifier with only two classes $C=\\{\\operatorname{true},\\operatorname{false}\\}$." ] }, { @@ -518,13 +518,14 @@ "| Classifier | Biases and Assumptions | Separatrices | Sensitivity | Locality | Parameters | Speed |\n", "|----------------------|------------------------|--------------|-------------|----------|------------|-------|\n", "| Euclidean classifier | ? | linear | sensitive to far outliers | global | none | very fast |\n", - "| Linear discriminant analysis | central limit theorem | linear | sensitive to far outliers | global | none | very fast |\n", - "| Quadratic classifier | ? | conic: e.g. hyperbola, parabola, ellipsis, line | sensitive to outliers | global | none | fast |\n", + "| Linear discriminant analysis | normally distributed data with equal covariances | linear | sensitive to far outliers | global | none | very fast |\n", + "| Quadratic classifier (e.g. QDA) | ? | conic: e.g. hyperbola, parabola, ellipsis, line | sensitive to outliers | global | none | fast |\n", "| Polynom classifier | ? | almost arbitrary | overfitting for high degrees | global | polynomial degree | fast |\n", - "| Nearest neighbor classifier | ? | implicit: neighbors | distance function | local | number of neighbors $k$ | $\\mathcal{O}(N^2)$ |\n", - "| SVM | mercer's condition | high dimensional hyperplane, nonlinear in data space | ? | global | input mapping, kernel function | efficient |\n", - "| MLP (not binary) | smooth interpolation | almost arbitrary | noise sensitive | global | activation functions, learning rate | slow |\n", - "| RBFN (not binary) | locality in data/clusters | ellipses/circular | robust to noise | local | regions of responsibility, activation functions (output layer), learning rate | comparably slow |" + "| Nearest neighbor classifier | ? | implicit: neighbors (voronoi cells around training data) | distance function | local | number of neighbors $k$ | $\\mathcal{O}(N)$ (instant training, linear classification) |\n", + "| Bayesian classifier | ? | discriminate functions (probability distributions) | overlapping classifications (only probabilities), noise is modeled | global | none | varies (underlying data and method for discriminate functions, see ML-09 Slides 5f) |\n", + "| MLP (not necessarily binary) | smooth interpolation | almost arbitrary | noise sensitive | global | activation functions, learning rate | slow |\n", + "| RBFN (not necessarily binary) | locality in data/clusters | ellipses/circular | robust to noise | local | regions of responsibility, learning rate | comparably slow |\n", + "| SVM | mercer's condition, input mapping, kernel function | high dimensional hyperplane, nonlinear in data space | handles noise with slacking variables | global | none | efficient |" ] }, { @@ -591,7 +592,7 @@ "metadata": {}, "source": [ "### c) \n", - "What does the Q-function express in reinforcement learning and how is it used?" + "What does the $Q$-function express in reinforcement learning and how is it used?" ] }, { @@ -648,7 +649,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Naive Bayes is a probabilistic classifier, i.e. a classifier that instead of a class assignment $x\\mapsto c(x)\\in C$ provides a probability value $P(C=c\\mid X=x)$. The naive Bayes classifier applies Bayes theorem to compute the posterior (diagnostical) probability $P(C\\mid X)$ from likelihood values $P(X\\mid C)$ (and prior $P(X)$) that have been learned from training data. Naivity refers to the simplifying assumption that the different features $X_1,\\ldots,X_n$ are conditional independent, given the class $C$, an assumption that may not be true in general but proofs to work well in many applications." + "Naive Bayes is a probabilistic classifier, i.e. a classifier that instead of a class assignment $x\\mapsto c(x)\\in C$ provides a probability value $P(C=c\\mid X=x)$. The naive Bayes classifier applies Bayes theorem to compute the posterior (diagnostical) probability $P(C\\mid X)$ from likelihood values $P(X\\mid C)$ (and prior $P(X)$) that have been learned from training data. Naivity refers to the simplifying assumption that the different features $X_1,\\ldots,X_n$ are conditional independent, given the class $C$, an assumption that may not be true in general but proves to work well in many applications." ] }, {