autoscale: true
#[fit] End of Days
- Machine Learning
- Deep Learning
- Really just thinking, the process
- learn how to create and regularize models
- learn how to optimize objective functions such as loss functions using Stochastic Gradient Descent
- learnt how to create (simple) archirectures to solve problems
- learnt how to transfer from one model to the next
#[fit]Concepts running through:
- Fitting parameters vs hyperparameters
- Regularization
- Validation testing
- Stochastic Gradient Descent
- Learning representations
#[fit] Regression
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)
lr2 = LinearRegression().fit(Xtrain, ytrain)
r2_test = r2_score(ytest, lr.predict(Xtest))
r2_train = r2_score(ytrain, lr.predict(Xtrain))
inputs_placeholder = Input(shape=(1,))
outputs_placeholder = Dense(1, activation='linear')(inputs_placeholder)
m = Model(inputs=inputs_placeholder, outputs=outputs_placeholder)
m.compile(optimizer='sgd', loss='mean_squared_error', metrics=['mae','accuracy'])
m.summary()
"data is a sample from an existing population"
- data is stochastic, variable; parameters fixed
- fit a parameter
- samples (or bootstrap) induce a sampling distribution on any estimator
- example of a very useful estimator: MLE
#[fit] Learning
#Statement of the Learning Problem
The sample must be representative of the population!
A: Empirical risk estimates in-sample risk. B: Thus the out of sample risk is also small.
#UNDERFIT(Bias) vs OVERFIT (Variance)
#[fit] Classifcation
This function is plotted below:
h = lambda z: 1./(1+np.exp(-z))
zs=np.arange(-5,5,0.1)
plt.plot(zs, h(zs), alpha=0.5);
Identify:
Then, the conditional probabilities of
These two can be written together as
BERNOULLI!!
Multiplying over the samples we get:
$$\renewcommand{\v}[1]{\mathbf #1} P(y|\v{x},\v{w}) = P({y_i} | {\v{x}i}, \v{w}) = \prod{y_i \in \cal{D}} P(y_i|\v{x_i}, \v{w}) = \prod_{y_i \in \cal{D}} h(\v{w}\cdot\v{x_i})^{y_i} \left(1 - h(\v{w}\cdot\v{x_i}) \right)^{(1-y_i)}$$
Indeed its important to realize that a particular sample can be thought of as a draw from some "true" probability distribution.
maximum likelihood estimation maximises the likelihood of the sample y, or alternately the log-likelihood,
Thus
[.autoscale: true]
The negative of this log likelihood (NLL), also called cross-entropy.
Gradient:
Hessian:
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# one hot encode outputs
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
# create model
model = Sequential()
model.add(Flatten(input_shape=(img_width, img_height)))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam',
metrics=['accuracy'])
model.summary()
# Fit the model
model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test),
callbacks=[WandbCallback(data_type="image", labels=labels, save_model=False)])
from sklearn.model_selection import GridSearchCV
pipeline = make_pipeline(TfidfVectorizer(),
LogisticRegression())
grid = GridSearchCV(pipeline,
param_grid={'logisticregression__C': [.1, 1, 10, 100]}, cv=5)
grid.fit(text_train, y_train)
print("Score",grid.score(text_test, y_test))
#[fit] Metrics and #[fit] Decision Theory
#[fit] Trees
- first do bagging
- then randomly choose features
- Random forest, squeezes out variance
- Big idea is ensembling
- use weak learners
- fit residuals
- gradient descent in "data" space
- inspiration for resnets
#[fit] Neural #[fit] Networks
ONE POINT AT A TIME
for i in range(nb_epochs):
np.random.shuffle(data)
for example in data:
params_grad = evaluate_gradient(loss_function, example, params)
params = params - learning_rate * params_grad
Mini-Batch: do some at a time
# create model
model = Sequential()
model.add(Flatten(input_shape=(img_width, img_height)))
model.add(Dense(config.hidden_nodes, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=config.optimizer,
metrics=['accuracy'])
model.summary()
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test),
epochs=config.epochs,
callbacks=[WandbCallback(data_type="image", labels=labels)])
- MLPs are crude
- we must help computer
- thus use CNNs and RNNs which direct the representations
- directed representation learning is the basis for transfer learning
#[fit] Practicalities
- Explosion: Gradient clipping
- Implosion: Skip Connections, Resnets, LSTM, Attention
- Curvature: Momentum, Gradient Accumulation
- Heterogeneity: adaptive learning rates
- Uniform or Normal: get unit variance
- Non saturated Bias for Relus
- Feature Normalization
- Batch Norm
#[fit] First Overfit
model=Sequential()
model.add(Flatten(input_shape=(img_width,img_height)))
model.add(Dropout(config.dropout))
model.add(Dense(config.hidden_nodes, activation='relu'))
model.add(Dropout(config.dropout))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=config.optimizer,
metrics=['accuracy'])
- Bagging
- Adverserial Examples
- Sparse Activations (on activations not weights)
#[fit] Convolutional #[fit] Neural #[fit] Networks
- 3073 params cifar-10
- 150129 Imagenet
- Nearer pizels are related
- And there are symmetries
- striding
- padding
- pooling
- upsampling
- 1x1 convolutions
- globalavgpooling
- reshaping, Densing
#[fit]Language #[fit] Modeling #[fit] + Embeddings
tokenizer = text.Tokenizer(num_words=config.vocab_size)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_matrix(X_train)
X_test = tokenizer.texts_to_matrix(X_test)
bow_model = LogisticRegression()
bow_model.fit(X_train, y_train)
pred_train = bow_model.predict(X_train)
acc = np.sum(pred_train==y_train)/len(pred_train)
pred_test = bow_model.predict(X_test)
val_acc = np.sum(pred_test==y_test)/len(pred_test)
model = Sequential()
model.add(Embedding(config.vocab_size,
config.embedding_dims,
input_length=config.maxlen))
model.add(Conv1D(config.filters,
config.kernel_size,
padding='valid',
activation='relu'))
model.add(Flatten())
model.add(Dense(config.hidden_dims, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
#OR
model.add(Embedding(config.vocab_size, 100,
input_length=config.maxlen, weights=[embedding_matrix], trainable=False))
model.add(LSTM(config.hidden_dims, activation="sigmoid"))
- cnn on embedding
- cnn-lstm
- stacked deep lstm
- bi-directional lstm
- cnn feeding into part of lstm (captioning)
###What other architectures could we use?
#[fit] Generative #[fit] Modeling
Not Solvable analytically!
In Supervised Learning, Latent Variables
In other words, we can write the full-data likelihood
In Unsupervised Learning, Latent Variables
We can only write the observed data likelihood:
COMBINE: Semi supervized learning.
![inline](images/Screenshot 2019-06-23 10.53.49.png)
![inline](images/Screenshot 2019-06-23 10.55.20.png)
![inline](images/Screenshot 2019-06-23 10.54.46.png)
![inline](images/Screenshot 2019-06-23 10.55.01.png)
![inline](images/Screenshot 2019-06-23 10.56.15.png)
![inline](images/Screenshot 2019-06-23 10.56.24.png)
![inline](images/Screenshot 2019-06-23 10.56.43.png)
- simply not possible to do inference in large models
- inference in neural networks: understanding robustness, etc
- hierarchical neural networks
- Mixture density networks: mixture parameters are fitted using ANNs
- extension to generative semisupervised learning
- variational autoencoders
- There is no substitute for coding.
- We are there to help. Please log onto https://discourse.univ.ai, so we can have conversations (use your github id).
- In this next week, choose one of the hackings for the course and work (more) on it.
- we will have a hack every 2 weeks (so 2 a month) until the next basics (next 3 months). we'll discuss on Discourse
- suggest topics if interested!
- we are looking for TAs for the next basics (hybrid on-site/online, october/november) and the advanced (online, november/december). Ping us if you want to do this!
- TAing is the best way to take your learning to a new level, nothing forces you to understand something like having to explain it.
- you will have the opportunity to develop material, and this will help your understanding
- TA training will be provided
- The first session of fast.ai
- Pattern Recognition and Machine Learning, Bishop
- Deep Learning, Eugene Charniak
- Deep Learning, Andrew Glassner
- Andrew Ng's course is an oldie but goodie. Machine Learning Yearning PDF Document
- Elements of Statistical Learning
- more applications: super-resolution, image segmentation. etc Unets.
- deep unsupervised learning: GANs, more on autoencoders, autoregressive models, flow models
- Transfer learning in NLP with Bert, Elmo, and ULM-Fit
- model deployment and experimentation
- seq2seq, attention
- bayesian statistics