http://www.kaggle.com/c/yoy-mimics-2022
This page we will briefly tell you how to:
- Create a notebook and attach the dataset
- Build a dataset pipeline
- Build your model
- Evaluate your results
- Submit your predictions
Please take a peak at the following notebooks for working examples of working code:
The notebooks and the helps in this page mostly use TensorFlow.
If you are new to Kaggle you can get started by:
B) and then < > New Notebook
Attach the Dataset of Butterfly Mimics by taping "Add data":
D) Type "Butterfly Mimics" in the search window
E) Click "Add" on the 2022 Dataset of Butterfly Mimics banner
Although the dataset used in this competition is small, it is good practice to use tools that are optimized to run CPU and GPU tasks with minimal bottlenecking. In TensorFlow we want to use Dataset. The following tips may help:
Set
AUTOTUNE = tf.data.experimental.AUTOTUNE
and use it where it is allowed, for example num_parallel_calls=AUTOTUNE
.
Make your Dataset with from_tensor_slices()
and then use map()
to load the feature matrix. The function that does the mapping call to tf.numpy_function(), this function is the secret to boosting the performance during training. It will be something like,
import tensorflow as tf
images_ds = tf.data.Dataset.from_tensor_slices(images_df)
images_ds = images_ds.map(get_feature_and_label)
⋮
features_labels = tf.numpy_function(load_jpg,[x,y],[tf.float32,tf.float32,tf.string])
⋮
Doing this will allow Tensorflow to prefetch batches and run things optimally using the CPU.
Since in the "2022 Butterfly Mimics" dataset all the butterfly photos for training are stored in one folder, /images,
and all the testing photos stored in another, /image_holdouts,
we load the csv files first to get the photo filenames and then load the photos.
The "baseline" notebook makes the following calls.
import tensorflow as tf
butterfly_model = tf.keras.Model(inputs, outputs, name=MODEL_NAME)
⋮
butterfly_model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
loss=tf.keras.losses.categorical_crossentropy,
metrics=['accuracy']
)
⋮
fit_history = butterfly_model.fit(
train_ds,
shuffle=True,
epochs=EPOCHS,
callbacks=[stop_early],
validation_data=validate_ds,
verbose=1
)
Tweaking the hyperparameters: LEARNING_RATE and EPOCHS is done for the calls made here.
This simple framework is deceptive in that it hides the layers and all the work that is being done between the inputs and outputs in the call to tf.keras.Model()
.
Lets look at the inputs and outputs:
inputs = tf.keras.Input(shape=(IMAGE_HEIGHT, IMAGE_WIDTH, IMAGE_DEPTH))
x = inputs
⋮
outputs = Dense(class_count, activation='softmax')(x)
butterfly_model = tf.keras.Model(inputs, outputs, name=MODEL_NAME)
There are many, many different ways to build the layers of the model as long as they start with a shape matching the image and finish with vector of 6 elements, one for each class of butterfly.
After your model is finished running, there are a number of things that we can do to see how it did.
Using your validation data so that you have y_truth
run predict()
to get y_pred
.
y_pred = butterfly_model.predict()
fbeta_score(y_truth, y_pred, beta=1, average='micro')
> F1 score: 0.
classification_report(y_truth, y_pred)
The Confusion Matrix can be fancy or plain:
confusion_matrix(y_truth, y_pred)
This is the final step. Run predict()
on the test dataset.
y_predictions = butterfly_model.predict(test_ds)
Zip the y_predictions
with the X, image ids in a Pandas dataframe and you can use it to create the csv with:
submit_df.to_csv("submission.csv", header=True, index=False)
Copyright © 2022 Keith Pinson