A series of scripts to use machine learning to find and extract covers from comic books.
Most comic book files will have the cover as the first page, but often they will have multiple covers. Sometimes these are included at the start, sometimes at the end, and sometimes they are spread throughout the book. The goal of this project is to be able to run it against a directory full of comic book files, and have it extract all of the covers, so that they can be used to generate a cool collage (see the examples section below)
Collages were built using John's Background Switcher
Added some manually selected non-cover pages as well to give it a bit more variety.
MLE_1_Feature_Engineering.py is the first main file. Given a folder, recursively search through it for comic files (cbr/cbz) and build out a feature set for each page/image in each file.
The features we are using are as follows:
- File Name
- Whether the file name contains "Variant"
- Image Height
- Image Width
- Number of continuous horizontal black lines in the image
- Number of continuous horizontal white lines in the image
- Number of white pixels in the image
- Number of black pixels in the image
- OCR word count for the image
- Whether the OCR found the word "Variant"
- Whether the OCR found the word "Marvel"
- Whether OpenCV thinks it saw the Marvel Logo,
- OpenCV confident score it seeing the Marvel Logo
Output csv looks like this:
MLE_2_Classifier_Testing_And_Comparison.py is the second main file. Given a training data set, split it 80:20 training:test, then run various different classifiers using those two sets and measure their performance.
Key metrics we are measureing are Accuracy, Precision, Recall, F1 and Logistic Loss.
The classifiers tested are:
- RandomForestClassifier
- GradientBoostingClassifier
- LinearDiscriminantAnalysis
- AdaBoostClassifier
- SVC
- GaussianNB
- DecisionTreeClassifier
- KNeighborsClassifier
The results of the tests looked like this:
Overall, GradientBoostingClassifier was found to be the best option for this use case.
MLE_3_Extract_Classify_Move.py is the third main file. It works as follows:
- Given an input folder, recursively search through it, find and extract all comic files to separate directories, flatten them (renaming files to avoid conflicts)
- Build feature set for each image
- Load trained classifier, load featureset into pandas, iterate over pandas and apply classifier
- Move covers to output folder and clean up temp directories.
Additionally, there are individual files for some of the individual features from MLE 1 from early testing/troubleshooting, as well as some additional benchmarking stuff. Might be of some use to someone.
Comparison of different classifiers against earlier training sets
Examining computation/time cost of different feature types