appSHNE: The Application of Representation Learning for Semantic-Associated Heterogeneous Networks in Creating Android App Embedding Layers
This is the DSC 180 capstone project created by Alexander Friend, Braden Riggs, and Raya Kavosh. In this project we explore the implementation of both structured and unstructured app features in the creation of representative embedding layers. This multi-faceted approach, at scale, offers researchers greater insight into the inner working of the app with deployment that ranges from malware detection to similarity search and link prediction. This work was based upon the heterogenous network learning methods for written text documents put forth in the paper SHNE: Representation Learning for Semantic-Associated Heterogeneous Networks, written by Chuxu Zhang, Ananthram Swami, and Nitesh V. Chawla.
Included in this repo is the source code for our project. More information is available in our paper, as well as a video presenting our project and results.
This project requires docompiled Android app APK to Smali code. This can be done by using ApkTool. For more information on Andorid apps and Smali code read through these slides.
A Docker image to run this project on the UCSD DSMLP servers is available here and can be pulled with the command:
docker push apfriend/dsc180-shne-env:latest
This code can be run from the command line by running:
python run.py [OPTIONS]
Valid options are:
-test,"-Test", -t Run on test set instead of full set of Android apps
-silent Hide outputs from command line
--save-out", -log Keep a log of command line outputs in file saved in configured
-time Keep track of how long to run and output time to complete when finished running
-eda Run EDA section only
--force-single Force to run on single process
--force-multi Force to run useing multi-processing
--show-params Print out parameters passed in command line
The paths to malicious and benign test app decompile apks should be set in mal_fp_test
and ben_fp_test
, respectively, in the configuration file config/params.json
. The paths to malicious and benign training app decompile apks should be set in mal_fp
and ben_fp
, respectively, in the configuration file config/params.json
.
- Updated Dockerfile as it was changed during merge
- Updated README.md file
- wrote EDA notebook that is callable from command line
- Run EDA with the following command line parameter:
-eda
- EDA can be run with the following parameters:
time
andlimit
python run.py -eda -time
will run the EDA and print the time to run it on completion
- Run EDA with the following command line parameter:
- Cleaned old code and adding documentation
- To do:
- Clean up parameters in
config/params.json
and delete unused parameters - Remove unused methods
- update dockerfile with
nbconvert
andpandoc
to runEDA.ipynb
from command line - Run EDA on 1000 apps
- Clean up parameters in
- added argument
-log
for the<redirect_std_out>
(save console output to log file) parameter - Moved SHNE_code to
src
directory
-t
,-test
,-Test
: Run on test set-node2vec
,-n2v
: Run with node2vec instead of word2vec--skip-embeddings
: Skip the word embeddings stage--skip-shne
: Skip SHNE model creation final step-p
,-parse
: Only create node dictionariesdict_A.json
,dict_B.json
,dict_P.json
,dict_I.json
,api_calls.json
, andnaming_key.json
-o
,-overwrite
: Overwrite previous node dictionaries created when parsing--save-out
: Save console output to file-time
: time how long to runmain.py
- All outputs will be saved under the values for
<out_path>
and<test_out_path>
- Subdirectories to save configured in respective dictionary.
- For instance word2vec embeddings will be saved under the path
<save_dir>
in the<word2vec-params>
dictionary intconfig/params.json
- For instance word2vec embeddings will be saved under the path
- Subdirectories to save configured in respective dictionary.
- All filenames parameterizable