This repository contains the solution for the hackathon, using this approach I was able to achieve 2nd rank in this hackathon.
Here is the link to my interview with Analytics India Magazine.
Can you construct a next-gen model, capable of detecting code that is present in a body of text? Be part of a mission to enhance the security and resilience of web applications.
Protecting our software landscapes is not an easy task. Malicious actors are frequently trying to enter systems and get access to resources, whether operational or data. The ability for an actor to compromise systems, elevate their privileges, and move laterally within infrastructure typically hinges on executing hidden code. One common method they employ is embedding this code in seemingly harmless media—whether it's images, videos, or even simple text files.
Given a body of text, find the source code hidden in the text. There might not be any source control or multiple sections of source code concealed within the text.
- ID: A unique identifier, either an integer or string, used to distinguish each row in the dataset.
- Text: A string containing the textual content associated with each record.
- ContainsCode: A boolean indicating whether the "Text" field contains any code snippets.
- CodeList: A list or array storing code snippets extracted from the "Text" field, if applicable.
- I have finetuned RoBERTa base distilled and DeBERTa v3 large, each for 30 epochs.
- Both of these models were trained on SQUAD V2.
- For Inference, I use the output of the model which has a higher logit score.
- Optimizer Adamw with Learning rate = 2e-5 and Cosine scheduler with warmup.
- Dropout of 0.2
- The evaluation metric used in this competition is accuracy.
- First, MultiLabel Binarizer is applied on the predicted span of code.
- Then, accuracy is computed. I got 0.90227 accuracy on the private leaderboard.