Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds Dalvik VM/DEX support #976

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

XVilka
Copy link
Contributor

@XVilka XVilka commented Aug 23, 2019

Very rough draft of the future Dalvik VM/DEX support in BAP.
I followed the examples, but the Core Theory/Knowledge API is new for me, so don't judge me if I did something very wrong. Also it is can't be even compiled and basically a bricks glued together with duck tape. But please check if my train of thought follows the right direction and provide a feedback what should be changed and what put to a fire before somebody saw it.

The biggest questions here is the array access (I am still trying to understand how to implement this) and invocation - since it requires the indices access for data loaded from DEX. I will update the PR according to your feedback.

@ivg ivg assigned gitoleg and ivg and unassigned gitoleg Aug 23, 2019
@ivg
Copy link
Member

ivg commented Aug 23, 2019

Ok, let's start with the general discussion and how in general we should address such issues (and least how it was designed. Speaking of the design we have an implementation of the final taggles style. So far we have Symantics of only one, but rather big Theory, which we arbitrary called the Core Theory. This theory represents a theory of a common Von Neumann computer, with memories, registers, ALU, etc. In general, we are not sticked to this particular theory and we can implement other theories, but we should keep in mind that each particular analysis is implemented for a specific theory. So far, all our analyses are understanding some subset of the Core Theory, so it is a good idea to be able to express semantics of the Dalvik bytecode in the Core Theory.

We don't want, however, to loose the high-level information, in general, even if right now there is no analyses that can benefit from it. Obviously, an analysis should be aware of the Java Theory to be able to benefit from the Java high-level constructions.

The general approach for implementing High Level Languages, Virtual Machines, and other systems that are higher (more abstract) than the Core Theory Machine is by gradually lowering the theory until we reach the suitable theory (e.g., Core or Basic). For example, speaking of Java we can declare a theory of the Java language (which would be a signature in terms of OCaml) and implement it in terms of a JVM, which we can, in order, implement in terms of the Core theory. E.g.,

module Java = functor (Machine : JVM) -> Java
module Jvm = functor (Machine : Core) -> Jvm

so that we can express Java code in terms of the low-level Core Theory, as

module Interpreter = Java(Jvm(Theory.Core))

On each level of this transformation we have to make implementation choices, like how arrays are represented, how types are stored, etc. Basically, we are implementing compilation, so we are moving to from the high-level to assembly level (and even lower) and we have to fix some decisions. Although JVM and DVM are much better specified then for example C, which leaves a lot of space for design decisions, we still have to make a lot of decisions, which in fact will be different from the decisions that were made by the implementers of the real virtual machines, like hotspot or DVM. This is not a big deal, (unless we really want to mimic a behavior of a particular machine), since we are more concerned with the correct semantics, rather than performance and memory footprint. What I'm trying to say here, is that we can have multiple different ways to express JVM or DVM in the terms of the Core Theory, therefore we should defined a DVM abstraction and keep the translation as a functor.

Now, let's dive down to DVM and a problem of reference objects. Unlike a C Abstract Machine, DVM (and JVM too) provides facilities for memory management. So we have to make some decisions on how we will represent the heap. My suggestion is to implement a very simple, ever growing model of heap. We need two variables:

  let heap_type = Theory.Mem.define value value
  let brk = Theory.Var.define value "brk"
  let heap = Theory.Var.define heap_type "mem"

where brk is a pointer to the next free space, and heap is the value->value memory. (Probably, it should be value->byte, we'll figure it out later).

Therefore, an object allocation could be reified in the Core Theory as

  let allocate_object dst len =
    seq
      (set_reg dst (var brk))
      (set brk (add (var brk) (int value len)))

I.e., it just increments the pointer (and thus allocates the memory), and increments brk by the allocated size.

So the semantics of the new_array instruction would be, roughly,

  let new_array dst len _typ : unit Theory.Effect.t KB.t =
    unlabeled >>= fun lbl ->
    blk lbl (allocate_object (int reg_name dst) len) skip

roughly, because in order to find the size we need to look into the array type and find the element type and so on... again devil in details, but we will figure it out.

Now, for the sake of the experiment let's implement (again roughly as a PoC) the semantics of the filled-new-array operation, it will be something like this:

  let filled_new_array dst len data =
    Theory.Var.fresh value >>= fun i ->
    let dst = int reg_name dst in
    let data = int value data in
    block [
      set i (int value Bitvec.zero);
      allocate_object dst len;
      repeat (ult (var i) (int value len)) @@
      data_block [
        set_slot dst (var i) data;
        set i (add (var i) (int value Bitvec.one))
      ]
    ]

where set_slot sets the i'th value slot of an object to data,

  let set_slot dst pos data =
    let stride = M64.int (Theory.Bitv.size value / 8) in
    let off = add (get_reg dst) (mul pos (int value stride)) in
    set heap (store (var heap) off data)

And now we can reify filled-new-array 2 42 t 0 to,
Which will be reified to the following BIL code

     {
        #1 := 0
        frame := frame with [2, be]:u32 <- brk
        brk := brk + 0x2A
        while (#1 < 0x2A) {
          mem := mem with [frame[2, be]:u32 + #1 * 4, be]:u32 <- 0
          #1 := #1 + 1
      }

The good thing about this solution is that we can more or less preserve the separation of frames of different objects (we will soon add assume/assert statements to our theory, which will make it even more explicit).
Below is the full example, with all helper functions (besides, do not hesitate to create those helpers, at the end they will form our DVM theory)

open Bap_core_theory
open Base
open KB.Syntax

let package = "dalvik"

(* let's pull a little bit redexer *)

type opcode =
  | OP_NOP
  | OP_MOVE
  | OP_NEW_INSTANCE
  | OP_NEW_ARRAY
  | OP_FILLED_NEW_ARRAY

type operand =
  | OPR_REGISTER of int
  | OPR_CONST    of int64  (** constant *)

type insn = opcode * operand list

module Java = struct
  type reg_name
  type value
  type byte



  (* registers are 4-bit? indices in a stack frame *)
  let reg_name : reg_name Theory.Bitv.t Theory.Value.sort =
    Theory.Bitv.define 4

  (* we can define our own type hierarchy for Java,
     but let's start with just 32 bit integers for all
     primitive and reference types, it will hit us when
     we will start dealing with doubles and longs.
  *)
  let value : value Theory.Bitv.t Theory.Value.sort =
    Theory.Bitv.define 32

  let byte : byte Theory.Bitv.t Theory.Value.sort =
    Theory.Bitv.define 8

  (* but frame is still a mapping from 4 bit offsets to 32 bit values.  *)
  let frame = Theory.Mem.define reg_name value
  let heap_type = Theory.Mem.define value value

  let current_frame = Theory.Var.define frame "frame"
  let brk = Theory.Var.define value "brk"
  let heap = Theory.Var.define heap_type "mem"
end

(* modular arithmetics for 4 bit values *)
module M4 = Bitvec.Make(struct let modulus = Bitvec.modulus 4 end)
module M32 = Bitvec.M32
module M64 = Bitvec.M64

module Dalvik(Core : Theory.Core) = struct
  open Core
  open Java

  let pass = perform Theory.Effect.Sort.bot
  let skip = perform Theory.Effect.Sort.bot

  let frame = var current_frame
  let unlabeled = KB.Symbol.intern ~package:"core-theory" "nil"
      Theory.Program.cls

  let set_reg x v =
    set current_frame (store frame x v)

  let get_reg x = load frame x

  let mov x y = set_reg x (get_reg y)

  let mov_rr x y =
    let x = int reg_name x
    and y = int reg_name y in
    mov x y

  let move eff =
    KB.Object.create Theory.Program.cls >>= fun lbl ->
    blk lbl eff skip


  let allocate_object dst len =
    seq
      (set_reg dst (var brk))
      (set brk (add (var brk) (int value len)))

  let nop =
    KB.return @@
    Theory.Effect.empty Theory.Effect.Sort.top

  let set_slot dst pos data =
    let stride = M64.int (Theory.Bitv.size value / 8) in
    let off = add (get_reg dst) (mul pos (int value stride)) in
    set heap (store (var heap) off data)

  (* probably the size is a function of len and typ *)
  let new_array dst len _typ : unit Theory.Effect.t KB.t =
    unlabeled >>= fun lbl ->
    blk lbl (allocate_object (int reg_name dst) len) skip

  let data_block = function
    | [] -> pass
    | xs -> List.reduce_exn xs ~f:seq

  let block xs =
    unlabeled >>= fun lbl ->
    blk lbl (data_block xs) skip

  let filled_new_array dst len data =
    Theory.Var.fresh value >>= fun i ->
    let dst = int reg_name dst in
    let data = int value data in
    block [
      set i (int value Bitvec.zero);
      allocate_object dst len;
      repeat (ult (var i) (int value len)) @@
      data_block [
        set_slot dst (var i) data;
        set i (add (var i) (int value Bitvec.one))
      ]
    ]

  let run
    : insn -> unit Theory.Effect.t KB.t =
    function
    | (OP_NOP,[]) -> nop
    | (OP_MOVE, [OPR_REGISTER x; OPR_REGISTER y]) ->
      move (mov_rr (M4.int x) (M4.int y))
    | (OP_NEW_ARRAY, [OPR_REGISTER dst; OPR_CONST len; _]) ->
      new_array (M4.int dst) (M32.int64 len) ()
    | (OP_FILLED_NEW_ARRAY, [OPR_REGISTER dst; OPR_CONST len; _; OPR_CONST data]) ->
      filled_new_array (M4.int dst) (M32.int64 len) (M32.int64 data)
    | _ -> failwith "not ready"

end



module Lifter = Dalvik(Theory.Manager)

let lift opcode =
  KB.Object.create Theory.Program.cls >>= fun insn ->
  Lifter.run opcode >>= fun sema ->
  KB.provide Theory.Program.Semantics.slot insn sema >>| fun () ->
  insn

let test opcode =
  match KB.run Theory.Program.cls (lift opcode) KB.empty with
  | Error _ -> failwith "Oops, we've got a conflict!"
  | Ok (code,_) ->
    Caml.Format.printf "%a@\n" KB.Value.pp code

@XVilka
Copy link
Contributor Author

XVilka commented Aug 28, 2019

I have a question now - what is the boilerplate that I need to add in plugins/dalvik/dalvik_dex.ml to use it for loading the binary instead of the raw bin or ELF code

And one more question - the imported coded uses DynArray for resizable arrays. Is there something in BAP that can be reused for this? Or I should change all the code that using it? Or just copy-paste the DynArray implementation along with the rest?

@ivg ivg self-requested a review September 4, 2019 16:32
@ivg
Copy link
Member

ivg commented Apr 15, 2020

And one more question - the imported coded uses DynArray for resizable arrays. Is there something in BAP that can be reused for this? Or I should change all the code that using it? Or just copy-paste the DynArray implementation along with the rest?

We have Bap.Std.Vector for that.

@ivg ivg changed the title [WIP] Dirty, very dirty initial draft - closes #974 adds Dalvki VM/DEX support Jun 12, 2020
@ivg ivg marked this pull request as draft June 12, 2020 20:24
@ivg ivg added the dex-lifter label Jun 12, 2020
@ivg ivg changed the title adds Dalvki VM/DEX support adds Dalvik VM/DEX support Jun 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants