-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adds Dalvik VM/DEX support #976
base: master
Are you sure you want to change the base?
Conversation
Ok, let's start with the general discussion and how in general we should address such issues (and least how it was designed. Speaking of the design we have an implementation of the final taggles style. So far we have Symantics of only one, but rather big Theory, which we arbitrary called the Core Theory. This theory represents a theory of a common Von Neumann computer, with memories, registers, ALU, etc. In general, we are not sticked to this particular theory and we can implement other theories, but we should keep in mind that each particular analysis is implemented for a specific theory. So far, all our analyses are understanding some subset of the Core Theory, so it is a good idea to be able to express semantics of the Dalvik bytecode in the Core Theory. We don't want, however, to loose the high-level information, in general, even if right now there is no analyses that can benefit from it. Obviously, an analysis should be aware of the Java Theory to be able to benefit from the Java high-level constructions. The general approach for implementing High Level Languages, Virtual Machines, and other systems that are higher (more abstract) than the Core Theory Machine is by gradually lowering the theory until we reach the suitable theory (e.g., Core or Basic). For example, speaking of Java we can declare a theory of the Java language (which would be a signature in terms of OCaml) and implement it in terms of a JVM, which we can, in order, implement in terms of the Core theory. E.g.,
so that we can express Java code in terms of the low-level Core Theory, as
On each level of this transformation we have to make implementation choices, like how arrays are represented, how types are stored, etc. Basically, we are implementing compilation, so we are moving to from the high-level to assembly level (and even lower) and we have to fix some decisions. Although JVM and DVM are much better specified then for example C, which leaves a lot of space for design decisions, we still have to make a lot of decisions, which in fact will be different from the decisions that were made by the implementers of the real virtual machines, like hotspot or DVM. This is not a big deal, (unless we really want to mimic a behavior of a particular machine), since we are more concerned with the correct semantics, rather than performance and memory footprint. What I'm trying to say here, is that we can have multiple different ways to express JVM or DVM in the terms of the Core Theory, therefore we should defined a DVM abstraction and keep the translation as a functor. Now, let's dive down to DVM and a problem of reference objects. Unlike a C Abstract Machine, DVM (and JVM too) provides facilities for memory management. So we have to make some decisions on how we will represent the heap. My suggestion is to implement a very simple, ever growing model of heap. We need two variables: let heap_type = Theory.Mem.define value value
let brk = Theory.Var.define value "brk"
let heap = Theory.Var.define heap_type "mem" where Therefore, an object allocation could be reified in the Core Theory as let allocate_object dst len =
seq
(set_reg dst (var brk))
(set brk (add (var brk) (int value len))) I.e., it just increments the pointer (and thus allocates the memory), and increments So the semantics of the let new_array dst len _typ : unit Theory.Effect.t KB.t =
unlabeled >>= fun lbl ->
blk lbl (allocate_object (int reg_name dst) len) skip roughly, because in order to find the size we need to look into the array type and find the element type and so on... again devil in details, but we will figure it out. Now, for the sake of the experiment let's implement (again roughly as a PoC) the semantics of the let filled_new_array dst len data =
Theory.Var.fresh value >>= fun i ->
let dst = int reg_name dst in
let data = int value data in
block [
set i (int value Bitvec.zero);
allocate_object dst len;
repeat (ult (var i) (int value len)) @@
data_block [
set_slot dst (var i) data;
set i (add (var i) (int value Bitvec.one))
]
] where let set_slot dst pos data =
let stride = M64.int (Theory.Bitv.size value / 8) in
let off = add (get_reg dst) (mul pos (int value stride)) in
set heap (store (var heap) off data) And now we can reify {
#1 := 0
frame := frame with [2, be]:u32 <- brk
brk := brk + 0x2A
while (#1 < 0x2A) {
mem := mem with [frame[2, be]:u32 + #1 * 4, be]:u32 <- 0
#1 := #1 + 1
} The good thing about this solution is that we can more or less preserve the separation of frames of different objects (we will soon add assume/assert statements to our theory, which will make it even more explicit). open Bap_core_theory
open Base
open KB.Syntax
let package = "dalvik"
(* let's pull a little bit redexer *)
type opcode =
| OP_NOP
| OP_MOVE
| OP_NEW_INSTANCE
| OP_NEW_ARRAY
| OP_FILLED_NEW_ARRAY
type operand =
| OPR_REGISTER of int
| OPR_CONST of int64 (** constant *)
type insn = opcode * operand list
module Java = struct
type reg_name
type value
type byte
(* registers are 4-bit? indices in a stack frame *)
let reg_name : reg_name Theory.Bitv.t Theory.Value.sort =
Theory.Bitv.define 4
(* we can define our own type hierarchy for Java,
but let's start with just 32 bit integers for all
primitive and reference types, it will hit us when
we will start dealing with doubles and longs.
*)
let value : value Theory.Bitv.t Theory.Value.sort =
Theory.Bitv.define 32
let byte : byte Theory.Bitv.t Theory.Value.sort =
Theory.Bitv.define 8
(* but frame is still a mapping from 4 bit offsets to 32 bit values. *)
let frame = Theory.Mem.define reg_name value
let heap_type = Theory.Mem.define value value
let current_frame = Theory.Var.define frame "frame"
let brk = Theory.Var.define value "brk"
let heap = Theory.Var.define heap_type "mem"
end
(* modular arithmetics for 4 bit values *)
module M4 = Bitvec.Make(struct let modulus = Bitvec.modulus 4 end)
module M32 = Bitvec.M32
module M64 = Bitvec.M64
module Dalvik(Core : Theory.Core) = struct
open Core
open Java
let pass = perform Theory.Effect.Sort.bot
let skip = perform Theory.Effect.Sort.bot
let frame = var current_frame
let unlabeled = KB.Symbol.intern ~package:"core-theory" "nil"
Theory.Program.cls
let set_reg x v =
set current_frame (store frame x v)
let get_reg x = load frame x
let mov x y = set_reg x (get_reg y)
let mov_rr x y =
let x = int reg_name x
and y = int reg_name y in
mov x y
let move eff =
KB.Object.create Theory.Program.cls >>= fun lbl ->
blk lbl eff skip
let allocate_object dst len =
seq
(set_reg dst (var brk))
(set brk (add (var brk) (int value len)))
let nop =
KB.return @@
Theory.Effect.empty Theory.Effect.Sort.top
let set_slot dst pos data =
let stride = M64.int (Theory.Bitv.size value / 8) in
let off = add (get_reg dst) (mul pos (int value stride)) in
set heap (store (var heap) off data)
(* probably the size is a function of len and typ *)
let new_array dst len _typ : unit Theory.Effect.t KB.t =
unlabeled >>= fun lbl ->
blk lbl (allocate_object (int reg_name dst) len) skip
let data_block = function
| [] -> pass
| xs -> List.reduce_exn xs ~f:seq
let block xs =
unlabeled >>= fun lbl ->
blk lbl (data_block xs) skip
let filled_new_array dst len data =
Theory.Var.fresh value >>= fun i ->
let dst = int reg_name dst in
let data = int value data in
block [
set i (int value Bitvec.zero);
allocate_object dst len;
repeat (ult (var i) (int value len)) @@
data_block [
set_slot dst (var i) data;
set i (add (var i) (int value Bitvec.one))
]
]
let run
: insn -> unit Theory.Effect.t KB.t =
function
| (OP_NOP,[]) -> nop
| (OP_MOVE, [OPR_REGISTER x; OPR_REGISTER y]) ->
move (mov_rr (M4.int x) (M4.int y))
| (OP_NEW_ARRAY, [OPR_REGISTER dst; OPR_CONST len; _]) ->
new_array (M4.int dst) (M32.int64 len) ()
| (OP_FILLED_NEW_ARRAY, [OPR_REGISTER dst; OPR_CONST len; _; OPR_CONST data]) ->
filled_new_array (M4.int dst) (M32.int64 len) (M32.int64 data)
| _ -> failwith "not ready"
end
module Lifter = Dalvik(Theory.Manager)
let lift opcode =
KB.Object.create Theory.Program.cls >>= fun insn ->
Lifter.run opcode >>= fun sema ->
KB.provide Theory.Program.Semantics.slot insn sema >>| fun () ->
insn
let test opcode =
match KB.run Theory.Program.cls (lift opcode) KB.empty with
| Error _ -> failwith "Oops, we've got a conflict!"
| Ok (code,_) ->
Caml.Format.printf "%a@\n" KB.Value.pp code |
I have a question now - what is the boilerplate that I need to add in And one more question - the imported coded uses |
We have |
Very rough draft of the future Dalvik VM/DEX support in BAP.
I followed the examples, but the Core Theory/Knowledge API is new for me, so don't judge me if I did something very wrong. Also it is can't be even compiled and basically a bricks glued together with duck tape. But please check if my train of thought follows the right direction and provide a feedback what should be changed and what put to a fire before somebody saw it.
The biggest questions here is the array access (I am still trying to understand how to implement this) and invocation - since it requires the indices access for data loaded from DEX. I will update the PR according to your feedback.