> Is there any new approach in the works? Maybe something ML-based for optimization?
I'm doing a PhD on this.
My goal is to detect known functions from obfuscated binaries.
The biggest challenge by far is building a good dataset. Unlike computer vision (millions of pictures with the label "dog") the number of training examples for a typical function is one. For now I'm focusing on C standard libraries, since there are a handful of real-world implementations plus some FOSS or students samples available for things like strlen and atoi.
If anyone wants to collaborate, feel free to message me.
I'm not sure I follow - wouldn't many statically linked programs have much of some version of libc within them? So you could take any program, change it to be statically linked and use that for training?
Not sure exactly what you mean by "best guess + fuzzing", but I have compiled code that was first decompiled by Ghidra. The problem is there are lots of invalid identifiers in the decompiled output.
The worst are symbols that are used inconsistently within the same function, like a parameter which is passed in as a long and then used as a pointer to a struct or even as a function.
The Ghidra community basically says you should not expect the exported decompiled code to be valid [1,2]. Which is fine, since rount-trip compile-decompile-compile is not exactly Ghidra's purpose.
Maybe there's a setting to make Ghidra export asm literals when it can't figure out a valid disassembly, but I am pretty new to Ghidra so it could just be my own ignorance.
> The worst are symbols that are used inconsistently within the same function, like a parameter which is passed in as a long and then used as a pointer to a struct or even as a function.
Split into new variable. Sounds like ghidra has trouble telling whether it is a reused stoarge location or actually the same variable.
Best guess = something that looks approxinately fitting for the relevant assembly
fuzzing = tweaking the source code to get what it compiles to closer to the actual assembly.
as in, generate a function, see how close its compilation resembles the assembly, tweak until you find a match
Yeah, I ended up creating new variables to get the compile to succeed.
As for generating functions, I'll have to think about what that loss function would look like. I've been looking at asm2vec[1] and structure2vec[2] for inspiration. I'm currently looking at different kinds of graph embeddings, because even answering the basic question of "are these N bytes of assembly semantically similar to these other N bytes" is a challenge.
That's not a bad idea. My first crack at this has been with a linux x64 target, but I have the infrastructure in place for mips, armv7, thumb, etc. I haven't tried compiling to very old/simple targets but I was considering using the MOVfuscator as one of the compilers.
Or maybe I can figure out how to tell LLVM to do some extreme strength reduction and target an ultra reduced subset of some ISA. Great food for thought, thanks!
I'm doing a PhD on this.
My goal is to detect known functions from obfuscated binaries.
The biggest challenge by far is building a good dataset. Unlike computer vision (millions of pictures with the label "dog") the number of training examples for a typical function is one. For now I'm focusing on C standard libraries, since there are a handful of real-world implementations plus some FOSS or students samples available for things like strlen and atoi.
If anyone wants to collaborate, feel free to message me.