> Is there any new approach in the works? Maybe something ML-based for optimizat...

anitil · on July 14, 2022

I'm not sure I follow - wouldn't many statically linked programs have much of some version of libc within them? So you could take any program, change it to be statically linked and use that for training?

That said I assume I'm missing something here.

Akronymus · on July 13, 2022

Could a best guess + fuzzing + compiling the decompiled code work towarda a heuristic?

hoosieree · on July 13, 2022

Not sure exactly what you mean by "best guess + fuzzing", but I have compiled code that was first decompiled by Ghidra. The problem is there are lots of invalid identifiers in the decompiled output.

The worst are symbols that are used inconsistently within the same function, like a parameter which is passed in as a long and then used as a pointer to a struct or even as a function.

The Ghidra community basically says you should not expect the exported decompiled code to be valid [1,2]. Which is fine, since rount-trip compile-decompile-compile is not exactly Ghidra's purpose.

Maybe there's a setting to make Ghidra export asm literals when it can't figure out a valid disassembly, but I am pretty new to Ghidra so it could just be my own ignorance.

[1]: https://github.com/NationalSecurityAgency/ghidra/issues/236

[2]: https://github.com/NationalSecurityAgency/ghidra/issues/3553

Akronymus · on July 13, 2022

> The worst are symbols that are used inconsistently within the same function, like a parameter which is passed in as a long and then used as a pointer to a struct or even as a function.

Split into new variable. Sounds like ghidra has trouble telling whether it is a reused stoarge location or actually the same variable.

Best guess = something that looks approxinately fitting for the relevant assembly

fuzzing = tweaking the source code to get what it compiles to closer to the actual assembly.

as in, generate a function, see how close its compilation resembles the assembly, tweak until you find a match

hoosieree · on July 13, 2022

Yeah, I ended up creating new variables to get the compile to succeed.

As for generating functions, I'll have to think about what that loss function would look like. I've been looking at asm2vec[1] and structure2vec[2] for inspiration. I'm currently looking at different kinds of graph embeddings, because even answering the basic question of "are these N bytes of assembly semantically similar to these other N bytes" is a challenge.

[1]: https://ieeexplore.ieee.org/document/8835340

[2]: https://arxiv.org/abs/1603.05629

Akronymus · on July 13, 2022

Maybe start with a simple fixed size instriction set? To get some methology down to later be refined. something like early 8 bit micros

hoosieree · on July 14, 2022

That's not a bad idea. My first crack at this has been with a linux x64 target, but I have the infrastructure in place for mips, armv7, thumb, etc. I haven't tried compiling to very old/simple targets but I was considering using the MOVfuscator as one of the compilers.

Or maybe I can figure out how to tell LLVM to do some extreme strength reduction and target an ultra reduced subset of some ISA. Great food for thought, thanks!