Automatic identification of CPU instruction sets from binaries

Reverse engineering is a crucial component for security evaluation of proprietary products. Most of it is performed by disassembly (and decompiling) binary code to analyse its behaviour. While usual computing platforms use structured executable formats such as PE or ELF which specify the required CPU ISA, embedded platforms sometimes use proprietary binary format or even flat binary models with no easy way to identify the architecture.

Current techniques used to identify architectures are usually manual and time consuming such as : trying every supported processor in IDA and manually looking at the resulting code. So automating this step would lead to substantial time gains.

The goal of the study is to determine an efficient way to automatically identify the CPU ISA based on binary code only.

For example: given a raw, unencrypted, uncompressed firmware for an embedded micro controller containing possibly both code and data, the result should be a list of probable CPU ISA.

Requirements

At least one person with a knowledge of CPU architectures and instruction sets (ISA).