SUPER Android Analyzer Downloads News

Creating a Dalvik parser in Rust (Part 1)

As you could imagine, a good Rust Android analyzer would not be so if it wasn’t 100% developed in Rust. SUPER has its analyzer 100% developed in Rust, but it still has some dependencies for translating the .apk file to the Java code we analyze. That is something that will change in the future, and it will start by creating a really efficient Dalvik parser. Even though Dalvik itself is discontinued, .dex files inside .apk files still use the Dalvik executable format.

In this first document we will understand the dex format and try to create the first basic data structures in Rust. Let’s first look at the main file structure:

This first section of a .dex file contains information about the data inside of it. It must be 112 bytes length, and it must be the first section. The fields in this structure, in order, are the following:

To parse this header, we can create a simple Header struct, that will parse this information:

pub struct Header {
    magic: [u8; 8],
    checksum: u32,
    signature: [u8; 20],
    file_size: usize,
    header_size: usize,
    endian_tag: u32,
    link_size: Option<usize>,
    link_offset: Option<usize>,
    map_offset: usize,
    string_ids_size: usize,
    string_ids_offset: Option<usize>,
    type_ids_size: usize,
    type_ids_offset: Option<usize>,
    prototype_ids_size: usize,
    prototype_ids_offset: Option<usize>,
    field_ids_size: usize,
    field_ids_offset: Option<usize>,
    method_ids_size: usize,
    method_ids_offset: Option<usize>,
    class_defs_size: usize,
    class_defs_offset: Option<usize>,
    data_size: usize,
    data_offset: usize,
}

I changed offsets and sizes to usize since it will be easier to use with Rust types. Parsing them will be as easy as this, using the byteorder crate. The header can be seen in the Git repository here:

let link_size = try!(if endian_tag == ENDIAN_CONSTANT {
    reader.read_u32::<LittleEndian>()
} else {
    reader.read_u32::<BigEndian>()
}) as usize;

The endianness comparison is done to know if the file is in little or big endian. Since we want it to be read as fast as possible, we use a Read object:

impl Header {
    pub fn from_reader<R: Read>(mut reader: R) -> Result<Header> {
        // Magic number
        let mut magic = [0u8; 8];
        try!(reader.read_exact(&mut magic));
        // Checksum
        let mut checksum = try!(reader.read_u32::<LittleEndian>());
        // Signature
        let mut signature = [0u8; 20];
        try!(reader.read_exact(&mut signature));
        // File size
        let mut file_size = try!(reader.read_u32::<LittleEndian>());
        // Header size
        let mut header_size = try!(reader.read_u32::<LittleEndian>());
        // Endian tag
        let endian_tag = try!(reader.read_u32::<LittleEndian>());
        // Check endianness
        if endian_tag == REVERSE_ENDIAN_CONSTANT {
            // The file is in big endian instead of little endian.
            checksum = checksum.swap_bytes();
            file_size = file_size.swap_bytes();
            header_size = header_size.swap_bytes();
        }

        // .... //
    }
}

Here, we read the file sequentially. which means that the checksum, the file_size and the header_size are read before we know the endianness of the file. Since usually files are in little endian, we read them in little endian, and if it happens to be in big endian once we get to the tag, we swap the bytes in memory. That function is embedded in the standard library.

Reading the file sequentially makes the read really fast, but it means that we need information about what we are reading, and if we don’t have it, we will need to go back once we have that information. The code in the repo has some more checks that are implemented for easier error reporting.

Some interesting checks can be done with the magic number:

fn is_magic_valid(magic: &[u8; 8]) -> bool {
    &magic[0..4] == &[0x64, 0x65, 0x78, 0x0a] && magic[7] == 0x00 &&
    magic[4] >= 0x30 && magic[5] >= 0x30 && magic[6] >= 0x30 && magic[4] <= 0x39 &&
    magic[5] <= 0x39 && magic[6] <= 0x39
}

This checks if the magic number is valid. It first checks that the first 4 characters are dex\n and that the last character is a \0. Then it checks if the other 3 characters are digits (they must be, they will represent the version of the dex file). The version number can also be parsed efficiently:

pub fn get_dex_version(&self) -> u8 {
    (self.magic[4] - 0x30) * 100 + (self.magic[5] - 0x30) * 10 + (self.magic[6] - 0x30)
}

It’s as simple as substracting 0x30 to each digit (in ASCII, 0x30 is 0 and digits are in order from 0x30 to 0x39). Then, each digit is multiplied by its position in the decimal 3-digit number.

IDs lists

After the header, a dex file contains IDs sections. These are arrays of offsets or indexes to other lists, so they don’t actually contain data. These sections are the following:

Conclusion and next steps

Creating a dex file parser is not a one-day job, but as we have seen here, there is plenty of documentation available, even though, as we will see in the next post, it won’t be enough. We will need to learn about unmapped data and the actual map section. We will need to parse a map and we will need to efficiently read the file storing the offsets in a special data structure to be able to fastly understand it sequencially.

See you in the next post!

Fork me on GitHub