We just moved to a different server. Please be patient until all files and pages are restored and the MediaWiki software has been updated. Thank you

Reverse engineering hints

From REWiki
Jump to: navigation, search

Here are some hints on how to do reverse-engineering on games and file-formats.

Contents

Disassembling

(Deep knowledge of x86 assembly required, The Art of Assembly Language by Randall Hyde might help there)

Use your favorite disassembler to disassemble the game's main executable file. Some of the available disassemblers are:

  • W32Dasm
  • IDA Pro. Version 4.3 freeware can be downloaded from DataRescue or other sites (google for freeida43.exe). Freeware version can disassemble DOS and Win32 programs. It runs under Linux/x86 using Wine.

Then, simply load the executable into the disassembler and start exploring it. Sometimes it is neccessary to unpack the executable before being able to disassemble it. A tool which does that is available for example here, download unp410.arj.

First you need to find out what the different functions do. DOS system calls (often called software interrupts) are documented here, a VGA reference can be found here. To find the relevant functions which operate on your data file, first find out where the filename is stored in the EXE file. Then, find cross-references to this address and check what functions are called directly after loading this address.

DOSBox

(Knowledge of C/C++ and assembly required)

Download DOSBox and compile it with debug support enabled. Then, add some debug instrumentation calls to the relevant DOS system calls in DOS_21Handler(void), which is in dos/dos.cpp. Interesting calls are CREATE, OPEN, READ and LSEEK. Add something like

LOG(LOG_FILES,LOG_NORMAL)("DOS: File %s opened, result:al=%d",DOSNAMEBUF,reg_al); 

to the case-statements in which you're interested. Then run your game and make sure you capture the debug-log to a file. Using this method you can find out where in a file your game tries to read, and how much it reads every time. This helps you greatly in reverse-engineering a game datafile using the manual method described next.

Wine

For unix systems, you can install wine and run Windows games with it. The least it can do is verbose logging of all API calls that the game does. For example, you can log all file operations with setting WINEDEBUG environment variable to "trace+file", and then running `wine | tee file.log`. This will log file open/read/write operations, and may help to get the basic knowledge on structure of game data files. More options for WINEDEBUG are listed one the wine wiki. There are other debugging capabilities, see man winedbg.

Manual method

Windows

You only need a hex-editor for this one. I suggest frhed, simply because it's free and good. Note: don't use it to open huge files as it tries to load everything into RAM first!. I also use the viewer which is built into Total Commander for quick viewing and searching in data files.

Unix

There's a lot of HEX editors for *nix.

  • Curses-based editors: hexedit, Midnight Commander's builtin editor, tweak (supports setting custom width and offset for examining the data), hexcurse (uses whole width of a terminal, so useful for examining data with different lengths).
  • GUI editors: Ghex (GTK)

You can also use combination of non-interactive utilities: hexdump -C file | less (canonical HEX+ascii dump) xxd file | less (xxd from the vim package supports different columns cound (-c N) and byte group sizes (-g N) od -tx1 -C file | less (has many options for output formatting, grouping etc.)

For identifying standard file types (sound, graphics, etc.) with non-standard names the file command is helpful. It uses known file magics to identify the file format.

You can also dump all strings (or everything that look like strings) from an arbitrary file with strings command.

Not even mentioning all other common unix utilities which are useful in many different ways.

Collection files

Most data files are simply a collection of smaller individual files. They might have a directory structure somewhere (either at the beginning or at the end) which I'll call global header, or they might simply be concatenated files with small local headers.

To find the global header simply look at the beginning of your file. If there seem to be filenames in there, then you probably have a file with a directory at the beginning. If you don't find a header at the beginning, then look at the end.

After you've found a directory structure, you need to look for (offset, size) pairs near the filenames. They are normally 32bit values which describe the offset and length of the individual files in your data file. The offset-part is normally increasing from file to file, because the files in the directory are sorted. If you can only find one number for each filename then it's most probably the offset. The length is then calculated by taking the difference to the offset of the next file (or, for the last file in the directory, to the end of file)

If you can't find a directory structure, chances are that you're looking at a file which only has local headers. Try to find one filename in the first hundreds of bytes in the datafile. This might be the name of the first file, followed somewhere near by the length of this file. Look if you find a (not too big) 32 bit number near the filename and skip this many bytes in your datafile. Look around there to see if there's another filename. Then do that again. If it works for the first 3 files, you probably have found out enough already to write a simple C program to extract these files.

If you can't find any filenames at all, it's a bit harder to reverse engineer the file. Maybe files are not accessed by name but by index. Look at the beginning and try to find out if what you see could be a table of some sort. Again, with (offset, length) pairs. For each entry in the table, Offset + Length should be equal (or, almost equal) to the next Offset value. Interpret some of the 32-bit values you find as offset and look at these addresses, see if you can find out if there's a file starting there. Look for known file magics ("GIF87a", "II", "BM", ...) or for strings that are the same at each offset (sometimes, all files in a game archive are of the same type and start with the same magic bytes)

To quickly find out if a data file contains unencrypted/uncompressed game content, you can try using this little tool to rip (some) known files from the datafile.

Examining bits of content

Though sometimes file structure is obvious, and/or the game uses well known file formats, it's highly probable that you will need to determine the structure and purpose of seemingly random data. This applies to both separate files (which may either come with the games as-is or be extracted from collection file as described above) and parts of a large file of unknown structure.

To determine which the data is, first look at its structure on a byte level. Is it regular? Are there repeated bytes/sequences of bytes? Are there byte values that are more frequent than others? Are there some byte patterns known to you? Are values of adjacent bytes close to each other?

Structured data

Are there many 0x00's or 0xFF's? The content is likely a bunch of integer values. Note distinguishable zeroes (or F's) from higher order bytes (stored at higher offsets in file, as it's little endian arch). This is dword, dword, word, word, signed dword.

00000000  0c 00 00 00 45 23 01 00  01 00 23 01 cc ed ff ff  |....E#....#.....|
          _ ________/       _ __/  _ __/    _/       ____/

Or are zeroes in low order bits? Periodic bit patterns? Likely floating point values. Periodic bit patterns come from the fact that not all values can be represented as a finite binary floating point number (e.g. 1/5 = 0.001100110011... binary).

00000000  00 00 00 3f 00 00 80 3f  ab aa aa 3f 55 55 d5 3f  |...?...?...?UU.?|
          \___1/2___/ \___1.0___/  \___4/3___/ \___5/3___/
00000010  55 55 55 55 55 55 d5 3f  9a 99 99 99 99 99 c9 3f  |UUUUUU.?.......?|
          \_________1/3_________/  \_________1/5_________/

If you encounter a files with data of this kind, look for a period in its structure.

  • If there is one, it's a table (and it should likely be preceded by it's size (number of elements), look for it to find where it starts).
  • If there is no period, this may be a table with entries of variable size, in which case each entry should have it's length, or a length of varying field(s).
  • There's also chance that this is no table, but a collection of different elements. In this case, there should be cross references (likely offsets in a file) or at least length+type fields for each object.

when you get an idea of this structure, look at all possible values for each field. Examine quantity of distinct values, range, signedness, check for flag fields (those with low number of bits set and separate bits toggled). Compare that to what you know of a game itself and other files to determine meaning of all fields. For example, if you know that there's a map MxM squares in size, a field with values in range [0..M-1] may be a position on map.

Graphical data

Destinctive triplets? Likely true color image, or palette (look at size).

00000070  cf 3e cc cd 3e c5 c5 44  bd be 4a bc bd 4b c2 c3  |.>..>..D..J..K..|
00000080  46 cb cc 3f ce cf 3c c8  c9 40 bf c0 48 b5 b6 51  |F..?..<..@..H..Q|

Similar features on a large scale? Likely an image. This is 35 pixel width indexed color image:

000001d0  04 0e 22 36 4c 51 47 2c  15 05 00 00 00 00 00 00  |.."6LQG,........|
000001e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000001f0  00 00 00 01 0b 20 35 48  4a 3e 2f 1e 0a 01 00 00  |..... 5HJ>/.....|
00000200  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000210  00 00 00 00 00 00 00 07  19 30 47 48 36 25 18 0a  |.........0GH6%..|
00000220  03 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000230  00 00 00 00 00 00 00 00  00 00 02 0c 24 3d 49 36  |............$=I6|
00000240  22 10 07 01 00 00 00 00  00 00 00 00 00 00 00 00  |"...............|
00000250  00 00 00 00 00 00 00 00  00 00 00 00 00 00 06 16  |................|
00000260  2f 48 41 29 13 04 00 00  00 00 00 00 00 00 00 00  |/HA)............|
00000270  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000280  00 00 07 1a 38 4e 43 28  0f 02 00 00 00 00 00 00  |....8NC(........|

Also look for when many adjacent bytes have values in a tight range or values change smoothly (i.e. area on image filled with a single tone).

Entropy

For any (large enough) chunk of data you can count number of bytes having each of 256 possible values. From these, a thing called entropy can be calculated. Simply speaking, if all byte values have close frequencies, the entropy is high, and it may be a sign of encryption or compression used. If some values appear noticeably more frequently than others, entropy is low and there should be some meaningful format.

Data size

Sometimes, a mere size of a file can tell enough of its contents.

  • Is it 2^x?, (2^x)*3, x^2, (x^2)*3? Likely an uncompressed image.
  • Is it 512, 768? Maybe a high color/true color palette.
  • Does size vary for files that should be of the same size (i.e. sprites)? If so, files are likely compressed.
  • Files with different sizes, but each is a multiple of a single constant? Should be tables/packs of some kind, you now also know element size. May also be multiplies of a one constant + another constant (header/number of elements).

Look for specific stuff

If you know what a specific file/chunk of data is, look around it - there should be files of same or related type. If there's indexed color image, there should be palette. If there's map, there should be data on which objects are located on that map.

Remember which data the game uses and think of how it may be grouped in how it may be stored. Maps - raster or vector? If raster, how is each square stored? Is it 16, 8 or maybe 4 bits per square (how many types of surface are there, does it need to also store height/passability data)? If it's vector, is it stored as floating points or as integers (fixed point)?

Is there some distinctive chunk of data that you know should be present in one place, but not in another? For example, water done in a way when it requires a separate depth map, may be present on one map, but not another. Or boss which (in contrast to basic enemies) may need extra data, such as it's predefined movement path, special moves, spawn points if it spawns other enemies etc. Compare stuff.

Use the game itself

You can use the game itself to get data. Swap two level files and run the game. Does level order simply change? If yes, everything related to a level is stored in a single file. Else, look in other places.

Change a single value and see what it affects.

Damage data and check for error messages (those like "cannot decrypt/decompress level heightmap" are pretty informative).

Take screenshots to get decompressed images and compare them to compressed ones.

Take an image of whole memory used by a program (on unix, you can just kill running program and take .core file from it; it works with killing dosbox running a program and should work on wine running a program as well). Then compare stuff in files and in memory.

Writing an unpacker

When you have all needed info on your hands, you'll need to write an unpacker for the data format. It should do three things:

  • Obviously, extract game data
  • Be a description of data format and a way of working with it
  • Check all assumptions you have on game data

Common rules:

  • The code should be short, clear and have comments where something non-obvious is done.
  • If you work with file of structured format, use C structs to work with it - it's much easier to understand data structure that way.
  • Use fields of fixed size (int16_t, uint32_t) - today both 32bit and 64bit machines are widely used, so types like int and long may have different size of different platforms.
  • There is also endianness problem - though most file formats use little endian (as our x86/x64 PCs), to write a really cross platform software you'll need to care about that. However, I guess it's OK to leave that to ones writing an engine reimplementation.
  • Be careful with C structure alignment. That is, this structure:
struct foo {
    int16_t one;
    int32_t two;
    int16_t three;
}

takes 12 bytes, not 8 as some people would expect. First, there is 2 byte gap after one field, as two is aligned by its length. Next, there is 2 byte gap after three, as the structure is aligned by the largest of its members' alignments (so that if you have struct foo array[2], second struct will also have two field 4-byte aligned, and sizeof(array)/sizeof(struct foo) would be correct as well). With this in effect, in-memory representation of a structure may differ from on-disk representation, and reading such structure from disk as-is will be broken. Use __attribute__((__packed__)) to suppress alignment of structure fields where it is required.

  • Separate isolated routines (decompression, format conversion, decryption) into functions taking no more arguments than needed. Ideally, (unsigned char *input, size_t in_size, unsigned char *output, size_t out_size).
  • Split code working with different parts of file into separate routines. Split code working with different levels of nested structures into separate routines. Try to not use global data (pass needed data by pointer/reference).
  • Extract data into formats which conventional utilities can understand. That is, graphics should be extracted as well-known BMP or PNG (useful when you need alpha), audio as WAV, maps as either images or plaintext, structured data as plaintext as well.
  • Check all assumptions you have. Use assert() or just if (val < 0xff) die_with_error("range assumptions for val failed");. Especially check things on which you rely in your code to avoid buffer overflows, unexpected behaviour, allocating -1 bytes etc.

This way your unpacker will not only work nice, but will also be:

  • safe as in not eating all your swap or filling your HDD with garbage or even executing arbitrary code
  • a sufficient description of game data format (well you can have bad wording when describing the format in a human language, but the code is much more straightforward and unambiguous)
  • helpful in detecting your misassumptions on data format. That is, even if extractor works well with a single game, it may fail with e.g. addon, mod, expansion pack, separate game on the same engine, or separate game which uses the same data format. If the unpacker tells where the problem is instead of just segfaulting, another person may easily extend it to support required features.
  • ready to be embedded into engine reimplementation with minimal changes, or to be converted into a separate library which works with the file format
Personal tools