pages/txt/python-for-re-1.txt (view raw)
1 08 February, 2019
2
3Python for Reverse Engineering
4
5Building your own disassembly tooling for -- that's right -- fun and profit
6
7 While solving complex reversing challenges, we often use established
8 tools like radare2 or IDA for disassembling and debugging. But there
9 are times when you need to dig in a little deeper and understand how
10 things work under the hood.
11
12 Rolling your own disassembly scripts can be immensely helpful when it
13 comes to automating certain processes, and eventually build your own
14 homebrew reversing toolchain of sorts. At least, that's what I'm
15 attempting anyway.
16
17Setup
18
19 As the title suggests, you're going to need a Python 3 interpreter
20 before anything else. Once you've confirmed beyond reasonable doubt
21 that you do, in fact, have a Python 3 interpreter installed on your
22 system, run
23$ pip install capstone pyelftools
24
25 where capstone is the disassembly engine we'll be scripting with and
26 pyelftools to help parse ELF files.
27
28 With that out of the way, let's start with an example of a basic
29 reversing challenge.
30/* chall.c */
31
32#include <stdio.h>
33#include <stdlib.h>
34#include <string.h>
35
36int main() {
37 char *pw = malloc(9);
38 pw[0] = 'a';
39 for(int i = 1; i <= 8; i++){
40 pw[i] = pw[i - 1] + 1;
41 }
42 pw[9] = '\0';
43 char *in = malloc(10);
44 printf("password: ");
45 fgets(in, 10, stdin); // 'abcdefghi'
46 if(strcmp(in, pw) == 0) {
47 printf("haha yes!\n");
48 }
49 else {
50 printf("nah dude\n");
51 }
52}
53
54 Compile it with GCC/Clang:
55$ gcc chall.c -o chall.elf
56
57Scripting
58
59 For starters, let's look at the different sections present in the
60 binary.
61# sections.py
62
63from elftools.elf.elffile import ELFFile
64
65with open('./chall.elf', 'rb') as f:
66 e = ELFFile(f)
67 for section in e.iter_sections():
68 print(hex(section['sh_addr']), section.name)
69
70 This script iterates through all the sections and also shows us where
71 it's loaded. This will be pretty useful later. Running it gives us
72> python sections.py
730x238 .interp
740x254 .note.ABI-tag
750x274 .note.gnu.build-id
760x298 .gnu.hash
770x2c0 .dynsym
780x3e0 .dynstr
790x484 .gnu.version
800x4a0 .gnu.version_r
810x4c0 .rela.dyn
820x598 .rela.plt
830x610 .init
840x630 .plt
850x690 .plt.got
860x6a0 .text
870x8f4 .fini
880x900 .rodata
890x924 .eh_frame_hdr
900x960 .eh_frame
910x200d98 .init_array
920x200da0 .fini_array
930x200da8 .dynamic
940x200f98 .got
950x201000 .data
960x201010 .bss
970x0 .comment
980x0 .symtab
990x0 .strtab
1000x0 .shstrtab
101
102 Most of these aren't relevant to us, but a few sections here are to be
103 noted. The .text section contains the instructions (opcodes) that we're
104 after. The .data section should have strings and constants initialized
105 at compile time. Finally, the .plt which is the Procedure Linkage Table
106 and the .got, the Global Offset Table. If you're unsure about what
107 these mean, read up on the ELF format and its internals.
108
109 Since we know that the .text section has the opcodes, let's disassemble
110 the binary starting at that address.
111# disas1.py
112
113from elftools.elf.elffile import ELFFile
114from capstone import *
115
116with open('./bin.elf', 'rb') as f:
117 elf = ELFFile(f)
118 code = elf.get_section_by_name('.text')
119 ops = code.data()
120 addr = code['sh_addr']
121 md = Cs(CS_ARCH_X86, CS_MODE_64)
122 for i in md.disasm(ops, addr):
123 print(f'0x{i.address:x}:\t{i.mnemonic}\t{i.op_str}')
124
125 The code is fairly straightforward (I think). We should be seeing this,
126 on running
127> python disas1.py | less
1280x6a0: xor ebp, ebp
1290x6a2: mov r9, rdx
1300x6a5: pop rsi
1310x6a6: mov rdx, rsp
1320x6a9: and rsp, 0xfffffffffffffff0
1330x6ad: push rax
1340x6ae: push rsp
1350x6af: lea r8, [rip + 0x23a]
1360x6b6: lea rcx, [rip + 0x1c3]
1370x6bd: lea rdi, [rip + 0xe6]
138**0x6c4: call qword ptr [rip + 0x200916]**
1390x6ca: hlt
140... snip ...
141
142 The line in bold is fairly interesting to us. The address at [rip +
143 0x200916] is equivalent to [0x6ca + 0x200916], which in turn evaluates
144 to 0x200fe0. The first call being made to a function at 0x200fe0? What
145 could this function be?
146
147 For this, we will have to look at relocations. Quoting [1]linuxbase.org
148
149 Relocation is the process of connecting symbolic references with
150 symbolic definitions. For example, when a program calls a function,
151 the associated call instruction must transfer control to the proper
152 destination address at execution. Relocatable files must have
153 "relocation entries'' which are necessary because they contain
154 information that describes how to modify their section contents,
155 thus allowing executable and shared object files to hold the right
156 information for a process's program image.
157
158 To try and find these relocation entries, we write a third script.
159# relocations.py
160
161import sys
162from elftools.elf.elffile import ELFFile
163from elftools.elf.relocation import RelocationSection
164
165with open('./chall.elf', 'rb') as f:
166 e = ELFFile(f)
167 for section in e.iter_sections():
168 if isinstance(section, RelocationSection):
169 print(f'{section.name}:')
170 symbol_table = e.get_section(section['sh_link'])
171 for relocation in section.iter_relocations():
172 symbol = symbol_table.get_symbol(relocation['r_info_sym'])
173 addr = hex(relocation['r_offset'])
174 print(f'{symbol.name} {addr}')
175
176 Let's run through this code real quick. We first loop through the
177 sections, and check if it's of the type RelocationSection. We then
178 iterate through the relocations from the symbol table for each section.
179 Finally, running this gives us
180> python relocations.py
181.rela.dyn:
182 0x200d98
183 0x200da0
184 0x201008
185_ITM_deregisterTMCloneTable 0x200fd8
186**__libc_start_main 0x200fe0**
187__gmon_start__ 0x200fe8
188_ITM_registerTMCloneTable 0x200ff0
189__cxa_finalize 0x200ff8
190stdin 0x201010
191.rela.plt:
192puts 0x200fb0
193printf 0x200fb8
194fgets 0x200fc0
195strcmp 0x200fc8
196malloc 0x200fd0
197
198 Remember the function call at 0x200fe0 from earlier? Yep, so that was a
199 call to the well known __libc_start_main. Again, according to
200 [2]linuxbase.org
201
202 The __libc_start_main() function shall perform any necessary
203 initialization of the execution environment, call the main function
204 with appropriate arguments, and handle the return from main(). If
205 the main() function returns, the return value shall be passed to the
206 exit() function.
207
208 And its definition is like so
209int __libc_start_main(int *(main) (int, char * *, char * *),
210int argc, char * * ubp_av,
211void (*init) (void),
212void (*fini) (void),
213void (*rtld_fini) (void),
214void (* stack_end));
215
216 Looking back at our disassembly
2170x6a0: xor ebp, ebp
2180x6a2: mov r9, rdx
2190x6a5: pop rsi
2200x6a6: mov rdx, rsp
2210x6a9: and rsp, 0xfffffffffffffff0
2220x6ad: push rax
2230x6ae: push rsp
2240x6af: lea r8, [rip + 0x23a]
2250x6b6: lea rcx, [rip + 0x1c3]
226**0x6bd: lea rdi, [rip + 0xe6]**
2270x6c4: call qword ptr [rip + 0x200916]
2280x6ca: hlt
229... snip ...
230
231 but this time, at the lea or Load Effective Address instruction, which
232 loads some address [rip + 0xe6] into the rdi register. [rip + 0xe6]
233 evaluates to 0x7aa which happens to be the address of our main()
234 function! How do I know that? Because __libc_start_main(), after doing
235 whatever it does, eventually jumps to the function at rdi, which is
236 generally the main() function. It looks something like this
237
238 To see the disassembly of main, seek to 0x7aa in the output of the
239 script we'd written earlier (disas1.py).
240
241 From what we discovered earlier, each call instruction points to some
242 function which we can see from the relocation entries. So following
243 each call into their relocations gives us this
244printf 0x650
245fgets 0x660
246strcmp 0x670
247malloc 0x680
248
249 Putting all this together, things start falling into place. Let me
250 highlight the key sections of the disassembly here. It's pretty
251 self-explanatory.
2520x7b2: mov edi, 0xa ; 10
2530x7b7: call 0x680 ; malloc
254
255 The loop to populate the *pw string
2560x7d0: mov eax, dword ptr [rbp - 0x14]
2570x7d3: cdqe
2580x7d5: lea rdx, [rax - 1]
2590x7d9: mov rax, qword ptr [rbp - 0x10]
2600x7dd: add rax, rdx
2610x7e0: movzx eax, byte ptr [rax]
2620x7e3: lea ecx, [rax + 1]
2630x7e6: mov eax, dword ptr [rbp - 0x14]
2640x7e9: movsxd rdx, eax
2650x7ec: mov rax, qword ptr [rbp - 0x10]
2660x7f0: add rax, rdx
2670x7f3: mov edx, ecx
2680x7f5: mov byte ptr [rax], dl
2690x7f7: add dword ptr [rbp - 0x14], 1
2700x7fb: cmp dword ptr [rbp - 0x14], 8
2710x7ff: jle 0x7d0
272
273 And this looks like our strcmp()
2740x843: mov rdx, qword ptr [rbp - 0x10] ; *in
2750x847: mov rax, qword ptr [rbp - 8] ; *pw
2760x84b: mov rsi, rdx
2770x84e: mov rdi, rax
2780x851: call 0x670 ; strcmp
2790x856: test eax, eax ; is = 0?
2800x858: jne 0x868 ; no? jump to 0x868
2810x85a: lea rdi, [rip + 0xae] ; "haha yes!"
2820x861: call 0x640 ; puts
2830x866: jmp 0x874
2840x868: lea rdi, [rip + 0xaa] ; "nah dude"
2850x86f: call 0x640 ; puts
286
287 I'm not sure why it uses puts here? I might be missing something;
288 perhaps printf calls puts. I could be wrong. I also confirmed with
289 radare2 that those locations are actually the strings "haha yes!" and
290 "nah dude".
291
292 Update: It's because of compiler optimization. A printf() (in this
293 case) is seen as a bit overkill, and hence gets simplified to a puts().
294
295Conclusion
296
297 Wew, that took quite some time. But we're done. If you're a beginner,
298 you might find this extremely confusing, or probably didn't even
299 understand what was going on. And that's okay. Building an intuition
300 for reading and grokking disassembly comes with practice. I'm no good
301 at it either.
302
303 All the code used in this post is here:
304 [3]https://github.com/icyphox/asdf/tree/master/reversing-elf
305
306 Ciao for now, and I'll see ya in #2 of this series -- PE binaries.
307 Whenever that is.
308
309References
310
311 1. http://refspecs.linuxbase.org/elf/gabi4+/ch4.reloc.html
312 2. http://refspecs.linuxbase.org/LSB_3.1.0/LSB-generic/LSB-generic/baselib%20--%20libc-start-main-.html
313 3. https://github.com/icyphox/asdf/tree/master/reversing-elf