pages/txt/python-for-re-1 (view raw)
1---
2date: '2019-02-08'
3subtitle: 'Building your own disassembly tooling for --- that''s
4 right --- fun and profit'
5template: text.html
6title: Python for Reverse Engineering
7url: 'python-for-re-1'
8---
9
10While solving complex reversing challenges, we often use established
11tools like radare2 or IDA for disassembling and debugging. But there are
12times when you need to dig in a little deeper and understand how things
13work under the hood.
14
15Rolling your own disassembly scripts can be immensely helpful when it
16comes to automating certain processes, and eventually build your own
17homebrew reversing toolchain of sorts. At least, that's what I'm
18attempting anyway.
19
20Setup
21-----
22
23As the title suggests, you're going to need a Python 3 interpreter
24before anything else. Once you've confirmed beyond reasonable doubt that
25you do, in fact, have a Python 3 interpreter installed on your system,
26run
27
28``` {.console}
29$ pip install capstone pyelftools
30```
31
32where `capstone` is the disassembly engine we'll be scripting with and
33`pyelftools` to help parse ELF files.
34
35With that out of the way, let's start with an example of a basic
36reversing challenge.
37
38``` {.c}
39/* chall.c */
40
41#include <stdio.h>
42#include <stdlib.h>
43#include <string.h>
44
45int main() {
46 char *pw = malloc(9);
47 pw[0] = 'a';
48 for(int i = 1; i <= 8; i++){
49 pw[i] = pw[i - 1] + 1;
50 }
51 pw[9] = '\0';
52 char *in = malloc(10);
53 printf("password: ");
54 fgets(in, 10, stdin); // 'abcdefghi'
55 if(strcmp(in, pw) == 0) {
56 printf("haha yes!\n");
57 }
58 else {
59 printf("nah dude\n");
60 }
61}
62```
63
64Compile it with GCC/Clang:
65
66``` {.console}
67$ gcc chall.c -o chall.elf
68```
69
70Scripting
71---------
72
73For starters, let's look at the different sections present in the
74binary.
75
76``` {.python}
77# sections.py
78
79from elftools.elf.elffile import ELFFile
80
81with open('./chall.elf', 'rb') as f:
82 e = ELFFile(f)
83 for section in e.iter_sections():
84 print(hex(section['sh_addr']), section.name)
85```
86
87This script iterates through all the sections and also shows us where
88it's loaded. This will be pretty useful later. Running it gives us
89
90``` {.console}
91› python sections.py
920x238 .interp
930x254 .note.ABI-tag
940x274 .note.gnu.build-id
950x298 .gnu.hash
960x2c0 .dynsym
970x3e0 .dynstr
980x484 .gnu.version
990x4a0 .gnu.version_r
1000x4c0 .rela.dyn
1010x598 .rela.plt
1020x610 .init
1030x630 .plt
1040x690 .plt.got
1050x6a0 .text
1060x8f4 .fini
1070x900 .rodata
1080x924 .eh_frame_hdr
1090x960 .eh_frame
1100x200d98 .init_array
1110x200da0 .fini_array
1120x200da8 .dynamic
1130x200f98 .got
1140x201000 .data
1150x201010 .bss
1160x0 .comment
1170x0 .symtab
1180x0 .strtab
1190x0 .shstrtab
120```
121
122Most of these aren't relevant to us, but a few sections here are to be
123noted. The `.text` section contains the instructions (opcodes) that
124we're after. The `.data` section should have strings and constants
125initialized at compile time. Finally, the `.plt` which is the Procedure
126Linkage Table and the `.got`, the Global Offset Table. If you're unsure
127about what these mean, read up on the ELF format and its internals.
128
129Since we know that the `.text` section has the opcodes, let's
130disassemble the binary starting at that address.
131
132``` {.python}
133# disas1.py
134
135from elftools.elf.elffile import ELFFile
136from capstone import *
137
138with open('./bin.elf', 'rb') as f:
139 elf = ELFFile(f)
140 code = elf.get_section_by_name('.text')
141 ops = code.data()
142 addr = code['sh_addr']
143 md = Cs(CS_ARCH_X86, CS_MODE_64)
144 for i in md.disasm(ops, addr):
145 print(f'0x{i.address:x}:\t{i.mnemonic}\t{i.op_str}')
146```
147
148The code is fairly straightforward (I think). We should be seeing this,
149on running
150
151``` {.console}
152› python disas1.py | less
1530x6a0: xor ebp, ebp
1540x6a2: mov r9, rdx
1550x6a5: pop rsi
1560x6a6: mov rdx, rsp
1570x6a9: and rsp, 0xfffffffffffffff0
1580x6ad: push rax
1590x6ae: push rsp
1600x6af: lea r8, [rip + 0x23a]
1610x6b6: lea rcx, [rip + 0x1c3]
1620x6bd: lea rdi, [rip + 0xe6]
163**0x6c4: call qword ptr [rip + 0x200916]**
1640x6ca: hlt
165... snip ...
166```
167
168The line in bold is fairly interesting to us. The address at
169`[rip + 0x200916]` is equivalent to `[0x6ca + 0x200916]`, which in turn
170evaluates to `0x200fe0`. The first `call` being made to a function at
171`0x200fe0`? What could this function be?
172
173For this, we will have to look at **relocations**. Quoting
174[linuxbase.org](http://refspecs.linuxbase.org/elf/gabi4+/ch4.reloc.html)
175\> Relocation is the process of connecting symbolic references with
176symbolic definitions. For example, when a program calls a function, the
177associated call instruction must transfer control to the proper
178destination address at execution. Relocatable files must have
179"relocation entries'' which are necessary because they contain
180information that describes how to modify their section contents, thus
181allowing executable and shared object files to hold the right
182information for a process's program image.
183
184To try and find these relocation entries, we write a third script.
185
186``` {.python}
187# relocations.py
188
189import sys
190from elftools.elf.elffile import ELFFile
191from elftools.elf.relocation import RelocationSection
192
193with open('./chall.elf', 'rb') as f:
194 e = ELFFile(f)
195 for section in e.iter_sections():
196 if isinstance(section, RelocationSection):
197 print(f'{section.name}:')
198 symbol_table = e.get_section(section['sh_link'])
199 for relocation in section.iter_relocations():
200 symbol = symbol_table.get_symbol(relocation['r_info_sym'])
201 addr = hex(relocation['r_offset'])
202 print(f'{symbol.name} {addr}')
203```
204
205Let's run through this code real quick. We first loop through the
206sections, and check if it's of the type `RelocationSection`. We then
207iterate through the relocations from the symbol table for each section.
208Finally, running this gives us
209
210``` {.console}
211› python relocations.py
212.rela.dyn:
213 0x200d98
214 0x200da0
215 0x201008
216_ITM_deregisterTMCloneTable 0x200fd8
217**__libc_start_main 0x200fe0**
218__gmon_start__ 0x200fe8
219_ITM_registerTMCloneTable 0x200ff0
220__cxa_finalize 0x200ff8
221stdin 0x201010
222.rela.plt:
223puts 0x200fb0
224printf 0x200fb8
225fgets 0x200fc0
226strcmp 0x200fc8
227malloc 0x200fd0
228```
229
230Remember the function call at `0x200fe0` from earlier? Yep, so that was
231a call to the well known `__libc_start_main`. Again, according to
232[linuxbase.org](http://refspecs.linuxbase.org/LSB_3.1.0/LSB-generic/LSB-generic/baselib---libc-start-main-.html)
233\> The `__libc_start_main()` function shall perform any necessary
234initialization of the execution environment, call the *main* function
235with appropriate arguments, and handle the return from `main()`. If the
236`main()` function returns, the return value shall be passed to the
237`exit()` function.
238
239And its definition is like so
240
241``` {.c}
242int __libc_start_main(int *(main) (int, char * *, char * *),
243int argc, char * * ubp_av,
244void (*init) (void),
245void (*fini) (void),
246void (*rtld_fini) (void),
247void (* stack_end));
248```
249
250Looking back at our disassembly
251
252 0x6a0: xor ebp, ebp
253 0x6a2: mov r9, rdx
254 0x6a5: pop rsi
255 0x6a6: mov rdx, rsp
256 0x6a9: and rsp, 0xfffffffffffffff0
257 0x6ad: push rax
258 0x6ae: push rsp
259 0x6af: lea r8, [rip + 0x23a]
260 0x6b6: lea rcx, [rip + 0x1c3]
261 **0x6bd: lea rdi, [rip + 0xe6]**
262 0x6c4: call qword ptr [rip + 0x200916]
263 0x6ca: hlt
264 ... snip ...
265
266but this time, at the `lea` or Load Effective Address instruction, which
267loads some address `[rip + 0xe6]` into the `rdi` register.
268`[rip + 0xe6]` evaluates to `0x7aa` which happens to be the address of
269our `main()` function! How do I know that? Because
270`__libc_start_main()`, after doing whatever it does, eventually jumps to
271the function at `rdi`, which is generally the `main()` function. It
272looks something like this
273
274![](https://cdn-images-1.medium.com/max/800/0*oQA2MwHjhzosF8ZH.png)
275
276To see the disassembly of `main`, seek to `0x7aa` in the output of the
277script we'd written earlier (`disas1.py`).
278
279From what we discovered earlier, each `call` instruction points to some
280function which we can see from the relocation entries. So following each
281`call` into their relocations gives us this
282
283 printf 0x650
284 fgets 0x660
285 strcmp 0x670
286 malloc 0x680
287
288Putting all this together, things start falling into place. Let me
289highlight the key sections of the disassembly here. It's pretty
290self-explanatory.
291
292 0x7b2: mov edi, 0xa ; 10
293 0x7b7: call 0x680 ; malloc
294
295The loop to populate the `*pw` string
296
297 0x7d0: mov eax, dword ptr [rbp - 0x14]
298 0x7d3: cdqe
299 0x7d5: lea rdx, [rax - 1]
300 0x7d9: mov rax, qword ptr [rbp - 0x10]
301 0x7dd: add rax, rdx
302 0x7e0: movzx eax, byte ptr [rax]
303 0x7e3: lea ecx, [rax + 1]
304 0x7e6: mov eax, dword ptr [rbp - 0x14]
305 0x7e9: movsxd rdx, eax
306 0x7ec: mov rax, qword ptr [rbp - 0x10]
307 0x7f0: add rax, rdx
308 0x7f3: mov edx, ecx
309 0x7f5: mov byte ptr [rax], dl
310 0x7f7: add dword ptr [rbp - 0x14], 1
311 0x7fb: cmp dword ptr [rbp - 0x14], 8
312 0x7ff: jle 0x7d0
313
314And this looks like our `strcmp()`
315
316 0x843: mov rdx, qword ptr [rbp - 0x10] ; *in
317 0x847: mov rax, qword ptr [rbp - 8] ; *pw
318 0x84b: mov rsi, rdx
319 0x84e: mov rdi, rax
320 0x851: call 0x670 ; strcmp
321 0x856: test eax, eax ; is = 0?
322 0x858: jne 0x868 ; no? jump to 0x868
323 0x85a: lea rdi, [rip + 0xae] ; "haha yes!"
324 0x861: call 0x640 ; puts
325 0x866: jmp 0x874
326 0x868: lea rdi, [rip + 0xaa] ; "nah dude"
327 0x86f: call 0x640 ; puts
328
329I'm not sure why it uses `puts` here? I might be missing something;
330perhaps `printf` calls `puts`. I could be wrong. I also confirmed with
331radare2 that those locations are actually the strings "haha yes!" and
332"nah dude".
333
334**Update**: It's because of compiler optimization. A `printf()` (in this
335case) is seen as a bit overkill, and hence gets simplified to a
336`puts()`.
337
338Conclusion
339----------
340
341Wew, that took quite some time. But we're done. If you're a beginner,
342you might find this extremely confusing, or probably didn't even
343understand what was going on. And that's okay. Building an intuition for
344reading and grokking disassembly comes with practice. I'm no good at it
345either.
346
347All the code used in this post is here:
348<https://github.com/icyphox/asdf/tree/master/reversing-elf>
349
350Ciao for now, and I'll see ya in \#2 of this series---PE binaries.
351Whenever that is.