all repos — site @ b29ab6388b0b78f485bff5360b2f6d8a2c4e236f

source for my site, found at icyphox.sh

pages/blog/python-for-re-1.md (view raw)

  1---
  2template: text.html
  3title: Python for Reverse Engineering #1: ELF Binaries
  4subtitle: Building your own disassembly tooling for — that’s right — fun and profit
  5date: 8 Feb, 2019
  6---
  7
  8# Python for Reverse Engineering 1: ELF Binaries
  9
 10## Building your own disassembly tooling for — that’s right — fun and profit
 11
 12While solving complex reversing challenges, we often use established tools like radare2 or IDA for disassembling and debugging. But there are times when you need to dig in a little deeper and understand how things work under the hood.
 13
 14Rolling your own disassembly scripts can be immensely helpful when it comes to automating certain processes, and eventually build your own homebrew reversing toolchain of sorts. At least, that’s what I’m attempting anyway.
 15
 16### Setup
 17
 18As the title suggests, you’re going to need a Python 3 interpreter before
 19anything else. Once you’ve confirmed beyond reasonable doubt that you do,
 20in fact, have a Python 3 interpreter installed on your system, run
 21
 22```console
 23$ pip install capstone pyelftools
 24```
 25
 26where `capstone` is the disassembly engine we’ll be scripting with and `pyelftools` to help parse ELF files.
 27
 28With that out of the way, let’s start with an example of a basic reversing
 29challenge.
 30
 31```c
 32/* chall.c */
 33
 34#include <stdio.h>
 35#include <stdlib.h>
 36#include <string.h>
 37
 38int main() {
 39   char *pw = malloc(9);
 40   pw[0] = 'a';
 41   for(int i = 1; i <= 8; i++){
 42       pw[i] = pw[i - 1] + 1;
 43   }
 44   pw[9] = '\0';
 45   char *in = malloc(10);
 46   printf("password: ");
 47   fgets(in, 10, stdin);        // 'abcdefghi'
 48   if(strcmp(in, pw) == 0) {
 49       printf("haha yes!\n");
 50   }
 51   else {
 52       printf("nah dude\n");
 53   }
 54}
 55```
 56
 57
 58Compile it with GCC/Clang:
 59
 60```console
 61$ gcc chall.c -o chall.elf
 62```
 63
 64
 65### Scripting
 66
 67For starters, let’s look at the different sections present in the binary.
 68
 69```python
 70# sections.py
 71
 72from elftools.elf.elffile import ELFFile
 73
 74with open('./chall.elf', 'rb') as f:
 75    e = ELFFile(f)
 76    for section in e.iter_sections():
 77        print(hex(section['sh_addr']), section.name)
 78```
 79
 80
 81This script iterates through all the sections and also shows us where it’s loaded. This will be pretty useful later. Running it gives us
 82
 83```console
 84› python sections.py
 850x238 .interp
 860x254 .note.ABI-tag
 870x274 .note.gnu.build-id
 880x298 .gnu.hash
 890x2c0 .dynsym
 900x3e0 .dynstr
 910x484 .gnu.version
 920x4a0 .gnu.version_r
 930x4c0 .rela.dyn
 940x598 .rela.plt
 950x610 .init
 960x630 .plt
 970x690 .plt.got
 980x6a0 .text
 990x8f4 .fini
1000x900 .rodata
1010x924 .eh_frame_hdr
1020x960 .eh_frame
1030x200d98 .init_array
1040x200da0 .fini_array
1050x200da8 .dynamic
1060x200f98 .got
1070x201000 .data
1080x201010 .bss
1090x0 .comment
1100x0 .symtab
1110x0 .strtab
1120x0 .shstrtab
113```
114
115
116Most of these aren’t relevant to us, but a few sections here are to be noted. The `.text` section contains the instructions (opcodes) that we’re after. The `.data` section should have strings and constants initialized at compile time. Finally, the `.plt` which is the Procedure Linkage Table and the `.got`, the Global Offset Table. If you’re unsure about what these mean, read up on the ELF format and its internals.
117
118Since we know that the `.text` section has the opcodes, let’s disassemble the binary starting at that address.
119
120```python
121# disas1.py
122
123from elftools.elf.elffile import ELFFile
124from capstone import *
125
126with open('./bin.elf', 'rb') as f:
127    elf = ELFFile(f)
128    code = elf.get_section_by_name('.text')
129    ops = code.data()
130    addr = code['sh_addr']
131    md = Cs(CS_ARCH_X86, CS_MODE_64)
132    for i in md.disasm(ops, addr):        
133        print(f'0x{i.address:x}:\t{i.mnemonic}\t{i.op_str}')
134```
135
136
137The code is fairly straightforward (I think). We should be seeing this, on running
138
139```console
140› python disas1.py | less      
1410x6a0: xor ebp, ebp
1420x6a2: mov r9, rdx
1430x6a5: pop rsi
1440x6a6: mov rdx, rsp
1450x6a9: and rsp, 0xfffffffffffffff0
1460x6ad: push rax
1470x6ae: push rsp
1480x6af: lea r8, [rip + 0x23a]
1490x6b6: lea rcx, [rip + 0x1c3]
1500x6bd: lea rdi, [rip + 0xe6]
151**0x6c4: call qword ptr [rip + 0x200916]**
1520x6ca: hlt
153... snip ...
154```
155
156
157The line in bold is fairly interesting to us. The address at `[rip + 0x200916]` is equivalent to `[0x6ca + 0x200916]`, which in turn evaluates to `0x200fe0`. The first `call` being made to a function at `0x200fe0`? What could this function be?
158
159For this, we will have to look at **relocations**. Quoting [linuxbase.org](http://refspecs.linuxbase.org/elf/gabi4+/ch4.reloc.html)
160> Relocation is the process of connecting symbolic references with symbolic definitions. For example, when a program calls a function, the associated call instruction must transfer control to the proper destination address at execution. Relocatable files must have “relocation entries’’ which are necessary because they contain information that describes how to modify their section contents, thus allowing executable and shared object files to hold the right information for a process’s program image.
161
162To try and find these relocation entries, we write a third script.
163
164```python
165# relocations.py
166
167import sys
168from elftools.elf.elffile import ELFFile
169from elftools.elf.relocation import RelocationSection
170
171with open('./chall.elf', 'rb') as f:
172    e = ELFFile(f)
173    for section in e.iter_sections():
174        if isinstance(section, RelocationSection):
175            print(f'{section.name}:')
176            symbol_table = e.get_section(section['sh_link'])
177            for relocation in section.iter_relocations():
178                symbol = symbol_table.get_symbol(relocation['r_info_sym'])
179                addr = hex(relocation['r_offset'])
180                print(f'{symbol.name} {addr}')
181```
182
183
184Let’s run through this code real quick. We first loop through the sections, and check if it’s of the type `RelocationSection`. We then iterate through the relocations from the symbol table for each section. Finally, running this gives us
185
186```console
187› python relocations.py
188.rela.dyn:
189 0x200d98
190 0x200da0
191 0x201008
192_ITM_deregisterTMCloneTable 0x200fd8
193**__libc_start_main 0x200fe0**
194__gmon_start__ 0x200fe8
195_ITM_registerTMCloneTable 0x200ff0
196__cxa_finalize 0x200ff8
197stdin 0x201010
198.rela.plt:
199puts 0x200fb0
200printf 0x200fb8
201fgets 0x200fc0
202strcmp 0x200fc8
203malloc 0x200fd0
204```
205
206
207Remember the function call at `0x200fe0` from earlier? Yep, so that was a call to the well known `__libc_start_main`. Again, according to [linuxbase.org](http://refspecs.linuxbase.org/LSB_3.1.0/LSB-generic/LSB-generic/baselib---libc-start-main-.html)
208> The `__libc_start_main()` function shall perform any necessary initialization of the execution environment, call the *main* function with appropriate arguments, and handle the return from `main()`. If the `main()` function returns, the return value shall be passed to the `exit()` function.
209
210And its definition is like so
211
212```c
213int __libc_start_main(int *(main) (int, char * *, char * *), 
214int argc, char * * ubp_av, 
215void (*init) (void), 
216void (*fini) (void), 
217void (*rtld_fini) (void), 
218void (* stack_end));
219```
220
221
222Looking back at our disassembly
223
224```
2250x6a0: xor ebp, ebp
2260x6a2: mov r9, rdx
2270x6a5: pop rsi
2280x6a6: mov rdx, rsp
2290x6a9: and rsp, 0xfffffffffffffff0
2300x6ad: push rax
2310x6ae: push rsp
2320x6af: lea r8, [rip + 0x23a]
2330x6b6: lea rcx, [rip + 0x1c3]
234**0x6bd: lea rdi, [rip + 0xe6]**
2350x6c4: call qword ptr [rip + 0x200916]
2360x6ca: hlt
237... snip ...
238```
239
240
241but this time, at the `lea` or Load Effective Address instruction, which loads some address `[rip + 0xe6]` into the `rdi` register. `[rip + 0xe6]` evaluates to `0x7aa` which happens to be the address of our `main()` function! How do I know that? Because `__libc_start_main()`, after doing whatever it does, eventually jumps to the function at `rdi`, which is generally the `main()` function. It looks something like this
242
243![](https://cdn-images-1.medium.com/max/800/0*oQA2MwHjhzosF8ZH.png)
244
245To see the disassembly of `main`, seek to `0x7aa` in the output of the script we’d written earlier (`disas1.py`).
246
247From what we discovered earlier, each `call` instruction points to some function which we can see from the relocation entries. So following each `call` into their relocations gives us this
248
249```
250printf 0x650
251fgets  0x660
252strcmp 0x670
253malloc 0x680
254```
255
256
257Putting all this together, things start falling into place. Let me highlight the key sections of the disassembly here. It’s pretty self-explanatory.
258
259```
2600x7b2: mov edi, 0xa  ; 10
2610x7b7: call 0x680    ; malloc
262```
263
264
265The loop to populate the `*pw` string
266
267```
2680x7d0:  mov     eax, dword ptr [rbp - 0x14]
2690x7d3:  cdqe    
2700x7d5:  lea     rdx, [rax - 1]
2710x7d9:  mov     rax, qword ptr [rbp - 0x10]
2720x7dd:  add     rax, rdx
2730x7e0:  movzx   eax, byte ptr [rax]
2740x7e3:  lea     ecx, [rax + 1]
2750x7e6:  mov     eax, dword ptr [rbp - 0x14]
2760x7e9:  movsxd  rdx, eax
2770x7ec:  mov     rax, qword ptr [rbp - 0x10]
2780x7f0:  add     rax, rdx
2790x7f3:  mov     edx, ecx
2800x7f5:  mov     byte ptr [rax], dl
2810x7f7:  add     dword ptr [rbp - 0x14], 1
2820x7fb:  cmp     dword ptr [rbp - 0x14], 8
2830x7ff:  jle     0x7d0
284```
285
286
287And this looks like our `strcmp()`
288
289```
2900x843:  mov     rdx, qword ptr [rbp - 0x10] ; *in
2910x847:  mov     rax, qword ptr [rbp - 8]    ; *pw
2920x84b:  mov     rsi, rdx             
2930x84e:  mov     rdi, rax
2940x851:  call    0x670                       ; strcmp  
2950x856:  test    eax, eax                    ; is = 0? 
2960x858:  jne     0x868                       ; no? jump to 0x868
2970x85a:  lea     rdi, [rip + 0xae]           ; "haha yes!" 
2980x861:  call    0x640                       ; puts
2990x866:  jmp     0x874
3000x868:  lea     rdi, [rip + 0xaa]           ; "nah dude"
3010x86f:  call    0x640                       ; puts  
302```
303
304
305I’m not sure why it uses `puts` here? I might be missing something; perhaps `printf` calls `puts`. I could be wrong. I also confirmed with radare2 that those locations are actually the strings “haha yes!” and “nah dude”.
306
307**Update**: It's because of compiler optimization. A `printf()` (in this case) is seen as a bit overkill, and hence gets simplified to a `puts()`.
308
309### Conclusion
310
311Wew, that took quite some time. But we’re done. If you’re a beginner, you might find this extremely confusing, or probably didn’t even understand what was going on. And that’s okay. Building an intuition for reading and grokking disassembly comes with practice. I’m no good at it either.
312
313All the code used in this post is here: [https://github.com/icyphox/asdf/tree/master/reversing-elf](https://github.com/icyphox/asdf/tree/master/reversing-elf)
314
315Ciao for now, and I’ll see ya in #2 of this series — PE binaries. Whenever that is.