all repos — site @ e41ce6841486e8f68f072544f4a65381fa55f67a

source for my site, found at icyphox.sh

pages/blog/python-for-re-1.md (view raw)

  1---
  2template: text.html
  3title: Python for Reverse Engineering #1: ELF Binaries
  4subtitle: Building your own disassembly tooling for — that’s right — fun and profit
  5date: 2019-02-08
  6---
  7
  8While solving complex reversing challenges, we often use established tools like radare2 or IDA for disassembling and debugging. But there are times when you need to dig in a little deeper and understand how things work under the hood.
  9
 10Rolling your own disassembly scripts can be immensely helpful when it comes to automating certain processes, and eventually build your own homebrew reversing toolchain of sorts. At least, that’s what I’m attempting anyway.
 11
 12### Setup
 13
 14As the title suggests, you’re going to need a Python 3 interpreter before
 15anything else. Once you’ve confirmed beyond reasonable doubt that you do,
 16in fact, have a Python 3 interpreter installed on your system, run
 17
 18```console
 19$ pip install capstone pyelftools
 20```
 21
 22where `capstone` is the disassembly engine we’ll be scripting with and `pyelftools` to help parse ELF files.
 23
 24With that out of the way, let’s start with an example of a basic reversing
 25challenge.
 26
 27```c
 28/* chall.c */
 29
 30#include <stdio.h>
 31#include <stdlib.h>
 32#include <string.h>
 33
 34int main() {
 35   char *pw = malloc(9);
 36   pw[0] = 'a';
 37   for(int i = 1; i <= 8; i++){
 38       pw[i] = pw[i - 1] + 1;
 39   }
 40   pw[9] = '\0';
 41   char *in = malloc(10);
 42   printf("password: ");
 43   fgets(in, 10, stdin);        // 'abcdefghi'
 44   if(strcmp(in, pw) == 0) {
 45       printf("haha yes!\n");
 46   }
 47   else {
 48       printf("nah dude\n");
 49   }
 50}
 51```
 52
 53
 54Compile it with GCC/Clang:
 55
 56```console
 57$ gcc chall.c -o chall.elf
 58```
 59
 60
 61### Scripting
 62
 63For starters, let’s look at the different sections present in the binary.
 64
 65```python
 66# sections.py
 67
 68from elftools.elf.elffile import ELFFile
 69
 70with open('./chall.elf', 'rb') as f:
 71    e = ELFFile(f)
 72    for section in e.iter_sections():
 73        print(hex(section['sh_addr']), section.name)
 74```
 75
 76
 77This script iterates through all the sections and also shows us where it’s loaded. This will be pretty useful later. Running it gives us
 78
 79```console
 80› python sections.py
 810x238 .interp
 820x254 .note.ABI-tag
 830x274 .note.gnu.build-id
 840x298 .gnu.hash
 850x2c0 .dynsym
 860x3e0 .dynstr
 870x484 .gnu.version
 880x4a0 .gnu.version_r
 890x4c0 .rela.dyn
 900x598 .rela.plt
 910x610 .init
 920x630 .plt
 930x690 .plt.got
 940x6a0 .text
 950x8f4 .fini
 960x900 .rodata
 970x924 .eh_frame_hdr
 980x960 .eh_frame
 990x200d98 .init_array
1000x200da0 .fini_array
1010x200da8 .dynamic
1020x200f98 .got
1030x201000 .data
1040x201010 .bss
1050x0 .comment
1060x0 .symtab
1070x0 .strtab
1080x0 .shstrtab
109```
110
111
112Most of these aren’t relevant to us, but a few sections here are to be noted. The `.text` section contains the instructions (opcodes) that we’re after. The `.data` section should have strings and constants initialized at compile time. Finally, the `.plt` which is the Procedure Linkage Table and the `.got`, the Global Offset Table. If you’re unsure about what these mean, read up on the ELF format and its internals.
113
114Since we know that the `.text` section has the opcodes, let’s disassemble the binary starting at that address.
115
116```python
117# disas1.py
118
119from elftools.elf.elffile import ELFFile
120from capstone import *
121
122with open('./bin.elf', 'rb') as f:
123    elf = ELFFile(f)
124    code = elf.get_section_by_name('.text')
125    ops = code.data()
126    addr = code['sh_addr']
127    md = Cs(CS_ARCH_X86, CS_MODE_64)
128    for i in md.disasm(ops, addr):        
129        print(f'0x{i.address:x}:\t{i.mnemonic}\t{i.op_str}')
130```
131
132
133The code is fairly straightforward (I think). We should be seeing this, on running
134
135```console
136› python disas1.py | less      
1370x6a0: xor ebp, ebp
1380x6a2: mov r9, rdx
1390x6a5: pop rsi
1400x6a6: mov rdx, rsp
1410x6a9: and rsp, 0xfffffffffffffff0
1420x6ad: push rax
1430x6ae: push rsp
1440x6af: lea r8, [rip + 0x23a]
1450x6b6: lea rcx, [rip + 0x1c3]
1460x6bd: lea rdi, [rip + 0xe6]
147**0x6c4: call qword ptr [rip + 0x200916]**
1480x6ca: hlt
149... snip ...
150```
151
152
153The line in bold is fairly interesting to us. The address at `[rip + 0x200916]` is equivalent to `[0x6ca + 0x200916]`, which in turn evaluates to `0x200fe0`. The first `call` being made to a function at `0x200fe0`? What could this function be?
154
155For this, we will have to look at **relocations**. Quoting [linuxbase.org](http://refspecs.linuxbase.org/elf/gabi4+/ch4.reloc.html)
156> Relocation is the process of connecting symbolic references with symbolic definitions. For example, when a program calls a function, the associated call instruction must transfer control to the proper destination address at execution. Relocatable files must have “relocation entries’’ which are necessary because they contain information that describes how to modify their section contents, thus allowing executable and shared object files to hold the right information for a process’s program image.
157
158To try and find these relocation entries, we write a third script.
159
160```python
161# relocations.py
162
163import sys
164from elftools.elf.elffile import ELFFile
165from elftools.elf.relocation import RelocationSection
166
167with open('./chall.elf', 'rb') as f:
168    e = ELFFile(f)
169    for section in e.iter_sections():
170        if isinstance(section, RelocationSection):
171            print(f'{section.name}:')
172            symbol_table = e.get_section(section['sh_link'])
173            for relocation in section.iter_relocations():
174                symbol = symbol_table.get_symbol(relocation['r_info_sym'])
175                addr = hex(relocation['r_offset'])
176                print(f'{symbol.name} {addr}')
177```
178
179
180Let’s run through this code real quick. We first loop through the sections, and check if it’s of the type `RelocationSection`. We then iterate through the relocations from the symbol table for each section. Finally, running this gives us
181
182```console
183› python relocations.py
184.rela.dyn:
185 0x200d98
186 0x200da0
187 0x201008
188_ITM_deregisterTMCloneTable 0x200fd8
189**__libc_start_main 0x200fe0**
190__gmon_start__ 0x200fe8
191_ITM_registerTMCloneTable 0x200ff0
192__cxa_finalize 0x200ff8
193stdin 0x201010
194.rela.plt:
195puts 0x200fb0
196printf 0x200fb8
197fgets 0x200fc0
198strcmp 0x200fc8
199malloc 0x200fd0
200```
201
202
203Remember the function call at `0x200fe0` from earlier? Yep, so that was a call to the well known `__libc_start_main`. Again, according to [linuxbase.org](http://refspecs.linuxbase.org/LSB_3.1.0/LSB-generic/LSB-generic/baselib---libc-start-main-.html)
204> The `__libc_start_main()` function shall perform any necessary initialization of the execution environment, call the *main* function with appropriate arguments, and handle the return from `main()`. If the `main()` function returns, the return value shall be passed to the `exit()` function.
205
206And its definition is like so
207
208```c
209int __libc_start_main(int *(main) (int, char * *, char * *), 
210int argc, char * * ubp_av, 
211void (*init) (void), 
212void (*fini) (void), 
213void (*rtld_fini) (void), 
214void (* stack_end));
215```
216
217
218Looking back at our disassembly
219
220```
2210x6a0: xor ebp, ebp
2220x6a2: mov r9, rdx
2230x6a5: pop rsi
2240x6a6: mov rdx, rsp
2250x6a9: and rsp, 0xfffffffffffffff0
2260x6ad: push rax
2270x6ae: push rsp
2280x6af: lea r8, [rip + 0x23a]
2290x6b6: lea rcx, [rip + 0x1c3]
230**0x6bd: lea rdi, [rip + 0xe6]**
2310x6c4: call qword ptr [rip + 0x200916]
2320x6ca: hlt
233... snip ...
234```
235
236
237but this time, at the `lea` or Load Effective Address instruction, which loads some address `[rip + 0xe6]` into the `rdi` register. `[rip + 0xe6]` evaluates to `0x7aa` which happens to be the address of our `main()` function! How do I know that? Because `__libc_start_main()`, after doing whatever it does, eventually jumps to the function at `rdi`, which is generally the `main()` function. It looks something like this
238
239![](https://cdn-images-1.medium.com/max/800/0*oQA2MwHjhzosF8ZH.png)
240
241To see the disassembly of `main`, seek to `0x7aa` in the output of the script we’d written earlier (`disas1.py`).
242
243From what we discovered earlier, each `call` instruction points to some function which we can see from the relocation entries. So following each `call` into their relocations gives us this
244
245```
246printf 0x650
247fgets  0x660
248strcmp 0x670
249malloc 0x680
250```
251
252
253Putting all this together, things start falling into place. Let me highlight the key sections of the disassembly here. It’s pretty self-explanatory.
254
255```
2560x7b2: mov edi, 0xa  ; 10
2570x7b7: call 0x680    ; malloc
258```
259
260
261The loop to populate the `*pw` string
262
263```
2640x7d0:  mov     eax, dword ptr [rbp - 0x14]
2650x7d3:  cdqe    
2660x7d5:  lea     rdx, [rax - 1]
2670x7d9:  mov     rax, qword ptr [rbp - 0x10]
2680x7dd:  add     rax, rdx
2690x7e0:  movzx   eax, byte ptr [rax]
2700x7e3:  lea     ecx, [rax + 1]
2710x7e6:  mov     eax, dword ptr [rbp - 0x14]
2720x7e9:  movsxd  rdx, eax
2730x7ec:  mov     rax, qword ptr [rbp - 0x10]
2740x7f0:  add     rax, rdx
2750x7f3:  mov     edx, ecx
2760x7f5:  mov     byte ptr [rax], dl
2770x7f7:  add     dword ptr [rbp - 0x14], 1
2780x7fb:  cmp     dword ptr [rbp - 0x14], 8
2790x7ff:  jle     0x7d0
280```
281
282
283And this looks like our `strcmp()`
284
285```
2860x843:  mov     rdx, qword ptr [rbp - 0x10] ; *in
2870x847:  mov     rax, qword ptr [rbp - 8]    ; *pw
2880x84b:  mov     rsi, rdx             
2890x84e:  mov     rdi, rax
2900x851:  call    0x670                       ; strcmp  
2910x856:  test    eax, eax                    ; is = 0? 
2920x858:  jne     0x868                       ; no? jump to 0x868
2930x85a:  lea     rdi, [rip + 0xae]           ; "haha yes!" 
2940x861:  call    0x640                       ; puts
2950x866:  jmp     0x874
2960x868:  lea     rdi, [rip + 0xaa]           ; "nah dude"
2970x86f:  call    0x640                       ; puts  
298```
299
300
301I’m not sure why it uses `puts` here? I might be missing something; perhaps `printf` calls `puts`. I could be wrong. I also confirmed with radare2 that those locations are actually the strings “haha yes!” and “nah dude”.
302
303**Update**: It's because of compiler optimization. A `printf()` (in this case) is seen as a bit overkill, and hence gets simplified to a `puts()`.
304
305### Conclusion
306
307Wew, that took quite some time. But we’re done. If you’re a beginner, you might find this extremely confusing, or probably didn’t even understand what was going on. And that’s okay. Building an intuition for reading and grokking disassembly comes with practice. I’m no good at it either.
308
309All the code used in this post is here: [https://github.com/icyphox/asdf/tree/master/reversing-elf](https://github.com/icyphox/asdf/tree/master/reversing-elf)
310
311Ciao for now, and I’ll see ya in #2 of this series — PE binaries. Whenever that is.