all repos — site @ bb3a8f50a4ec0942ea89daca6775f15eb147e5ce

source for my site, found at icyphox.sh

pages/blog/python-for-re-1.md (view raw)

  1---
  2template: text.html
  3title: Python for Reverse Engineering #1: ELF Binaries
  4subtitle: Building your own disassembly tooling for — that’s right — fun and profit
  5date: 2019-02-08
  6url: python-for-re-1
  7---
  8
  9While solving complex reversing challenges, we often use established tools like radare2 or IDA for disassembling and debugging. But there are times when you need to dig in a little deeper and understand how things work under the hood.
 10
 11Rolling your own disassembly scripts can be immensely helpful when it comes to automating certain processes, and eventually build your own homebrew reversing toolchain of sorts. At least, that’s what I’m attempting anyway.
 12
 13## Setup
 14
 15As the title suggests, you’re going to need a Python 3 interpreter before
 16anything else. Once you’ve confirmed beyond reasonable doubt that you do,
 17in fact, have a Python 3 interpreter installed on your system, run
 18
 19```console
 20$ pip install capstone pyelftools
 21```
 22
 23where `capstone` is the disassembly engine we’ll be scripting with and `pyelftools` to help parse ELF files.
 24
 25With that out of the way, let’s start with an example of a basic reversing
 26challenge.
 27
 28```c
 29/* chall.c */
 30
 31#include <stdio.h>
 32#include <stdlib.h>
 33#include <string.h>
 34
 35int main() {
 36   char *pw = malloc(9);
 37   pw[0] = 'a';
 38   for(int i = 1; i <= 8; i++){
 39       pw[i] = pw[i - 1] + 1;
 40   }
 41   pw[9] = '\0';
 42   char *in = malloc(10);
 43   printf("password: ");
 44   fgets(in, 10, stdin);        // 'abcdefghi'
 45   if(strcmp(in, pw) == 0) {
 46       printf("haha yes!\n");
 47   }
 48   else {
 49       printf("nah dude\n");
 50   }
 51}
 52```
 53
 54
 55Compile it with GCC/Clang:
 56
 57```console
 58$ gcc chall.c -o chall.elf
 59```
 60
 61
 62## Scripting
 63
 64For starters, let’s look at the different sections present in the binary.
 65
 66```python
 67# sections.py
 68
 69from elftools.elf.elffile import ELFFile
 70
 71with open('./chall.elf', 'rb') as f:
 72    e = ELFFile(f)
 73    for section in e.iter_sections():
 74        print(hex(section['sh_addr']), section.name)
 75```
 76
 77
 78This script iterates through all the sections and also shows us where it’s loaded. This will be pretty useful later. Running it gives us
 79
 80```console
 81› python sections.py
 820x238 .interp
 830x254 .note.ABI-tag
 840x274 .note.gnu.build-id
 850x298 .gnu.hash
 860x2c0 .dynsym
 870x3e0 .dynstr
 880x484 .gnu.version
 890x4a0 .gnu.version_r
 900x4c0 .rela.dyn
 910x598 .rela.plt
 920x610 .init
 930x630 .plt
 940x690 .plt.got
 950x6a0 .text
 960x8f4 .fini
 970x900 .rodata
 980x924 .eh_frame_hdr
 990x960 .eh_frame
1000x200d98 .init_array
1010x200da0 .fini_array
1020x200da8 .dynamic
1030x200f98 .got
1040x201000 .data
1050x201010 .bss
1060x0 .comment
1070x0 .symtab
1080x0 .strtab
1090x0 .shstrtab
110```
111
112
113Most of these aren’t relevant to us, but a few sections here are to be noted. The `.text` section contains the instructions (opcodes) that we’re after. The `.data` section should have strings and constants initialized at compile time. Finally, the `.plt` which is the Procedure Linkage Table and the `.got`, the Global Offset Table. If you’re unsure about what these mean, read up on the ELF format and its internals.
114
115Since we know that the `.text` section has the opcodes, let’s disassemble the binary starting at that address.
116
117```python
118# disas1.py
119
120from elftools.elf.elffile import ELFFile
121from capstone import *
122
123with open('./bin.elf', 'rb') as f:
124    elf = ELFFile(f)
125    code = elf.get_section_by_name('.text')
126    ops = code.data()
127    addr = code['sh_addr']
128    md = Cs(CS_ARCH_X86, CS_MODE_64)
129    for i in md.disasm(ops, addr):        
130        print(f'0x{i.address:x}:\t{i.mnemonic}\t{i.op_str}')
131```
132
133
134The code is fairly straightforward (I think). We should be seeing this, on running
135
136```console
137› python disas1.py | less      
1380x6a0: xor ebp, ebp
1390x6a2: mov r9, rdx
1400x6a5: pop rsi
1410x6a6: mov rdx, rsp
1420x6a9: and rsp, 0xfffffffffffffff0
1430x6ad: push rax
1440x6ae: push rsp
1450x6af: lea r8, [rip + 0x23a]
1460x6b6: lea rcx, [rip + 0x1c3]
1470x6bd: lea rdi, [rip + 0xe6]
148**0x6c4: call qword ptr [rip + 0x200916]**
1490x6ca: hlt
150... snip ...
151```
152
153
154The line in bold is fairly interesting to us. The address at `[rip + 0x200916]` is equivalent to `[0x6ca + 0x200916]`, which in turn evaluates to `0x200fe0`. The first `call` being made to a function at `0x200fe0`? What could this function be?
155
156For this, we will have to look at **relocations**. Quoting [linuxbase.org](http://refspecs.linuxbase.org/elf/gabi4+/ch4.reloc.html)
157> Relocation is the process of connecting symbolic references with symbolic definitions. For example, when a program calls a function, the associated call instruction must transfer control to the proper destination address at execution. Relocatable files must have “relocation entries’’ which are necessary because they contain information that describes how to modify their section contents, thus allowing executable and shared object files to hold the right information for a process’s program image.
158
159To try and find these relocation entries, we write a third script.
160
161```python
162# relocations.py
163
164import sys
165from elftools.elf.elffile import ELFFile
166from elftools.elf.relocation import RelocationSection
167
168with open('./chall.elf', 'rb') as f:
169    e = ELFFile(f)
170    for section in e.iter_sections():
171        if isinstance(section, RelocationSection):
172            print(f'{section.name}:')
173            symbol_table = e.get_section(section['sh_link'])
174            for relocation in section.iter_relocations():
175                symbol = symbol_table.get_symbol(relocation['r_info_sym'])
176                addr = hex(relocation['r_offset'])
177                print(f'{symbol.name} {addr}')
178```
179
180
181Let’s run through this code real quick. We first loop through the sections, and check if it’s of the type `RelocationSection`. We then iterate through the relocations from the symbol table for each section. Finally, running this gives us
182
183```console
184› python relocations.py
185.rela.dyn:
186 0x200d98
187 0x200da0
188 0x201008
189_ITM_deregisterTMCloneTable 0x200fd8
190**__libc_start_main 0x200fe0**
191__gmon_start__ 0x200fe8
192_ITM_registerTMCloneTable 0x200ff0
193__cxa_finalize 0x200ff8
194stdin 0x201010
195.rela.plt:
196puts 0x200fb0
197printf 0x200fb8
198fgets 0x200fc0
199strcmp 0x200fc8
200malloc 0x200fd0
201```
202
203
204Remember the function call at `0x200fe0` from earlier? Yep, so that was a call to the well known `__libc_start_main`. Again, according to [linuxbase.org](http://refspecs.linuxbase.org/LSB_3.1.0/LSB-generic/LSB-generic/baselib -- libc-start-main-.html)
205> The `__libc_start_main()` function shall perform any necessary initialization of the execution environment, call the *main* function with appropriate arguments, and handle the return from `main()`. If the `main()` function returns, the return value shall be passed to the `exit()` function.
206
207And its definition is like so
208
209```c
210int __libc_start_main(int *(main) (int, char * *, char * *), 
211int argc, char * * ubp_av, 
212void (*init) (void), 
213void (*fini) (void), 
214void (*rtld_fini) (void), 
215void (* stack_end));
216```
217
218
219Looking back at our disassembly
220
221```
2220x6a0: xor ebp, ebp
2230x6a2: mov r9, rdx
2240x6a5: pop rsi
2250x6a6: mov rdx, rsp
2260x6a9: and rsp, 0xfffffffffffffff0
2270x6ad: push rax
2280x6ae: push rsp
2290x6af: lea r8, [rip + 0x23a]
2300x6b6: lea rcx, [rip + 0x1c3]
231**0x6bd: lea rdi, [rip + 0xe6]**
2320x6c4: call qword ptr [rip + 0x200916]
2330x6ca: hlt
234... snip ...
235```
236
237
238but this time, at the `lea` or Load Effective Address instruction, which loads some address `[rip + 0xe6]` into the `rdi` register. `[rip + 0xe6]` evaluates to `0x7aa` which happens to be the address of our `main()` function! How do I know that? Because `__libc_start_main()`, after doing whatever it does, eventually jumps to the function at `rdi`, which is generally the `main()` function. It looks something like this
239
240![](https://cdn-images-1.medium.com/max/800/0*oQA2MwHjhzosF8ZH.png)
241
242To see the disassembly of `main`, seek to `0x7aa` in the output of the script we’d written earlier (`disas1.py`).
243
244From what we discovered earlier, each `call` instruction points to some function which we can see from the relocation entries. So following each `call` into their relocations gives us this
245
246```
247printf 0x650
248fgets  0x660
249strcmp 0x670
250malloc 0x680
251```
252
253
254Putting all this together, things start falling into place. Let me highlight the key sections of the disassembly here. It’s pretty self-explanatory.
255
256```
2570x7b2: mov edi, 0xa  ; 10
2580x7b7: call 0x680    ; malloc
259```
260
261
262The loop to populate the `*pw` string
263
264```
2650x7d0:  mov     eax, dword ptr [rbp - 0x14]
2660x7d3:  cdqe    
2670x7d5:  lea     rdx, [rax - 1]
2680x7d9:  mov     rax, qword ptr [rbp - 0x10]
2690x7dd:  add     rax, rdx
2700x7e0:  movzx   eax, byte ptr [rax]
2710x7e3:  lea     ecx, [rax + 1]
2720x7e6:  mov     eax, dword ptr [rbp - 0x14]
2730x7e9:  movsxd  rdx, eax
2740x7ec:  mov     rax, qword ptr [rbp - 0x10]
2750x7f0:  add     rax, rdx
2760x7f3:  mov     edx, ecx
2770x7f5:  mov     byte ptr [rax], dl
2780x7f7:  add     dword ptr [rbp - 0x14], 1
2790x7fb:  cmp     dword ptr [rbp - 0x14], 8
2800x7ff:  jle     0x7d0
281```
282
283
284And this looks like our `strcmp()`
285
286```
2870x843:  mov     rdx, qword ptr [rbp - 0x10] ; *in
2880x847:  mov     rax, qword ptr [rbp - 8]    ; *pw
2890x84b:  mov     rsi, rdx             
2900x84e:  mov     rdi, rax
2910x851:  call    0x670                       ; strcmp  
2920x856:  test    eax, eax                    ; is = 0? 
2930x858:  jne     0x868                       ; no? jump to 0x868
2940x85a:  lea     rdi, [rip + 0xae]           ; "haha yes!" 
2950x861:  call    0x640                       ; puts
2960x866:  jmp     0x874
2970x868:  lea     rdi, [rip + 0xaa]           ; "nah dude"
2980x86f:  call    0x640                       ; puts  
299```
300
301
302I’m not sure why it uses `puts` here? I might be missing something; perhaps `printf` calls `puts`. I could be wrong. I also confirmed with radare2 that those locations are actually the strings “haha yes!” and “nah dude”.
303
304**Update**: It's because of compiler optimization. A `printf()` (in this case) is seen as a bit overkill, and hence gets simplified to a `puts()`.
305
306## Conclusion
307
308Wew, that took quite some time. But we’re done. If you’re a beginner, you might find this extremely confusing, or probably didn’t even understand what was going on. And that’s okay. Building an intuition for reading and grokking disassembly comes with practice. I’m no good at it either.
309
310All the code used in this post is here: [https://github.com/icyphox/asdf/tree/master/reversing-elf](https://github.com/icyphox/asdf/tree/master/reversing-elf)
311
312Ciao for now, and I’ll see ya in #2 of this series -- PE binaries. Whenever that is.