all repos — site @ b28ff0244440916a9ad5094cb554cb3a6263fa26

source for my site, found at icyphox.sh

pages/blog/python-for-re-1.md (view raw)

  1---
  2template: text.html
  3title: Python for Reverse Engineering #1: ELF Binaries
  4---
  5
  6# Python for Reverse Engineering 1: ELF Binaries
  7
  8## Building your own disassembly tooling for — that’s right — fun and profit
  9
 10While solving complex reversing challenges, we often use established tools like radare2 or IDA for disassembling and debugging. But there are times when you need to dig in a little deeper and understand how things work under the hood.
 11
 12Rolling your own disassembly scripts can be immensely helpful when it comes to automating certain processes, and eventually build your own homebrew reversing toolchain of sorts. At least, that’s what I’m attempting anyway.
 13
 14### Setup
 15
 16As the title suggests, you’re going to need a Python 3 interpreter before
 17anything else. Once you’ve confirmed beyond reasonable doubt that you do,
 18in fact, have a Python 3 interpreter installed on your system, run
 19
 20```console
 21$ pip install capstone pyelftools
 22```
 23
 24where `capstone` is the disassembly engine we’ll be scripting with and `pyelftools` to help parse ELF files.
 25
 26With that out of the way, let’s start with an example of a basic reversing
 27challenge.
 28
 29```c
 30/* chall.c */
 31
 32#include <stdio.h>
 33#include <stdlib.h>
 34#include <string.h>
 35
 36int main() {
 37   char *pw = malloc(9);
 38   pw[0] = 'a';
 39   for(int i = 1; i <= 8; i++){
 40       pw[i] = pw[i - 1] + 1;
 41   }
 42   pw[9] = '\0';
 43   char *in = malloc(10);
 44   printf("password: ");
 45   fgets(in, 10, stdin);        // 'abcdefghi'
 46   if(strcmp(in, pw) == 0) {
 47       printf("haha yes!\n");
 48   }
 49   else {
 50       printf("nah dude\n");
 51   }
 52}
 53```
 54
 55
 56Compile it with GCC/Clang:
 57
 58```console
 59$ gcc chall.c -o chall.elf
 60```
 61
 62
 63### Scripting
 64
 65For starters, let’s look at the different sections present in the binary.
 66
 67```python
 68# sections.py
 69
 70from elftools.elf.elffile import ELFFile
 71
 72with open('./chall.elf', 'rb') as f:
 73    e = ELFFile(f)
 74    for section in e.iter_sections():
 75        print(hex(section['sh_addr']), section.name)
 76```
 77
 78
 79This script iterates through all the sections and also shows us where it’s loaded. This will be pretty useful later. Running it gives us
 80
 81```console
 82› python sections.py
 830x238 .interp
 840x254 .note.ABI-tag
 850x274 .note.gnu.build-id
 860x298 .gnu.hash
 870x2c0 .dynsym
 880x3e0 .dynstr
 890x484 .gnu.version
 900x4a0 .gnu.version_r
 910x4c0 .rela.dyn
 920x598 .rela.plt
 930x610 .init
 940x630 .plt
 950x690 .plt.got
 960x6a0 .text
 970x8f4 .fini
 980x900 .rodata
 990x924 .eh_frame_hdr
1000x960 .eh_frame
1010x200d98 .init_array
1020x200da0 .fini_array
1030x200da8 .dynamic
1040x200f98 .got
1050x201000 .data
1060x201010 .bss
1070x0 .comment
1080x0 .symtab
1090x0 .strtab
1100x0 .shstrtab
111```
112
113
114Most of these aren’t relevant to us, but a few sections here are to be noted. The `.text` section contains the instructions (opcodes) that we’re after. The `.data` section should have strings and constants initialized at compile time. Finally, the `.plt` which is the Procedure Linkage Table and the `.got`, the Global Offset Table. If you’re unsure about what these mean, read up on the ELF format and its internals.
115
116Since we know that the `.text` section has the opcodes, let’s disassemble the binary starting at that address.
117
118```python
119# disas1.py
120
121from elftools.elf.elffile import ELFFile
122from capstone import *
123
124with open('./bin.elf', 'rb') as f:
125    elf = ELFFile(f)
126    code = elf.get_section_by_name('.text')
127    ops = code.data()
128    addr = code['sh_addr']
129    md = Cs(CS_ARCH_X86, CS_MODE_64)
130    for i in md.disasm(ops, addr):        
131        print(f'0x{i.address:x}:\t{i.mnemonic}\t{i.op_str}')
132```
133
134
135The code is fairly straightforward (I think). We should be seeing this, on running
136
137```console
138› python disas1.py | less      
1390x6a0: xor ebp, ebp
1400x6a2: mov r9, rdx
1410x6a5: pop rsi
1420x6a6: mov rdx, rsp
1430x6a9: and rsp, 0xfffffffffffffff0
1440x6ad: push rax
1450x6ae: push rsp
1460x6af: lea r8, [rip + 0x23a]
1470x6b6: lea rcx, [rip + 0x1c3]
1480x6bd: lea rdi, [rip + 0xe6]
149**0x6c4: call qword ptr [rip + 0x200916]**
1500x6ca: hlt
151... snip ...
152```
153
154
155The line in bold is fairly interesting to us. The address at `[rip + 0x200916]` is equivalent to `[0x6ca + 0x200916]`, which in turn evaluates to `0x200fe0`. The first `call` being made to a function at `0x200fe0`? What could this function be?
156
157For this, we will have to look at **relocations**. Quoting [linuxbase.org](http://refspecs.linuxbase.org/elf/gabi4+/ch4.reloc.html)
158> Relocation is the process of connecting symbolic references with symbolic definitions. For example, when a program calls a function, the associated call instruction must transfer control to the proper destination address at execution. Relocatable files must have “relocation entries’’ which are necessary because they contain information that describes how to modify their section contents, thus allowing executable and shared object files to hold the right information for a process’s program image.
159
160To try and find these relocation entries, we write a third script.
161
162```python
163# relocations.py
164
165import sys
166from elftools.elf.elffile import ELFFile
167from elftools.elf.relocation import RelocationSection
168
169with open('./chall.elf', 'rb') as f:
170    e = ELFFile(f)
171    for section in e.iter_sections():
172        if isinstance(section, RelocationSection):
173            print(f'{section.name}:')
174            symbol_table = e.get_section(section['sh_link'])
175            for relocation in section.iter_relocations():
176                symbol = symbol_table.get_symbol(relocation['r_info_sym'])
177                addr = hex(relocation['r_offset'])
178                print(f'{symbol.name} {addr}')
179```
180
181
182Let’s run through this code real quick. We first loop through the sections, and check if it’s of the type `RelocationSection`. We then iterate through the relocations from the symbol table for each section. Finally, running this gives us
183
184```console
185› python relocations.py
186.rela.dyn:
187 0x200d98
188 0x200da0
189 0x201008
190_ITM_deregisterTMCloneTable 0x200fd8
191**__libc_start_main 0x200fe0**
192__gmon_start__ 0x200fe8
193_ITM_registerTMCloneTable 0x200ff0
194__cxa_finalize 0x200ff8
195stdin 0x201010
196.rela.plt:
197puts 0x200fb0
198printf 0x200fb8
199fgets 0x200fc0
200strcmp 0x200fc8
201malloc 0x200fd0
202```
203
204
205Remember the function call at `0x200fe0` from earlier? Yep, so that was a call to the well known `__libc_start_main`. Again, according to [linuxbase.org](http://refspecs.linuxbase.org/LSB_3.1.0/LSB-generic/LSB-generic/baselib---libc-start-main-.html)
206> The `__libc_start_main()` function shall perform any necessary initialization of the execution environment, call the *main* function with appropriate arguments, and handle the return from `main()`. If the `main()` function returns, the return value shall be passed to the `exit()` function.
207
208And its definition is like so
209
210```c
211int __libc_start_main(int *(main) (int, char * *, char * *), 
212int argc, char * * ubp_av, 
213void (*init) (void), 
214void (*fini) (void), 
215void (*rtld_fini) (void), 
216void (* stack_end));
217```
218
219
220Looking back at our disassembly
221
222```
2230x6a0: xor ebp, ebp
2240x6a2: mov r9, rdx
2250x6a5: pop rsi
2260x6a6: mov rdx, rsp
2270x6a9: and rsp, 0xfffffffffffffff0
2280x6ad: push rax
2290x6ae: push rsp
2300x6af: lea r8, [rip + 0x23a]
2310x6b6: lea rcx, [rip + 0x1c3]
232**0x6bd: lea rdi, [rip + 0xe6]**
2330x6c4: call qword ptr [rip + 0x200916]
2340x6ca: hlt
235... snip ...
236```
237
238
239but this time, at the `lea` or Load Effective Address instruction, which loads some address `[rip + 0xe6]` into the `rdi` register. `[rip + 0xe6]` evaluates to `0x7aa` which happens to be the address of our `main()` function! How do I know that? Because `__libc_start_main()`, after doing whatever it does, eventually jumps to the function at `rdi`, which is generally the `main()` function. It looks something like this
240
241![](https://cdn-images-1.medium.com/max/800/0*oQA2MwHjhzosF8ZH.png)
242
243To see the disassembly of `main`, seek to `0x7aa` in the output of the script we’d written earlier (`disas1.py`).
244
245From what we discovered earlier, each `call` instruction points to some function which we can see from the relocation entries. So following each `call` into their relocations gives us this
246
247```
248printf 0x650
249fgets  0x660
250strcmp 0x670
251malloc 0x680
252```
253
254
255Putting all this together, things start falling into place. Let me highlight the key sections of the disassembly here. It’s pretty self-explanatory.
256
257```
2580x7b2: mov edi, 0xa  ; 10
2590x7b7: call 0x680    ; malloc
260```
261
262
263The loop to populate the `*pw` string
264
265```
2660x7d0:  mov     eax, dword ptr [rbp - 0x14]
2670x7d3:  cdqe    
2680x7d5:  lea     rdx, [rax - 1]
2690x7d9:  mov     rax, qword ptr [rbp - 0x10]
2700x7dd:  add     rax, rdx
2710x7e0:  movzx   eax, byte ptr [rax]
2720x7e3:  lea     ecx, [rax + 1]
2730x7e6:  mov     eax, dword ptr [rbp - 0x14]
2740x7e9:  movsxd  rdx, eax
2750x7ec:  mov     rax, qword ptr [rbp - 0x10]
2760x7f0:  add     rax, rdx
2770x7f3:  mov     edx, ecx
2780x7f5:  mov     byte ptr [rax], dl
2790x7f7:  add     dword ptr [rbp - 0x14], 1
2800x7fb:  cmp     dword ptr [rbp - 0x14], 8
2810x7ff:  jle     0x7d0
282```
283
284
285And this looks like our `strcmp()`
286
287```
2880x843:  mov     rdx, qword ptr [rbp - 0x10] ; *in
2890x847:  mov     rax, qword ptr [rbp - 8]    ; *pw
2900x84b:  mov     rsi, rdx             
2910x84e:  mov     rdi, rax
2920x851:  call    0x670                       ; strcmp  
2930x856:  test    eax, eax                    ; is = 0? 
2940x858:  jne     0x868                       ; no? jump to 0x868
2950x85a:  lea     rdi, [rip + 0xae]           ; "haha yes!" 
2960x861:  call    0x640                       ; puts
2970x866:  jmp     0x874
2980x868:  lea     rdi, [rip + 0xaa]           ; "nah dude"
2990x86f:  call    0x640                       ; puts  
300```
301
302
303I’m not sure why it uses `puts` here? I might be missing something; perhaps `printf` calls `puts`. I could be wrong. I also confirmed with radare2 that those locations are actually the strings “haha yes!” and “nah dude”.
304
305### Conclusion
306
307Wew, that took quite some time. But we’re done. If you’re a beginner, you might find this extremely confusing, or probably didn’t even understand what was going on. And that’s okay. Building an intuition for reading and grokking disassembly comes with practice. I’m no good at it either.
308
309All the code used in this post is here: [https://github.com/icyphox/asdf/tree/master/reversing-elf](https://github.com/icyphox/asdf/tree/master/reversing-elf)
310
311Ciao for now, and I’ll see ya in #2 of this series — PE binaries. Whenever that is.