Why does Linux on x86 use different segments for user processes and the kernel?

https://stackoverflow.com/questions/4575032

14-10-2019
|

Question

So, I know that Linux uses four default segments for an x86 processor (kernel code, kernel data, user code, user data), but they all have the same base and limit (0x00000000 and 0xfffff), meaning each segment maps to the same set of linear addresses.

Given this, why even have user/kernel segments? I understand why there should be separate segments for code and data (just due to how the x86 processor deals with the cs and ds registers), but why not have a single code segment and a single data segment? Memory protection is done through paging, and the user and kernel segments map to the same linear addresses anyway.

Solution

The x86 architecture associates a type and a privilege level with each segment descriptor. The type of a descriptor allows segments to be made read only, read/write, executable, etc., but the main reason for different segments having the same base and limit is to allow a different descriptor privilege level (DPL) to be used.

The DPL is two bits, allowing the values 0 through 3 to be encoded. When the privilege level is 0, then it is said to be ring 0, which is the most privileged. The segment descriptors for the Linux kernel are ring 0 whereas the segment descriptors for user space are ring 3 (least privileged). This is true for most segmented operating systems; the core of the operating system is ring 0 and the rest is ring 3.

The Linux kernel sets up, as you mentioned, four segments:

__KERNEL_CS (Kernel code segment, base=0, limit=4GB, type=10, DPL=0)
__KERNEL_DS (Kernel data segment, base=0, limit=4GB, type=2, DPL=0)
__USER_CS (User code segment, base=0, limit=4GB, type=10, DPL=3)
__USER_DS (User data segment, base=0, limit=4GB, type=2, DPL=3)

The base and limit of all four are the same, but the kernel segments are DPL 0, the user segments are DPL 3, the code segments are executable and readable (not writable), and the data segments are readable and writable (not executable).

See also:

OTHER TIPS

The x86 memory management architecture uses both segmentation and paging. Very roughly speaking, a segment is a partition of a process's address space that has its own protection policy. So, in the x86 architecture, it is possible to split the range of memory addresses that a process sees into multiple contiguous segments, and assign different protection modes to each. Paging is a technique for mapping small (usually 4KB) regions of a process's address space to chunks of real, physical memory. Paging thus controls how regions inside a segment are mapped onto physical RAM.

All processes have two segments:

one segment (addresses 0x00000000 through 0xBFFFFFFF) for user-level, process-specific data such as the program's code, static data, heap, and stack. Every process has its own, independent user segment.
one segment (addresses 0xC0000000 through 0xFFFFFFFF), which contains kernel-specific data such as the kernel instructions, data, some stacks on which kernel code can execute, and more interestingly, a region in this segment is directly mapped to physical memory, so that the kernel can directly access physical memory locations without having to worry about address translation. The same kernel segment is mapped into every process, but processes can access it only when executing in protected kernel mode.

So, in user-mode, the process may only access addresses less than 0xC0000000; any access to an address higher than this results in a fault. However, when a user-mode process begins executing in the kernel (for instance, after having made a system call), the protection bit in the CPU is changed to supervisor mode (and some segmentation registers are changed), meaning that the process is thereby able to access addresses above 0xC0000000.

Refer ed from: HERE

in X86 - linux segment registers are used for buffer overflow check [see the below code snippet which have defined some char arrays in stack] :

static void
printint(int xx, int base, int sgn)
{
    char digits[] = "0123456789ABCDEF";
    char buf[16];
    int i, neg;
    uint x;

    neg = 0;
    if(sgn && xx < 0){
        neg = 1;
        x = -xx;
    } else {
        x = xx;
    }

    i = 0;
    do{
        buf[i++] = digits[x % base];
    }while((x /= base) != 0);
    if(neg)
        buf[i++] = '-';

    while(--i >= 0)
        my_putc(buf[i]);
}

Now if we see the dis-assembly of the code gcc-generated code.

Dump of assembler code for function printint:

 0x00000000004005a6 <+0>:   push   %rbp
   0x00000000004005a7 <+1>: mov    %rsp,%rbp
   0x00000000004005aa <+4>: sub    $0x50,%rsp
   0x00000000004005ae <+8>: mov    %edi,-0x44(%rbp)


  0x00000000004005b1 <+11>: mov    %esi,-0x48(%rbp)
   0x00000000004005b4 <+14>:    mov    %edx,-0x4c(%rbp)
   0x00000000004005b7 <+17>:    mov    %fs:0x28,%rax  ------> obtaining an 8 byte guard from based on a fixed offset from fs segment register [from the descriptor base in the corresponding gdt entry]
   0x00000000004005c0 <+26>:    mov    %rax,-0x8(%rbp) -----> pushing it as the first local variable on to stack
   0x00000000004005c4 <+30>:    xor    %eax,%eax
   0x00000000004005c6 <+32>:    movl   $0x33323130,-0x20(%rbp)
   0x00000000004005cd <+39>:    movl   $0x37363534,-0x1c(%rbp)
   0x00000000004005d4 <+46>:    movl   $0x42413938,-0x18(%rbp)
   0x00000000004005db <+53>:    movl   $0x46454443,-0x14(%rbp)

...
...
  // function end

   0x0000000000400686 <+224>:   jns    0x40066a <printint+196>
   0x0000000000400688 <+226>:   mov    -0x8(%rbp),%rax -------> verifying if the stack was smashed
   0x000000000040068c <+230>:   xor    %fs:0x28,%rax  --> checking the value on stack is matching the original one based on fs
   0x0000000000400695 <+239>:   je     0x40069c <printint+246>
   0x0000000000400697 <+241>:   callq  0x400460 <__stack_chk_fail@plt>
   0x000000000040069c <+246>:   leaveq 
   0x000000000040069d <+247>:   retq

Now if we remove the stack based char arrays from this function , gcc won't generate this guard check .

I have seen the same generated by gcc even for kernel modules. Basically I was seeing a crash while botrapping some kernel code and it was faulting with virtual address 0x28. Later I figured that thought i had initialized the stack pointer correctly and loaded the program correctly, I am not having the right entries in gdt, which would translate the fs based offset into a valid virtual address.

However in case of kernel code it was simply ignoring , the error instead of jumping to something like __stack_chk_fail@plt>.

The relevant compiler option which adds this guard in gcc is -fstack-protector . I think this is enabled by default which compiling a user app.

For kernel , we can enable this gcc flag via config CC_STACKPROTECTOR option.

config CC_STACKPROTECTOR
 699        bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
 700        depends on SUPERH32
 701        help
 702          This option turns on the -fstack-protector GCC feature. This
 703          feature puts, at the beginning of functions, a canary value on
 704          the stack just before the return address, and validates
 705          the value just before actually returning.  Stack based buffer
 706          overflows (that need to overwrite this return address) now also
 707          overwrite the canary, which gets detected and the attack is then
 708          neutralized via a kernel panic.
 709
 710          This feature requires gcc version 4.2 or above.

The relevant kernel file where this gs / fs is linux/arch/x86/include/asm/stackprotector.h

Kernel memory should not be readable from programs running in user space.

Program data is often not executable (DEP, a processor feature, which helps guard against executing an overflowed buffer and other malicious attacks).

It's all about access control - different segments have different rights. That's why accessing the wrong segment will give you a "segmentation fault".

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow