Question

I apologize in advance if this question is deemed too trivial, but I did spend a large amount of time trying to find a straightforward answer online and could not.

I recently read in an intro programming class about segmentation of programs and how typically programs are divided into "segments" that are individually loaded into memory (or partially loaded into memory, via paging...I think) when needed. Our book mentioned that programs are often divided up into segments based on logical divisions, e.g. a segment for the stack, for the heap, for the global constants, etc.

I am wondering what exactly determines how this segmentation works. Is it done by the compiler at compile time? Or does the OS somehow handle it? Does every subroutine typically get its own segment, no matter how small?

I know that information related to segmentation like segment descriptors, etc. is handled at the architecture level with registers and such specifically allocated for dealing with segmentation. But I'm just having a lot of trouble envisioning where/how the actual segmentation of the program happens in the first place, and how this information makes its way down into those registers. How are addresses translated into segment ids and offsets? Can anyone enlighten me? Thank you very much for any help you can give and sorry if I butchered any concepts here.

Was it helpful?

Solution

This is a good question, and I can only provide a little information that might steer you in the right direction. I believe program segmentation is defined by the executable file format, so if you want specific information, find the specification for your native format(any of the various ELF variants, for instance). It can be interesting to read about older formats such as a.out or the old "MZ" DOS binaries, if only for perspective and to see simpler specifications. [EDITED: for clarity]

As you seem to have guessed, segmentation is handled cooperatively by both the toolchain (predominately the linker, although the compiler has some impact: for instance global C variables go into a different segment than local variables, which go on the stack) and the OS. For an example of OS involvement, good operating systems make use of the memory protection features of the hardware platform to enforce correct use of an program's segments.

Hopefully this will give you some material for further research.

OTHER TIPS

This is referring to executable file format; as others have noted, the linker puts this together.

On my OS X system, file /bin/ls reports /bin/ls: Mach-O universal binary with 2 architectures

You next want to look for details on that format, and tools to read them. Actually looking at the segments will, I think, give you a great picture of what goes in them and how its structured.

From that latter link:

Each Mach-O file is made up of one Mach-O header, followed by a series of load commands, followed by one or more segments, each of which contains between 0 and 255 sections. Mach-O uses the REL relocation format to handle references to symbols. When looking up symbols Mach-O uses a two-level namespace that encodes each symbol into an 'object/symbol name' pair that is then linearly searched for by first the object and then the symbol name.

The basic structure—a list of variable-length "load commands" that reference pages of data elsewhere in the file—was also used in the executable file format for Accent. The Accent file format was in turn, based on an idea from Spice Lisp.

Just for completeness, the other tools for other OSes:

There are a few basic ideas here.

We want to ensure that the program's code and constant data aren't modified at run-time due to bugs in it or malicious input exploiting them. If there's an attempt detected, the OS should terminate the program. Benefits: catching bugs, improving security. Typical implementation mechanism: page-level memory protection.

We also typically don't want any data area in the program to be executable. Malicious input can exploit program bugs and lead to arbitrary (attacker-controlled) code execution in those areas. Same implementation mechanism.

Having a gaps in the memory (often called guard pages) between different kinds of program parts (code/constants, data, stack) that are inaccessible for reading/writing/execution can catch some buffer overflow bugs, that can, again, have security impact. Sometimes such special gaps are placed before and after every data object. In production code they are too costly (because of the memory waste and extra code needed to be executed to manage them), but they can be of great help when debugging.

Yet another reason for the logical separation of code and data is shared libraries (e.g. DLLs). Your OS can share (again, by using page translation) only library's code between different processes, thus saving memory, while maintaining individual data areas in those processes.

How all of this can be made possible you'll learn when you read and understand page translation (with all those page tables and virtual to physical address translation).

Lastly, there can be certain hardware limitations such as segmented address space. This is the case for 16-bit modes of x86 CPUs. In those modes even though you can access up to about 1MB of memory (in the real addressing mode and in the virtual 8086 mode) and 16MB of memory (in the 16-bit protected mode), you are forced by the CPU to use addresses that are broken into 16-bit portions, the segment selector and the offset. Within each such a segment you can only access up to 65536 bytes. If you need more, you have to use multiple segments and in order to switch between segments you need to reload the segment registers to point to the segments of interest. This limitation made many MSDOS assemblers and compilers produce object (=partially compiled) and executable (=fully compiler) code with clear boundaries between the various program parts, most notably the code and data, each not exceeding 65536 bytes in size.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top