What prevents a race condition when checking a semaphore value?

Question 1

Locking and relasing semaphores and/or mutexes happen as atomic operations, this means the CPU cannot be withdrawn from the current process. This ensures, that as soon as a mutex-lock is started (it consists of either a single or a few CPU-instruction (microcode)), the process keeps the CPU until the locking/releasing is done.
There are also different ways to implement threading, which can either be a direct support by CPU (kernel-space) or through a library (such as pthreads) in user-space.

From OSDev.org

An atomic operation is an operation that will always be executed without any other process being able to read or change state that is read or changed during the operation. It is effectively executed as a single step, and is an important quality in a number of algorithms that deal with multiple indepent processes, both in synchronization and algorithms that update shared data without requiring synchronization.

Here is a nice article on atomicity, too (although in Delphi).

Question 2

The most common (although definitely not the only) way to implement most locking primitives are compare-and-set instructions. An normal move instruction would just set the value of a memory location to whatever value you ask it to while a compare-and-set instruction does "atomically set this memory location to value X only if the value of the memory location is Y, then set some flag if the operation succeeded or not". The keyword "atomic" is that the CPU can in hardware make sure that nothing else can interfere with that operation.

Using a compare-and-swap instruction your example P could be implemented as:

int oldlock;
retry:
oldlock = lock;
if (oldlock > 0) {
    if (compare_and_swap(&lock, oldlock, oldlock - 1))
        goto retry;
    return 0;
}
return 1;

Of course reality is much more complex than that, but compare-and-set is easy to understand and explain and has the nice property that it can implement (almost?) all other locking primitives.

Here's a wikipedia article.

Question 3

The difference between a semaphore (or a mutex) and a "normal" variable isn't that big. Those libraries which offer you this functionality just make sure that the semaphore is only accessed through atomic operations. There are multiple ways to achieve that:

Special assembly instructions which guarantee atomic access, e.g.: TSL or XCHG.
Turning off the scheduler's interrupts before the variable gets accessed and afterwards turning them back on again. So the scheduler can't remove your process from the CPU. But you have to be aware that this only works on single CPU systems.
Using language specific features like Java's synchronise keyword.

An example of how to use the TSL instruction:

enter_region:        ; A "jump to" tag; function entry point.

  tsl reg, flag      ; Test and Set Lock; flag is the
                     ; shared variable; it is copied
                     ; into the register reg and flag
                     ; then atomically set to 1.

  cmp reg, #0        ; Was flag zero on entry_region?

  jnz enter_region   ; Jump to enter_region if
                     ; reg is non-zero; i.e.,
                     ; flag was non-zero on entry.

  ret                ; Exit; i.e., flag was zero on
                     ; entry. If we get here, tsl
                     ; will have set it non-zero; thus,
                     ; we have claimed the resource
                     ; associated with flag.

leave_region:
  move flag, #0      ; store 0 in flag
  ret                ; return to caller

By the way, as you already and correctly pointed out, a mutex is just a special kind of a semaphore, only allowing FALSE (represented by 0 in C) and TRUE (represented by 1 or any other value != 0) for it's internal int variable. Thus making it a so called binary semaphore.