What is the minimum X86 assembly needed for a spinlock

Question 1

Shortest would probably be:

acquire:
    lock bts [eax],0
    jc acquire

release:
    mov [eax],0

For performance, it's best to use a "test, test and set" approach, and use pause, like this:

acquire:
    lock bts [eax],0    ;Optimistic first attempt
    jnc l2              ;Success if acquired
l1:
    pause
    test [eax],1        
    jne l1              ;Don't attempt again unless there's a chance

    lock bts [eax],0    ;Attempt to acquire
    jc l1               ;Wait again if failed

l2:

release:
    mov [eax],0

For debugging, you can add extra data to make it easier to detect problems, like this:

acquire:
    lock bts [eax],31         ;Optimistic first attempt
    jnc l2                    ;Success if acquired

    mov ebx,[CPUnumber]
    lea ebx,[ebx+0x80000000]
    cmp [eax],ebx             ;Is the lock acquired by this CPU?
    je .bad                   ; yes, deadlock
    lock inc dword [eax+4]    ;Increase "lock contention counter"
l1:
    pause
    test [eax],0x80000000        
    jne l1                    ;Don't attempt again unless there's a chance

    lock bts [eax],31         ;Attempt to acquire
    jc l1                     ;Wait again if failed

l2: mov [eax],ebx             ;Store CPU number

release:
    mov ebx,[CPUnumber]
    lea ebx,[ebx+0x80000000]
    cmp [eax],ebx             ;Is lock acquired, and is CPU same?
    jne .bad                  ; no, either not acquired or wrong CPU
    mov [eax],0

Question 2

Your code is fine, but if you're looking for high performance I'd suggest this instead:

  xor ecx, ecx
.loop:
  lock xchg [eax], ecx
  test ecx, ecx
  jz .loop

Reasons:

xor ecx, ecx is smaller and doesn't require a literal, and modern CPUs have this hardwired to fast register zero.
test ecx, ecx can be marginally smaller and faster than cmp ecx, 0, because you don't need to load a literal and test is just a bitwise AND operation rather than a subtraction.

P.S. I always put the lock prefix in there regardless of whether it is implied, for readability reasons - it makes it obvious that I'm doing a locked operation.

Question 3

Your code is fine and you can always try to make it shorter if you have space problems.

Other answers mention performance and that displays a basic ignorance of how locks work.

When an attempt to lock is initiated the core in question raises a signal on one of its pins (LOCK) which tells all other cores, their caches, all memory and all bus-mastering devices (because they can update RAM independently of the cores) to complete any outstanding memory operations. When they have done this they collectively raise another signal - lock acknowledge (LOCKA) - which is returned to the original core and the memory exchange takes place. After this the LOCK signal is switched off.

Once you have arrived here you are able to look at the value you fetched using xchg. If it turns out that another task/thread owns the lock you will need to perform the lock sequence all over again.

Assume the slowest bus-mastering device anywhere on your computer is a 33MHz PCI card. If it is doing something it will need any number of PCI bus clock cycles to complete. Every cycle there means one hundred wait cycles on a 3.3GHz CPU. Put this in the perspective of saving a cycle or two in the lock sequence. There are several buses in a CPU, the chipset and the motherboard and some, all or none of them may be active at any one time - such as when the LOCK is initiated. The active bus that takes longest to reply with LOCKA will determine how fast the lock completes.

Try it yourself: measure how long it takes to do ten million spinlocks (grab and release).

I wrote some more on spinlocks here, here and here.

The performance trick with bus-locks (spinlocks, critical sections in Windows) is to use them as seldom as possible which means organizing your data to make this possible. A bus-lock will likely complete faster on a non-server computer. This is because the bus-mastering devices on a server are operating more or less constantly. So if your application is server-based economizing on bus-locks can be critical to maintain performance.

EDIT

To Peter Cordes,

You state that

... it's not related to bus-mastering, at least not on CPUs since at least Nehalem

From the latest intel System Programming Guide:

8.1.4 Effects of a LOCK Operation on Internal Processor Caches

For the Intel486 and Pentium processors, the LOCK# signal is always asserted on the bus during a LOCK operation, even if the area of memory being locked is cached in the processor.

For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow it’s cache coherency mechanism to ensure that the operation is carried out atomically. This operation is called “cache locking.” The cache coherency mechanism automatically prevents two or more processors that have cached the same area of memory from simultaneously modifying data in that area.

In the second paragraph it says

... the processor may not assert the LOCK# signal on the bus.

Now, I don't know about you but to me at least "may not" sounds suspiciously unlike "will not" or "won't".

The numbers I have produced may or may not be correct - even I make mistakes now and then - but I challenge you to stop quoting this or that "authority" and instead get your hands dirty by doing the work to either disprove my numbers, find the error(s) or discrepancies. I included the relevant source code in another thread (where I also griped about your lofty, theoretical, armchair-based comments) so it won't take you forever to get started.

Start, for example, by proving that I am

over-stating the cost of locking in the no-contention case ...

I look forward to your analysis.