英特尔x86汇编优化技术中的示例问题

https://stackoverflow.com/questions/1414911

06-07-2019
|

题

我正在学习汇编程序很长一段时间，我正在尝试重写一些简单的过程\函数来查看性能优势（如果有的话）。我的主要开发工具是Delphi 2007，第一个例子将使用该语言，但它们也可以很容易地翻译成其他语言。

问题表示为：

我们给出了一个无符号字节值，其中八位中的每一位代表屏幕一行中的一个像素。每个单个像素可以是实心（1）或透明（0）。换句话说，我们在一个字节值中包含8个像素。我想将这些像素解压缩成一个8字节的数组，就像最年轻的像素（位）将落在数组的最低索引之下一样，依此类推。这是一个例子：

One byte value -----------> eight byte array

10011011 -----------------> [1][1][0][1][1][0][0][1]

Array index number ------->  0  1  2  3  4  5  6  7

下面我介绍解决问题的五种方法。接下来，我将展示他们的时间比较以及我如何衡量这些时间。

我的问题由两部分组成：

1

我要求您提供有关方法 DecodePixels4a 和 DecodePixels4b 的详细答案。为什么方法 4b 比 4a 慢一些？

例如，如果它的速度较慢，因为我的代码没有正确对齐，那么告诉我给定方法中的哪些指令可以更好地对齐，以及如何做到这一点不破坏方法。

我希望看到这个理论背后的真实例子。请记住，我正在学习汇编，我想从你的答案中获取知识，这使我将来能够编写更好的优化代码。

2

你能编写比 DecodePixels4a 更快的例行程序吗？如果是，请提供并描述您已采取的优化步骤。通过更快的例程我的意思是在测试环境中在最短的时间段内运行的所有例程中的例程。

允许使用所有英特尔系列处理器以及与之兼容的处理器。

下面你会找到我写的例程：

procedure DecodePixels1(EncPixels: Byte; var DecPixels: TDecodedPixels);
var
  i3: Integer;
begin
  DecPixels[0] := EncPixels and $01;
  for i3 := 1 to 7 do
  begin
    EncPixels := EncPixels shr 1;
    DecPixels[i3] := EncPixels and $01;
    //DecPixels[i3] := (EncPixels shr i3) and $01;  //this is even slower if you replace above 2 lines with it
  end;
end;


//Lets unroll the loop and see if it will be faster.
procedure DecodePixels2(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  DecPixels[0] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[1] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[2] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[3] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[4] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[5] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[6] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[7] := EncPixels and $01;
end;


procedure DecodePixels3(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push eax;
    push ebx;
    push ecx;
    mov bl, al;
    and bl, $01;
    mov [edx], bl;
    mov ecx, program Test;

{$APPTYPE CONSOLE}

uses
  SysUtils, Windows;

type
  TDecodedPixels = array[0..7] of Byte;
var
  Pixels: TDecodedPixels;
  Freq, TimeStart, TimeEnd :Int64;
  Time1, Time2, Time3, Time4a, Time4b: Extended;
  i, i2: Integer;

begin
  if QueryPerformanceFrequency(Freq) then
  begin
    for i2 := 1 to 100 do
    begin
      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels1(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time1 := Time1 + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels2(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time2 := Time2 + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels3(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time3 := Time3 + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels4a(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time4a := Time4a + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels4b(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time4b := Time4b + ((TimeEnd - TimeStart) / Freq * 1000);

    end;
    Writeln('Time1 : ' + FloatToStr(Time1 / 100) + ' ms.    <- Delphi loop.');
    Writeln('Time2 : ' + FloatToStr(Time2 / 100) + ' ms.    <- Delphi unrolled loop.');
    Writeln('Time3 : ' + FloatToStr(Time3/ 100) + ' ms.    <- BASM loop.');
    Writeln('Time4a : ' + FloatToStr(Time4a / 100) + ' ms.    <- BASM unrolled loop.');
    Writeln('Time4b : ' + FloatToStr(Time4b / 100) + ' ms.    <- BASM unrolled loop instruction switch.');
  end;
  Readln;
end.
;
@@Decode:
    inc ecx;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + ecx], bl;
    cmp ecx, $07;
    jnz @@Decode;
    pop ecx;
    pop ebx;
    pop eax;
  end;
end;


//Unrolled assembly loop
procedure DecodePixels4a(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push eax;
    push ebx;
    mov bl, al;
    and bl, $01;
    mov  [edx], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $01], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $02], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $03], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $04], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $05], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $06], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $07], bl;
    pop ebx;
    pop eax;
  end;
end;


// it differs compared to 4a only in switching two instructions (but seven times)
procedure DecodePixels4b(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push eax;
    push ebx;
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx], bl;        //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $01], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $02], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $03], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $04], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $05], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $06], bl;  //
    mov bl, al;
    and bl, $01;
    mov [edx + $07], bl;
    pop ebx;
    pop eax;
  end;
end;

以下是我如何测试它们：

Time1  : 1,68443549919493 ms.     <- Delphi loop.
Time2  : 1,33773024572211 ms.     <- Delphi unrolled loop.
Time3  : 1,37015271374424 ms.     <- BASM loop.
Time4a : 0,822916962526627 ms.    <- BASM unrolled loop.
Time4b : 0,862914462301607 ms.    <- BASM unrolled loop instruction switch.

以下是我的机器（Win32 XP上的英特尔®奔腾®E2180）的结果：

Time1  : 1,66535493194387 ms.     <- Delphi loop.
Time2  : 1,29115785420688 ms.     <- Delphi unrolled loop.
Time3  : 1,33716934524107 ms.     <- BASM loop.
Time4a : 0,795041753757838 ms.    <- BASM unrolled loop.
Time4b : 0,843520166815013 ms.    <- BASM unrolled loop instruction switch.
Time5  : 1,49457681191307 ms.     <- Wouter van Nifterick, Delphi unrolled
Time6  : 0,400587402866258 ms.    <- PhiS, table lookup Delphi
Time7  : 0,325472442519827 ms.    <- PhiS, table lookup Delphi inline
Time8  : 0,37350491544239 ms.     <- PhiS, table lookup BASM

结果非常稳定 - 每次测试之间的时间差别只有几个百分点。这总是如此： Time1＆gt;时间3＆gt;时间2> Time4b＆gt; Time4a

所以我认为Time4a和Time4b之间的区别取决于方法 DecodePixels4b 中的指令切换。有时它是4％有时它高达10％但 4b 总是慢于 4a 。

我在考虑使用MMX指令同时将内存写入8个字节的另一种方法，但我无法想出将字节解压缩到64位寄存器的快速方法。

感谢您的时间。

谢谢你们的宝贵意见。 Whish我可以同时回答所有人，不幸的是，与现代CPU相比，我只有一个“管道”。并且只能执行一个指令“回复”。当时 ;-）所以，我会在这里总结一些事情并在你的答案下写下额外的评论。

首先，我想说在发布我的问题之前，我想出了Wouter van Nifterick提出的解决方案，实际上慢了然后我的汇编代码。所以我决定不在这里发布那个例程，但你可能会看到我在循环Delphi版本的例程中也采用了相同的方法。它在那里被评论，因为它给了我更糟糕的结果。

这对我来说是一个谜。我已经用Wouter和PhilS的例程再次运行我的代码，结果如下：

Time1  : 1,22508325749317 ms.     <- Delphi loop.
Time2  : 1,33004145373084 ms.     <- Delphi unrolled loop.
Time3  : 1,1473583622526 ms.      <- BASM loop.
Time4a : 0,77322594033463 ms.     <- BASM unrolled loop.
Time4b : 0,846033593023372 ms.    <- BASM unrolled loop instruction switch.
Time5  : 0,688689382044384 ms.    <- Wouter van Nifterick, Delphi unrolled
Time6  : 0,503233741036693 ms.    <- PhiS, table lookup Delphi
Time7  : 0,385254722925063 ms.    <- PhiS, table lookup Delphi inline
Time8  : 0,432993919452751 ms.    <- PhiS, table lookup BASM
Time9  : 0,362680491244212 ms.    <- PhiS, table lookup BASM with assembler directive

看看Time5的结果，很奇怪不是吗？我想我有不同的Delphi版本，因为我生成的汇编代码与Wouter提供的汇编代码不同。

<强>山高

解决方案

你的asm代码相对较慢，因为使用堆栈结束写入内存8次。检查一下......

procedure DecodePixels(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  xor   ecx, ecx
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 1
  mov   [DecPixels + 4], ecx
  xor   ecx, ecx
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 1
  mov   [DecPixels], ecx
end;

可能比使用查找表的代码更快！

改进版本：

procedure DecodePixelsI(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  mov   ecx, 0    //Faster than: xor   ecx, ecx
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  ror   ecx, 16
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  mov   [DecPixels + 4], ecx
  mov   ecx, 0    //Faster than: xor   ecx, ecx
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  ror   ecx, 16
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  mov   [DecPixels], ecx
end;

第3版：

procedure DecodePixelsX(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  add   al, al
  setc  byte ptr[DecPixels + 7]
  add   al, al
  setc  byte ptr[DecPixels + 6]
  add   al, al
  setc  byte ptr[DecPixels + 5]
  add   al, al
  setc  byte ptr[DecPixels + 4]
  add   al, al
  setc  byte ptr[DecPixels + 3]
  add   al, al
  setc  byte ptr[DecPixels + 2]
  add   al, al
  setc  byte ptr[DecPixels + 1]
  setnz byte ptr[DecPixels]
end;

第4版：

const Uint32DecPix : array [0..15] of cardinal = (
  <*>, $00000001, $00000100, $00000101,
  $00010000, $00010001, $00010100, $00010101,
  $01000000, $01000001, $01000100, $01000101,
  $01010000, $01010001, $01010100, $01010101
  );

procedure DecodePixelsY(EncPixels: byte; var DecPixels: TDecodedPixels); inline;
begin
  pcardinal(@DecPixels)^ := Uint32DecPix[EncPixels and <*>F];
  pcardinal(cardinal(@DecPixels) + 4)^ := Uint32DecPix[(EncPixels and $F0) shr 4];
end;

其他提示

一般情况下，我个人不会试图通过在汇编级别上使用技巧来优化代码，除非你真的需要额外2％或3％的速度，并且你愿意支付难以阅读，维护和移植的代码的价格。

要挤出最后1％，你甚至可能需要维护每个处理器优化的几个版本，如果出现更新的处理器和改进的pascal编译器，你就不会从中受益。

这个Delphi代码比最快的汇编代码更快：

procedure DecodePixels5(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  DecPixels[0] := (EncPixels shr 0) and $01;
  DecPixels[1] := (EncPixels shr 1) and $01;
  DecPixels[2] := (EncPixels shr 2) and $01;
  DecPixels[3] := (EncPixels shr 3) and $01;
  DecPixels[4] := (EncPixels shr 4) and $01;
  DecPixels[5] := (EncPixels shr 5) and $01;
  DecPixels[6] := (EncPixels shr 6) and $01;
  DecPixels[7] := (EncPixels shr 7) and $01;
end;


Results:

Time1  : 1,03096806151283 ms.    <- Delphi loop.
Time2  : 0,740308641141395 ms.   <- Delphi unrolled loop.
Time3  : 0,996602425688886 ms.   <- BASM loop.
Time4a : 0,608267951561275 ms.   <- BASM unrolled loop.
Time4b : 0,574162510648039 ms.   <- BASM unrolled loop instruction switch.
Time5  : 0,499628206138524 ms. !!!  <- Delphi unrolled loop 5.

速度很快，因为操作只能通过寄存器完成，而不需要存储和获取内存。现代处理器部分并行地执行此操作（可以在前一个完成之前开始新操作），因为连续指令的结果彼此独立。

机器代码如下所示：

  push ebx;
  // DecPixels[0] := (EncPixels shr 0) and 1;
  movzx ecx,al
  mov ebx,ecx
  //  shr ebx,var
  PixelLookup:Array[byte] of TDecodedPixels;

// You could precalculate, but the performance gain would hardly be worth it because you call this once only.
for I := 0 to 255 do
  DecodePixels5b(I, PixelLookup[I]);


procedure DecodePixels7(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  DecPixels := PixelLookup[EncPixels];
end;

Results:

Time1  : 1,03096806151283 ms.    <- Delphi loop.
Time2  : 0,740308641141395 ms.   <- Delphi unrolled loop.
Time3  : 0,996602425688886 ms.   <- BASM loop.
Time4a : 0,608267951561275 ms.   <- BASM unrolled loop.
Time4b : 0,574162510648039 ms.   <- BASM unrolled loop instruction switch.
Time5  : 0,499628206138524 ms. !!!  <- Delphi unrolled loop 5.
Time7 : 0,251533475182096 ms.    <- simple table lookup

  and bl,$01
  mov [edx],bl
  // DecPixels[1] := (EncPixels shr 1) and 1;
  mov ebx,ecx
  shr ebx,1
  and bl,$01
  mov [edx+$01],bl
  // DecPixels[2] := (EncPixels shr 2) and 1;
  mov ebx,ecx
  shr ebx,$02
  and bl,$01
  mov [edx+$02],bl
  // DecPixels[3] := (EncPixels shr 3) and 1;
  mov ebx,ecx
  shr ebx,$03
  and bl,$01
  mov [edx+$03],bl
  // DecPixels[4] := (EncPixels shr 4) and 1;
  mov ebx,ecx
  shr ebx,$04
  and bl,$01
  mov [edx+$04],bl
  // DecPixels[5] := (EncPixels shr 5) and 1;
  mov ebx,ecx
  shr ebx,$05
  and bl,$01
  mov [edx+$05],bl
  // DecPixels[6] := (EncPixels shr 6) and 1;
  mov ebx,ecx
  shr ebx,$06
  and bl,$01
  mov [edx+$06],bl
  // DecPixels[7] := (EncPixels shr 7) and 1;
  shr ecx,$07
  and cl,$01
  mov [edx+$07],cl
  pop ebx;

编辑：根据建议，表查找确实更快。

<*>

Expanding on Nick D's answer, I tried the following table-lookup based versions, all of which are faster than the implementations you give (and faster than Wouter van Nifterick's code).

Given the following packed array:


      const Uint64DecPix : PACKED ARRAY [0..255] OF UINT64 =
  ( $0000000000000000, $0000000000000001, $0000000000000100, $0000000000000101, $0000000000010000, $0000000000010001, $0000000000010100, $0000000000010101, $0000000001000000, $0000000001000001, $0000000001000100, $0000000001000101, $0000000001010000, $0000000001010001, $0000000001010100, $0000000001010101,
    $0000000100000000, $0000000100000001, $0000000100000100, $0000000100000101, $0000000100010000, $0000000100010001, $0000000100010100, $0000000100010101, $0000000101000000, $0000000101000001, $0000000101000100, $0000000101000101, $0000000101010000, $0000000101010001, $0000000101010100, $0000000101010101,
    $0000010000000000, $0000010000000001, $0000010000000100, $0000010000000101, $0000010000010000, $0000010000010001, $0000010000010100, $0000010000010101, $0000010001000000, $0000010001000001, $0000010001000100, $0000010001000101, $0000010001010000, $0000010001010001, $0000010001010100, $0000010001010101,
    $0000010100000000, $0000010100000001, $0000010100000100, $0000010100000101, $0000010100010000, $0000010100010001, $0000010100010100, $0000010100010101, $0000010101000000, $0000010101000001, $0000010101000100, $0000010101000101, $0000010101010000, $0000010101010001, $0000010101010100, $0000010101010101,
    $0001000000000000, $0001000000000001, $0001000000000100, $0001000000000101, $0001000000010000, $0001000000010001, $0001000000010100, $0001000000010101, $0001000001000000, $0001000001000001, $0001000001000100, $0001000001000101, $0001000001010000, $0001000001010001, $0001000001010100, $0001000001010101,
    $0001000100000000, $0001000100000001, $0001000100000100, $0001000100000101, $0001000100010000, $0001000100010001, $0001000100010100, $0001000100010101, $0001000101000000, $0001000101000001, $0001000101000100, $0001000101000101, $0001000101010000, $0001000101010001, $0001000101010100, $0001000101010101,
    $0001010000000000, $0001010000000001, $0001010000000100, $0001010000000101, $0001010000010000, $0001010000010001, $0001010000010100, $0001010000010101, $0001010001000000, $0001010001000001, $0001010001000100, $0001010001000101, $0001010001010000, $0001010001010001, $0001010001010100, $0001010001010101,
    $0001010100000000, $0001010100000001, $0001010100000100, $0001010100000101, $0001010100010000, $0001010100010001, $0001010100010100, $0001010100010101, $0001010101000000, $0001010101000001, $0001010101000100, $0001010101000101, $0001010101010000, $0001010101010001, $0001010101010100, $0001010101010101,
    $0100000000000000, $0100000000000001, $0100000000000100, $0100000000000101, $0100000000010000, $0100000000010001, $0100000000010100, $0100000000010101, $0100000001000000, $0100000001000001, $0100000001000100, $0100000001000101, $0100000001010000, $0100000001010001, $0100000001010100, $0100000001010101,
    $0100000100000000, $0100000100000001, $0100000100000100, $0100000100000101, $0100000100010000, $0100000100010001, $0100000100010100, $0100000100010101, $0100000101000000, $0100000101000001, $0100000101000100, $0100000101000101, $0100000101010000, $0100000101010001, $0100000101010100, $0100000101010101,
    $0100010000000000, $0100010000000001, $0100010000000100, $0100010000000101, $0100010000010000, $0100010000010001, $0100010000010100, $0100010000010101, $0100010001000000, $0100010001000001, $0100010001000100, $0100010001000101, $0100010001010000, $0100010001010001, $0100010001010100, $0100010001010101,
    $0100010100000000, $0100010100000001, $0100010100000100, $0100010100000101, $0100010100010000, $0100010100010001, $0100010100010100, $0100010100010101, $0100010101000000, $0100010101000001, $0100010101000100, $0100010101000101, $0100010101010000, $0100010101010001, $0100010101010100, $0100010101010101,
    $0101000000000000, $0101000000000001, $0101000000000100, $0101000000000101, $0101000000010000, $0101000000010001, $0101000000010100, $0101000000010101, $0101000001000000, $0101000001000001, $0101000001000100, $0101000001000101, $0101000001010000, $0101000001010001, $0101000001010100, $0101000001010101,
    $0101000100000000, $0101000100000001, $0101000100000100, $0101000100000101, $0101000100010000, $0101000100010001, $0101000100010100, $0101000100010101, $0101000101000000, $0101000101000001, $0101000101000100, $0101000101000101, $0101000101010000, $0101000101010001, $0101000101010100, $0101000101010101,
    $0101010000000000, $0101010000000001, $0101010000000100, $0101010000000101, $0101010000010000, $0101010000010001, $0101010000010100, $0101010000010101, $0101010001000000, $0101010001000001, $0101010001000100, $0101010001000101, $0101010001010000, $0101010001010001, $0101010001010100, $0101010001010101,
    $0101010100000000, $0101010100000001, $0101010100000100, $0101010100000101, $0101010100010000, $0101010100010001, $0101010100010100, $0101010100010101, $0101010101000000, $0101010101000001, $0101010101000100, $0101010101000101, $0101010101010000, $0101010101010001, $0101010101010100, $0101010101010101);
PUint64DecPix : pointer = @Uint64DecPix;

you can write the following:


procedure DecodePixelsPS1Pas (EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  DecPixels := TDecodedPixels(Uint64DecPix[EncPixels]);
end;

procedure DecodePixelsPS1PasInline (EncPixels: Byte; var DecPixels: TDecodedPixels);
inline;
begin
  DecPixels := TDecodedPixels(Uint64DecPix[EncPixels]);
end;

procedure DecodePixelsPS1Asm (EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  lea ecx, Uint64DecPix //[<-Added in EDIT 3] 
  //mov ecx, dword ptr PUint64DecPix - alternative to the above line (slower for me)
  movzx eax, al
  movq xmm0, [8*eax+ecx]  //Using XMM rather than MMX so we don't have to issue emms at the end
  movq [edx], xmm0        //use MOVQ because it doesn't need mem alignment
end;

The standard PAS and ASM implementations are fairly similar speed-wise, but the PAS implementation marked with "INLINE" is the fastest because it gets rid of all the call/ret involved in calling the routine.

--EDIT--: I forgot to say: since you are implicitly assuming something about the memory layout of your TDecodedPixels structure, it would be better if you declare it as


PACKED ARRAY [0..7] of byte

--EDIT2--: Here are my results for comparison:


Time1 : 2.51638266874701 ms.    <- Delphi loop.
Time2 : 2.11277620479698 ms.    <- Delphi unrolled loop.
Time3 : 2.21972066282167 ms.    <- BASM loop.
Time4a : 1.34093090043567 ms.    <- BASM unrolled loop.
Time4b : 1.52222070123437 ms.    <- BASM unrolled loop instruction switch.
Time5 : 1.17106364076999 ms.    <- Wouter van Nifterick
TimePS1 : 0.633099318488802 ms.    <- PS.Pas
TimePS2 : 0.551617593856202 ms.    <- PS.Pas Inline
TimePS3 : 0.70921094720139 ms.    <- PS.Asm (speed for version before 3rd EDIT)

Compilers do very good job at optimizing small routines.

I would optimize your code by using a lookup table.
Since you decode a single byte - 256 different states - you can precalculate 256 arrays with the unpacked values.

Edit: Note that Pentium processors can execute specific instructions in parallel (Superscalar architecture), it is called pairing.

Pure software solution

Using the beautiful technique from this question, which is again inspired by this question we'll have a great solution like this with only one line of code (excluding declarations)

type TPackedDecodedPixels = record
case integer of
  0: (a: TDecodedPixels);
  1: (v: Int64);
end;

procedure DecodePixels(EncPixels: byte; var DecPixels: TDecodedPixels); inline;
const
  magic = $8040201008040201;
  mask  = $8080808080808080;
begin
  TPackedDecodedPixels(DecPixels).v := SwapEndian(((EncPixels*magic) and mask) shr 7);
end;

Of course you need to make sure that DecPixels is properly 8-byte aligned or you may suffer from some slow down (or even segfaults on other architectures). You can also easily vectorize the function to make it faster

Explanation:

Assume we have the following bit pattern as abcdefgh. We'll want the output array to contain

0000000a 0000000b 0000000c 0000000d 0000000e 0000000f 0000000g 0000000h (1)

Reading that in little endian as a 64-bit integer we'll get %0000000h0000000g0000000f0000000e0000000d0000000c0000000b0000000a. We have to find a magic number that shifts the original bits to the positions that we can extract the necessary bits

Let's multiply the value with the magic number

  |  b7  ||  b6  ||  b4  ||  b4  ||  b3  ||  b2  ||  b1  ||  b0  |
                                                          abcdefgh (1-byte value)
x 1000000001000000001000000001000000001000000001000000001000000001
__________________________________________________________________
= h0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh

At this point all the pixels' bits have been moved to the most significant bits of the corresponding bytes. As they already lied in the right place, we just need to strip out the remaining bits with and

  |  b7  ||  b6  ||  b4  ||  b4  ||  b3  ||  b2  ||  b1  ||  b0  |
  h0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh
& 1000000010000000100000001000000010000000100000001000000010000000
__________________________________________________________________    
= h0000000g0000000f0000000e0000000d0000000c0000000b0000000a0000000 (8-byte array)

Now the pixels' bits are in the most significant bits of the corresponding bytes, we need to do a logical right shift by 7 to move them to the least significant position. Because the OP wants the value in reversed order, we need SwapEndian() to convert the bytes to big endian. If you just want little endian you can stop at this step

So the magic number is %1000000001000000001000000001000000001000000001000000001000000001 = $8040201008040201 and the mask is %1000000010000000100000001000000010000000100000001000000010000000 = $8080808080808080. Of course in reality to solve the problem and get those values we need to do backwards from the final result → multiplied result → magic number

But why did I put the bytes in little endian at (1) and then have to convert back to big endian? Why don't just arrange the bytes in big endian order and find the magic number for that? In case you're wondering about that then it's because that way it'll only work for at most 7 bits at a time. I did that way in my old answer and have to split a bit off then combine it back later

                                                          0abcdefg
x 0000000000000010000001000000100000010000001000000100000010000001
__________________________________________________________________
= 00000000abcdefgabcdefgabcdefgabcdefgabcdefgabcdefgabcdefgabcdefg
& 0000000000000001000000010000000100000001000000010000000100000001
__________________________________________________________________    
= 000000000000000a0000000b0000000c0000000d0000000e0000000f0000000g

Hardware support

This is actually a special case of bit expand with a constant mask. In AVX2 Intel introduced the pdep instruction in BMI2 instruction set for that purpose, so you just need a single instruction to get the result. In other languages you can use this with the intrinsic function _pext_u64. Unfortunately AFAIK Free Pascal doesn't support it and you have to use assembly directly. However the expression will look like

TPackedDecodedPixels(DecPixels).v := _pext_u64(EncPixels, $0101010101010101);

Correctness check:

I tried comparing the OP's version with both my versions and didn't find any problem until now. The compiler output is like this

mov al, dil
mov rbx, rsi
movzx edi, al
movabs rax, 0x8040201008040201
imul rdi, rax
movabs rax, 0x8080808080808080
and rdi, rax
shr rdi, 0x7
call 4016a0 <SYSTEM_$$_SWAPENDIAN$INT64$$INT64>
mov QWORD PTR [rbx], rax

The FPC output is still sub-optimal because the compiler doesn't know to replace the call to SwapEndian with BSWAP, and it copies data unnecessarily. Why mov al, dil; movzx edi, al instead of just movzx edi, dil?. As you can see, outputs from C and C++ compilers are a lot better

See Creating a byte out of 8 bool values (and vice versa)?

I was about to give the same algorithm as Wouter van Nifterick.

In addition, I would explain the better performance in terms of dependency chains. In each of the versions that you proposed, when you unrolled your basic loop, you kept a dependency between two successive iterations: each of your shr al, $01; requires the previous value of al to have been computed. If you organize your unrolled iterations such that they can be executed in parallel, they will actually be on a modern processor. Don't be fooled by false dependencies that can be suppressed by register renaming.

Someone pointed out that the Pentium can execute two instructions at once. That's true, but modern processors (since the Pentium Pro, PII,...,Core, Core 2) are executing much more than two instructions at the same time, when they have the chance -- that is, when there is no dependency between the instructions being executed. Note how in Wouter van Nifterick's version each line can be executed independently from the others.

http://www.agner.org/optimize/ has all the information you could ever need to understand the architecture of modern processors and how to take advantage of them.

if you only support 80386 and above you can use BTcc and SETcc set of instructions in this manner:

BT ax,1
SETC [dx]
inc dx

BT ax,2
SETC [dx]
inc dx

etc

How about something like:

/* input byte in eax, address to store result in edx */
and eax, 0xff    /* may not be needed */
mov ebx, eax
shl ebx, 7
or  eax, ebx
mov ebx, eax
shl ebx, 14
or  eax, ebx
mov ebx, eax
and eax, 0x01010101
mov [edx], eax
shr ebx, 4
and ebx, 0x01010101
mov [edx+4], ebx

The likely reason that 4b is faster than 4a is that it parallelizes better. From 4a:

mov bl, al;
and bl, $01;          // data dep (bl)
mov  [edx], bl;       // data dep (bl)
shr al, $01;
mov bl, al;           // data dep (al)
and bl, $01;          // data dep (bl)
mov [edx + $01], bl;  // data dep (bl)

Instructions marked "data dep" cannot begin executing until the previous instruction has finished, and I've written the registers that cause this data dependency. Modern CPUs are capable of starting an instruction before the last one has completed, if there is no dependency. But the way you've ordered these operations prevents this.

In 4b, you have fewer data dependencies:

mov bl, al;
and bl, $01;          // data dep (bl)
shr al, $01;
mov [edx], bl;
mov bl, al;
and bl, $01;          // data dep (bl)
shr al, $01;
mov [edx + $01], bl;

With this instruction ordering, fewer of the instructions depend on the previous instruction, so there is more opportunity for parallelism.

I can't guarantee that this is the reason for the speed difference, but it is a likely candidate. Unfortunately it is hard to come across answers as absolute as the ones you are looking for; modern processors have branch predictors, multi-level caches, hardware pre-fetchers, and all sorts of other complexities that can make it difficult to isolate the reasons for performance differences. The best you can do is read a lot, perform experiments, and get familiar with the tools for taking good measurements.

I guess it's that writing to memory (actually, cache memory) is slower than working with registers.

So,

mov [edx+...], bl
shr al, $01;
mov bl, al;

gives the processor some time to write bl to memory before the bl register is needed again, while

shr al, $01;
mov [edx], bl;
mov bl, al;

needs bl immediately so the processor has to stop and wait for the memory write to complete.

This is surprising to me. Modern Intel processors do crazy pipelining and register renaming so in my opinion, if anything, DecodePixels4b should be faster, since the dependencies of each instruction are further back. The above is all the explanation I can offer, apart from this:

x86 is a terrible instruction set, and Intel does amazing and very advanced hocus-pocus to make it efficient. If I were you, I would look into something else. There's very little demand for megaMcOptimised software for PCs today. My friendly suggestion is to look into processors for mobile devices (mainly ARM), because in mobile devices, processor speed, power consumption and battery life concerns mean that micro-optimised software is more important. And ARM has a superior instruction set to x86.

SIMD

If you extend the algorithm to processing arrays, then SIMD becomes an optimisation option. Here's a SIMD version that's 1/3 the time of an optimised C equivalent:

int main ()
{
  const int
    size = 0x100000;

  unsigned char
    *source = new unsigned char [size],
    *dest,
    *dest1 = new unsigned char [size * 32],
    *dest2 = new unsigned char [size * 32];

  for (int i = 0 ; i < size ; ++i)
  {
    source [i] = rand () & 0xff;
  }

  LARGE_INTEGER
    start,
    middle,
    end;

  QueryPerformanceCounter (&start);
  dest = dest1;
  for (int i = 0 ; i < size ; ++i)
  {
    unsigned char
      v = source [i];

    for (int b = 0 ; b < 8 ; ++b)
    {
      *(dest++) = (v >> b) & 1;
    }
  }
  unsigned char
    bits [] = {1,2,4,8,16,32,64,128,1,2,4,8,16,32,64,128},
    zero [] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
    ones [] = {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1};

  QueryPerformanceCounter (&middle);
  __asm
  {
    movdqu xmm1,bits
    movdqu xmm2,zero
    movdqu xmm3,ones
    mov ecx,0x100000/4
    mov esi,source
    mov edi,dest2
l1:
    lodsd
    movd xmm0,eax
    movd xmm4,eax
    punpcklbw xmm0,xmm0
    punpcklbw xmm4,xmm4
    punpcklwd xmm0,xmm0
    punpcklwd xmm4,xmm4
    punpckldq xmm0,xmm0
    punpckhdq xmm4,xmm4
    pand xmm0,xmm1
    pand xmm4,xmm1
    pcmpeqb xmm0,xmm2
    pcmpeqb xmm4,xmm2
    paddb xmm0,xmm3
    paddb xmm4,xmm3
    movdqu [edi],xmm0
    movdqu [edi+16],xmm4
    add edi,32
    dec ecx
    jnz l1
  }
  QueryPerformanceCounter (&end);

  cout << "Time taken = " << (middle.QuadPart - start.QuadPart) << endl;
  cout << "Time taken = " << (end.QuadPart - middle.QuadPart) << endl;
  cout << "memcmp = " << memcmp (dest1, dest2, size * 32) << endl;

  return 0;
}

Incredible smart solution Chris, what would you do with the inverse problem: make a byte from an array of 8 bytes?

Non optimized solution for the inverse problem:

BtBld PROC Array:DWORD, Pixels:DWORD
  mov  eax, [Array]
  add  eax, 7
  mov  edx, [Pixels]

  mov  bx, 0

  mov  ecx, 8
rpt:  or  bx, [eax]
  dec  eax
  shl  bx, 1
  loop rpt
  shr  bx, 1
  mov  [edx], bl
  ret
BtBld ENDP

As you notice, the difference of speed in 4a and 4b implementation is because of CPU optimization (by execute multiple instructions in parallel / pipelining instruction). But the factor is not in the operands, but because of the nature of operator itself.

4a Instruction Sequence:
AND - MOV - SHR

4b Instruction Sequence:
AND - SHR - MOV

Both AND and SHR use Flags register, so these two instructions has wait state in their pipeline.

Read them as follow:

4a: AND (piped) MOV (piped) SHR
4b: AND (WAIT) SHR (piped) MOV

Conclusion: 4b has 7 more wait-state in it's pipeline than 4a, thus it's slower.

Josh mentioned that there's data dependencies, i.e.:

mov bl, al;
and bl, $01;          // data dep (bl)

but it's not entirely true since those two instruction can partially be executed in paralel in CPU level:

mov bl, al -> (A:) read al (B:) write bl  => (2 clocks in i386)
and bl, 01 -> (C:) read 01 (D:) write bl  => idem

Sequentially they take 4 clocks, but pipelined they take only 3 "clocks" (actually the term "clock" is not adequate in pipeline perspective but I used it in context of simplicity)

[--A--][--B--]
 [--C--]<wait>[---D--]

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow