Lockbits 성능 임계 코드

https://stackoverflow.com/questions/740555

09-09-2019
|

문제

가능한 한 빨리 필요한 방법이 있으며, 안전하지 않은 메모리 포인터 와이 유형의 코딩에 대한 첫 번째 진출을 사용하므로 아마도 더 빠를 수 있다는 것을 알고 있습니다.

    /// <summary>
    /// Copies bitmapdata from one bitmap to another at a specified point on the output bitmapdata
    /// </summary>
    /// <param name="sourcebtmpdata">The sourcebitmap must be smaller that the destbitmap</param>
    /// <param name="destbtmpdata"></param>
    /// <param name="point">The point on the destination bitmap to draw at</param>
    private static unsafe void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
    {
        // calculate total number of rows to draw.
        var totalRow = Math.Min(
            destbtmpdata.Height - point.Y,
            sourcebtmpdata.Height);


        //loop through each row on the source bitmap and get mem pointers
        //to the source bitmap and dest bitmap
        for (int i = 0; i < totalRow; i++)
        {
            int destRow = point.Y + i;

            //get the pointer to the start of the current pixel "row" on the output image
            byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride);
            //get the pointer to the start of the FIRST pixel row on the source image
            byte* srcRowPtr = (byte*)sourcebtmpdata.Scan0 + (i * sourcebtmpdata.Stride);

            int pointX = point.X;
            //the rowSize is pre-computed before the loop to improve performance
            int rowSize = Math.Min(destbtmpdata.Width - pointX, sourcebtmpdata.Width);
            //for each row each set each pixel
            for (int j = 0; j < rowSize; j++)
            {
                int firstBlueByte = ((pointX + j)*3);

                int srcByte = j *3;
                destRowPtr[(firstBlueByte)] = srcRowPtr[srcByte];
                destRowPtr[(firstBlueByte) + 1] = srcRowPtr[srcByte + 1];
                destRowPtr[(firstBlueByte) + 2] = srcRowPtr[srcByte + 2];
            }


        }
    }

더 빨리 만들기 위해 할 수있는 일이 있습니까? 지금은 TODO를 무시하고 나중에 기준 성능 측정이 있으면 나중에 고치십시오.

업데이트: 죄송합니다. Graphics.DrawImage 대신 내가 이것을 사용하는 이유는 IM이 멀티 스레딩을 구현하고 그로 인해 DrawImage를 사용할 수 없기 때문이라고 언급했습니다.

UPDATE 2: 나는 아직도 성능에 만족하지 않으며 가질 수있는 MS가 몇 개 더 있다고 확신합니다.

해결책

코드에는 근본적으로 문제가 있었는데 지금까지 눈치 채지 못했다고 믿을 수 없습니다.

byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride);

이것은 대상 행에 대한 포인터를 얻지 만 복사하는 열을 얻지 못하며 이전 코드에서는 Rowsize 루프 내부에서 수행됩니다. 이제는 다음과 같습니다.

byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride) + pointX * 3;

이제 대상 데이터에 대한 올바른 포인터가 있습니다. 이제 우리는 루프를 위해 그것을 제거 할 수 있습니다. 제안을 사용합니다 vilx- 그리고 롭 코드는 이제 다음과 같습니다.

        private static unsafe void CopyBitmapToDestSuperFast(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
    {
        //calculate total number of rows to copy.
        //using ternary operator instead of Math.Min, few ms faster
        int totalRows = (destbtmpdata.Height - point.Y < sourcebtmpdata.Height) ? destbtmpdata.Height - point.Y : sourcebtmpdata.Height;
        //calculate the width of the image to draw, this cuts off the image
        //if it goes past the width of the destination image
        int rowWidth = (destbtmpdata.Width - point.X < sourcebtmpdata.Width) ? destbtmpdata.Width - point.X : sourcebtmpdata.Width;

        //loop through each row on the source bitmap and get mem pointers
        //to the source bitmap and dest bitmap
        for (int i = 0; i < totalRows; i++)
        {
            int destRow = point.Y + i;

            //get the pointer to the start of the current pixel "row" and column on the output image
            byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride) + point.X * 3;

            //get the pointer to the start of the FIRST pixel row on the source image
            byte* srcRowPtr = (byte*)sourcebtmpdata.Scan0 + (i * sourcebtmpdata.Stride);

            //RtlMoveMemory function
            CopyMemory(new IntPtr(destRowPtr), new IntPtr(srcRowPtr), (uint)rowWidth * 3);

        }
    }

500x500 이미지를 그리드에서 5000x5000 이미지에 50 번 복사했습니다. 00 : 00 : 07.9948993 초. 이제 위의 변경 사항을 사용하면 00 : 00 : 01.8714263 SECS가 필요합니다. 훨씬 낫다.

다른 팁

글쎄 ... .net 비트 맵 데이터 형식이 있는지 잘 모르겠습니다. 전적으로 Windows의 GDI32 기능과 호환 ...

그러나 내가 배운 처음 몇 개의 Win32 API 중 하나는 bitblt입니다.

BOOL BitBlt(
  HDC hdcDest, 
  int nXDest, 
  int nYDest, 
  int nWidth, 
  int nHeight, 
  HDC hdcSrc, 
  int nXSrc, 
  int nYSrc, 
  DWORD dwRop
);

그리고 그것은 가장 빠른 내가 정확하게 기억한다면 데이터를 복사하는 방법.

다음은 C# 및 관련 사용 정보에 사용하기위한 Bitblt Pinvoke 서명입니다. C#에서 고성능 그래픽으로 작업하는 사람에게는 훌륭한 읽기입니다.

http://www.pinvoke.net/default.aspx/gdi32/bitblt.html

확실히 볼만한 가치가 있습니다.

내부 루프는 많은 시간을 집중시키고 싶은 곳입니다 (그러나 보장하기 위해 측정을 수행하십시오).

for  (int j = 0; j < sourcebtmpdata.Width; j++)
{
    destRowPtr[(point.X + j) * 3] = srcRowPtr[j * 3];
    destRowPtr[((point.X + j) * 3) + 1] = srcRowPtr[(j * 3) + 1];
    destRowPtr[((point.X + j) * 3) + 2] = srcRowPtr[(j * 3) + 2];
}

곱하기 및 배열 인덱싱 (후드 아래에서 곱하기)을 제거하고 증분하는 포인터로 교체하십시오.
+1, +2를 가진 ditto는 포인터를 증가시킵니다.
아마도 컴파일러는 Computing Point.x (Check)를 유지하지 않지만 경우에 따라 로컬 변수를 만듭니다. 단일 반복에서는 그렇게하지 않지만 각 반복 할 수 있습니다.

당신은보고 싶을 수도 있습니다 고유.

사용하는 C ++ 템플릿 라이브러리입니다 SSE (2 이상) 및 Altivec 명령어는 벡터화되지 않은 코드로 우아한 폴백을 갖는 세트.

빠른. (벤치 마크 참조).
표현식 템플릿을 사용하면 임시를 지능적으로 제거하고 게으른 평가를 가능하게 할 수 있습니다. 이는 적절한 경우 에이 겐은 이것을 자동으로 처리하고 대부분의 경우 별명을 처리합니다.
SSE (2 이상) 및 Altivec 명령 세트에 대해 명시 적 벡터화가 수행되며, 벡터화되지 않은 코드로의 우아한 폴백이 있습니다. 표현식 템플릿을 사용하면 전체 표현식을 위해 전 세계적으로 이러한 최적화를 수행 할 수 있습니다.
고정 크기의 객체를 사용하면 동적 메모리 할당을 피하고 루프가 이해되면 풀리지 않습니다.
큰 매트릭스의 경우 캐시 친화성에 특별한주의를 기울입니다.

C ++에서 기능을 구현 한 다음 C#에서 호출 할 수 있습니다.

좋은 속도를 얻기 위해 포인터를 사용할 필요는 없습니다. 이것은 원본의 두 MS 내에 있어야합니다.

        private static void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
    {
        byte[] src = new byte[sourcebtmpdata.Height * sourcebtmpdata.Width * 3];
        int maximum = src.Length;
        byte[] dest = new byte[maximum];
        Marshal.Copy(sourcebtmpdata.Scan0, src, 0, src.Length);
        int pointX = point.X * 3;
        int copyLength = destbtmpdata.Width*3 - pointX;
        int k = pointX + point.Y * sourcebtmpdata.Stride;
        int rowWidth = sourcebtmpdata.Stride;
        while (k<maximum)
        {
            Array.Copy(src,k,dest,k,copyLength);
            k += rowWidth;

        }
        Marshal.Copy(dest, 0, destbtmpdata.Scan0, dest.Length);
    }

불행히도 전체 솔루션을 작성할 시간이 없지만 플랫폼 사용을 살펴 보겠습니다. rtlmovememory () 바이트가 아닌 행 전체를 움직일 수 있습니다. 훨씬 더 빠릅니다.

보폭 및 행 번호 한계는 미리 계산할 수 있다고 생각합니다.

그리고 나는 모든 곱셈을 미리 계산하여 다음 코드를 만듭니다.

private static unsafe void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
{
    //TODO: It is expected that the bitmap PixelFormat is Format24bppRgb but this could change in the future
    const int pixelSize = 3;

    // calculate total number of rows to draw.
    var totalRow = Math.Min(
        destbtmpdata.Height - point.Y,
        sourcebtmpdata.Height);

    var rowSize = Math.Min(
        (destbtmpdata.Width - point.X) * pixelSize,
        sourcebtmpdata.Width * pixelSize);

    // starting point of copy operation
    byte* srcPtr = (byte*)sourcebtmpdata.Scan0;
    byte* destPtr = (byte*)destbtmpdata.Scan0 + point.Y * destbtmpdata.Stride;

    // loop through each row
    for (int i = 0; i < totalRow; i++) {

        // draw the entire row
        for (int j = 0; j < rowSize; j++)
            destPtr[point.X + j] = srcPtr[j];

        // advance each pointer by 1 row
        destPtr += destbtmpdata.Stride;
        srcPtr += sourcebtmpdata.Stride;
    }

}

Hav는 그것을 철저히 테스트하지는 않았지만 그것을 작동시킬 수 있어야합니다.

루프에서 곱셈 작업을 제거하고 (대신 미리 계산) 대부분의 분기를 제거하여 약간 더 빠릅니다.

도움이된다면 알려주세요 :-)

나는 당신의 C# 코드를보고 있고 친숙한 것을 인식 할 수 없습니다. 모두 C ++ 톤처럼 보입니다. BTW, DirectX/XNA가 새로운 친구가되어야하는 것처럼 보입니다. 내 2 센트. 메신저를 죽이지 마십시오.

이 작업을 수행하기 위해 CPU에 의존 해야하는 경우 : 24 비트 레이아웃 최적화를 직접 수행했으며 메모리 액세스 속도가 병목 현상이어야한다고 말할 수 있습니다. 가능한 가장 빠른 바이 결합 액세스를 위해 SSE3 지침을 사용하십시오. 이것은 C ++ 및 임베디드 어셈블리 언어를 의미합니다. 순수한 C에서는 대부분의 기계에서 30% 느리게됩니다.

현대 GPU는 이러한 종류의 작업에서 CPU보다 훨씬 빠릅니다.

이것이 추가 성능을 제공할지 확실하지 않지만 반사기에서 패턴을 많이 볼 수 있습니다.

그래서:

int srcByte = j *3;
destRowPtr[(firstBlueByte)] = srcRowPtr[srcByte];
destRowPtr[(firstBlueByte) + 1] = srcRowPtr[srcByte + 1];
destRowPtr[(firstBlueByte) + 2] = srcRowPtr[srcByte + 2];

*destRowPtr++ = *srcRowPtr++;
*destRowPtr++ = *srcRowPtr++;
*destRowPtr++ = *srcRowPtr++;

아마도 더 많은 교정기가 필요할 것입니다.

너비가 고정되면 전체 선을 수백 줄로 풀 수 있습니다. :)

업데이트

더 나은 성능을 위해 더 큰 유형의 int32 또는 int64를 사용해 볼 수도 있습니다.

알겠습니다. 이것은 알고리즘에서 나올 수있는 MS 수의 줄에 상당히 가까울 것입니다. 수학 대신 삼차 연산자로 교체하십시오.

일반적으로 도서관 전화를하는 것은 스스로 무언가를하는 것보다 시간이 오래 걸리며이를 확인하기 위해 간단한 테스트 드라이버를 만들었습니다. 수학.

using System;
using System.Diagnostics;

namespace TestDriver
{
    class Program
    {
        static void Main(string[] args)
        {
            // Start the stopwatch
            if (Stopwatch.IsHighResolution)
            { Console.WriteLine("Using high resolution timer"); }
            else
            { Console.WriteLine("High resolution timer unavailable"); }
            // Test Math.Min for 10000 iterations
            Stopwatch sw = Stopwatch.StartNew();
            for (int ndx = 0; ndx < 10000; ndx++)
            {
                int result = Math.Min(ndx, 5000);
            }
            Console.WriteLine(sw.Elapsed.TotalMilliseconds.ToString("0.0000"));
            // Test trinary operator for 10000 iterations
            sw = Stopwatch.StartNew();
            for (int ndx = 0; ndx < 10000; ndx++)
            {
                int result = (ndx < 5000) ? ndx : 5000;
            }
            Console.WriteLine(sw.Elapsed.TotalMilliseconds.ToString("0.0000"));
            Console.ReadKey();
        }
    }
}

내 컴퓨터에서 위를 실행할 때의 결과 인텔 T2400 @1.83GHz. 또한 결과에는 약간의 변화가 있지만 일반적으로 Trinay 연산자는 약 0.01ms 더 빠릅니다. 그다지 많지는 않지만 충분히 큰 데이터 세트를 통해 추가됩니다.

고해상도 타이머 사용
0.0539
0.0402

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow