Blitting?

pcmattman · **Posted:** Fri May 11, 2007 6:07 pm

Does anyone know of a really fast way to blit stuff to the screen?

I've given up on virtual 8086 mode, and have since found a way to get into VESA modes from PMode. (I'll come back to virtual mode later).

Now I've got a problem: I can put stuff on the screen but it's slow to do such operations as 'clear screen'... Obviously, this is not the way to do it:

Code:

for( int x = 0; x < vga.GetXRes(); x++ )
{
   for( int y = 0; y < vga.GetYRes(); y++ )
      vga.SetPixel( x, y, COLOR_WHITE );
}

But I don't know any faster way

.

Alboin · **Posted:** Fri May 11, 2007 6:45 pm

Maybe this?

pcmattman · **Posted:** Fri May 11, 2007 8:36 pm

Thanks, that helped.

I just ran my OS in Qemu and almost fainted... It's insanely fast! Bochs is like a snail in comparison. Though I have one problem, my VBE entry code doesn't work on a real PC. But then, I was trying to get into 640x480 and not many cards support that anymore.

Brynet-Inc · **Posted:** Fri May 11, 2007 9:05 pm

pcmattman wrote:

Thanks, that helped.

I just ran my OS in Qemu and almost fainted... It's insanely fast! Bochs is like a snail in comparison. Though I have one problem, my VBE entry code doesn't work on a real PC. But then, I was trying to get into 640x480 and not many cards support that anymore.

I think that you're supposed to check what modes "each" card supports..

Alboin · **Posted:** Fri May 11, 2007 9:32 pm

Brynet-Inc wrote:

pcmattman wrote:

Thanks, that helped.

I just ran my OS in Qemu and almost fainted... It's insanely fast! Bochs is like a snail in comparison. Though I have one problem, my VBE entry code doesn't work on a real PC. But then, I was trying to get into 640x480 and not many cards support that anymore.

I think that you're supposed to check what modes "each" card supports..

Maybe with something like this in your boot loader?

Tyler · **Posted:** Sat May 12, 2007 4:26 am

Is there no fast generic bit block method?

For example in the Windows Graphics Driver Model (I am yet to look at X11's)... it is possible for a driver to ask Windows to handle the bitblt on it's behalf (DrvBitBlt calls EngBitBlt) and the operation is very fast. Given that it works on any system that at least has a standard frame buffer one would assume they do not use VESA but a generic operation.

What is it most likely they do, copy pixel by pixel or use a processor block transfer? (is this even possible over Memory Mapped I/O?)

Brendan · **Posted:** Sat May 12, 2007 11:02 am

Hi,

Tyler wrote:

For example in the Windows Graphics Driver Model (I am yet to look at X11's)... it is possible for a driver to ask Windows to handle the bitblt on it's behalf (DrvBitBlt calls EngBitBlt) and the operation is very fast. Given that it works on any system that at least has a standard frame buffer one would assume they do not use VESA but a generic operation.

Given that they've got video drivers that support full 2D and 3D hardware acceleration it's not surprising that it's fast. It's not a generic operation though.

Pcmattman's problem is that he's recalculating an offset into display memory for every single pixel (307200 calculations for 640*480 mode), writing one pixel at a time, and drawing vertical lines (which sucks for caching) instead of horizontal lines . Filling horizontal lines with "REP STOSD" would make his code about 100 times faster.

For example, something like:

Code:

    mov eax,(WHITE << 24) | (WHITE << 16) | (WHITE << 8) | WHITE
    mov edi, DISPLAY_MEMORY_ADDRESS
    mov ebx, VERTICAL_RESOLUTION / 4
    mov ebp, BYTES_BETWEEN_LINES
    mov edx, DWORDS_PER_LINE
    cld

.l1:
    mov ecx,edx
    rep stosd
    add edi,ebp
    mov ecx,edx
    rep stosd
    add edi,ebp
    mov ecx,edx
    rep stosd
    add edi,ebp
    mov ecx,edx
    rep stosd
    add edi,ebp
    sub ebx,1
    jne .l1

Notice how theres no "address = videoMem + (Y * bytesPerLine + X) * bytesPerPixel" in there?

Of course this is for 8-bit per pixel (but could be easily modified for other pixel formats, except 24-bit per pixel) and doesn't support bank switching (but no sane person uses bank switching anymore).

BTW Bochs is slow, which is a good thing because it makes it easy to find code that needs to be improved. Qemu is faster, which makes it harder to see code that needs to be improved, which is a bad thing.

Cheers,

Brendan

carbonBased · **Posted:** Sat May 12, 2007 11:15 am

Brendan wrote:

Filling horizontal lines with "REP STOSD" would make his code about 100 times faster.

And this is pretty much the fastest truly generic approach. In some processors you can utilize 64-bit registers for the copy (eg, using the mmx registers).

Brendan's point about recalculating offsets has always been one of my biggest concern with graphics code. If you think hard enough, pretty much any graphics routine (lines, circles, ovals, etc) can be drawn with only one initial calculation (ie, one multiply) and the rest the increments can be done through additions.

As an interesting aside; I've actually heard of people using DMA to transfer to the video card in the background. This wouldn't be fast, I just found it interesting.

--Jeff

Candy · **Posted:** Sat May 12, 2007 2:08 pm

Brendan wrote:

Pcmattman's problem is that he's recalculating an offset into display memory for every single pixel (307200 calculations for 640*480 mode), writing one pixel at a time, and drawing vertical lines (which sucks for caching) instead of horizontal lines . Filling horizontal lines with "REP STOSD" would make his code about 100 times faster.

Compilers solve that problem by determining loop invariants, strength reduction and code motion to determine that the number isn't changed in the loop, the other derived variables are replacable by additions and the initial calculation can go before the loop. Add to that that your compiler can/should know about faster assignments and you're all set with this code, without the rep stosd code. Stosd is particularly slow on newer processors, so you might even be quicker off with a partially unrolled loop with Duff's device for unrolling the initial amount and making the rest into large aligned moves.

Let's say, a proper compiler could outperform your rep stosd.

Brendan · **Posted:** Sat May 12, 2007 4:30 pm

Hi,

Candy wrote:

Let's say, a proper compiler could outperform your rep stosd.

Find a compiler and compile Pcmattman's original method (with all optimizations enabled that don't prevent the code from running on any 32-bit 80x86 CPU), then find an assembler and assemble my method. Then benchmark the results and get back to me...

For all sane video cards, the first byte of each line is aligned on a 32-bit (or greater) boundary, so misalignment problems won't happen.

Modern CPUs internally optimise string instructions where possible (if the string operation is "long enough", if everything is aligned and if the areas don't overlap). If these conditions are met REP MOVSD is limited by RAM bandwidth and nothing else (for Intel's latest CPUs, REP MOVSD will saturate the L1 cache bandwidth if data is in the L1 cache) . Aligned REP STOSD to video display memory would be limited by the PCI bus bandwidth and nothing else.

If the conditions aren't met (i.e. lack of alignment, overlapping source & dest or small strings) then string instructions can be improved on, but I doubt this is the case for the code I posted, although to be honest I haven't found a good estimate of how many bytes is "long enough" (but 640 * 480 with 8-bit per pixel works out to 10 cache lines, which should be enough).

The other problem with string instructions is they pollute the caches - a "non-temporal store" would prevent this, but that's only possible for SSE. Of course cache pollution can't happen for REP STOSD to areas that are "write-combining" or "uncacheable", like video memory, as the writes don't go to cache anyway.

Cheers,

Brendan

Tyler · **Posted:** Sun May 13, 2007 3:24 am

Brendan wrote:

Hi,

Tyler wrote:

For example in the Windows Graphics Driver Model (I am yet to look at X11's)... it is possible for a driver to ask Windows to handle the bitblt on it's behalf (DrvBitBlt calls EngBitBlt) and the operation is very fast. Given that it works on any system that at least has a standard frame buffer one would assume they do not use VESA but a generic operation.

Given that they've got video drivers that support full 2D and 3D hardware acceleration it's not surprising that it's fast. It's not a generic operation though.

The very point of EngBitBlt is that it does not use those video drivers. It is a fall back for them if they do not have built in capability for BitBlt...yet it always runs very fast. I assume they simply REP the buffer though... and will continue to until i can be bothered running through the code.

Brendan · **Posted:** Sun May 13, 2007 4:23 am

Hi,

Tyler wrote:

Brendan wrote:

Tyler wrote:

For example in the Windows Graphics Driver Model (I am yet to look at X11's)... it is possible for a driver to ask Windows to handle the bitblt on it's behalf (DrvBitBlt calls EngBitBlt) and the operation is very fast. Given that it works on any system that at least has a standard frame buffer one would assume they do not use VESA but a generic operation.

Given that they've got video drivers that support full 2D and 3D hardware acceleration it's not surprising that it's fast. It's not a generic operation though.

The very point of EngBitBlt is that it does not use those video drivers. It is a fall back for them if they do not have built in capability for BitBlt...yet it always runs very fast. I assume they simply REP the buffer though... and will continue to until i can be bothered running through the code.

Doh - I should learn to read before I reply... :oops:

It does make me wonder how many video drivers actually do use EngBitBlt, and how much it's designed to handle - I could imagine a generic EngBitBlt that uses one of many different methods depending on if the source data is aligned, how much is being transfered, if MMX and/or SSE is supported, if it's an awkward video mode (e.g. 24-bit colour), etc.

Cheers,

Brendan

Tyler · **Posted:** Sun May 13, 2007 7:33 am

Brendan wrote:

Doh - I should learn to read before I reply... :oops:

It does make me wonder how many video drivers actually do use EngBitBlt, and how much it's designed to handle - I could imagine a generic EngBitBlt that uses one of many different methods depending on if the source data is aligned, how much is being transfered, if MMX and/or SSE is supported, if it's an awkward video mode (e.g. 24-bit colour), etc.

Cheers,

Brendan

I have noticed that mentioned a few times... What is it about 24-bit that makes operating that would work on 8-, 16- and 32-bit not work? Is it simply a data size and alignment issue? Also, is it not feasable copy three 24-bit pixels in two 32-bit ops?

Brendan · **Posted:** Sun May 13, 2007 2:21 pm

Hi,

Tyler wrote:

Quote:

It does make me wonder how many video drivers actually do use EngBitBlt, and how much it's designed to handle - I could imagine a generic EngBitBlt that uses one of many different methods depending on if the source data is aligned, how much is being transfered, if MMX and/or SSE is supported, if it's an awkward video mode (e.g. 24-bit colour), etc.

I have noticed that mentioned a few times... What is it about 24-bit that makes operating that would work on 8-, 16- and 32-bit not work? Is it simply a data size and alignment issue? Also, is it not feasable copy three 24-bit pixels in two 32-bit ops?

In my experience, 24-bit pixel formal is a huge pain in the neck (if you want it to be fast) because a dword contains 1.333 pixels and it's hard to do aligned writes.

If you're filling a horizontal line with a colour, you end up doing something like:

Code:

    mov eax,0xRRGGBBRR
    mov ebx,0xGGBBRRGG
    mov edx,0xBBRRGGBB
    mov ecx,LENGTH/4
.l1:
    mov [edi],eax
    mov [edi+4],ebx
    mov [edi+8],edx
    add edi,12
    loop .l1

Of course you have to make sure the first pixel is on a 12 byte (4 pixel) boundary to keep your writes aligned, and for arbitrary lines there's a lot of messing about with the first and last pixels.

For MMX or SSE it's worse. Because you're doing 8 byte or 16 byte writes you end up needing to start on a 24 byte or 48 byte boundary and spending more time trying to get the start and end of the line right. For filling an arbitrary rectangle it's the same problem.

For blitting an arbitrary area it can be much worse - it's a mess unless the left and right edges of the source data and the destination are both aligned on 12 byte, 24 byte or 48 byte boundaries.

Worst case is if the source data and the destination are a few pixels out (e.g. if "sourceLeftEdge % 4 != destLeftEdge % 4"). You'd need to do misaligned reads or writes, or rotate data in the inner loop, for e.g.:

Code:

.l1:
    mov eax,[esi]
    mov ebx,[esi+4]
    mov edx,[esi+8]
    shld eex,eax,cl
    shld efx,ebx,cl
    shld egx,edx,cl
    mov [edi],eex
    mov [edi],efx
    mov [edi],egx
    rol cx,8
    shrd eex,eax,cl
    shrd egx,ebx,cl
    shrd ehx,edx,cl
    rol cx,8
    sub dword [count],1
    jne .l1

Of course this is simplified - there aren't enough general registers so I made up some new ones! A better approach would be to have 4 different routines (no rotation, 1 byte rotation, 2 byte rotation and 3 byte rotation).

Lastly, if your unlucky enough to be using bank switching you might find that a single pixel is split across different banks (for e.g. for 64 KB banks there's 21845.33333 pixels per bank). I'd hate to attempt something like "blitPIxels(void *srcData, int srcTop, int srcLeft, int destTop, int destLeft, int width, int height);" in this case...

Cheers,

Brendan

OSDev.org

Blitting?

Who is online