OSDev.org

The Place to Start for Operating System Developers
It is currently Thu Apr 18, 2024 11:17 pm

All times are UTC - 6 hours




Post new topic Reply to topic  [ 14 posts ] 
Author Message
 Post subject: Blitting?
PostPosted: Fri May 11, 2007 6:07 pm 
Offline
Member
Member

Joined: Sun Jan 14, 2007 9:15 pm
Posts: 2566
Location: Sydney, Australia (I come from a land down under!)
Does anyone know of a really fast way to blit stuff to the screen?

I've given up on virtual 8086 mode, and have since found a way to get into VESA modes from PMode. (I'll come back to virtual mode later).

Now I've got a problem: I can put stuff on the screen but it's slow to do such operations as 'clear screen'... Obviously, this is not the way to do it:
Code:
for( int x = 0; x < vga.GetXRes(); x++ )
{
   for( int y = 0; y < vga.GetYRes(); y++ )
      vga.SetPixel( x, y, COLOR_WHITE );
}


But I don't know any faster way :( .

_________________
Pedigree | GitHub | Twitter | LinkedIn


Top
 Profile  
 
 Post subject:
PostPosted: Fri May 11, 2007 6:45 pm 
Offline
Member
Member
User avatar

Joined: Thu Jan 04, 2007 3:29 pm
Posts: 1466
Location: Noricum and Pannonia
Maybe this?

_________________
C8H10N4O2 | #446691 | Trust the nodes.


Top
 Profile  
 
 Post subject:
PostPosted: Fri May 11, 2007 8:36 pm 
Offline
Member
Member

Joined: Sun Jan 14, 2007 9:15 pm
Posts: 2566
Location: Sydney, Australia (I come from a land down under!)
Thanks, that helped.

I just ran my OS in Qemu and almost fainted... It's insanely fast! Bochs is like a snail in comparison. Though I have one problem, my VBE entry code doesn't work on a real PC. But then, I was trying to get into 640x480 and not many cards support that anymore.

_________________
Pedigree | GitHub | Twitter | LinkedIn


Top
 Profile  
 
 Post subject:
PostPosted: Fri May 11, 2007 9:05 pm 
Offline
Member
Member
User avatar

Joined: Tue Oct 17, 2006 9:29 pm
Posts: 2426
Location: Canada
pcmattman wrote:
Thanks, that helped.

I just ran my OS in Qemu and almost fainted... It's insanely fast! Bochs is like a snail in comparison. Though I have one problem, my VBE entry code doesn't work on a real PC. But then, I was trying to get into 640x480 and not many cards support that anymore.

I think that you're supposed to check what modes "each" card supports..

_________________
Image
Twitter: @canadianbryan. Award by smcerm, I stole it. Original was larger.


Top
 Profile  
 
 Post subject:
PostPosted: Fri May 11, 2007 9:32 pm 
Offline
Member
Member
User avatar

Joined: Thu Jan 04, 2007 3:29 pm
Posts: 1466
Location: Noricum and Pannonia
Brynet-Inc wrote:
pcmattman wrote:
Thanks, that helped.

I just ran my OS in Qemu and almost fainted... It's insanely fast! Bochs is like a snail in comparison. Though I have one problem, my VBE entry code doesn't work on a real PC. But then, I was trying to get into 640x480 and not many cards support that anymore.

I think that you're supposed to check what modes "each" card supports..

Maybe with something like this in your boot loader?

_________________
C8H10N4O2 | #446691 | Trust the nodes.


Top
 Profile  
 
 Post subject:
PostPosted: Sat May 12, 2007 4:26 am 
Offline
Member
Member

Joined: Tue Nov 07, 2006 7:37 am
Posts: 514
Location: York, England
Is there no fast generic bit block method?

For example in the Windows Graphics Driver Model (I am yet to look at X11's)... it is possible for a driver to ask Windows to handle the bitblt on it's behalf (DrvBitBlt calls EngBitBlt) and the operation is very fast. Given that it works on any system that at least has a standard frame buffer one would assume they do not use VESA but a generic operation.

What is it most likely they do, copy pixel by pixel or use a processor block transfer? (is this even possible over Memory Mapped I/O?)


Top
 Profile  
 
 Post subject:
PostPosted: Sat May 12, 2007 11:02 am 
Offline
Member
Member
User avatar

Joined: Sat Jan 15, 2005 12:00 am
Posts: 8561
Location: At his keyboard!
Hi,

Tyler wrote:
For example in the Windows Graphics Driver Model (I am yet to look at X11's)... it is possible for a driver to ask Windows to handle the bitblt on it's behalf (DrvBitBlt calls EngBitBlt) and the operation is very fast. Given that it works on any system that at least has a standard frame buffer one would assume they do not use VESA but a generic operation.


Given that they've got video drivers that support full 2D and 3D hardware acceleration it's not surprising that it's fast. It's not a generic operation though.

Pcmattman's problem is that he's recalculating an offset into display memory for every single pixel (307200 calculations for 640*480 mode), writing one pixel at a time, and drawing vertical lines (which sucks for caching) instead of horizontal lines . Filling horizontal lines with "REP STOSD" would make his code about 100 times faster.

For example, something like:

Code:
    mov eax,(WHITE << 24) | (WHITE << 16) | (WHITE << 8) | WHITE
    mov edi, DISPLAY_MEMORY_ADDRESS
    mov ebx, VERTICAL_RESOLUTION / 4
    mov ebp, BYTES_BETWEEN_LINES
    mov edx, DWORDS_PER_LINE
    cld

.l1:
    mov ecx,edx
    rep stosd
    add edi,ebp
    mov ecx,edx
    rep stosd
    add edi,ebp
    mov ecx,edx
    rep stosd
    add edi,ebp
    mov ecx,edx
    rep stosd
    add edi,ebp
    sub ebx,1
    jne .l1


Notice how theres no "address = videoMem + (Y * bytesPerLine + X) * bytesPerPixel" in there?

Of course this is for 8-bit per pixel (but could be easily modified for other pixel formats, except 24-bit per pixel) and doesn't support bank switching (but no sane person uses bank switching anymore).

BTW Bochs is slow, which is a good thing because it makes it easy to find code that needs to be improved. Qemu is faster, which makes it harder to see code that needs to be improved, which is a bad thing.


Cheers,

Brendan

_________________
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.


Top
 Profile  
 
 Post subject:
PostPosted: Sat May 12, 2007 11:15 am 
Offline
Member
Member
User avatar

Joined: Sat Nov 20, 2004 12:00 am
Posts: 382
Location: Wellesley, Ontario, Canada
Brendan wrote:
Filling horizontal lines with "REP STOSD" would make his code about 100 times faster.


And this is pretty much the fastest truly generic approach. In some processors you can utilize 64-bit registers for the copy (eg, using the mmx registers).

Brendan's point about recalculating offsets has always been one of my biggest concern with graphics code. If you think hard enough, pretty much any graphics routine (lines, circles, ovals, etc) can be drawn with only one initial calculation (ie, one multiply) and the rest the increments can be done through additions.

As an interesting aside; I've actually heard of people using DMA to transfer to the video card in the background. This wouldn't be fast, I just found it interesting.

--Jeff


Top
 Profile  
 
 Post subject:
PostPosted: Sat May 12, 2007 2:08 pm 
Offline
Member
Member
User avatar

Joined: Tue Oct 17, 2006 11:33 pm
Posts: 3882
Location: Eindhoven
Brendan wrote:
Pcmattman's problem is that he's recalculating an offset into display memory for every single pixel (307200 calculations for 640*480 mode), writing one pixel at a time, and drawing vertical lines (which sucks for caching) instead of horizontal lines . Filling horizontal lines with "REP STOSD" would make his code about 100 times faster.


Compilers solve that problem by determining loop invariants, strength reduction and code motion to determine that the number isn't changed in the loop, the other derived variables are replacable by additions and the initial calculation can go before the loop. Add to that that your compiler can/should know about faster assignments and you're all set with this code, without the rep stosd code. Stosd is particularly slow on newer processors, so you might even be quicker off with a partially unrolled loop with Duff's device for unrolling the initial amount and making the rest into large aligned moves.

Let's say, a proper compiler could outperform your rep stosd.


Top
 Profile  
 
 Post subject:
PostPosted: Sat May 12, 2007 4:30 pm 
Offline
Member
Member
User avatar

Joined: Sat Jan 15, 2005 12:00 am
Posts: 8561
Location: At his keyboard!
Hi,

Candy wrote:
Let's say, a proper compiler could outperform your rep stosd.


Find a compiler and compile Pcmattman's original method (with all optimizations enabled that don't prevent the code from running on any 32-bit 80x86 CPU), then find an assembler and assemble my method. Then benchmark the results and get back to me... ;)

For all sane video cards, the first byte of each line is aligned on a 32-bit (or greater) boundary, so misalignment problems won't happen.

Modern CPUs internally optimise string instructions where possible (if the string operation is "long enough", if everything is aligned and if the areas don't overlap). If these conditions are met REP MOVSD is limited by RAM bandwidth and nothing else (for Intel's latest CPUs, REP MOVSD will saturate the L1 cache bandwidth if data is in the L1 cache) . Aligned REP STOSD to video display memory would be limited by the PCI bus bandwidth and nothing else.

If the conditions aren't met (i.e. lack of alignment, overlapping source & dest or small strings) then string instructions can be improved on, but I doubt this is the case for the code I posted, although to be honest I haven't found a good estimate of how many bytes is "long enough" (but 640 * 480 with 8-bit per pixel works out to 10 cache lines, which should be enough).

The other problem with string instructions is they pollute the caches - a "non-temporal store" would prevent this, but that's only possible for SSE. Of course cache pollution can't happen for REP STOSD to areas that are "write-combining" or "uncacheable", like video memory, as the writes don't go to cache anyway.


Cheers,

Brendan

_________________
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.


Top
 Profile  
 
 Post subject:
PostPosted: Sun May 13, 2007 3:24 am 
Offline
Member
Member

Joined: Tue Nov 07, 2006 7:37 am
Posts: 514
Location: York, England
Brendan wrote:
Hi,

Tyler wrote:
For example in the Windows Graphics Driver Model (I am yet to look at X11's)... it is possible for a driver to ask Windows to handle the bitblt on it's behalf (DrvBitBlt calls EngBitBlt) and the operation is very fast. Given that it works on any system that at least has a standard frame buffer one would assume they do not use VESA but a generic operation.


Given that they've got video drivers that support full 2D and 3D hardware acceleration it's not surprising that it's fast. It's not a generic operation though.


The very point of EngBitBlt is that it does not use those video drivers. It is a fall back for them if they do not have built in capability for BitBlt...yet it always runs very fast. I assume they simply REP the buffer though... and will continue to until i can be bothered running through the code.


Top
 Profile  
 
 Post subject:
PostPosted: Sun May 13, 2007 4:23 am 
Offline
Member
Member
User avatar

Joined: Sat Jan 15, 2005 12:00 am
Posts: 8561
Location: At his keyboard!
Hi,

Tyler wrote:
Brendan wrote:
Tyler wrote:
For example in the Windows Graphics Driver Model (I am yet to look at X11's)... it is possible for a driver to ask Windows to handle the bitblt on it's behalf (DrvBitBlt calls EngBitBlt) and the operation is very fast. Given that it works on any system that at least has a standard frame buffer one would assume they do not use VESA but a generic operation.


Given that they've got video drivers that support full 2D and 3D hardware acceleration it's not surprising that it's fast. It's not a generic operation though.


The very point of EngBitBlt is that it does not use those video drivers. It is a fall back for them if they do not have built in capability for BitBlt...yet it always runs very fast. I assume they simply REP the buffer though... and will continue to until i can be bothered running through the code.


Doh - I should learn to read before I reply... :oops:

It does make me wonder how many video drivers actually do use EngBitBlt, and how much it's designed to handle - I could imagine a generic EngBitBlt that uses one of many different methods depending on if the source data is aligned, how much is being transfered, if MMX and/or SSE is supported, if it's an awkward video mode (e.g. 24-bit colour), etc.


Cheers,

Brendan

_________________
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.


Top
 Profile  
 
 Post subject:
PostPosted: Sun May 13, 2007 7:33 am 
Offline
Member
Member

Joined: Tue Nov 07, 2006 7:37 am
Posts: 514
Location: York, England
Brendan wrote:

Doh - I should learn to read before I reply... :oops:

It does make me wonder how many video drivers actually do use EngBitBlt, and how much it's designed to handle - I could imagine a generic EngBitBlt that uses one of many different methods depending on if the source data is aligned, how much is being transfered, if MMX and/or SSE is supported, if it's an awkward video mode (e.g. 24-bit colour), etc.


Cheers,

Brendan


I have noticed that mentioned a few times... What is it about 24-bit that makes operating that would work on 8-, 16- and 32-bit not work? Is it simply a data size and alignment issue? Also, is it not feasable copy three 24-bit pixels in two 32-bit ops?


Top
 Profile  
 
 Post subject:
PostPosted: Sun May 13, 2007 2:21 pm 
Offline
Member
Member
User avatar

Joined: Sat Jan 15, 2005 12:00 am
Posts: 8561
Location: At his keyboard!
Hi,

Tyler wrote:
Quote:
It does make me wonder how many video drivers actually do use EngBitBlt, and how much it's designed to handle - I could imagine a generic EngBitBlt that uses one of many different methods depending on if the source data is aligned, how much is being transfered, if MMX and/or SSE is supported, if it's an awkward video mode (e.g. 24-bit colour), etc.


I have noticed that mentioned a few times... What is it about 24-bit that makes operating that would work on 8-, 16- and 32-bit not work? Is it simply a data size and alignment issue? Also, is it not feasable copy three 24-bit pixels in two 32-bit ops?


In my experience, 24-bit pixel formal is a huge pain in the neck (if you want it to be fast) because a dword contains 1.333 pixels and it's hard to do aligned writes.

If you're filling a horizontal line with a colour, you end up doing something like:

Code:
    mov eax,0xRRGGBBRR
    mov ebx,0xGGBBRRGG
    mov edx,0xBBRRGGBB
    mov ecx,LENGTH/4
.l1:
    mov [edi],eax
    mov [edi+4],ebx
    mov [edi+8],edx
    add edi,12
    loop .l1


Of course you have to make sure the first pixel is on a 12 byte (4 pixel) boundary to keep your writes aligned, and for arbitrary lines there's a lot of messing about with the first and last pixels.

For MMX or SSE it's worse. Because you're doing 8 byte or 16 byte writes you end up needing to start on a 24 byte or 48 byte boundary and spending more time trying to get the start and end of the line right. For filling an arbitrary rectangle it's the same problem.

For blitting an arbitrary area it can be much worse - it's a mess unless the left and right edges of the source data and the destination are both aligned on 12 byte, 24 byte or 48 byte boundaries.

Worst case is if the source data and the destination are a few pixels out (e.g. if "sourceLeftEdge % 4 != destLeftEdge % 4"). You'd need to do misaligned reads or writes, or rotate data in the inner loop, for e.g.:

Code:
.l1:
    mov eax,[esi]
    mov ebx,[esi+4]
    mov edx,[esi+8]
    shld eex,eax,cl
    shld efx,ebx,cl
    shld egx,edx,cl
    mov [edi],eex
    mov [edi],efx
    mov [edi],egx
    rol cx,8
    shrd eex,eax,cl
    shrd egx,ebx,cl
    shrd ehx,edx,cl
    rol cx,8
    sub dword [count],1
    jne .l1


Of course this is simplified - there aren't enough general registers so I made up some new ones! A better approach would be to have 4 different routines (no rotation, 1 byte rotation, 2 byte rotation and 3 byte rotation).

Lastly, if your unlucky enough to be using bank switching you might find that a single pixel is split across different banks (for e.g. for 64 KB banks there's 21845.33333 pixels per bank). I'd hate to attempt something like "blitPIxels(void *srcData, int srcTop, int srcLeft, int destTop, int destLeft, int width, int height);" in this case...


Cheers,

Brendan

_________________
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 14 posts ] 

All times are UTC - 6 hours


Who is online

Users browsing this forum: Bing [Bot] and 107 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group