OSDev.org https://forum.osdev.org/ |
|
Assembly mem_copy ( ) https://forum.osdev.org/viewtopic.php?f=1&t=13992 |
Page 1 of 1 |
Author: | Lprogster [ Thu May 17, 2007 12:15 pm ] |
Post subject: | Assembly mem_copy ( ) |
I've been trying to improve the performance of my mem_copy ( ) function. ...See later posts... Any hints, ideas or solutions (especially) would be great. Thanks, Lster |
Author: | Korona [ Thu May 17, 2007 1:39 pm ] |
Post subject: | |
Code: mov edx, [esp] mov eax, 8 div ecx This piece of code loads the return address into edx and then divides edx:eax trough ecx. Try: Code: mov eax, [esp + 4] xor edx, edx mov ecx, 8 div ecx mov eax, ecx Much faster would be: Code: mov ecx, [esp + 4] shr ecx, 8 Code: mov ebx, [esp + 4]
mov edx, [esp + 8] ebx and edx are loaded with the wrong values, the first argument is esp + 4, the second is esp + 8 and the third is esp + 12. It would be faster to load more than one qword in one iteration (pipelining): movq mm0, [edx] movq mm1, [edx + 8] .... |
Author: | Lprogster [ Thu May 17, 2007 1:55 pm ] |
Post subject: | |
Thank you very much for your reply. I still get errors, though (page fault...). Here is my updated code: ... Thanks again, Lster |
Author: | B.E [ Thu May 17, 2007 3:00 pm ] |
Post subject: | |
I would try somthing like this (I can't remember if movs* decreases ecx or rep does it.), I don't know if this is faster, but it's supported by all >286 processors Code: mem_copy_a:
mov ecx, [esp + 4] mov eax, ecx and eax,3 shr ecx, 3 mov edi, [esp + 8] mov esi, [esp + 12] shr eax,1 jnc TwoByteWrite movsb TwoByteWrite: shr eax,1 jnc Dword_Write movsw Dword_Write: rep movsd ret Also you page fault may be due to the fact that your writing to much data. (i.e what happens if the code want's to copy 9,17,25 or 33 bytes of data. |
Author: | B.E [ Thu May 17, 2007 3:04 pm ] |
Post subject: | |
Also it may be better to use a 128 bit move with. and use offsets rather than adding to edx, ebx Code: movdqa xmm1,[esi+ebx];
movdqa [edi+ebx],xmm1; |
Author: | XCHG [ Fri May 18, 2007 12:31 am ] |
Post subject: | |
Maybe I am missing on something but how in the world can you assume that the programmer is trying to move >8 bytes? What if we are moving 2 bytes only? You are moving 8 bytes with the MOVQ MMX instruction. Another point to notice is that when you issue the DEC instruction with any of the General Purpose Registers, the Zero Flag in the EFLAGS register will be set to 1. So this code that you have written: Code: dec ecx cmp ecx, 0 jne .again Is not optimized. You can remove the CMP instruction like this: Code: DEC ECX JNZ .DoSomethingIfECXIsNotZeroYet Another thing that you might already know is that you "must" ask the programmer how many bytes he would like to copy. Then you will be able to pull off a lot of tricks for copying from the source to the destination. For example, first find the quotient of the "count" by 8 and see how many QWORDs (Using MMX instruction) you can move to the destination. After that, find the quotient of the "count / 8" divided by 4 and find out how many DWORDs you can move to the destination, then use that for WORD and finally BYTE. For example, if the user asks you to move 19 bytes from the source to the destination, then 19 bytes of transfer can be done like this: Two QWORDs (= 2 * 8 = 16 bytes) Now subtract 16 (the number of bytes we have already moved) from 19 to get 3. 3 will not yield a positive quotient when divided by 4 so divide it by 2 to get 1 as the quotient. The remainder will be 1. So you have to move 1 WORD and 1 Bytes. As a result, to transfer 19 bytes, Transfer 2 QWORDs (QWORD = 8 bytes, 2*8= 16 bytes) Transfer 1 WORD (2 bytes) Transfer 1 Byte I have written on __MMXZeroMemory function that I think will give you the idea. However, I have not used DWORD/WORD movements in this code. I have written comments for the code but haven't had time to write documentation. Post back here if you have any questions. Code: __MMXZeroMemory: ; void __MMXZeroMemory (void* Destination, DWORD Length) PUSH EAX ; Push the accumulator onto the stack PUSH ECX ; Push the count register onto the stack PUSH EDX ; Push the data register onto the stack PUSH EDI ; Push the destination index onto the stack PUSH EBP ; Push the base pointer onto the stack MOV EBP , ESP ; Move the stack pointer to the base pointer MOV ECX , DWORD PTR [EBP + 0x1C]; *ECX = The [Length] parameter TEST ECX , ECX ; See if the requested length is zero JZ .EP ; Jump to the end of the procedure if yes MOV EDI , DWORD PTR [EBP + 0x18]; *EDI = The [Destination] parameter MOV EAX , ECX ; *EAX = The [Length] parameter XOR EDX , EDX ; Clear the buffer for Byte-moves SHR ECX , 0x00000003 ; *ECX = Number of QWORDs that we have to move to the Destination AND EAX , 0x00000007 ; *EAX = Number of parity bytes that we have to move TEST ECX , ECX ; See if the number of QWORDs to move is zero JZ .MoveBytes ; Start moving bytes (Up to 7 bytes) if yes EMMS ; Empty MMX Technology State PXOR MM0 , MM0 ; The 8-Byte MMX byte is zero .MoveQWORDs: ; Start zeroing memory 8 bytes at a time MOVQ QWORD PTR [EDI] , MM0 ; Zero the current QWORD ADD EDI , 0x00000008 ; Move to the next QWORD in the destination DEC ECX ; Decrement the number of QWORDs to move JNZ .MoveQWORDs ; Keep moving QWORDs to the destination TEST EAX , EAX ; See if the number of parity bytes to move is zero JZ .EP ; Jump to the end of the procedure if yes .MoveBytes: ; We are left to move up to 7 bytes now MOV BYTE PTR [EDI] , DL ; Write one byte to the current destination's location INC EDI ; Move to the next byte of the destination DEC EAX ; Decrement the number of bytes to move JNZ .MoveBytes ; Keep moving parity bytes while EAX>0 .EP: ; End of the procedure routine POP EBP ; Restore the base pointer POP EDI ; Restore the destination index POP EDX ; Restore the data register POP ECX ; Restore the count register POP EAX ; Restore the accumulator RET 0x08 ; Return to the calling procedure ; And sweep 2 parameters off the stack ; And sweep 1 parameter off the stack Alright, I coded one [__MemCpy] right now and you are free to use it too. But there are certain things you have to bear in mind about the code and these can be used as general programming skills: When you want to divide a number by n where n is a power of 2 (For example, 1, 2, 4, 8, 16, 32, etc), just find x while 2^x = n For example, to divide a number by 16, find x for 2^x = 16 This can be evaluated as Log.2(16) = x Log.2(2^4) = x 4*Log.2(2) = x 4=x or x=4 Therefore, to be able to divide a number by 16, you can shift it to the right 4 times. This allows us to find the quotient of a number divided by 16. We are going to need this value to be able to find out how many QWORDs/DWORDs/etc we have to move to a buffer. For example, if the user wants to move 34 bytes to a buffer, we shift 34 to the right 2 times to be able to find out how many DWORDs we have to move to the buffer. When you shift a number to the right n times, that number will be divided by 2^n. So shifting a number to the right 2 times will divide the number by 2^2 = 4 (DWORD). Now that we know the quotient of our division, we have to find the remainder also. Finding the remainder of a division of a number by a power of 2 is really easy, also. If you want to find this remainder, just AND the number of N-1 when N is the quotient. For example, to find the remainder of the division (14/8), just subtract 1 from 8 (8-1 = 7) then AND 14 with 7 to get 6. So 6 is the remainder of the division of (14/8). 1*8 = 8, 8+6 = 14. To wrap it up, let's say we need the remainder and the quotient of the division (NUMBER / 32). Quotient = NUMBER SHR x 2^x = 32 => x = 5 Quotient = NUMBER SHR 5 Remainder = NUMBER AND (32-1) = NUMBER AND 31 That's that. Now to be able to understand the code below, you need to know a few optimization tricks. 1) If the required length specified by the programmer is zero, then exit the function immediately. 2) If the source and the destination are equal, then quit the function immediately. 3) Use the MOVSD instruction to move ECX number of DWORDs at a time from DS:ESI to DS:EDI. 4) Use the MOVSB instruction to move ECX number of BYTES at a time from DS:ESI to DS:EDI. 5) Do not assume that the processor your kernel is running in is equipped with MMX/SSE instructions. 6) Do not trash your general purpose registers; PUSH and POP them to save their current state. 7) Clear the direction flag when using string instructions (although not necessarily) with the CLD instruction. 8) I use the StdCall calling convention in my kernel. You should push the parameters from right to left in this calling convention. I didn't have enough time to write documentation for this code either. If you need further help/explanation, write back here. Code: ; ——————————————————————————————————————————————————
__MemCpy: ; void __MemCpy (void* Destination, const void* Source, DWORD NumberofBytesToCopy); StdCall; PUSHFD ; Push the EFLAGS register onto the stack PUSH EAX ; Push the accumulator onto the stack PUSH ECX ; Push the count register onto the stack PUSH ESI ; Push the source index onto the stack PUSH EDI ; Push the destination index onto the satck PUSH EBP ; Push the base pointer onto the stack MOV EBP , ESP ; Move the stack pointer to the base pointer CLD ; Clear the direction flag, Move forward with string instructions MOV ECX , DWORD PTR [EBP + 0x24]; *ECX = The [NumberofBYtesToCopy] parameter TEST ECX , ECX ; See if the number of bytes we have to copy is zero JZ .EP ; Jump to the end of the procedure if yes MOV ESI , DWORD PTR [EBP + 0x20]; *ESI = The [Source] parameter MOV EDI , DWORD PTR [EBP + 0x1C]; *EDI = The [Destination] parameter CMP ESI , EDI ; Are the source and the destination the same memory location? JE .EP ; If yes, then jump to the end of the procedure MOV EAX , ECX ; EAX holds the [NumberofBytesToCopy] parameter too AND EAX , 0x00000003 ; *EAX = Remainder of ([NumberofBytesToCopy] / 4) SHR ECX , 0x00000002 ; *ECX = Quotient of ([NumberofBytesToCopy] / 4) REP MOVSD ; Store as many as ECX DWORDs from source to destination MOV ECX , EAX ; Put the count in the count register REP MOVSB ; Now start moving bytes from Source to Destination .EP: ; End of the procedure POP EBP ; Restore the base pointer POP EDI ; Restore the destination index POP ESI ; Restore the source index POP ECX ; Restore the count register POP EAX ; Restore the base index POPFD ; Restore the EFLAGS register RET 0x0C ; Return to the calling procedure ; And sweep 3 parameters off the stack Hope that helps! |
Author: | Brendan [ Fri May 18, 2007 2:17 am ] |
Post subject: | Re: Assembly mem_copy ( ) |
Hi, Lprogster wrote: I've been trying to improve the performance of my mem_copy ( ) function. So, this is my attempt to use MMX to copy faster (in Nasm BTW):
Don't. A high performance generic "mem_copy()" needs to take into account various values for count and alignment issues, and also needs to worry about setup overhead. For example, for a small transfer (e.g. 16 bytes) where source and destination are aligned to an 8 byte boundary, do you need to check if MMX is supported, then use FXSAVE, EMMS and FXLOAD (in case the caller is using the FPU), and could it cause an device not available exception? In this case "rep movsb" would be faster because there's much less setup overhead. For a high level language this would be optimised by the compiler itself, so that if the compiler knows the source and destination are aligned it inlines code that doesn't bother checking alignment, and if the compiler knows the count is a multiple of 4 or 8 it inlines code that doesn't bother transferring bytes, words, dwords, etc. If you really want high performance, then you need to use the same tricks. Basically you need either an optimised macro for each specific case or custom code for each specific case. A generic "mem_copy()" routine is guaranteed to give worse performance, simply because it's generic. @XCHG: the INC and DEC instructions only update some of the flags, and therefore these instructions can't complete until after previous instruction/s have modified the flags (it creates a false dependancy). Because of this it's better to use ADD and SUB which replace all flags and has no false dependancy. For example: Code: cmp foo, bar ;An instruction that modifies flags
add foo,bar ;Can complete without knowing flags left by the last instruction (NO STALL) adc foo,bar ;Can't complete without knowing flags left by the last instruction (STALL) dec foo ;Can't complete without knowing flags left by the last instruction (STALL) inc foo ;Can't complete without knowing flags left by the last instruction (STALL) sub foo,1 ;Can complete without knowing flags left by the last instruction (NO STALL) Cheers, Brendan |
Author: | crackers [ Fri May 18, 2007 2:57 am ] |
Post subject: | |
Brendan wrote: @XCHG: the INC and DEC instructions only update some of the flags, and therefore these instructions can't complete until after previous instruction/s have modified the flags (it creates a false dependancy). Because of this it's better to use ADD and SUB which replace all flags and has no false dependancy.
I'm not sure if I understand. Does it mean that it's better to use add ax, 1 than inc ax ? If yes than by "stall" you mean what is happening in out-of-order buffer? If I've just written something stupid, please don't laugh at me. |
Author: | XCHG [ Fri May 18, 2007 4:13 am ] |
Post subject: | |
Brendan, What you are referring to as a stall is Partial-Register-Access-Stall. This "stall" does not apply in all circumstances and is well addressed in Agner Fog's optimization manuals. Instructions such as INC/DES/SUB/ADD all pair in both the u and the v pipes. In order to be able to say whether they stall or not you have to trace the code to its origin and see what instruction pairs with what. For example, Code: LEA ESI , Source
LEA EDI , Destination MOV ECX , 100 @Transfer: MOV EAX , DWORD PTR [ESI] MOV DWORD PTR [EDI] , EAX ADD ESI , 4 ADD EDI , 4 DEC ECX JNZ @Transfer According to your post, the above code should take longer to execute compared to when the DEC isntruction is replaced by SUB 1. The above code runs at the speed of 313 clock cycles under the below circumstances: Code Segment aligned on a DWORD boundary. Stack Segment aligned on a DWORD boundary. Both Source and Destination variables are arrays of 100 DWORDs, aligned on a DWORD boundary. CPU: PIII 800MHZ RAM: 512MB SD Now if I change the DEC to a SUB, the code will run at 336 clock cycles on the same CPU with the same conditions as stated above. The clock cycles stated above include the creation and the destruction of the stack frame. |
Author: | Brendan [ Fri May 18, 2007 4:54 am ] |
Post subject: | |
Hi, XCHG wrote: What you are referring to as a stall is Partial-Register-Access-Stall. This "stall" does not apply in all circumstances and is well addressed in Agner Fog's optimization manuals. Instructions such as INC/DES/SUB/ADD all pair in both the u and the v pipes. In order to be able to say whether they stall or not you have to trace the code to its origin and see what instruction pairs with what. For example, IIRC the U and V pipes only apply to Pentium CPUs. On Pentium III backward branches are predicted as taken, and the CPU will speculatively start executing the next iteration of the loop before it knows if the branch is taken or not. In this case the choice of sub or dec should only really effect the speed that the CPU exits the loop (where the branch would be mispredicted until the flags are known), nothing else. XCHG wrote: Now if I change the DEC to a SUB, the code will run at 336 clock cycles on the same CPU with the same conditions as stated above. The clock cycles stated above include the creation and the destruction of the stack frame. How many times did you try it (are the figures you've stated an average of many tests), were the caches in the same state (e.g. all code in the cache beforehand, so we know the add wasn't slower because it caused an extra cache line fetch), and did you do a serialising instruction before the first RDTSC? More importantly, did you try it on a Pentium 4 or any other CPUs? From Intel's Optimization Manual, "Use of the inc and dec Instructions" (page 2-68 in my copy): Quote: The inc and dec instructions modify only a subset of the bits in the flag register. This creates a dependance on all previous writes of the flag register. This is especially problematic when these instructions are on the critical path because they are used to change an address for a load on which many other instructions depend.
Assembly/Complier Coding Rule 42. (medium impact, high generality) inc and dec instructions should be replaced with an add or sub instruction, because add and sub overwrite all flags, whereas inc and dec do not, therefore creating false dependancies on earlier instructions that set the flags. Cheers, Brendan |
Author: | Candy [ Sat May 19, 2007 7:52 am ] |
Post subject: | |
Brendan wrote: IIRC the U and V pipes only apply to Pentium CPUs. Yep. U/V pipelining is total nonsense when applied to other cpu's. I have seen an internal implementation of memcpy for VS2005 which was hand-optimized for the Pentium I though, so you're not the first to do that wrong. In general, the things that worked on a P1 will continue to work, but more could work. Quote: XCHG wrote: Now if I change the DEC to a SUB, the code will run at 336 clock cycles on the same CPU with the same conditions as stated above. The clock cycles stated above include the creation and the destruction of the stack frame. More importantly, did you try it on a Pentium 4 or any other CPUs? From Intel's Optimization Manual, "Use of the inc and dec Instructions" (page 2-68 in my copy): Yep. That applies only for the P4 NetBurst architecture and probably the newer Core architectures as well. |
Page 1 of 1 | All times are UTC - 6 hours |
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group http://www.phpbb.com/ |