Streaming SHA calculation using VIA's Padlock Hashing Engine?

Question 1

Depends on the CPU. On VIA Nano and later, you can perform partial hashes by setting EAX to FFFFFFFF before executing the REP XSHA1/256 instruction - and the CPU won't perform the final padding (so you can simply feed the chunks into the hash, just as you usually do with hashing functions). On older models (up to C7), such a possibility is not present, EAX has to be set to zero before the hash instruction, and a full hash (i.e. including the final padding) is performed.

I successfully implemented the hack mentioned above (on Windows) and it worked [tested on VIA Nano with EAX=0 though, don't have access to an old CPU]. But yes, there's a performance penalty here, so you don't want to feed tiny chunks into the code. I suggest to buffer small chunks into a bigger buffer, say a few kilobytes, and only then perform the "interrupted hash". If you finish with less data than that, it may be better to fall back to ordinary x86 code.

Since I can't comment/reply on other posts, here's a reply to the comment below:

I'm afraid I can't share my code, but I suggest to google for "PadlockSDK_3.1_Release_20090121.zip" That's the official Via source containing the relevant functions (look e.g. inside PadlockSDK_3.1_Release_20090121\PadlockSDK_3.1_build20081128\sdk\src - there's the assembly implementation of asm_partial_sha1_op3() function).

Question 2

Well, I found this: http://www.logix.cz/michal/devel/padlock/phe_sum.xp

PHE saves its current state into a memory on every process switch and as well on any page fault that occurs during the run. This state includes number of bytes hashed and an intermediate result that could be used as an initial value for subsequent rounds. So far so good. The only remaining question is how to trigger a context switch or a page fault at the place we need. Solution: mmap(2) two or more pages, mprotect(2) the last one to deny all access (PROT_NONE). This creates an inaccessible piece of memory exactly at the place we need. Now we put all our input data just before this barrier and engage PHE. However we'll tell it to hash slightly more data than we put into the buffer. With these instructions PHE will crunch all our input and attempt to hash some more. At that point it hits the protected area, trigges an exception, saves current intermediate status into the memory and calls the exception handler (well, not exactly and not exactly in this order, never mind ;-). Anyway the exception handler skips over the PHE instruction (hacky hack, EIP+=4 ;-) and returns.

Clever hack, but I don't know about the performance penalty of doing this.

Doing some testing, it seems that it never completes if the file is larger than the input buffer, i.e. the hack doesn't appear to be working for me, so it seems rather fragile, though the theory sounds okay.

So from what I've found, there's no particularly ideal way to feed xsha1 in chunks. (seems a little pointless to have hardware accelerated hashing support without being able to feed it large amounts of data nicely)