My technical take on ProgPow's weakest link

greerso · March 26, 2019, 5:25pm

nvrtc is distributable according to https://docs.nvidia.com/cuda/eula/#attachment-a

jcyr · March 26, 2019, 5:41pm

@greerso Good news… But the license terms also include:

The distributable portions of the SDK shall only be accessed by your application.

and

Except as expressly provided in this Agreement, you may not copy, sell, rent, sublicense, transfer, distribute, modify, or create derivative works of any portion of the SDK.

IANAL but it might be ok to bundle the nvrtc DLLs and SOs in some cases.

gcolvin · March 26, 2019, 10:14pm

Reading the code at https://github.com/ifdefelse/ProgPOW/blob/master/libprogpow/ProgPow.cpp
it seems that instead of generating and compiling code a similar effect could be achieved by having code similar to what is generated by getKern() calling randomly into a (very large) set of pre-compiled math() and merge() functions.

Arachnid · March 27, 2019, 8:02pm

The worst-case scenario is not cards going offline, it’s a fork, with all AMD cards on one side and NVIDIA cards on the other. The fork would persist even after the problematic epoch has passed, as each side would see the other side’s blocks as invalid.

I believe the point of ProgPoW is that such a set of functions would be too large to practically implement and push to the GPU; if it weren’t, you could build an ASIC that does this more efficiently.

shemnon · March 27, 2019, 8:29pm

I don’t think so. None of the clients do POW validation on GPUs, it’s all software for them. So instead of a fork you will get AMD GPUs generating bad blocks that no clients will propagate.

shemnon · March 27, 2019, 8:31pm

ASICs can do it more efficiently. The question is how much more efficiently. The EIP preamble calls this out and provides an estimate at the maximum gains an ASIC could gain. The aim of ProgPow was never to make ASICs impossible but to reduce the efficiency that they can gain relative to existing GPU architectures.

Arachnid · March 27, 2019, 8:34pm

I’m aware of that, but you’re missing my point. For the conditions in the preamble to be true, the set of possible pipelines has to be too large to precompile and include in a binary as Greg is suggesting.

Arachnid · March 27, 2019, 8:34pm

Good point, I stand corrected.

shemnon · March 27, 2019, 8:58pm

Actually I would expect a good ASIC to have a software driver that would communicate the lane configurations. Loading and configuring the calculations is not the long part of the process, once it’s set up it runs for 2.5 minutes.

The “unpredictability” of the calculations isn’t the aim of progpow. It’s the variety of calculations over the life of the chain.

If unpredictability across the chain is desired all it would take is to add in the block hash from sum number of blocks before the start of the period to change the kiss99 seed. But I’m not sure how much “efficiency” that would cost because it is similarly calculable in a driver that can send out the config to an entire fleet of ASICs.

Arachnid · March 27, 2019, 9:02pm

Again, not my point, but I seem to be unable to communicate my point clearly enough, so I give up.

gcolvin · March 28, 2019, 4:34am

I think you are right, though I wasn’t clear that a new executable would still be created each time. No new code would be compiled, just different combinations of the same precompiled code linked together. It would take some redesign of the kernel algorithm. And the period of the “random program generator” would be a lot less, but I don’t down how much that would matter to ASIC resistance.

jcyr · March 28, 2019, 5:45am

A reasonable analogy explaining an ASIC pipeline can be found here: http://chipverification.blogspot.com/2008/06/what-is-pipelining.html

Each stage in an ASIC pipeline requires it’s own copy of the state. Even with a simple SHA-256 hash, with a relatively small state, a fully unrolled pipeline will have a couple of hundred stages. The number of stages times state size dictates the number of on-chip storage bits required to accommodate the pipeline. ProgPow has a much larger state and a more complex algorithm which precludes a fully unrolled pipeline (too much expensive on-chip storage bits) forcing the ASIC designer to chose a more sequential, less unrolled, slower, and less power efficient design. I don’t think you’d need a very large number of pre-generated random sequences. I’m guessing that 32 such sequences would be enough to encumber an ASIC and would still be GPU loadable. Certainly not as complex as the completely pseudo random merge but difficult enough!

You’d need to mock up the design and run simulations to get a better idea of the lower bound for the number of sequences.

Anlan · March 28, 2019, 9:30am

@gcolvin. Technically speaking a compilation of a new search kernel from pseudo-random source on every period is not mandatory. In the end the CPU validation implemented as a backup-verification into the miner itself or directly into node’s code does not recompile anything.
When thinking about parallel processing though (thus GPU mining) the removal, at maximum extent possible, from the kernel of all if, switch, select … braces has a huge impact on processing speed. A very small compile time (spent async when a previous search kernel is on duty) is well worth the increment in processing speed on GPU side.

Delivering a miner which implement a statically compiled pattern would be meaningless: nobody would use it and all GPU miners would use the versions which implement async compile as they’d give a higher hashrate.

Due to the above is worth mentioning that on-the-fly compilation matches with GPU arch and can be optimized on behalf of various values detected from the GPU (eg. best work multiplier, compute capabilities etc).

Anlan · March 28, 2019, 10:05am

@xazax310 the issue of the bogus period found by @jcyr occurred in a very limited environment (if compared to the variety of GPU types and driver versions out there). IIRC was found on Ellesmere AMD family cards with driver 18.x with OpenCL 1.2 implementation embedded into ethminer. Neither Jean nor me had all the collection of all possible cards available for testing. I personally had no availability to test wether or not earlier versions of drivers affected the very same GPU architecture in the same way. Due to this I can’t honestly affirm that this was exclusively a “generic” OpenCL issue on all AMD cards.
I can report what we witnessed which is this : on AMD-GPU-PRO driver the issue was there. On AMD ROCm driver the issue did not appear. But the code generated by ethminer was the very same in both cases. So we had to assume it was something related to driver compiler. We’re still waiting from AMD a more detailed report of what happened though even if we’re informed they’re working on it.

To be very specific the “error” was caused by a particular sequence of math operations. Say we had 3 operations named A B and C. When the sequence was A B C everything ok; when it was B A C everything stil ok; when sequence was B C A … something got messed up and the kernel produced results which for its inner logic were meant correct but failed the CPU verification.

On the ROCm matter … ROCm is both a driver (btw with high(er) hardware requirements on the motherboard) and a platform. To be very honest the finding ROCm did not suffer from the issue was a lucky shot (I recall it was late night - for me - while testing and gittering with Jean and Dobromir).
Worth to reiterate that ROCm driver was compiling the very same OpenCL 1.2 code.
Having a full ROCm implementation is another matter. While, for instance, OpenCL forces code to a C99 syntax, with ROCm you have a C++ like syntax and you can use inline asm (like Nvidia/CUDA). This opens the scenario to new optimization possibilities.

I conclude: I personally think the proposal of implementation of, say, 32 pre generated sequences would’nt be worth the effort. GPU miners would always prefer the most performant solution they may find available and, atm, the async precompile is the best. It’s up to nodes to verify the nonce is valid.

I honestly don’t feel to share the worries of “catastrophic events”. As said we (developers) can easily, if supported, carry out extensive tests on a reasonably wide time span for future epochs/periods using the actual state of the art (current driver releases, known architectures etc) while keeping up-to date miner software over time. Plus, to my knowledge, GPU miners who un-advisedly install new software/drivers without a proper functioning/regression test (thus exposing their business plan - if any - to the risk of being voided) are very few. Even on Windows the mostly advised hint given in all forums, channels, whatever is to disable Windows automatic updates.

The possibility to get to a situation so critical where all gpu types, with any driver version, could all of sudden produce only invalid results (thus causing the halt of the chain) is, imho, == NULL.

jcyr · March 28, 2019, 12:26pm

To be more factual the error occurred will all versions of AMDGPU and AMDGPU-PRO for Windows or Linux, the drivers used by the majority of AMD users. It is really unknowable how and when this type of error will occur.

I have no clue what you mean when you say that dynamically compiled code would be faster?

Based on the often cited Bitcoin experience, many miners have this irrational fear of ASICs due to a belief in some kind of magical superpower. ASICs achieve their performance and power efficiency in two ways. By using optimized dedicated circuitry for logic and math operations, and by maximizing the use of all of on-chip circuitry via pipelining. The 1st gen. Ethash ASICs were not at all impressive, demonstrating that the type of algorithm matters. For a simple hash algorithm like SHA256d, with a small state and a shorter hash operator sequence, ASICs have a huge advantage. With an algorithm like Ethash where the mix portion of the algorithm imposes external memory access and stalls the entire pipe with each access, many of an ASIC’s advantages disappear. Adding to that a large state and multiple computational paths further shrinks their advantage.

Correct me if I’m wrong, but the underlying strategy with ProgPow is to force a dedicated ASIC to operate more like a GPU ASIC, thus evening things out. ProgPow as it stands hits that objective, but you’ll have a hard time convincing me that added run-time complexity and increased reliance on external components doesn’t lower reliability.

Just wanted to make that clear. I understand that changing course at this stage, with EIP rewrites, etc. that it would entail is also a major consideration.

Anlan · March 28, 2019, 1:09pm

I have no clue what you mean when you say that dynamically compiled code would be faster?

It seems to me pretty obvious in ProgPoW scope. The purpose of code generator is to provide the GPU a kernel freed from all conditional braces which drive the “period” adjustment. I think we can all agree that

If(cond) {
    doThis();
} else {
    doThat();
}

will have an execution time higher than

doThis();

when the “program” have to be executed gazillions of times.

Based on the often cited Bitcoin experience, [cut]

The matter here is not technical about ASIC efficiency rather than related to the ecosystem they imply. By extreme simplification if I had to see ASIC machines sold on store shelves along with GPUs I’d have no problem with them.

Correct me if I’m wrong, but the underlying strategy with ProgPow is to force a dedicated ASIC to operate more like a GPU ASIC

You’re not wrong. That’s what ProgPoW proponents have stated from day one.

but you’ll have a hard time convincing me that added run-time complexity and increased reliance on external components doesn’t lower reliability.

I’m pretty sure I can’t convince you about anything. Nevertheless the whole digital world relies on third party components. It’s everywhere. A flaw in OpenSSL could make damages even worse.

jcyr · March 28, 2019, 3:21pm

You don’t really need to convince me… It’s just an idiom.

It is a generally accepted engineering property that increased complexity will lower reliability. Granted, it is one factor of many in evaluating ProgPow. I meant the thread’s title quite literally, and it is hard to quantify (assign weights) to these factors. If this issue is indeed insignificant and is the ‘weakest’ part of the implementation, then I’d say ProgPow is in pretty good shape. I doubt very much that an audit of the actual algorithm will yield any important surprises.

jcyr · March 28, 2019, 3:34pm

Anlan:

It seems to me pretty obvious in ProgPoW scope. The purpose of code generator is to provide the GPU a kernel freed from all conditional braces which drive the “period” adjustment. I think we can all agree that
If(cond) {
    doThis();
} else {
    doThat();
}
will have an execution time higher than
doThis();

It’s not like the GPU miner could choose to do one or the other. As long as all miners have to run the same algorithm, speed doesn’t matter.

I only pointed out a technical issue.

OpenSSL is statically linked and runs on the host, not on the GPU. Need a better example.

jcyr · March 28, 2019, 3:52pm

The tangible (actionable) takeaways I get from this thread are:

Miner devs will need to test some number of ProgPow periods into the future.
Implement stricter control over driver and tooling versions.

Anlan · March 28, 2019, 4:09pm

Yes they can choose. If I know that for a certain period the condition will always be such to drive doThis() I prefer to compile a kernel without the “If”, let it run for what is necessary, and when condition change replace with another kernel. Even at period height of 5 this has advantages.
If, on the other hand, you want to apply conditional logic into the kernel, then you have enforce logic over something more volatile like, for example, workpackage header.

It’s only a more “generic” example of how everthing in the stack can have dependencies from third party components. But I am sure you got the point.