dysfun@treehouse.systems ("gaytabase") wrote:
what am i doing? oh remember that silly idea to generate code that does an optimal node lookup for each node in a data structure? well as it turns out one of my friends has done that already and it's not completely impractical.
anyway i was looking at how we could shave reading a HOT node down to fewer versions of the code that would need to be generated. maybe we could get the entropy small enough that we could reasonably just generate all of them in advance?
well, if you compress the entropy enough, we can do it with our current setup and not need the code generation at all! i've now managed to do that in full for avx512 even with arbitrary string keys, i only need 4 versions for 4 node types! this is of course only possible because AMD had fixed their slow pext/pdep by zen 3, before they supported avx-512 at all, otherwise i would have a real mess on my hands
avx2 is harder owing to the lack of predication and the smaller register size, but if i manage to figure it out, it's only 5 versions. SSE is a fairly abysmal 9 versions - incredibly tedious to write but maybe workable. i say maybe, because this doesn't account for the mess induced by the vast number of ways you can implement pext/pdep. even if i only go back as far as zen 2 (which i'm currently using, so i will go back at least that far...) i'll have to provide twice as many implementations.
next year will be the 10 year anniversary of zen, maybe i can talk myself into believing that this is old enough that noone would give a shit about older hardware. i wouldn't be talking about no support, just it would fall back to a portable algorithm and be probably quite slow. it would certainly save implementing SSE versions...