Generally speaking, addition will always involve some sort of looping construct. This isn't really a problem though, as these loops tend to be bounded (it's known ahead of time how many iterations they'll have at most), which means you can unroll them (e.g. build several units when doing subframe). My point is that what really matters for a case like TPT's is how "small" (in terms of particle count) you can make your iterations, not how many of them there are.
What many people don't know about this algorithm is that it's guaranteed to terminate after as many iterations as there are bits in your numbers. Proof for A' = (A & B) << 1 and B' = A ^ B, where both A and B are n-bit numbers and the termination condition is A* = 0: A initially has n bits that may or may not be populated. Bitwise-anding it with B either doesn't touch this set of bits or removes bits from it, but never adds any. Shifting left by 1 definitely removes a bit from this set, decreasing the set's size. This means that after n iterations, its size must be 0, therefore there can't be any bits left in this set, so A = 0.