Use lightning paradigm for "batch accumulation" instead of the one currently implemented
Our current method for gradient accumulation goes by re-using the current batch-size and setting a number divisible by such number to break the batch into smaller chunks. This approach has some downsides, with the first being that there must be a check for the divisibility and a complex decision making. The advantage is that the batch-size can be "compared" across setups.
Lighting proposes a different approach, which I find much simpler: the batch-size is independent of the number of batches to accumulate gradients before updating. If you choose acc-batches = 1, then no gradient accumulation happens. If you choose acc-batches = 2, then each 2 batches correspond to one update. There is no need to check for divisibility in this context, as we are talking about multipliers. The "down side" is that the effective batch-size is now "batch-size * acc-batches". However, that is something that the user can handle easily, which would make our logic simpler in the program.