Training a byte-level discrete diffusion language model to generate palindromes, comparing it to autoregressive approaches.
Understanding per-sample gradients and their applications in data attribution and influence functions.