Forcing Flash Attention onto a TPU and Learning the Hard Way (archerzhang.me)

azhng 6 days ago

FL33TW00D 1 day ago

Why ruin good work by letting Claude write it all? Full of em dashes, riddled with Claudisms.

gdiamos 1 day ago

I personally don't mind letting Claude write about work.

You could spend 80% doing the work and 20% writing about it, or 99% doing the work and 1% copy-pasting Claude's writeup about it into a blog.

There is nothing wrong with writing if you are into it, and yes you can probably do better than Claude, but I can related to engineers who just want to build.

spzb 1 day ago

If you can’t be bothered to write it, why should I bother to read it?

cannonpr 1 day ago

Because it contains information of value to you ? I mean if it doesn’t, just don’t read it.

userbinator 22 hours ago

https://marketoonist.com/wp-content/uploads/2023/03/230327.n...

Orygin 16 hours ago

To quote another HN comment recently made:

> Using AI to write content is seen so harshly because it violates the previously held social contract that it takes more effort to write messages than to read messages. If a person goes through the trouble of thinking out and writing an argument or message, then reading is a sufficient donation of time.

However, with the recent chat based AI models, this agreement has been turned around. It is now easier to get a written message than to read it. Reading it now takes more effort. If a person is not going to take the time to express messages based on their own thoughts, then they do not have sufficient respect for the reader, and their comments can be dismissed for that reason.

cannonpr 12 hours ago

So to a large extent I appreciate that argument, however I feel this applied more to throwaway comments or sales outreach, writing with low information density. In this occasion the work that went into it is a lot, it would be lost or inaccessible to me otherwise, I am genuinely grateful someone stuck their work in an LLM, said tidy this up to post, and hit enter.

selfhoster11 1 day ago

I could spend 100% doing the work with my own Claude, and 0% reading yours. That's a negative-sum outcome. I do think that the 80%/20% split is better (though anything that is mostly human voice is fine for me).

Groxx 1 day ago

Because the failures are so frequent and often load-bearing that it makes it a negative sum to even attempt to read stuff that appears generated.

jafioti 1 day ago

i did a quick scroll and was happy to see a long-ish article on XLA and TPUs. then i realized it was literally just "using vmap for parallel loops is better than fori", but in massively wordy claudisms.

skybrian 1 day ago

Why let an obsession with writing style prevent you from learning from a reasonably decent writeup?

JSR_FDED 1 day ago

He’s doing the author a favor

gdiamos 1 day ago

One of my lessons in using different accelerators, whether they be different NVIDIA versions, or GPU->TPU, etc is that someone needs to do this work of indexing, partitioning, mapping, scheduling, and benchmarking. That work is labor intensive.

In this case, google has already done it, and that will be true for high resourced accelerator companies like Google working with the most popular operations like attention.

As long as you use those operations, you are okay. But if you do something different, you need to be prepared to do all of this yourself.

ColonelPhantom 1 day ago

Interesting read! One remark though: I'm not too familiar with the architecture of a Google TPU, but comparing the TPU's VMEM with Nvidia's shared memory feels wrong to me.

Looking at the size, and its shared nature, it feels far more natural to compare with the L2 cache, which is also shared across the entire GPU and is in the same order of size (40MB on the listed A100).

refulgentis 1 day ago

It broke my heart to have a visceral "I'm being slop'd" reaction reading this: it's such good work, and AI's barely used AFAICT, but there's enough odd transitions and copy-pasta'd markdown that you get the subconcious "this is AI" reaction regardless.

Many sentences are 3x as long as it normally would be in subtle ways (to wit: "My flash attention is 35x slower than the fused standard at n=4096. Not a little worse. Catastrophically worse."), it really wears on attention. (pun intended) It brings literary voice to a technical blog post, and a very difficult process-oriented technical blog post. I have to reallocate my unfortunately-limited brain cells from "maintaining state of where we are in the process" to "is this cutesy fluff or important" and I've never had to do that in 37 years with technical blog posts.

The Markdown gets bad. Bolding is used for important phrases (like a human would), then, all of a sudden, after the "Inside a TPU chip" header its being used every other sentence, on anything that is a proper noun/would have a Wikipedia article. It got so weird that at some point I was like "a human definitely didn't let this through...they must be links?" and tried clicking them.

It's doubly bad at that point, because markdown tables start coming in hot and heavy too. So you're left with "It's pretty apparent the LLM did it from here, and I can't keep trying to keep the state of the process in my head while trying to figure out if the bolding is important, reflexive close tab

jacquesm 1 day ago

You got a lot further than I did.