Atheb a day ago

You got to give it to the pytorch team, they're really great at bringing complex optimization schemes (mixed-precision, torch.compile, etc) down to a simple to use API. I'm glad I moved from TF/Kerasto Pytorch around 2018-2019 and never looked back. I'm eager to try this as well.

  • ansk a day ago

    I've seen and ignored a lot of "pytorch good, tensorflow bad" takes in my time, but this is so egregiously wrong I can't help but chime in. Facilitating graph-level optimizations has been one of the most central tenets of tensorflow's design philosophy since its inception. The XLA compiler was designed in close collaboration with the tensorflow team and was available in the tensorflow API as far back as 2017. It's not an exaggeration to say that pytorch is 5+ years behind on this front. Before anyone invokes the words "pythonic" or "ergonomic", I'd like to note that the tensorflow 2 API for compilation is nearly identical to torch.compile.

    • brrrrrm a day ago

      it's not about the API. its about the documentation + ecosystem.

      TF's doesn't seem very good. I just tried to figure out how to learn a linear mapping with TF and went through this:

      1. googled "linear layer in tensorflow" and got to the page about linear.

      2. spent 5 minutes trying to understand why monotonicity would be a central tenet of the documentation

      3. realizing that's not the right "linear" I couldn't think of what the appropriate name would be

      4. I know MLPs have them, google "tensorflow mlp example"

      5. click the apr '24 page: https://www.tensorflow.org/guide/core/mlp_core

      6. read through 10[!] code blocks that are basically just boiler-plate setup of data and visualizations. entirely unrelated to MLPs

      7. realize they call it "dense" in tensorflow world

      8. see that "dense" needs to be implemented manually

      9. think that's strange, google "tensorflow dense layer"

      10. find a keras API (https://www.tensorflow.org/api_docs/python/tf/keras/layers/D...)

      • mochomocha a day ago

        11. notice that there's a unicode rendering error ("'" for apostrophe) on kernel_initializer and bias_initializer default arguments in the documentation, and wonder why on earth for such a high-level API one would want to expose lora_rank as a first class construct. Also, 3 out of the 5 links in the "Used in the guide" links point to TF1 to TF2 migration articles - TF2 was released 5 years ago.

      • n_u a day ago

        To add onto this I feel like one of the hard things about TF is that there is like at least 3 ways to do everything because they have supported multiple APIs and migrated to eager. So if you find an example or an open source project it might not be for the flavor of tensorflow that your codebase is in.

        • __rito__ a day ago

          Moreover, the way you find might not be the best or the most efficient way.

      • __rito__ a day ago

        Re 6: TF/Keras team motivates random people to write long tutorials and be featured in the official site and their tutorial be included in the official guides. I have seen a lot of subpar devs/AI people write subpar tutorials and brag on twitter how their tutorials are included in the official Keras site.

        I have seen some good ones, too, of course.

      • shmel 21 hours ago

        Oh god, you just gave me a flashback =) The last time I properly used TF was in early 2019, I am so happy that I don't have to deal with this anymore.

      • mft_ a day ago

        Honestly, this example holds true for roughly half of the Python ecosystem; and you can square the level of frustration if it's also anything coming from Google.

        (This pattern is relatively easy to understand: smart people creating something get their gratification from the creation process, not writing tedious documentation; and this is systemically embedded for people at Google, who are probably directly incentivised in a similar way.)

      • exe34 a day ago

        I feel like that with every single Google api doc. if there's a variable called x, the documentation will be "variable to store x". and you need to create/supply 5 different resources before you can create an x, but these will each require 5 further things to be figured out before you can create one of them.

        • pjmlp 18 hours ago

          One of the reasons I am happy no longer to do Android, Github samples as "documentation".

    • marcinzm 16 hours ago

      Tensorflow works really well in theory. In practice a lot less so. I saw someone spend months fighting Tensorflow to convert a production model from CPU to GPU inference with any sort of efficiency. Tons of issues due to bugs across versions, deprecations of features across versions, the graph optimizer shuffling data back to the CPU for no decent reason, etc. The person had no idea what was happening or why most of the time due to how black box Tensorflow was. This was a very senior ML engineer with a lot of Tensorflow experience.

    • dekhn 12 hours ago

      Does tensorflow have a future? I doubt it. I don't think Google is really investing many resources into it (beyond the necessary maintainence to support whatever production models still depend on it). The cost of migrating from old TF to new TF was really large, half the projects that depend on TF that I try to use just break out of the box (only 1/4 of torch projects I try fail that way).

      From what I can tell Google is moving in a direction that doesn't require tensorflow, and I don't see it gaining signficant adoption outside google, so it seems most likely we will simply see it deprecated in about 10 years. It's best to see it as a transitional technology that Jeff Dean created to spur ML development internally, which was mistakenly open sourced, and now, Jeff's reports typically use Jax or other systems.

    • lgessler 16 hours ago

      GP wrote "simple to use API". You can attribute many qualities to TensorFlow, but this is not one of them.

    • zozbot234 14 hours ago

      > Facilitating graph-level optimizations has been one of the most central tenets of tensorflow's design philosophy since its inception.

      Agreed of course but it's not like they came up with this approach from scratch. They seem to have just picked it up from Theano (now Aesara/PyTensor).

    • catgary a day ago

      I think tensorflow-datasets and tensorflow-serving are great, but for model development I think most people use JAX and then export it to a tensorflow SavedModel with Orbax.

      • ithkuil a day ago

        But IIUC Jax also leverages XLA and for the purpose of this discussion the frontend matters only inasmuch people feel productive in using it. Whether that's TF or Jax.

    • whymauri a day ago

      I'm so sorry but Tensorflow is simply one of the worst parts of my job.

    • uoaei 15 hours ago

      Praising XLA by defending Tensorflow of all things has to be one of the strangest takes I've ever come across.

      JAX is right there. No need to beat a dead horse when there's a stallion in the stables.

      • ansk 13 hours ago

        Tensorflow is a lot like IBM -- it deserves praise not because it's great in its current state, but for its contributions towards advancing the broader technological front to where it is today. Tensorflow walked so JAX could run, so to speak. Frankly, I don't really draw much of a distinction between the two frameworks since I really just use them as lightweight XLA wrappers.

        • uoaei 11 hours ago

          Tensorflow started out as anything but lightweight. In my opinion it takes the cake for kludgiest framework I've ever worked with. So verbose, so little effort put into ergonomics. Even eager mode is not really valuable unless you're working on a legacy project.

    • YetAnotherNick 11 hours ago

      +1. As someone who has tried to migrate multiple tf.function to torch.compile, tensorflow edge is not small in this. torch.compile still is highly highly experimental. Don't believe me, just go and look into github issues as torch maintainers try to figure why torch.compile makes code very unoptimal in lot of cases, or results in incomprehensible errors.

  • yablak 13 hours ago

    Best way to use tensorflow is by writing models in Jax.

formalsystem a day ago

Hi! I'm Mark from the PyTorch team at Meta and work on torchao. If you have any questions about the library or really anything at all about performance, don't hesitate to ask!

  • necovek a day ago

    Great stuff!

    A minor nitpick on the copy (and even then, it might just be me): I find "97% speedup" and "50% speedup" really hard to parse — a "30x speedup" or "97% reduction of time taken" immediately tell me what is being achieved!

    Great results once I get my head around them, though!

    • IanCal a day ago

      Fwiw I'm pretty sure 97% speedup is 197% of the speed of the baseline, so roughly double.

      • necovek 20 hours ago

        That's why it's confusing: "2x speedup" would clearly indicate 200% of the current speed, so 97% speedup is unclear if it's a multiple (not because that would be a slow down), a reduction in time (which was my assumption) or an increase in speed (something per unit of time).

        I guess you are right and it's probably the latter, but obviously better language would have avoided any doubt.

        • elcomet 18 hours ago

          I understand it as " the speed increases by 97%"

          • formalsystem 14 hours ago

            yeah indeed choice of language might not be ideal, it seems like 2x language is clearest to folks? I can make some quick edits to the article

  • DhawalModi a day ago

    Hi Mark, the library looks cool, excited to try it out. Coincidentally I am starting work on a project that is investigating a lot of Post training quantization methods. I read the blog and I am curious to understand what kind of overheads are involved in quantizing a layer?

    • formalsystem a day ago

      There's a bunch of overhead associated with PTQ - but TL;DR is that much of that overhead goes away when you're using `torch.compile()` and `torchao.autoquant()`

      Essentially the latency overhead comes from quantizing and dequantizing weights and activations. For large layers this overhead is small because by quantizing your weights for example you reduce memory bandwidth pressure but for small layers the overhead of potentially looking up a table, reading scaling factors, quantization/dequantization and finally handling zero points might not be worth it.

      However, even if such overhead exists you can still quantize your model and get it to be smaller it might not be faster is the problem. We solve the speed problem in 2 ways - `torch.compile()` will fuse operations like a dequant and matmul into a single kernel and `torchao.autoquant()` will do kernel level profiling to see whether a layer is actually made faster when quantizing and if not it skips quantizing that layer.

      • DhawalModi a day ago

        I see, thank you for the explanation!

  • dark__paladin a day ago

    First off, well done, this looks exciting. I haven't had a chance to interact with the library yet — should torchao be seen as a dev-friendly quantization interface? I.e., if my team was working on new quantization techniques, does the API provide easy tooling for implementing and benchmarking new quantization algorithms? Or is this closer to a "toolbox of finished (generally) finished products"?

    • formalsystem a day ago

      It's both! For this blog we decided to discuss our best end user facing numbers to keep things simple. We briefly hint at our contributor guide here https://github.com/pytorch/ao/issues/391 which does a tour of the APIs we provide developers implementing new algorithms

      But we have had quantization algorithm developers such as HQQ or Autoround merge their code in to get composability and serialization for free. We view quantization algorithms as the top layer and going down you have quantized tensors, quant primitives like dequant/quant and finally basic dtypes like uint1-7 and float3-8. Personally why I spent so much time on AO was I was hoping we could make it easier for people to express their quantization algorithms in easy to read PyTorch code and if they must use custom kernels we also have some tutorials for how to integrate custom cuda and triton ops.

      Most of those discussions have been happening on #torchao on discord.gg/gpumode so if you need to chat back and forth feel free to reach out to the team there otherwise Github also works.

  • soulofmischief a day ago

    Thanks for the hard work, any idea what the roadmap is for MPS support?

    • formalsystem a day ago

      Most of our performance relies on leveraging torch.compile which generates Triton kernels which run fast on CPU and GPU but not MPS since Triton does not support generating Metal kernels. So you lose the nice story of writing low bit code in pure PyTorch but also get it running fast.

      In these cases the only path forward we have is writing custom Metal kernels and plugging those in. That work is still ongoing and we'll hopefully have more to share soon.

      • underanalyzer a day ago

        This might not be the right place for this question but, as someone who has made a couple very modest mps backend contributions, I'm curious why not add metal support to triton (or a fork if openai won't allow it) rather than maintain a whole separate backend?

        • formalsystem a day ago

          Mostly comes down to what's fastest to develop, it's faster to write a few custom kernels than it is to develop a new compiler backend

          Granted after more upfront effort compilers are just such a significant UX boost that indeed you are making me question why I don't spend more time working on this myself lol

  • darkninja 17 hours ago

    Hi mark, Wanted to know if the float4 training is possible with torchao as we trying to fit a large model on a single GPU for training.

  • darkninja 17 hours ago

    Hi mark, Wanted to know if the float4 training is possible with torchao as we trying to fit a large model on a single GPU for training

  • OutOfHere a day ago

    Why don't they merge this into Pytorch? Why so many packages?

    • formalsystem a day ago

      There's different tradeoffs, spinning up a separate repo is what we call "out of core" vs having everything in PyTorch "in core"

      Basically PyTorch is a large library where CI takes a long time to run which means merging code is hard and adding new dependencies is challenging and there are stringent constraints on BC breaking changes

      Instead what torchao did and many other repos like torchtune, torchchat, torchtitan did was move out of core and it helps keep the core PyTorch library leaner with a smaller binary size and it really lets the team "out of core" focus on optimizing for their needs

      Unfortunately the argument for what gets better changes over time, for example torch.compile initially a new repo called torchdynamo was built out of core to move fast but eventually merged back because everyone wanted to use it. Now torch.compile dev velocity is still quite fast and so now we have to tell people to use nightlies instead of official stable releases to which some people have asked me why don't you move torch.compile out of core

      My 2c is the ecosystem will be much stronger and teams can move faster if they develop out of core so that's the tradeoff we picked for torchao. We managed to for example merge a few custom CPP kernels like fp6 or Marlin that would have challenging to motivate in core since those are still quite experimental and need to stand the test of time.

tomrod a day ago

This is a cool project! Understanding lower bits is still on my to do list, perhaps I'll spin this up for a go

majke 18 hours ago

Pardon my ignorance, but how do matrix operations on quantized data work? Is hardware support needed?

AFAIU int4 matrix multiplication is supported by cuda, but I'm not sure about other operations. The blog post mentioned fp6, and I don't think this is supported by cuda. Or maybe the data are upscaled to something common like fp16 before doing math?

  • formalsystem 14 hours ago

    It's a great question! Int4 is an easy one to understand. PyTorch supports int8 but not int4 so what you can do is "pack" 2 int4 values into a single int8 value. You still get speedups even without hardware support because you're sending less data to the GPU and workloads like small batch size LLM inference are memory bandwidth bound and not compute bound. So indeed your intuition is correct you pack the values and before doing a matmul you "unpack" them back into an int8 and then upcast to fp16 to do a matmul

Evidlo 21 hours ago

> We’re happy to officially launch torchao, a PyTorch native library that makes models faster and smaller by leveraging low bit dtypes

Will this let me use uint8 arrays as indexing arrays? A problem I have is that pytorch forces me to use uint64 for fancy indexing.

CalChris a day ago

Is this what Mojo is supposed to be?