PyTorch Blog on Feedspot

Torchtune: Easily fine-tune LLMs using PyTorch

PyTorch Blog

by Facebook

1w ago

We’re pleased to announce the alpha release of torchtune, a PyTorch-native library for easily fine-tuning large language models. Staying true to PyTorch’s design principles, torchtune provides composable and modular building blocks along with easy-to-extend training recipes to fine-tune popular LLMs on a variety of consumer-grade and professional GPUs. torchtune supports the full fine-tuning workflow from start to finish, including Downloading and preparing datasets and model checkpoints. Customizing the training with composable building blocks that support different model architectures, para ..read more

Visit website

Accelerating MoE model inference with Locality-Aware Kernel Design

PyTorch Blog

by Adnan Hoque, Less Wright, Antoni Virós Martin, Chih-Chieh Yang

3w ago

1.0 Summary We show that by implementing column-major scheduling to improve data locality, we can accelerate the core Triton GEMM (General Matrix-Matrix Multiply) kernel for MoEs (Mixture of Experts) up to 4x on A100, and up to 4.4x on H100 Nvidia GPUs. This post demonstrates several different work decomposition and scheduling algorithms for MoE GEMMs and shows, at the hardware level, why column-major scheduling produces the highest speedup. Repo and code available at: https://github.com/pytorch-labs/applied-ai/tree/main/kernels/triton/inference/col_major_moe_gemm. Figure 1A. Optimized Fused ..read more

Visit website

Maximizing training throughput using PyTorch FSDP

PyTorch Blog

by Team PyTorch at IBM and Team PyTorch at Meta

1M ago

In this blog, we demonstrate the scalability of FSDP with a pre-training exemplar, a 7B model trained for 2T tokens, and share various techniques we used to achieve a rapid training speed of 3,700 tokens/sec/GPU, or 40B tokens/day on 128 A100 GPUs. This translates to a model FLOPS utilization (MFU) and hardware FLOPS utilization (HFU) of 57%. Additionally, we have observed near linear scaling of FSDP to 512 GPUs, implying that training a 7B model on 512 GPUs to 2T tokens using this method would take just under two weeks. IBM researchers trained a Meta Llama 2 7B architecture to 2T tokens, whic ..read more

Visit website

PyTorch 2 paper and tutorial @ ASPLOS 2024

PyTorch Blog

by Facebook

2M ago

The PyTorch team is excited to share that our paper on PyTorch 2 has been accepted for presentation at the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), scheduled to take place from April 27 to May 1, 2024, in San Diego, CA, USA. The paper delves into the implementation of torch.compile and highlights the key technologies driving it, including TorchDynamo (graph capture), TorchInductor (backend compiler), and Dynamic Shape support. During the ASPLOS conference, we’ll be conducting a tutorial on Saturday, April 27, focusing on th ..read more

Visit website

What’s New in PyTorch Documentation

PyTorch Blog

by Facebook

2M ago

Greetings to the PyTorch community! Here is a quick update on PyTorch docs. In November 2023, we successfully conducted a PyTorch Docathon, a community event where PyTorch community members gathered together to improve PyTorch documentation and tutorials. This event saw a global participation of contributors who dedicated their time and effort to enhance our docs. We extend our sincere gratitude to everyone involved. A key accomplishment of the Docathon was the comprehensive work carried out on docstrings. Our community contributors meticulously reviewed and improved the docstrings based on th ..read more

Visit website

PyTorch 2.2: FlashAttention-v2 integration, AOTInductor

PyTorch Blog

by Facebook

3M ago

We are excited to announce the release of PyTorch® 2.2 (release note)! PyTorch 2.2 offers ~2x performance improvements to scaled_dot_product_attention via FlashAttention-v2 integration, as well as AOTInductor, a new ahead-of-time compilation and deployment tool built for non-python server-side deployments. This release also includes improved torch.compile support for Optimizers, a number of new inductor optimizations, and a new logging mechanism called TORCH_LOGS. Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64. Along wi ..read more

Visit website

New Library Updates in PyTorch 2.2

PyTorch Blog

by Facebook

3M ago

Summary We are bringing a number of improvements to the current PyTorch libraries, alongside the PyTorch 2.2 release. These updates demonstrate our focus on developing common and extensible APIs across all domains to make it easier for our community to build ecosystem projects on PyTorch. Latest Stable Library Versions (Full List)* TorchArrow 0.1.0 TorchRec 0.6.0 TorchVision 0.17 TorchAudio 2.2.0 TorchServe 0.9.0 TorchX 0.7.0 TorchData 0.7.1 TorchText 0.17.0 PyTorch on XLA Devices 2.1 *To see prior versions or (unstable) nightlies, click on versions in the top left menu above ‘Search ..read more

Visit website

Accelerating Generative AI with PyTorch IV: Seamless M4T, fast

PyTorch Blog

by Yejin Lee, Carole-Jean Wu, Christian Puhrsch, Joel Schlosser, Driss Guessous, Jeffrey Wan, Joe Isaacson, Can Balioglu, Juan Pino

3M ago

This post is the fourth part of a multi-series blog focused on how to accelerate generative AI models with pure, native PyTorch. To skip to the code, check out our github (seamless_communication, fairseq2). We are excited to share a breadth of newly released PyTorch performance features alongside practical examples to see how far we can push PyTorch native performance. In part one, we showed how to accelerate Segment Anything over 8x using only pure, native PyTorch. In part two, we showed how to accelerate Llama-7B by almost 10x using only native PyTorch optimizations. In part three, we showed ..read more

Visit website

Accelerate PyTorch Models Using Quantization Techniques with Intel Extension for PyTorch

PyTorch Blog

by Intel

3M ago

Overview PyTorch is a Python-based framework for developing deep learning models. It is one of the most popular industry-standard AI frameworks and is used for a wide variety of computer vision and natural language processing applications. PyTorch was developed by Meta and is now part of The Linux Foundation. Intel works with the open source PyTorch project to optimize the PyTorch framework for Intel® hardware. The newest optimizations and features are first released in Intel® Extension for PyTorch before upstreaming them into PyTorch. The Intel extension provides quantization features to deli ..read more

Visit website

Accelerating Triton Dequantization Kernels for GPTQ

PyTorch Blog

by Less Wright, Adnan Hoque (IBM)

3M ago

TL;DR Leveraging a first principles approach, we showcase a step by step process undertaken to accelerate the current Triton GPTQ kernels by 3x (core GPTQ) and 6x (AutoGPTQ). Example: 275us to 47us on a typical Llama style inference input. The goal is to provide a helpful template for accelerating any given Triton kernel. We provide a background on Triton and GPTQ quantization and dequantization process, showcase the impact of coalesced memory access to improve shared and global memory throughput, highlight changes made to reduce warp stalling to improve total throughput, and an overview on in ..read more

Visit website

Follow PyTorch Blog on FeedSpot