The Return of the Frame Pointers
Brendan Gregg's Blog
by
3M ago
Sometimes debuggers and profilers are obivously broken, sometimes it's subtle and hard to spot. From my flame graphs page: CPU flame graph (partly broken) (Click for original SVG.) This is pretty common and usually goes unnoticed as the flame graph looks ok at first glance. But there are 15% of samples on the left, above "[unknown]", that are in the wrong place and missing frames. The problem is that this system has a default libc that has been compiled without frame pointers, so any stack walking stops at the libc layer, producing a partial stack that's missing the application frames. These ..read more
Visit website
EBPF Documentary
Brendan Gregg's Blog
by
3M ago
eBPF is a crazy technology – like putting JavaScript into the Linux kernel – and getting it accepted had so far been an untold story of strategy and ingenuity. The eBPF documentary, published late last year, tells this story by interviewing key players from 2014 including myself, and touches on new developments including Windows. (If you are new to eBPF, it is the name of a kernel execution engine that runs a variety of new programs in a performant and safe sandbox in the kernel, like how JavaScript can run programs safely in a browser sandbox; it is also no longer an acronym.) The documentary ..read more
Visit website
USENIX SREcon APAC 2022: Computing Performance: What's on the Horizon
Brendan Gregg's Blog
by
3M ago
At USENIX SREcon22 APAC I gave the opening keynote on the future of computer performance, rounding up the latest developments and making predictions of where I see things heading. This talk originated from my updates to Systems Performance 2nd Edition, and this was the first time I've given this talk in person! The video is now on YouTube: The slides are online and as a PDF: first prev next last / In Q&A I was asked about CXL (compute express link) which was fortunate as I had planned to cover it and then forgot, so the question let me talk about it (although Q&A is missing from the ..read more
Visit website
USENIX SREcon APAC 2023: CFP
Brendan Gregg's Blog
by
3M ago
USENIX's SREcon conference is the best venue for learning the latest in systems engineering (not just site reliability engineering) and if you have useful production stories and takeaways to share -- especially if you are in the Asia/Pacific region -- please consider submitting a talk proposal to SREcon APAC 2023. The call for participation ends on March 2nd, only two weeks away. It is held this year in Singapore, June 14-16, and I'm excited to be program co-chair with fellow Aussie Jamie Wilkinson. To quote from our CFP: You build computer platforms, debug them, and support them, and you have ..read more
Visit website
Brendan@Intel.com
Brendan Gregg's Blog
by
3M ago
I'm thrilled to be joining Intel to work on the performance of everything, apps to metal, with a focus on cloud computing. It's an exciting time to be joining: The geeks are back with Pat Gelsinger and Greg Lavender as the CEO and CTO; new products are launching including the Sapphire Rapids processor; there are more competitors, which will drive innovation and move the whole industry forward more quickly; and Intel are building new fabs on US soil. It's a critical time to join, and an honour to do so as an Intel fellow, based in Australia. My dream is to turn computer performance analysis int ..read more
Visit website
Netflix End of Series 1
Brendan Gregg's Blog
by
3M ago
A large and unexpected opportunity has come my way outside of Netflix that I've decided to try. Netflix has been the best job of my career so far, and I'll miss my colleagues and the culture. offer letter logo (2014) flame graphs (2014) eBPF tools (2014-2019) PMC analysis (2017) my pandemic-abandoned desk (2020); office wall I joined Netflix in 2014, a company at the forefront of cloud computing with an attractive work culture. It was the most challenging job among those I interviewed for. On the Netflix Java/Linux/EC2 stack there were no working mixed-mode flame graphs, no production sa ..read more
Visit website
TensorFlow Library Performance
Brendan Gregg's Blog
by
3M ago
A while ago I helped a colleague, Vadim, debug a performance issue with TensorFlow in an unexpected location. I thought this was a bit interesting so I've been meaning to share it; here's a rough post of the details. 1. The Expert's Eye Vadim had spotted something unusual in this CPU flamegraph (redacted); do you see it?: I'm impressed he found it so quickly, but then if you look at enough flame graphs the smaller unusual patterns start to jump out. In this case there's an orange tower (kernel code) that's unusual. The cause I've highlighted here. 10% of total CPU time in page faults. At Netf ..read more
Visit website
Why Don't You Use ...
Brendan Gregg's Blog
by
3M ago
Working for a famous tech company, I get asked a lot "Why don't you use technology X?" X may be an application, programming language, operating system, hypervisor, processor, or tool. It may be because: It performs poorly. It is too expensive. It is not open source. It lacks features. It lacks a community. It lacks debug tools. It has serious bugs. It is poorly documented. It lacks timely security fixes. It lacks subject matter expertise. It's developed for the wrong audience. Our custom internal solution is good enough. Its longevity is uncertain: Its startup may be dead or sold soon. We kno ..read more
Visit website
The Speed of Time
Brendan Gregg's Blog
by
3M ago
How long does it take to read the time? How would you time time? These strange questions came to the fore back in 2014 when Netflix was switching services from CentOS Linux to Ubuntu, and I helped debug several weird performance issues including one I'll describe here. While you're unlikely to run into this specific issue anymore, what is interesting is this type of issue and the simple method of debugging it: a pragmatic mix of observability and experimentation tools. I've shared many posts about superpower observability tools, but often humble hacking is just as effective. A Cassandra databa ..read more
Visit website
ZFS Is Mysteriously Eating My CPU
Brendan Gregg's Blog
by
3M ago
A microservice team asked me for help with a mysterious issue. They claimed that the ZFS file system was consuming 30% of CPU capacity. I summarized this case study at Kernel Recipes in 2017; it is an old story that's worth resharing here. 1. Problem Statement The microservice was for metrics ingestion and had recently updated their base OS image (BaseAMI). After doing so, they claimed that ZFS was now eating over 30% of CPU capacity. My first thought was that they were somehow mistaken: I worked on ZFS internals at Sun Microsystems, and unless it is badly misconfigured there's no way it can c ..read more
Visit website

Follow Brendan Gregg's Blog on FeedSpot

Continue with Google
Continue with Apple
OR