EBPF Observability Tools Are Not Security Tools
Brendan Gregg's Blog
by
1M ago
eBPF has many uses in improving computer security, but just taking eBPF observability tools as-is and using them for security monitoring would be like driving your car into the ocean and expecting it to float. Observability tools are designed have the lowest overhead possible so that they are safe to run in production while analyzing an active performance issue. Keeping overhead low can require tradeoffs in other areas: tcpdump(8), for example, will drop packets if the system is overloaded, resulting in incomplete visibility. This creates an obvious security risk for tcpdump(8)-based security ..read more
Visit website
USENIX SREcon APAC 2022: Computing Performance: What's on the Horizon
Brendan Gregg's Blog
by
3M ago
At USENIX SREcon22 APAC I gave the opening keynote on the future of computer performance, rounding up the latest developments and making predictions of where I see things heading. This talk originated from my updates to [Systems Performance 2nd Edition], and this was the first time I've given this talk in person! The video is now on [YouTube]: The slides are [online] and as a [PDF]: first prev next last / In Q&A I was asked about CXL (compute express link) which was fortunate as I had planned to cover it and then forgot, so the question let me talk about it (although Q&A is missing f ..read more
Visit website
USENIX SREcon APAC 2023: CFP
Brendan Gregg's Blog
by
3M ago
USENIX's SREcon conference is the best venue for learning the latest in systems engineering (not just site reliability engineering) and if you have useful production stories and takeaways to share -- especially if you are in the Asia/Pacific region -- please consider submitting a talk proposal to [SREcon APAC 2023]. The [call for participation] ends on March 2nd, only two weeks away. It is held this year in Singapore, June 14-16, and I'm excited to be program co-chair with fellow Aussie [Jamie Wilkinson]. To quote from our CFP: You build computer platforms, debug them, and support them, and yo ..read more
Visit website
USENIX SREcon APAC 2023: CFP
Brendan Gregg's Blog
by
3M ago
USENIX's SREcon conference is the best venue for learning the latest in systems engineering (not just site reliability engineering) and if you have useful production stories and takeaways to share -- especially if you are in the Asia/Pacific region -- please consider submitting a talk proposal to [SREcon APAC 2023]. The [call for participation] ends on March 2nd, only two weeks away. It is held this year in Singapore, June 14-16, and I'm excited to be program co-chair with fellow Aussie [Jamie Wilkinson]. To quote from our CFP: You build computer platforms, debug them, and support them, and yo ..read more
Visit website
Brendan@Intel.com
Brendan Gregg's Blog
by
3M ago
I'm thrilled to be joining Intel to work on the performance of everything, apps to metal, with a focus on cloud computing. It's an exciting time to be joining: The geeks are back with [Pat Gelsinger] and [Greg Lavender] as the CEO and CTO; new products are launching including the Sapphire Rapids processor; there are more competitors, which will drive innovation and move the whole industry forward more quickly; and Intel are building new fabs on US soil. It's a critical time to join, and an honour to do so as an Intel fellow, based in Australia. My dream is to turn computer performance analysis ..read more
Visit website
Netflix End of Series 1
Brendan Gregg's Blog
by
3M ago
A large and unexpected opportunity has come my way outside of Netflix that I've decided to try. Netflix has been the best job of my career so far, and I'll miss my colleagues and the culture. offer letter logo (2014) flame graphs (2014) eBPF tools (2014-2019) PMC analysis (2017) my pandemic-abandoned desk (2020); office wall I joined Netflix in 2014, a company at the forefront of cloud computing with an attractive [work culture]. It was the most challenging job among those I interviewed for. On the Netflix Java/Linux/EC2 stack there were no working mixed-mode flame graphs, no production ..read more
Visit website
TensorFlow Library Performance
Brendan Gregg's Blog
by
3M ago
A while ago I helped a colleague, Vadim, debug a performance issue with TensorFlow in an unexpected location. I thought this was a bit interesting so I've been meaning to share it; here's a rough post of the details. ## 1. The Expert's Eye Vadim had spotted something unusual in this CPU flamegraph (redacted); do you see it?: I'm impressed he found it so quickly, but then if you look at enough flame graphs the smaller unusual patterns start to jump out. In this case there's an orange tower (kernel code) that's unusual. The cause I've highlighted here. 10% of total CPU time in page faults. At N ..read more
Visit website
Why Don't You Use ...
Brendan Gregg's Blog
by
3M ago
Working for a famous tech company, I get asked a lot "Why don't you use technology X?" X may be an application, programming language, operating system, hypervisor, processor, or tool. It may be because: - It performs poorly. - It is too expensive. - It is not open source. - It lacks features. - It lacks a community. - It lacks debug tools. - It has serious bugs. - It is poorly documented. - It lacks timely security fixes. - It lacks subject matter expertise. - It's developed for the wrong audience. - Our custom internal solution is good enough. - Its longevity is uncertain: Its startup may be ..read more
Visit website
The Speed of Time
Brendan Gregg's Blog
by
3M ago
How long does it take to read the time? How would you _time_ time? These strange questions came to the fore back in 2014 when Netflix was switching services from CentOS Linux to Ubuntu, and I helped debug several weird performance issues including one I'll describe here. While you're unlikely to run into this specific issue anymore, what is interesting is this type of issue and the simple method of debugging it: a pragmatic mix of observability and experimentation tools. I've shared many posts about superpower observability tools, but often humble hacking is just as effective. A Cassandra data ..read more
Visit website
ZFS Is Mysteriously Eating My CPU
Brendan Gregg's Blog
by
3M ago
A microservice team asked me for help with a mysterious issue. They claimed that the ZFS file system was consuming 30% of CPU capacity. I summarized this case study at [Kernel Recipes] in 2017; it is an old story that's worth resharing here. ## 1. Problem Statement The microservice was for metrics ingestion and had recently updated their base OS image (BaseAMI). After doing so, they claimed that ZFS was now eating over 30% of CPU capacity. My first thought was that they were somehow mistaken: I worked on ZFS internals at Sun Microsystems, and unless it is badly misconfigured there's no way it ..read more
Visit website

Follow Brendan Gregg's Blog on Feedspot

Continue with Google
Continue with Apple
OR