Skip to main content
S
s
Glossary Term

Site reliability engineer

Site reliability engineers, also known as SREs, work on IT infrastructure issues such as latency and capacity planning.

By IT Brew Staff

less than 3 min read

Back to Glossary

Definition:

Site reliability engineers, also known as SREs, work to ensure that software infrastructure deployment is effectively utilized by consumers. They are becoming more important to IT as AI starts to power automation across the tech sector.

SREs work on issues like latency, capacity planning, and automating tasks through integrating new code into existing systems for improved uptime, observability, and less repetitive effort, or “toil.”

Clicking through

More AI adoption means more work for SREs. Site reliability has always had a role in the digital landscape; with AI making things faster and more efficient, SREs are increasingly important for system management.

With more AI comes more toil, Laura de Vesine, a Datadog senior staff engineer, commented in a Catchpoint analysis of the increase in adoption that SREs manually supervising the systems is the kind of thing that can increase their workloads.

“AI systems are themselves a new source of operations we as an industry have yet to master: Maintaining and updating models and running massive GPU clusters are both new problems for most teams,” de Vesine said. “For teams not running those AI systems, AI proponents are keen to tell us that its rollout will reduce toil, but the evidence may suggest that AI is actually a source of increased toil.”

Big time AI

In this boom moment for the AI industry, SREs will be asked to integrate the technology into tasks like testing and automation, what’s often called “AIOps.”

Some AI advocates, like researcher Abdul Samad Mohammed, believe that AI and machine learning can help to relieve the SRE’s burden. Whether it hurts or helps, AI is changing the site reliability engineering landscape, meaning that resiliency will continue to be critical moving forward.