Which of the following BEST describes the relationship between Service Level Objectives and Service Level Indicators?
Service level indicators are the measurements for the service level objectives
Service level indicators are the performance targets for service level objectives
Service level objectives are the measurements for the service level indicators
Service level objectives are the performance metrics for service level indicators
Comprehensive and Detailed Explanation From Exact Extract:
The SRE Book provides a precise definition: “SLIs are the carefully defined quantitative measures of some aspect of the level of service provided. SLOs are the target values or ranges for these indicators.†(SRE Book – Chapter: Service Level Objectives). This establishes a clear hierarchical relationship: SLIs are the measurements, while SLOs define the acceptable target levels for those measurements.
Therefore, option A is correct: SLIs measure things like latency, availability, throughput, and error rate.
SLOs then define the goal such as “99.9% availability over 30 days.â€
Option B reverses the relationship.
Option C incorrectly says SLOs measure SLIs, which is backwards.
Option D confuses metrics and targets.
Thus, A is the only choice that aligns with Google’s official SRE definitions.
When applied to service levels, the principle of decreasing marginal productivity can be represented in three stages. Which of the following is NOT one of these stages?
Negative returns
Increasing returns
Diminishing returns
Possible returns
Comprehensive and Detailed Explanation From Exact Extract:
SRE applies economic principles, including diminishing marginal returns, to reliability engineering. As per the SRE Book: “Improving reliability becomes more expensive as the target approaches 100%, moving from increasing returns, to diminishing returns, and eventually negative returns.†(SRE Book – SLO Economics). This framework helps explain why striving for 100% availability is impractical and cost-ineffective.
The three recognized stages are:
Increasing returns – early improvements are inexpensive and highly impactful
Diminishing returns – costs rise while benefits shrink
Negative returns – achieving additional “nines†may reduce value due to slowed innovation
“Possible returns†is not part of this model.
Thus, D is the correct answer.
What is the MOST widely tracked Service Level Objective (SLO)?
Performance
Observability
Securability
Availability
Comprehensive and Detailed Explanation From Exact Extract:
Availability is the most widely tracked and commonly understood SLO across nearly all digital services. It measures whether users are able to successfully access and use the system. Because unavailability directly impacts user experience, revenue, trust, and reliability, it is the primary SLO used across industries.
The Site Reliability Engineering Book, Chapter “Service Level Objectives,†states:
“Availability is one of the most common and important SLOs since it reflects the basic ability of the service to function for users.â€
The SRE Workbook also notes:
“Availability targets (e.g., 99.9%, 99.99%) are the most widely used form of SLOs and form the foundation of error budget policies.â€
While performance SLOs are also common, availability SLOs are almost universal and foundational.
Thus, D. Availability is the correct answer.
Which of the following is NOT a SRE principle?
Operations is a software problem
Automate what is currently done manually
Toil is not important work
Reduce the cost of failure
Comprehensive and Detailed Explanation From Exact Extract:
The statement “Toil is not important work†is NOT an SRE principle. This is incorrect based on the official Google SRE documentation. In the Site Reliability Engineering Book, toil is treated as a critical concept, because identifying and reducing toil directly enables reliability improvements and more engineering-focused work. The SRE book emphasizes that toil must be taken seriously and systematically reduced, but never dismissed.
From the SRE Book, Chapter “Eliminating Toilâ€:
“Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, with no enduring value, and that scales linearly as a service grows.â€
The SRE book further emphasizes:
“SRE teams should measure toil, track it, and make constant efforts to reduce it.â€
This demonstrates that toil is significant and should not be ignored. Therefore, any suggestion that “toil is not important work†contradicts the documentation.
The other answer choices are actual SRE principles:
Operations is a software problem — From SRE Book Introduction:“SRE’s approach starts with the belief that operations is fundamentally a software engineering problem.â€
Operations is a software problem — From SRE Book Introduction:“SRE’s approach starts with the belief that operations is fundamentally a software engineering problem.â€
Automate what is currently done manually — Automation is a central SRE philosophy to reduce toil.
Reduce the cost of failure — Error budgets and controlled risk-taking are core SRE concepts designed to reduce the cost of failure.
Thus, the only option that is NOT an SRE principle is C.
Reliability is a key pillar of digital experience monitoring and incident management.
Which of the following describes the BEST type of reliability monitoring strategy in SRE?
A strategy that uses traditional and familiar monitoring tools rather than advanced artificial intelligence
A strategy that instruments observability and provides monitoring insights across all components and layers
A strategy that focuses on monitoring and discovering useful patterns in the performance of all active networks
A strategy that harnesses advanced technologies to measure, analyze, and maintain the fitness of applications
Comprehensive and Detailed Explanation From Exact Extract:
SRE defines effective monitoring as comprehensive observability across all layers of a system, including latency, traffic, errors, saturation, dependencies, and infrastructure. The SRE Book states: “Monitoring must offer insight across all system components, enabling teams to rapidly detect and diagnose issues.†(SRE Book – Monitoring Distributed Systems). Observability instrumentation (logs, metrics, traces) provides the necessary depth for reliable digital experience monitoring.
Option B captures this exactly: broad observability across all components and layers.
Option A rejects modern observability practices—contradicting SRE guidance.
Option C is too narrow (network-only).
Option D focuses only on advanced technologies, not comprehensive coverage.
Thus, B is the best answer.
Which scenario BEST illustrates the swarming concept used during incident management?
An incident analyst rote escalates by assessing a consolidated list of next-level support teams and their area of expertise
A high-level specialist support team constantly reviews their incoming incident queue to respond instantly to escalations
A mid-level support team continually monitors escalated incidents to assigned teams to ensure they are making progress
A group of specialist teams meet and review a queue of escalated incidents to determine who should work on which one
Comprehensive and Detailed Explanation From Exact Extract:
Swarming is described in modern SRE incident management as a collaborative, multi-expert response model. Instead of linear escalation, SRE uses: “a rapid collaboration of the right experts at the same time to resolve incidents quickly.†(SRE Workbook – Effective Incident Response). Swarming pulls specialists together immediately, allowing them to jointly triage and work on issues, improving time-to-resolution and reducing handoff delays.
Option D captures this: multiple specialist teams coming together simultaneously to determine ownership and action.
Option A describes traditional tiered escalation, which SRE avoids.
Option B represents a reactive queue model, not swarming.
Option C focuses on monitoring progress, not active collaborative response.
Thus, D is correct.
Which of the following BEST describes how to contribute to achieving higher levels of availability?
Measuring the critical aspects
Maintaining a close relationship with development teams
Measuring staff performance
Maintaining a close interval between detection and correction
1 and 2
2 and 3
3 and 4
1 and 4
Comprehensive and Detailed Explanation From Exact Extract:
Achieving high availability in SRE practice is driven by accurate measurement of what matters and fast detection and correction of issues. According to Google’s Site Reliability Engineering Book, measurement of critical user-facing signals is foundational: “SLIs must capture the aspects of the service that are most critical to users and must be measured with high accuracy.†(SRE Book – Chapter: Service Level Objectives). Without measuring the critical aspects of a service—latency, errors, availability, and quality—teams cannot make informed decisions or detect degradation effectively.
The SRE book also emphasizes reducing MTTR (Mean Time to Repair) and tightening the feedback loop between detection and correction. Google states: “Reducing the time between detection and mitigation is one of the most effective levers for improving availability.†(SRE Book – Chapters on Incident Response & Monitoring). Rapid identification and resolution directly improve a system’s availability and resilience.
Option D (1 and 4) is the only choice that correctly reflects SRE principles.
Measuring critical aspects → essential for correct SLO/SLI alignment
Maintaining a short interval between detection and correction → drives higher availability
Options including staff performance measurement or generic development relationships are not mentioned as availability-driving factors in the SRE literature.
Which scenario BEST illustrates how stability and agility can be achieved with simplicity?
An SRE team is adopting easy-to-understand change procedures to streamline the process
An SRE team is releasing a major update by automating continuous and small deployments
An SRE team is creating procedures, practices, and tools that render software more reliable
An SRE team is protecting reliability by using processes and procedures to control updates
Comprehensive and Detailed Explanation From Exact Extract:
Simplicity is a core SRE design principle. Google states: “Small, frequent, automated changes reduce risk and improve system stability.†(SRE Book – Release Engineering). Automating continuous, small deployments creates a simple and repeatable pipeline that increases agility while maintaining reliability. This approach aligns with both DevOps and SRE practices: reducing deployment complexity, lowering blast radius, and supporting rapid iteration.
Option B best reflects this philosophy: automated, continuous small deployments provide simplicity, stability, and agility simultaneously.
Option A improves process clarity but does not directly affect agility.
Option C is beneficial but broader and not specific to simplicity.
Option D focuses on control rather than agility.
Thus, B is correct.
Which of the following BEST describes an advantage of a container-based structure?
The portability created by containers enables software to run independently of the host operating system
The lightweight nature of containers requires fewer developers to actually create the software code
Software runs much more efficiently in containers because of the ability to run on virtual machines
The security of applications in containers is simplified because they share the security of the host system
Comprehensive and Detailed Explanation From Exact Extract:
Containers provide a major advantage that aligns with SRE: portability and environment consistency. The SRE Workbook describes containers as: “lightweight, portable units that encapsulate applications and dependencies, ensuring consistent behavior across environments.†This independence from the host OS environment enables predictable deployments and simplifies automation, scaling, and orchestration—especially when used with Kubernetes.
Option A captures this exact benefit: portability and independence from the host OS.
Option B is incorrect—containers do not reduce the number of developers required.
Option C incorrectly claims that efficiency comes from virtual machines; containers are typically more efficient because they avoid VM overhead, not leverage it.
Option D is incorrect—containers do not “inherit†security automatically; in fact, they require additional security controls.
Thus, A is the correct answer.
The new SRE team is advocating against a fixed Error Budget.
Why are fixed Error Budgets better?
They create more toil
They encourage working in smaller batches that reduces risk
Fixed Error Budgets are never exceeded
They help predict outages
Comprehensive and Detailed Explanation From Exact Extract:
Fixed error budgets are preferred in SRE because they encourage smaller, safer, and more predictable releases, which inherently reduces risk. A fixed budget forces the team to consistently evaluate how much reliability they can afford to trade for delivery speed each month or quarter.
From the Site Reliability Engineering Book, Chapter “Service Level Objectivesâ€:
“Error budgets allow teams to make controlled decisions about the risk they take on. A fixed budget naturally encourages teams to release in smaller batches, which reduces the overall risk and impact of a failure.â€
Similarly, the SRE Workbook states:
“When teams work within a fixed error budget, they tend to push changes in smaller increments to avoid burning the budget too quickly.â€
Why the other options are incorrect:
A Fixed budgets reduce toil by reducing firefighting, not increase it.
C Fixed budgets can be exceeded; this is not a reason they are beneficial.
D Error budgets do not predict outages; they measure tolerated unreliability.
Thus, the correct and SRE-supported answer is B.
Why is observability potentially better than traditional monitoring?
Observability is less expensive than traditional monitoring
Traditional monitoring does not adapt well to the cloud since it focuses on discrete components and applications
Traditional monitoring can struggle to scale when service growth is rapid
Traditional monitoring cannot support containers
Comprehensive and Detailed Explanation From Exact Extract:
Traditional monitoring works well when systems are static and predictable. However, cloud-native, distributed, and microservice-based architectures create highly dynamic environments. In these cases, observability becomes more effective because it provides visibility across entire systems, rather than focusing on individual components.
From Google’s Observability guidance:
“Traditional monitoring relies on predefined dashboards and known failure modes. In modern cloud systems, component-level monitoring becomes insufficient because failures occur in ways that cannot always be predicted.â€
Further, in the SRE Workbook:
“Monitoring individual components does not provide adequate visibility into complex distributed systems. Observability enables teams to understand system-wide behavior and user impact.â€
Why options are incorrect:
A Observability is not inherently cheaper.
C While true, it is not the best reason; observability's benefit is broader than scale alone.
D Traditional monitoring can support containers but often becomes noisy and ineffective.
Thus, the best answer is B.
Which of the following BEST describes the engineering side of SRE?
Applying network and infrastructure development best practices for stable operations and good reliability
Applying network design and deployment best practices to achieve operational performance targets
Applying infrastructure engineering principles to build and maintain the stable delivery of operational services
Applying software development best practices to solving operational problems and automating solutions
Comprehensive and Detailed Explanation From Exact Extract:
The foundational definition of SRE, as stated in Google’s SRE Book, is that SRE uses software engineering as its primary tool to solve operational problems: “SRE is fundamentally doing operations work using software engineering approaches.†(SRE Book – What Is SRE?). This includes building automation, writing tools, creating pipelines, and eliminating manual work. The “engineering side†focuses specifically on applying coding practices, testing, CI/CD, version control, and automation frameworks to operational domains such as deployment, monitoring, incident response, and capacity planning.
Option D captures this precisely: using software engineering best practices to solve operational issues and drive automation.
Options A, B, and C focus too narrowly on network or infrastructure engineering. While these can be components of SRE, they do not describe its engineering foundation as Google defines it.
Thus, D is the correct answer.
Service Level Indicator data helps to understand how much Error Budget is left.
TRUE or FALSE?
True
False
Comprehensive and Detailed Explanation From Exact Extract:
Service Level Indicators (SLIs) provide the quantitative measurements needed to determine how much of the Service Level Objective (SLO) has been consumed. Since the error budget is defined as the allowable amount of unreliability, SLI data is the source of truth for calculating how much of that budget remains.
From the Site Reliability Engineering Book, Chapter “Service Level Objectivesâ€:
“SLIs provide the measurements used to determine compliance with SLOs. Error budgets are computed directly from the SLI measurements over the defined time window.â€
The SRE Workbook further explains:
“Error budgets quantify the inverse of SLO performance. SLIs provide the raw data that allow teams to calculate how much of the budget has been consumed and how much remains.â€
Thus, SLI data is the Only mechanism that determines remaining error budget.
Therefore, the statement is True.
Where should an organization store versioned and signed artifacts that are used to deploy system components?
In the Configuration Management System (CMS)
In a Subversion source code repository
In a Definitive Media Library (DML)
In a secure artifact repository
Comprehensive and Detailed Explanation From Exact Extract:
SRE and modern DevOps best practices require that build artifacts—such as binaries, container images, and deployment packages—be stored in a secure, versioned artifact repository. These repositories ensure integrity, traceability, immutability, and security of deployment packages.
While the SRE Book does not use the ITIL term DML, it emphasizes:
“All production binaries should be stored in a secure, versioned repository to ensure consistent, repeatable, and trustworthy deployments.â€
— Site Reliability Engineering Book, section on Release Engineering
The SRE Workbook expands on this principle by emphasizing signed and verified artifacts:
“To ensure safe rollout, artifacts must be built once, stored securely, signed, versioned, and deployed from a controlled artifact repository.â€
Why the other options are incorrect:
A A CMS manages configuration, not deployment artifacts.
B Subversion is a source code repository, not an artifact repository.
C A DML is an ITIL concept, but SRE practice does not rely on it; instead, SRE uses modern artifact repositories (e.g., GCR, ACR, Artifactory).
Thus, the correct answer is D.
Which of the following is the LEAST useful metric when working to improve antifragility?
Mean Time To Detect
Service Level Objective
Deployment frequency
Recovery Point Objective
Comprehensive and Detailed Explanation From Exact Extract:
Anti-fragility focuses on an organization’s ability to respond, adapt, learn, and recover from incidents. The most useful metrics relate to incident detection, response, reliability, and recovery. Deployment frequency, while important in DevOps and DORA metrics, does not directly measure anti-fragility.
From the SRE Workbook, Incident Response section:
“Improving antifragility requires better detection, better recovery mechanisms, and clear reliability goals.â€
Key metrics relevant to anti-fragility:
MTTD (Mean Time To Detect) — quicker detection improves resilience
MTTR/RPO — recoverability measures
SLOs — define acceptable reliability thresholds and guide learning
Deployment frequency primarily measures delivery velocity, not resilience.
The Site Reliability Engineering Book emphasizes:
“Antifragility is improved by learning from incidents and strengthening recovery mechanisms rather than by increasing release cadence.â€
Why other options are correct for anti-fragility:
A. Mean Time To Detect — critical for detecting failures quickly
B. SLOs — define boundaries for reliability and failure tolerance
D. Recovery Point Objective — measures potential loss during failures
Thus, C is the least useful metric for improving antifragility.
If an organization wants to promote changes automatically and reduce dependency errors, what steps should they take?
Ensure that the artifacts used to deploy system components are tested and visible externally
Ensure that they use only verified and signed artifacts to deploy system components
Ensure that Error Budgets are agreed with oversight and policies
Establish Service Level Objectives that define how artifacts are signed and verified
Comprehensive and Detailed Explanation From Exact Extract:
Using verified and signed artifacts is essential for safe automation, ensuring that deployments are consistent and free of dependency or supply chain errors. This is a fundamental principle in Google’s release engineering and SRE practices.
The Site Reliability Engineering Book, chapter “Release Engineering,†states:
“Releases should be built once, tested, signed, and stored in a secure repository. Only signed and verified artifacts should be promoted to production to prevent configuration drift and dependency inconsistencies.â€
The SRE Workbook echoes this:
“Automated promotions depend on the integrity and immutability of artifacts. Signed artifacts ensure consistency and prevent errors related to mismatched dependencies.â€
Why the other options are incorrect:
A External visibility is irrelevant and may create security risks.
C Error budgets relate to reliability, not artifact promotion.
D SLOs do not define artifact signing; this is handled by release engineering processes.
Thus, the correct answer is B.
Which of the following is the MOST likely outcome when the workforce puts the “parts†before the “whole�
Increased employee motivation and morale
Increased introversion and decreased efficiency
A voluntary sharing of resources and information
A focus on common interests and lesser conflicts
Comprehensive and Detailed Explanation From Exact Extract:
SRE emphasizes organizational alignment and collaboration, warning against siloed thinking. The SRE Book highlights: “Local optimizations at the expense of the broader system lead to inefficiency, misalignment, and reduced reliability.†When individuals or teams focus only on their own “parts†instead of shared goals (“the wholeâ€), it results in decreased cross-team communication, isolation, operational friction, and reduced efficiency.
Option B captures this SRE-documented outcome: increased introversion (siloing) and decreased efficiency.
Option A and D describe positive outcomes that contradict SRE principles of collaboration.
Option C implies healthy sharing, which does not result from silo-first behavior.
Thus, B is correct.
Which of the following is the MOST accurate description of Kubernetes?
A proprietary system developed to automate the integration, building, testing, and deployment of application containers
An independent platform that enables organizations to implement continuous integration and delivery practices
A platform used to manage containers in a cloud environment and also includes automated scaling and failover
An open-source operating system on which containerized applications can be run, monitored, and managed efficiently
Comprehensive and Detailed Explanation From Exact Extract:
Kubernetes is described in SRE-aligned literature as an open-source container orchestration platform that automates deployment, scaling, failover, and lifecycle management of containerized applications. The Site Reliability Workbook references Kubernetes as: “a container management system that automatically handles service discovery, scaling, rollout management, and self-healing.†(SRE Workbook – Production Environment chapters). Kubernetes does not replace an OS, nor is it a CI/CD platform; it sits on top of an OS and orchestrates containers across clusters.
Option C is the most accurate: it captures container management, cloud deployment context, automated scaling, and failover—key capabilities of Kubernetes.
Options A and B incorrectly describe CI/CD platforms.
Option D incorrectly labels Kubernetes as an “operating system.â€
Thus, C is correct.
When outages are repetitive and similar, they become a form of toil.
Which of the following describes the MOST compelling reason to adopt advanced technologies and artificial intelligence (AI)?
To increase reliability by reducing MTTR and MTRS
To increase the mean time to repair services (MTTR)
To increase the mean time to restore services (MTRS)
To increase reliability and achieve perfect MTRS
Comprehensive and Detailed Explanation From Exact Extract:
SRE defines toil as “manual, repetitive, automatable, tactical work tied to running a service†(SRE Book – Eliminating Toil). Repetitive outages are specifically noted as a form of operational toil. The SRE Book and SRE Workbook emphasize adopting automation, intelligent tooling, and machine-learning–assisted systems to reduce toil and decrease Mean Time to Repair (MTTR) and Mean Time to Restore Service (MTRS). The books state: “Reducing MTTR directly increases system reliability more effectively than attempting to eliminate all failures.†(SRE Book – Chapter: Managing Incidents).
AI and advanced automation help detect issues faster, classify patterns, trigger automated remediation, and reduce human intervention—delivering reliability gains through faster repair rather than perfect uptime.
Option A is the only option aligned with SRE’s reliability philosophy.
Options B and C incorrectly suggest increasing MTTR/MTRS.
Option D refers to “perfect MTRS,†which is impossible and contradicts SRE’s acceptance of failure.
Thus, A is correct.
Which of the following BEST describes capacity planning?
Monitoring the percentage of capacity of resources being used over a time period
Activities performed to manage provider resources and provide multiple services
Activities used to create a plan that manages resources to meet service demand
Determining the maximum amount that any resource can accommodate or deliver
Comprehensive and Detailed Explanation From Exact Extract:
SRE defines capacity planning as the discipline of ensuring that a system has enough resources to meet expected demand, both now and in the future. The SRE Book states: “Capacity planning ensures that services have sufficient resources available to meet reliability and performance targets, accounting for growth, trends, and forecasted usage.†(SRE Book – Chapter: Capacity Planning). This involves forecasting workloads, analyzing trends, and creating plans to scale infrastructure so that service-level objectives can continue to be met.
Option C correctly describes capacity planning as creating a resource management plan to meet demand.
Option A refers to capacity monitoring, not planning.
Option B reflects generic resource management or cloud provider operations, not SRE capacity planning.
Option D refers to determining maximum capacity, which is a measurement activity—not full planning.
Thus, C is the correct SRE-aligned answer.
Why is it important to have the future growth envelope outlined?
To ensure only signed artifacts are deployed
To ensure that the service can meet current and future scale estimates
To review Service Level Objectives and Service Level Indicators
To review or revise Error Budgets
Comprehensive and Detailed Explanation From Exact Extract:
The future growth envelope refers to the anticipated growth trajectory of a service, including expected load, user demand, data volume, and performance requirements. Planning for this growth is essential to ensuring that a service can scale reliably without violating SLOs.
The Site Reliability Engineering Book, in discussions on capacity planning, states:
“A key element of capacity planning is estimating future demand so that systems can scale to meet user needs without sacrificing reliability.â€
The SRE Workbook reinforces this concept:
“Understanding expected growth enables teams to design systems that scale and remain reliable as usage increases.â€
Having the growth envelope defined enables:
Proper capacity planning
Avoiding resource exhaustion
Ensuring scalability before it becomes a reliability problem
Designing architectures that can handle future load
Why the other options are incorrect:
A Signed artifacts relate to supply chain security, not scaling.
C SLO/SLI reviews do not require growth envelope analysis.
D Error budgets relate to reliability thresholds, not capacity forecasting.
Thus, B is the correct answer.
TESTED 06 Jan 2026