AI Core App Architecture in 2026: Why Trusted Execution Environments Are Becoming a Default, Not a Premium

June 13, 2026

TL;DR

Standard AI app stacks expose prompt payloads at multiple points during inference, including inside RAM during the forward pass, where neither TLS nor encryption at rest applies.
Trusted Execution Environments (TEEs) enforce hardware-level memory isolation, creating an encrypted execution boundary that the host OS, hypervisor, and infrastructure operators cannot read.
Intel TDX is the correct TEE for production LLM hosting; Intel SGX's EPC memory limit (~512MB–1GB in most deployments) makes it impractical for models beyond a few billion parameters.
GPU-level attestation via NVIDIA Confidential Compute is the missing layer in most "confidential AI" architectures; protecting CPU memory while leaving GPU VRAM unverified produces an incomplete isolation model.
Gartner predicts that by 2029, more than 75% of operations processed on untrusted infrastructure will be secured in-use by confidential computing, and enterprise procurement teams are already treating attestation evidence as a pre-condition, not a differentiator.
ORGN's confidential AI gateway combines Intel TDX isolation, NVIDIA GPU attestation, and zero-retention enforcement across the inference path, with cryptographic attestation records available.

When the Inference Moment Becomes the Liability: Understanding the Architecture Gap

The default AI app stack was assembled for general compute. At the time most of its components were designed, the data flowing through them was largely opaque: session tokens, hashed identifiers, structured API payloads. LLM inference changed this. The same infrastructure now handles clinical notes, legal discovery material, transaction records, and proprietary source code, in plaintext, in memory, for the duration of the forward pass.

The security model underneath hasn't caught up, and the gap between what enterprises assume their infrastructure protects and what it actually protects during inference is where most of the risk now lives.

What the Default AI App Stack Looks Like at Runtime

A production LLM-integrated application typically moves a prompt through six or more distinct layers before a token is generated. The client sends a request through an API gateway, which passes it to a reverse proxy, which routes it to an SDK wrapper, which calls the inference provider. Observability agents sit alongside each hop, capturing request metadata, sometimes full payloads, for debugging and monitoring pipelines.

At each of these layers, the prompt exists as cleartext in transit or buffered in memory. TLS protects the network segment between client and gateway, and between gateway and inference endpoint. But inside each process, the payload is decrypted, readable, and in some cases logged. The inference provider decrypts the prompt in RAM to run the forward pass. At that moment, any process with sufficient privilege on the host, the hypervisor, the host OS, a noisy neighbor with a cache-timing side-channel, can observe the contents.

There is no TEE in this architecture; there is no reason there would be: the default stack was built for stateless APIs where payloads weren't sensitive. The absence of hardware isolation isn't a misconfiguration. It's a structural default that needs to be deliberately overridden for AI workloads.

Why Encryption at Rest and TLS Are Insufficient for LLM Inference

Encryption has three domains: data at rest (disks, databases), data in transit (TLS over the network), and data in use (memory during computation). Most enterprise security postures cover the first two. AI inference falls almost entirely in the third.

During a transformer forward pass, the prompt is decrypted into GPU or CPU RAM. Intermediate tensors accumulate, attention scores, key-value cache entries, partial token probability distributions. The generated response materializes token by token in memory before it's serialized back to the wire. None of this is protected by TLS. None of it is encrypted at rest. It exists as plaintext in the execution environment for the duration of the request.

For traditional APIs, this gap was largely theoretical. An opaque session token being readable in RAM presented minimal risk. But a prompt containing a patient's diagnostic history, or a developer's proprietary codebase being processed by an AI coding assistant, represents a completely different threat profile. The memory exposure moment is the highest-risk state in the entire AI inference lifecycle, and it's the one that standard encryption models leave completely unaddressed.

The Hidden Retention Problem: How Observability Infrastructure Accumulates Sensitive Payloads

Teams deploying AI apps rarely intend to log prompt content. But the observability infrastructure they inherit often does exactly that by default. OpenTelemetry instrumentation captures spans with request attributes. Datadog agents buffer HTTP request bodies for trace reconstruction. Jaeger exporters include payload fields in distributed traces. Reverse proxies write access logs that include request bodies when debug logging is enabled.

Over months of operation, these telemetry streams accumulate what functions as a hidden database: a recoverable corpus of every AI interaction that flowed through the system. The corpus isn't stored intentionally. It accumulates as a side effect of general-purpose observability tooling that wasn't designed with LLM payload sensitivity in mind.

The breach risk from this corpus is often larger than the risk from the model provider itself. An attacker who gains access to a SIEM export, a log aggregator, or a distributed tracing backend can reconstruct months of sensitive AI interactions, without ever touching the inference infrastructure.

The architectural differences become clearer when comparing visibility boundaries directly.

Layer

Standard AI Runtime

Confidential Runtime

Host OS memory visibility

Yes

Hypervisor inspection

Possible

Blocked

GPU VRAM encryption

Usually absent

Enabled

Runtime attestation

None

Cryptographic

Prompt visibility during inference

Plaintext in RAM

Encrypted enclave memory

What Trusted Execution Environments Do Inside an AI Inference Stack

A TEE isn't a virtual machine with stricter permissions, and it isn't a software sandbox. Understanding the mechanism precisely matters here, because vague claims about "hardware-level security" proliferate in this space and most of them stop well short of what engineers need to know before making architectural decisions.

Memory Isolation, Hardware-Managed Encryption Keys, and Enclave Boundaries

A TEE creates an isolated execution context within the CPU where memory pages are encrypted using hardware-managed keys that never leave the CPU package. The hypervisor cannot read enclave memory even with root privilege on the host. Direct Memory Access from peripherals into the protected region is blocked at the hardware level. The host operating system sees only encrypted bytes when it attempts to inspect enclave memory.

The distinction from a standard VM is fundamental. A standard VM's memory is visible to the hypervisor. Hypervisor-level compromise, or a sufficiently privileged cloud provider operator, can read a VM's memory. An enclave's memory is encrypted with keys held inside the CPU's security processor, which are never exposed to system software. The hypervisor cannot read it. A privileged container runtime cannot read it. Even a malicious co-tenant on the same physical hardware cannot read it.

For an AI inference workload, the enclave boundary defines what enters and what exits. The prompt enters the enclave as ciphertext, is decrypted inside the protected region, processed during inference, and the response exits as ciphertext. Intermediate tensors, KV cache entries, and partial token distributions never leave the encrypted boundary. The host OS, the cloud provider, and the infrastructure administrator see none of it during execution.

Measured Launch and Cryptographic Attestation: How Verification Works

At TEE boot time, the hardware measures the workload. The measurement is a cryptographic hash of the enclave's code, initial configuration, and security parameters. The CPU's security processor signs this measurement using a hardware-rooted key chain, producing an attestation report.

The attestation report contains: the enclave measurement (hash of what's running), the platform identity (which CPU, which hardware generation), the security version number of the firmware, and a signature from the hardware root of trust. A verifier, the client application, an audit system, or a compliance tool, checks the report against Intel's or AMD's certificate authority chain.

The attestation flow becomes easier to reason about when treated like any other signed infrastructure artifact.

curl https://api.ORGN.com/v1/attestation \

-H "Authorization: Bearer $TOKEN" \

-o report.json

{

"tee_type": "intel_tdx",

"mrtd": "8f7a2f9d5a7b0c...",

"firmware_version": "1.5.02",

"gpu_attestation": true,

"timestamp": "2026-05-09T18:42:11Z",

"signature": "MEQCID..."

}

The MRTD measurement identifies the enclave state that executed the inference request. Teams can validate the firmware version, verify GPU confidential compute mode, and archive signed reports for audit evidence.

Security in ORGN extends beyond model execution to the workspace runtime itself.

The practical difference between a policy claim and a cryptographic attestation is verifiability after the fact. A vendor's terms of service stating "we isolate your data" cannot be audited post-hoc. An attestation report signed by the CPU's security processor can be. An enterprise can store attestation evidence and present it during a SOC 2 audit or a HIPAA review as proof that specific inference requests ran inside a verified, isolated enclave. ORGN generates these cryptographic attestation records for qualifying inference requests, making them retrievable from the ORGN console for exactly this purpose.

Verification also requires explicit failure handling when enclave integrity changes.

Request received

↓

Attestation mismatch detected

↓

Inference aborted

↓

No model execution

Common failure conditions include stale firmware measurements, revoked certificate chains, enclave drift, or GPU confidential compute mode becoming unavailable.

Intel TDX vs. Intel SGX vs. AMD SEV: Scoping the Right TEE for LLM Workloads

Intel SGX was the first widely deployed CPU TEE. Its protection model is solid, but its Enclave Page Cache (EPC), the protected memory region, is limited to approximately 512MB to 1GB in most production deployments. Loading a 7B-parameter model in FP16 requires roughly 14GB. Loading a 70B model requires around 140GB. SGX is simply not the right TEE for LLM inference. It works for small, security-sensitive processes, key management services, attestation agents, but not for model serving.

Intel Trust Domain Extensions (TDX) addresses the memory ceiling by operating at the VM level. A TDX Trust Domain is a full encrypted virtual machine. Guest memory is encrypted using CPU-managed keys, the hypervisor cannot read it, and there's no EPC ceiling to work around. The entire model, runtime, and inference stack run inside the encrypted trust domain. This is the architecture that makes production LLM hosting inside a TEE practical.

AMD SEV-SNP (Secure Encrypted Virtualization, Secure Nested Paging) is AMD's functional equivalent. It provides encrypted VM memory, hardware attestation, and protection against hypervisor-level introspection. The security model is comparable to TDX; the hardware ecosystem and tooling differ. For teams evaluating multi-vendor confidential compute, SEV-SNP is a credible option on AMD-based infrastructure, with attestation reports generated through AMD's Key Distribution Service.

For production LLM hosting specifically, TDX is currently the more mature choice on Intel infrastructure, with broader integration support and documented benchmarks. A 2025 ETH Zurich study (arXiv:2509.18886) running full Llama2 inference pipelines (7B, 13B, and 70B parameters) inside Intel TDX found CPU TEE overhead of under 10% throughput reduction and under 20% latency increase, numbers that fall within acceptable ranges for production inference services when security requirements mandate it.

GPU Attestation: The Missing Layer That CPU-Only TEEs Leave Exposed

LLM inference doesn't run primarily on the CPU. The forward pass, attention computation, and token generation run on GPU. A confidential compute architecture that protects CPU memory while leaving GPU VRAM unverified produces an incomplete isolation model: the most compute-intensive part of inference, where the model weights interact with the decrypted prompt tensors, remains outside the verified boundary.

NVIDIA Confidential Computing on the H100 Hopper architecture addresses this. In Confidential Compute (CC) mode, VRAM is encrypted and inaccessible to the host driver. The CPU-GPU data path over PCIe is encrypted, preventing plaintext data from being observed in transit between the CPU trust domain and the GPU. NVIDIA GPU attestation generates a signed report covering the GPU's firmware state, driver configuration, and hardware identity, allowing a verifier to confirm that the GPU participating in inference is operating in a verified, untampered state.

The ETH Zurich study cited above also measured NVIDIA H100 Confidential Compute GPU overhead, finding throughput penalties of 4–8% that diminish as batch and input sizes grow. For large models under sustained load, the penalty approaches the low end of this range.

ORGN's architecture combines Intel TDX CPU isolation with NVIDIA GPU attestation, covering both execution layers. The calling application specifies the model. ORGN forwards the request to the selected confidential compute environment and exposes attestation artifacts for both the CPU trust domain and the GPU execution context.

Where TEE Adoption Is Being Driven by Regulatory Pressure, Not Vendor Preference

The shift from "TEE as premium option" to "TEE as architectural baseline" isn't happening because engineers suddenly discovered the elegance of hardware-enforced memory isolation. It's happening because compliance frameworks are tightening in ways that make architectural assurances mandatory, and because Gartner now lists confidential computing as a top strategic technology trend with a concrete adoption forecast attached. According to Gartner's October 2025 analysis (source), by 2029, more than 75% of operations processed on untrusted infrastructure will be secured in-use by confidential computing. The path from that forecast to procurement-level pressure is short.

HIPAA, SOC 2 Type II, and the EU AI Act: What Each Framework Actually Requires at the Infrastructure Layer

HIPAA's Security Rule requires covered entities to implement technical safeguards that protect electronic PHI (ePHI) against unauthorized access. The relevant control category is 45 CFR §164.312(a)(1), which mandates access controls for systems that process ePHI. When an AI copilot processes clinical notes during inference, the prompt containing ePHI exists in RAM. The HIPAA "minimum necessary" standard requires that access to ePHI be limited to what's needed. A hypervisor-visible execution environment violates this at the architectural level, even if no human operator actively reads the memory.

SOC 2 Type II auditors evaluate controls against the Trust Services Criteria. CC6.x covers logical and physical access controls. CC6.1 requires controls that prevent unauthorized access to data during processing. As AI moves into enterprise workflows, SOC 2 auditors are increasingly examining inference-path data handling, asking not just whether data is encrypted at rest and in transit, but whether the runtime processing environment provides verifiable isolation. Attestation reports from TEE-backed inference directly answer this question in a form that auditors can retain as evidence.

The EU AI Act's requirements for high-risk AI systems include data governance controls under Article 10 and traceability requirements under Article 12. Systems classified as high-risk, which includes AI used in employment, credit, healthcare, and law enforcement contexts, must demonstrate controls over data used in operation, not just training. Runtime isolation via TEE and cryptographic attestation provides the kind of verifiable, auditable evidence the Act's traceability requirements point toward.

Why "We Don't Train on Your Data" Is Not a Sufficient Compliance Statement Anymore

Enterprise procurement and compliance teams have internalized a distinction that vendor marketing often elides: the difference between a contractual assurance and an architectural guarantee. "We don't train on your data" is a data-use policy. It addresses what happens after inference. It says nothing about what's accessible during inference, what gets logged by observability infrastructure, or whether the execution environment is visible to the infrastructure operator.

SOC 2 auditors and CISO teams now routinely ask for: evidence of runtime isolation controls, configuration documentation for logging and retention settings, and cryptographic proof of execution environment integrity for sensitive workloads. A DPA clause or a vendor's privacy policy doesn't produce evidence of any of these. An attestation report signed by the CPU's security processor does.

The architectural shift matters here. A policy claim can be changed by the vendor unilaterally. A cryptographic attestation is produced by the hardware. Enterprises that want verifiable guarantees rather than contractual ones need TEE-backed infrastructure, and auditors increasingly expect to see the evidence.

Sovereign AI Infrastructure and National-Level Confidentiality Requirements

ORGN's partnership with AILO AI Infra and MBK Holding to establish Qatar's first Confidential AI Factories, announced at the World Summit AI in Doha in December 2025, illustrates where sovereign AI requirements are heading. National-level AI deployments increasingly bundle data residency with hardware-level execution isolation as inseparable requirements. Keeping data within national borders is a necessary condition. It isn't sufficient.

A sovereign AI deployment that processes government data in-country but on hypervisor-visible infrastructure still exposes that data to infrastructure operators. Confidential computing closes this gap. The Gartner forecast that more than 75% of European and Middle Eastern enterprises will shift virtual workloads to localized solutions by 2030 reflects the same pressure: geopolitical risk reduction requires not just geographic control, but verified execution control.

For procurement teams evaluating AI infrastructure, sovereign AI factories with TEE-backed execution are becoming a distinct product category, not a customization of standard cloud hosting.

The Practical Architecture of a Confidential AI Gateway: Request Path, Attestation Flow, and Retention Controls

Knowing that TEEs exist and that they provide memory isolation is the starting point. Knowing how the pieces connect in a production deployment, request lifecycle, TLS session boundaries, attestation handoff, retention enforcement, is where the architectural decision-making actually happens.

How Request Forwarding, Model Selection, and Enclave Handoff Work in a Confidential Gateway

The application layer holds model selection. The client sends a request specifying the target model, ORGN doesn't dynamically reroute or select models based on internal policy; the calling application controls which confidential compute environment receives the request. ORGN validates authentication, forwards the request to the specified TDX-backed environment over an encrypted channel, and the payload enters the trust domain.

Application Request

│

▼

ORGN Gateway

│

▼

TDX Trust Domain

│ ├── Prompt decrypted inside enclave

├── GPU inference in CC mode

└── Response encrypted before exit

Inside the trust domain, the TLS session terminates. The prompt is decrypted within the encrypted VM's memory. The model processes the request, and the response is encrypted before leaving the trust domain boundary. No plaintext exits the enclave across an unprotected channel. The response returns to the client alongside attestation artifacts, the signed report from the CPU's security processor covering the enclave measurement and platform identity, plus GPU attestation evidence from the NVIDIA H100's CC mode.

The TLS session lifecycle follows the trust domain boundary: client to gateway is one TLS session, gateway to enclave is a separate session that terminates inside the protected region. The gateway doesn't hold a session that spans both segments in plaintext.

Implementing Zero-Retention at the Infrastructure Layer: Configuration, Enforcement, and Audit Evidence

Application-level logging suppression is fragile. A configuration change, a dependency update, or a new observability integration can silently re-enable payload capture. Infrastructure-enforced zero-retention is a different model: the components responsible for processing the request are configured at the infrastructure layer to exclude content fields from telemetry, and those configurations are policy-governed rather than application-level.

logging: request_body_capture: false

otel: export_payloads: false

tracing: metadata_only: true

Concretely, enforcing zero-retention across the inference path requires: disabling payload capture in API gateway request logging; stripping request bodies from reverse proxy access logs before they reach log aggregators; configuring OpenTelemetry collectors to exclude body attributes from exported spans; setting tracing exporters to metadata-only mode. What remains after these configurations are applied is billing-relevant metadata, token counts, model version, request timestamps, latency percentiles, none of which constitutes sensitive payload content.

ORGN's architecture applies a privacy-first retention posture: customer content is not retained for model training, and retention controls can be enforced across the inference path. What remains is operational metadata suitable for billing and reliability monitoring, not a recoverable corpus of inference interactions.

Attestation Verification Patterns: Pre-Request, Post-Request, and Continuous Verification Models

Three verification patterns apply to different operational contexts, each with distinct latency and assurance tradeoffs.

ORGN supports both standard model deployments and confidential-compute deployments.

Pre-request verification: the client fetches the attestation report from ORGN's console, validates the enclave measurement and certificate chain before sending sensitive payloads, and only proceeds if the report matches expected values. This model adds a round-trip of latency per session initialization, but provides the highest assurance, the client confirms the execution environment's integrity before any sensitive data is transmitted. Suitable for batch jobs and non-latency-sensitive workflows where attestation mismatch should abort the request entirely.

Post-request verification: the client sends the request, receives the response alongside the attestation artifact, and validates the report before acting on the output. This eliminates the pre-flight latency penalty while still providing cryptographic verification of the execution environment. Practical for interactive workloads where first-token latency matters and the risk model accepts that the request executes before attestation is validated.

Continuous background verification: periodic re-attestation confirms that the enclave remains in the expected state across a long-running inference service session. Suitable for services that maintain warm enclaves across many requests, where re-verifying before each request is impractical. The attestation report has a validity window; staleness means the enclave state hasn't been re-verified recently, not that it's been compromised, but a stale report combined with enclave restart creates a gap in the verified chain.

Architectural Tradeoffs and Failure Modes When Running LLMs Inside TEEs

Most articles covering TEEs stop at the security properties. The actual engineering work involves understanding where TEEs fall short, what the performance costs look like in production, and what breaks when the operational model doesn't account for enclave lifecycle management. These are the questions that decide whether a confidential compute deployment actually works.

Performance Overhead: Memory Encryption Penalty, Boot Time, and GPU PCIe Encryption Costs

The ETH Zurich study on confidential LLM inference (arXiv:2509.18886) provides the most rigorous published benchmarks on this question. For CPU TEEs running full Llama2 inference pipelines (7B, 13B, 70B), TDX overhead measured between 5.51% and 10.68% throughput reduction relative to baseline VM execution. With Intel Advanced Matrix Extensions (AMX) acceleration, overhead falls further. For GPU-based inference on NVIDIA H100 Confidential Compute, throughput penalties measured 4–8%, diminishing as batch and input sizes grow.

Independent benchmarking from OpenMetal on Intel TDX bare-metal deployments (source) found inference serving overhead of 5–15% depending on request frequency and model size. For I/O-intensive workloads, the overhead skews higher because TDX memory transitions add latency when workloads saturate available CPU headroom.

Cold-start latency is a separate concern. TEE initialization and model loading inside the trust domain add to time-to-first-token for inference services that don't maintain warm enclaves. Pre-warming enclave pools and pre-loading model weights inside the trust domain address this. Model quantization (INT8 rather than FP16) reduces memory footprint and improves throughput within the encrypted region. What doesn't work is selectively bypassing encryption for "non-sensitive" tensors: partial exposure breaks the isolation model entirely, since intermediate state from a protected forward pass can be observed at the unprotected tensor boundary.

Side-Channel Risks That TEEs Don't Eliminate and How to Mitigate Them

TEEs prevent direct memory inspection from privileged software. They do not eliminate microarchitectural side-channel attacks. Cache-timing attacks allow an attacker with co-tenant access on shared hardware to infer information about enclave execution by observing shared cache state. Branch predictor state leakage and speculative execution side-channels (Spectre variants) can expose information across privilege boundaries that TEE memory encryption doesn't protect.

Mitigations exist at multiple layers: retpoline compiler flags address branch predictor side-channels; flush-and-reload mitigations reduce cache-timing attack surfaces; disabling hyperthreading on multi-tenant hardware eliminates the most dangerous co-tenant side-channel vectors. The critical constraint here is that side-channel risk is materially lower on dedicated hardware than on shared multi-tenant infrastructure. A single-tenant confidential AI factory where the physical hardware isn't shared with other customers reduces the attack surface substantially compared to a shared cloud instance running inside a confidential VM.

This is part of why the sovereign AI factory model matters architecturally, beyond data residency: dedicated hardware reduces the side-channel attack surface to near-zero for co-tenant vectors, while TEE memory encryption addresses the host-privilege vectors.

Key Management, Attestation Report Freshness, and What Happens When Enclave State Changes

TEE-internal encryption keys are managed by the CPU's security processor, not by the application. When an enclave is terminated, due to a crash, a rolling update, or a planned restart, the hardware-managed keys for that trust domain are destroyed. A new enclave initialization produces a new key set and a new measurement.

Attestation reports have a validity period. A stale report doesn't indicate compromise; it indicates that the enclave's state hasn't been re-verified within the TTL. But for compliance workflows that require fresh attestation evidence per request period, stale reports require re-validation against the live enclave. Dependent systems that cached the previous attestation report need to re-fetch and re-verify after any enclave restart.

Intel TDX doesn't currently support live enclave migration. A trust domain must be rebuilt rather than migrated, which means rolling updates require spinning up a new trust domain with the updated workload, verifying its attestation report, routing traffic to it, and tearing down the old domain. For stateful inference services that maintain context across requests, session handoff across trust domain restarts requires careful design, either stateless per-request inference or explicit context re-hydration with the new enclave.

End-to-End Confidential Inference Workflow

The sequence below shows how a production request moves through ORGN-backed confidential infrastructure from request entry to attestation verification.

1. Client submits inference request

2. ORGN validates authentication

3. Request forwarded into TDX trust domain

4. Prompt decrypted inside encrypted memory

5. GPU executes forward pass in CC mode

6. Response encrypted before leaving enclave

7. Attestation report generated

8. Client validates enclave measurement

9. Response accepted

curl https://api.gateway.orgn.com/v1/chat \

-H "Authorization: Bearer $TOKEN"

Conclusion

The architecture gap this article covers isn't new, but it's become load-bearing now that LLM inference handles regulated data at production scale. Standard AI stacks expose prompt content in RAM during the forward pass, accumulate sensitive payloads in observability infrastructure by default, and provide no verifiable runtime isolation. TEEs address the execution exposure. Zero-retention controls address the persistence exposure. Cryptographic attestation converts both into something auditable rather than assumed.

The TEE decision boundary platform engineers face in 2026 is less about whether to adopt confidential compute and more about which workloads to migrate first. Workloads processing PHI, legal material, financial records, or proprietary code have a clear case for immediate migration. Workloads handling non-sensitive content can operate on standard infrastructure and bear the performance overhead only where the data sensitivity justifies it.

The migration path from standard API-based AI integration to hardware-attested confidential compute runs through a gateway layer. ORGN's confidential AI gateway provides that path: Intel TDX for CPU memory isolation, NVIDIA GPU attestation for the accelerator execution context, zero-retention enforcement across the inference path, and cryptographic attestation records available for audit workflows. Applications specify their target model explicitly. The gateway handles secure forwarding and exposes attestation artifacts. The origin point of trust shifts from contractual assurance to silicon-enforced, cryptographically verifiable execution.

FAQs

Q1: What is the difference between a Trusted Execution Environment and a standard virtual machine for AI inference security?

A standard VM's memory is visible to the hypervisor, meaning a privileged host-level process or a compromised infrastructure operator can read it. A TEE uses hardware-managed encryption keys held inside the CPU's security processor to encrypt enclave memory, the hypervisor sees only ciphertext. TEE isolation is hardware-enforced, not permission-based. For AI inference, this means the prompt and all intermediate computation remain protected from host-level inspection throughout the request, a guarantee no standard VM can provide.

Q2: How does Intel TDX attestation work when an LLM inference request is submitted through a gateway like ORGN?

At trust domain initialization, the CPU measures the workload image and configuration, producing a cryptographic hash. The CPU's security processor signs a report containing the measurement, platform identity, and security version. ORGN exposes these attestation artifacts for client verification. Engineers retrieve the report from the ORGN console and validate the signature against Intel's certificate chain. The report confirms that the specific code running in the enclave matches the expected measurement, providing cryptographic proof of execution environment integrity rather than a policy claim.

Q3: Does running LLM inference inside a TEE eliminate the need for zero data retention policies, or are both controls required?

Both are required. TEE memory isolation protects data during execution. It doesn't address what happens before or after the request: prompt payloads can still be captured by observability infrastructure, API gateways, or logging pipelines outside the enclave boundary. Zero-retention controls enforce that no prompt or response content is durably stored in these layers.

Q4: What are the GPU-side confidentiality requirements for production LLM workloads, and does CPU-only TEE isolation cover them?

CPU-only TEE isolation doesn't cover GPU execution. LLM inference runs primarily on GPU, where model weights interact with decrypted prompt tensors during the forward pass. Without GPU-level attestation, the VRAM and CPU-GPU PCIe data path remain outside the verified boundary. NVIDIA Confidential Compute on H100 addresses this: VRAM is encrypted in CC mode, the PCIe path carries encrypted data, and NVIDIA GPU attestation generates a signed report covering GPU firmware state and driver configuration.