Data classification and residency
Data classification
Every data type in the Union.ai platform is classified by its residency and access pattern. This classification determines where data is stored and how it is accessed.
| Classification | Data types | At rest | In transit | Enters control plane memory? |
|---|---|---|---|---|
| Bulk Customer Data | Files, directories, DataFrames, code bundles, container images, reports | Customer infrastructure (S3 SSE / GCS / Azure SSE) | HTTPS via presigned URL | No: never enters control plane |
| Inline Customer Data | Structured task inputs/outputs, secret values (during creation), execution log streams | Customer infrastructure (S3 SSE / GCS / Azure SSE; cloud secret managers) | TLS (client→CP) + TLS+mTLS+tunnel (CP→DP) | Yes: plaintext in memory, not persisted/cached/logged |
| Orchestration Metadata | Task definitions (including env vars, default values, SQL, pod specs), run/action state, error messages, trigger specs | Control plane databases (AES-256/KMS) | TLS (API) + TLS (gRPC events) | Yes: read from DB into memory for API responses |
| Platform Metadata | User identity/RBAC records, cluster records | Control plane databases (AES-256/KMS) | TLS (API) | Yes: read from DB into memory for API responses |
Bulk customer data (files, directories, DataFrames, code bundles, container images, and reports) is stored exclusively in the customer’s infrastructure and never enters the control plane. These objects are accessed via presigned URLs.
Inline customer data (structured task inputs and outputs, secret values during creation/update, and execution log streams) is stored at rest in the customer’s infrastructure but transits control plane memory during request processing. This data is encrypted in transit (TLS + Cloudflare Tunnel), exists as plaintext in control plane memory only for the duration of each request, and is not persisted, cached, or logged in the control plane.
Orchestration metadata is stored in the control plane databases (encrypted at rest). This includes task definitions, which contain structural information (container image references, typed interfaces) and fields that may be customer-sensitive: environment variables, default input literal values, SQL query statements, Kubernetes pod specs, plugin configuration, and config key-value pairs. Error messages from task executions (which may contain data from Python tracebacks) are also stored. A full task definition (TaskSpec) is stored on every run submission.
Data residency
All customer data resides in the customer’s own cloud account and region. The customer chooses the region for their data plane deployment, and all data plane resources (object storage, container registry, secrets backend, log aggregator, and compute) are provisioned within that region.
The control plane is available in the following regions: US West (us-west-2), US East (us-east-2), EU West-1 (Ireland), EU West-2 (London), and EU Central (eu-central-1). No bulk customer data is replicated to or cached in Union.ai infrastructure. Inline data (structured task I/O, secret values during creation, log streams) transits control plane memory during request processing but is not persisted. This transit occurs through the control plane region, so customers should select a control plane region consistent with their data residency requirements. For EU-deployed data planes using an EU control plane region, all data (both at rest and in transit) stays within the EU, supporting GDPR data residency requirements.
For details on the architectural separation that enforces these residency guarantees, see Two-plane separation.
Verification
Data classification
Reviewer focus: Confirm that each data type resides where the classification table claims. Verify that bulk data is in the customer’s infrastructure, and that task definitions in the control plane contain only expected fields.
How to verify:
The task definition schema is derived from the open-source Flyte protobuf definitions in the
flyte-sdk repository. Review the TaskTemplate and RunSpec protobuf schemas and compare them to the field enumeration in the classification table above to confirm that the fields stored match the documented classifications.
Then run a workflow with recognizable data (e.g., a known string or file), and verify the location of each data type:
-
Inputs/outputs: confirm they are in the customer’s object store:
aws s3 ls s3://<customer-bucket>/org/project/domain/run-name/action-name/ -
Code bundle: confirm it is in the customer’s object store:
aws s3 ls s3://<customer-bucket>/org/project/domain/code-bundles/ -
Container image: confirm it is in the customer’s container registry:
aws ecr describe-images --repository-name <repo> --region <region> -
Logs: confirm they are in the customer’s log aggregator:
aws logs get-log-events --log-group-name <group> --log-stream-name <stream> -
Secrets: confirm they are in the customer’s secrets backend:
aws secretsmanager list-secrets --region <region> -
Task definition: confirm it contains the expected fields, stored in the control plane:
uctl get task <task-name> -o jsonThe response will contain resource requirements, typed interfaces, container image references, and potentially sensitive fields (environment variables, default values, etc.) as documented in Control plane. Bulk data content should not appear inline.
-
Run metadata: confirm it contains metadata and URI references, stored in the control plane:
uctl get execution <execution-id> -o jsonThe response should contain phase, timestamps, URIs, error messages, and task definition fields. Bulk data content should not appear inline.
Data residency
Reviewer focus: Confirm that all data plane resources reside in the customer’s chosen region and that no customer data is stored outside that region.
How to verify:
-
Confirm the object store region:
aws s3api get-bucket-location --bucket <customer-bucket>The output should match the customer’s chosen deployment region.
-
Verify all data plane resources in the cloud console. Compute, storage, registry, secrets, and log aggregator should all be in the same region.
-
Confirm the cluster region via the Union.ai API:
uctl get clusterThe cluster region should match the customer’s chosen deployment region.