Introduction
The document site for light-portal application.
Architecture
Design
Light Portal is an application that connect the providers to the consumers, and it contains many components or applications. Each component will have some API endpoints and a user interface in the portal view single page application.
To allow the users to understand each component in detail in term of design, we have collected all the design documents in this section.
Portal View
Mutliple Environment
This document outlines the necessary changes to configure portal view to work dynamically across different environments (sdx, dev, non-prod, prod) using environment-specific configuration.
1. Environment Variables Setup
Create .env File
Create environment-specific .env files in project root:
# Environment variables
# VITE_BASE_PATH is used as the base URL prefix for API calls.
VITE_BASE_PATH=/bff/admin/
# VITE_PORTAL_URL is the full absolute URL where the frontend static files are served
VITE_PORTAL_URL=https://sdx.lightapi.net/bff
Required Environment Variables
- VITE_BASE_PATH: Defines the sub-path where your application is deployed.
- VITE_PORTAL_URL: The API endpoint base URL.
Benefits of .env Configuration
- Switch environments without code changes
- Maintain a single codebase for all environments
2. Vite Configuration Changes
File: vite.config.js
Location: Project root
Required Change:
import { defineConfig, loadEnv } from 'vite';
import react from '@vitejs/plugin-react';
export default defineConfig(({ mode }) => {
const env = loadEnv(mode, process.cwd(), '');
return {
plugins: [react()],
base: env.VITE_BASE_PATH || "/",
// ... other configurations
};
});
Why This Change is Necessary?
The Problem Without base Configuration
When your application is deployed to a sub-path rather than the domain root, all asset references break.
| Deployment Scenario | Required base Value |
|---|---|
https://example.com/ | "/" (default) |
https://example.com/portal/ | "/portal/" |
https://example.com/app/v2/ | "/app/v2/" |
What base Affects
The base configuration controls how Vite prefixes:
- Static asset URLs (JavaScript, CSS, images, fonts)
- Client-side routing paths
- Public folder references
Example: Without vs With base
Without base Configuration:
- App hosted at:
https://example.com/portal/ - Vite generates:
<script src="/assets/index.js"> - Browser requests:
https://example.com/assets/index.js - Result: 404 Not Found ❌
With base: “/portal/”:
- App hosted at:
https://example.com/portal/ - Vite generates:
<script src="/portal/assets/index.js"> - Browser requests:
https://example.com/portal/assets/index.js - Result: Success ✅
3. React Router Configuration Changes
File: App.tsx
Location: src/App.tsx
Required Change:
import { BrowserRouter } from 'react-router-dom';
function App() {
const basename = import.meta.env.VITE_BASE_PATH || "/";
return (
<BrowserRouter basename={basename}>
{/* Your app routes and components */}
</BrowserRouter>
);
}
export default App;
What basename Does
The basename prop tells React Router the base URL prefix for all routes in your application.
Routing Behavior Comparison
| Scenario | Without basename | With basename=“/portal” |
|---|---|---|
<Link to="/dashboard"> | Navigates to /dashboard | Navigates to /portal/dashboard |
path="/settings" matches | /settings | /portal/settings |
useNavigate("/login") | Goes to /login | Goes to /portal/login |
Why It’s Required
When your app is hosted at a sub-path (e.g., https://example.com/portal/), React Router needs to know that /portal is the deployment prefix, not part of your route definitions.
Without basename:
- You define
<Route path="/dashboard" /> - User visits
/portal/dashboard - React Router sees
/portal/dashboard→ no match → Route Not Found ❌
With basename="/portal":
- React Router strips
/portalfrom the URL - Sees
/dashboard→ matches your route → Success ✅
4. API Call Configuration
Current Behavior Issue
Without a configured base URL, the browser constructs API request URLs relative to the current page origin.
Example:
- App running at:
https://example.com/portal/dashboard - API call:
fetch('/api/users') - Browser sends request to:
https://example.com/api/users
This may work in some cases but breaks when:
- API is hosted on a different domain/subdomain
- API has a different base path
- Cross-environment consistency is needed
Solution: Custom Fetch Wrapper
File: src/utils/fetchClient.js
const BASE_URL = import.meta.env.VITE_API_BASE_URL || "";
/**
* Custom fetch wrapper with automatic base URL prefixing
* @param {string} endpoint - API endpoint path (e.g., '/api/users')
* @param {Object} options - Fetch options (method, headers, body, etc.)
* @returns {Promise} - Response JSON
*/
async function fetchClient(endpoint, options = {}) {
const url = `${BASE_URL}${endpoint}`;
const defaultHeaders = {
"Content-Type": "application/json",
};
const config = {
...options,
headers: {
...defaultHeaders,
...options.headers,
},
};
const response = await fetch(url, config);
if (!response.ok) {
throw new Error(`HTTP error! status: ${response.status}`);
}
return response.json();
}
export default fetchClient;
Usage Example
import fetchClient from './utils/fetchClient';
// GET request
const users = await fetchClient('/api/users');
// POST request
const newUser = await fetchClient('/api/users', {
method: 'POST',
body: JSON.stringify({ name: 'John Doe', email: '[email protected]' }),
});
// With custom headers
const data = await fetchClient('/api/protected', {
headers: {
'Authorization': `Bearer ${token}`,
},
});
Benefits
- Consistency: All API calls use the same base URL
- Environment flexibility: Different API endpoints per environment
- Maintainability: Single place to update API configuration
- Error handling: Centralized response validation
5. Build and Deployment Steps
Step 1: Change the .env variables for specific environment (sdx, dev, prod etc.)
VITE_BASE_PATH=/ (base path)
VITE_PORTAL_URL=https://example.com (endpoint URL)
Step 2: Build
npm run build
MCP Registry Design for AI Gateway
This document outlines the registration and management strategy for Model Context Protocol (MCP) tools within the AI Gateway.
Registration Strategy: The Hybrid Model
For a robust and scalable AI Gateway, the recommended approach is a Hybrid Model: Register the MCP Server as the primary entity, but manage and expose the Tools individually.
This approach balances the technical requirements of connectivity with the operational requirements of governance, security, and performance.
1. Register the Server (The “Connection” Layer)
The MCP Server should be treated as the source of truth and the primary unit of connectivity.
- Centralized Configuration: Authentication (API keys, OAuth), base URLs, transport protocols (SSE or Stdio), and environment variables are defined at the server level.
- Connectivity Management: A single server acts as a wrapper around related APIs. Registering tools individually would create significant overhead and redundant connections.
- Lifecycle & Health Monitoring: If an MCP server goes down, all its tools become unavailable. It is more efficient to monitor health and availability at the server level.
- Dynamic Discovery: The MCP protocol includes a
tools/listcapability. By registering the server, the gateway can automatically sync and discover new tools when the server is updated, eliminating the need for manual registration of every new function.
2. Expose Tools Individually (The “Governance” Layer)
While the gateway connects to the server, it should expose and manage tools as individual objects. This is crucial for:
- Granular Permissions (RBAC): Access control can be applied at the tool level. For example, a “Finance” team might be granted access to a
get-invoicetool but restricted from amodify-ledgertool, even if both reside on the same ERP server. - Context Window Optimization: Large Language Models (LLMs) have limited context windows. Sending all 50 tools from a large server to an LLM wastes tokens and increases the “lost in the middle” effect. The Gateway should allow for the activation of specific subsets of tools for a given AI session or agent.
- Rate Limiting & Cost Control: High-compute or high-cost tools (e.g.,
generate-video) can be rate-limited or billed differently compared to lightweight tools (e.g.,get-weather). - Safety & Compliance: Metadata can be attached to individual tools to flag them as
Read-Only,Destructive, orSensitive, enabling specific security flows (like “Human-in-the-loop” approvals) for risky operations.
Recommended Architecture: The “Catalog” Pattern
The implementation should follow a “Catalog” or “App Store” pattern:
- Provider/Server Registration: An admin registers a server (e.g., “The GitHub MCP Server”) with its credentials.
- Automated Discovery: The Gateway calls the server’s
list_toolsmethod and populates a tool catalog. - Governance & Activation: Admins “enable” specific tools for specific model configurations or user groups.
- Routing Layer: When a model requests a tool, the Gateway resolves the request to the owning Server and handles the underlying communication.
Comparison of Approaches
| Feature | Individual Registration | Group (Server) Registration | Recommended: Hybrid |
|---|---|---|---|
| Management | Extremely difficult (manual entry for every tool) | Easy (single connection) | Optimal (Auto-sync tools from server) |
| Security | Granular (Tool-level RBAC) | Coarse (All-or-nothing access) | Granular (Policy per tool) |
| LLM Context | Precise | Potential for bloating | Precise (Selectable subsets) |
| Maintenance | High (Breaks if tool name changes) | Low | Low (Unified lifecycle) |
| Connectivity | Redundant connections | Efficient | Efficient (One connection, many tools) |
Data Model & Schema Design
The AI Gateway leverages the existing API registry schema used by light-gateway, with specific enhancements to accommodate the unique requirements of the MCP protocol.
Conceptual Mapping
| MCP Concept | light-gateway Table | Mapping Strategy |
|---|---|---|
| MCP Server | api_t | Represents the top-level service (e.g., “Postgres MCP Server”). |
| Server Instance | api_version_t | Manages the connectivity parameters and the overall tool manifest. |
| MCP Tool | api_endpoint_t | Each tool is registered as an individual endpoint belonging to an MCP version. |
| Tool Permissions | api_endpoint_scope_t | Handles RBAC and scope-based access to specific tools. |
Core Tables & Enhancements
To support MCP, the following schema adjustments are implemented:
1. API Version (Server Connection)
The api_version_t table is enhanced to store transport-level configurations for stdio or SSE connections.
ALTER TABLE api_version_t ADD COLUMN transport_config TEXT;
-- JSON Example for transport_config:
-- {"transport": "stdio", "command": "npx", "args": ["-y", "@mcp/server-google"]}
2. API Endpoint (Tool Definition)
The api_endpoint_t table acts as the tool registry. We relax the traditional HTTP method constraints and add fields for MCP tool metadata.
-- Allow 'call' as a valid operation for MCP tools
ALTER TABLE api_endpoint_t DROP CONSTRAINT api_endpoint_t_http_method_check;
ALTER TABLE api_endpoint_t ADD CHECK ( http_method IN ( 'delete', 'get', 'patch', 'post', 'put', 'call' ) );
-- Store the Tool Schema (for LLM validation) and Metadata (for safety flags)
ALTER TABLE api_endpoint_t ADD COLUMN tool_schema TEXT; -- JSON Schema of the tool inputs
ALTER TABLE api_endpoint_t ADD COLUMN tool_metadata TEXT; -- e.g., {"destructive": true, "read_only": false}
Full Registry Schema Reference
-- API Definition (The MCP Server)
CREATE TABLE api_t (
host_id UUID NOT NULL,
api_id VARCHAR(16) NOT NULL,
api_name VARCHAR(128) NOT NULL,
api_desc VARCHAR(1024),
api_status VARCHAR(32) NOT NULL,
active BOOLEAN NOT NULL DEFAULT TRUE,
PRIMARY KEY (host_id, api_id)
);
-- API Version (The Connection/Transport)
CREATE TABLE api_version_t (
host_id UUID NOT NULL,
api_version_id UUID NOT NULL,
api_id VARCHAR(16) NOT NULL,
api_version VARCHAR(16) NOT NULL,
api_type VARCHAR(7) NOT NULL, -- 'mcp', 'openapi', etc.
transport_config TEXT, -- MCP-specific connection data
spec TEXT, -- Full tool manifest (optional)
active BOOLEAN NOT NULL DEFAULT TRUE,
PRIMARY KEY(host_id, api_version_id),
FOREIGN KEY(host_id, api_id) REFERENCES api_t(host_id, api_id) ON DELETE CASCADE
);
-- API Endpoint (The Individual Tool)
CREATE TABLE api_endpoint_t (
host_id UUID NOT NULL,
endpoint_id UUID NOT NULL,
api_version_id UUID NOT NULL,
endpoint VARCHAR(1024) NOT NULL, -- Tool Name
http_method VARCHAR(10), -- 'call' for MCP
endpoint_name VARCHAR(128) NOT NULL,
endpoint_desc VARCHAR(1024),
tool_schema TEXT, -- Input parameter validation
tool_metadata TEXT, -- Safety and cost metadata
active BOOLEAN NOT NULL DEFAULT TRUE,
PRIMARY KEY(host_id, endpoint_id),
FOREIGN KEY(host_id, api_version_id) REFERENCES api_version_t(host_id, api_version_id) ON DELETE CASCADE
);
Tool Metadata & Synchronization
Populating the api_endpoint_t table involves coordinating data from the MCP Server with operational policies defined within the AI Gateway.
Sources of Metadata
The metadata for each tool is synthesized from three primary sources:
1. Standard MCP Server Response (Automated)
When the Gateway performs a tools/list call, the MCP server provides the baseline technical definition for each tool.
- Source Fields:
name,description,inputSchema. - Mapping: These are mapped directly to
endpoint,endpoint_desc, andtool_schemarespectively.
2. Gateway Operational Enrichment (Manual/Policy)
Since the standard MCP protocol does not include operational flags (like safety or cost), the AI Gateway manages these in the tool_metadata JSON column.
- Administrative Enrichment: Platform admins use the Gateway UI to tag specific tools. Common tags include:
destructive: true: Triggers a warning or confirmation flow.human_approval_required: true: Places the request in a queue for manual sign-off.cost_tier: "high": Used for rate-limiting or internal billing.
- Heuristic Auto-Tagging: The Gateway can automatically infer metadata based on patterns. For example, any tool starting with
get_orlist_is auto-flagged asread_only: true.
3. Protocol Extensions (Custom)
The MCP specification allows for additional properties in the tool object. If a custom MCP server includes an extra metadata or annotations block, the Gateway’s synchronization logic can be configured to capture and store these directly.
Synchronization Workflow
The following lifecycle ensures the Gateway’s registry remains accurate:
- Connection: The Gateway establishes a connection to the server using the
transport_config. - Discovery (Sync): The Gateway calls
tools/listand performs an “upsert” for all tools found.- Existing tools have their
tool_schemaandendpoint_descupdated. - New tools are created with a default
activestatus and baselinetool_metadata.
- Existing tools have their
- Review: An administrator reviews the newly discovered tools in the Gateway dashboard.
- Governance Policy: The administrator “enables” the tool for specific roles and configures any required safety metadata (e.g., flagging the
drop_tabletool asdestructive). - LLM Execution: When a model calls the tool, the Gateway uses the stored
tool_schemafor pre-flight validation and thetool_metadatato enforce security policies.
Too Many Pages/Forms
The portal has accumulated many pages, generated forms, custom admin screens, and feature-specific entry points. The sidebar can expose these pages, but it does not help a user understand which pages are required to finish a real business task. The MCP Gateway quick start wizard is a useful experiment, but it also shows the limitation of a rigid linear wizard: real tasks have optional steps, pre-existing data, and multiple valid starting points.
This document proposes a task-oriented navigation layer for portal-view.
Problem
Users currently need to know the portal information architecture before they can complete a task. For example, onboarding an API to MCP Gateway may require some combination of:
- create or select an API
- create or select an API version
- link the API version to a gateway or sidecar instance
- select MCP tools
- configure access control
- revisit instance, API, or role administration later
The same pattern exists across other areas. A task is not a single route; it is a sequence of related pages and forms. The current navigation model makes users pick pages first, then infer the task process themselves.
Current MCP Wizard Observation
The MCP Gateway wizard already has useful building blocks:
flowConfig.tsxkeeps step metadata in one place.McpServerForm.tsxrenders a generic wizard shell.useMcpPrefill.tscan resume from URL context such asapiId,apiVersionId, andinstanceApiId.- Several steps are marked skippable.
However, the wizard is still too rigid:
- Step order is linear even when the task is naturally conditional.
- Initial step selection relies on hard-coded step numbers.
- Optional work is represented as skip buttons instead of task state.
- The wizard duplicates or wraps existing forms instead of treating existing pages/forms as first-class task steps.
- The solution is specific to MCP Gateway and does not help users navigate the rest of the portal.
Design Goals
- Let users start from a task, not a page name.
- Keep existing pages and generated forms as the source of truth.
- Support multiple entry points into the same task.
- Detect what has already been completed and show only relevant next actions.
- Support optional, required, blocked, complete, and skipped steps.
- Preserve role-based visibility and host-specific context.
- Allow users to leave a task, return later, and continue from context.
- Make the approach reusable for MCP, API publishing, access control, deployment, config promotion, migration, and admin workflows.
Non-Goals
- Do not replace every admin page with a wizard.
- Do not create a separate custom form for each task if an existing generated form already works.
- Do not use the sidebar as the only navigation surface.
- Do not force a strict step sequence when the data model allows safe jumping.
Proposed Solution
Add a task-oriented navigation layer above the current pages/forms.
The main pieces are:
- Task Center
- Task Registry
- Task Progress Resolver
- Task Navigation Shell
- Global Search and Command Palette
- Contextual Next Actions
Task Center
The Task Center is a page where users choose what they want to accomplish. It should group work by intent, not by implementation table.
Example task groups:
- API Marketplace
- Register a new API
- Add an API version
- Publish an API
- Review API details
- MCP Gateway
- Onboard an existing API to MCP Gateway
- Register a standalone MCP server
- Configure MCP tools
- Configure MCP access control
- Access Control
- Create role
- Assign permissions
- Configure endpoint access
- Platform Operations
- Register controller/gateway instance
- Link API version to instance
- Promote configuration
- Portal Administration
- Manage host users
- Export/import portal data
- Convert migration snapshot
Each task card should show:
- title
- short description
- required role
- common starting object, such as API, instance, host, or client
- progress status when the current context is known
- primary action such as Start, Continue, Review, or Fix Missing Step
Task Registry
Introduce a registry that describes tasks and steps declaratively. This is the
generalized version of the current MCP flowConfig.tsx, but it should route to
existing pages/forms instead of rendering every step inside one wizard.
Example TypeScript shape:
export type TaskDefinition = {
id: string;
title: string;
description: string;
category: string;
roles?: string[];
keywords: string[];
entryPoints: TaskEntryPoint[];
steps: TaskStep[];
};
export type TaskStep = {
id: string;
title: string;
description?: string;
required: boolean;
dependsOn?: string[];
route: (ctx: TaskContext) => string;
formId?: string;
completeWhen?: TaskCompletionCheck;
visibleWhen?: TaskVisibilityCheck;
blockedWhen?: TaskBlockedCheck;
};
The task registry should live close to portal navigation code, for example:
src/tasks/taskRegistry.ts
src/tasks/taskTypes.ts
src/tasks/resolvers/
src/pages/tasks/TaskCenter.tsx
src/pages/tasks/TaskDetail.tsx
Page And Form Metadata
To make search and tasks work well, pages and generated forms need metadata.
For generated forms, the metadata can come from Forms.json plus a small
registry override when the form title is not enough.
For custom pages, add a route/page registry:
export type PageDefinition = {
route: string;
title: string;
description?: string;
category: string;
roles?: string[];
keywords: string[];
entities?: string[];
};
This registry can feed:
- sidebar sections
- Task Center
- command palette
- page breadcrumbs
- contextual next actions
The important rule is that page/form metadata should be reused, not copied into each wizard.
Task Progress Resolver
A task should not blindly ask users to complete steps that are already done. Each task can have a resolver that checks the current host and entity context.
For MCP Gateway, the resolver can check:
- API exists
- API version exists
- instance API link exists
- MCP tool configuration exists
- access control exists
The UI then marks each step:
- Complete
- Required
- Optional
- Blocked
- Skipped
- Needs review
The resolver should use existing query endpoints where possible. The first implementation can query on page load. Later, it can cache per task/session.
Task Navigation Shell
Instead of a full-screen wizard that owns all steps, use a task shell that can wrap or accompany existing pages.
Recommended behavior:
- A task detail page shows the checklist and current state.
- Selecting a step navigates to the existing page/form with task context in the URL or router state.
- The target page shows a compact “Task” panel or return link.
- After save, the user can return to the checklist or continue to the next recommended step.
Example URL:
/app/form/createService?task=mcp-onboard-api&returnTo=/app/tasks/mcp-onboard-api
This keeps existing page behavior intact while adding guided navigation.
Global Search And Command Palette
The portal should have a global launcher. It should search tasks, pages, forms, and entities.
Examples:
- “onboard mcp”
- “create api”
- “auth client”
- “relation type”
- “instance api”
- “export snapshot”
Search results should be role-aware and host-aware.
Result types:
- Task
- Page
- Form
- Entity
- Recent item
This is the fastest way to help expert users without forcing them through a wizard.
Contextual Next Actions
Detail pages should expose next actions based on the current entity.
Examples:
- API detail
- Add version
- Link version to gateway
- Configure MCP tools
- Configure access control
- Instance detail
- Link API version
- Configure MCP tools
- View gateway servers
- Auth client detail
- Assign owner
- Review sessions
- Review audit
- Snapshot export
- Convert snapshot
- Import snapshot
These actions should come from the same task registry, not from one-off buttons hard-coded on every page.
MCP Gateway Example
The MCP Gateway quick start can be rebuilt as a task:
Task: Onboard API to MCP Gateway
Steps:
1. Select or create API
2. Select or create API version
3. Choose deployment mode
4. Link API version to gateway or sidecar instance
5. Select MCP tools
6. Configure access control
Step behavior:
- API selection is required unless
apiIdis already provided. - API version is required unless
apiVersionIdis already provided. - Spec upload is optional and only shown when creating a new API/version.
- Deployment mode is required when the version is not linked.
- Gateway selection is required only for centralized deployment.
- Tool selection is optional if users only want to register the server first.
- Access control is optional but should be shown as a recommended final step.
This task can support several entry points:
/app/tasks/mcp-onboard-api
/app/tasks/mcp-onboard-api?apiId=...
/app/tasks/mcp-onboard-api?apiId=...&apiVersionId=...
/app/tasks/mcp-onboard-api?instanceApiId=...
The UI should not rely on fixed step numbers. It should compute visible steps from the task context and completion state.
Task State
Start with client-side state:
- URL query parameters for entity context
sessionStoragefor in-progress task context- existing backend records for real completion state
Later, add persisted task state if needed:
- user id
- host id
- task id
- context JSON
- skipped step ids
- last active step
- updated timestamp
Persisting task state should not become the source of truth for business data. It should only remember navigation state and user choices. Completion should be derived from actual portal records.
Sidebar Role
The sidebar should become smaller and more stable. It should expose major areas, not every page/form.
Recommended sidebar sections:
- Home
- Tasks
- Marketplace
- MCP Gateway
- Operations
- Administration
Deep links should still exist, but they should be discoverable through search, contextual actions, and task detail pages.
Implementation Plan
Phase 1: Inventory And Metadata
- Create page/form metadata registry.
- Add task registry types.
- Register the most-used pages and forms.
- Add global search over registered tasks/pages/forms.
Phase 2: Task Center
- Add
/app/tasks. - Add task category cards.
- Add task detail checklist page.
- Implement client-side task context with URL parameters and session storage.
Phase 3: MCP Gateway Task
- Convert the current MCP wizard flow into
mcp-onboard-apitask definition. - Reuse existing MCP components for the pages that still need custom UI.
- Replace hard-coded step numbers with resolver-driven visible steps.
- Add return-to-task behavior after saves.
Phase 4: Contextual Actions
- Add task actions to API detail and instance detail pages.
- Add task actions to access control and config pages where appropriate.
- Use the task registry to drive action visibility.
Phase 5: Broader Rollout
- Add tasks for API publishing, config promotion, host/user management, and snapshot export/import.
- Reduce sidebar clutter once task/search usage is available.
- Add persisted task state only if session storage is not enough.
Risks And Mitigations
| Risk | Mitigation |
|---|---|
| Task registry duplicates sidebar and route definitions | Reuse page/form metadata as the source for labels, roles, and keywords |
| Task state becomes stale | Derive completion from backend records, not saved task status |
| Users lose flexibility | Allow direct page navigation and command-palette search |
| Implementation grows into another wizard framework | Route to existing pages/forms wherever possible |
| Role filtering becomes inconsistent | Centralize role checks in the page/task registry |
Recommendation
Keep the MCP Gateway wizard as a prototype, but do not build more isolated wizards in the same style. The long-term solution should be:
- a task registry
- a Task Center
- resolver-driven progress
- global search
- contextual next actions
- reuse of existing pages and generated forms
This gives new users guided paths while still letting experienced users jump directly to the page or form they already know.
User Filter
As more portal users manage their own APIs, clients, instances, schedules, and
configuration records, giving every operator a broad admin role becomes too
coarse. A broad admin can see and modify records created by other admins on the
same host. This document proposes an incremental owner-scoped filtering model
for portal-view.
The first step is a UI-side filter based on the user recorded on each row, such
as update_user. This is not a complete security boundary. The same rule must
eventually be enforced in the query and command services with fine-grained
authorization from the rule engine. The UI implementation is still useful
because it improves day-to-day user experience and gives us a concrete policy
shape to move into the service layer.
Problem
Portal admin pages were originally designed for a small set of trusted operators. Many tables expose all host-scoped records once the user can access the admin page.
That model creates problems as adoption grows:
- application owners need to manage their own APIs, clients, and instances
- broad admin roles expose unrelated records from other teams
- users can accidentally edit or delete records owned by another user or team
- creating one role per page, such as
api-adminorinstance-admin, still does not solve row ownership - service-layer fine-grained authorization is not available everywhere yet
The immediate need is to let users use admin-like pages while limiting the rows they see and act on.
Current Experiment
Schedule.tsx is the first experimental page. The idea is:
- users can access the schedule admin surface
- normal users only see schedules where
updateUsermatches their user id - global admins or schedule admins can still see all schedules
- the
updateUsercolumn can be hidden for normal users - create/update/delete actions are available only on the visible set
One implementation detail matters: ownership filters must be added before the
request payload serializes the filters array.
const apiFilters = [];
if (ownedOnly && userId) {
apiFilters.push({ id: "updateUser", value: userId });
}
const cmdData = {
filters: JSON.stringify(apiFilters),
};
Adding the filter after cmdData.filters is built will not send it to the
backend.
Design Goals
- Allow regular users to manage records they created or updated.
- Avoid giving every self-service user broad all-record admin visibility.
- Keep the admin table implementation familiar and incremental.
- Centralize the owner filter logic instead of duplicating it page by page.
- Make the UI rule match the future service-layer rule as closely as possible.
- Preserve host scoping and existing role-based page visibility.
- Avoid presenting UI-side filtering as a security boundary.
Non-Goals
- Do not claim UI filtering is sufficient authorization.
- Do not replace service-layer rule-engine enforcement.
- Do not solve full team ownership in the first UI-only pass.
- Do not migrate every admin page in one large change.
- Do not overload
update_useras the permanent ownership model if a better owner field exists or can be added.
Ownership Model
There are several possible ownership signals. They should be treated in this order of preference.
| Field | Meaning | Recommendation |
|---|---|---|
owner_user_id | explicit individual owner | best long-term user ownership field |
owner_position_id | explicit position or org-unit owner | best long-term team/hierarchy ownership field |
create_user | original creator | good fallback if available |
update_user | last updater | useful interim fallback, but not true ownership |
domain-specific owner, such as operation_owner | business owner | useful when the field is reliable and normalized |
update_user is acceptable for the first UI experiment because many tables
already have it. However, it has an important semantic problem: ownership moves
to whoever last updated the row. If Alice creates an API and Bob updates it,
Bob becomes the owner under an update_user rule.
The long-term model should add explicit owner fields where needed:
owner_user_id
owner_position_id
owner_group_id is intentionally deferred. Groups are still useful for flat
team membership, but position ownership fits the portal authorization model
better when access should follow the organization hierarchy. owner_org_id is
also deferred because normal portal records are already scoped by host_id, and
host_t links back to org_t through the host domain. Add organization-level
ownership only if a future cross-host/global ownership use case requires it.
Do not add created_by and updated_by as authorization fields in Phase 4.
The existing update_user and update_ts columns remain the last-updater audit
trail. If creator audit becomes important, add create_user and create_ts as
audit fields later, not as substitutes for stable ownership.
Until explicit owner columns exist, each page should declare which field is used for interim UI owner filtering.
Role Model
Use one page per entity type, but separate page visibility from row scope.
| Role | Meaning | Page access | Row scope |
|---|---|---|---|
user | baseline signed-in portal user | only approved self-service admin pages | owned records only |
admin | global portal administrator, effectively super admin | all admin pages | all records |
<entity>-admin | administrator for one entity type, such as schedule-admin | that entity’s admin page | all records for that entity |
platform-admin | deployment platform administrator if this role is kept | platform/deployment platform pages only | not a global all-record role |
Do not give every user account access to every admin page. Only pages that are
safe for self-service ownership should be exposed to user, and each of those
pages must apply the owner filter and action guards.
The admin role can be repurposed as the global all-record role once the
sidebar stops using it as a broad menu marker. Role checks must use exact role
tokens. A role such as schedule-admin must not match admin through substring
checks.
Access Modes
The UI should support three access modes.
Owner-Scoped Admin
This is the default self-service mode. The user can open admin pages, but rows are filtered to records they own.
Example:
roles: user
scope: owned
filter: updateUser = current user id
All-Scope Admin
This is for operators who can see and manage every record on the current host.
Example roles:
admin
schedule-admin
The default all-scope role is admin. Page-specific roles such as
schedule-admin can opt a user into all-record visibility for one area. Do not
use platform-admin as a global all-scope role because the portal already has a
Platform Admin page for deployment platform management.
Read-Only or Support View
Some users may need to see records without modifying them. This can be added later with separate flags:
canReadAll = true
canWriteOwned = true
canWriteAll = false
Proposed UI Architecture
Add a small ownership-scope helper used by admin pages.
Example shape:
type OwnershipScopeOptions = {
roles?: string | null;
userId?: string | null;
ownerField: string;
allScopeRoles?: string[];
};
type OwnershipScope = {
ownedOnly: boolean;
ownerFilter: { id: string; value: string } | null;
canWriteAll: boolean;
};
Example usage:
import {
applyOwnershipFilter,
defaultAllScopeRoles,
ownershipScope,
} from "../utils/ownershipScope";
const ownership = ownershipScope({
roles,
userId,
ownerField: "updateUser",
allScopeRoles: [...defaultAllScopeRoles, "schedule-admin"],
});
const apiFilters = applyOwnershipFilter(columnFiltersWithoutActive, ownership);
This helper should live near other portal navigation/task utilities or in a small access utility module, for example:
src/utils/ownershipScope.ts
or:
src/tasks/accessScope.ts
The helper should not call the backend. It only computes the UI filter and UI capabilities from the current user state.
Sidebar Behavior
The sidebar should not use admin as a marker on every admin menu link. That
made the whole Administration group disappear for normal users and prevented
owner-scoped self-service pages from being reachable.
Recommended behavior:
adminusers see every Administration link.- non-admin users see only Administration links explicitly marked with
useror a matching entity role, such asrole: "user schedule-admin". - only add
userto a link after that page has owner-scoped filtering and action guards. - remove
role: "admin"from individual menu links. - use exact role-token matching instead of string
includes, soschedule-admindoes not accidentally grantadmin.
At the Phase 3 rollout point, the following Administration links are safe to
expose to user because the pages apply the shared owner-scope helper and
action guards:
- API Admin
- API Detail
- OAuth Auth Client and Client Token
- App Admin
- Instance Admin, Runtime Instance, and instance relationship pages
- Schedule Admin
- Workflow Definition
Configuration, platform admin, user/role admin, workflow process/task/audit pages, and lower-volume metadata pages should remain admin-only until they have the same owner-scope treatment or a separate support/read-only policy.
Admin Page Behavior
For an owner-scoped user:
- add the owner filter before the query payload is serialized
- hide the owner column if it does not add useful information
- show a small scope label such as “My records”
- keep create actions available
- allow update/delete only for rows matching the ownership rule
- preserve normal table sorting, pagination, and global filter behavior
For an all-scope admin:
- do not add the owner filter
- show a scope label such as “All host records”
- show the owner/update columns
- allow existing admin actions
For a user without enough context:
- if
userIdis missing, do not run an owner-scoped query - show a clear message that user context is required
- avoid falling back to all-record visibility
Action-Level Guard
List filtering is not enough for a good UI. Row actions should also check the same scope.
Example:
const canUpdateRow =
ownership.canWriteAll ||
row.original.updateUser === userId;
For rows the user cannot modify:
- hide destructive actions, or
- disable them with a tooltip explaining the scope
Even after service-layer authorization is implemented, the UI should keep these guards so users understand why an action is unavailable.
Phase 4 Ownership Columns
For high-value entity tables, add canonical owner columns directly on the entity row:
owner_user_id UUID NULL
owner_position_id VARCHAR(128) NULL
Recommended constraints where the table has host_id:
FOREIGN KEY (host_id, owner_user_id)
REFERENCES user_host_t(host_id, user_id)
FOREIGN KEY (host_id, owner_position_id)
REFERENCES position_t(host_id, position_id)
Both owner columns should be nullable during migration. New records should get
owner_user_id from the authenticated user on the service side by default. Do
not trust a browser-submitted owner user id unless the caller has permission to
assign ownership.
owner_position_id should be optional on create. The UI can show a host
position dropdown populated from the user’s allowed positions. If the user has
exactly one effective position and the page is configured for position
ownership, the UI can default to that position. If the user has multiple
positions, require an explicit choice when position ownership is desired.
For portal forms, the optional position owner field should be exposed as
ownerPositionId and backed by the existing position label dynaselect query.
The form action uses the position/getPositionLabel endpoint, which is backed
by the queryPositionLabel persistence method and returns the id/label pairs
needed by the select control.
Do not expose ownerUserId as a normal create/update form field. The command
path must derive owner_user_id from the authenticated user in the event
context. If an owner-transfer use case is needed later, implement it as a
separate command with explicit authorization and audit behavior.
Normal update forms may update owner_position_id when the page allows the
caller to choose or clear the owning position. update_user changes on every
update and remains audit metadata. owner_user_id should not change on normal
update; it changes only through an explicit owner-transfer action restricted to
the current owner, admin, or the relevant entity-admin role.
Existing rows should be migrated conservatively:
- if
update_usercan be resolved to a user in the host, it can be used as an initialowner_user_id - leave
owner_position_idnull unless there is a reliable source for the owning position - rows with no owner columns populated should be treated as unassigned legacy rows, visible only to all-scope admins until an owner is assigned
Service-Layer Target
The UI filter is an interim step. The durable solution belongs in the query and command services.
The service layer should eventually:
- derive user id, roles, host id, and scopes from JWT claims
- ignore client-supplied owner filters as an authorization source
- inject owner predicates into query handlers based on the authenticated user
- reject update/delete commands when the user does not own the row and lacks all-scope permission
- use rule-engine policies for exceptions and domain-specific ownership
Once service-side owner enforcement is implemented, the UI should no longer be the source of authorization predicates. The service should inject the ownership predicate from authenticated user context and rule-engine decisions.
The UI should still keep owner-aware behavior for usability:
- show “My records” or “Admin View” scope labels
- hide or show owner columns based on the user’s scope
- disable update/delete actions that the current user cannot take
- optionally send a simple view hint such as
scope=ownedorscope=all
The service must treat any UI-supplied scope or owner filter as a hint only. It must ignore, override, or reject filters that would expand the caller’s authorized scope.
For owner-scoped users, the service-side predicate should be an OR condition:
owner_user_id = current_user_id
OR owner_position_id IN current_user_effective_positions
For all-scope admins, such as admin or the relevant entity-admin role, the
service should omit this owner predicate and return all rows within the normal
host scope.
The UI and backend should share the same policy concepts:
host scope
entity type
owner field
owned-only permission
all-record permission
read vs write capability
Position hierarchy must be resolved by the service layer or rule engine. A JWT
claim such as pos=ai-engineer only grants exact-position access unless the
service expands it to effective positions from position_t and
user_position_t. If hierarchy is enabled, the effective position set should
include inherited positions according to the existing position inheritance
rules.
Rows with owner_position_id IS NULL are not position-owned. A user can still
see the row if owner_user_id matches their user id. Rows where both
owner_user_id and owner_position_id are null are unassigned legacy rows and
should not be visible to normal owner-scoped users by default.
Rule Engine Direction
The rule engine can express policies such as:
user can read API when api.owner_user_id == user.user_id
user can update API when api.owner_user_id == user.user_id
admin can read all APIs on host
admin can update all APIs on host
api-admin can read all APIs on host
api-admin can update all APIs on host
support can read all APIs but cannot update
For tables that do not yet have explicit ownership fields, the policy can
temporarily map ownership to update_user.
Rollout Plan
Phase 1: Fix Schedule Experiment
- Fix filter ordering so
updateUseris included in the request. - Use roles plus user id to decide owner-scoped vs all-scope mode.
- Add action-level guards for update/delete.
- Keep the current route behavior unchanged.
Phase 2: Add Reusable UI Helper
- Create a shared ownership-scope helper.
- Add unit-level coverage if the repo has a practical test pattern.
- Document default all-scope roles.
- Keep owner field configurable per page.
Phase 3: Apply To High-Value Admin Pages
Start with pages where users commonly manage their own records:
- API admin
- API detail/version admin
- OAuth clients
- client apps
- instances
- instance API links
- schedules
- workflow definitions
Then expand to lower-volume metadata pages.
Current implementation status:
src/utils/ownershipScope.tscentralizes exact role matching, owner-scope calculation, owner filter injection, and owner-column hiding.- Sidebar access now exposes only scoped links to
useror matching entity-admin roles, while exactadmincontinues to see all Administration links. - API pages use
adminandapi-adminfor all-record scope, withuserlimited byupdateUser. - OAuth client pages use
adminandoauth-client-adminfor all-record scope, withuserlimited byupdateUser. - Client app pages use
adminandapp-adminfor all-record scope, withuserlimited byupdateUser. - Instance pages use
adminandinstance-adminfor all-record scope, withuserlimited byupdateUser. - Schedule pages use
adminandschedule-adminfor all-record scope, withuserlimited byupdateUser. - Workflow Definition uses
adminandworkflow-adminfor all-record scope, withuserlimited byupdateUser. - Task/page search registries use exact role-token checks so
schedule-adminor another entity-admin role does not accidentally match globaladmin, while exactadminstill has global visibility.
Deferred from this phase:
- Workflow Process, Task, Worklist, Work, Audit, and Trace remain admin-only until their ownership rules are defined and implemented.
- Configuration and platform pages remain admin-only because their ownership model is not yet defined.
- User and role administration remain admin-only because exposing them to self-service users would require a separate delegated-administration model.
Phase 4: Add Explicit Ownership Fields
Where update_user is too weak, add proper owner fields through the database
and services.
Candidate fields:
owner_user_id
owner_position_id
Apply these first to the high-value tables that already have owner-scoped admin
pages. Keep the fields nullable during migration, default owner_user_id from
the authenticated user on create, and make owner transfer explicit.
Current implementation status:
portal-dbadds nullableowner_user_idandowner_position_idcolumns to the high-value portal tables used by the owner-scoped admin pages.- The migration backfills
owner_user_idfromupdate_useronly whenupdate_useris already a UUID. Non-UUID audit values remain unassigned instead of blocking the migration. - A database insert trigger defaults
owner_user_idfromupdate_userfor new rows when the command path writes the authenticated user id intoupdate_user. - Query projections for the scoped UI pages now return
ownerUserIdandownerPositionId, and UUID filtering recognizesownerUserId. portal-viewnow usesownerUserIdfor ownership checks on action controls. The UI no longer sends an owner filter for service-enforced pages because service-side scope must include both direct user ownership and position ownership.- Owner-aware create/update forms expose optional
ownerPositionIdwith a host-scoped position dynaselect backed byqueryPositionLabel. - Command schemas allow optional
ownerPositionIdfor the owner-aware create and update commands. They do not acceptownerUserId;owner_user_idcomes from the authenticated event user. light-portalpersistence writesowner_user_idfrom the event user on create and writesowner_position_idfromownerPositionIdon create/update.- Schedule query is the first service-enforced owner-scope path. Non all-scope
users are filtered by
owner_user_id = current_user_id OR owner_position_id IN effective positionsbased on authenticated audit context.
Remaining rollout work:
- Add explicit owner-transfer commands instead of changing ownership through normal update forms.
Phase 5: Enforce In Services
- Add query-side owner predicates.
- Add command-side ownership checks.
- Move policy decisions into rule-engine configuration.
- Keep the UI filters as usability hints, not authorization.
Current implementation status:
- Query-side owner predicates are implemented for Schedule, API, API Version, App, OAuth Client, Client Token, Instance, Instance API, Instance API Path Prefix, Instance App, Instance App API, Runtime Instance, and Workflow Definition.
- Query handlers derive scope from the authenticated audit attachment. Users
with the global
adminrole or the entity-specific all-scope role bypass the owner predicate; other users are scoped by user id or effective positions. - The UI keeps owner-aware action guards, but it does not send the owner filter
as a request filter for service-enforced pages. That keeps position-owned rows
visible when the service grants access by
owner_position_id. - The db-provider keeps backward-compatible query methods and adds owner-aware overloads so query services can roll forward independently.
Remaining service rollout work:
- Add command-side ownership checks before update/delete actions.
- Add explicit owner-transfer commands and audit events.
- Move the all-scope role and position hierarchy decisions from Java guards into rule-engine policy once the service-side rule context is ready.
Future Improvement: Entity Access Grants
Do not introduce a generic ownership table in Phase 4. It adds query joins, pagination complexity, and weaker referential integrity before we have a clear sharing use case.
A generic table can be added later for secondary grants, sharing, and delegated administration. It should supplement the canonical owner columns rather than replace them.
Possible future shape:
entity_access_t
host_id
entity_type
entity_id
principal_type -- user, position, group, role
principal_id
access_level -- owner, maintainer, viewer
Use this only when we need use cases such as:
- share one API with another position or group
- give support read-only access to a selected set of records
- delegate maintenance without transferring the canonical owner
- manage record-specific exceptions from an Access Admin page
Risks And Mitigations
| Risk | Mitigation |
|---|---|
| UI filter is bypassed | Treat it as interim only; enforce in services next |
update_user changes ownership unexpectedly | Prefer explicit owner fields; use update_user only as fallback |
| users lose access to records updated by operators | support owner transfer or explicit owner fields |
| inconsistent page behavior | centralize scope helper and rollout page by page |
| broad admins still need all records | define all-scope roles separately from self-service admin |
| query filters can be removed by browser tools | backend must inject authorization predicates from JWT claims |
Recommendation
Use owner-scoped filtering as the first UI step, but centralize it immediately.
Do not copy the Schedule.tsx logic into every page by hand.
The recommended path is:
- fix the schedule filter ordering
- introduce a reusable ownership-scope helper
- apply it to the most common self-service admin pages
- add explicit owner fields where
update_useris not good enough - enforce the same rules in query and command services through the rule engine
This gives users a safer admin experience now while creating a clear migration path to real fine-grained authorization.
Contextual Help Links
portal-view has many pages, generated forms, task flows, and admin tables.
Even with the task-oriented navigation work, users still need page-specific and
form-specific help when they are making a decision or filling a field. This
document proposes a contextual help-link model for pages and forms.
Problem
Users often need help at the exact point where they are working:
- what this page is for
- when to use this form
- what required fields mean
- which optional fields matter
- what permissions or ownership rules apply
- what happens after submit
- how this page fits into a larger task
Today, help is usually outside the UI context. Users must know where to look, which document applies, and which page or form name maps to the screen in front of them.
Design Goals
- Add a clear help entry point to every major page and generated form.
- Keep help content close to the product documentation source of truth.
- Avoid bloating the
portal-viewapplication bundle with documentation. - Allow documentation-only updates without rebuilding
portal-view. - Make help links declarative so page, form, and task metadata can drive them.
- Keep link identifiers stable even if routes or component names change.
- Support future documentation search, related topics, and task-specific help.
- Preserve the ability to run the app locally with a configurable docs base URL.
Non-Goals
- Do not build a full documentation authoring system inside
portal-view. - Do not duplicate long user guides in component source files.
- Do not block a page or form rollout because full documentation is missing.
- Do not use contextual help as a replacement for better labels, validation, or field-level error messages.
Documentation Location Decision
The help content should live in light-portal-doc. portal-view should store
only metadata that points to the relevant help page.
Recommended split:
light-portal-doc
src/help/portal-view/
pages/
forms/
tasks/
concepts/
portal-view
page registry, task registry, and form metadata with help ids or help paths
Why light-portal-doc
Pros:
- Keeps user-facing documentation in the documentation repo.
- Allows documentation changes without rebuilding or redeploying
portal-view. - Avoids increasing the app bundle with markdown content.
- Supports documentation search, navigation, publishing, and review workflows.
- Allows the same help content to be linked from support tickets, onboarding, release notes, and external docs.
- Fits the existing pattern where portal-view design docs already live in
light-portal-doc.
Cons:
- Requires stable published URLs.
- Requires a configurable docs base URL for local and deployed environments.
- Can drift from UI behavior unless we add link validation and ownership rules.
Why Not portal-view/docs
Pros:
- Easy to review UI and docs in one PR.
- Help content can be tightly coupled to the component version.
- Local development does not need a separate docs deployment.
Cons:
- Documentation-only changes require app rebuilds and deployments.
- Large markdown content can bloat the frontend repo and build context.
- It is harder to provide a proper documentation navigation/search experience.
- It encourages implementation notes and user help to mix in the same repo.
Recommendation: use light-portal-doc for content and keep portal-view
limited to stable link metadata.
Help Content Structure
Create a user-facing help tree separate from design docs:
src/help/portal-view/
pages/
api-admin.md
api-detail.md
instance-admin.md
schedule-admin.md
forms/
create-api.md
update-api.md
create-client.md
update-instance.md
tasks/
mcp-onboard-api.md
register-standalone-mcp-server.md
concepts/
ownership-and-positions.md
hosts-and-user-hosts.md
api-versioning.md
Use page-level help for screen orientation and form-level help for submission semantics. Use concept help for reusable explanations that should not be copied into many page/form documents.
URL Strategy
Help URLs should be stable and human-readable.
Recommended public URL shape:
/help/portal-view/pages/api-admin
/help/portal-view/forms/create-api
/help/portal-view/tasks/mcp-onboard-api
/help/portal-view/concepts/ownership-and-positions
Do not make the public URL depend on React route internals or component names.
If a route changes from /app/api to another route later, the help URL should
not need to change.
portal-view should build the absolute link from a runtime config value:
PORTAL_DOC_BASE_URL=https://doc.lightapi.net
or for Vite:
VITE_PORTAL_DOC_BASE_URL=https://doc.lightapi.net
Local development can point to a local docs server:
VITE_PORTAL_DOC_BASE_URL=http://localhost:3000
Metadata Contract
Use a stable help id or help path in the app metadata. A help path is more direct and easier to validate.
Page registry example:
{
id: "api-admin",
title: "API Admin",
route: "/app/apis",
helpPath: "/help/portal-view/pages/api-admin",
}
Task registry example:
{
id: "mcp-onboard-api",
title: "Onboard API to MCP Gateway",
helpPath: "/help/portal-view/tasks/mcp-onboard-api",
}
Form metadata example:
{
"formId": "createApi",
"helpPath": "/help/portal-view/forms/create-api",
"actions": []
}
If we need indirection later, we can change to helpId and resolve it through a
small registry:
{
helpId: "forms.create-api"
}
Start with helpPath because it is simple, transparent, and works well with
static documentation.
Portal UI Behavior
Each page and generated form should have a small help action in a predictable location.
Recommended behavior:
- open help in a new browser tab
- use an external-link icon or help icon with an accessible label
- keep the help action near the page title or form title
- if a form is opened inside a task shell, prefer form-specific help first and show task help as a secondary link
- if no specific help exists yet, fall back to the nearest page or concept help
Example resolution order for a form opened from a task:
- form
helpPath - current task
helpPath - current page
helpPath - generic portal help landing page
Do not render a broken link. If a help path is missing, hide the action or show the fallback help link.
Generated Forms
Generated forms should support a top-level helpPath field in Forms.json.
The renderer can read it and show a help action in the form header.
For example:
{
"formId": "createSchedule",
"helpPath": "/help/portal-view/forms/create-schedule",
"schema": {},
"form": []
}
Field-level help can be added later, but it should not be the first step. Many field descriptions can stay in the JSON schema title/description. Use field-level help only for fields where a short description is not enough, such as security, ownership, deployment, or advanced configuration fields.
Possible future field shape:
{
"key": "ownerPositionId",
"helpPath": "/help/portal-view/concepts/ownership-and-positions"
}
Task-Aware Help
The task-oriented navigation layer should support task help separately from page or form help. A user working on the same form may need different context depending on the task.
Example:
createApiopened from “Register a new API” links to create API form help.createApiopened from “Onboard API to MCP Gateway” can also link to MCP onboarding task help.
The UI should pass task context through existing task URL parameters and layout state, then render both links when useful:
Help: Create API
Related: Onboard API to MCP Gateway
Authoring Guidelines
Each page help document should include:
- what the page is used for
- who can access it
- what records are visible
- common actions
- links to related forms and tasks
Each form help document should include:
- when to use the form
- what happens after submit
- required fields
- important optional fields
- ownership and permission behavior
- validation or troubleshooting notes
Keep help content user-facing. Do not put implementation details, class names, or database internals in the main help body unless they are truly needed for an operator.
Validation
To prevent link drift, add a lightweight validation step once the first help docs exist.
Validation should check:
- every
helpPathinportal-viewpoints to a markdown source inlight-portal-doc - every high-value page has page help
- every high-value form has form help
- no help path uses a route-specific or component-specific unstable name
This can start as a script in light-portal-doc or a shared CI check that
accepts both repo paths.
Rollout Plan
Phase 1: Documentation Structure
- Create
src/help/portal-view/pages. - Create
src/help/portal-view/forms. - Create
src/help/portal-view/tasks. - Create
src/help/portal-view/concepts. - Add placeholder help pages for the high-value admin pages and forms.
Phase 2: App Metadata
- Add optional
helpPathtopageRegistry.ts. - Add optional
helpPathtotaskRegistry.ts. - Add optional top-level
helpPathto generated form metadata. - Add a docs base URL runtime config.
Phase 3: UI Components
- Add a reusable help-link component.
- Render page help near page titles.
- Render form help in the generated form header.
- Render task help in the task navigation shell.
- Add fallback behavior when a specific help link is missing.
Phase 4: Coverage And Validation
- Add help paths for all self-service owner-scoped admin pages.
- Add help paths for all high-value create/update forms.
- Add a validation script for help path coverage and broken links.
- Add missing docs over time as pages move into the task-oriented model.
Initial Scope
Start with the pages and forms most likely to be used by self-service users:
- API Admin and API Detail
- create/update API
- create/update API Version
- App Admin
- create/update App
- OAuth Client and Client Token
- create/update Client
- create Client Token
- Instance Admin and relationship pages
- create/update Instance
- create Instance API
- create/update Instance API Path Prefix
- create Instance App
- create Instance App API
- Schedule Admin
- create/update Schedule
- Workflow Definition
- create/update Workflow Definition
Then expand to admin-only pages after their ownership and access model is clear.
MVP Decisions
Use these decisions for the first implementation.
Missing Help Links
Do not hide the help action when a specific page, form, or task help path is missing. Fall back to the generic portal-view help landing page:
/help/portal-view/index
This keeps the UI consistent. A missing specific help page should degrade to general help instead of making the help affordance disappear.
Help Presentation
Open help in a new browser tab for the MVP. Do not build an embedded markdown viewer, side drawer, or iframe-based documentation panel in the first version.
This keeps portal-view small and avoids adding documentation rendering,
iframe, routing, and panel-state complexity to the app. A side panel can be
revisited later if users need in-page help while editing long forms.
JSON Schema Descriptions
Do not auto-generate full form help pages from JSON schema descriptions. Schema titles and descriptions are best used for inline labels, helper text, or field-level tooltips.
Form-level help should explain why the form exists, when to use it, what happens after submit, and how the form fits into a larger workflow. It should not simply repeat field types and required flags.
Documentation Versioning
Use latest documentation URLs for the MVP. Do not introduce release-versioned help URLs in the first implementation.
The portal will likely support both cloud SaaS deployments and enterprise on-premise deployments. SaaS users normally interact with the latest deployed portal, but enterprise customers may run older portal versions for a longer period. Versioned docs are therefore a good future requirement, but they should not block the first help-link rollout.
Keep helpPath values relative and version-neutral:
/help/portal-view/forms/create-api
Then versioning can be introduced later by changing only the configured docs base URL:
PORTAL_DOC_BASE_URL=https://doc.lightapi.net/v2.0
This keeps the app metadata stable while allowing SaaS to use latest docs and on-premise builds to point at version-specific documentation.
Future Enhancements
In-Page Help Drawer
Add an optional in-page help drawer after the helpPath metadata is stable and
the first new-tab implementation has proven useful.
The drawer should be opt-in, not the default for every form. Long or complex configuration forms can declare:
{
"helpPath": "/help/portal-view/forms/update-instance",
"inPageHelp": true
}
When enabled, the UI can render a right-side drawer that displays the help document through an iframe or a lightweight markdown renderer. This avoids constant tab switching for complex forms while keeping the MVP simple.
Field-Level Help Paths
Add field-level help paths sparingly for complex fields and architectural concepts. Standard fields should continue to use JSON schema titles, descriptions, helper text, or tooltips.
Example future field metadata:
{
"key": "ownerPositionId",
"helpPath": "/help/portal-view/concepts/ownership-and-positions"
}
The UI can render a small help icon next to the field label when a field-level
helpPath exists. Good candidates include ownership, security, OAuth token
exchange, deployment target, transport configuration, and workflow definition
fields.
Versioned Documentation
Add release-versioned documentation when multiple portal versions must be supported at the same time, especially for on-premise enterprise deployments.
The relative helpPath values should remain unchanged. The deployment or build
configuration should select the versioned docs base URL:
SaaS/latest:
PORTAL_DOC_BASE_URL=https://doc.lightapi.net
On-premise v2.0:
PORTAL_DOC_BASE_URL=https://doc.lightapi.net/v2.0
This gives cloud deployments a simple latest-docs experience and gives
enterprise deployments a path to version-matched help without changing
portal-view metadata.
Recommendation
Store user-facing help content in light-portal-doc and add declarative
helpPath metadata in portal-view. This keeps documentation maintainable and
publishable while allowing every page, form, and task to provide context-aware
help from the UI.
Event Processing Notifications
Portal commands are event driven. After a command is submitted, one or more
CloudEvents are written to event_store_t and outbox_message_t. The
hybrid-query event consumer later processes the outbox rows and updates the
projection tables used by portal-view.
The notification page in the user profile is intended to show the user the
recent processing result for those events. Today the table and read path exist,
but notification_t is not populated consistently, so the page cannot provide
meaningful status.
Current State
The command path already writes events through the common command handler:
- The command handler validates and enriches the request.
- It builds one or more CloudEvents.
- It inserts those events into
event_store_tandoutbox_message_t. - The command returns before the query-side projection has necessarily run.
The query side can run through either event-processing pipeline, selected by configuration:
- Pg-notify pipeline:
DbEventConsumerStartupHookpollsoutbox_message_t, uses the table’s gaplessc_offset, groups rows bytransaction_id, and writes failed transactions to the databasedead_letter_queue. - Kafka pipeline: a connector publishes rows from
outbox_message_tto Kafka.PortalEventConsumerStartupHookconsumes those records, groups records by the command-sidetransaction_id, and produces failed transactions to the Kafka DLQ topic when DLQ is enabled.
Both pipelines eventually call PortalDbProvider.handleEvent(conn, event).
handleEvent dispatches the event to the projection method for that event type.
Because both pipelines process the same outbox-backed events, they should share
the same user-facing notification status model.
The notification pieces are partially present:
notification_texists inportal-db.NotificationDataPersistenceImplcan querynotification_t.NotificationServiceImplcan insert a notification row.user-queryexposesgetNotification.portal-viewhas a notification table page.
The current gap is that notification rows are not created at the central event processing boundary.
There is also a separate UI error in MailMenu: it calls getPrivateMessage,
whose handler currently returns an empty response. That explains the browser
error Unexpected end of JSON input, but it is separate from the notification
status design.
Goals
- Show the current user the latest event processing results in the profile notification page.
- Record both successful and failed projection processing.
- Preserve event processing correctness even if notification insertion fails.
- Keep notification creation centralized instead of adding calls to every projection method.
- Support commands that emit multiple events.
- Make the read API filter by host and user by default.
- Keep enough diagnostic data to debug failed projections.
- Keep notification writes idempotent so event replay is safe.
Non-Goals
- Do not replace
event_store_t,outbox_message_t, ordead_letter_queue. - Do not use notifications as the source of truth for projection state.
- Do not build a real-time push channel in the first phase.
- Do not add notification logic manually to every projection method.
- Do not expose other users’ processing history to non-admin users.
Recommended Design
Use notification_t as an operational projection-status table. The command
side creates PENDING rows at the central event publication boundary, and the
hybrid-query event consumer updates those rows with the processing result.
The primary processing-result write point should be the centralized outbox
consumer path, around the call to PortalDbProvider.handleEvent(conn, event).
Recommended lifecycle:
command handler
-> event_store_t
-> outbox_message_t
-> notification_t PENDING row
-> response to caller
hybrid-query consumer
-> read outbox_message_t
-> handleEvent(conn, event)
-> projection table write
-> notification_t status row
For command-side publication, insert or update one notification row for each
CloudEvent with status PENDING in the same transaction that writes
event_store_t and outbox_message_t. Leave event_partition and
event_offset null for this first insert, because the consumer has not observed
the event position yet.
For successful projection processing, update the notification row for the
CloudEvent to status SUCCEEDED and populate event_partition and
event_offset from the active processor’s outbox position.
For failed projection processing, insert or update one notification row for each
failed CloudEvent with status FAILED or DLQ, and store the exception
message. Populate event_partition and event_offset when the processor has
that information.
Status Model
Use one explicit status field. Do not keep is_processed; this feature is
being implemented for the first time, and a boolean cannot distinguish pending,
success, retry, DLQ, and skipped outcomes.
Recommended statuses:
| Status | Meaning |
|---|---|
PENDING | Event accepted into event_store_t and outbox_message_t, but the active event consumer has not recorded a processing result yet. |
SUCCEEDED | Event was applied to projection tables and the projection transaction committed. |
FAILED | Event processing failed before the failed transaction was durably written to the configured DLQ, or the DLQ write itself failed. |
DLQ | Event transaction failed in fallback mode and was durably written to the configured DLQ. |
SKIPPED | Event was read by the active event consumer but intentionally ignored, such as an unhandled event type. |
The UI should show the status labels, not the underlying event pipeline. The same status meanings apply to both pg-notify and Kafka processing.
Schema
The existing table is close, but it is too small for operational status and has
nonce as INTEGER while event tables use BIGINT.
Recommended table shape:
CREATE TABLE notification_t (
id UUID NOT NULL,
host_id UUID NOT NULL,
user_id UUID NOT NULL,
nonce BIGINT NOT NULL,
event_class VARCHAR(255) NOT NULL,
event_json TEXT NOT NULL,
event_ts TIMESTAMP WITH TIME ZONE NULL,
process_ts TIMESTAMP WITH TIME ZONE NOT NULL,
status VARCHAR(16) NOT NULL,
error VARCHAR(2048) NULL,
aggregate_id VARCHAR(255) NULL,
aggregate_type VARCHAR(255) NULL,
aggregate_version BIGINT NULL,
event_partition INTEGER NULL,
event_offset BIGINT NULL,
transaction_id UUID NULL,
read_ts TIMESTAMP WITH TIME ZONE NULL,
PRIMARY KEY (host_id, id),
FOREIGN KEY (host_id) REFERENCES host_t(host_id) ON DELETE CASCADE
);
user_id is intentionally not a foreign key to user_t. PENDING rows are
inserted on the command side before projection tables are updated, so enforcing
that projection FK would break commands such as user creation before the
projection catches up.
Recommended indexes:
CREATE INDEX idx_notification_user_process_ts
ON notification_t (host_id, user_id, process_ts DESC);
CREATE INDEX idx_notification_status_process_ts
ON notification_t (host_id, status, process_ts DESC);
CREATE INDEX idx_notification_transaction
ON notification_t (host_id, transaction_id);
CREATE INDEX idx_notification_event_position
ON notification_t (host_id, event_partition, event_offset);
CREATE INDEX idx_notification_unread_failure
ON notification_t (host_id, user_id, process_ts DESC)
WHERE read_ts IS NULL AND status IN ('FAILED', 'DLQ');
event_partition and event_offset are intentionally generic processing
position fields. They are useful for operator diagnostics, but the UI should not
label them as pg-notify or Kafka details. In the pg-notify processor,
event_partition is the configured logical consumer partition and
event_offset is outbox_message_t.c_offset. In the Kafka processor,
event_partition and event_offset are the consumed Kafka record partition and
offset.
Both columns are nullable. PENDING rows should leave them empty at initial
insert time. They are filled later by the pg-notify or Kafka processor when the
processing result changes the row to SUCCEEDED, FAILED, DLQ, or SKIPPED.
transaction_id remains a UUID because it is generated by the command side and
used by both event processors.
Do not store pipeline name, source topic/channel name, or DLQ destination in
notification_t. Those are implementation details of the configured event
pipeline. Operators can use service configuration and logs when they need
pipeline-specific diagnostics.
For existing installations, ship this as a patch:
ALTER TABLE notification_t ALTER COLUMN nonce TYPE BIGINT;
ALTER TABLE notification_t ADD COLUMN IF NOT EXISTS status VARCHAR(16);
ALTER TABLE notification_t ADD COLUMN IF NOT EXISTS event_ts TIMESTAMP WITH TIME ZONE;
ALTER TABLE notification_t ADD COLUMN IF NOT EXISTS aggregate_id VARCHAR(255);
ALTER TABLE notification_t ADD COLUMN IF NOT EXISTS aggregate_type VARCHAR(255);
ALTER TABLE notification_t ADD COLUMN IF NOT EXISTS aggregate_version BIGINT;
ALTER TABLE notification_t ADD COLUMN IF NOT EXISTS event_partition INTEGER;
ALTER TABLE notification_t ADD COLUMN IF NOT EXISTS event_offset BIGINT;
ALTER TABLE notification_t ADD COLUMN IF NOT EXISTS transaction_id UUID;
ALTER TABLE notification_t ADD COLUMN IF NOT EXISTS read_ts TIMESTAMP WITH TIME ZONE;
ALTER TABLE notification_t DROP CONSTRAINT IF EXISTS notification_t_user_id_fkey;
ALTER TABLE notification_t DROP COLUMN IF EXISTS is_processed;
ALTER TABLE notification_t ALTER COLUMN status SET NOT NULL;
CREATE INDEX IF NOT EXISTS idx_notification_unread_failure
ON notification_t (host_id, user_id, process_ts DESC)
WHERE read_ts IS NULL AND status IN ('FAILED', 'DLQ');
Write Path
Notification writes need an explicit transaction policy. A single rule cannot cover every status:
- Failure and DLQ notifications must be durable even when projection writes are rolled back.
- Success notifications must not claim success until the projection write has committed.
- Notification write failures should not break projection processing.
Use a REQUIRES_NEW style helper for notification writes that must survive a
projection rollback. In plain JDBC, this means opening a separate connection
with its own commit/rollback boundary.
For success rows, there are two safe options:
- Commit the projection transaction first, then write
SUCCEEDEDin a separate notification transaction. - Write
SUCCEEDEDinside the projection transaction, but wrap it in a savepoint and treat notification insert failure as non-fatal.
The first option is the recommended default because notification failures cannot
roll back projection updates. The tradeoff is a small window where projection
has committed but the success notification is missing. That is acceptable
because event_store_t remains the source of truth and success notifications
are user feedback, not projection correctness.
Recommended service methods:
void recordPending(Map<String, Object> event, UUID transactionId);
void recordSuccess(Map<String, Object> event, EventMetadata metadata);
void recordFailure(Map<String, Object> event, EventMetadata metadata, String error, String status);
recordPending should participate in the command-side transaction that writes
event_store_t and outbox_message_t. It may store transaction_id, because
that value is generated by the command side, but it must leave
event_partition and event_offset null. recordSuccess and recordFailure
should use the event-processing transaction policy described below.
EventMetadata should carry only pipeline-neutral data that is not inside the
CloudEvent map:
eventPartition: the active processor’s partition value. For pg-notify this is the configured logical consumer partition; for Kafka this is the consumed Kafka record partition.eventOffset: the active processor’s offset value. For pg-notify this isoutbox_message_t.c_offset; for Kafka this is the consumed Kafka record offset.transactionId: the command-side transaction UUID used by both processors.
Both consumers should build this metadata before calling handleEvent, so the
failure path still has offset and transaction context after the projection
transaction is rolled back.
Use an idempotent upsert:
INSERT INTO notification_t (...)
VALUES (...)
ON CONFLICT (host_id, id) DO UPDATE SET
process_ts = EXCLUDED.process_ts,
status = EXCLUDED.status,
error = EXCLUDED.error,
event_partition = EXCLUDED.event_partition,
event_offset = EXCLUDED.event_offset,
transaction_id = EXCLUDED.transaction_id;
This makes replay and fallback processing safe.
Success Handling
In the normal batch path:
begin projection transaction
for each event from the active pipeline:
parse CloudEvent
handleEvent(conn, event)
commit projection transaction
for each successfully committed event:
recordSuccess(event, metadata) in separate notification transaction
Do not write SUCCEEDED before the projection transaction commits unless it is
part of the same transaction. If it is written in the same transaction, a
projection rollback must roll back the success row too.
The implementation can keep an in-memory list of successfully applied events while processing the batch. After commit, loop through that list and upsert the success notifications. If a success notification write fails, log it and continue; do not retry the projection.
Failure Handling
Failure rows must be written outside the failed projection transaction.
In fallback mode, processing is retried per transaction. For a failed transaction:
begin projection transaction
savepoint projection_attempt
process transaction events
on exception:
rollback to projection_attempt or rollback projection transaction
write failed events to database DLQ or Kafka DLQ topic
recordFailure(event, metadata, error, "DLQ") in separate notification transaction
For pg-notify, the DLQ destination is the database dead_letter_queue table.
For Kafka, the DLQ destination is the configured Kafka DLQ topic. The failure
notification can be committed with the database DLQ transaction for pg-notify,
or in a separate notification transaction immediately after the Kafka DLQ
produce request is accepted. The key requirement is that it must not be part of
the projection work that is being rolled back.
If the database connection enters an unrecoverable error state, close it and open a fresh connection for the DLQ and failure notification writes.
If the payload cannot be parsed as a CloudEvent, the consumer may not know the CloudEvent id or event type. In that case, the DLQ remains the primary failure record. If the consumer metadata still has host, user, partition, offset, and transaction id, the consumer can create a diagnostic notification with a generated id, but this should be treated as a best-effort operational row.
Pending Handling
PENDING is part of phase one. Add pending rows at the central command-side
publication boundary that writes event_store_t and outbox_message_t.
The pending notification should be written in the same command-side transaction as the event-store and outbox rows. If the command rolls back, the pending notification must roll back too. Do not add pending writes to individual command handlers.
At this stage, the notification row should contain command-known fields only:
CloudEvent id, host id, user id, nonce, event class, event JSON, event timestamp,
and transaction id. The processor-owned event_partition and event_offset
fields remain null until event processing updates the row.
Read API
Keep getNotification as the main query endpoint, but tighten its contract.
Recommended request fields:
{
"hostId": "uuid",
"userId": "uuid",
"offset": 0,
"limit": 25,
"status": "SUCCEEDED",
"eventClass": "ClientCreatedEvent",
"nonce": "123",
"fromTs": "2026-05-08T00:00:00Z",
"toTs": "2026-05-08T23:59:59Z",
"error": "duplicate key"
}
Recommended response:
{
"total": 1,
"notifications": [
{
"id": "uuid",
"hostId": "uuid",
"userId": "uuid",
"nonce": 123,
"eventClass": "ClientCreatedEvent",
"status": "SUCCEEDED",
"processTs": "2026-05-08T16:12:00Z",
"aggregateId": "host|client",
"aggregateType": "Client",
"aggregateVersion": 2,
"transactionId": "uuid",
"eventPartition": 0,
"eventOffset": 1001,
"error": null,
"eventJson": "{...}"
}
]
}
eventPartition and eventOffset are intentionally displayed as generic
position fields regardless of which event pipeline is configured. The main list
can hide them by default and show them in the detail view.
userId is a filter on getNotification, not a separate endpoint contract. The
profile notification page should always send the logged-in user’s userId. The
admin notification page can omit userId to request host-wide results, or pass a
specific userId to narrow the host-wide view to one user.
Authorization rules:
- Normal users can query only their own token
user_idwithin the selectedhostId. If the request omitsuserId, the backend should apply the tokenuser_id; if the request supplies anotheruserId, the backend should reject it or override it with the tokenuser_id. - Admin users can query all users for the host by omitting
userId, or filter to a specific user by providinguserId. - The backend should enforce this using token claims, not only UI filters.
Portal View
The profile notification page should become a processing-status view.
Recommended columns:
- Time
- Status
- Event
- Aggregate
- Nonce
- Error
- Details
Recommended default filters:
hostIdfrom the selected host.userIdfrom the logged-in user.- No default
statusfilter. - Most recent first.
- Last 25 rows.
The UI should display concise summaries and keep full eventJson behind an
expandable detail row or dialog.
Show all events associated with the user, including successful, failed, and derived events. Derived events should be visible as their own rows instead of being collapsed under the original command transaction.
The current processFlag filter should be replaced by status. No
is_processed compatibility mapping is needed because this feature has not yet
started populating notification_t.
The header MailMenu should not call getPrivateMessage unless that handler is
restored. For notification status, add a small notification badge endpoint or
reuse getNotification with limit = 5.
The header badge should count only unread failure notifications, such as
FAILED and DLQ, and display the count in red when the count is greater than
zero.
In the list, FAILED and DLQ status badges should also use red styling.
Phase 2 adds two narrow user-query RPCs:
getUnreadNotificationCount: returns unreadFAILEDandDLQnotifications for the currenthostIdanduserId.markFailureNotificationsRead: setsread_tson unreadFAILEDandDLQnotifications for the currenthostIdanduserId.
The header uses the count endpoint for its badge and marks failures read when the user opens the notification menu. The notification page also marks failures read when it is opened.
Admin Notification Page
Phase 3 should add a separate admin notification page instead of overloading the profile notification page. The recommended location is:
- Route:
/app/event/notifications - Menu:
Administration->Event Admin->Notifications
This page should reuse the same notification table and getNotification read
API, but with admin defaults:
hostIdfrom the selected host.- No default
userIdfilter, so admins see host-wide results. - Default status filter for
FAILEDandDLQ, with an option to show all statuses. - Filters for
userId,eventClass,status,transactionId,aggregateId, processing position, time range, and error text. - No unread badge behavior and no call to
markFailureNotificationsRead.
The page should clearly identify itself as an admin view, such as “Admin View: Host Notifications”. Host-wide access must still be enforced by the backend using token roles.
Operational Cleanup
Notifications are operational history. They should not grow forever.
Recommended retention:
- Keep successful notifications for 30 to 90 days.
- Keep failed and DLQ notifications longer, such as 180 days.
- Allow host-level configuration later if needed.
Cleanup should be implemented as a generic operational cleanup process, not as
notification-specific UI or command-handler logic. The first cleanup target is
notification_t, but the same framework should also support other operational
tables such as message_t for private messages.
Recommended implementation:
- Add an
OperationalCleanupStartupHookon the query side. - Run cleanup on a fixed interval, such as daily, with config-driven enablement, interval, batch size, and per-target retention days.
- Use a single cleanup coordinator that owns multiple cleanup targets. Each target defines its table, timestamp column, status/type conditions if needed, retention duration, and batch delete SQL.
- Use a database lock, such as a PostgreSQL advisory lock or a dedicated cleanup lock row, so only one service instance performs cleanup at a time.
- Delete in bounded batches to avoid long table locks and large transactions.
- Use a separate database connection and transaction for cleanup work.
- Log cleanup failures and continue service startup; cleanup failure must not block query APIs or event processing.
Do not use schedule_t directly for this cleanup. That scheduler is business
workflow infrastructure that emits events into event_store_t and
outbox_message_t. Operational cleanup is local maintenance and should stay out
of the event-processing path.
Example notification cleanup:
WITH doomed AS (
SELECT host_id, id
FROM notification_t
WHERE (status IN ('SUCCEEDED', 'SKIPPED') AND process_ts < ?)
OR (status IN ('FAILED', 'DLQ') AND process_ts < ?)
ORDER BY process_ts
LIMIT ?
)
DELETE FROM notification_t n
USING doomed d
WHERE n.host_id = d.host_id
AND n.id = d.id;
Private-message cleanup can be another target using message_t.send_time:
WITH doomed AS (
SELECT host_id, from_id, nonce
FROM message_t
WHERE send_time < ?
ORDER BY send_time
LIMIT ?
)
DELETE FROM message_t m
USING doomed d
WHERE m.host_id = d.host_id
AND m.from_id = d.from_id
AND m.nonce = d.nonce;
Recommended default cleanup targets:
| Target | Table | Retention |
|---|---|---|
| Successful notification history | notification_t where status IN ('SUCCEEDED', 'SKIPPED') | 90 days |
| Failed notification history | notification_t where status IN ('FAILED', 'DLQ') | 180 days |
| Private messages | message_t | 180 days |
Do not delete recent PENDING notifications. Old PENDING rows should be
treated as an operational signal first because they may indicate that the event
consumer is stopped or lagging. If a hard cap is needed later, make it a
separate, longer retention policy.
Snapshot and Promotion
notification_t should be treated as an operational table, not a promoted
projection table.
It should be excluded from global snapshot export and conversion alongside
event_store_t, outbox_message_t, dead_letter_queue, log_counter, and
consumer_offsets.
Rollout Plan
Phase 1: Make Notifications Useful
- Add
statusand diagnostic columns tonotification_t. - Add pipeline-neutral
event_partition,event_offset, andtransaction_idmetadata. - Change
NotificationServiceto support separate notification transactions. - Insert
PENDINGrows at the central command-side outbox publication boundary. - Insert
SUCCEEDEDrows after successfulhandleEvent. - Insert
DLQrows in fallback failure handling. - Update
getNotificationto supportstatusand correct timestamp fields. - Update
portal-viewto usestatus, default to the current user, and show all user-associated events including derived events.
Phase 2: Improve User Feedback
- Add an unread marker with
read_ts. - Add a small header badge query for unread
FAILEDandDLQnotifications and render the badge in red. - Mark unread failure notifications as read when the user opens the header menu or the notification page.
Phase 3: Operations
- Add a generic operational cleanup startup hook with retention targets for
notification_tandmessage_t. - Make cleanup configurable by enablement, interval, batch size, and per-target retention days.
- Add a database lock so only one service instance runs cleanup at a time.
- Add an admin notification page under Event Admin that uses
getNotificationwithout auserIdfilter for host-wide failures. - Add dashboards or alerts for repeated DLQ statuses.
Risks and Mitigations
| Risk | Mitigation |
|---|---|
| Notification write failure breaks event processing | Write notifications in a separate transaction after projection commit, or use savepoints for same-transaction success rows. |
| Failure notifications are rolled back with projection failures | Write FAILED and DLQ rows outside the failed projection transaction. |
| False success rows after projection rollback | Write SUCCEEDED only after projection commit, or keep same-transaction success rows rollback-safe. |
| Duplicate rows on replay | Use ON CONFLICT (host_id, id) DO UPDATE. |
| Users see other users’ events | Enforce token-based authorization in getNotification. |
| Operational tables grow without bound | Add generic operational cleanup targets and supporting indexes. |
| Cleanup runs concurrently on multiple instances | Use a database lock so only one instance runs cleanup at a time. |
| Cleanup failure blocks query service startup | Log cleanup failures and continue startup; cleanup is maintenance, not correctness-critical. |
| Status meaning stays ambiguous | Use status as the only outcome field for both pg-notify and Kafka processing. |
API Marketplace Catalog
Context
The portal already has a Marketplace navigation group and an api-marketplace
page registry entry. The current API administration page is table-oriented and
is useful for owners, but it is not a consumer catalog. A Marketplace API
catalog should let users discover APIs by business category, capability,
protocol, lifecycle status, and governance metadata.
API create and update forms already use the standardized taxonomy fields:
categoryIdsfor selected category identifiers.tagIdsfor selected tag identifiers.getCategoryLabelByTypewithentityType = "api"for category options.getTagLabelByTypewithentityType = "api"for tag options.
The service query layer also returns categoryIds, categories, tagIds, and
tags for API rows. The catalog should use those fields for display and
filtering instead of reintroducing the legacy apiTags string field.
Goals
- Add a Marketplace menu entry for an API catalog.
- Use database-backed categories and tags, not hard-coded UI lists.
- Keep categories and tags reusable across future catalog pages.
- Keep API create/update forms as the source of truth for taxonomy assignment.
- Give consumers a browse-first experience instead of an admin table.
- Support deep links from a catalog listing to API detail, versions, endpoints, runtime bindings, and owner actions.
- Preserve host scope and ownership rules already used by API administration.
Non-Goals
- Do not replace API administration pages with the catalog.
- Do not store display names in API rows when they can be resolved from
category_t,tag_t,entity_category_t, andentity_tag_t. - Do not use the old
apiTagsfield for catalog filtering. - Do not make taxonomy values static frontend constants.
- Do not expose private tenant APIs through a public catalog without an explicit visibility and authorization decision.
Current Building Blocks
| Area | Current shape | Catalog use |
|---|---|---|
| Portal navigation | Marketplace group already exists in the sidebar | Add an API Catalog child item under Marketplace |
| Page registry | api-marketplace points to /app/marketplace, while the app route still needs a real catalog page | Keep a registry entry for search, task links, and help links |
| API admin page | Service.tsx calls service/getApi and displays categories and tags | Reuse its query contract but present catalog cards/list views |
| API detail page | ApiDetail.tsx shows API versions and action links | Catalog detail can deep-link to this page |
| Forms | createApi and updateApi submit categoryIds and tagIds | Catalog reads the same assignments |
| Category labels | category/getCategoryLabelByType returns id and label | Use for category tabs, filters, and chips |
| Tag labels | tag/getTagLabelByType returns id, label, value, group code, group label, group sort order, and tag sort order | Use for grouped tag filters and grouped multi-select controls |
| Database | category_t, tag_t, entity_category_t, and entity_tag_t are entity-type scoped | Use entity_type = 'api' for API catalog taxonomy |
User Experience
The first screen under Marketplace should be the usable catalog, not a landing page. The recommended route is:
/app/marketplace/api
The sidebar can keep the existing Marketplace group, but its children should move from API-type-only links to intent-based entries:
- API Catalog
- API Clients
- JSON Schema
- YAML Rule
- Schema Form
The API Catalog page should provide:
- Search across API id, name, description, business group, line of business, capability, platform, git repository, categories, and tags.
- Category tabs or a category rail based on
getCategoryLabelByType. - Grouped tag filters based on
getTagLabelByType. - Filter chips for active category and tag selections.
- A compact card or list row per API with name, description, status, categories, tags, owner, business group, and latest version summary.
- Actions to view details, review versions, create a new version, update the API metadata, and open related runtime or access-control pages when the user has permission.
The catalog should support an Uncategorized bucket for active APIs without
category assignments. This avoids hiding incomplete data and gives admins an
easy cleanup target.
Categories And Tags
Categories should be stable browse buckets. Tags should be flexible facets.
Both are stored with entityType = "api" so the same tag names can be reused
for other entity types without forcing cross-catalog semantics.
Recommended initial API categories:
| Category value | Label | Purpose |
|---|---|---|
public-api | Public API | External developer-facing APIs |
partner-api | Partner API | APIs shared with business partners |
internal-api | Internal API | Organization-internal service APIs |
platform-service | Platform Service | Shared platform or infrastructure APIs |
data-api | Data API | Data access, analytics, reporting, and query APIs |
ai-automation-api | AI / Automation API | Agent, workflow, automation, or AI-facing APIs |
security-compliance-api | Security / Compliance API | Identity, audit, policy, compliance, and control APIs |
developer-tooling-api | Developer Tooling API | Build, test, deployment, and developer-experience APIs |
legacy-modernization-api | Legacy / Modernization API | Legacy integration and modernization APIs |
The stored category_name must stay lower-case and URL-friendly. The display
labels above are UI labels derived from those values.
Recommended initial API tag groups:
| Group code | Group label | Example tag values |
|---|---|---|
protocol | Protocol | openapi, graphql, hybrid, mcp, rest, event-driven |
lifecycle | Lifecycle | draft, review, implemented, deprecated, beta, ga |
security | Security | oauth2, jwt, mtls, pii, hipaa, pci, read-only |
runtime | Runtime | gateway, sidecar, kubernetes, serverless, multi-region |
domain | Domain | customer, order, payment, inventory, tax, billing |
consumer | Consumer | public, partner, internal, agent-facing, mobile, web |
operations | Operations | high-traffic, low-latency, batch, streaming, critical |
integration | Integration | database, kafka, s3, third-party, mainframe, saas |
Stored tag names must stay lower-case and URL-friendly. If a display label needs capitalization, the UI should format it or the label endpoint should provide a separate display field later.
Tags without tag_group_code or tag_group_label should be shown under a
General filter group in the catalog UI. Configured groups should appear first
by group_sort_order; the General group should appear after configured
groups, matching the current label query behavior where null group sort values
sort last.
Data Flow
Catalog filter option loading:
portal-view
-> category/getCategoryLabelByType(hostId, entityType = "api")
-> tag/getTagLabelByType(hostId, entityType = "api")
Catalog result loading:
portal-view
-> service/getApi(hostId, offset, limit, active, filters, globalFilter, sorting)
-> api rows with categoryIds, categories, tagIds, tags
The catalog should prefer server-side pagination and filtering. Client-side filtering is acceptable only for a small first pass because it breaks as soon as the API count exceeds one fetched page.
Query Contract
The existing getApi contract already supports filters, globalFilter,
sorting, offset, limit, hostId, and active. To make the catalog work
well at scale, add first-class filter support for taxonomy fields:
{
"hostId": "01964b05-552a-7c4b-9184-6857e7f3dc5f",
"offset": 0,
"limit": 20,
"active": true,
"categoryIds": ["..."],
"tagIds": ["..."],
"tagMatch": "all",
"globalFilter": "payment"
}
Recommended semantics:
categoryIdsuses OR semantics by default. An API in any selected category is returned.tagIdsshould supporttagMatch = "all"andtagMatch = "any".- Category and tag filters should use
EXISTSagainstentity_category_tandentity_tag_twithentity_type = 'api'andactive = TRUE. - Display arrays should continue to be returned as
categoriesandtags. - Form update payloads should continue to submit identifiers only through
categoryIdsandtagIds.
Page Design
The API Catalog page can be implemented as a dedicated page rather than trying to stretch the current API admin table.
Proposed files:
src/pages/marketplace/ApiCatalog.tsx
src/pages/marketplace/components/ApiCatalogFilters.tsx
src/pages/marketplace/components/ApiCatalogCard.tsx
src/pages/marketplace/hooks/useApiCatalog.ts
Page state:
- search text
- selected category ids
- selected tag ids
- tag match mode
- active status
- pagination
- sorting
- view mode, either compact list or card grid
Catalog state should be URL-driven from Phase 1. Search text, selected categories, selected tags, tag match mode, active status, sorting, and pagination should be encoded in the query string so users can refresh the page, use browser navigation, and share filtered catalog URLs. Example:
/app/marketplace/api?q=payment&category=public-api&tag=oauth2&tag=mtls&tagMatch=all&page=1
The page should still reuse existing infrastructure:
fetchClientfor portal query calls.useUserStatefor host and user context.buildTaskAwareRoutefor deep links.- ownership utilities for update/delete action visibility.
TaskActionPanelfor publisher/admin next actions.pageRegistryand contextual help metadata.
Routing And Navigation
Add or update these portal-view entries:
| Location | Change |
|---|---|
Sidebar.tsx | Add API Catalog under Marketplace with route /app/marketplace/api |
App.tsx | Route /app/marketplace/api to ApiCatalog |
pageRegistry.ts | Add or update API Catalog metadata, keywords, and help path |
taskRegistry.ts | Update publish/review steps to point to /app/marketplace/api |
| Help docs | Add a user-facing help page after the UI settles |
The existing /app/marketplace route can redirect to /app/marketplace/api or
remain a broader Marketplace landing page later. For the first API catalog
implementation, redirecting keeps the behavior simple.
Backend Changes
The backend already persists API category and tag relationships. The main backend change is query filtering:
- Extend
service-queryspec for optionalcategoryIds,tagIds, andtagMatch. - Update
GetApito pass those optional fields to the DB provider. - Update
PortalDbProvider#getApiandApiServicePersistenceImpl#getApi. - Add SQL predicates over
entity_category_tandentity_tag_t. - Verify or add compound indexes for taxonomy filtering.
- Add tests for category-only, tag-any, tag-all, combined taxonomy filters, and APIs with no taxonomy assignments.
The existing join-table indexes are useful for entity lookups and label resolution, but catalog filtering also needs indexes that start with filter fields. Before implementing Phase 2, verify the query plan and add indexes if needed:
CREATE INDEX idx_entity_tag_filter
ON entity_tag_t (entity_type, tag_id, entity_id)
WHERE active = TRUE;
CREATE INDEX idx_entity_category_filter
ON entity_category_t (entity_type, category_id, entity_id)
WHERE active = TRUE;
For tagMatch = "all", prefer a single grouped subquery over generating one
EXISTS predicate per selected tag when the selected tag set can grow. A common
shape is to filter entity_tag_t by selected tag ids, group by entity_id, and
require COUNT(DISTINCT tag_id) = selectedTagCount.
The query response should continue to include both identifiers and labels:
{
"apiId": "0001",
"apiName": "Petstore",
"categoryIds": ["..."],
"categories": ["public-api"],
"tagIds": ["..."],
"tags": ["openapi", "oauth2"]
}
Implementation Phases
Phase 1: Catalog Page
- Add the API Catalog route and Marketplace menu entry.
- Load category and tag options from the existing label endpoints.
- Load APIs with
service/getApi. - Render search, category filter, grouped tag filter, and API list/card results.
- Store catalog filters, search text, sorting, and pagination in the URL query string.
- Use current query response labels for display.
- Deep-link to existing API detail and update forms.
Phase 2: Server-Side Taxonomy Filters
- Add
categoryIds,tagIds, andtagMatchtoservice-query. - Implement SQL filtering in
ApiServicePersistenceImpl. - Keep current table filtering support for admin use.
- Add DB provider and handler tests.
Phase 3: Catalog Polish
- Add API detail summary panels with versions, endpoint count, runtime exposure, and access-control hints.
- Add help docs and task links.
- Add optional counts per category and tag if the catalog needs faceted counts.
Open Questions
- Should Marketplace API Catalog show only active APIs by default? The recommendation is yes, with an admin-visible inactive filter.
- Should unauthenticated users ever see catalog data? The recommendation is no until a separate public visibility model is designed.
- Should category selection be single-select or multi-select? The recommendation is multi-select OR semantics for flexibility.
- Should tags use all-match or any-match by default? The recommendation is
allfor precision, with a visible toggle if users need broader searches. - Should OpenAPI tags imported from specs automatically create API catalog tags? The recommendation is no for the first pass. Spec tags are often endpoint-level groupings and should not automatically become curated catalog taxonomy.
Agent Skill And API Endpoint Discovery
Problem
The GenAI chat flow has two separate concepts that are easy to confuse:
- The
light-gatewayMCP endpoint is the runtime server that lists and executes tools. An agent should call the gateway fortools/listandtools/call. A listed tool may be backed by a downstream MCP server or by a gateway-routed HTTP/OpenAPI endpoint. - Portal-query is the catalog service for skills, tools, and agent
assignments. The agent should read this catalog through the
genai-queryAPI, cache it locally, and search it during chat. - The controller registry remains a runtime control-plane service for registration, discovery, and cache-management commands. It should not own the portal skill/tool catalog and should not execute downstream MCP or REST calls.
During chat, light-agent should use its local catalog cache to find relevant
skills, then call tools/list on the gateway to verify executable tools. Tool
execution still goes through the gateway. If the catalog cache is empty or
stale, the agent should refresh it from portal-query. If portal-query is
temporarily unavailable, the agent should still be able to use the gateway tool
list directly.
The missing piece is a portal-managed catalog that explains which API endpoints exist, which endpoint projections are invokable by agents, which skills they belong to, and which agents are allowed or expected to use those skills. Without that catalog, the agent can list executable gateway tools, but it has no domain guidance beyond each tool description.
Goals
- Keep the gateway as the runtime source of truth for MCP tool execution.
- Keep direct gateway
tools/listandtools/callworking even when no skills have been authored. - Treat API endpoints as the generic capability unit. MCP tools, OpenAPI operations, JSON-RPC methods, and future protocol operations should all become endpoint-level capabilities before they are exposed to agents.
- Populate a portal endpoint and tool catalog from API version parsing,
LightAPI descriptions, gateway-discovered MCP tools, manually pasted MCP
tools/listpayloads, and gateway-routed REST tools. - Let portal users create skills that contain instructions and curated tool selections.
- Let portal users assign skills to agent definitions.
- Use the
genai-queryAPI and spec as the portal-query access surface for skills, tools, and agent assignments. - Let the agent cache the effective catalog locally and reload it when controller cache-management invalidation is triggered.
- Make skills useful for progressive disclosure without requiring every MCP tool to be wrapped before it can be called.
- Store semantic routing metadata for endpoint capabilities so the agent or portal-query can perform macro-filtering, keyword search, vector ranking, context viability checks, and safety filtering.
Non-Goals
- Do not move MCP request routing or downstream REST calls into the controller.
- Do not implement
skill/searchin controller-rs. Controller-rs can invalidate the agent cache, but portal-query owns catalog reads. - Do not use config-server as the first delivery path for the skills/tools catalog. The agent can fetch from portal-query and cache locally.
- Do not require every gateway tool to have a skill before it is executable.
- Do not replace the existing MCP Gateway registry design. This design extends it with agent-facing skill curation.
- Do not implement embeddings in the first phase. Keyword search is enough for the initial local catalog search.
- Do not limit the catalog to MCP tools. The UI may use “tool” when referring to LLM tool-calling, but the persistent capability model should be endpoint first.
- Do not use skill assignments as the only authorization control. Gateway policy and downstream authorization still apply at execution time.
Concepts
| Concept | Responsibility | Example |
|---|---|---|
| API Endpoint | Canonical endpoint-level capability stored by API version. It may come from OpenAPI, MCP tools/list, LightAPI, JSON-RPC, or another protocol. | /v1/accounts@get, getRandomNumber@call |
| Tool | Agent-facing projection of an endpoint as an executable LLM function. The runtime call is made by name through the gateway. | getAccounts calling GET /v1/accounts |
| Skill | Domain guidance plus a curated set of tools. It helps an agent decide what to expose and how to reason. | “Account Management” using account read and create tools |
| Agent | Runtime worker that receives a user prompt, discovers skills and tools, calls the LLM, then executes requested tools through the gateway. | account-agent |
| Gateway | MCP server and router. It owns runtime tools/list and tools/call behavior. | light-gateway /mcp |
| Portal Query | Catalog API service for reading skills, tools, tool params, skill-tool mappings, and agent-skill assignments. | genai-query API |
| Controller Registry | Runtime control-plane service for service metadata, discovery, and cache invalidation. | cache-management MCP tool |
| Portal | Authoring UI and persistence layer for tools, skills, and agent assignments. | Tool Catalog, Skill Editor, Agent Skill Assignment |
Target Architecture
The target flow keeps runtime execution and control-plane metadata separate.
Portal UI
-> writes api_endpoint_t, tool_t, tool_param_t, skill_t, skill_tool_t, skill_workflow_t, agent_skill_t
light-gateway /mcp
-> lists executable tools from mcp-router.tools and downstream MCP servers
-> executes tools/call against downstream MCP or REST services
light-workflow
-> owns deterministic multi-step workflow execution, task state, and audit events
portal-query genai-query API
-> serves skill/tool/agent-skill catalog reads from portal data
controller-rs portal registry
-> registers agents and sends cache-management invalidation commands
light-agent
-> loads assigned skills and mapped tools from portal-query
-> caches the effective catalog locally
-> searches cached skills during chat
-> lists executable tools from light-gateway
-> calls selected tools through light-gateway
For the account-agent example:
- The gateway exposes account tools such as
getAccountsandgetAccountByNo. - Portal stores the canonical endpoint rows in
api_endpoint_t. - Portal publishes selected endpoint rows into
tool_tas agent-invokable capabilities. - An operator creates an “Account Management” skill in
skill_t. - Portal links that skill to the account tools through
skill_tool_t. - Portal assigns the skill to the account agent through
agent_skill_t. - At startup or cache reload, the agent reads the assigned catalog through
genai-queryand caches it locally. - At chat time, the agent searches its local catalog cache.
- The agent combines matched skill instructions with the gateway tool definitions.
- Any tool call still goes to
light-gatewaytools/call.
Source Of Truth
The gateway is the runtime source of truth for executable tools. If a tool is not available from the gateway, the agent should not be able to execute it just because it exists in the portal database.
api_endpoint_t is the canonical portal endpoint catalog. It stores the
endpoint identity, protocol method, path, logical tool schema, endpoint
description, and raw tool metadata for one API version.
tool_t is the agent-facing projection of an endpoint. It stores the tool name,
agent description, implementation type, optional endpoint reference, response
schema, active flag, semantic routing fields, and semantic embedding. The full
metadata object should still be preserved in api_endpoint_t.tool_metadata for
import/export and agent cache payloads.
The portal database is the control-plane catalog. It stores:
- operator-friendly descriptions,
- skill instructions,
- agent assignments,
- governance metadata,
- cached or imported tool schemas.
Tool sync should be idempotent. The recommended unique identity is:
host_id + api_version_id + endpoint
Gateway exposure is a separate deployment selection. The catalog should sync all endpoint rows for an API version, then let the user choose which endpoint/tool projections are deployed to a specific gateway instance.
For runtime-executable projections, the gateway identity is:
hostId + serviceId + envTag
The access token used for portal catalog or gateway deployment APIs should carry
matching host, sid, and env claims. Portal-query must verify those claims
against the requested hostId, serviceId, and envTag before returning or
changing catalog data.
Runtime verification means checking whether an endpoint projection is actually
listed by a deployed gateway through tools/list. This should be done against
the selected gateway instance when an operator is preparing or reviewing a
gateway deployment. A later host-wide diagnostics view can aggregate all
registered gateways, but phase 2 does not need host-wide verification as the
default.
Runtime verification is not part of the persistence projection. The persistence
layer should store catalog state, endpoint/tool metadata, and inactive drift
state, but it should not call a live gateway. The portal UI, deployment review
flow, or a diagnostics endpoint should call the selected gateway’s tools/list
with the operator or service credential, compare the returned tool names and
schemas with the catalog, and surface the result as deployment drift.
If a previously imported endpoint or tool disappears from the gateway, the sync process should mark the catalog projection inactive instead of deleting it immediately. This preserves skill mappings and gives operators a clear drift signal.
Current Data Model
The database already has the main tables needed for this design:
skill_t: skill name, description,content_markdown, embedding placeholder, version, and active flag.tool_t: agent-facing tool catalog with name, description, implementation metadata, endpoint reference, and response schema.tool_param_t: parameter-level metadata and validation schema.agent_skill_t: maps agent definitions to skills.skill_tool_t: maps skills to tools for progressive disclosure.api_endpoint_t: MCP or REST endpoint metadata, includingtool_schemaandtool_metadata.wf_definition_t: stores workflow definitions as YAML for thelight-workflowruntime.
Phase 3.5 should add a skill-to-workflow mapping table rather than storing
workflow YAML inside skill_t. The recommended table is:
| Column | Purpose |
|---|---|
host_id | Tenant and ownership boundary. |
skill_id | Skill that can use or expose the workflow. |
wf_def_id | Workflow definition stored in wf_definition_t. |
workflow_role | Relationship type such as primary, validation, remediation, or test. |
start_mode | How the workflow can be started, such as manual, agent, scheduled, or portal. |
config | JSONB overrides for workflow input defaults, disclosure settings, or skill-specific runtime hints. |
aggregate_version | Event-sourced concurrency/version field. |
active | Soft delete and publication flag. |
The current phase 2 persistence path can preserve semantic metadata in
api_endpoint_t.tool_metadata before dedicated routing columns exist. That is
acceptable for import/export compatibility and for small catalogs searched from
the agent’s local cache. It should not be treated as the final indexed search
shape. Before portal-query performs database-side macro-filtering over large
catalogs or before vector ranking becomes a production dependency, promote the
high-use routing fields to first-class columns or indexed relationships and
backfill them from tool_metadata.
The existing MCP Registry design already maps MCP tools into api_endpoint_t.
OpenAPI parsing also creates endpoint rows. This design uses tool_t as the
agent-facing catalog row and links it back to api_endpoint_t when the tool
originates from an API endpoint.
Recommended mapping for gateway-imported tools:
| Gateway tool field | Portal storage |
|---|---|
name | tool_t.name and api_endpoint_t.endpoint_name |
description | tool_t.description and api_endpoint_t.endpoint_desc |
inputSchema | api_endpoint_t.tool_schema and generated tool_param_t rows |
| Gateway route metadata | gateway exposure metadata keyed by hostId, serviceId, and envTag |
| Downstream REST path | tool_t.api_endpoint and api_endpoint_t.endpoint_path |
| Downstream method | tool_t.api_method and api_endpoint_t.http_method |
| Safety flags | indexed tool metadata plus api_endpoint_t.tool_metadata.safety |
tool_t.implementation_type should be a standardized enum aligned with the
LightAPI Description execution model. Endpoint-backed tools should use a
LightAPI endpoint implementation type rather than preserving every downstream
transport as a different tool implementation. The downstream protocol remains
in the endpoint and LightAPI request metadata.
Recommended first enum values:
| Implementation type | Use |
|---|---|
lightapi_endpoint | Any agent-invokable API endpoint described by api_endpoint_t and LightAPI metadata. |
java | In-process Java implementation. |
python | Script-backed Python implementation. |
javascript | Script-backed JavaScript implementation. |
For lightapi_endpoint, execution still goes through gateway tools/call when
the endpoint is exposed to a gateway. The source protocol, such as MCP,
OpenAPI, JSON-RPC, OpenRPC, or gRPC, belongs in api_endpoint_t,
tool_metadata, and the LightAPI request description.
Endpoint-First Capability Model
Agents and skills should operate over endpoint capabilities, not only over MCP tools. MCP remains the runtime protocol for tool-calling through the gateway, but the catalog should support any endpoint that can be represented as an agent-invokable capability.
Recommended capability layers:
api_endpoint_t: canonical endpoint row for the API version.tool_t: agent-facing executable projection of the endpoint.tool_param_t: normalized top-level input parameters derived from the endpoint’s JSON Schema.skill_tool_t: curated relationship between a skill and a tool projection, including per-skill overrides such as priority, examples, or approval notes.agent_skill_t: assignment of skills to agent definitions.
This model supports these source types:
| Source | Endpoint identity | Tool projection |
|---|---|---|
MCP tools/list | <toolName>@call | Tool name is the MCP tool name; method is call. |
| OpenAPI | <path>@<method> | Tool name comes from operation id or generated endpoint name. |
| LightAPI Description | operation.endpointId or <operationId>@<method> | Tool name comes from operation id or curated agent metadata. |
| JSON-RPC/OpenRPC | <method>@call | Tool name is the method or curated operation name. |
| gRPC | <service>/<method>@call | Tool name is the curated operation name. |
tool_param_t should be generated from the logical input schema, not from wire
transport details alone. For OpenAPI, the logical input schema should merge
path parameters, query parameters, and request body into one object. For MCP,
the logical input schema is the MCP inputSchema. For JSON-RPC, it is the
logical params schema.
Semantic Routing Metadata
The customer-required semantic routing fields should be first-class indexed catalog data, not only JSON metadata. They are used for macro-filtering before expensive keyword, vector, or LLM ranking, so the common filter fields must be queryable through normal portal-query indexes.
Recommended indexed fields or relationships:
- domain and semantic namespace,
- sensitivity tier,
- semantic weight,
- target personas,
- active state,
- source protocol and implementation type,
- portal category and tag relationships.
Recommended phase 2 column names for endpoint and tool projections:
| Field | Suggested column or relationship | Source fallback |
|---|---|---|
| Domain | routing_domain | tool_metadata.routing.domain, LightAPI capability group, OpenAPI tag. |
| Semantic namespace | semantic_namespace | tool_metadata.routing.semanticNamespace, LightAPI info.namespace. |
| Sensitivity tier | sensitivity_tier | tool_metadata.routing.sensitivityTier, LightAPI visibility or safety metadata. |
| Semantic weight | semantic_weight | tool_metadata.routing.semanticWeight, default 1.0. |
| Source protocol | source_protocol | LightAPI operation protocol, OpenAPI, MCP, JSON-RPC, gRPC. |
| Target personas | join table or indexed array | tool_metadata.routing.targetPersonas, LightAPI agent metadata. |
The full structured payload should still be preserved in
api_endpoint_t.tool_metadata so LightAPI import/export, gateway config
generation, and agent cache payloads have one portable metadata object.
Recommended api_endpoint_t.tool_metadata shape:
{
"routing": {
"domain": "finance.accounts",
"category": "account-management",
"semanticNamespace": "prod.accounts.core",
"targetPersonas": ["account-agent", "customer-support-agent"],
"semanticDescription": "Retrieves account profile and status information when a user asks about an existing account.",
"semanticKeywords": ["account lookup", "customer account", "balance", "status"],
"contextRequirements": {
"requiredInputs": ["accountNo"],
"requiredContext": ["host_id"]
},
"dependencies": [
{
"endpoint": "/v1/accounts/{accountNo}@get",
"relation": "frequently_chained_after"
}
],
"semanticWeight": 0.75,
"sensitivityTier": "Internal-Only",
"fallbackEndpoint": "/v1/accounts@get",
"embedding": {
"model": "tool-description-embedding",
"source": "semanticDescription"
}
},
"safety": {
"read_only": true,
"destructive": false,
"humanApprovalRequired": false
}
}
Recommended ownership:
| Metadata | Primary storage | Notes |
|---|---|---|
| Domain and namespace | Indexed endpoint/tool columns plus tool_metadata.routing | Used for macro-filtering before vector ranking. |
| Categories and tags | Existing portal tag/category tables plus tool_metadata.routing | Reuse the portal taxonomy instead of creating a separate endpoint taxonomy. |
| Target personas | Indexed mapping or array plus tool_metadata.routing.targetPersonas | Used to filter the effective catalog for the current agent. |
| Rich capability description | tool_t.description plus tool_metadata.routing.semanticDescription | tool_t.description should be the concise LLM-facing description. |
| Synonyms and keywords | tool_metadata.routing.semanticKeywords | Used by keyword search and embedding source text. |
| Embedding vector | tool_t.description_embedding | The embedding provider must produce the configured vector dimension, currently 384, or the column must be migrated. |
| Required state/context locks | tool_metadata.routing.contextRequirements | The router should exclude non-viable tools before LLM tool injection. |
| Dependency mappings | tool_metadata.routing.dependencies | Used for chain suggestions, prefetch, or warm-up. |
| Priority score | Indexed column plus tool_metadata.routing.semanticWeight | Numeric multiplier for ranking ties. |
| Sensitivity tier | Indexed column plus tool_metadata.routing.sensitivityTier | Used before disclosure and before execution. |
| Fallback target | tool_metadata.routing.fallbackEndpoint | Runtime fallback should still respect gateway policy. |
| Destructive/read-only flags | tool_metadata.safety and existing gateway toolMetadata | Runtime enforcement belongs in gateway or policy, not only in prompts. |
The first semantic search implementation can work from the agent’s local cache:
- Filter by host, active flag, assigned skill, domain, namespace, target persona, and sensitivity tier.
- Exclude endpoints whose required context is not available in the current workflow or chat state.
- Rank by keyword matches over skill text, endpoint name, tool name, description, semantic keywords, and LightAPI capability text.
- When embeddings are populated, combine vector similarity with the keyword
score and multiply by
semanticWeight. - Call gateway
tools/listand intersect the ranked set with currently executable tools before exposing schemas to the LLM.
Embedding Recommendation
Keep the first production embedding dimension at 384 because the current
Postgres vector column is already VECTOR(384) and the first catalog use case
is routing over short endpoint descriptions, not long document retrieval.
Recommended model strategy:
- Use a provider abstraction with configured
embedding_model,embedding_dimension, andembedding_source. - For OpenAI-hosted embeddings, use
text-embedding-3-smallwith the dimensions parameter set to 384. - For on-prem or firewall-restricted deployments, use a local embedding service that is configured to emit 384-dimensional vectors.
- Store enough metadata to know how a vector was created: model, dimension, source text hash, source field, and generated timestamp.
- Re-embed when the semantic description, keywords, domain, or model config changes.
The portal catalog write path should remain in the portal service layer that
owns api_endpoint_t and tool_t persistence. Because the current portal
command/query services are Java, the Java side should own transactions,
versioning, and persistence of embedding results. A Rust service or worker can
still generate embeddings behind an internal API or queue consumer, especially
if local model performance is better there. In that model, Java requests or
consumes the vector and writes it through the normal portal persistence path.
LightAPI Description Enrichment
LightAPI Description should be the preferred enrichment source for endpoint
capabilities. OpenAPI and MCP tools/list are good at initial extraction, but
LightAPI adds the agent-oriented context needed for high-accuracy routing:
- endpoint identity and stable
endpointId - domain, tags, lifecycle, visibility, and capability group
- logical input schema and request mapping
- result schema and result cases
- examples and behavior notes
- progressive disclosure metadata
- agent-facing descriptions, personas, keywords, context requirements, and guardrails
Recommended merge priority for endpoint metadata:
- Portal operator overrides.
- Endpoint-level LightAPI Description.
- API-level inherited LightAPI Description context.
- OpenAPI/OpenRPC/protobuf/MCP source extraction.
- Gateway runtime
tools/listdiscovery.
This keeps runtime discovery useful while letting curated LightAPI descriptions provide richer semantic routing without hand-authoring every endpoint as an independent skill.
Phase 2 persistence should be treated as the receiver for this metadata, not as the extractor. The openapi-parser, a LightAPI Description parser, or a dedicated ingestion worker must emit the enriched endpoint payload on the API version event. At minimum, the event payload for each endpoint should include:
endpointId, endpoint identity, protocol, method, path, name, and description,- logical
toolSchemagenerated from the LightAPI operation input contract, toolMetadata.routingwith namespace, domain, capability group, personas, keywords, context requirements, sensitivity tier, and semantic weight where present,toolMetadata.safetyfrom LightAPI safety, visibility, idempotency, and destructive-operation hints,- response schema or result metadata when it is available for the tool projection.
If the parser only emits the base OpenAPI or MCP fields, the catalog remains valid but only has low-enrichment metadata. The phase 2 implementation should record that as an ingestion gap, not as a persistence defect.
Portal Catalog Contract
The agent should read skills and tools through the genai-query API in
portal-query. The source spec is:
genai-query/src/main/resources/spec.yaml
The current spec already includes catalog endpoints for the main entities:
getAgentSkillandgetFreshAgentSkillgetSkillandgetFreshSkillgetSkillToolandgetFreshSkillToolgetSkillDependencyandgetFreshSkillDependencygetToolandgetFreshToolgetToolParamandgetFreshToolParam
Phase 2 should add a dedicated effective catalog endpoint instead of forcing the
agent to compose many generic query endpoints. The endpoint should still live in
genai-query, not controller-rs.
Recommended endpoint behavior:
- verify the caller’s token claims before reading catalog rows,
- require request
host_id,service_id, andenv_tag, - match token
host,sid, andenvclaims to those request values, - return only endpoint/tool projections valid for that host, service, and environment,
- include active endpoint metadata, tool schemas, safety metadata, routing metadata, and skill mappings relevant to the agent,
- support a freshness or version field so the agent can cache the result.
The agent should cache the returned structure locally:
{
"host_id": "00000000-0000-0000-0000-000000000000",
"agent_def_id": "00000000-0000-0000-0000-000000000000",
"catalog_version": 42,
"skills": [
{
"skill_id": "00000000-0000-0000-0000-000000000000",
"name": "Account Management",
"description": "Use account tools to inspect and manage customer accounts.",
"content_markdown": "Prefer read-only tools before create or update tools.",
"tools": [
{
"tool_id": "00000000-0000-0000-0000-000000000000",
"endpoint_id": "00000000-0000-0000-0000-000000000000",
"name": "getAccounts",
"endpoint": "/v1/accounts@get",
"api_type": "openapi",
"description": "List account summaries.",
"input_schema": {
"type": "object",
"properties": {}
},
"routing_metadata": {
"domain": "finance.accounts",
"semanticNamespace": "prod.accounts",
"semanticKeywords": ["account list", "customer accounts"],
"sensitivityTier": "Internal-Only"
},
"safety": {
"read_only": true,
"destructive": false
}
}
]
}
]
}
For phase 2, the agent definition identity is the agent API version identity.
agent_definition_t.agent_def_id stores the same UUID as
api_version_t.api_version_id; the table is an agent-specific profile extension
for model and runtime settings, not a second standalone agent registry. The
agent display name comes from api_t.api_name, so agent_definition_t does not
duplicate the API name. API Admin continues to own the API/API-version
lifecycle, Instance Admin continues to own deployed instances, and the Agent
Definition page edits the profile for that API version.
The previous registry skill/search response shape was:
{
"skills": [
{
"skill_id": "00000000-0000-0000-0000-000000000000",
"name": "Account Management",
"description": "Use account tools to inspect and manage customer accounts.",
"tool_name": "getAccounts",
"input_schema": {
"type": "object",
"properties": {}
}
}
]
}
That flattened shape can remain as an internal compatibility DTO while the agent is migrated, but it should not be the long-term external contract. The target cache shape should support a skill with multiple tools:
{
"skills": [
{
"skill_id": "00000000-0000-0000-0000-000000000000",
"name": "Account Management",
"description": "Use account tools to inspect and manage customer accounts.",
"content_markdown": "Prefer read-only tools before create or update tools.",
"tools": [
{
"name": "getAccounts",
"description": "List account summaries.",
"input_schema": {
"type": "object",
"properties": {}
}
}
]
}
]
}
Migration rule:
- Remove the controller-rs
skill/searchplaceholder. - The agent can temporarily accept both the flattened shape and the nested
toolsshape while its portal-query client is being migrated. - After migration, the nested effective catalog shape becomes the preferred internal cache contract.
Agent identity can come from token claims, configured agent definition, or request fields. If inference is not enough, pass explicit fields to the portal-query catalog call:
{
"agent_def_id": "00000000-0000-0000-0000-000000000000",
"host_id": "00000000-0000-0000-0000-000000000000",
"service_id": "com.networknt.account-agent-1.0.0",
"env_tag": "dev"
}
Runtime Behavior
The agent should treat the portal catalog as helpful guidance, not as a hard dependency for basic tool use.
Recommended behavior:
- At startup, call the
genai-queryAPI to load the effective agent catalog. - Cache the catalog locally under
host_id, agent identity, and catalog version. - During chat, search the local catalog with the user prompt.
- If matched skills are returned, add skill instructions to the prompt context.
- If matched skills include tool mappings, prefer those tools for the LLM tool list.
- Call gateway
tools/listto verify executable tools and obtain the current runtime schemas. - Intersect skill-selected tool names with gateway-listed tools.
- If no skills match, or the local catalog is unavailable, fall back to gateway
tools/list. - Execute all LLM tool calls through gateway
tools/call.
When portal data changes, controller cache management can invalidate the agent’s local catalog cache. Reload behavior should match the agent’s initial loading strategy:
- if the agent loads the catalog during startup, invalidation should trigger an eager reload so the next chat request sees current metadata;
- if the agent loads the catalog on the first request, invalidation can clear the cache and let the next request reload lazily.
This keeps the account-agent usable before the portal skill catalog is fully populated and avoids making controller-rs part of the catalog query or execution path.
Portal UI
Endpoint Catalog And Tool Projection
The catalog UI should be endpoint-first but still show the tool projection that agents will see. It should let operators:
- browse
api_endpoint_trows by API, API version, endpoint, method, source, and active state, - import or resync endpoint capabilities from OpenAPI, MCP
tools/list, manually pasted MCP tools payloads, LightAPI descriptions, and selected gateway runtime surfaces, - publish selected endpoint rows into
tool_tas agent-invokable tools, - generate or refresh
tool_param_trows from the logical input schema, - see tool name, description, input schema, downstream endpoint, API type, semantic namespace, domain, personas, sensitivity tier, and runtime executable state,
- compare catalog metadata against source specs and current gateway
tools/list, - mark missing endpoint projections inactive,
- override operator-facing descriptions without changing gateway config,
- review and edit semantic routing metadata such as keywords, context requirements, fallback endpoint, priority weight, read-only, destructive, sensitive, or human-approval-required.
The first implementation should not depend only on live gateway access. It can
import from the endpoint rows produced by API version parsing, including manual
MCP tools/list JSON pasted into the API version spec field. Gateway
tools/list should then be used to verify which imported projections are
currently executable by a deployed gateway.
Skill Editor
The Skill Editor should let operators:
- create and update
skill_trows, - write
content_markdowninstructions, - link tools through
skill_tool_t, - set tool access level and per-skill config,
- preview which tools the skill would expose for a sample prompt,
- optionally link the skill to one or more workflow definitions,
- activate or deactivate skills.
Skill content should be short and operational. It should describe when to use the skill, how to interpret the tools, and any sequencing rules. It should not contain secrets.
Workflow-backed Skills
Some skills are only guidance plus a curated tool set. Other skills need a
repeatable process that calls several tools, branches on results, waits for
human input, runs assertions, or leaves an audit trail. Those skills should use
light-workflow as the orchestration layer.
The boundary is:
| Layer | Responsibility |
|---|---|
| Skill | Discovery metadata, instructions, taxonomy, allowed tools, and agent guidance. |
| Workflow | Ordered execution, branching, retries, assertions, human tasks, durable state, and audit events. |
| Gateway | Runtime tool execution through tools/list and tools/call. |
Workflow-backed skills should be optional. Use a workflow when the skill represents a durable or regulated process, such as API onboarding, approval, validation, remediation, scheduled live testing, or a multi-step operation with clear checkpoints. Do not require workflow backing for simple skills that only guide an agent toward one tool call or open-ended exploration.
The workflow definition remains canonical in wf_definition_t.definition as
YAML. The skill workspace should link to the definition through
skill_workflow_t and should reuse the generic workflow editor described in
Workflow Editor. The skill workspace can constrain the
editor with skill context, but it should not implement its own workflow
runtime.
For workflow-backed skills, skill_tool_t becomes the allowed tool set. A
save-time validator should reject workflow steps that reference a gateway tool
not linked to the skill, unless the step is explicitly marked as a future or
external dependency. This keeps progressive disclosure, operator review, and
workflow execution aligned.
Recommended Skill Workspace tabs:
| Tab | Purpose |
|---|---|
| Overview | Edit name, description, Markdown instructions, active state, tags, and categories. |
| Tools | Link tools, configure skill_tool_t.config, inspect schemas, sensitivity, and gateway availability. |
| Workflow | Select or create workflow definitions, edit YAML, inspect the step outline, and link workflows through skill_workflow_t. |
| Preview | Show the effective prompt, allowed tool set, linked workflow graph, and disclosure payload. |
| Test | Start a workflow with JSON input, watch instance events, complete waiting tasks, and inspect assertions or failures. |
Agent Skill Assignment
The Agent Skill Assignment UI should let operators:
- select an agent definition,
- assign one or more active skills through
agent_skill_t, - set priority and sequence,
- preview the final skill list for that agent,
- verify that each assigned skill still has at least one executable gateway tool.
Portal-query And Agent Cache Implementation
Catalog lookup should be implemented through the genai-query API. The agent
should fetch the assigned active catalog, cache it locally, and run progressive
disclosure search against the cache.
Phase 5 implements this for the Rust light-agent only. Other agent runtimes
can adopt the same genai-query contract later, but they are not part of the
Phase 5 implementation scope.
Initial algorithm:
- Resolve
host_idfrom the agent runtime configuration and the catalog request token. Resolveagent_def_idfromLIGHT_AGENT_AGENT_DEF_IDorLIGHT_AGENT_API_VERSION_ID. Resolveservice_idandenv_tagfrom the registered Rust agent service config. - Call
genai-querygetEffectiveAgentCatalog. - The endpoint loads active
agent_skill_trows and linked activeskill_t,skill_tool_t,tool_t,tool_param_t, andskill_workflow_trows. - Build a nested effective catalog grouped by skill, with each skill carrying its mapped tools, schemas, endpoint identity, safety flags, and routing metadata.
- Cache the effective catalog locally with
catalogVersionandcatalogHash. - During chat, macro-filter cached entries by agent persona, domain, namespace, sensitivity tier, active state, and available workflow context.
- Rank cached entries by simple text matching over
skill_t.name,skill_t.description,skill_t.content_markdown,tool_t.name,tool_t.description, endpoint name, endpoint description, and semantic keywords. - Intersect the final candidate list with gateway
tools/listbefore exposing tool schemas to the LLM.
Controller cache management should invalidate this local cache when portal catalog data changes. After invalidation, the agent reloads from portal-query.
Later algorithm:
- Add vector search over
skill_t.description_embeddingandtool_t.description_embedding. - Add vector search over endpoint semantic descriptions and LightAPI capability text.
- Include skill dependency expansion from
skill_dependency_t. - Use dependency mappings and fallback endpoints for chain planning, prefetch, and failure repair.
- Include inactive or missing-tool diagnostics for portal admin views, not for normal agent search.
Gateway Implementation
The gateway should keep the MCP data-plane contract stable:
tools/listreturns the executable tool set for the caller.tools/callroutes by tool name to downstream MCP servers or REST services.- Gateway policy remains authoritative at execution time.
- Gateway does not depend on
skill_toragent_skill_tto execute tools.
The gateway can expose an administrative sync endpoint later, but the first
portal sync can call the existing MCP tools/list endpoint with an operator or
service credential.
mcp-router.tools in values.yml should stay a runtime execution projection,
not the full semantic registry. It should include the fields the gateway needs
to list and call tools, plus safety metadata that must be enforced at runtime.
Richer semantic routing metadata should stay in portal-query and the agent
cache unless the gateway needs it for a concrete runtime policy decision.
Security Rules
- Skill assignment narrows what the agent should offer to the LLM, but it does not grant runtime authorization by itself.
- Gateway access control, endpoint scopes, OAuth token claims, and downstream service authorization still decide whether a tool call is allowed.
- Tool schemas and descriptions are not trusted input. They should be validated before storing and escaped when rendered.
- Skill content must not contain secrets, tokens, private keys, or passwords.
- A stale catalog row must not make a removed gateway tool executable.
- A stale local agent cache must be intersected with gateway
tools/listbefore exposing tools to the LLM. - Controller cache invalidation only forces reload; it does not grant access to catalog rows or executable tools.
- Sensitive or destructive tool metadata should be enforced by the gateway or a policy layer, not only by prompt instructions.
- Sensitivity tier must be checked before catalog disclosure. An agent without
clearance for
Restricted-PIIshould not receive the endpoint description or schema even if a skill references it. - Context requirements are not only prompt hints. If required context is missing, the endpoint should be excluded or routed to an ask/workflow step that obtains the missing value.
Failure Handling
| Failure | Expected behavior |
|---|---|
| Portal-query catalog load fails at startup | Start with an empty catalog cache and fall back to gateway tools/list. |
| Portal-query catalog reload fails after invalidation | Keep the previous cache if available, mark it stale, retry with backoff, and still verify tools through gateway tools/list. |
Gateway tools/list fails | Continue chat without tools or return a clear tool-unavailable response. |
| Skill references missing tool | Omit the missing tool from the runtime tool list and surface drift in portal admin UI. |
Gateway rejects tools/call | Return the tool error to the LLM loop and log the gateway response. |
| Catalog sync sees changed schema | Update catalog schema, mark the tool as changed, and preserve operator metadata. |
| LightAPI enrichment conflicts with source spec | Preserve the source invocation contract, mark the semantic metadata conflict for review, and do not overwrite operator overrides. |
Phased Implementation
Phase 1: Preserve Direct MCP Baseline
- Keep agent tool execution through gateway
tools/call. - Remove the controller-rs
skill/searchplaceholder before it becomes a dependency. - Ensure agent falls back to gateway
tools/listwhen no catalog cache is available. - Keep direct gateway
tools/listandtools/callworking without portal skills.
Phase 2: API Endpoint Catalog Sync
- Add portal UI for endpoint-first import and resync.
- Use existing API version parsing to populate
api_endpoint_tfor OpenAPI and MCP tools, including manual MCPtools/listpayloads accepted in the API version spec field. - Sync all endpoint rows for the API version into the endpoint catalog. Do not limit the catalog to the endpoints currently selected for one gateway instance.
- Import or refresh LightAPI Description metadata for endpoint enrichment.
- Publish selected endpoint rows into
tool_tas agent-facing tool projections. - Generate
tool_param_tfrom each endpoint’s logical input schema. - Link every API-origin tool projection back to
api_endpoint_t.endpoint_id. - Store semantic routing metadata in indexed endpoint/tool fields and preserve
the full metadata payload in
api_endpoint_t.tool_metadata. - If the first code slice only writes
tool_metadata, keep that as a compatibility step and add the indexed routing-column migration before database-side macro-filtering or production vector ranking is enabled. - Let users select which endpoint projections should be exposed to a specific gateway instance. This deployment selection is separate from endpoint catalog sync.
- Verify runtime executability outside persistence with gateway
tools/listfor the selected gateway instance when a gateway is reachable. - Mark disappeared or non-executable projections inactive instead of deleting them.
- Add drift indicators for schema, description, safety metadata, and semantic routing metadata changes.
Phase 3: Skill Authoring
- Keep the existing
skill_tCRUD page as the phase 3 authoring surface. - Add skill-scoped category and tag assignment to the create/update skill
forms. The UI should use dropdowns populated from the existing portal
taxonomy where
entity_type = 'skill'. - Persist skill categories through
entity_category_tand skill tags throughentity_tag_t; do not addtagsorcategoriescolumns toskill_t. - Implement skill save as a composite command: one event updates the skill row and one taxonomy event replaces the selected category/tag associations for the same skill.
- Keep
content_markdownas the instruction body. YAML or JSON skill files are import/export envelopes; if full structured skill authoring is introduced later, add a nullable JSONB skill-spec column besidecontent_markdowninstead of replacing it. - Keep embeddings optional.
Phase 3.5: Skill Workspace And Structured Authoring
- Add a richer Skill Workspace with Overview, Tools, Workflow, Preview, and Test tabs.
- Add tool linking workflows for
skill_tool_tand formalizeskill_tool_t.configfor per-skill tool overrides. - Add workflow-backed skill support through
skill_workflow_t, withwf_definition_t.definitionkept as the canonical workflow YAML. - Reuse the generic Workflow Editor in the Workflow tab for YAML editing, step preview, validation, and test runs.
- Add validation that workflow tool-call steps reference tools linked to the
skill through
skill_tool_t. - Add “create skill from LightAPI/tool” flows that can generate a draft skill, link relevant tools, and optionally create a starter workflow definition.
- Add YAML/JSON import/export for structured skill documents. Normalize YAML to
JSON for storage when a persisted structured payload is needed, while keeping
Markdown instructions in
content_markdown.
Phase 4: Agent Assignment
- Add portal UI for
agent_skill_t. - Let operators assign active skills to agent definitions.
- Add an Agent Definition assignment entry point in addition to the existing
agent_skill_ttable page, so operators can manage assigned skills from the agent context. - Add a batch assignment composite command that emits one
AgentSkillCreatedEventper selected skill. - Add validation that assigned skills have at least one active direct
skill_tool_tlink. A workflow-backed skill does not satisfy this by having onlyskill_workflow_t; the workflow must use the skill’s linked tools. - Enforce assignment validation in command handlers and mirror the same checks as UI preflight feedback.
- Treat
sequence_idas the deterministic effective prompt/display order andpriorityas a ranking weight for later catalog/search behavior.
Phase 5: Real Skill Search
- Add the dedicated
genai-querygetEffectiveAgentCatalogendpoint with token verification againsthost,sid, andenvclaims. - The endpoint returns the active nested catalog for one
hostId + agentDefId + serviceId + envTag: agent metadata, assigned skills, tags, categories, skill config, mapped tools, tool params, routing/safety fields, workflow references,catalogVersion, andcatalogHash. - Implement the Rust
light-agentportal-query client using that endpoint. - Build and cache the nested effective catalog inside the Rust agent.
- Start with local macro-filtering and keyword matching over cached skills, endpoint metadata, and tool projections.
- Intersect selected catalog tool names with gateway
tools/list; execute only through gatewaytools/call. - Wire controller cache-management invalidation to clear the Rust agent catalog cache. The next chat request lazily reloads from portal-query.
- If portal-query is unavailable or no agent definition ID is configured, the
Rust agent falls back to direct gateway
tools/listwithout portal catalog filtering. - Add vector ranking after 384-dimensional embeddings are populated and combine
it with
semanticWeight.
Phase 6: Semantic Routing And Governance
- Support the Rust
light-agentonly. Other agent runtimes can adopt the same catalog and diagnostics contracts later. - Use the normalized sensitivity tiers
public,internal,confidential, andrestricted. Treat missing or unknown tool tiers asinternal. - Enforce sensitivity-tier disclosure before portal-query returns the effective
catalog to the agent. Tools blocked by policy are omitted from the returned
toolslist and surfaced as diagnostics for admin review. - Block destructive or approval-required tools unless the skill/tool policy
names an approval workflow. Until workflow-owned approval state exists, the
current
activerow plus aggregate version remains the catalog versioning authority. - Keep gateway
tools/listandtools/callas the runtime source of truth. The Rust agent must still intersect catalog-selected tools with live gatewaytools/list. - Add Rust-agent diagnostics that compare the effective catalog against gateway
tools/listat/diagnostics/tools, showing catalog tools missing from the gateway, gateway tools outside the catalog, and policy-blocked catalog tools. - Enforce the same destructive, approval-required, and sensitivity metadata at
the gateway before
tools/callexecution. A blocked call should includeauditInfofields and gateway debug/warn logs with the tool name, endpoint, tier, policy reason, and approval state. - Do not write catalog-disclosure audit records into
audit_log_t; it is reserved for workflow. Phase 6 usesauditInfoso the existing audit log file path captures blocked gateway decisions. A generic audit table can be added in a later governance phase if file logging is not enough.
Resolved Phase 2 Decisions
- Phase 2 endpoint catalog sync covers all endpoint rows for an API version. Gateway exposure is a separate step where users select which endpoint/tool projections to deploy to a specific gateway instance.
- Runtime verification means checking the selected gateway instance’s
tools/listresponse to confirm that a deployed endpoint projection is executable there. It is not the same as endpoint catalog sync and should be implemented in the portal UI, deployment review flow, or diagnostics layer, not inside the persistence projection. - Gateway exposure identity is
hostId + serviceId + envTag. The token used for portal APIs must carry matchinghost,sid, andenvclaims. tool_t.implementation_typeshould be standardized and aligned with the LightAPI Description execution model. Endpoint-backed tools should use the standardized endpoint implementation type, with downstream protocol stored in endpoint and LightAPI metadata.- High-use semantic routing fields should be indexed columns or indexed
relationships, with the full structured payload preserved in
api_endpoint_t.tool_metadata. JSON-only persistence is only an interim import/export-compatible shape for small catalogs or local-cache search. - LightAPI Description enrichment requires an upstream parser or ingestion
worker to emit enriched endpoint payloads. The persistence layer can store
tool_schema,tool_metadata.routing, andtool_metadata.safety, but it does not derive those fields from the raw LightAPI document by itself. - Endpoint category and tag classification should reuse the existing portal tag and category system.
- Embeddings should start at 384 dimensions to match the current
VECTOR(384)schema. Use a provider abstraction so hosted OpenAI embeddings or local embedding services can be swapped without changing the catalog schema. genai-queryshould expose a dedicated effective catalog endpoint. Its token verification must match requesthost_id,service_id, andenv_tagagainst tokenhost,sid, andenvclaims.- Cache reload behavior depends on the loading strategy. Startup-loaded catalogs should eagerly reload after invalidation. First-request-loaded catalogs can reload lazily on the next request.
- Phase 2 focuses on tool and endpoint metadata. Skill-specific metadata and per-skill tool config should be designed later with the skill authoring phase.
- Phase 3 uses the existing taxonomy join tables for skill tags and
categories. Skill files may be YAML or JSON, but the database should keep
content_markdownfor the instruction body; a structured JSONB skill-spec column belongs in a later full authoring/import phase if it becomes needed.
Resolved Phase 3.5 Decisions
- Use
light-workflowfor workflow-backed skills that need durable multi-tool orchestration, approvals, assertions, retries, scheduled tests, or audit history. - Do not force every skill into a workflow. Skills remain the discovery and guidance layer, and simple skills can stay instruction-and-tool based.
- Keep
light-gatewayas the runtime tool execution path. Workflow tasks that call tools should still use gateway-visible tool identities and should not bypass gateway policy. - Keep workflow definitions in
wf_definition_t.definitionas YAML. Link skills to workflow definitions throughskill_workflow_tinstead of embedding workflow definitions inskill_t. - Treat
skill_tool_tas the allowed tool set for workflow-backed skills. Save-time validation should flag workflow tool calls that are not linked to the skill. - Build the workflow authoring UI as a generic reusable editor first, then embed it inside the Skill Workspace with skill-aware reference filtering and validation.
Resolved Phase 4 Decisions
- An assignable skill must be active and must have at least one active direct
tool link through
skill_tool_t. Activeskill_workflow_trows are useful orchestration metadata, but they do not replace the direct allowed-tool set. - Workflow-backed skill assignment should also rely on the Phase 3.5 validator:
workflow tool-call references must resolve to tools linked through
skill_tool_t. - Validation must be enforced server-side by
createAgentSkill,updateAgentSkill, and the batch assignment composite command. The Portal UI should run the same checks as preflight feedback, but UI checks are not authoritative. - Keep the existing
AgentSkilltable page and add an Agent Definition assignment context so operators can assign and inspect skills from the agent they are configuring. - Batch assignment should be a composite command that creates multiple
AgentSkillCreatedEventevents from one request. sequence_idcontrols deterministic ordering when building the agent’s effective skill prompt/catalog.priorityis reserved as a ranking weight for later effective-catalog and search behavior.- Live gateway runtime executability checks are not part of Phase 4
persistence validation. Keep them as a diagnostics or governance item that
compares cataloged/assigned tools with the selected gateway instance’s
tools/listresponse before deployment or runtime enablement.
Recommendation
Implement this as a progressive control-plane enhancement. The gateway remains
the execution path, and portal-authored skills become the agent guidance layer
served by portal-query. The agent should cache the effective catalog locally and
reload it after controller cache-management invalidation. This lets MCP tools
work immediately through tools/list and tools/call, while still giving
portal operators a clean path to organize tools into skills, assign those
skills to agents, and improve retrieval over time.
Workflow Editor
Purpose
The Workflow Editor is the generic Portal authoring surface for
light-workflow definitions. It should replace the raw textarea-only workflow
definition experience with a structured editor that still preserves YAML as the
canonical workflow definition stored in wf_definition_t.definition.
The editor is reusable. It can be opened from the Workflow Definition page, embedded in the Skill Workspace, or used by future task-specific authoring flows such as API onboarding, scheduled live tests, and remediation playbooks.
Design Boundary
light-workflow owns workflow execution, task state, retries, waiting human
tasks, and audit events. The Portal editor authors definitions and starts test
runs, but it must not implement its own workflow runtime.
The gateway remains the runtime tool execution path. Workflow steps that invoke tools should reference gateway-visible tools or endpoint descriptions and then execute through the same runtime path used by agents.
The editor should not duplicate endpoint contracts. API, MCP, JSON-RPC, gRPC, and other endpoint details belong in LightAPI descriptions, OpenAPI/OpenRPC documents, protobuf metadata, or the portal endpoint catalog. Workflow tasks reference those descriptions and provide step-level wiring, guards, exports, and error handling.
Current State
The current Portal implementation already has the persistence and generic CRUD surface needed for a first editor:
wf_definition_tstoresnamespace,name,version, anddefinition.workflow-commandexposes create, update, delete, and start workflow commands.workflow-queryexposes workflow definition reads.portal-viewhas a Workflow Definition table and generic create/update forms whosedefinitionfield is a YAML textarea.
The first Workflow Editor can therefore be an incremental UI improvement over the existing definition CRUD and start workflow command.
Goals
- Keep workflow YAML as the canonical persisted artifact.
- Provide a readable step outline or graph next to the YAML editor.
- Validate definitions before save and before test runs.
- Let users discover and reference endpoint descriptions, gateway tools, skills, rules, and human task types from a side panel.
- Support workflow definition create, update, import, export, and start-test flows.
- Make the editor embeddable so skill authoring can use the same workflow authoring component with skill-specific constraints.
- Preserve owner scoping and existing Portal command/query conventions.
Non-Goals
- Do not execute workflow logic in Portal View.
- Do not make skills the workflow runtime.
- Do not store workflow YAML in
skill_t. - Do not require a visual drag-and-drop graph before the editor is useful.
- Do not copy full API contracts into workflow steps when endpoint descriptions can be referenced.
- Do not fork or embed the Apache KIE Serverless Logic Web Tools as the first implementation path. They are useful reference material for CNCF Serverless Workflow concepts, but they are tightly coupled to the strict upstream spec and would be expensive to adapt for Light-Fabric agentic extensions.
Authoring Model
The editor should maintain two synchronized representations:
| Representation | Purpose |
|---|---|
| YAML source | Canonical text saved to wf_definition_t.definition. |
| Parsed view model | UI-only representation used for step outline, validation, references, and property panels. |
All saves should serialize from the YAML source or from a parsed model that round-trips to the same specification format. If the visual editor changes a step, it should update the YAML and keep the YAML visible.
The editor should support progressive enhancement:
- YAML editor plus parsed step outline.
- Step palette and property panel that edit YAML safely.
- Read-only graph preview.
- Drag-and-drop graph editing once round-trip behavior is reliable.
Implementation Architecture
The recommended implementation is a custom React editor built from focused building blocks:
| Component | Recommended library | Responsibility |
|---|---|---|
| Source editor | CodeMirror 6 with JSON/YAML extensions | Edit YAML/JSON, validate against the Light-Fabric workflow schema, provide autocomplete, lint markers, folding, and hover help. |
| Visual graph | React Flow / xyflow | Render workflow states as nodes and transitions as edges, with custom node components for agentic task types. |
| Property panels | Schema-backed React forms, optionally JSONForms | Edit selected node/task properties without forcing users to hand-edit every YAML field. |
| State manager | Existing portal state pattern or Zustand if a local editor store is needed | Hold the canonical workflow document, parsed model, diagnostics, selected node, dirty state, and test run state. |
The workflow YAML or JSON document remains the source of truth. CodeMirror edits parse into the editor store. The parsed workflow model is then projected into React Flow nodes and edges. React Flow edits update the same model and then serialize back to the YAML document.
This avoids adding a second large browser editor runtime to portal-view,
which already uses CodeMirror for Markdown and OpenAPI JSON/YAML editing. It
also avoids fighting a visualizer that only understands the strict CNCF
Serverless Workflow schema, while still letting Portal define first-class
visual treatments for Light-Fabric task types such as agent, mcp, ask,
assert, rule, switch, and future LLM or approval-oriented steps.
CodeMirror should use a custom JSON Schema derived from the CNCF Serverless
Workflow schema plus Light-Fabric agentic extensions. For JSON definitions,
use a CodeMirror 6 JSON Schema integration such as codemirror-json-schema to
provide linting, autocomplete, and hover details. For YAML definitions, reuse
the existing portal-view CodeMirror YAML setup where possible and add schema
validation through a YAML language-server bridge or equivalent worker-backed
integration. The goal is Monaco-like schema assistance without Monaco’s bundle
cost.
React Flow should not own the persisted shape. It owns layout, selection, edge creation, and node interaction. The persisted workflow definition should remain independent of the canvas library so a future editor or CLI can read the same definitions.
Recommended sync behavior:
- Parse CodeMirror content into a typed workflow model when the YAML is valid.
- Preserve text edits and show problems when YAML is invalid; do not destroy the user’s in-progress text.
- Project valid workflow models to React Flow nodes and edges.
- Let graph edge changes update transition targets in the model.
- Let property-panel changes update the model through schema-aware controls.
- Serialize model changes back into the YAML document using stable formatting.
- Keep conflict handling explicit when source edits and graph edits race.
Mermaid can be used for documentation or a lightweight read-only preview, but it is not the long-term authoring surface. JSONForms can be useful inside property panels, but it should not replace the graph/source editor combination.
Layout
Recommended first layout:
| Region | Contents |
|---|---|
| Header | Namespace, name, version, owner, active state, save, validate, import, export, and test actions. |
| Left panel | Step outline, problems, references, and search. |
| Main panel | YAML editor with syntax highlighting and parse markers. |
| Right panel | Selected step properties, input/output/export preview, and endpoint/tool metadata. |
| Bottom panel | Test input, validation results, workflow events, waiting tasks, and output. |
The generic Workflow Definition page can use the full layout. The Skill Workspace can embed the same editor with a narrower reference scope and a skill-aware validation profile.
Step Palette
The editor should understand the task types defined by the Light-Fabric agentic workflow design:
| Step type | Use |
|---|---|
ask | Pause for human input, approval, or missing values. |
assert | Validate context, API results, or business rules. |
http / openapi | Invoke HTTP endpoints directly or through cataloged descriptions. |
jsonrpc / openrpc | Invoke JSON-RPC methods directly or through OpenRPC descriptions. |
grpc | Invoke cataloged gRPC methods. |
mcp | Invoke gateway-visible MCP tools, resources, or prompts. |
rule | Delegate complex checks to Light-Rule. |
agent | Delegate a bounded task to an agent worker. |
switch / condition | Branch based on workflow context or task output. |
set / export | Move task results into workflow context. |
wait | Represent a durable wait, timeout, or externally completed task. |
The palette should create minimal valid YAML fragments. Users can then edit the full YAML when advanced options are needed.
Reference Panel
The editor should help authors reference existing catalog objects instead of typing fragile identifiers by hand:
- workflow definitions and versions,
- LightAPI endpoint descriptions,
- API endpoints and tool projections,
- gateway-visible MCP tools,
- rule definitions,
- agent definitions,
- skills and skill-linked tools when the editor is embedded in the Skill Workspace.
For generic workflow authoring, the reference panel can show all objects the current user is allowed to read. For skill authoring, it should filter tools to the skill’s linked tools and flag references outside that set.
Validation
Validation should run in layers:
| Layer | Checks |
|---|---|
| Syntax | YAML parses, document shape is valid, and duplicate keys are rejected when possible. |
| Specification | Required workflow fields, step IDs, task type structure, branch targets, exports, and inputs are valid. |
| Catalog references | Referenced endpoint descriptions, tools, rules, agents, and child workflows exist and are active. |
| Security | Sensitive or destructive steps have required approval, visibility, and ownership metadata. |
| Skill embedding | Workflow tool calls are linked through skill_tool_t when editing a workflow-backed skill. |
| Runtime diagnostics | Optional gateway tools/list checks compare cataloged tool names with deployed gateway availability. |
Runtime diagnostics should be separate from persistence validation. A workflow definition can be saved before a gateway is reachable, but the editor should make missing runtime executability visible before test or deployment.
Test Runner
The editor should support a test panel that starts a workflow instance through the existing workflow start command and then reads instance events and task state through the workflow query APIs.
The test panel should support:
- JSON workflow input,
- start run,
- event stream or polling view,
- current context and output preview,
- waiting task completion for
askor approval steps, - assertion and rule failure display,
- gateway or endpoint call failure display,
- rerun with the same input.
The test runner is a client of light-workflow; it does not execute workflow
steps in the browser.
Skill Workspace Integration
Phase 3.5 skill authoring should embed the Workflow Editor rather than create a second skill-specific workflow UI.
Recommended integration:
- The Skill Workspace has a Workflow tab.
- The tab lets the user choose
noneorworkflow-backed. - In workflow-backed mode, the user can select an existing workflow definition or create a draft definition.
- The link is stored in
skill_workflow_t. - The editor reference panel filters tool references to the tools linked by
skill_tool_t. - Validation rejects or warns on workflow tool calls not present in the skill’s allowed tool set.
- The Test tab starts the linked workflow with sample JSON input and displays the same workflow events used by the generic editor.
This keeps the skill as a discovery and guidance artifact while light-workflow
owns deterministic orchestration.
Data And API Changes
The first generic editor can reuse existing workflow definition APIs. Later phases should add editor-friendly endpoints only when they remove real UI complexity.
Phase B adds the validation endpoint and keeps the reference catalog composed from existing read models. A single combined catalog endpoint remains optional if the multiple list queries become noisy or slow.
| API or table | Purpose |
|---|---|
validateWfDefinition | Server-side validation using the workflow query service parser and, later, the same schema as light-workflow. |
formatWfDefinition | Optional canonical formatting if the workflow parser supports round-trip formatting. |
| Existing catalog queries | Fetch endpoint, tool, rule, agent, and workflow labels for the reference panel. |
getWorkflowReferenceCatalog | Optional future consolidation into one reference-panel query. |
startWorkflow | Start an editor test run for the saved workflow definition with sample JSON input. |
| Workflow runtime read models | Refresh process, task, task assignment, worklist, and audit-log projections for the current workflow instance. |
completeTask | Complete a waiting ask or human task from the editor test panel by emitting a TaskInfoUpdatedEvent. |
skill_workflow_t | Link skills to workflow definitions without embedding workflow YAML in skills. |
saveSkillWorkspace | Composite command that saves skill metadata, taxonomy, tool links, workflow links, and optional draft workflow updates from one workspace action. |
Server-side validation should be authoritative. Client-side validation is useful for responsiveness but should not be the only guard before saving or testing a workflow definition.
Phased Implementation
Phase A: Structured YAML Editor
- Add a generic Workflow Editor component and route.
- Replace create/update workflow definition textarea navigation with the editor where practical.
- Keep YAML visible and canonical.
- Reuse the existing portal-view CodeMirror editor stack with the Light-Fabric workflow schema for YAML/JSON validation, autocomplete, hover help, folding, and parse markers.
- Parse YAML client-side to render a step outline and problems panel.
- Add import/export and basic validation before save.
Phase B: Catalog-Aware Authoring
- Add a reference panel for endpoint descriptions, tools, rules, agents, and workflow definitions.
- Add a step palette that inserts valid YAML snippets.
- Add schema-backed property panels for selected steps. Use dropdowns for catalog references and constrained enums instead of free-text fields where Portal already has authoritative labels.
- Add server-side validation through
validateWfDefinition. - Add runtime diagnostics that compare MCP tool references with gateway
tools/listor the Rust agent/diagnostics/toolsendpoint when a gateway target is selected.
Phase C: Test And Worklist Integration
- Add a test runner panel backed by
light-workflowstart and query APIs. - Show workflow events, current task state, waiting human tasks, assertions, and final output.
- Let users complete
asktasks from the test panel. - Link failed test runs to remediation tasks or worklist entries.
Phase C uses the existing Portal workflow command/query boundary. The editor
starts a test run through workflow/startWorkflow, then refreshes
getProcessInfo, getTaskInfo, getTaskAsst, getWorklist, and
getAuditLog for the returned wfInstanceId. The test panel completes a
waiting human task through workflow/completeTask, which preserves the
structured response in the event data and materializes the task as completed
through the existing TaskInfoUpdatedEvent projection.
The panel should expose remediation links instead of silently creating production work. Failed process or task rows can open a prefilled remediation task form, and task assignments can jump to the workflow worklist with the current workflow instance context.
Phase D: Visual Graph Editing
- Add a React Flow graph preview after the outline is stable.
- Represent Light-Fabric task types with custom React Flow nodes and explicit transition edges.
- Add drag-and-drop graph editing only after YAML/model round-trip behavior is reliable.
- Keep YAML as the source of truth even when visual editing is enabled.
Phase D adds the graph as a projection of the parsed YAML model, not a separate
persisted representation. The graph reads steps, tasks, states, or do
containers and renders one custom React Flow node per detected step. Node
styling reflects the Light-Fabric task type, and the graph can overlay runtime
task status from the Phase C test-run read models when the workflow task id
matches a graph step id.
Explicit transition fields such as next, then, to, and transition
become solid graph edges. Ordered fallback edges are shown as dashed edges so
authors can distinguish model transitions from inferred sequence. Creating an
edge in React Flow updates the source step’s transition in YAML, and deleting an
explicit edge removes that transition target from YAML. Dragging nodes changes
only the authoring layout in the browser session; it does not mutate the saved
workflow definition.
The graph must continue to tolerate partial or invalid authoring states. If the YAML cannot be parsed into a known workflow container, the editor keeps the source editor and validation panels usable and shows an empty graph state rather than blocking authoring.
Recommendation
Build the generic Workflow Editor before the Skill Workspace embeds workflow authoring. The skill UI should provide context and constraints, while the workflow editor provides YAML editing, step preview, validation, and test runs for every workflow authoring use case in Portal.
Portal Catalog Scope
Problem
Light Portal supports multiple tenants through host_id and can also host
multiple runtime environments in one portal instance. A common deployment shape
is:
| Portal instance | Runtime environments |
|---|---|
| Instance A | dev, sit |
| Instance B | stg, prd |
Within an organization or a cloud deployment, operators need a catalog for APIs, API endpoints, tools, skills, schemas, rules, workflows, categories, and tags. Some catalog entries are reusable platform knowledge. Other entries are tenant-owned, environment-bound, or tied to a concrete gateway deployment.
The main design question is whether Light Portal should clone catalog rows into every host/tenant, or maintain one shared catalog per portal instance and expose it through a separate single page application and virtual host.
The recommended answer is neither full cloning nor a UI-only split. The portal should model catalog scope explicitly:
- shared catalog definitions use global scope,
- tenant-specific definitions and overrides use host scope,
- environment-specific runtime bindings use host plus environment scope,
- a separate SPA may expose the same backend catalog, but it should not become the catalog authority.
Goals
- Avoid duplicating the full catalog for every tenant.
- Prevent catalog drift between tenants and between portal instances.
- Preserve tenant isolation for private APIs, private skills, secrets, access control, and runtime bindings.
- Let dev and sit share one portal instance while still keeping their runtime endpoint targets separate.
- Let stg and prd share another portal instance while keeping production controls stricter.
- Support an effective catalog query that combines global definitions with host-specific rows and environment-specific bindings.
- Reuse existing portal-query APIs and the
genai-querycatalog direction for agent-facing skills and tools. - Keep
light-gatewayas the runtime MCP execution path fortools/listandtools/call. - Support promotion or import/export between portal instances instead of relying on ad hoc row copies.
Non-Goals
- Do not clone every global catalog row into every tenant by default.
- Do not make a separate SPA the source of truth for catalog data.
- Do not bypass host-scoped authorization just because a catalog item is global.
- Do not put secrets, client credentials, runtime tokens, or deployment state in global catalog rows.
- Do not move MCP tool execution from
light-gatewayinto portal-query, controller-rs, or the catalog UI. - Do not require every MCP or API endpoint to be wrapped in a skill before the gateway can expose it as a runtime tool.
Current Model
The database already contains both global-capable and host-scoped patterns.
category_t and tag_t have nullable host_id. A null host_id means the
category or tag is global. A non-null host_id means the row belongs to one
host. Their unique indexes already separate global uniqueness from host-specific
uniqueness.
The query behavior for category and tag labels returns both host-specific rows and global rows for a host. This is the right shape for taxonomy and catalog organization metadata.
Other catalog entities are currently host-scoped:
api_tapi_version_tapi_endpoint_tagent_definition_tskill_ttool_ttool_param_tagent_skill_tskill_tool_tskill_dependency_t
Those tables use host_id NOT NULL and most query paths filter by
host_id = ?. This is correct for private tenant data and runtime-bound data,
but it is too narrow for reusable platform catalog definitions if the only
sharing mechanism is row replication.
Design Decision
Use a scoped catalog inside Light Portal.
The portal backend remains the source of truth. The catalog UI can be part of the existing portal SPA or exposed through another SPA/virtual host, but both UI surfaces must read and write through the same portal-query and command APIs.
The durable model is:
global catalog definition
-> host enablement or host override
-> environment runtime binding
This model allows one shared definition for reusable knowledge and separate tenant or environment controls where isolation matters.
Scope Types
| Scope | Storage meaning | Typical data |
|---|---|---|
| Global | host_id IS NULL or a dedicated global definition row | Shared categories, tags, reusable schemas, rule templates, workflow templates, public tool definitions, shared skill templates |
| Host | host_id = ? | Tenant-owned APIs, private schemas, tenant skills, tenant tools, host-level enablement, access rules |
| Environment | host_id = ? plus env_tag, service id, target host, instance, or deployment binding | dev/sit/stg/prd endpoint targets, gateway exposure, runtime service bindings, deployment state |
| Instance | Separate portal database or portal deployment | Promotion boundary between dev/sit instance and stg/prd instance |
Global rows are reusable definitions. Host rows are ownership and isolation. Environment rows are runtime selection.
Catalog Entity Guidance
| Entity | Recommended scope | Reason |
|---|---|---|
| Category | Global by default, host-specific when private taxonomy is needed | Existing schema already supports nullable host_id |
| Tag | Global by default, host-specific when private taxonomy is needed | Existing schema already supports nullable host_id |
| API | Host-scoped, with optional shared template support later | API ownership, lifecycle, and visibility are usually tenant-specific |
| API version | Host-scoped | Carries env_tag, target_host, service id, spec, and runtime-facing version metadata |
| API endpoint | Host-scoped for concrete API versions; may be generated from shared templates | Endpoint availability depends on the owning API version and runtime |
| Tool | Shared definition when generic; host-scoped projection when executable for a tenant | Runtime execution still depends on gateway, endpoint, policy, and service binding |
| Skill | Shared template when reusable; host-scoped copy or override when edited by a tenant | Skills contain prompt guidance that tenants may customize |
| Schema | Global when it is a reusable contract; host-scoped when it contains tenant-private fields or lifecycle | Avoid cloning standard contracts but protect tenant-specific schemas |
| Rule | Global template or host-specific rule | A reusable rule definition is different from enabling that rule for a host |
| Workflow | Global template or host-specific workflow | Templates can be shared, execution bindings should be host or environment scoped |
Effective Catalog
Consumers should not need to manually merge global and host rows. Portal-query should expose an effective catalog read model for each host and runtime context.
The effective catalog request should include:
hostIdserviceIdwhen the catalog is for a gateway, agent, or runtime serviceenvTagwhen the result is environment-specific- optional
agentDefIdwhen the result is for an agent - optional filters for entity type, category, tag, protocol, routing domain, or capability
The effective catalog response should include:
- global definitions visible to the caller,
- host-specific definitions visible to the caller,
- host overrides that shadow global defaults,
- environment bindings for the requested
envTag, - active state and catalog version or freshness metadata,
- category and tag labels from both global and host-specific taxonomy rows,
- enough provenance to show whether a row came from global scope, host scope, or an environment binding.
Recommended precedence:
environment binding > host override > global definition
This keeps shared definitions stable while allowing host and environment customization.
Data Model Direction
For tables that already support nullable host_id, keep the current pattern:
host_id IS NULL -> global/shared row
host_id = ? -> host-specific row
For strictly host-scoped catalog tables, do not simply make every host_id
nullable without checking foreign keys and runtime assumptions. Some tables are
correctly host-scoped because they point to tenant-owned APIs, credentials,
gateway endpoints, or agent assignments.
Use one of these patterns per entity:
- Nullable
host_idon the definition table when the entity can safely be global and all references can resolve global plus host rows. - Separate template and binding tables when the definition is global but enablement is tenant-specific.
- Keep the current host-scoped table when the entity is inherently tenant or runtime bound.
For reusable skills and tools, the safest long-term shape is template plus binding:
catalog_skill_template_t
-> host_skill_t or skill_t host override
-> agent_skill_t assignment
catalog_tool_template_t
-> host tool projection
-> skill_tool_t mapping
-> gateway runtime tools/list verification
If the implementation starts smaller, it can add nullable global scope to selected catalog definition tables first, but the query contract must still return the effective catalog and indicate scope provenance.
Separate SPA Or Virtual Host
A separate SPA deployed with LightAPI and sign-in as another BFF virtual host is useful as a catalog presentation surface. It can provide a marketplace-style view for shared APIs, tools, skills, schemas, rules, and workflows.
It should not own separate catalog state.
Recommended use:
- browse global catalog definitions,
- request enablement for a host,
- compare host overrides with global definitions,
- review environment bindings,
- publish or promote catalog versions between portal instances.
Avoid using the separate SPA to bypass tenant-aware portal APIs. The BFF should still pass authenticated requests to portal-query or command APIs, and those APIs must enforce host, service, environment, and role checks.
Environment Handling
Within one portal instance, environments should be runtime bindings, not cloned catalog universes.
For a dev/sit instance:
- one shared catalog can describe a capability,
- dev and sit get separate
env_tagbindings, - runtime endpoints can differ through
target_host,service_id, instance, deployment, or gateway registration, - a tool can be visible in both environments but executable only where the gateway lists it.
For a stg/prd instance:
- stg and prd can share approved global definitions,
- production enablement should require stricter workflow or authorization,
- secrets, tokens, OAuth clients, runtime instances, and deployment state remain environment-specific,
- catalog promotion into prd should preserve stable IDs and versions.
Promotion Between Portal Instances
The boundary between dev/sit and stg/prd is an instance boundary. Treat it as a promotion boundary, not as live replication between tenants.
Recommended promotion flow:
- Author or import catalog definitions in the lower portal instance.
- Review and approve the global or host-scoped definitions.
- Export selected catalog rows with their versions and dependencies.
- Import into the target portal instance.
- Resolve environment bindings for stg or prd.
- Verify runtime exposure through the selected
light-gatewaytools/list. - Activate the target bindings.
Promotion should be idempotent. A repeated import of the same catalog version should update or confirm the same target definition instead of creating duplicates.
Security And Authorization
Global catalog visibility does not mean global execution permission.
Authorization must be checked at these layers:
- portal UI and BFF authentication,
- portal-query read authorization,
- command API write authorization,
- host and environment claim matching,
- category/tag visibility when private taxonomy is used,
- gateway
tools/listavailability, - gateway
tools/callpolicy, - downstream service authorization.
For runtime catalog reads used by gateways and agents, the token should include
host, sid, and, when environment-specific data is requested, env. The
query handler should compare those claims with the requested hostId,
serviceId, and envTag.
UI Guidance
The portal UI should show catalog scope explicitly:
GlobalHostEnvironment
For list pages, include filters for scope, environment, category, tag, active state, and source protocol. For detail pages, show whether a host row inherits from a global definition, overrides it, or is private to the host.
For destructive changes, make the target scope clear. Updating a global catalog definition can affect many hosts, while updating a host override should affect only that host.
Migration Approach
- Keep the existing category and tag nullable
host_idbehavior. - Add effective catalog read APIs before broad schema changes so callers have a stable contract.
- Identify which catalog entities need global definitions versus host-only rows.
- Add template or nullable-scope tables for reusable definitions.
- Add host enablement or override tables for tenant-specific activation.
- Add environment binding views or APIs for dev, sit, stg, and prd.
- Add import/export or snapshot support for promotion between portal instances.
- Update portal-view to expose scope and provenance.
- Keep existing host-scoped APIs working during the migration.
Open Questions
- Should global reusable skills and tools use nullable
host_idin the existing tables, or separate template tables with host bindings? - Which catalog entities require approval workflow before production activation?
- Should category and tag assignment tables store additional scope metadata, or is scope fully inherited from the referenced category or tag?
- What stable external identity should be used during cross-instance catalog promotion when UUIDs differ between portal databases?
- Should portal-query expose one broad effective catalog endpoint or multiple entity-specific effective endpoints?
OAuth Kafka
Token Exchange
This document outlines the design decisions and implementation details for supporting multiple token exchange flows in the oauth-kafka module.
Comparison of Detection Methods
When implementing token exchange (RFC 8693), the server must determine which identity provider (IdP) issued the subject_token to verify it correctly and map claims.
| Method | Explanation | Pros | Cons | Recommended For |
|---|---|---|---|---|
JWT Peek (iss) | Server decodes the token header/body without verification to read the iss claim. | Zero client configuration; Uses standard parameters. | Token is parsed twice; Sensitive to malformed tokens. | Public OIDC providers (Azure, Okta, Google). |
| Custom URNs | Client sends a specific requested_token_type (e.g. urn:networknt:msal). | Explicit and unambiguous; Follows standard extensibility. | Clients must know the specific URNs for each flow. | Mixed heterogeneous token types (SAML vs JWT). |
subject_issuer | Client passes an extra subject_issuer parameter in the request. | Clean API; Works with “opaque” (non-JWT) tokens. | Non-standard parameter; Redundant for self-describing JWTs. | Opaque tokens or overlapping issuers. |
| Client Context | Server maps the client_id of the caller to a specific flow. | Highly secure; Enforces strict per-client policy. | High management overhead; Inflexible for multi-source clients. | Rigid, security-conscious B2B integrations. |
Implementation Strategy
Our implementation in ProviderIdTokenPostHandler uses Option 4: Client Context as the primary strategy:
- Database-Driven Configuration: A new column
token_ex_typehas been added to theauth_client_ttable to specify the supported exchange type for each client.ALTER TABLE auth_client_t ADD COLUMN token_ex_type VARCHAR(64); - Supported Exchange Types:
msal: Microsoft Authentication Library based exchange.ccac: Client Credentials to Authorization Code exchange.
- Flow Determination: Instead of relying on client-supplied parameters like
requested_token_type, the server retrieves thetoken_ex_typefrom the client context in the database to decide which handler to use. This ensures that only authorized exchange types are performed for each specific client.
Recommendation
For the light-portal ecosystem:
- Option 4: Client Context is the selected method. It provides the highest level of security by ensuring that token exchange flows are explicitly configured and restricted on a per-client basis in the database.
token_ex_typeshould be populated for any client that requires token exchange functionality. Clients without this configuration will not be allowed to perform token exchange.
Future Considerations
- Implement automated issuer discovery if the number of external providers grows.
- Support “opaque” token exchange by integrating with introspection endpoints of external IdPs.
- Extend the
auth_client_tconfiguration to support multiple allowed exchange types per client if needed.
OAuth Audit
The OAuth services keep authorization codes and refresh tokens as operational state. These rows are short lived and are now written directly to auth_code_t and auth_refresh_token_t instead of being created through the general event store. This avoids high-volume login and refresh-token churn in event_store_t and outbox_message_t.
Audit and login history are recorded separately in append-oriented OAuth audit tables.
Goals
- Show administrators who is currently online.
- Show a user the last login time and session history.
- Track refresh-token rotation and rejected refresh attempts.
- Preserve enough history for support and security review without storing raw secrets.
- Keep the hot login and token-refresh path simple and transactional.
Tables
auth_session_t stores one row per login session. It is the current and historical session summary.
session_ididentifies the browser/device session.login_ts,last_refresh_ts,logout_ts, andexpires_tsdescribe the session lifetime.statusisACTIVE,LOGGED_OUT,EXPIRED, orREVOKED.refresh_countis incremented on each successful refresh-token rotation.ip_address,user_agent, anddevice_idare optional request context fields.
auth_session_audit_t stores append-only auth audit entries.
LOGIN_SUCCEEDEDLOGIN_FAILEDAUTH_CODE_ISSUEDAUTH_CODE_CONSUMEDREFRESH_TOKEN_ISSUEDREFRESH_TOKEN_ROTATEDREFRESH_TOKEN_REJECTEDLOGOUTSESSION_EXPIREDSESSION_REVOKED
auth_refresh_token_t.session_id links the currently valid refresh token to the session that owns it. This removes ambiguity when the same user is logged in from multiple browsers or devices.
Audit rows keep session_id as data, but do not use a hard foreign key to auth_session_t. Audit history must remain groupable by session even if operational session rows are later archived or removed.
Login Flow
When /oauth2/{providerId}/code authenticates the user:
- Insert the authorization code into
auth_code_t. - Insert an
ACTIVEsession intoauth_session_t. - Insert
LOGIN_SUCCEEDEDandAUTH_CODE_ISSUEDaudit rows. - Include the
session_idin the auth code row so the token exchange can attach the refresh token to the same session.
Failed logins write LOGIN_FAILED with the available host, provider, client, request metadata, and failure reason.
Authorization Code Exchange
When grant_type=authorization_code succeeds:
- Delete the consumed auth code from
auth_code_t. - Insert the refresh token into
auth_refresh_token_twith the auth code’ssession_id. - Insert
AUTH_CODE_CONSUMEDandREFRESH_TOKEN_ISSUEDaudit rows.
Refresh Token Rotation
When grant_type=refresh_token succeeds, the service performs one transaction:
- Insert the replacement refresh token.
- Delete the previous refresh token with its expected aggregate version.
- Update
auth_session_t.last_refresh_tsand incrementrefresh_count. - Insert
REFRESH_TOKEN_ROTATEDwith the old and new token ids.
If a refresh token is missing, invalid, or belongs to the wrong client, the service writes REFRESH_TOKEN_REJECTED when enough context is available. Raw refresh-token values must not be stored in audit metadata.
Admin Revocation
Administrators can kick out a user by revoking the user’s current refresh token. Operationally, deleting the refresh token is enough to stop the session from renewing once the current access token expires. The audit/session model adds explicit session state to that behavior.
The revocation operation must run as one transaction:
- Find the refresh token row and its
session_id. - Delete the refresh token from
auth_refresh_token_t. - Update
auth_session_t:status = 'REVOKED'logout_ts = CURRENT_TIMESTAMPend_reason = 'ADMIN_REVOKED'
- Insert
SESSION_REVOKEDintoauth_session_audit_t.
The database patch provides revoke_auth_session_by_refresh_token(host_id, refresh_token, admin_user, reason) for this workflow. Admin screens should call the revoke operation instead of issuing a plain refresh-token delete when the intent is to kick out a logged-in user.
If the refresh token has no session_id, the operation still deletes the token and returns NULL. This preserves backward compatibility with refresh-token rows created before session tracking.
Admin Queries
Current online users:
SELECT *
FROM auth_session_t
WHERE status = 'ACTIVE'
AND (expires_ts IS NULL OR expires_ts > CURRENT_TIMESTAMP);
User login history:
SELECT *
FROM auth_session_t
WHERE host_id = $1
AND user_id = $2
ORDER BY login_ts DESC;
Session duration:
SELECT
login_ts,
COALESCE(logout_ts, last_refresh_ts, CURRENT_TIMESTAMP) - login_ts AS duration
FROM auth_session_t
WHERE host_id = $1
AND session_id = $2;
Retention
auth_session_t can be retained longer than operational token tables. auth_session_audit_t should use a retention policy appropriate for the deployment, for example 90 days or one year. Retention jobs should delete audit rows by event_ts and optionally archive them before deletion.
Multi-Tenant
Database Schema
Adding a host_id to every table is one approach, but it does lead to composite primary keys and can impact performance. Using UUIDs as primary keys, even in a multi-tenant environment, is another viable option with its own set of trade-offs. Let’s examine both strategies:
- Host ID on Every Table (Composite Primary Keys)
Schema: Each table would have a host_id column, and the primary key would be a combination of host_id and another unique identifier (e.g., user_id, endpoint_id).
CREATE TABLE user_t (
host_id UUID NOT NULL, -- References hosts table
user_id INT NOT NULL,
-- ... other columns
PRIMARY KEY (host_id, user_id),
FOREIGN KEY (host_id) REFERENCES hosts_t(host_id)
);
Pros:
-
Data Isolation: Clear separation of data at the database level. Easy to query data for a specific tenant.
-
Backup/Restore: Simplified backup and restore procedures for individual tenants.
Cons:
-
Composite Primary Keys: Can lead to more complex queries, especially joins, as you always need to include the host_id. Can affect query optimizer performance.
-
Storage Overhead: host_id is repeated in every row of every table, adding storage overhead.
-
Index Impact: Composite indexes can sometimes be less efficient than single-column indexes.
- UUIDs as Primary Keys (Shared Tables)
Schema: Tables use UUIDs as primary keys. A separate table (tenant_resources_t) maps UUIDs to tenants.
CREATE TABLE user_t (
user_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
-- ... other columns
);
CREATE TABLE tenant_resource_t(
host_id UUID NOT NULL,
resource_type varchar(255) NOT NULL, --e.g., 'user', 'api_endpoint'
resource_id UUID NOT NULL,
PRIMARY KEY(host_id, resource_type, resource_id),
FOREIGN KEY (host_id) REFERENCES hosts_t(host_id)
);
Pros:
-
Simplified Primary Keys: Easier to manage single-column UUID primary keys. Simpler joins.
-
Reduced Storage Overhead: No need to repeat host_id in every table.
-
Application Logic: Multi-tenancy is handled mostly in the application logic by querying tenant_resources_t to ensure a user belongs to the correct tenant, adding a layer of flexibility. (This is also a con if not carefully implemented.)
Cons:
-
Data Isolation (slightly reduced): Data is logically separated but resides in shared tables. Robust application logic is essential to prevent data leakage between tenants.
-
Backup/Restore (more complex): Backing up/restoring for a single tenant requires filtering based on the tenant_resources_t table.
-
Query Performance (potential issue): Queries might require joining with tenant_resources_t table which will add a bit overhead. Proper indexing and query optimization become crucial.
- Hybrid Approach (Recommended in many cases)
-
Combine the strengths of both approaches.
-
Use UUIDs as primary keys for most tables for simplicity.
-
Add host_id to tables where data isolation is paramount (e.g., tables containing sensitive financial or personal data) or where frequent tenant-specific queries are performed. This provides a more granular level of control and allows for optimized queries when needed.
-
Use the tenant_resources_t table to maintain an overview of which resources belong to which tenant, supporting the application logic’s multi-tenancy enforcement.
Choosing the Right Strategy:
-
Small Number of Tenants, Low Data Volume: UUIDs might be sufficient.
-
Strict Data Isolation Requirements, Frequent Tenant-Specific Queries, High Data Volume: host_id on every relevant table or a hybrid approach is often preferable.
-
Flexibility in Multi-Tenancy Implementation: Favors the UUID approach or Hybrid approach, with logic predominantly handled in the application layer.
Additional Considerations:
-
Database Support: Check if database (PostgreSQL) has specific features for multi-tenancy.
-
Indexing: Carefully plan indexes for optimal performance in any multi-tenant scenario.
-
Schema Design: Consider creating views for tenant-specific data access to simplify queries in the application layer.
-
Security: Implement robust security measures to prevent data leakage between tenants, regardless of the chosen approach.
The hybrid approach usually provides the best balance between performance, data isolation, and development complexity in real-world multi-tenant applications. You have more direct control where strict isolation is required and can maintain the simplicity of UUIDs where it’s beneficial. Using tenant_resources_t for application logic enforcement offers flexibility and a central point of management for tenant resource association. This approach also prepares the application for potential future expansion and different multi-tenancy demands as the application grows.
Should host_id part of the PK
This is a classic multi-tenancy design question. Both approaches have implications. Let’s analyze them:
Option 1: Current Approach - PK (host_id, instance_id)
- Pros:
- Excellent for Tenant-Specific Queries: This is the biggest advantage. Queries like
SELECT ... FROM instance_t WHERE host_id = ? AND ...orSELECT ... FROM instance_t WHERE host_id = ? ORDER BY instance_idcan directly and efficiently use the primary key index. The index is naturally ordered by tenant first, then by instance within that tenant. This improves data locality for a specific tenant’s data. - Clear Logical Grouping: The primary key explicitly represents the concept that an instance belongs to a specific host (tenant).
- Enforces Uniqueness Per Tenant: Guarantees that
instance_idis unique within a givenhost_id. (Although UUIDv7 makes global collisions highly unlikely anyway).
- Excellent for Tenant-Specific Queries: This is the biggest advantage. Queries like
- Cons:
- Wider Primary Key: The PK is 32 bytes (16+16).
- Wider Foreign Keys: Any table referencing
instance_twould need bothhost_idandinstance_idas its foreign key columns. - Slightly Larger Secondary Indexes: Other indexes on
instance_twill implicitly include both PK columns, making them slightly larger than if the PK was just 16 bytes.
Option 2: Alternative - PK (instance_id)
- Pre-requisite: This only works if your application guarantees that
instance_idis globally unique across all hosts/tenants. Given you’re using UUIDv7, this is a safe assumption in practice, but the schema wouldn’t enforce uniqueness per host explicitly via the PK itself. - Pros:
- Narrower Primary Key: The PK is only 16 bytes.
- Simpler Foreign Keys: Tables referencing
instance_tonly need a singleinstance_idcolumn for the foreign key. - Slightly Smaller Secondary Indexes: Other indexes on the table will be marginally smaller.
- Cons:
- Requires Separate Index for Tenant Queries: You would absolutely need a separate index on
(host_id, instance_id)(or at least(host_id)) for efficient tenant-specific queries (WHERE host_id = ?). Without it, querying for a specific tenant’s data would require less efficient scans. This index would likely be aUNIQUEindex anyway to enforce the logical relationship:CREATE UNIQUE INDEX instance_t_host_instance_idx ON instance_t (host_id, instance_id); - Potential Reduced Locality: While the separate index helps, the primary key index itself (based only on
instance_id) might interleave data from different tenants physically, potentially slightly reducing cache efficiency for queries scanning many instances for a single tenant compared to the composite PK approach.
- Requires Separate Index for Tenant Queries: You would absolutely need a separate index on
Recommendation:
Stick with the composite primary key: PRIMARY KEY(host_id, instance_id).
Reasoning:
- Performance for Core Use Case: In multi-tenant systems, filtering by the tenant identifier (
host_id) is almost always the primary access pattern. Havinghost_idas the leading column in the PK index directly optimizes this critical path. - Index Necessity: Even if you chose
instance_idas the sole PK, you would still need to create an index on(host_id, instance_id)for performance. Making this essential index the primary key index is often the most straightforward and efficient approach. - Clarity: The composite key clearly reflects the logical relationship and ownership.
- Cost: The “cost” of a 32-byte PK vs. a 16-byte PK is often negligible compared to the performance gains achieved by aligning the PK index with the dominant query patterns in a multi-tenant architecture. The impact on FKs and secondary indexes is real but usually acceptable.
Using just instance_id as the PK prioritizes global uniqueness and FK simplicity over optimizing tenant-specific queries directly via the PK index. In most multi-tenant scenarios, optimizing tenant queries is more important.
Citus PostgreSQL Extension
Citus, now fully integrated into PostgreSQL as a distributed database extension, can be very helpful in scaling your multi-tenant application, especially if you anticipate significant data growth and high query loads. Here’s how Citus can fit into your use case and the factors to consider:
How Citus Helps:
-
Horizontal Scalability: Citus allows you to distribute the data across multiple PostgreSQL nodes (servers), enabling horizontal scaling. This is crucial for handling increasing data volumes and query loads in a multi-tenant environment.
-
Improved Query Performance: By distributing data and queries, Citus can significantly improve the performance of many types of queries, especially analytical queries that operate on large datasets. This is particularly beneficial if we have tenants with substantially different data volumes or query patterns.
-
Shard Placement by Tenant: One of the most effective ways to use Citus for multi-tenancy is to shard the data by host_id (or a tenant ID). This means that all data for a given tenant resides on the same shard (a subset of the distributed database). This allows for efficient tenant isolation and simplifies queries for tenant-specific data.
-
Simplified Multi-Tenant Queries: When sharding by tenant, queries that filter by host_id become very efficient because Citus can route them directly to the appropriate shard. This eliminates the need for expensive scans across the entire database.
-
Flexibility: Citus supports various sharding strategies, allowing you to choose the best approach for the data and query patterns. You can even use a hybrid approach, distributing some tables while keeping others replicated across all nodes for faster access to shared data.
Example (Sharding by Tenant):
Create a distributed table: When creating tables (e.g., user_t, api_endpoint_t, etc.), we would declare them as distributed tables in Citus, using the host_id as the distribution column:
CREATE TABLE user_t (
host_id UUID NOT NULL,
user_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
-- ... other columns
) DISTRIBUTE BY HASH (host_id);
Querying: When querying data for a specific tenant, include the host_id in the WHERE clause:
SELECT * FROM users_t WHERE host_id = 'your-tenant-id';
Citus will automatically route this query to the shard containing the data for that tenant, resulting in much faster query execution.
Citus Cost:
-
Citus Open Source: The Citus open-source extension is free to use and is included in the PostgreSQL distribution. We can self-host and manage it.
-
Azure CosmosDB for PostgreSQL (Managed Citus): Microsoft offers a fully managed cloud service called Azure CosmosDB for PostgreSQL, which is built on Citus. This service has usage-based pricing, and the cost depends on factors like the number of nodes, storage, and compute resources used. This managed option reduces the operational overhead of managing Citus yourself.
Recommendation:
Don’t automatically add host_id to every table just because we are using Citus. Carefully analyze the data model, query patterns, and multi-tenancy requirements.
-
Distribute tables by host_id (tenant ID) when data locality and isolation are paramount, and we want to optimize tenant-specific queries.
-
Consider replicating smaller, frequently joined tables to avoid unnecessary joins and host_id overhead.
-
Use a central mapping table (tenant_resources_t) to manage tenant-resource associations and enforce multi-tenancy rules in the application logic where appropriate.
This more nuanced approach provides a balance between the benefits of distributed data with Citus and avoiding unnecessary complexity or performance overhead from overusing host_id. Choose the Citus deployment model (self-hosted open source or managed cloud service) that best suits our needs and budget.
Primary Key Considerations in a Distributed Citus Environment
When a table includes host_id (due to sharding requirements), it is important to include host_id as part of the primary key. This ensures proper functioning and optimization within the Citus distributed database.
-
Distribution Column Requirement
In Citus, the distribution column (e.g.,host_id) must be part of the primary key. This is essential for routing queries and distributing data correctly across shards. -
Uniqueness Enforcement
- The primary key enforces uniqueness across the entire distributed database.
- For example, if
user_idis unique only within a tenant (host), then(host_id, user_id)is required as the primary key to ensure uniqueness across all shards.
-
Data Locality and Co-location
Includinghost_idin the primary key ensures that all rows for the same tenant (identified by the samehost_id) are stored together on a single shard. This provides:- Efficient Joins: Joins between tables related to the same tenant can be performed locally on a single shard, avoiding expensive cross-shard data transfers.
- Optimized Queries: Queries filtering by
host_idare efficiently routed to the appropriate shard.
-
Referential Integrity
If other tables reference theusers_ttable and are also distributed byhost_id, includinghost_idin the primary key ofusers_tis essential to maintain referential integrity across shards.
Multi-Host User Session Management
In a multi-host environment where multiple hosts reside on the same server, users must associate with one host at a time. The session management is handled as follows:
-
Host Association on Login:
- Once a user logs in, a host cookie is returned, derived from the JWT token.
- The user’s session defaults to the associated host in the cookie.
-
Switching Hosts:
- If a user wishes to switch to another host, they can:
- Access the User Menu to select a different host.
- Log out of the current session.
- During the next login, the session will be tied to the newly selected host.
- If a user wishes to switch to another host, they can:
-
Host in API Requests:
- For all API requests sent to the server, the host is typically included as part of the request payload.
- For login users, the host is in the JWT token as a custom claim.
- For guest users, the default host is used until the user is signed in.
- This ensures proper routing and handling of requests in a multi-host environment.
By associating users to a specific host for each session, this approach ensures clear separation of data and responsibilities across hosts, while providing users the flexibility to switch hosts as needed.
Event Header
As the portal is based on the event sorucing, all events will be responsible for populating the database. So, they need to be separated by host_id as well. In the event header, we have one unique id which is generated when event is created. Also, it has host_id and user_id in the EventId which is included in every events.
Reference and Shared Tables
In an application there are some data that is shared by all tenants. For example, the dropdown options on the UI and business validation. We call them reference data and have defined several tables to manage them centrally. For each reference data type, there is a logical table defined in the ref_table_t and marked as common or not. Common means the table can be shared with other tenants. Otherwise, it is only private for the owner tenant.
Some other entities are very similar but they cannot be fit into the reference tables. For example, category_t table contains all the category definitions for different entities. These tables are designed with an optional host_id. Here is an exmaple.
CREATE TABLE category_t (
category_id VARCHAR(22) NOT NULL, -- unique id to identify the category
host_id VARCHAR(22), -- null mean global category
entity_type VARCHAR(50) NOT NULL, -- the version of the schema
category_name VARCHAR(126) NOT NULL, -- category name, must be url friendly.
category_desc VARCHAR(1024) NOT NULL, -- decription
parent_category_id VARCHAR(22) REFERENCES category_t(category_id) ON DELETE SET NULL, -- parent category id, null if there is no parent.
sort_order INT DEFAULT 0, -- sort order on the UI
update_user VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY (category_id)
);
-- 1. Unique index for GLOBAL categories (where host_id IS NULL)
-- Ensures uniqueness of (entity_type, category_name, parent_category_id) ONLY when host_id is NULL
CREATE UNIQUE INDEX idx_category_unique_global
ON category_t (entity_type, category_name, parent_category_id)
NULLS NOT DISTINCT -- Handles NULLs in parent_category_id correctly
WHERE host_id IS NULL;
-- 2. Unique index for TENANT-SPECIFIC categories (where host_id IS NOT NULL)
-- Ensures uniqueness of (host_id, entity_type, category_name, parent_category_id)
-- for rows that belong to a specific host.
CREATE UNIQUE INDEX idx_category_unique_tenant
ON category_t (host_id, entity_type, category_name, parent_category_id)
NULLS NOT DISTINCT -- Handles NULLs in parent_category_id correctly
WHERE host_id IS NOT NULL;
CREATE INDEX idx_category_entity_type ON category_t (entity_type);
CREATE INDEX idx_category_parent ON category_t (parent_category_id);
CREATE INDEX idx_category_name ON category_t (category_name);
CREATE INDEX idx_category_host_id ON category_t (host_id);
On the UI, the host_id will be auto populated according to the associated host_id by the user in readonly mode. There is a checkbox “Is Global Category” in the form. If checked, the backend service will have an FGA rule to ensure that the user is admin and the host_id will be removed in the event. This works for both create and update.
When viewing categories, the super admin might see all categories by default, possibly with a column or indicator showing the host_id (or “Global”). Filters should allow viewing global only, or a specific tenant’s categories.
Tenant Admin / Host Owner:
When a tenant admin accesses the category management UI, their context is fixed to their own host_id.
They should only be able to create/edit categories associated with their specific host_id.
The UI should not offer them the option to create/edit global categories or categories for other hosts. The host_id is implicitly set or displayed as read-only based on their logged-in context.
When viewing categories, they should see their own tenant-specific categories plus all applicable global categories. The UI should clearly differentiate between these (e.g., using grouping, labels, icons).
System Integration
System integrations must preserve the same identity and tenant boundaries as interactive portal workflows. The integration token is not only an access credential; it is also the source of audit metadata, event metadata, row filtering, and host scoping.
Command Side
The command side uses event sourcing. Every accepted command writes one or more domain events, and those events are later projected into query-side tables. Because events become the durable system of record, command calls need a stable user identity and host identity.
For command APIs, use an authorization code token whenever possible. The token
must contain the real portal user id so command handlers can derive the correct
userId, host, nonce, and CloudEvent metadata. This is the preferred path for
browser flows, operator tools, and integrations that can act on behalf of a
known user.
If the integration has no user session in the request context, do not submit anonymous command events. First onboard a real user in the system for the integration actor or service account. That user becomes the durable audit principal for the commands emitted by the integration.
After the user is onboarded, create an auth client for the integration and set custom claims that carry the command identity:
{
"host": "<host-id>",
"elm": "<integration-user-email>",
"uid": "<integration-user-id>",
"uty": "<user-type>"
}
The uid claim must reference the onboarded user. The host claim must match
the tenant boundary where commands are allowed to run. The elm and uty
claims should match the onboarded user’s email and user type so downstream
authorization, audit, and support workflows can identify the actor without
guessing.
For an integration auth client whose type is trusted, the client application can
call Light OAuth with the client_credentials grant type when the auth client
has these custom claims configured. Light OAuth issues a token that carries the
custom claims, allowing the token to act as an id-token-like access token for
command APIs, similar to the user-bearing token produced by the authorization
code grant. This path is only acceptable for trusted client types because the
client, not an interactive browser session, is asserting the user and host
identity through the auth client configuration.
Command-side integration rules:
- Prefer an authorization code token tied to the real interactive user.
- Use a dedicated onboarded integration user only when no user session exists.
- For non-session integrations, use a trusted auth client with custom claims and
request the token from Light OAuth with
grant_type=client_credentials. - Do not use a token that lacks a usable
userId/uidfor event-sourced commands. - Do not allow non-trusted clients to mint user-bearing command tokens from client credentials.
- Keep host ownership explicit; never infer host scope from the client id alone.
- Treat the auth client and custom claims as deployment configuration, not as a substitute for user onboarding.
Query Side
The query side serves read models built from command-side events and operational tables. Query APIs do not create domain events, do not allocate command nonces, and should not mutate event-sourced state.
Query integrations still need authorization and tenant scoping. The request token must provide enough identity to determine the host and the effective user or service account. For user-scoped reads, use the same authorization code token or integration-user token described for the command side so row and column filters can apply consistently.
If authorization code flow is not available for a query integration, the
client_credentials flow is acceptable only for auth clients whose type is
trusted. The token must carry host, sid, and, when environment-specific data
is requested, env. Here sid is the service id for the gateway, agent, or
other Light-Fabric runtime calling portal-query. Query handlers must compare
these claims with the requested hostId, serviceId, and optional envTag
before returning service-scoped data.
Light-Fabric ecosystem components such as gateways and agents may use a
long-lived token for query-side access when the token was issued through this
trusted client_credentials path. That access is not general portal read
access. It is limited to query endpoints built for those runtime components,
such as gateway, agent, discovery, or catalog endpoints, and those endpoints
must enforce the claim match before returning data.
For host-scoped or service-level reads, a client token can be used only when the auth client type is trusted and the token carries the required host, service, and environment claims. The query service should apply the same host boundary as the command side and return only data visible to that actor. A missing user session may reduce the allowed result set, but it must not broaden access.
Query-side integration rules:
- Read from projected/query tables; do not write command events from query handlers.
- Resolve host scope from the validated token claims and request parameters.
- When authorization code flow is unavailable, accept
client_credentialsonly from auth clients whose type is trusted. - Require
hostandsidtoken claims; requireenvwhen the endpoint or request is environment-scoped. - Match token
host,sid, and optionalenvto requestedhostId,serviceId, and optionalenvTag. - Allow long-lived Light-Fabric runtime tokens only on endpoints designed for gateways, agents, and similar ecosystem components.
- Do not use long-lived runtime tokens for broad user-facing query access.
- Apply user, role, position, group, attribute, and fine-grained filters when the endpoint requires them.
- Use the onboarded integration user for auditability when a human user is not present.
- Keep query tokens least-privileged; read-only integrations should not receive command scopes.
Portal Event
Light Portal is using event sourcing and CQRS. Any update to the system will generate an event and there are hundreds of event types.
All events are in Avro format and will be pushed to a Kafka cluster for stream processing. Each event has an EventId that contains common info for events and it is reside in light-kafka repo.
Here is one of the events in the light-portal.
{
"type": "record",
"name": "ApiRuleCreatedEvent",
"namespace": "net.lightapi.portal.market",
"fields": [
{
"name": "EventId",
"type": {
"type": "record",
"name": "EventId",
"namespace": "com.networknt.kafka.common",
"fields": [
{
"name": "id",
"type": "string",
"doc": "a unique identifier"
},
{
"name": "nonce",
"type": "long",
"doc": "the number of the transactions for the user"
},
{
"name": "timestamp",
"type": "long",
"default": 0,
"doc": "time the event is recorded"
},
{
"name": "derived",
"type": "boolean",
"default": false,
"doc": "indicate if the event is derived from event processor"
}
]
}
},
{
"name": "hostId",
"type": "string",
"doc": "host id"
},
{
"name": "apiId",
"type": "string",
"doc": "api id"
},
{
"name": "ruleIds",
"type": {
"type": "array",
"items": "string"
},
"doc": "one or many rule ids that link to the apiId"
}
]
}
Kafka Key
When pushing events into a Kafka topic, the record key will be used to distribute record between different Kafka partitions. Here is the key selection for the system.
- multi-tenent
The key will be the hostId
- single-tenent
The key will be the userId
Promotion or Replay
Promotion approaches
- When promote from dev to sit, we can export all event from dev and update the event json file and then replay to the sit.
- We can import the original event json from dev to sit and then update some on the sit host.
Promotable Event Type
There are two type of events: configurable event vs transactional event. We should only promote the configurable events from dev to sit. Not the deployment logs from dev to sit. We need a table to define the promotable event types.
Reference Table
When building a web application, there would be a lot of dropdown selects in forms. The form itself only cares about the id and label list to render the form and only the id will be submitted to the backend API for single select and several ids for multiple select.
To save the effort to create many similar tables, we can craete a set of tables for all dropdowns. For some of the reference tables, dropdown should be the same across all hosts and we can set common flag to ‘Y’ so that they are shared by all hosts. If the dropdown values might be different between hosts, we can create a reference table per host and link the reference table with host in a separate table that support sharding.
Reference Schema
CREATE TABLE ref_host_t (
table_id VARCHAR(22) NOT NULL,
host_id VARCHAR(22) NOT NULL,
PRIMARY KEY (table_id, host_id),
FOREIGN KEY (table_id) REFERENCES ref_table_t (table_id) ON DELETE CASCADE,
FOREIGN KEY (host_id) REFERENCES host (host_id) ON DELETE CASCADE
);
CREATE TABLE ref_table_t (
table_id VARCHAR(22) NOT NULL, -- UUID genereated by Util
table_name VARCHAR(80) NOT NULL, -- Name of the ref table for lookup.
table_desc VARCHAR(1024) NULL,
active CHAR(1) NOT NULL DEFAULT 'Y', -- Only active table returns values
editable CHAR(1) NOT NULL DEFAULT 'Y', -- Table value and locale can be updated via ref admin
common CHAR(1) NOT NULL DEFAULT 'Y', -- The drop down shared across hosts
PRIMARY KEY(table_id)
);
CREATE TABLE ref_value_t (
value_id VARCHAR(22) NOT NULL,
table_id VARCHAR(22) NOT NULL,
value_code VARCHAR(80) NOT NULL, -- The dropdown value
start_time TIMESTAMP NULL,
end_time TIMESTAMP NULL,
display_order INT, -- for editor and dropdown list.
active VARCHAR(1) NOT NULL DEFAULT 'Y',
PRIMARY KEY(value_id),
FOREIGN KEY table_id REFERENCES ref_table_t (table_id) ON DELETE CASCADE
);
CREATE TABLE value_locale_t (
value_id VARCHAR(22) NOT NULL,
language VARCHAR(2) NOT NULL,
value_desc VARCHAR(256) NULL, -- The drop label in language.
PRIMARY KEY(value_id,language),
FOREIGN KEY value_id REFERENCES ref_value_t (value_id) ON DELETE CASCADE
);
CREATE TABLE relation_type_t (
relation_id VARCHAR(22) NOT NULL,
relation_name VARCHAR(32) NOT NULL, -- The lookup keyword for the relation.
relation_desc VARCHAR(1024) NOT NULL,
PRIMARY KEY(relation_id)
);
CREATE TABLE relation_t (
relation_id VARCHAR(22) NOT NULL,
value_id_from VARCHAR(22) NOT NULL,
value_id_to VARCHAR(22) NOT NULL,
active VARCHAR(1) NOT NULL DEFAULT 'Y',
PRIMARY KEY(relation_id, value_id_from, value_id_to)
FOREIGN KEY relation_id REFERENCES relation_type_t (relation_id) ON DELETE CASCADE,
FOREIGN KEY value_id_from REFERENCES ref_value_t (value_id) ON DELETE CASCADE,
FOREIGN KEY value_id_to REFERENCES ref_table_t (value_id) ON DELETE CASCADE
);
Authentication & Authorization
Light-Portal is a single-page application (SPA) that utilizes both the OAuth 2.0 Authorization Code and Client Credentials flows.
The following pattern illustrates the end-to-end process recommended by the Light Platform for an SPA interacting with downstream APIs.
Sequence Diagram
sequenceDiagram
participant PortalView as Portal View
participant LoginView as Login View
participant Gateway as Light Gateway
participant OAuthKafka as OAuth-Kafka
participant AuthService as Auth Service
participant ProxySidecar as Proxy Sidecar
participant BackendAPI as Backend API
PortalView ->> LoginView: 1. Signin redirect
LoginView ->> OAuthKafka: 2. Authenticate user
OAuthKafka ->> AuthService: 3. Authenticate User<br/>(Active Directory<br/>for Employees)<br/>(CIF System<br/>for Customers)
AuthService ->> OAuthKafka: 4. Authenticated
OAuthKafka ->> OAuthKafka: 5. Generate auth code
OAuthKafka ->> PortalView: 6. Redirect with code
PortalView ->> Gateway: 7. Authorization URL<br/>with code param
Gateway ->> OAuthKafka: 8. Create JWT access<br/>token with code
OAuthKafka ->> OAuthKafka: 9. Generate JWT<br/>access token<br/>with user claims
OAuthKafka ->> Gateway: 10. Token returns<br/>to Gateway
Gateway ->> PortalView: 11. Token returns<br/>to Portal View<br/>in Secure Cookie
PortalView ->> Gateway: 12. Call Backend API
Gateway ->> Gateway: 13. Verify the token
Gateway ->> OAuthKafka: 14. Create Client<br/>Credentials token
OAuthKafka ->> OAuthKafka: 15. Generate Token<br/>with Scopes
OAuthKafka ->> Gateway: 16. Return the<br/>scope token
Gateway ->> Gateway: 17. Add scope<br/>token to<br/>X-Scope-Token<br/>Header
Gateway ->> ProxySidecar: 18. Invoke API
ProxySidecar ->> ProxySidecar: 19. Verify<br/>Authorization<br/>token
ProxySidecar ->> ProxySidecar: 20. Verify<br/>X-Scope-Token
ProxySidecar ->> ProxySidecar: 21. Fine-Grained<br/>Authorization
ProxySidecar ->> BackendAPI: 22. Invoke<br/>business API
BackendAPI ->> ProxySidecar: 23. Business API<br/>response
ProxySidecar ->> ProxySidecar: 24. Fine-Grained<br/>response filter
ProxySidecar ->> Gateway: 25. Return response
Gateway ->> PortalView: 26. Return response
-
When a user visits the website to access the single-page application (SPA), the Light Gateway serves the SPA to the user’s browser. Each single page application will have a dedicated Light Gateway instance acts as a BFF. By default, the user is not logged in and can only access limited site features. To unlock additional features, the user can click the
Userbutton in the header and select theSign Inmenu. This action redirects the browser from the Portal View to the Login View, both served by the same Light Gateway instance. -
On the Login View page, the user can either input a username and password or choose Google/Facebook for authentication. When the login form is submitted, the request is sent to the Light Gateway with the user’s credentials. The Gateway forwards this request to the OAuth Kafka service.
-
OAuth Kafka supports multiple authenticator implementations to verify user credentials. Examples include authenticating via the Light Portal user database, Active Directory for employees, or CIF service for customers.
-
Once authentication is successfully completed, the OAuth Kafka responds with the authentication result.
-
Upon successful authentication, OAuth Kafka generates an authorization code (a UUID associated with the user’s profile).
-
OAuth Kafka redirects the authorization code back to the browser at the Portal View via the Gateway.
-
Since the Portal View SPA lacks a dedicated redirect route for the authorization code, the browser sends the code as a query parameter in a request to the Gateway.
-
The
StatelessAuthHandlerin the Gateway processes this request, initiating a token request to OAuth Kafka to obtain a JWT access token. -
OAuth Kafka generates an access token containing user claims in its custom JWT claims. The authorization code is then invalidated, as it is single-use.
-
The access token is returned to the Gateway.
-
The
StatelessAuthHandlerin the Gateway stores the access token in a secure cookie and sends it back to the Portal View. -
When the Portal View SPA makes requests to backend APIs, it includes the secure cookie in the API request sent to the Gateway.
-
The
StatelessAuthHandlerin the Gateway validates the token in the secure cookie and places it in theAuthorizationheader of the outgoing request. -
If the token is successfully validated, the
TokenHandlerin the Gateway makes a request to OAuth Kafka for a client credentials token, using the path prefix of the API endpoint. -
OAuth Kafka generates a client credentials token with the appropriate scope for accessing the downstream service.
-
The client credentials token is returned to the Gateway.
-
The
TokenHandlerin the Gateway inserts this token into theX-Scope-Tokenheader of the original request. -
The Gateway routes the original request, now containing both tokens, to the downstream
proxy sidecarof the backend API. -
The proxy sidecar validates the
Authorizationtoken, verifying its signature, expiration, and other attributes. -
The proxy sidecar also validates the
X-Scope-Token, ensuring its signature, expiration, and scope are correct. -
Once both tokens are successfully validated, the proxy sidecar enforces fine-grained authorization rules based on the user’s custom security profile contained in the
Authorizationtoken. -
If the fine-grained authorization checks are passed, the proxy sidecar forwards the request to the backend API.
-
The backend API processes the request and sends the full response back to the
proxy sidecar. -
The proxy sidecar applies fine-grained filters to the response, reducing the number of rows and/or columns based on the user’s security profile or other policies.
-
The proxy sidecar returns the filtered response to the Gateway.
-
The Gateway forwards the response to the Portal View, allowing the SPA to render the page.
Fine-Grained Authorization
What is Fine-Grained Authorization?
Fine-grained authorization (FGA) refers to a detailed and precise control mechanism that governs access to resources based on specific attributes, roles, or rules. It’s also known as fine-grained access control (FGAC). Unlike coarse-grained authorization, which applies broader access policies (e.g., “Admins can access everything”), fine-grained authorization allows for more specific policies (e.g., “Admins can access user data only if they belong to the same department and the access request is during business hours”).
Key Features
- Granular Control: Policies are defined at a detailed level, considering attributes like user role, resource type, action, time, location, etc.
- Context-Aware: Takes into account dynamic conditions such as the time of request, user’s location, or other contextual factors.
- Flexible Policies: Allows the creation of complex, conditional rules tailored to the organization’s needs.
Why Do We Need Fine-Grained Authorization?
1. Enhanced Security
By limiting access based on detailed criteria, fine-grained authorization minimizes the risk of unauthorized access or data breaches.
2. Regulatory Compliance
It helps organizations comply with legal and industry-specific regulations (e.g., GDPR, HIPAA) by ensuring sensitive data is only accessible under strict conditions.
3. Minimized Attack Surface
By restricting access to only the required resources and operations, fine-grained authorization reduces the potential impact of insider threats or compromised accounts.
4. Improved User Experience
Enables personalized access based on roles and permissions, ensuring users see only what they need, which reduces confusion and improves productivity.
5. Auditing and Accountability
Detailed access logs and policy enforcement make it easier to track and audit who accessed what, when, and why, fostering better accountability.
Examples of Use Cases
- Healthcare: A doctor can only view records of patients they are treating.
- Government: A government employee can access to data and documents based on security clearance levels and job roles.
- Finance: A teller can only access transactions related to their assigned branch.
- Enterprise Software: Employees can edit documents only if they own them or have been granted editing permissions.
Fine-Grained Authorization in API Access Control
In API access control, fine-grained authorization governs how users or systems interact with specific API endpoints, actions, and data. This approach ensures that access permissions are precisely tailored to attributes, roles, and contextual factors, enabling a secure and customized API experience. As the Light Portal is a platform centered on APIs, the remainder of the design will focus on the API access control context.
Early Approaches to Fine Grained Authorization
Early approaches to fine grained authorization primarily involved Access Control Lists (ACLs) and Role-Based Access Control (RBAC). These methods laid the foundation for more sophisticated access control mechanisms that followed. Here’s an overview of these primary approaches:
Access Control Lists (ACLs):
-
ACLs were one of the earliest forms of fine grained authorization, allowing administrators to specify access permissions on individual resources for each user or group of users.
-
In ACLs, permissions are directly assigned to users or groups, granting or denying access to specific resources based on their identities.
-
While effective for small-scale environments with limited resources and users, ACLs became cumbersome as organizations grew. Maintenance issues arose, such as the time required to manage access to an increasing number of resources for numerous users.
Role-Based Access Control (RBAC):
-
RBAC emerged as a solution to the scalability and maintenance challenges posed by ACLs. It introduced the concept of roles, which represent sets of permissions associated with particular job functions or responsibilities.
-
Users are assigned one or more roles, and their access permissions are determined by the roles they possess rather than their individual identities.
-
RBAC can be implemented with varying degrees of granularity. Roles can be coarse-grained, providing broad access privileges, or fine-grained, offering more specific and nuanced permissions based on organizational needs.
-
Initially, RBAC appeared to address the limitations of ACLs by providing a more scalable and manageable approach to access control.
Both ACLs and RBAC have their shortcomings:
-
Maintenance Challenges: While RBAC offered improved scalability compared to ACLs, it still faced challenges with role management as organizations expanded. The proliferation of roles, especially fine grained ones, led to a phenomenon known as role explosion where the number of roles grew rapidly, making them difficult to manage effectively.
-
Security Risks: RBAC’s flexibility also posed security risks. Over time, users might accumulate permissions beyond what they need for their current roles, leading to a phenomenon known as permission creep. This weakened overall security controls and increased the risk of unauthorized access or privilege misuse.
Following the discussion of early approaches to fine grained authorization, it’s crucial to acknowledge that different applications have varying needs for authorization.
Whether to use fine grained or coarse-grained controls depends on the specific project. Controlling access becomes trickier due to the spread-out nature of resources and differing levels of detail needed across components. Let’s delve into the differentiating factors:
Standard Models for Implementing FGA
There are several standard models for implementing FGA:
-
Attribute-Based Access Control (ABAC): In ABAC, access control decisions are made by evaluating attributes such as user roles, resource attributes (e.g., type, size, status), requested action, current date and time, and any other relevant contextual information. ABAC allows for very granular control over access based on a wide range of attributes. -
Policy-Based Access Control (PBAC): PBAC is similar to ABAC but focuses more on defining policies than directly evaluating attributes. Policies in PBAC typically consist of rules or logic that dictate access control decisions based on various contextual factors. While ABAC relies heavily on data (attributes), PBAC emphasizes using logic to determine access. -
Relationship-Based Access Control (ReBAC): ReBAC emphasizes the relationships between users and resources, as well as relationships between different resources. By considering these relationships, ReBAC provides a powerful and expressive model for describing complex authorization contexts. This can involve the attributes of users and resources and their interactions and dependencies.
Each of these models offers different strengths and may be more suitable for different scenarios. FGA allows for fine grained control over access, enabling organizations to enforce highly specific access policies tailored to their requirements.
Streamlining FGA by Implementing Rule-Based Access Control:
ABAC (Attribute-Based Access Control) focuses on data attributes, PBAC (Policy-Based Access Control) centers on logic, and ReBAC (Relationship-Based Access Control) emphasizes relationships between users and resources. But what if we combined all three to leverage the strengths of each? This is the idea behind Rule-Based Access Control (RuBAC).
By embedding a lightweight rule engine, we can integrate multiple rules and actions to achieve the following:
-
Optimize ABAC: Reduce the number of required attributes since not all rules depend on them. For example, a standard rule like “Customer data can only be accessed during working hours” can be shared across policies.
-
Flexible Policy Enforcement: Using a rule engine makes access policies more dynamic and simpler to manage.
-
Infer Relationships: Automatically deduce relationships between entities. For instance, the rule engine could grant a user access to a file if they already have permission for the containing folder.
Principle of Least Privilege
The principle of least privilege access control widely referred to as least privilege, and PoLP is the security concept in which user(s) (employee(s)) are granted the minimum level of access/permissions to the app, data, or system that is required to perform his/her job functions.
To ensure PoLP is effectively enforced, we’ve compiled a list of best practices:
-
Conduct a thorough privilege audit: As we know, visibility is critical in an access environment, so conducting regular or periodic access audits of all privileged accounts can help your team gain complete visibility. This audit includes reviewing privileged accounts and credentials held by employees, contractors, and third-party vendors, whether on-premises, accessible remotely, or in the cloud. However, your team must also focus on default and hard-coded credentials, which IT teams often overlook.
-
Establish the least privilege as the default: Start by granting new accounts the minimum privileges required for their tasks and eliminate or reconfigure default permissions on new systems or applications. Further, use role-based access control to help your team determine the necessary privileges for a new account by providing general guidelines based on roles and responsibilities. Also, your team needs to update and adjust access level permissions when the user’s role changes; this will help prevent privilege creep.
-
Enforce separation of privileges: Your team can prevent over-provisioning by limiting administrator privileges. Firstly, segregate administrative accounts from standard accounts, even if they belong to the same user, and isolate privileged user sessions. Then, grant administrative privileges (such as read, write, and execute permissions) only to the extent necessary for the user to perform their specific administrative tasks. This will help your team prevent granting users unnecessary or excessive control over critical systems, which could lead to security vulnerabilities or misconfigurations.
-
Provide just-in-time, limited access: To maintain least-privilege access without hindering employee workflows, combine role-based access control with time-limited privileges. Further, replace hard-coded credentials with dynamic secrets or use one-time-use/temporary credentials. This will help your team grant temporary elevated access permissions when users need it, for instance, to complete specific tasks or short-term projects.
-
Keep track and evaluate privileged access: Continuously monitor authentications and authorizations across your API platform and ensure all the individual actions are traceable. Additionally, record all authentication and authorizaiton sessions comprehensively, and use automated tools to swiftly identify any unusual activity or potential issues. These best practices are designed to enhance the security of your privileged accounts, data, and assets while ensuring compliance adherence and improving operational security without disrupting user workflows.
OpenAPI Specification Extensions
OpenAPI uses the term security scheme for authentication and authorization schemes. OpenAPI 3.0 lets you describe APIs protected using the following security schemes. The fine-grained authorization is just another layer of security and it is natural to define the fine-grained authorization in the same specification. It can be done with OpenAPI specification extensions.
Extensions (also referred to as specification extensions or vendor extensions) are custom properties that start with x-, such as x-logo. They can be used to describe extra functionality that is not covered by the standard OpenAPI Specification. Many API-related products that support OpenAPI make use of extensions to document their own attributes, such as Amazon API Gateway, ReDoc, APIMatic and others.
As OpenAPI specification openapi.yaml is loaded during the light-4j startup, the extensions will be available at runtime in cache for each endpoint just like the scopes definition. The API owner can define the following two extensions for each endpoint:
-
x-request-access: This section allows designer to specify one or more rules as well as one or more security attributes for the input of the rules. For example, roles, location etc. The rule result will decide if the user has access to the endpoint based on the security attributes from the JWT token in the request chain.
-
x-response-filter: This section is similar to the above; however, it works on the response chain. The rule result will decide which row or column of the response JSON will return to the user based on the security profile from the JWT token.
Example of OpenAPI specification with fine-grained authorization.
paths:
/accounts:
get:
summary: "List all accounts"
operationId: "listAccounts"
x-request-access:
rule: "account-cc-group-role-auth"
roles: "manager teller customer"
x-response-filter:
rule: "account-row-filter"
teller:
status: open
customer:
status: open
owner: @user_id
rule: "account-col-filter"
teller: ["num","owner","type","firstName","lastName","status"]
customer: ["num","owner","type","firstName","lastName"]
security:
- account_auth:
- "account.r"
FGA Rules for AccessControlHandler
With the above specification loaded during the runtime, the rules will be loaded during the server startup for the service as well. In the Rule Registry on the light-portal, we have a set of built-in rules that can be picked as fine-grained policies for each API. Here is an example of rule for the above specification in the x-request-access.
account-cc-group-role-auth:
ruleId: account-cc-group-role-auth
host: lightapi.net
description: Role-based authorization rule for account service and allow cc token and transform group to role.
conditions:
- conditionId: allow-cc
variableName: auditInfo
propertyPath: subject_claims.ClaimsMap.user_id
operatorCode: NIL
joinCode: OR
index: 1
- conditionId: manager
variableName: auditInfo
propertyPath: subject_claims.ClaimsMap.groups
operatorCode: CS
joinCode: OR
index: 2
conditionValues:
- conditionValueId: manager
conditionValue: admin
- conditionId: teller
variableName: auditInfo
propertyPath: subject_claims.ClaimsMap.groups
operatorCode: CS
joinCode: OR
index: 3
conditionValues:
- conditionValueId: teller
conditionValue: frontOffice
- conditionId: allow-role-jwt
variableName: auditInfo
propertyPath: subject_claims.ClaimsMap.roles
operatorCode: NNIL
joinCode: OR
index: 4
actions:
- actionId: match-role
actionClassName: com.networknt.rule.FineGrainedAuthAction
actionValues:
- actionValueId: roles
value: $roles
All rules are managed by the light-portal and shared by all the services. In addition, developers can create their customized rules for their own services.
Response Filter
There are two type of filters. Row and Column.
Row
For row filter, we need to check the condition defined for some of the properties in order to make the filter decision. In database, for each endpoint, we have colName, operator and colValue defined for the condition.
The operator supports the following enum: [“=”,“!=”,“<”,“>”,“<=”,“>=”,“in”,“not in”, “range”]
For the colValue, we do support variables from the jwt token with @. For example, @eid will be replaced with the eid claim from the jwt token.
Col
For column filter, we need to include a list of columns or exclude a list of columns in json format.
[“accountNo”,“firstName”,“lastName”]
or
![“status”]
JSON Schema Registry
JSON Schema is a declarative language that provides a standardized way to describe and validate JSON data.
What it does
JSON Schema defines the structure, content, data types, and constraints of JSON documents. It’s an IETF standard that helps ensure the consistency and integrity of JSON data across applications.
How it works
JSON Schema uses keywords to define data properties. A JSON Schema validator checks if JSON documents conform to the schema.
What it’s useful for
- Describing existing data formats
- Validating data as part of automated testing
- Submitting client data
- Defining how a record should be organized
What is a JSON Schema Registry
The JSON Schema Registry provides a centralized service for your JSON schemas with RESTful endpoints for storing and retrieving JSON schemas.
When using data in a distributed application with many RESTful APIs, it is important to ensure that it is well-formed and structured. If data is sent without prior validation, errors may occur on the services. A schema registry provides a way to ensure that the data is validated before it is sent and validated after it is received.
A schema registry is a service used to define and confirm the structure of data that is sent between consumers and providers. In a schema registry, developers can define what the data should look like and how it should be validated. The schemas can be utilized in the OpenAPI specifications to ensure that schemas can be externalized.
Schema records can also help ensure forward and backward compatibility when changes are made to the data structure. When a schema record is used, the data transfered with more schema information that can be used to ensure that applications reading the data can interpret it.
Given the API consumers and providers can belong to different groups or organizations, it is necessary to have a centralized service to manage the schemas so that they can be shared between them. This is why we have implemented this service as part of the light-portal.
Schema Specification Version
The registry is heterogeneous registry as it can store schemas of different schema draft versions. By default the registry is configured to store schemas of Draft 2020-12. When a schema is added, the version which is currently is set, is what the schema is saved as.
The following list contains all supported specification versions.
- Draft 4
- Draft 6
- Draft 7
- 2019-09
- 2020-12
Schema Version
Once a schema is registed into the registry, it will be assigned as version 1. Each time it is updated, the version number will increase 1. When the schema is retrieve, the version number can be part of the URL to indicate that exact version will be retrieved. If version number is not in the URL, the latest version will be retrieved.
Access Endpoint
Table Structure
Light Controller
YAML Rule Registry
React Schema Form
React Schema Form is a form generator based on JSON Schema and form definitions from Light Portal. It renders UI forms to manipulate database entities, and form submissions are automatically hooked into an API endpoint.
Debugging a Component
Encountering a bug in a react-schema-form component can be challenging since the source code may not be directly visible. To debug:
- Set up the Light Portal server if dropdowns are loaded from the server.
- Use the example app in the same project to debug.
Use a Local Alias with Vite
Vite allows creating an alias to point to your library’s src folder. Update the vite.config.ts in your example app:
import { defineConfig } from 'vite';
import react from '@vitejs/plugin-react';
import path from 'path';
export default defineConfig({
plugins: [react()],
resolve: {
alias: {
'react-schema-form': path.resolve(__dirname, '../src'), // Adjust the path to point to the library's `src` folder
},
},
});
Use a Link Script in package.json
Update the example app’s package.json file. In the dependencies section, replace the library’s version with a local path:
{
"dependencies": {
"react-schema-form": "file:../src"
}
}
Library Entry Point
Vite requires an entry point file, typically named index.js or index.ts, in your library’s src folder. Ensure that your library’s src folder includes a properly configured index.js file, like this:
export { default as SchemaForm } from './SchemaForm'
export { default as ComposedComponent } from './ComposedComponent'
export { default as utils } from './utils'
export { default as Array } from './Array'
Without a correctly named and configured entry file, components like SchemaForm may not be imported properly.
Update index.html
If you change the entry point file from main.js to index.js, ensure you update the reference in the index.html file located in the root folder. For example:
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<link rel="icon" type="image/svg+xml" href="/vite.svg" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Vite + React</title>
</head>
<body>
<div id="root"></div>
<script type="module" src="/src/index.js"></script>
</body>
</html>
Sync devDependencies from peerDependencies
When the source code in src is used directly by the example app, the peerDependencies in the example app won’t work for react-schema-form components. To address this, copy the peerDependencies into the devDependencies section of react-schema-form’s package.json. For example:
"devDependencies": {
"@babel/runtime": "^7.26.0",
"@codemirror/autocomplete": "^6.18.2",
"@codemirror/language": "^6.10.6",
"@codemirror/lint": "^6.8.2",
"@codemirror/search": "^6.5.7",
"@codemirror/state": "^6.4.1",
"@codemirror/theme-one-dark": "^6.1.2",
"@codemirror/view": "^6.34.2",
"@emotion/react": "^11.13.5",
"@emotion/styled": "^11.13.5",
"@eslint/js": "^9.13.0",
"@lezer/common": "^1.2.3",
"@mui/icons-material": "^6.1.6",
"@mui/material": "^6.1.6",
"@mui/styles": "^6.1.6",
"@types/react": "^18.3.1",
"@uiw/react-markdown-editor": "^6.1.2",
"@vitejs/plugin-react": "^4.3.3",
"codemirror": "^6.0.1",
"eslint": "^9.13.0",
"eslint-plugin-react": "^7.37.2",
"eslint-plugin-react-hooks": "^5.0.0",
"eslint-plugin-react-refresh": "^0.4.14",
"gh-pages": "^6.2.0",
"globals": "^15.11.0",
"react": "^18.3.1",
"react-dom": "^18.3.1",
"vite": "^6.0.3"
},
"peerDependencies": {
"@babel/runtime": "^7.26.0",
"@codemirror/autocomplete": "^6.18.2",
"@codemirror/language": "^6.10.6",
"@codemirror/lint": "^6.8.2",
"@codemirror/search": "^6.5.7",
"@codemirror/state": "^6.4.1",
"@codemirror/theme-one-dark": "^6.1.2",
"@codemirror/view": "^6.34.2",
"@emotion/react": "^11.13.5",
"@emotion/styled": "^11.13.5",
"@lezer/common": "^1.2.3",
"@mui/icons-material": "^6.1.6",
"@mui/material": "^6.1.6",
"@mui/styles": "^6.1.6",
"@types/react": "^18.3.1",
"@uiw/react-markdown-editor": "^6.1.2",
"codemirror": "^6.0.1",
"react": "^18.3.1",
"react-dom": "^18.3.1"
},
Additionally, ensure the peerDependencies are also synced with the dependencies section of the example app’s package.json. This step allows react-schema-form components to load independently and work seamlessly during development.
Update Source Code
After completing all the updates, perform a clean install for both react-schema-form and the example app. Then, start the server from the example folder using the following command:
yarn dev
Whenever you modify a react-schema-form component, simply refresh the browser to reload the example application and see the updated component in action.
Debug with Visual Studio Code
You can debug the component using Visual Studio Code. There are many tutorials available online that explain how to debug React applications built with Vite, which can help you set up breakpoints, inspect components, and track down issues effectively.
Component dynaselect
dynaselect is a component that renders a dropdown select, either from static options or options loaded dynamically from a server via an API endpoint. It is a wrapper of material ui Autocomplete component. Below is an example form from the example app that demonstrates how to use this component.
{
"schema": {
"type": "object",
"title": "React Component Autocomplete Demo Static Single",
"properties": {
"name": {
"title": "Name",
"type": "string",
"default": "Steve"
},
"host": {
"title": "Host",
"type": "string"
},
"environment": {
"type": "string",
"title": "Environment",
"default": "LOCAL",
"enum": [
"LOCAL",
"SIT1",
"SIT2",
"SIT3",
"UAT1",
"UAT2"
]
},
"stringarraysingle": {
"type": "array",
"title": "Single String Array",
"items": {
"type": "string"
}
},
"stringcat": {
"type": "string",
"title": "Joined Strings"
},
"stringarraymultiple": {
"type": "array",
"title": "Multiple String Array",
"items": {
"type": "string"
}
}
},
"required": [
"name",
"environment"
]
},
"form": [
"name",
{
"key": "host",
"type": "dynaselect",
"multiple": false,
"action": {
"url": "https://localhost/portal/query?cmd=%7B%22host%22%3A%22lightapi.net%22%2C%22service%22%3A%22user%22%2C%22action%22%3A%22listHost%22%2C%22version%22%3A%220.1.0%22%7D"
}
},
{
"key": "environment",
"type": "dynaselect",
"multiple": false,
"options": [
{
"id": "LOCAL",
"label": "Local"
},
{
"id": "SIT1",
"label": "SIT1"
},
{
"id": "SIT2",
"label": "SIT2"
},
{
"id": "SIT3",
"label": "SIT3"
},
{
"id": "UAT1",
"label": "UAT1"
},
{
"id": "UAT2",
"label": "UAT2"
}
]
},
{
"key": "stringarraysingle",
"type": "dynaselect",
"multiple": false,
"options": [
{
"id": "id1",
"label": "label1"
},
{
"id": "id2",
"label": "label2"
},
{
"id": "id3",
"label": "label3"
},
{
"id": "id4",
"label": "label4"
},
{
"id": "id5",
"label": "label5"
},
{
"id": "id6",
"label": "label6"
}
]
},
{
"key": "stringcat",
"type": "dynaselect",
"multiple": true,
"options": [
{
"id": "id1",
"label": "label1"
},
{
"id": "id2",
"label": "label2"
},
{
"id": "id3",
"label": "label3"
},
{
"id": "id4",
"label": "label4"
},
{
"id": "id5",
"label": "label5"
},
{
"id": "id6",
"label": "label6"
}
]
},
{
"key": "stringarraymultiple",
"type": "dynaselect",
"multiple": true,
"options": [
{
"id": "id1",
"label": "label1"
},
{
"id": "id2",
"label": "label2"
},
{
"id": "id3",
"label": "label3"
},
{
"id": "id4",
"label": "label4"
},
{
"id": "id5",
"label": "label5"
},
{
"id": "id6",
"label": "label6"
}
]
}
]
}
Dynamic Options from APIs
The host is a string type field rendered as a dynaselect with multiple set to false. The options for the select are loaded via an API endpoint, with the action URL provided. Note that the cmd query parameter value is encoded because it contains curly brackets {}.
To encode and decode the query parameter value, you can use the following tool:
Encoded:
%7B%22host%22%3A%22lightapi.net%22%2C%22service%22%3A%22user%22%2C%22action%22%3A%22listHost%22%2C%22version%22%3A%220.1.0%22%7D
Decoded:
{"host":"lightapi.net","service":"user","action":"listHost","version":"0.1.0"}
When using the example app to test the react-schema-form with APIs, you need to configure CORS on the light-gateway. Ensure that CORS is enabled only on the light-gateway and not on the backend API, such as hybrid-query.
Here is the example in values.yml for the light-gateway.
# cors.yml
cors.enabled: true
cors.allowedOrigins:
- https://devsignin.lightapi.net
- https://dev.lightapi.net
- https://localhost:3000
- http://localhost:5173
cors.allowedMethods:
- GET
- POST
- PUT
- DELETE
Single string type
For the environment field, the schema defines the type as string, and the form definition specifies multiple: false to indicate it is a single select.
The select result in the model looks like the following:
{
"environment": "SIT1",
}
Single string array type
For the stringarraysingle field, the schema defines the type as a string array, and the form definition specifies multiple: false to indicate it is a single select.
The select result in the model looks like the following:
{
"stringarraysingle": [
"id3"
],
}
Multiple string type
For the stringcat field, the schema defines the type as a string, and the form definition specifies multiple: true to indicate it is a multiple select.
The select result in the model looks like the following:
{
"stringcat": "id2,id4"
}
Multiple string array type
For the stringarraymultiple field, the schema defines the type as a string array, and the form definition specifies multiple: true to indicate it is a multiple select.
The select result in the model looks like the following:
{
"stringarraymultiple": [
"id2",
"id5",
"id3"
],
}
User Management
User Type
The user_type field is a critical part of the user security profile in the JWT token and can be leveraged for fine-grained authorization. In a multi-tenant environment, user_type is presented as a dropdown populated from the reference table configured for the organization. It can be dynamically selected based on the host chosen during the user registration process.
Supported Standard Dropdown Models
-
Employee and Customer
- Dropdown values:
E(Employee),C(Customer) - Default model for
lightapi.nethost. - Suitable for most organizations.
- Dropdown values:
-
Employee, Personal, and Business
- Dropdown values:
E(Employee)P(Personal)B(Business)
- Commonly used for banks where personal and business banking are separated.
- Dropdown values:
Database Configuration
- The
user_typefield is nullable in theuser_ttable by default. - However, you can enforce this field as mandatory in your application via the schema and UI configuration.
On-Prem Deployment
In on-premise environments, the user_type can determine the authentication method:
- Employees: Authenticated via Active Directory.
- Customers: Authenticated via a customer database.
This flexibility allows organizations to tailor the authentication process based on their specific needs and user classifications.
Handling Users with Multi-Host Access
There are two primary ways to handle users who belong to multiple hosts:
- User-Host Mapping Table:
user_t: This table would not have a host_id and would store core user information that is host-independent. The user_id would be unique across all hosts.
user_host_t (or user_tenant_t): This would be a mapping table to represent the many-to-many relationship between users and hosts.
-- user_t (no host_id, globally unique user_id)
CREATE TABLE user_t (
user_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(), -- UUID is recommended
-- ... other user attributes (e.g., name, email)
);
-- user_host_t (mapping table)
CREATE TABLE user_host_t (
user_id UUID NOT NULL,
host_id UUID NOT NULL,
-- ... other relationship-specific attributes (e.g., roles within the host)
PRIMARY KEY (user_id, host_id),
FOREIGN KEY (user_id) REFERENCES user_t (user_id) ON DELETE CASCADE,
FOREIGN KEY (host_id) REFERENCES host_t (host_id) ON DELETE CASCADE -- Assuming you have a hosts_t
);
- Duplicating User Records (Less Recommended):
user_t: You would keep host_id in this table, and the primary key would be (host_id, user_id).
User Duplication: If a user needs access to multiple hosts, you would duplicate their user record in users_t for each host they belong to, each with a different host_id.
Why User-Host Mapping is Generally Preferred:
-
Data Integrity: Avoids data duplication and the potential for inconsistencies that come with it. If a user’s core information (e.g., name, email) changes, you only need to update it in one place in user_t.
-
Flexibility: Easier to add or remove a user’s access to hosts without affecting their core user data.
-
Querying: While you’ll need joins to get a user’s hosts or a host’s users, these joins are straightforward using the mapping table.
-
Scalability: Better scalability as your user base and the number of hosts they can access grow.
Distributing Tables in a Multi-Host User Scenario:
With the user-host mapping approach:
-
user_t: This table would likely be a reference table in Citus (replicated to all nodes) since it does not have a host_id for distribution.
-
user_host_t: This table would be distributed by host_id.
-
Other tables (e.g., employees_t, api_endpoints_t, etc.): These would be distributed by host_id as before.
When querying, you would typically:
-
Start with the user_hosts_t table to find the hosts a user has access to.
-
Join with other tables (distributed by host_id) based on the host_id to retrieve tenant-specific data.
Choosing the Right user_id Primary Key:
Here’s a comparison of the options for the user_id primary key in user_t:
1. UUID (user_id)
- Pros:
- Globally Unique: Avoids collisions across hosts or when scaling beyond the current setup.
- Security: Difficult to guess or enumerate.
- Scalability: Well-suited for distributed environments like Citus.
- Cons:
- Storage: Slightly larger storage size compared to integers.
- Readability: Not human-readable, which can be inconvenient for debugging.
- Recommendation:
This is generally the best option for auser_idin a multi-tenant, distributed environment.
2. Email (email)
- Pros:
- Human-Readable: Easy to identify and manage.
- Login Identifier: Often used as a natural login credential.
- Cons:
- Uniqueness Challenges: Enforcing global uniqueness across all hosts may require complex constraints or application logic.
- Changeability: If emails change, cascading updates can complicate the database.
- Security: Using emails as primary keys can expose sensitive user data if not handled securely.
- Performance: String comparisons are slower than those for integers or UUIDs.
- Recommendation:
Not recommended as a primary key, especially in a multi-tenant or distributed setup.
3. User-Chosen Unique ID (e.g., username)
- Pros:
- Human-Readable: Intuitive and user-friendly.
- Cons:
- Uniqueness Challenges: Enforcing global uniqueness is challenging and may require complex constraints.
- Changeability: Users may request username changes, causing cascading update issues.
- Security: Usernames are easier to guess or enumerate compared to UUIDs.
- Recommendation:
Not recommended as a primary key in a multi-tenant, distributed environment.
In Conclusion:
-
Use a User-Host Mapping Table:
This is the best approach to handle users who belong to multiple hosts in a multi-tenant Citus environment. -
Use UUID for
user_id:
UUIDs are the most suitable option for theuser_idprimary key inuser_tdue to their global uniqueness, security, and scalability. -
Distribute by
host_id:
Distribute tables that need sharding byhost_id, and ensure that foreign keys to distributed tables includehost_id. -
Use Reference Tables:
For tables likeuser_tthat don’t have ahost_id, designate them as reference tables in Citus.
This approach provides a flexible and scalable foundation for managing users with multi-host access in your Citus-based multi-tenant application.
User Tables
Using a single user_t table with a user_type discriminator is a good approach for managing both employees and customers in a unified way. Adding optional referral relationships for customers adds a nice dimension as well. Here’s a suggested table schema in PostgreSQL, along with explanations and some considerations:
user_t (User Table): This table will store basic information common to both employees and customers.
CREATE TABLE user_t (
user_id VARCHAR(24) NOT NULL,
email VARCHAR(255) NOT NULL,
password VARCHAR(1024) NOT NULL,
language CHAR(2) NOT NULL,
first_name VARCHAR(32) NULL,
last_name VARCHAR(32) NULL,
user_type CHAR(1) NULL, -- E employee C customer or E employee P personal B business
phone_number VARCHAR(20) NULL,
gender CHAR(1) NULL,
birthday DATE NULL,
country VARCHAR(3) NULL,
province VARCHAR(32) NULL,
city VARCHAR(32) NULL,
address VARCHAR(128) NULL,
post_code VARCHAR(16) NULL,
verified BOOLEAN NOT NULL DEFAULT false,
token VARCHAR(64) NULL,
locked BOOLEAN NOT NULL DEFAULT false,
nonce BIGINT NOT NULL DEFAULT 0,
update_user VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
update_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
);
ALTER TABLE user_t ADD CONSTRAINT user_pk PRIMARY KEY ( user_id );
ALTER TABLE user_t ADD CONSTRAINT user_email_uk UNIQUE ( email );
user_host_t (User to host relationship or mapping):
CREATE TABLE user_host_t (
host_id VARCHAR(24) NOT NULL,
user_id VARCHAR(24) NOT NULL,
-- other relationship-specific attributes (e.g., roles within the host)
PRIMARY KEY (host_id, user_id),
FOREIGN KEY (user_id) REFERENCES user_t (user_id) ON DELETE CASCADE,
FOREIGN KEY (host_id) REFERENCES host_t (host_id) ON DELETE CASCADE
);
employee_t (Employee Table): This table will store employee-specific attributes.
CREATE TABLE employee_t (
host_id VARCHAR(22) NOT NULL,
employee_id VARCHAR(50) NOT NULL, -- Employee ID or number or ACF2 ID. Unique within the host.
user_id VARCHAR(22) NOT NULL,
title VARCHAR(255) NOT NULL,
manager_id VARCHAR(50), -- manager's employee_id if there is one.
hire_date DATE,
update_user VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
update_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY (host_id, employee_id),
FOREIGN KEY (host_id, user_id) REFERENCES user_host_t(host_id, user_id) ON DELETE CASCADE,
FOREIGN KEY (host_id, manager_id) REFERENCES employee_t(host_id, employee_id) ON DELETE CASCADE
);
customer_t (Customer Table): This table will store customer-specific attributes.
CREATE TABLE customer_t (
host_id VARCHAR(24) NOT NULL,
customer_id VARCHAR(50) NOT NULL,
user_id VARCHAR(24) NOT NULL,
-- Other customer-specific attributes
referral_id VARCHAR(22), -- the customer_id who refers this customer.
update_user VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
update_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY (host_id, customer_id),
FOREIGN KEY (host_id, user_id) REFERENCES user_host_t(host_id, user_id) ON DELETE CASCADE,
FOREIGN KEY (host_id, referral_id) REFERENCES customer_t(host_id, customer_id) ON DELETE CASCADE
);
position_t (Position Table): Defines different positions within the organization for employees.
CREATE TABLE position_t (
host_id VARCHAR(22) NOT NULL,
position_id VARCHAR(22) NOT NULL,
position_name VARCHAR(255) UNIQUE NOT NULL,
description TEXT,
inherit_to_ancestor BOOLEAN DEFAULT FALSE,
inherit_to_sibling BOOLEAN DEFAULT FALSE,
update_user VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
update_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY (host_id, position_id)
);
user_position_t (Employee Position Table): Links employees to their positions with effective dates.
CREATE TABLE employee_position_t (
host_id VARCHAR(22) NOT NULL,
employee_id VARCHAR(50) NOT NULL,
position_id VARCHAR(22) NOT NULL,
position_type CHAR(1) NOT NULL, -- P position of own, D inherited from a decendant, S inherited from a sibling.
start_date DATE NOT NULL,
end_date DATE,
update_user VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
update_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY (host_id, employee_id, position_id),
FOREIGN KEY (host_id, position_id) REFERENCES position_t(host_id, position_id) ON DELETE CASCADE
);
Authorization Strategies
In order to link users to API endpoints for authorization, we will adpot the following approaches with a rule engine to enforce the policies in the sidecar of the API with access-control middleware handler.
A. Role-Based Access Control (RBAC)
This is a common and relatively simple approach. You define roles (e.g., “admin,” “editor,” “viewer”) and assign permissions to those roles. Users are then assigned to one or more roles.
Role Table:
CREATE TABLE role_t (
host_id VARCHAR(22) NOT NULL,
role_id VARCHAR(22) NOT NULL,
role_name VARCHAR(255) UNIQUE NOT NULL,
description TEXT,
update_user VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
update_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY (host_id, role_id)
);
Role-Endpoint Permission Table:
CREATE TABLE role_permission_t (
host_id VARCHAR(32) NOT NULL,
role_id VARCHAR(32) NOT NULL,
endpoint_id VARCHAR(64) NOT NULL,
update_user VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
update_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY (host_id, role_id, endpoint_id),
FOREIGN KEY (host_id, role_id) REFERENCES role_t(host_id, role_id) ON DELETE CASCADE,
FOREIGN KEY (endpoint_id) REFERENCES api_endpoint_t(endpoint_id) ON DELETE CASCADE
);
Role-User Assignment Table:
CREATE TABLE role_user_t (
host_id VARCHAR(22) NOT NULL,
role_id VARCHAR(22) NOT NULL,
user_id VARCHAR(22) NOT NULL,
start_date DATE NOT NULL DEFAULT CURRENT_DATE,
end_date DATE,
update_user VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
update_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY (host_id, role_id, user_id, start_date),
FOREIGN KEY (user_id) REFERENCES user_t(user_id) ON DELETE CASCADE,
FOREIGN KEY (host_id, role_id) REFERENCES role_t(host_id, role_id) ON DELETE CASCADE
);
B. User-Based Access Control (UBAC)
This approach assigns permissions directly to users, allowing for very fine-grained control. It’s more flexible but can become complex to manage if you have a lot of users and endpoints. It should only be used for temporary access.
User-Endpoint Permissions Table:
CREATE TABLE user_permission_t (
user_id VARCHAR(22) NOT NULL,
host_id VARCHAR(22) NOT NULL,
endpoint_id VARCHAR(22) NOT NULL,
start_date DATE NOT NULL DEFAULT CURRENT_DATE,
end_date DATE,
update_user VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
update_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY (user_id, host_id, endpoint_id),
FOREIGN KEY (user_id) REFERENCES user_t(user_id) ON DELETE CASCADE,
FOREIGN KEY (endpoint_id) REFERENCES api_endpoint_t(endpoint_id) ON DELETE CASCADE
);
C. Group-Based Access Control (GBAC)
You can group users into teams or departments and assign permissions to those groups. This is useful when you want to manage permissions for sets of users with similar access needs.
Groups Table:
CREATE TABLE group_t (
host_id VARCHAR(32) NOT NULL,
group_id VARCHAR(32) NOT NULL,
group_name VARCHAR(255) UNIQUE NOT NULL,
description TEXT,
update_user VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
update_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY (host_id, group_id)
);
Group-Endpoint Permission Table:
CREATE TABLE group_permission_t (
host_id VARCHAR(32) NOT NULL,
group_id VARCHAR(32) NOT NULL,
endpoint_id VARCHAR(32) NOT NULL,
update_user VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
update_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY (host_id, group_id, endpoint_id),
FOREIGN KEY (host_id, group_id) REFERENCES group_t(host_id, group_id) ON DELETE CASCADE,
FOREIGN KEY (endpoint_id) REFERENCES api_endpoint_t(endpoint_id) ON DELETE CASCADE
);
Group-User Membership Table:
CREATE TABLE group_user_t (
host_id VARCHAR(22) NOT NULL,
group_id VARCHAR(22) NOT NULL,
user_id VARCHAR(22) NOT NULL,
start_date DATE NOT NULL DEFAULT CURRENT_DATE,
end_date DATE,
update_user VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
update_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY (host_id, group_id, user_id, start_date),
FOREIGN KEY (user_id) REFERENCES user_t(user_id) ON DELETE CASCADE,
FOREIGN KEY (host_id, group_id) REFERENCES group_t(host_id, group_id) ON DELETE CASCADE
);
D. Attribute-Based Access Control (ABAC)
Attribute Table:
CREATE TABLE attribute_t (
host_id VARCHAR(22) NOT NULL,
attribute_id VARCHAR(22) NOT NULL,
attribute_name VARCHAR(255) UNIQUE NOT NULL, -- The name of the attribute (e.g., "department," "job_title," "project," "clearance_level," "location").
attribute_type VARCHAR(50) CHECK (attribute_type IN ('string', 'integer', 'boolean', 'date', 'float', 'list')), -- Define allowed data types
description TEXT,
update_user VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
update_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY (host_id, attribute_id)
);
- Attribute User Table:
CREATE TABLE attribute_user_t (
host_id VARCHAR(22) NOT NULL,
attribute_id VARCHAR(22) NOT NULL,
user_id VARCHAR(22) NOT NULL, -- References users_t
attribute_value TEXT, -- Store values as strings; you can cast later
start_date DATE NOT NULL DEFAULT CURRENT_DATE,
end_date DATE,
update_user VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
update_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY (host_id, attribute_id, user_id, start_date),
FOREIGN KEY (user_id) REFERENCES user_t(user_id) ON DELETE CASCADE,
FOREIGN KEY (host_id, attribute_id) REFERENCES attribute_t(host_id, attribute_id) ON DELETE CASCADE
);
- Attribute Permission Table:
CREATE TABLE attribute_permission_t (
host_id VARCHAR(32) NOT NULL,
attribute_id VARCHAR(32) NOT NULL,
endpoint_id VARCHAR(32) NOT NULL, -- References api_endpoints_t
attribute_value TEXT,
update_user VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
update_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY (host_id, attribute_id, endpoint_id),
FOREIGN KEY (endpoint_id) REFERENCES api_endpoint_t(endpoint_id) ON DELETE CASCADE,
FOREIGN KEY (host_id, attribute_id) REFERENCES attribute_t(host_id, attribute_id) ON DELETE CASCADE
);
How it Works:
-
Define Attributes: Define all relevant attributes in attribute_t. Think about all the properties of your users, resources, and environment that might be used in access control decisions.
-
Assign Attributes to Users: Populate attribute_user_t to associate attribute values with users.
-
Assign Attributes to Endpoints: Populate attribute_permission_t to associate attribute values with API endpoints.
-
Write Policies: Create policy rules in rule engine. These rules should use the attribute names defined in attribute_t.
-
Policy Evaluation (at runtime):
-
The policy engine receives the subject (user), resource (API endpoint), and action (HTTP method) of the request.
-
The engine retrieves the relevant attributes from the user_attribute_t and attribute_permission_t tables.
-
The engine evaluates the policy rule from the relevant policies against the attributes.
-
Based on the policy evaluation result, access is either granted or denied.
Key Advantages of ABAC:
-
Fine-Grained Control: Express very specific access rules.
-
Centralized Policy Management: Policies are stored centrally and can be easily updated.
-
Flexibility and Scalability: Adapts easily to changing requirements.
-
Auditing and Compliance: Easier to audit and demonstrate compliance.
Format of attributes in JWT token:
Unlike roles, groups and positions that can be concatanated as a string, an attribut is a key/value pair. We need to format multiple attributes into a string and put it into a token.
Challenges
-
Spaces: The primary issue is that simple key-value pairs like key1:value1 key2:value2 will not work when value contain spaces.
-
Escaping: We need a way to escape characters that may confuse the parser, for example if the value also contains a :.
-
Readability: The format should be reasonably readable for debugging and human consumption.
-
Parsing: The format should be easy to parse on the application side.
Options
- Comma-Separated Key-Value Pairs with Escaping:
-
Format: key1=value1,key2=value2_with_spaces,key3=value3,with,commas
-
Escaping: Use backslash \ to escape commas and backslashes within the values. You can also escape spaces to make it more clear \
-
Pros: Simple to implement, relatively easy to parse using splitting by comma and then by =.
-
Cons: Can become hard to read with complex values, requires proper escaping, will become unreadable if \ need to be escaped.
- Custom Delimiter and Escaping:
-
Format: key1^=^value1~key2^=^value2 with spaces~key3^=^value3~
-
Delimiter: Use ^=^ as delimiter for key and value and use ~ for different attributes.
-
Pros: You can avoid many escaping issues and keep spaces, easier to read than comma separated values.
-
Cons: Need to choose delimiter carefully to make sure it is unique.
- URL-Encoded Key-Value Pairs:
-
Format: key1=value1&key2=value+with+spaces&key3=value3%2Cwith%2Ccommas
-
Pros: Well-established standard, handles spaces and special characters well.
-
Cons: Requires URL encoding and decoding, slightly more overhead, can be less readable.
-
Recommended Approach: Custom Delimiter with Simple Escaping
We recommend the Custom Delimiter with Simple Escaping approach for your use case. It’s a good balance between simplicity, readability, and the ability to handle spaces within values. It avoids the need to rely on complex URL encoding and also avoids the unreadability issue of using comma with backslash escaping.
JWT Security Claims
Using the tables defined above, follow these steps to create an authorization code token with user security claims:
-
uid
Theentity_id(e.g.,employee_idfor employees andcustomer_idfor customers) should be assigned to theuidclaim in the JWT. Thisuidwill be used by the response transformer to filter the response for the user and must represent a business identifier.Examples:
- Employee: Use the ACF2 ID as the
uid. - Customer: Use the CIF ID as the
uid(e.g., in a banking context).
- Employee: Use the ACF2 ID as the
-
role
Include a list of roles associated with the user. -
grp
Add a list of groups the user belongs to. -
att
Include a list of key-value pairs representing user attributes. -
posInclude a list of positions for the user. -
hostThe host of the user.
Example Token
eyJraWQiOiJUal9sX3RJQlRnaW5PdFFiTDBQdjV3IiwiYWxnIjoiUlMyNTYifQ.eyJpc3MiOiJ1cm46Y29tOm5ldHdvcmtudDpvYXV0aDI6djEiLCJhdWQiOiJ1cm46Y29tLm5ldHdvcmtudCIsImV4cCI6MTczNDA2NDU5NSwianRpIjoicEs4WEtDZkU1aVFSdWdlQThJWXBwZyIsImlhdCI6MTczNDA2Mzk5NSwibmJmIjoxNzM0MDYzODc1LCJ2ZXIiOiIxLjAiLCJ1aWQiOiJzaDM1IiwidXR5IjoiRSIsImNpZCI6ImY3ZDQyMzQ4LWM2NDctNGVmYi1hNTJkLTRjNTc4NzQyMWU3MiIsImNzcmYiOiItTUN4OGhZRlF1bVZ3NFZkRDVHbEd3Iiwic2NwIjpbInBvcnRhbC5yIiwicG9ydGFsLnciLCJyZWYuciIsInJlZi53Il0sInJvbGUiOiJhZG1pbiB1c2VyIiwiYzEiOiIzNjEiLCJjMiI6IjY3IiwiZ3JwIjoiZGVsZXRlIGluc2VydCBzZWxlY3QgdXBkYXRlIiwiYXR0IjoiY291bnRyeV49XkNBTn5wZXJhbmVudCBlbXBsb3llZV49XnRydWV-c2VjdXJpdHlfY2xlYXJhbmNlX2xldmVsXj1eMiIsInBvcyI6IkFQSVBsYXRmb3JtRGVsaXZlcnkiLCJob3N0IjoiTjJDTXcwSEdRWGVMdkMxd0JmbG4yQSJ9.Gky_rR9hreP04GZm-0H_HBBAeDIPhQ9tsNuZclUzTdkMrYay40kcNk4jWkPdMcxfIfIbGj2eqSQgNhkBuym2yc6HsRF0nukZhYSGklVNXFe3R-0DdKwxxWyqvXyWDvrQtme0ttT2tYGTRRCZXnHDRMUFeDSz7kVjjIj3WymjFyxWBnWnBOjYqDL34652Fb8c7hWME0nSxbWO0ZvPRDhRM-l0nDGNm2ojq-3sjaU_pRywYahXP-wtnNSLwvctFgONPWSM9Ie6FqwRmYBFVo8OE0VdTRvUfnO4mL1O2UbTfxzbNJFv4HP1mSZG_SSB5j3t_RuZLfUMIajFi105ze2PUg
And the payload:
{
"iss": "urn:com:networknt:oauth2:v1",
"aud": "urn:com.networknt",
"exp": 1734064595,
"jti": "pK8XKCfE5iQRugeA8IYppg",
"iat": 1734063995,
"nbf": 1734063875,
"ver": "1.0",
"uid": "sh35",
"uty": "E",
"cid": "f7d42348-c647-4efb-a52d-4c5787421e72",
"csrf": "-MCx8hYFQumVw4VdD5GlGw",
"scp": [
"portal.r",
"portal.w"
],
"role": "admin user",
"c1": "361",
"c2": "67",
"grp": "delete insert select update",
"att": "country^=^CAN~peranent employee^=^true~security_clearance_level^=^2",
"pos": "APIPlatformDelivery",
"host": "N2CMw0HGQXeLvC1wBfln2A"
}
Group and Position Management
Define Groups Related to User Category
You can create groups that align with teams, departments, or other organizational units. These groups are relatively static and reflect the overall organizational structure. Use a separate table, group_t, as described earlier, to store these groups. Groups can be applied to all users regardless of their user type.
Use the Employee Reporting Structure to Manage Positions
Positions are similar to groups in managing user permissions, but they leverage the organizational reporting structure to propagate permissions between team members and their direct manager.
-
Position Flags
Each position in the
position_ttable has two flags:
inherit_to_ancestor: Determines if the position is inherited by a subordinate.inherit_to_sibling: Determines if the position is inherited by team members (siblings) under the same manager.
-
Responsibilities
The application is responsible for propagating positions:
- Between Siblings: Assigning inherited positions to team members under the same manager.
- To the Manager: Assigning inherited positions to the direct manager.
-
User Interface for Position Management
A user interface (UI) can be implemented to simplify position management:
- Feature: List all potential inherited positions for selection when adding a new user or changing a manager.
- Functionality: Allow administrators to choose specific positions to inherit for users and managers dynamically.
Use Both Groups and Positions
You can choose to use both groups and positions for your organization. However, you need to ensure that groups and positions categorize users across different dimensions. In general, groups should be used for customers, while positions should be used for employees.
User Login Query
Here is the query to run against the database tables upon a user login request:
SELECT
u.user_id,
u.user_type,
CASE
WHEN u.user_type = 'E' THEN e.employee_id
WHEN u.user_type = 'C' THEN c.customer_id
ELSE NULL
END AS entity_id,
CASE WHEN u.user_type = 'E' THEN string_agg(DISTINCT p.position_name, ' ' ORDER BY p.position_name) ELSE NULL END AS positions,
string_agg(DISTINCT r.role_name, ' ' ORDER BY r.role_name) AS roles,
string_agg(DISTINCT g.group_name, ' ' ORDER BY g.group_name) AS groups,
CASE
WHEN COUNT(DISTINCT at.attribute_name || '^=^' || aut.attribute_value) > 0 THEN string_agg(DISTINCT at.attribute_name || '^=^' || aut.attribute_value, '~' ORDER BY at.attribute_name || '^=^' || aut.attribute_value)
ELSE NULL
END AS attributes
FROM
user_t AS u
LEFT JOIN
user_host_t AS uh ON u.user_id = uh.user_id
LEFT JOIN
role_user_t AS ru ON u.user_id = ru.user_id
LEFT JOIN
role_t AS r ON ru.host_id = r.host_id AND ru.role_id = r.role_id
LEFT JOIN
attribute_user_t AS aut ON u.user_id = aut.user_id
LEFT JOIN
attribute_t AS at ON aut.host_id = at.host_id AND aut.attribute_id = at.attribute_id
LEFT JOIN
group_user_t AS gu ON u.user_id = gu.user_id
LEFT JOIN
group_t AS g ON gu.host_id = g.host_id AND gu.group_id = g.group_id
LEFT JOIN
employee_t AS e ON uh.host_id = e.host_id AND u.user_id = e.user_id
LEFT JOIN
customer_t AS c ON uh.host_id = c.host_id AND u.user_id = c.user_id
LEFT JOIN
employee_position_t AS ep ON e.host_id = ep.host_id AND e.employee_id = ep.employee_id
LEFT JOIN
position_t AS p ON ep.host_id = p.host_id AND ep.position_id = p.position_id
WHERE
u.email = '[email protected]'
GROUP BY
u.user_id, u.user_type, e.employee_id, c.customer_id;
And here is an example result from the test database:
utgdG50vRVOX3mL1Kf83aA E sh35 APIPlatformDelivery admin user delete insert select update country^=^CAN~peranent employee^=^true~security_clearance_level^=^2
Parse Attribute String
The query above returns attributes in a customized format. These attributes can be parsed using the Util.parseAttributes method available in the light-4j utility module
Portal View and Default Role
Given the flexibility of fine-grained authorization approaches, users can choose one or more methods to suit their business requirements. However, in scenarios where RBAC (Role-Based Access Control) is not utilized, the role claim may not exist in the custom claims of the JWT token.
Handling Missing role in JWT
For the portal-view application, at least one role is required to filter menu items. To address cases where no roles are present in the JWT:
-
Default Role Assignment:
If theroleclaim is absent in the JWT, the system will:- Assign a default role,
"user", to ensure compatibility. - Include this role in a
rolesfield in the browser cookie.
- Assign a default role,
-
Cookie Roles Field:
- The
rolesfield in the cookie will contain a single role:"user". - This ensures the portal-view can still function as expected by displaying the appropriate menu items for users.
- The
Example Workflow
- A user authenticates, and their JWT is generated without a
roleclaim. - During authentication handling:
- The StatelessAuthHandler checks for the presence of the
roleclaim. - If no roles are found, the
"user"role is added to therolesfield in the cookie.
- The StatelessAuthHandler checks for the presence of the
- The portal-view reads the
rolesfield from the cookie to filter menu items appropriately.
This approach provides a seamless experience while maintaining compatibility with applications requiring roles for authorization or UI customization.
Private Messages
Problem
Portal users need a way to exchange private messages from the user profile without exposing email addresses to each other. The sender should only need a recipient user id or a display-safe user label. The backend can resolve email internally when it needs to send an external notification, but email must not be part of the user-facing message contract.
Current State
The current codebase already has a partial private-message skeleton:
user-commandexposeslightapi.net/user/sendMessage/0.1.0.- The
sendMessagerequest containsuserId,subject, andcontent. light-portaldefinesPrivateMessageSentEvent.portal-dbdefinesmessage_t.portal-viewhas a mail menu, a private messages page, and aprivateMessageform.user-queryexposeslightapi.net/user/getPrivateMessage/0.1.0.
The current implementation is not complete enough to support production use:
GetPrivateMessagehas its real implementation commented out and currently returnsnull.SendMessageresolves the recipient throughqueryUserById, then stores the whole response astoEmail. That lookup currently returns too much user data, including email and sensitive fields that should not be exposed through a peer messaging flow.SendMessagedoes not putfromIdinto event data, but the projection code readsfromIdfrom event data.- The
message_ttable now hashost_id NOT NULL, but the projection insert does not writehost_id. - The table is inbox-style storage, keyed by sender and nonce, and does not model conversations, read state, participant visibility, or per-user delete.
- The UI mostly relies on the mail menu response and navigation state. The messages page should load its own data from the query API.
- The existing private-message tests are disabled stubs.
Goals
- Let one logged-in user send a message to another portal user without knowing or seeing the recipient email.
- Keep the message model host-scoped so tenant boundaries are explicit.
- Derive sender identity from the authorization token, not from form input.
- Store user ids in message records and events. Do not store recipient email in the message projection unless a short migration bridge requires it.
- Support an inbox page, unread badge, conversation view, reply, read state, and per-user hide/delete.
- Keep email notification as an optional side effect that resolves the recipient email internally.
- Provide a path from the existing
message_tskeleton to a conversation-based model without breaking existing UI routes immediately.
Non-Goals
- Do not build group chat in the first phase.
- Do not expose email addresses in message APIs, events, UI state, or task context.
- Do not use private messages as an audit or support-ticket system.
- Do not implement WebSocket or SSE push in the first phase. Polling is enough until the read/write model is stable.
- Do not make public user lookup broader as part of this feature.
Privacy Rules
Private messages should be user-id based at every external boundary.
The UI may show:
- Display name.
- Avatar or initials.
- User id when no better label exists.
- Message subject, preview, content, and timestamps.
The UI must not show:
- Sender email.
- Recipient email.
- Password, token, nonce, or other profile internals from
user_t.
The backend may resolve recipient email only inside trusted server code for
external email notification. That internal lookup should return the minimum
fields required, ideally user_id, email, current host membership, and a
display label.
Recommended Data Model
For a chat-like experience, introduce conversation identity instead of treating each message as an isolated inbox row.
CREATE TABLE private_conversation_t (
host_id UUID NOT NULL,
conversation_id UUID NOT NULL,
participant_low_id UUID NOT NULL,
participant_high_id UUID NOT NULL,
created_ts TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
last_message_id UUID NULL,
last_message_ts TIMESTAMP WITH TIME ZONE NULL,
PRIMARY KEY (host_id, conversation_id),
UNIQUE (host_id, participant_low_id, participant_high_id),
FOREIGN KEY (host_id) REFERENCES host_t(host_id) ON DELETE CASCADE
);
participant_low_id and participant_high_id are the two sorted user ids. This
gives each pair of users one stable conversation per host without relying on
email.
CREATE TABLE private_message_t (
host_id UUID NOT NULL,
message_id UUID NOT NULL,
conversation_id UUID NOT NULL,
from_user_id UUID NOT NULL,
to_user_id UUID NOT NULL,
subject VARCHAR(256) NULL,
content TEXT NOT NULL,
send_ts TIMESTAMP WITH TIME ZONE NOT NULL,
PRIMARY KEY (host_id, message_id),
FOREIGN KEY (host_id, conversation_id)
REFERENCES private_conversation_t(host_id, conversation_id)
ON DELETE CASCADE
);
CREATE TABLE private_message_state_t (
host_id UUID NOT NULL,
message_id UUID NOT NULL,
user_id UUID NOT NULL,
read_ts TIMESTAMP WITH TIME ZONE NULL,
deleted_ts TIMESTAMP WITH TIME ZONE NULL,
PRIMARY KEY (host_id, message_id, user_id),
FOREIGN KEY (host_id, message_id)
REFERENCES private_message_t(host_id, message_id)
ON DELETE CASCADE
);
The state table keeps read and delete behavior per participant. A user deleting a message should hide it from that user only. It should not erase the other participant’s copy.
Recommended indexes:
CREATE INDEX idx_private_conversation_last_message
ON private_conversation_t (host_id, participant_low_id, participant_high_id, last_message_ts DESC);
CREATE INDEX idx_private_message_conversation_ts
ON private_message_t (host_id, conversation_id, send_ts DESC);
CREATE INDEX idx_private_message_to_user_ts
ON private_message_t (host_id, to_user_id, send_ts DESC);
CREATE INDEX idx_private_message_state_unread
ON private_message_state_t (host_id, user_id)
WHERE read_ts IS NULL AND deleted_ts IS NULL;
If the first implementation needs to reuse message_t, treat it as a migration
bridge only. Add from_user_id, to_user_id, message_id, read_ts, and
per-user delete columns, then migrate to the conversation tables once the API
contract is stable.
Event Model
Keep the event-driven command/query pattern. A message send should create a CloudEvent and the query-side projection should update the private-message tables.
Recommended event data:
{
"hostId": "019...",
"conversationId": "019...",
"messageId": "019...",
"fromUserId": "019...",
"toUserId": "019...",
"subject": "Question about the API",
"content": "Can you take a look at this?"
}
fromUserId and hostId are derived from the token. toUserId, subject, and
content come from validated request data. conversationId can be generated by
the command side after looking up or creating the pair conversation, or it can
be derived during projection from the participant pair.
Do not put toEmail into PrivateMessageSentEvent. Email notification should
be a separate trusted server-side action.
API Contracts
Send Message
Keep the existing sendMessage action name for compatibility, but change the
contract to be user-id based.
{
"toUserId": "019...",
"conversationId": "019...",
"subject": "Question about the API",
"content": "Can you take a look at this?"
}
conversationId is optional. If absent, the backend resolves or creates the
conversation for the current user and toUserId.
Server responsibilities:
- Require an authorization-code token.
- Derive
fromUserIdfrom the token. - Derive
hostIdfrom the active user host. - Validate that
toUserIdbelongs to the same host. - Reject empty content and enforce size limits.
- Optionally reject self-messages unless a product decision allows notes to self.
- Write the event through the existing command event-store path.
- Send optional external email notification after the command is accepted.
Conversation List
Add or evolve a query endpoint for the inbox list.
{
"offset": 0,
"limit": 25
}
The backend derives hostId and userId from the token. The response should
include only conversations involving the current user.
{
"total": 1,
"conversations": [
{
"conversationId": "019...",
"otherUserId": "019...",
"otherUserLabel": "Jane Smith",
"lastMessageTs": "2026-05-08T13:30:00Z",
"lastMessagePreview": "Can you take a look at this?",
"unreadCount": 2
}
]
}
Conversation Messages
{
"conversationId": "019...",
"offset": 0,
"limit": 50
}
The backend validates that the current user is one of the participants.
{
"conversationId": "019...",
"messages": [
{
"messageId": "019...",
"fromUserId": "019...",
"fromUserLabel": "Jane Smith",
"subject": "Question about the API",
"content": "Can you take a look at this?",
"sendTs": "2026-05-08T13:30:00Z",
"read": false
}
]
}
Unread Count
The mail badge should call a count endpoint instead of loading all messages.
{
"count": 3
}
Mark Read and Delete
markPrivateConversationRead should mark unread rows in
private_message_state_t for the current user and conversation.
deletePrivateMessage or hidePrivateConversation should set deleted_ts for
the current user only.
Operational Cleanup
Private messages are user content, not operational status rows. They should not be hard-deleted only because they are old while either participant can still see them.
The operational cleanup job may purge active private-message rows only when all
participant state rows for the message have deleted_ts set and the latest
deleted_ts is older than privateMessageRetentionDays.
Cleanup responsibilities:
- Select purge candidates from
private_message_tjoined toprivate_message_state_t. - Require every participant state row for the message to have
deleted_tsset. - Use
MAX(deleted_ts)as the retention clock so the grace period starts after the last participant deletes the message. - Delete
private_message_state_trows first, then delete theprivate_message_trow in the same transaction. - Leave
private_conversation_trows in place so the participant pair keeps a stable conversation identity if a new message is sent later. - Skip private-message cleanup when
privateMessageRetentionDaysis less than or equal to zero.
The cleanup job should not purge visible messages, partially deleted messages, or recently deleted-by-all messages. A separate maximum retention policy for undeleted private messages would need an explicit product/security decision.
Authorization
The command and query handlers must not trust user ids supplied by the client for the current user. The current user is always the token subject.
Rules:
- A sender can send only as themself.
- A user can read only conversations where they are a participant.
- A user can mark read or delete only their own state rows.
- Admin visibility should be a separate explicit support/admin endpoint if it is needed later.
- Cross-host messaging should be rejected in the first phase. If cross-host messaging is later needed, the contract must model the recipient host explicitly and pass a product/security review.
Portal View
Use the current profile surfaces but make them data-driven:
MailMenushould poll unread count and show a small list of recent conversations only after the menu opens./app/messagesshould fetch conversation data directly. It should not depend onlocation.statefromMailMenu.- The
privateMessageform should usetoUserId, notuserId, to avoid confusing recipient identity with the current user. - Reply should prefill
toUserIdand optionallyconversationId. - User-facing labels should come from a display-safe user label endpoint.
- Empty inbox, loading, and error states should be explicit.
The first UI can be an inbox plus conversation thread. Real-time typing, presence, attachments, and rich-text editing are later enhancements.
Migration Plan
Phase 0: Stop the Broken Behavior
- Make
GetPrivateMessagereturn valid JSON even before the new model is complete. - Fix the existing projection insert to include
host_idifmessage_tremains in use. - Ensure
SendMessagestores sender identity from the token. - Stop using broad
queryUserByIdoutput as a recipient email value.
Phase 1: User-ID Based Backend
- Add the conversation/message/state tables.
- Update
PrivateMessageSentEventto usefromUserIdandtoUserId. - Add a trusted recipient resolver that returns only internal fields needed for validation and optional email notification.
- Implement conversation list, conversation messages, unread count, mark-read, and hide/delete APIs.
Phase 2: Portal View
- Update the mail badge to use unread count.
- Update
/app/messagesto load data directly. - Update the
privateMessageform and reply paths to usetoUserId. - Remove email assumptions from task context and UI state.
Phase 3: Cleanup
- Remove
to_emailfrom the active private-message path. - Remove disabled private-message tests and replace them with focused coverage.
- Ensure operational cleanup targets the active private-message tables and purges only messages deleted by all participants after the retention window.
- Add optional push delivery later if polling becomes insufficient.
Testing
Backend tests should cover:
- Sender is derived from token and cannot be spoofed.
- Recipient must belong to the current host.
- Message event contains user ids, not emails.
- Projection writes host-scoped conversation and message rows.
- Inbox query returns only conversations for the current user.
- Conversation query rejects non-participants.
- Unread count increments for the recipient and clears after mark-read.
- Delete/hide affects only the current user’s state.
- Operational cleanup purges only messages deleted by all participants after retention and keeps visible, partially deleted, and recently deleted messages.
Frontend tests should cover:
- Mail menu shows unread count without loading full inbox.
- Messages page fetches its own data.
- Reply pre-populates recipient context without email.
- Empty and error states do not produce JSON parse failures.
Open Questions
- Should users be able to send messages to themselves as private notes?
- Should profile pages expose a “Message” action only for users in the same host, or should some cross-host flows be allowed?
- Should email notification include the sender display label, or only say that a portal message was received?
- Should any maximum retention policy apply to undeleted private messages?
- Should administrators have a separate support/audit view, and under what permission?
Config Server
Default Config Properties
For each config class in light-4j modules, we use annotations to generate schemas for the config files with default values, comments and validation rules.
As one time step, we also generate events to input all the properties into the light-portal. These events will create a base-line of the config properties with default values. All events in this first time population doesn’t have a version.
For each version release, we will create and attach an event.json file with the change to the properties. Most likely, we will add some properties with default values for each release. All events in the is file will have a version associated. Once played on the portal, updates for the version will be populated.
On the portal ui, we load all properties and default values from database with a union of the base-line properties and all versions below and equal to the current version.
Instance Config Snapshot
Once a logical instance is created on the light-portal, we need to provide the product_version_id which will map to a specific product version. We also need to provide runtime configuration and deployment configuration for the instance to start the server and deploy it to a target environment. During the configuration updates, it might be a process of discovery and may take several revisit to complete. If a user makes a mistake, he/she might want to rollback the previous changes to a snapshot version to start it over again. During the deployment, we also need to save and tag the snapshot version so that we can rollback to the previous deployment configuration snapshot in case of deployment failure.
The above requirements force us to create a table that is record all the commit for the config updates at instance level. It is like a GitHub commit to group several updates together. The user needs to explicitly click the commit button on the UI to allow the server to run the query to populate the snapshot table to create a new snapshot id.
Durng the deployment, the deployment serivce will invoke the config server to force a commit and also link that commit to a deployment id just like a tag in GitHub.
To meet the requirement above, we need to design tables to store immutable snapshots associated with a commitId/snapshotId to proivde reliable rollback points.
Snapshot tables
CREATE TABLE config_snapshot_t (
snapshot_id UUID NOT NULL, -- Primary Key, maybe UUIDv7 for time ordering
snapshot_ts TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
snapshot_type VARCHAR(32) NOT NULL, -- e.g., 'DEPLOYMENT', 'USER_SAVE', 'SCHEDULED_BACKUP'
description TEXT, -- User-provided description or system-generated info
user_id UUID, -- User who triggered it (if applicable)
deployment_id UUID, -- FK to deployment_t if snapshot_type is 'DEPLOYMENT'
-- Scope columns define WHAT this snapshot represents:
scope_host_id UUID NOT NULL, -- Host context (always needed)
scope_config_phase CHAR(1) NOT NULL, -- config phase context(required)
scope_environment VARCHAR(16), -- Environment context (if snapshot is env-specific)
scope_product_id VARCHAR(8) -- Product id context
scope_product_version VARCHAR(12) -- Product version context
scope_service_id VARCHAR(512) -- Service id context
scope_api_id VARCHAR(16) -- Api id context
scope_api_version VARCHAR(16) -- Api version context
PRIMARY KEY(snapshot_id),
FOREIGN KEY(deployment_id) REFERENCES deployment_t(deployment_id) ON DELETE SET NULL,
FOREIGN KEY(user_id) REFERENCES user_t(user_id) ON DELETE SET NULL,
FOREIGN KEY(scope_host_id) REFERENCES host_t(host_id) ON DELETE CASCADE
);
-- Index for finding snapshots by type or scope
CREATE INDEX idx_config_snapshot_scope ON config_snapshot_t (scope_host_id, scope_config_phase, scope_environment,
scope_product_id, scope_product_version, scope_service_id, scope_api_id, scope_api_version, snapshot_type, snapshot_ts);
CREATE INDEX idx_config_snapshot_deployment ON config_snapshot_t (deployment_id);
CREATE TABLE config_snapshot_property_t (
snapshot_property_id UUID NOT NULL, -- Surrogate primary key for easier referencing/updates if needed
snapshot_id UUID NOT NULL, -- FK to config_snapshot_t
config_id UUID NOT NULL, -- The config id
property_id UUID NOT NULL, -- The final property id
property_name VARCHAR(64) NOT NULL, -- The final property name
property_type VARCHAR(32) NOT NULL, -- The property type
property_value TEXT, -- The effective property value at snapshot time
value_type VARCHAR(32), -- Optional: Store the type (string, int, bool...) for easier parsing later
source_level VARCHAR(32), -- e.g., 'instance', 'product_version', 'environment', 'default'
PRIMARY KEY(snapshot_property_id),
FOREIGN KEY(snapshot_id) REFERENCES config_snapshot_t(snapshot_id) ON DELETE CASCADE
);
-- Unique constraint to ensure one value per key within a snapshot
ALTER TABLE config_snapshot_property_t
ADD CONSTRAINT config_snapshot_property_uk UNIQUE (snapshot_id, config_id, property_id);
-- Index for quickly retrieving all properties for a snapshot
CREATE INDEX idx_config_snapshot_property_snapid ON config_snapshot_property_t (snapshot_id);
-- Snapshot of Instance API Overrides
CREATE TABLE snapshot_instance_api_property_t (
snapshot_id UUID NOT NULL,
host_id UUID NOT NULL,
instance_api_id UUID NOT NULL,
property_id UUID NOT NULL,
property_value TEXT,
update_user VARCHAR (255) NOT NULL,
update_ts TIMESTAMP WITH TIME ZONE NOT NULL,
PRIMARY KEY(snapshot_id, host_id, instance_api_id, property_id), -- Composite PK matches original structure + snapshot_id
FOREIGN KEY(snapshot_id) REFERENCES config_snapshot_t(snapshot_id) ON DELETE CASCADE
);
CREATE INDEX idx_snap_iapi_prop ON snapshot_instance_api_property_t (snapshot_id);
-- Snapshot of Instance App Overrides
CREATE TABLE snapshot_instance_app_property_t (
snapshot_id UUID NOT NULL,
host_id UUID NOT NULL,
instance_app_id UUID NOT NULL,
property_id UUID NOT NULL,
property_value TEXT,
update_user VARCHAR (255) NOT NULL,
update_ts TIMESTAMP WITH TIME ZONE NOT NULL,
PRIMARY KEY(snapshot_id, host_id, instance_app_id, property_id),
FOREIGN KEY(snapshot_id) REFERENCES config_snapshot_t(snapshot_id) ON DELETE CASCADE
);
CREATE INDEX idx_snap_iapp_prop ON snapshot_instance_app_property_t (snapshot_id);
-- Snapshot of Instance App API Overrides
CREATE TABLE snapshot_instance_app_api_property_t (
snapshot_id UUID NOT NULL,
host_id UUID NOT NULL,
instance_app_id UUID NOT NULL,
instance_api_id UUID NOT NULL,
property_id UUID NOT NULL,
property_value TEXT,
update_user VARCHAR (255) NOT NULL,
update_ts TIMESTAMP WITH TIME ZONE NOT NULL,
PRIMARY KEY(snapshot_id, host_id, instance_app_id, instance_api_id, property_id),
FOREIGN KEY(snapshot_id) REFERENCES config_snapshot_t(snapshot_id) ON DELETE CASCADE
);
CREATE INDEX idx_snap_iaappi_prop ON snapshot_instance_app_api_property_t (snapshot_id);
-- Snapshot of Instance Overrides
CREATE TABLE snapshot_instance_property_t (
snapshot_id UUID NOT NULL,
host_id UUID NOT NULL,
instance_id UUID NOT NULL,
property_id UUID NOT NULL,
property_value TEXT,
update_user VARCHAR (255) NOT NULL,
update_ts TIMESTAMP WITH TIME ZONE NOT NULL,
PRIMARY KEY(snapshot_id, host_id, instance_id, property_id),
FOREIGN KEY(snapshot_id) REFERENCES config_snapshot_t(snapshot_id) ON DELETE CASCADE
);
CREATE INDEX idx_snap_inst_prop ON snapshot_instance_property_t (snapshot_id);
-- Snapshot of Environment Overrides (If needed for rollback)
CREATE TABLE snapshot_environment_property_t (
snapshot_id UUID NOT NULL,
host_id UUID NOT NULL,
environment VARCHAR(16) NOT NULL,
property_id UUID NOT NULL,
property_value TEXT,
update_user VARCHAR (255) NOT NULL,
update_ts TIMESTAMP WITH TIME ZONE NOT NULL,
PRIMARY KEY(snapshot_id, host_id, environment, property_id),
FOREIGN KEY(snapshot_id) REFERENCES config_snapshot_t(snapshot_id) ON DELETE CASCADE
);
CREATE INDEX idx_snap_env_prop ON snapshot_environment_property_t (snapshot_id);
CREATE TABLE snapshot_product_property_t (
snapshot_id UUID NOT NULL,
product_id VARCHAR(8) NOT NULL,
property_id UUID NOT NULL,
property_value TEXT,
update_user VARCHAR (255) NOT NULL,
update_ts TIMESTAMP WITH TIME ZONE NOT NULL,
PRIMARY KEY(snapshot_id, product_id, property_id),
FOREIGN KEY(snapshot_id) REFERENCES config_snapshot_t(snapshot_id) ON DELETE CASCADE
);
CREATE INDEX idx_snap_prd_prop ON snapshot_product_property_t (snapshot_id);
CREATE TABLE snapshot_product_version_property_t (
snapshot_id UUID NOT NULL,
host_id UUID NOT NULL,
product_version_id UUID NOT NULL,
property_id UUID NOT NULL,
property_value TEXT,
update_user VARCHAR (255) NOT NULL,
update_ts TIMESTAMP WITH TIME ZONE NOT NULL,
PRIMARY KEY(snapshot_id, host_id, product_version_id, property_id),
FOREIGN KEY(snapshot_id) REFERENCES config_snapshot_t(snapshot_id) ON DELETE CASCADE
);
CREATE INDEX idx_snap_pv_prop ON snapshot_product_version_property_t (snapshot_id);
How to generate rollback events
There are two options to generate rollback events or compensate events.
Option 1. With historical events.
- Identify Target State: You have a
snapshot_idrepresenting the desired historical state. - Find Snapshot Timestamp: Get the
snapshot_tsfromconfig_snapshot_tfor the targetsnapshot_id. - Query Events: Find all configuration events in your event store that:
- Occurred after the
snapshot_ts. - Relate to the specific scope (host, instance, environment, etc.) being rolled back.
- Occurred after the
- Generate Compensating Events: For each event found in step 3, create its logical inverse (a “compensating event”). For example:
InstancePropertyUpdated { propertyId: X, newValue: B, oldValue: A }->InstancePropertyUpdated { propertyId: X, newValue: A, oldValue: B }(Requires storingoldValuein the original event).InstancePropertyCreated { propertyId: X, value: A }->InstancePropertyDeleted { propertyId: X, value: A }(Requires storing the value in the delete event for potential future rollback).InstancePropertyDeleted { propertyId: X, value: A }->InstancePropertyCreated { propertyId: X, value: A }(Requires storing the value in the delete event).
- Order Compensating Events: Sort the generated compensating events in the reverse chronological order of the original events they are compensating for.
- Replay Compensating Events: Apply these ordered compensating events through your event handling system.
Conceptually, this is a valid approach often used in event sourcing patterns (related to compensating transactions). However, it comes with significant challenges and complexities:
Challenges & Considerations:
- Generating Perfect Inverse Events: This is the hardest part.
- Requires Rich Events: Your original events must contain enough information to construct their inverse. For updates, you need the
oldValue. For creations, the delete needs the key. For deletions, the create needs the deleted value. If your current events don’t store this, you cannot reliably generate compensating events this way. - Complexity: For multi-step or complex operations, determining the exact inverse sequence can be non-trivial.
- Requires Rich Events: Your original events must contain enough information to construct their inverse. For updates, you need the
- Order of Operations: Compensating events MUST be applied in strict reverse order. Getting this wrong can lead to incorrect states.
- State Dependencies: Event handlers sometimes make assumptions about the state before the event is applied. Replaying compensating events might encounter unexpected states if other unrelated changes have occurred or if the reverse logic isn’t perfect, potentially causing handler errors.
- Performance: Querying potentially thousands of events, generating inverses, and replaying them might be slow, especially if the time gap between the snapshot and the present is large.
- Snapshot Data Not Used: This approach doesn’t directly leverage the known good state stored in
config_snapshot_property_t. It relies solely on the ability to perfectly reverse subsequent events. - Idempotency: Compensating event handlers should ideally be idempotent (applying them multiple times has the same effect as applying them once), although this is hard to guarantee for inverse operations.
Option 2: Diff-based event generation.
- Get Target State: Fetch key-values from
config_snapshot_property_tforsnapshot_id. (TargetState) - Get Current State: Run aggregation query for the current configuration. (
CurrentState) - Calculate Diff: Find differences between
TargetStateandCurrentState. - Generate Corrective Events: Create events to transform
CurrentStateintoTargetState.- If
keyis inTargetStatebut different/missing inCurrentState-> GenerateUpsert[Level]Propertyevent with the value fromTargetState(applied at the highest relevant override level for the scope). - If
keyis inCurrentStatebut missing inTargetState-> GenerateDelete[Level]Propertyevent for the override that’s currently providing the value (likely the highest relevant override level).
- If
- Apply Events: Apply these corrective events.
Why the Diff-Based Approach is Often Preferred for Snapshot Rollback:
- Uses Known Good State: It directly uses the guaranteed state from the snapshot table.
- Less Reliant on Event Reversibility: It doesn’t matter if the original events are perfectly reversible or store old values. It focuses on achieving the target state from the current state.
- Potentially Fewer Events: Might generate fewer events than reversing a long history, focusing only on the net changes needed.
- More Direct: The generated events directly aim to establish the target state, which can feel less fragile than relying on reversing history.
Conclusion:
While method of reversing events since the snapshot is a recognized event sourcing pattern, it’s often more complex and potentially fragile for the specific task of rolling back to a known snapshot state compared to the diff-based corrective event generation method.
The diff-based method leverages the snapshot data directly and focuses on achieving the target state, making it generally more robust and often easier to implement correctly, as it doesn’t require perfectly reversible events.
How to create the snapshot
Let’s clarify how the scope_* columns in config_snapshot_t relate to the query that generates the snapshot and the override tables (*_property_t).
The Purpose of scope_* Columns:
The scope_* columns in config_snapshot_t serve one primary purpose: To record the specific context for which the snapshot was generated. They define what set of effective configuration values are stored in the associated config_snapshot_property_t rows.
Think of them as the input parameters that were used to run the aggregation query when the snapshot was created.
How They Are Used in the Snapshot Generation Query:
You do not need one scope_* column for every *_property_t table. Instead, the values you store in the scope_* columns are the parameters you pass into your aggregation query’s WHERE clauses to filter the rows from the relevant override tables according to the desired context.
Let’s refine the query strategy using the scope_* concept and aim for a more efficient query than repeated NOT EXISTS clauses (using ROW_NUMBER() or DISTINCT ON).
Example Scenario: Snapshotting for a specific Instance
Let’s say you want to create a snapshot for a specific instance_id on a specific host_id.
-
Input Parameters:
p_host_id(UUID)p_instance_id(UUID)
-
Derive Related IDs (Inside your snapshot creation logic/service):
- You’ll need to query
instance_tto get the associatedproduct_version_id,environment, etc., for this instance. - Query
product_version_tto getproduct_id. - Let’s call these derived values
v_product_version_id,v_environment,v_product_id.
- You’ll need to query
-
config_snapshot_tRecord:- Generate a
snapshot_id(e.g., UUIDv7). snapshot_ts:CURRENT_TIMESTAMPsnapshot_type: e.g., ‘DEPLOYMENT’scope_host_id:p_host_idscope_instance_id:p_instance_idscope_environment:v_environment(Store the derived environment for clarity, even though it came from the instance)scope_product_version_id:v_product_version_id(Store for clarity)scope_product_id:v_product_id(Store for clarity)- (Other
scope_*columns likescope_instance_api_idwould be NULL for this instance-level snapshot)
- Generate a
-
Aggregation Query (Using
ROW_NUMBER()): This query uses the input parameters (p_host_id,p_instance_id) and the derived values (v_product_version_id,v_environment,v_product_id) to find the highest priority value for eachproperty_id.
WITH – Parameters derived before running this query: – p_host_id UUID – p_instance_id UUID – v_product_version_id UUID (derived from p_instance_id) – v_environment VARCHAR(16) (derived from p_instance_id) – v_product_id VARCHAR(8) (derived from v_product_version_id)
– Find relevant instance_api_ids and instance_app_ids for the target instance RelevantInstanceApis AS ( SELECT instance_api_id FROM instance_api_t WHERE host_id = ? – p_host_id AND instance_id = ? – p_instance_id ), RelevantInstanceApps AS ( SELECT instance_app_id FROM instance_app_t WHERE host_id = ? – p_host_id AND instance_id = ? – p_instance_id ),
– Pre-process Instance App API properties with merging logic Merged_Instance_App_Api_Properties AS ( SELECT iaap.property_id, CASE cp.value_type WHEN ‘map’ THEN COALESCE(jsonb_merge_agg(iaap.property_value::jsonb), ‘{}’::jsonb)::text WHEN ‘list’ THEN COALESCE((SELECT jsonb_agg(elem ORDER BY iaa.update_ts) – Order elements based on when they were added via the link table? Or property update_ts? Assuming property update_ts. Check data model if linking time matters more. FROM jsonb_array_elements(sub.property_value::jsonb) elem WHERE jsonb_typeof(sub.property_value::jsonb) = ‘array’ ), ‘[]’::jsonb)::text – Requires subquery if ordering elements – Subquery approach for ordering list elements by property timestamp: /* COALESCE( (SELECT jsonb_agg(elem ORDER BY prop.update_ts) FROM instance_app_api_property_t prop, jsonb_array_elements(prop.property_value::jsonb) elem WHERE prop.host_id = iaap.host_id AND prop.instance_app_id = iaap.instance_app_id AND prop.instance_api_id = iaap.instance_api_id AND prop.property_id = iaap.property_id AND jsonb_typeof(prop.property_value::jsonb) = ‘array’ ), ‘[]’::jsonb )::text / ELSE MAX(iaap.property_value) – For simple types, MAX can work if only one entry expected, otherwise need timestamp logic – More robust for simple types: Pick latest based on timestamp / (SELECT property_value FROM instance_app_api_property_t latest WHERE latest.host_id = iaap.host_id AND latest.instance_app_id = iaap.instance_app_id AND latest.instance_api_id = iaap.instance_api_id AND latest.property_id = iaap.property_id ORDER BY latest.update_ts DESC LIMIT 1) */ END AS effective_value FROM instance_app_api_property_t iaap JOIN config_property_t cp ON iaap.property_id = cp.property_id JOIN instance_app_api_t iaa ON iaa.host_id = iaap.host_id AND iaa.instance_app_id = iaap.instance_app_id AND iaa.instance_api_id = iaap.instance_api_id – Join to potentially use its timestamp for ordering lists WHERE iaap.host_id = ? – p_host_id AND iaap.instance_app_id IN (SELECT instance_app_id FROM RelevantInstanceApps) AND iaap.instance_api_id IN (SELECT instance_api_id FROM RelevantInstanceApis) GROUP BY iaap.host_id, iaap.instance_app_id, iaap.instance_api_id, iaap.property_id, cp.value_type – Group to aggregate/merge ),
– Pre-process Instance API properties Merged_Instance_Api_Properties AS ( SELECT iap.property_id, CASE cp.value_type WHEN ‘map’ THEN COALESCE(jsonb_merge_agg(iap.property_value::jsonb), ‘{}’::jsonb)::text WHEN ‘list’ THEN COALESCE((SELECT jsonb_agg(elem ORDER BY prop.update_ts) FROM instance_api_property_t prop, jsonb_array_elements(prop.property_value::jsonb) elem WHERE prop.host_id = iap.host_id AND prop.instance_api_id = iap.instance_api_id AND prop.property_id = iap.property_id AND jsonb_typeof(prop.property_value::jsonb) = ‘array’), ‘[]’::jsonb)::text ELSE (SELECT property_value FROM instance_api_property_t latest WHERE latest.host_id = iap.host_id AND latest.instance_api_id = iap.instance_api_id AND latest.property_id = iap.property_id ORDER BY latest.update_ts DESC LIMIT 1) END AS effective_value FROM instance_api_property_t iap JOIN config_property_t cp ON iap.property_id = cp.property_id WHERE iap.host_id = ? – p_host_id AND iap.instance_api_id IN (SELECT instance_api_id FROM RelevantInstanceApis) GROUP BY iap.host_id, iap.instance_api_id, iap.property_id, cp.value_type ),
– Pre-process Instance App properties Merged_Instance_App_Properties AS ( SELECT iapp.property_id, CASE cp.value_type WHEN ‘map’ THEN COALESCE(jsonb_merge_agg(iapp.property_value::jsonb), ‘{}’::jsonb)::text WHEN ‘list’ THEN COALESCE((SELECT jsonb_agg(elem ORDER BY prop.update_ts) FROM instance_app_property_t prop, jsonb_array_elements(prop.property_value::jsonb) elem WHERE prop.host_id = iapp.host_id AND prop.instance_app_id = iapp.instance_app_id AND prop.property_id = iapp.property_id AND jsonb_typeof(prop.property_value::jsonb) = ‘array’), ‘[]’::jsonb)::text ELSE (SELECT property_value FROM instance_app_property_t latest WHERE latest.host_id = iapp.host_id AND latest.instance_app_id = iapp.instance_app_id AND latest.property_id = iapp.property_id ORDER BY latest.update_ts DESC LIMIT 1) END AS effective_value FROM instance_app_property_t iapp JOIN config_property_t cp ON iapp.property_id = cp.property_id WHERE iapp.host_id = ? – p_host_id AND iapp.instance_app_id IN (SELECT instance_app_id FROM RelevantInstanceApps) GROUP BY iapp.host_id, iapp.instance_app_id, iapp.property_id, cp.value_type ),
– Combine all levels with priority AllOverrides AS ( – Priority 10: Instance App API (highest) - Requires aggregating the merged results if multiple app/api combos apply to the instance SELECT m_iaap.property_id, – Need final merge/latest logic here if multiple app/api combos apply to the SAME instance_id and define the SAME property_id – Assuming for now we take the first one found or need more complex logic if merge is needed again at this stage – For simplicity, let’s assume we just take MAX effective value if multiple rows exist per property_id for the instance MAX(m_iaap.effective_value) as property_value, – This MAX might not be right for JSON, need specific logic if merging across app/api combos is needed here 10 AS priority_level FROM Merged_Instance_App_Api_Properties m_iaap – No additional instance filter needed if CTEs were already filtered by RelevantInstanceApps/Apis linked to p_instance_id GROUP BY m_iaap.property_id – Group to handle multiple app/api links potentially setting the same property for the instance
UNION ALL
-- Priority 20: Instance API
SELECT
m_iap.property_id,
MAX(m_iap.effective_value) as property_value, -- Similar merge concern as above
20 AS priority_level
FROM Merged_Instance_Api_Properties m_iap
GROUP BY m_iap.property_id
UNION ALL
-- Priority 30: Instance App
SELECT
m_iapp.property_id,
MAX(m_iapp.effective_value) as property_value, -- Similar merge concern
30 AS priority_level
FROM Merged_Instance_App_Properties m_iapp
GROUP BY m_iapp.property_id
UNION ALL
-- Priority 40: Instance
SELECT
ip.property_id,
ip.property_value,
40 AS priority_level
FROM instance_property_t ip
WHERE ip.host_id = ? -- p_host_id
AND ip.instance_id = ? -- p_instance_id
UNION ALL
-- Priority 50: Product Version
SELECT
pvp.property_id,
pvp.property_value,
50 AS priority_level
FROM product_version_property_t pvp
WHERE pvp.host_id = ? -- p_host_id
AND pvp.product_version_id = ? -- v_product_version_id
UNION ALL
-- Priority 60: Environment
SELECT
ep.property_id,
ep.property_value,
60 AS priority_level
FROM environment_property_t ep
WHERE ep.host_id = ? -- p_host_id
AND ep.environment = ? -- v_environment
UNION ALL
-- Priority 70: Product (Host independent)
SELECT
pp.property_id,
pp.property_value,
70 AS priority_level
FROM product_property_t pp
WHERE pp.product_id = ? -- v_product_id
UNION ALL
-- Priority 100: Default values
SELECT
cp.property_id,
cp.property_value, -- Default value
100 AS priority_level
FROM config_property_t cp
-- Optimization: Filter defaults to only those applicable to the product version?
-- JOIN product_version_config_property_t pvcp ON cp.property_id = pvcp.property_id
-- WHERE pvcp.host_id = ? AND pvcp.product_version_id = ?
), RankedOverrides AS ( SELECT ao.property_id, ao.property_value, ao.priority_level, ROW_NUMBER() OVER (PARTITION BY ao.property_id ORDER BY ao.priority_level ASC) as rn FROM AllOverrides ao WHERE ao.property_value IS NOT NULL – Exclude levels where the value was NULL (unless NULL is a valid override) ) – Final Selection for Snapshot Table SELECT – snapshot_id needs to be added here or during INSERT cfg.config_name || ‘.’ || cp.property_name AS property_key, ro.property_value, cp.property_type, cp.value_type – Include ro.priority_level AS source_priority if storing provenance FROM RankedOverrides ro JOIN config_property_t cp ON ro.property_id = cp.property_id JOIN config_t cfg ON cp.config_id = cfg.config_id WHERE ro.rn = 1;
5. **Populate `config_snapshot_property_t`:** Insert the results of this query into `config_snapshot_property_t`, using the `snapshot_id` generated in step 3.
**Key Takeaways:**
* The `scope_*` columns define the *context* of the snapshot.
* The values for these `scope_*` columns are used as *parameters* within the `WHERE` clauses of the aggregation query that *generates* the snapshot data.
* You don't need a `scope_*` column per override table. You need columns representing the different *dimensions* or *levels* by which you might want to define a snapshot's context (host, instance, environment, product version, etc.).
* The aggregation query uses these parameters to filter the relevant rows from each override table and then determines the highest priority value using `UNION ALL` and a ranking mechanism (`ROW_NUMBER()` or `DISTINCT ON`).
This approach keeps the `config_snapshot_t` table focused on metadata and context, while the query handles the complex logic of applying that context to the various override tables to produce the effective configuration for `config_snapshot_property_t`.
### Config Phase
In the config_t table, there is a config_phase column to separate different stages of api/app life cycles. For example, config for codegen, config for runtime, config for deployment.
Given your two main use cases:
1. **Service Startup:** Needs the *runtime* (`'R'`) configuration.
2. **Deployment Rollback:** Needs to potentially restore the state required for *deployment* (`'D'`) and the resulting *runtime* (`'R'`) configuration from that point in time. (Generator `'G'` configs are usually less relevant for deployment/runtime rollbacks).
Here are the options and the recommended approach:
**Option 1: Phase-Specific Snapshots (Separate Records)**
* **How:** Add `scope_config_phase CHAR(1)` to `config_snapshot_t`.
* **Snapshot Creation:** When a snapshot event occurs (e.g., pre-deployment):
* Generate a `snapshot_id_D` (e.g., using UUIDv7).
* Run the aggregation query with `config_phase = 'D'`.
* Store results in `config_snapshot_property_t` linked to `snapshot_id_D`.
* Create metadata in `config_snapshot_t` for `snapshot_id_D` with `scope_config_phase = 'D'`.
* Generate *another* `snapshot_id_R`.
* Run the aggregation query with `config_phase = 'R'`.
* Store results in `config_snapshot_property_t` linked to `snapshot_id_R`.
* Create metadata in `config_snapshot_t` for `snapshot_id_R` with `scope_config_phase = 'R'`.
* You'd need a way to link `snapshot_id_D` and `snapshot_id_R` to the same logical event (e.g., same `related_deployment_id`).
* **Pros:** Very explicit separation. Querying for a specific phase's snapshot is straightforward.
* **Cons:** Requires multiple runs of the aggregation query. Doubles the metadata rows in `config_snapshot_t`. Complicates linking phases related to the same event. Less efficient.
**Option 2: Single Snapshot, Phase Included in Properties (Recommended)**
* **How:** Do **not** add `scope_config_phase` to `config_snapshot_t`. Instead, add `config_phase CHAR(1)` to `config_snapshot_property_t`.
* **Snapshot Creation:**
* Generate a single `snapshot_id`.
* Create one metadata row in `config_snapshot_t` representing the overall scope and time (without phase).
* **Modify the Aggregation Query:**
* **Remove** the `WHERE c.config_phase = ?` filter entirely.
* **SELECT** the `c.config_phase` value in the final `SELECT` statement.
* Run this modified query *once*. It will calculate the effective properties across *all* phases applicable to the scope.
* Store the results in `config_snapshot_property_t`, populating the new `config_phase` column for each property based on the phase of the `config_t` record from which it originated.
* **`config_snapshot_property_t` Structure:**
```sql
CREATE TABLE config_snapshot_property_t (
-- ... other columns ...
config_phase CHAR(1) NOT NULL, -- Phase this property belongs to
property_key TEXT NOT NULL,
property_value TEXT,
property_type VARCHAR(32),
value_type VARCHAR(32),
-- ...
PRIMARY KEY(snapshot_property_id), -- Or PK(snapshot_id, config_phase, property_key)? Needs thought.
FOREIGN KEY(snapshot_id) REFERENCES config_snapshot_t(snapshot_id) ON DELETE CASCADE
);
-- Ensure uniqueness within a snapshot for a given key *and phase*
ALTER TABLE config_snapshot_property_t
ADD CONSTRAINT config_snapshot_property_uk UNIQUE (snapshot_id, config_phase, property_key);
-- Index for lookup by snapshot and phase
CREATE INDEX idx_config_snapshot_property_snap_phase ON config_snapshot_property_t (snapshot_id, config_phase);
```
* **Pros:**### commitConfigInstance
Let's outline the structure of your `commitConfigInstance` service method and the necessary SQL INSERT statements using JDBC.
This involves several steps within a single database transaction:
1. **Generate Snapshot ID:** Create a new UUID for the snapshot.
2. **Derive Scope IDs:** Query live tables (`instance_t`, `product_version_t`, etc.) based on the input `hostId` and `instanceId` to get other relevant scope identifiers (`environment`, `productId`, `productVersionId`, `serviceId`, etc.).
3. **Insert Metadata:** Insert a record into `config_snapshot_t`.
4. **Aggregate Effective Config:** Run the complex aggregation query (using `ROW_NUMBER()` or similar) to get the final effective properties.
5. **Insert Effective Config:** Insert the results from step 4 into `config_snapshot_property_t`.
6. **Snapshot Override Tables:** For each relevant live override table (`instance_property_t`, `instance_api_property_t`, etc.), select its current state (filtered by scope) and insert it into the corresponding `snapshot_*_property_t` table.
7. **Commit/Rollback:** Commit the transaction if all steps succeed, otherwise roll back.
**Java Service Method Structure (Conceptual)**
```java
import com.github.f4b6a3.uuid.UuidCreator; // For UUIDv7 generation
import javax.sql.DataSource; // Assuming you have a DataSource injected
import java.sql.*;
import java.time.OffsetDateTime;
import java.util.*;
public class ConfigSnapshotService {
private final DataSource ds;
// Inject DataSource via constructor
// Pre-compile your complex aggregation query (modify based on previous examples)
private static final String AGGREGATE_EFFECTIVE_CONFIG_SQL = """
WITH AllOverrides AS (
-- Priority 10: Instance App API (merged) ...
-- Priority 20: Instance API (merged) ...
-- Priority 30: Instance App (merged) ...
-- Priority 40: Instance ...
-- Priority 50: Product Version ...
-- Priority 60: Environment ...
-- Priority 70: Product ...
-- Priority 100: Default ...
),
RankedOverrides AS (
SELECT ..., ROW_NUMBER() OVER (PARTITION BY ao.property_id ORDER BY ao.priority_level ASC) as rn
FROM AllOverrides ao WHERE ao.property_value IS NOT NULL
)
SELECT
c.config_phase, -- Phase from config_t
cfg.config_id, -- Added config_id
cp.property_id, -- Added property_id
cp.property_name, -- Added property_name
cp.property_type,
cp.value_type,
cfg.config_name || '.' || cp.property_name AS property_key, -- Keep for logging/debug? Not needed in snapshot table itself
ro.property_value,
ro.priority_level -- To determine source_level
FROM RankedOverrides ro
JOIN config_property_t cp ON ro.property_id = cp.property_id
JOIN config_t cfg ON cp.config_id = cfg.config_id
WHERE ro.rn = 1;
"""; // NOTE: Add parameters (?) for host_id, instance_id, derived IDs etc.
public Result<String> commitConfigInstance(Map<String, Object> event) {
// 1. Extract Input Parameters
UUID hostId = (UUID) event.get("hostId");
UUID instanceId = (UUID) event.get("instanceId");
String snapshotType = (String) event.getOrDefault("snapshotType", "USER_SAVE"); // Default type
String description = (String) event.get("description");
UUID userId = (UUID) event.get("userId"); // May be null
UUID deploymentId = (UUID) event.get("deploymentId"); // May be null
if (hostId == null || instanceId == null) {
return Failure.of(new Status(INVALID_PARAMETER, "hostId and instanceId are required."));
}
UUID snapshotId = UuidCreator.getTimeOrderedEpoch(); // Generate Snapshot ID (e.g., V7)
Connection connection = null;
try {
connection = ds.getConnection();
connection.setAutoCommit(false); // Start Transaction
// 2. Derive Scope IDs
// Query instance_t and potentially product_version_t based on hostId, instanceId
DerivedScope scope = deriveScopeInfo(connection, hostId, instanceId);
if (scope == null) {
connection.rollback(); // Rollback if instance not found
return Failure.of(new Status(OBJECT_NOT_FOUND, "Instance not found for hostId/instanceId."));
}
// 3. Insert Snapshot Metadata
insertSnapshotMetadata(connection, snapshotId, snapshotType, description, userId, deploymentId, hostId, scope);
// 4 & 5. Aggregate and Insert Effective Config
insertEffectiveConfigSnapshot(connection, snapshotId, hostId, instanceId, scope);
// 6. Snapshot Individual Override Tables
// Use INSERT ... SELECT ... for efficiency
snapshotInstanceProperties(connection, snapshotId, hostId, instanceId);
snapshotInstanceApiProperties(connection, snapshotId, hostId, instanceId);
snapshotInstanceAppProperties(connection, snapshotId, hostId, instanceId);
snapshotInstanceAppApiProperties(connection, snapshotId, hostId, instanceId); // Requires finding relevant App/API IDs first
snapshotEnvironmentProperties(connection, snapshotId, hostId, scope.environment());
snapshotProductVersionProperties(connection, snapshotId, hostId, scope.productVersionId());
snapshotProductProperties(connection, snapshotId, scope.productId());
// Add others as needed
// 7. Commit Transaction
connection.commit();
logger.info("Successfully created config snapshot: {}", snapshotId);
return Success.of(snapshotId.toString());
} catch (SQLException e) {
logger.error("SQLException during snapshot creation for instance {}: {}", instanceId, e.getMessage(), e);
if (connection != null) {
try {
connection.rollback();
} catch (SQLException ex) {
logger.error("Error rolling back transaction:", ex);
}
}
return Failure.of(new Status(SQL_EXCEPTION, "Database error during snapshot creation."));
} catch (Exception e) { // Catch other potential errors (e.g., during scope derivation)
logger.error("Exception during snapshot creation for instance {}: {}", instanceId, e.getMessage(), e);
if (connection != null) {
try { connection.rollback(); } catch (SQLException ex) { logger.error("Error rolling back transaction:", ex); }
}
return Failure.of(new Status(GENERIC_EXCEPTION, "Unexpected error during snapshot creation."));
} finally {
if (connection != null) {
try {
connection.setAutoCommit(true); // Restore default behavior
connection.close();
} catch (SQLException e) {
logger.error("Error closing connection:", e);
}
}
}
}
// --- Helper Methods ---
// Placeholder for derived scope data structure
private record DerivedScope(String environment, String productId, String productVersion, UUID productVersionId, String serviceId /*, add API details if needed */) {}
private DerivedScope deriveScopeInfo(Connection conn, UUID hostId, UUID instanceId) throws SQLException {
// Query instance_t LEFT JOIN product_version_t ... WHERE i.host_id = ? AND i.instance_id = ?
// Extract environment, service_id from instance_t
// Extract product_id, product_version from product_version_t (via product_version_id in instance_t)
// Return new DerivedScope(...) or null if not found
String sql = """
SELECT i.environment, i.service_id, pv.product_id, pv.product_version, i.product_version_id
FROM instance_t i
LEFT JOIN product_version_t pv ON i.host_id = pv.host_id AND i.product_version_id = pv.product_version_id
WHERE i.host_id = ? AND i.instance_id = ?
""";
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, hostId);
ps.setObject(2, instanceId);
try (ResultSet rs = ps.executeQuery()) {
if (rs.next()) {
return new DerivedScope(
rs.getString("environment"),
rs.getString("product_id"),
rs.getString("product_version"),
rs.getObject("product_version_id", UUID.class),
rs.getString("service_id")
);
} else {
return null; // Instance not found
}
}
}
}
private void insertSnapshotMetadata(Connection conn, UUID snapshotId, String snapshotType, String description,
UUID userId, UUID deploymentId, UUID hostId, DerivedScope scope) throws SQLException {
String sql = """
INSERT INTO config_snapshot_t
(snapshot_id, snapshot_ts, snapshot_type, description, user_id, deployment_id,
scope_host_id, scope_environment, scope_product_id, scope_product_version_id, -- Changed col name
scope_service_id /*, scope_api_id, scope_api_version - Add if applicable */)
VALUES (?, CURRENT_TIMESTAMP, ?, ?, ?, ?, ?, ?, ?, ?, ? /*, ?, ? */)
""";
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, snapshotId);
ps.setString(2, snapshotType);
ps.setString(3, description);
ps.setObject(4, userId); // setObject handles null correctly
ps.setObject(5, deploymentId); // setObject handles null correctly
ps.setObject(6, hostId);
ps.setString(7, scope.environment());
ps.setString(8, scope.productId());
ps.setObject(9, scope.productVersionId()); // Store the ID
ps.setString(10, scope.serviceId());
// Set API scope if needed ps.setObject(11, ...); ps.setString(12, ...);
ps.executeUpdate();
}
}
private void insertEffectiveConfigSnapshot(Connection conn, UUID snapshotId, UUID hostId, UUID instanceId, DerivedScope scope) throws SQLException {
String insertSql = """
INSERT INTO config_snapshot_property_t
(snapshot_property_id, snapshot_id, config_phase, config_id, property_id, property_name,
property_type, property_value, value_type, source_level)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""";
// Prepare the aggregation query
try (PreparedStatement selectStmt = conn.prepareStatement(AGGREGATE_EFFECTIVE_CONFIG_SQL);
PreparedStatement insertStmt = conn.prepareStatement(insertSql)) {
// Set ALL parameters for the AGGREGATE_EFFECTIVE_CONFIG_SQL query
int paramIndex = 1;
// Example: set parameters based on how AGGREGATE_EFFECTIVE_CONFIG_SQL is structured
// selectStmt.setObject(paramIndex++, hostId);
// selectStmt.setObject(paramIndex++, instanceId);
// ... set derived scope IDs (productVersionId, environment, productId) ...
// ... set parameters for all UNION branches and potential subqueries ...
try (ResultSet rs = selectStmt.executeQuery()) {
int batchCount = 0;
while (rs.next()) {
insertStmt.setObject(1, UuidCreator.getTimeOrderedEpoch()); // snapshot_property_id
insertStmt.setObject(2, snapshotId);
insertStmt.setString(3, rs.getString("config_phase"));
insertStmt.setObject(4, rs.getObject("config_id", UUID.class));
insertStmt.setObject(5, rs.getObject("property_id", UUID.class));
insertStmt.setString(6, rs.getString("property_name"));
insertStmt.setString(7, rs.getString("property_type"));
insertStmt.setString(8, rs.getString("property_value"));
insertStmt.setString(9, rs.getString("value_type"));
insertStmt.setString(10, mapPriorityToSourceLevel(rs.getInt("priority_level"))); // Map numeric priority back to level name
insertStmt.addBatch();
batchCount++;
if (batchCount % 100 == 0) { // Execute batch periodically
insertStmt.executeBatch();
}
}
if (batchCount % 100 != 0) { // Execute remaining batch
insertStmt.executeBatch();
}
}
}
}
// Helper to map priority back to source level name
private String mapPriorityToSourceLevel(int priority) {
return switch (priority) {
case 10 -> "instance_app_api"; // Adjust priorities as used in your query
case 20 -> "instance_api";
case 30 -> "instance_app";
case 40 -> "instance";
case 50 -> "product_version";
case 60 -> "environment";
case 70 -> "product";
case 100 -> "default";
default -> "unknown";
};
}
// --- Methods for Snapshotting Individual Override Tables ---
private void snapshotInstanceProperties(Connection conn, UUID snapshotId, UUID hostId, UUID instanceId) throws SQLException {
String sql = """
INSERT INTO snapshot_instance_property_t
(snapshot_id, host_id, instance_id, property_id, property_value, update_user, update_ts)
SELECT ?, host_id, instance_id, property_id, property_value, update_user, update_ts
FROM instance_property_t
WHERE host_id = ? AND instance_id = ?
""";
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, snapshotId);
ps.setObject(2, hostId);
ps.setObject(3, instanceId);
ps.executeUpdate();
}
}
private void snapshotInstanceApiProperties(Connection conn, UUID snapshotId, UUID hostId, UUID instanceId) throws SQLException {
// Find relevant instance_api_ids first
List<UUID> apiIds = findRelevantInstanceApiIds(conn, hostId, instanceId);
if (apiIds.isEmpty()) return; // No API overrides for this instance
String sql = """
INSERT INTO snapshot_instance_api_property_t
(snapshot_id, host_id, instance_api_id, property_id, property_value, update_user, update_ts)
SELECT ?, host_id, instance_api_id, property_id, property_value, update_user, update_ts
FROM instance_api_property_t
WHERE host_id = ? AND instance_api_id = ANY(?) -- Use ANY with array for multiple IDs
""";
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, snapshotId);
ps.setObject(2, hostId);
// Create a SQL Array from the List of UUIDs
Array sqlArray = conn.createArrayOf("UUID", apiIds.toArray());
ps.setArray(3, sqlArray);
ps.executeUpdate();
sqlArray.free(); // Release array resources
}
}
// Similar methods for snapshotInstanceAppProperties, snapshotInstanceAppApiProperties...
// These will need helper methods like findRelevantInstanceApiIds/findRelevantInstanceAppIds
private void snapshotEnvironmentProperties(Connection conn, UUID snapshotId, UUID hostId, String environment) throws SQLException {
if (environment == null || environment.isEmpty()) return; // No environment scope
String sql = """
INSERT INTO snapshot_environment_property_t
(snapshot_id, host_id, environment, property_id, property_value, update_user, update_ts)
SELECT ?, host_id, environment, property_id, property_value, update_user, update_ts
FROM environment_property_t
WHERE host_id = ? AND environment = ?
""";
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, snapshotId);
ps.setObject(2, hostId);
ps.setString(3, environment);
ps.executeUpdate();
}
}
private void snapshotProductVersionProperties(Connection conn, UUID snapshotId, UUID hostId, UUID productVersionId) throws SQLException {
if (productVersionId == null) return;
String sql = """
INSERT INTO snapshot_product_version_property_t
(snapshot_id, host_id, product_version_id, property_id, property_value, update_user, update_ts)
SELECT ?, host_id, product_version_id, property_id, property_value, update_user, update_ts
FROM product_version_property_t
WHERE host_id = ? AND product_version_id = ?
""";
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, snapshotId);
ps.setObject(2, hostId);
ps.setObject(3, productVersionId);
ps.executeUpdate();
}
}
private void snapshotProductProperties(Connection conn, UUID snapshotId, String productId) throws SQLException {
if (productId == null || productId.isEmpty()) return;
String sql = """
INSERT INTO snapshot_product_property_t
(snapshot_id, product_id, property_id, property_value, update_user, update_ts)
SELECT ?, product_id, property_id, property_value, update_user, update_ts
FROM product_property_t
WHERE product_id = ?
""";
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, snapshotId);
ps.setString(2, productId);
ps.executeUpdate();
}
}
// --- Helper method to find associated instance_api_ids ---
private List<UUID> findRelevantInstanceApiIds(Connection conn, UUID hostId, UUID instanceId) throws SQLException {
List<UUID> ids = new ArrayList<>();
String sql = "SELECT instance_api_id FROM instance_api_t WHERE host_id = ? AND instance_id = ?";
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, hostId);
ps.setObject(2, instanceId);
try (ResultSet rs = ps.executeQuery()) {
while(rs.next()) {
ids.add(rs.getObject("instance_api_id", UUID.class));
}
}
}
return ids;
}
// --- Add similar helper for findRelevantInstanceAppIds ---
// --- Add similar helper for findRelevantInstanceAppApiIds (if needed) ---
}
SQL INSERT Statements:
-
config_snapshot_t:INSERT INTO config_snapshot_t (snapshot_id, snapshot_ts, snapshot_type, description, user_id, deployment_id, scope_host_id, scope_environment, scope_product_id, scope_product_version_id, scope_service_id /*, ... other scope cols */) VALUES (?, CURRENT_TIMESTAMP, ?, ?, ?, ?, ?, ?, ?, ?, ? /*, ... */)(Parameters: snapshotId, snapshotType, description, userId, deploymentId, hostId, environment, productId, productVersionId, serviceId, …)
-
config_snapshot_property_t: (Executed in a loop/batch)INSERT INTO config_snapshot_property_t (snapshot_property_id, snapshot_id, config_phase, config_id, property_id, property_name, property_type, property_value, value_type, source_level) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)(Parameters: new UUID, snapshotId, phase, configId, propertyId, propName, propType, propValue, valType, sourceLevelString)
-
snapshot_instance_property_t:INSERT INTO snapshot_instance_property_t (snapshot_id, host_id, instance_id, property_id, property_value, update_user, update_ts) SELECT ?, host_id, instance_id, property_id, property_value, update_user, update_ts FROM instance_property_t WHERE host_id = ? AND instance_id = ?(Parameters: snapshotId, hostId, instanceId)
-
snapshot_instance_api_property_t:INSERT INTO snapshot_instance_api_property_t (snapshot_id, host_id, instance_api_id, property_id, property_value, update_user, update_ts) SELECT ?, host_id, instance_api_id, property_id, property_value, update_user, update_ts FROM instance_api_property_t WHERE host_id = ? AND instance_api_id = ANY(?) -- Parameter is a SQL Array of relevant instance_api_ids(Parameters: snapshotId, hostId, SQL Array of instance_api_ids)
-
snapshot_instance_app_property_t:INSERT INTO snapshot_instance_app_property_t (snapshot_id, host_id, instance_app_id, property_id, property_value, update_user, update_ts) SELECT ?, host_id, instance_app_id, property_id, property_value, update_user, update_ts FROM instance_app_property_t WHERE host_id = ? AND instance_app_id = ANY(?) -- Parameter is a SQL Array of relevant instance_app_ids(Parameters: snapshotId, hostId, SQL Array of instance_app_ids)
-
snapshot_instance_app_api_property_t:INSERT INTO snapshot_instance_app_api_property_t (snapshot_id, host_id, instance_app_id, instance_api_id, property_id, property_value, update_user, update_ts) SELECT ?, host_id, instance_app_id, instance_api_id, property_id, property_value, update_user, update_ts FROM instance_app_api_property_t WHERE host_id = ? AND instance_app_id = ANY(?) -- SQL Array of relevant instance_app_ids AND instance_api_id = ANY(?) -- SQL Array of relevant instance_api_ids(Parameters: snapshotId, hostId, SQL Array of instance_app_ids, SQL Array of instance_api_ids)
-
snapshot_environment_property_t:INSERT INTO snapshot_environment_property_t (snapshot_id, host_id, environment, property_id, property_value, update_user, update_ts) SELECT ?, host_id, environment, property_id, property_value, update_user, update_ts FROM environment_property_t WHERE host_id = ? AND environment = ?(Parameters: snapshotId, hostId, environment)
-
snapshot_product_version_property_t:INSERT INTO snapshot_product_version_property_t (snapshot_id, host_id, product_version_id, property_id, property_value, update_user, update_ts) SELECT ?, host_id, product_version_id, property_id, property_value, update_user, update_ts FROM product_version_property_t WHERE host_id = ? AND product_version_id = ?(Parameters: snapshotId, hostId, productVersionId)
-
snapshot_product_property_t:INSERT INTO snapshot_product_property_t (snapshot_id, product_id, property_id, property_value, update_user, update_ts) SELECT ?, product_id, property_id, property_value, update_user, update_ts FROM product_property_t WHERE product_id = ?(Parameters: snapshotId, productId)
Important Considerations:
- Transaction Management: Ensure all these inserts happen within a single atomic transaction.
- Error Handling: Robust error handling and rollback are critical.
- Performance: The aggregation query can be heavy. Consider its performance impact. The
INSERT ... SELECTstatements should be reasonably efficient if the source tables are properly indexed on their scope columns. - Parameterization: Use
PreparedStatementfor all queries to prevent SQL injection and improve performance. - Scope Derivation Logic: Ensure the logic in
deriveScopeInfoand the helpers likefindRelevantInstanceApiIdsis correct and handles cases where relationships might not exist (e.g., an instance with no associated APIs). - UUID Generation: Use a suitable UUID generator (like UUIDv7 via
uuid-creator). - Logging: Add clear logging for snapshot creation success/failure.
- Refinement: You’ll need to adapt the
AGGREGATE_EFFECTIVE_CONFIG_SQLquery based on the final version usingROW_NUMBERand correctly incorporate parameters.- Only one aggregation query run per snapshot event.
- Captures the complete multi-phase state at that point under one
snapshot_id. - Simpler
config_snapshot_tmetadata. - Facilitates rollbacks that might need to restore both ‘D’ and ‘R’ phase configurations.
- Cons:
- Requires modifying the aggregation query.
- Consuming services (like startup) need to add
AND config_phase = 'R'when queryingconfig_snapshot_property_t.
Conclusion:
Option 2 is the better approach. Create a single snapshot representing the point-in-time state for the defined scope, but modify your aggregation query to:
- Remove the
config_phase = ?filter. - Select
c.config_phasein the final output. - Store this
config_phasealong with the effectiveproperty_keyandproperty_valuein theconfig_snapshot_property_ttable.
This is more efficient for snapshot creation and provides a complete view for rollbacks. Consumers simply need to add a phase filter when reading the specific configuration they need (e.g., 'R' for runtime).
Modified Aggregation Query Snippet (Final SELECT):
-- (Inside the final SELECT statement of the modified aggregation query)
SELECT
-- snapshot_id needs to be added here or during INSERT
c.config_phase, -- <-- Select the phase
cfg.config_name || '.' || cp.property_name AS property_key,
ro.property_value,
cp.property_type,
cp.value_type
-- Include ro.priority_level AS source_priority if storing provenance
FROM RankedOverrides ro
JOIN config_property_t cp ON ro.property_id = cp.property_id
JOIN config_t cfg ON cp.config_id = cfg.config_id
-- Make sure the join to config_t (aliased as c or cfg) is available here to get the phase
-- This join likely happened earlier to get config_name anyway.
WHERE ro.rn = 1;
commitConfigInstance
Okay, let’s outline the structure of your commitConfigInstance service method and the necessary SQL INSERT statements using JDBC.
This involves several steps within a single database transaction:
- Generate Snapshot ID: Create a new UUID for the snapshot.
- Derive Scope IDs: Query live tables (
instance_t,product_version_t, etc.) based on the inputhostIdandinstanceIdto get other relevant scope identifiers (environment,productId,productVersionId,serviceId, etc.). - Insert Metadata: Insert a record into
config_snapshot_t. - Aggregate Effective Config: Run the complex aggregation query (using
ROW_NUMBER()or similar) to get the final effective properties. - Insert Effective Config: Insert the results from step 4 into
config_snapshot_property_t. - Snapshot Override Tables: For each relevant live override table (
instance_property_t,instance_api_property_t, etc.), select its current state (filtered by scope) and insert it into the correspondingsnapshot_*_property_ttable. - Commit/Rollback: Commit the transaction if all steps succeed, otherwise roll back.
Java Service Method Structure (Conceptual)
import com.github.f4b6a3.uuid.UuidCreator; // For UUIDv7 generation
import javax.sql.DataSource; // Assuming you have a DataSource injected
import java.sql.*;
import java.time.OffsetDateTime;
import java.util.*;
public class ConfigSnapshotService {
private final DataSource ds;
// Inject DataSource via constructor
// Pre-compile your complex aggregation query (modify based on previous examples)
private static final String AGGREGATE_EFFECTIVE_CONFIG_SQL = """
WITH AllOverrides AS (
-- Priority 10: Instance App API (merged) ...
-- Priority 20: Instance API (merged) ...
-- Priority 30: Instance App (merged) ...
-- Priority 40: Instance ...
-- Priority 50: Product Version ...
-- Priority 60: Environment ...
-- Priority 70: Product ...
-- Priority 100: Default ...
),
RankedOverrides AS (
SELECT ..., ROW_NUMBER() OVER (PARTITION BY ao.property_id ORDER BY ao.priority_level ASC) as rn
FROM AllOverrides ao WHERE ao.property_value IS NOT NULL
)
SELECT
c.config_phase, -- Phase from config_t
cfg.config_id, -- Added config_id
cp.property_id, -- Added property_id
cp.property_name, -- Added property_name
cp.property_type,
cp.value_type,
cfg.config_name || '.' || cp.property_name AS property_key, -- Keep for logging/debug? Not needed in snapshot table itself
ro.property_value,
ro.priority_level -- To determine source_level
FROM RankedOverrides ro
JOIN config_property_t cp ON ro.property_id = cp.property_id
JOIN config_t cfg ON cp.config_id = cfg.config_id
WHERE ro.rn = 1;
"""; // NOTE: Add parameters (?) for host_id, instance_id, derived IDs etc.
public Result<String> commitConfigInstance(Map<String, Object> event) {
// 1. Extract Input Parameters
UUID hostId = (UUID) event.get("hostId");
UUID instanceId = (UUID) event.get("instanceId");
String snapshotType = (String) event.getOrDefault("snapshotType", "USER_SAVE"); // Default type
String description = (String) event.get("description");
UUID userId = (UUID) event.get("userId"); // May be null
UUID deploymentId = (UUID) event.get("deploymentId"); // May be null
if (hostId == null || instanceId == null) {
return Failure.of(new Status(INVALID_PARAMETER, "hostId and instanceId are required."));
}
UUID snapshotId = UuidCreator.getTimeOrderedEpoch(); // Generate Snapshot ID (e.g., V7)
Connection connection = null;
try {
connection = ds.getConnection();
connection.setAutoCommit(false); // Start Transaction
// 2. Derive Scope IDs
// Query instance_t and potentially product_version_t based on hostId, instanceId
DerivedScope scope = deriveScopeInfo(connection, hostId, instanceId);
if (scope == null) {
connection.rollback(); // Rollback if instance not found
return Failure.of(new Status(OBJECT_NOT_FOUND, "Instance not found for hostId/instanceId."));
}
// 3. Insert Snapshot Metadata
insertSnapshotMetadata(connection, snapshotId, snapshotType, description, userId, deploymentId, hostId, scope);
// 4 & 5. Aggregate and Insert Effective Config
insertEffectiveConfigSnapshot(connection, snapshotId, hostId, instanceId, scope);
// 6. Snapshot Individual Override Tables
// Use INSERT ... SELECT ... for efficiency
snapshotInstanceProperties(connection, snapshotId, hostId, instanceId);
snapshotInstanceApiProperties(connection, snapshotId, hostId, instanceId);
snapshotInstanceAppProperties(connection, snapshotId, hostId, instanceId);
snapshotInstanceAppApiProperties(connection, snapshotId, hostId, instanceId); // Requires finding relevant App/API IDs first
snapshotEnvironmentProperties(connection, snapshotId, hostId, scope.environment());
snapshotProductVersionProperties(connection, snapshotId, hostId, scope.productVersionId());
snapshotProductProperties(connection, snapshotId, scope.productId());
// Add others as needed
// 7. Commit Transaction
connection.commit();
logger.info("Successfully created config snapshot: {}", snapshotId);
return Success.of(snapshotId.toString());
} catch (SQLException e) {
logger.error("SQLException during snapshot creation for instance {}: {}", instanceId, e.getMessage(), e);
if (connection != null) {
try {
connection.rollback();
} catch (SQLException ex) {
logger.error("Error rolling back transaction:", ex);
}
}
return Failure.of(new Status(SQL_EXCEPTION, "Database error during snapshot creation."));
} catch (Exception e) { // Catch other potential errors (e.g., during scope derivation)
logger.error("Exception during snapshot creation for instance {}: {}", instanceId, e.getMessage(), e);
if (connection != null) {
try { connection.rollback(); } catch (SQLException ex) { logger.error("Error rolling back transaction:", ex); }
}
return Failure.of(new Status(GENERIC_EXCEPTION, "Unexpected error during snapshot creation."));
} finally {
if (connection != null) {
try {
connection.setAutoCommit(true); // Restore default behavior
connection.close();
} catch (SQLException e) {
logger.error("Error closing connection:", e);
}
}
}
}
// --- Helper Methods ---
// Placeholder for derived scope data structure
private record DerivedScope(String environment, String productId, String productVersion, UUID productVersionId, String serviceId /*, add API details if needed */) {}
private DerivedScope deriveScopeInfo(Connection conn, UUID hostId, UUID instanceId) throws SQLException {
// Query instance_t LEFT JOIN product_version_t ... WHERE i.host_id = ? AND i.instance_id = ?
// Extract environment, service_id from instance_t
// Extract product_id, product_version from product_version_t (via product_version_id in instance_t)
// Return new DerivedScope(...) or null if not found
String sql = """
SELECT i.environment, i.service_id, pv.product_id, pv.product_version, i.product_version_id
FROM instance_t i
LEFT JOIN product_version_t pv ON i.host_id = pv.host_id AND i.product_version_id = pv.product_version_id
WHERE i.host_id = ? AND i.instance_id = ?
""";
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, hostId);
ps.setObject(2, instanceId);
try (ResultSet rs = ps.executeQuery()) {
if (rs.next()) {
return new DerivedScope(
rs.getString("environment"),
rs.getString("product_id"),
rs.getString("product_version"),
rs.getObject("product_version_id", UUID.class),
rs.getString("service_id")
);
} else {
return null; // Instance not found
}
}
}
}
private void insertSnapshotMetadata(Connection conn, UUID snapshotId, String snapshotType, String description,
UUID userId, UUID deploymentId, UUID hostId, DerivedScope scope) throws SQLException {
String sql = """
INSERT INTO config_snapshot_t
(snapshot_id, snapshot_ts, snapshot_type, description, user_id, deployment_id,
scope_host_id, scope_environment, scope_product_id, scope_product_version_id, -- Changed col name
scope_service_id /*, scope_api_id, scope_api_version - Add if applicable */)
VALUES (?, CURRENT_TIMESTAMP, ?, ?, ?, ?, ?, ?, ?, ?, ? /*, ?, ? */)
""";
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, snapshotId);
ps.setString(2, snapshotType);
ps.setString(3, description);
ps.setObject(4, userId); // setObject handles null correctly
ps.setObject(5, deploymentId); // setObject handles null correctly
ps.setObject(6, hostId);
ps.setString(7, scope.environment());
ps.setString(8, scope.productId());
ps.setObject(9, scope.productVersionId()); // Store the ID
ps.setString(10, scope.serviceId());
// Set API scope if needed ps.setObject(11, ...); ps.setString(12, ...);
ps.executeUpdate();
}
}
private void insertEffectiveConfigSnapshot(Connection conn, UUID snapshotId, UUID hostId, UUID instanceId, DerivedScope scope) throws SQLException {
String insertSql = """
INSERT INTO config_snapshot_property_t
(snapshot_property_id, snapshot_id, config_phase, config_id, property_id, property_name,
property_type, property_value, value_type, source_level)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""";
// Prepare the aggregation query
try (PreparedStatement selectStmt = conn.prepareStatement(AGGREGATE_EFFECTIVE_CONFIG_SQL);
PreparedStatement insertStmt = conn.prepareStatement(insertSql)) {
// Set ALL parameters for the AGGREGATE_EFFECTIVE_CONFIG_SQL query
int paramIndex = 1;
// Example: set parameters based on how AGGREGATE_EFFECTIVE_CONFIG_SQL is structured
// selectStmt.setObject(paramIndex++, hostId);
// selectStmt.setObject(paramIndex++, instanceId);
// ... set derived scope IDs (productVersionId, environment, productId) ...
// ... set parameters for all UNION branches and potential subqueries ...
try (ResultSet rs = selectStmt.executeQuery()) {
int batchCount = 0;
while (rs.next()) {
insertStmt.setObject(1, UuidCreator.getTimeOrderedEpoch()); // snapshot_property_id
insertStmt.setObject(2, snapshotId);
insertStmt.setString(3, rs.getString("config_phase"));
insertStmt.setObject(4, rs.getObject("config_id", UUID.class));
insertStmt.setObject(5, rs.getObject("property_id", UUID.class));
insertStmt.setString(6, rs.getString("property_name"));
insertStmt.setString(7, rs.getString("property_type"));
insertStmt.setString(8, rs.getString("property_value"));
insertStmt.setString(9, rs.getString("value_type"));
insertStmt.setString(10, mapPriorityToSourceLevel(rs.getInt("priority_level"))); // Map numeric priority back to level name
insertStmt.addBatch();
batchCount++;
if (batchCount % 100 == 0) { // Execute batch periodically
insertStmt.executeBatch();
}
}
if (batchCount % 100 != 0) { // Execute remaining batch
insertStmt.executeBatch();
}
}
}
}
// Helper to map priority back to source level name
private String mapPriorityToSourceLevel(int priority) {
return switch (priority) {
case 10 -> "instance_app_api"; // Adjust priorities as used in your query
case 20 -> "instance_api";
case 30 -> "instance_app";
case 40 -> "instance";
case 50 -> "product_version";
case 60 -> "environment";
case 70 -> "product";
case 100 -> "default";
default -> "unknown";
};
}
// --- Methods for Snapshotting Individual Override Tables ---
private void snapshotInstanceProperties(Connection conn, UUID snapshotId, UUID hostId, UUID instanceId) throws SQLException {
String sql = """
INSERT INTO snapshot_instance_property_t
(snapshot_id, host_id, instance_id, property_id, property_value, update_user, update_ts)
SELECT ?, host_id, instance_id, property_id, property_value, update_user, update_ts
FROM instance_property_t
WHERE host_id = ? AND instance_id = ?
""";
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, snapshotId);
ps.setObject(2, hostId);
ps.setObject(3, instanceId);
ps.executeUpdate();
}
}
private void snapshotInstanceApiProperties(Connection conn, UUID snapshotId, UUID hostId, UUID instanceId) throws SQLException {
// Find relevant instance_api_ids first
List<UUID> apiIds = findRelevantInstanceApiIds(conn, hostId, instanceId);
if (apiIds.isEmpty()) return; // No API overrides for this instance
String sql = """
INSERT INTO snapshot_instance_api_property_t
(snapshot_id, host_id, instance_api_id, property_id, property_value, update_user, update_ts)
SELECT ?, host_id, instance_api_id, property_id, property_value, update_user, update_ts
FROM instance_api_property_t
WHERE host_id = ? AND instance_api_id = ANY(?) -- Use ANY with array for multiple IDs
""";
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, snapshotId);
ps.setObject(2, hostId);
// Create a SQL Array from the List of UUIDs
Array sqlArray = conn.createArrayOf("UUID", apiIds.toArray());
ps.setArray(3, sqlArray);
ps.executeUpdate();
sqlArray.free(); // Release array resources
}
}
// Similar methods for snapshotInstanceAppProperties, snapshotInstanceAppApiProperties...
// These will need helper methods like findRelevantInstanceApiIds/findRelevantInstanceAppIds
private void snapshotEnvironmentProperties(Connection conn, UUID snapshotId, UUID hostId, String environment) throws SQLException {
if (environment == null || environment.isEmpty()) return; // No environment scope
String sql = """
INSERT INTO snapshot_environment_property_t
(snapshot_id, host_id, environment, property_id, property_value, update_user, update_ts)
SELECT ?, host_id, environment, property_id, property_value, update_user, update_ts
FROM environment_property_t
WHERE host_id = ? AND environment = ?
""";
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, snapshotId);
ps.setObject(2, hostId);
ps.setString(3, environment);
ps.executeUpdate();
}
}
private void snapshotProductVersionProperties(Connection conn, UUID snapshotId, UUID hostId, UUID productVersionId) throws SQLException {
if (productVersionId == null) return;
String sql = """
INSERT INTO snapshot_product_version_property_t
(snapshot_id, host_id, product_version_id, property_id, property_value, update_user, update_ts)
SELECT ?, host_id, product_version_id, property_id, property_value, update_user, update_ts
FROM product_version_property_t
WHERE host_id = ? AND product_version_id = ?
""";
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, snapshotId);
ps.setObject(2, hostId);
ps.setObject(3, productVersionId);
ps.executeUpdate();
}
}
private void snapshotProductProperties(Connection conn, UUID snapshotId, String productId) throws SQLException {
if (productId == null || productId.isEmpty()) return;
String sql = """
INSERT INTO snapshot_product_property_t
(snapshot_id, product_id, property_id, property_value, update_user, update_ts)
SELECT ?, product_id, property_id, property_value, update_user, update_ts
FROM product_property_t
WHERE product_id = ?
""";
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, snapshotId);
ps.setString(2, productId);
ps.executeUpdate();
}
}
// --- Helper method to find associated instance_api_ids ---
private List<UUID> findRelevantInstanceApiIds(Connection conn, UUID hostId, UUID instanceId) throws SQLException {
List<UUID> ids = new ArrayList<>();
String sql = "SELECT instance_api_id FROM instance_api_t WHERE host_id = ? AND instance_id = ?";
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, hostId);
ps.setObject(2, instanceId);
try (ResultSet rs = ps.executeQuery()) {
while(rs.next()) {
ids.add(rs.getObject("instance_api_id", UUID.class));
}
}
}
return ids;
}
// --- Add similar helper for findRelevantInstanceAppIds ---
// --- Add similar helper for findRelevantInstanceAppApiIds (if needed) ---
}
SQL INSERT Statements:
-
config_snapshot_t:INSERT INTO config_snapshot_t (snapshot_id, snapshot_ts, snapshot_type, description, user_id, deployment_id, scope_host_id, scope_environment, scope_product_id, scope_product_version_id, scope_service_id /*, ... other scope cols */) VALUES (?, CURRENT_TIMESTAMP, ?, ?, ?, ?, ?, ?, ?, ?, ? /*, ... */)(Parameters: snapshotId, snapshotType, description, userId, deploymentId, hostId, environment, productId, productVersionId, serviceId, …)
-
config_snapshot_property_t: (Executed in a loop/batch)INSERT INTO config_snapshot_property_t (snapshot_property_id, snapshot_id, config_phase, config_id, property_id, property_name, property_type, property_value, value_type, source_level) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)(Parameters: new UUID, snapshotId, phase, configId, propertyId, propName, propType, propValue, valType, sourceLevelString)
-
snapshot_instance_property_t:INSERT INTO snapshot_instance_property_t (snapshot_id, host_id, instance_id, property_id, property_value, update_user, update_ts) SELECT ?, host_id, instance_id, property_id, property_value, update_user, update_ts FROM instance_property_t WHERE host_id = ? AND instance_id = ?(Parameters: snapshotId, hostId, instanceId)
-
snapshot_instance_api_property_t:INSERT INTO snapshot_instance_api_property_t (snapshot_id, host_id, instance_api_id, property_id, property_value, update_user, update_ts) SELECT ?, host_id, instance_api_id, property_id, property_value, update_user, update_ts FROM instance_api_property_t WHERE host_id = ? AND instance_api_id = ANY(?) -- Parameter is a SQL Array of relevant instance_api_ids(Parameters: snapshotId, hostId, SQL Array of instance_api_ids)
-
snapshot_instance_app_property_t:INSERT INTO snapshot_instance_app_property_t (snapshot_id, host_id, instance_app_id, property_id, property_value, update_user, update_ts) SELECT ?, host_id, instance_app_id, property_id, property_value, update_user, update_ts FROM instance_app_property_t WHERE host_id = ? AND instance_app_id = ANY(?) -- Parameter is a SQL Array of relevant instance_app_ids(Parameters: snapshotId, hostId, SQL Array of instance_app_ids)
-
snapshot_instance_app_api_property_t:INSERT INTO snapshot_instance_app_api_property_t (snapshot_id, host_id, instance_app_id, instance_api_id, property_id, property_value, update_user, update_ts) SELECT ?, host_id, instance_app_id, instance_api_id, property_id, property_value, update_user, update_ts FROM instance_app_api_property_t WHERE host_id = ? AND instance_app_id = ANY(?) -- SQL Array of relevant instance_app_ids AND instance_api_id = ANY(?) -- SQL Array of relevant instance_api_ids(Parameters: snapshotId, hostId, SQL Array of instance_app_ids, SQL Array of instance_api_ids)
-
snapshot_environment_property_t:INSERT INTO snapshot_environment_property_t (snapshot_id, host_id, environment, property_id, property_value, update_user, update_ts) SELECT ?, host_id, environment, property_id, property_value, update_user, update_ts FROM environment_property_t WHERE host_id = ? AND environment = ?(Parameters: snapshotId, hostId, environment)
-
snapshot_product_version_property_t:INSERT INTO snapshot_product_version_property_t (snapshot_id, host_id, product_version_id, property_id, property_value, update_user, update_ts) SELECT ?, host_id, product_version_id, property_id, property_value, update_user, update_ts FROM product_version_property_t WHERE host_id = ? AND product_version_id = ?(Parameters: snapshotId, hostId, productVersionId)
-
snapshot_product_property_t:INSERT INTO snapshot_product_property_t (snapshot_id, product_id, property_id, property_value, update_user, update_ts) SELECT ?, product_id, property_id, property_value, update_user, update_ts FROM product_property_t WHERE product_id = ?(Parameters: snapshotId, productId)
Important Considerations:
- Transaction Management: Ensure all these inserts happen within a single atomic transaction.
- Error Handling: Robust error handling and rollback are critical.
- Performance: The aggregation query can be heavy. Consider its performance impact. The
INSERT ... SELECTstatements should be reasonably efficient if the source tables are properly indexed on their scope columns. - Parameterization: Use
PreparedStatementfor all queries to prevent SQL injection and improve performance. - Scope Derivation Logic: Ensure the logic in
deriveScopeInfoand the helpers likefindRelevantInstanceApiIdsis correct and handles cases where relationships might not exist (e.g., an instance with no associated APIs). - UUID Generation: Use a suitable UUID generator (like UUIDv7 via
uuid-creator). - Logging: Add clear logging for snapshot creation success/failure.
- Refinement: You’ll need to adapt the
AGGREGATE_EFFECTIVE_CONFIG_SQLquery based on the final version usingROW_NUMBERand correctly incorporate parameters.
rollbackConfigInstance
Okay, here’s the rollbackConfigInstance method implementing the DELETE/INSERT strategy to restore the state of instance-level and related sub-level overrides from a snapshot.
Assumptions:
- “Rolling back an instance” means restoring the overrides defined specifically for that instance and its associated APIs, Apps, and App-API combinations. It does not modify higher-level overrides (Environment, Product Version, Product).
- The
snapshot_*_property_ttables accurately store the state of the corresponding live tables at the time the snapshot was taken. - The necessary helper methods like
findRelevantInstanceApiIds,findRelevantInstanceAppIdsexist (examples provided).
import com.github.f4b6a3.uuid.UuidCreator; // If needed for audit logging ID
import javax.sql.DataSource;
import java.sql.*;
import java.util.*;
public class ConfigRollbackService {
private final DataSource ds;
// Inject DataSource via constructor
// --- SQL Templates ---
// DELETE Statements (Targeting LIVE tables)
private static final String DELETE_INSTANCE_PROPS_SQL = "DELETE FROM instance_property_t WHERE host_id = ? AND instance_id = ?";
private static final String DELETE_INSTANCE_API_PROPS_SQL = "DELETE FROM instance_api_property_t WHERE host_id = ? AND instance_api_id = ANY(?)";
private static final String DELETE_INSTANCE_APP_PROPS_SQL = "DELETE FROM instance_app_property_t WHERE host_id = ? AND instance_app_id = ANY(?)";
private static final String DELETE_INSTANCE_APP_API_PROPS_SQL = "DELETE FROM instance_app_api_property_t WHERE host_id = ? AND instance_app_id = ANY(?) AND instance_api_id = ANY(?)";
// INSERT ... SELECT Statements (From SNAPSHOT tables to LIVE tables)
private static final String INSERT_INSTANCE_PROPS_SQL = """
INSERT INTO instance_property_t
(host_id, instance_id, property_id, property_value, update_user, update_ts)
SELECT host_id, instance_id, property_id, property_value, update_user, update_ts
FROM snapshot_instance_property_t
WHERE snapshot_id = ? AND host_id = ? AND instance_id = ?
""";
private static final String INSERT_INSTANCE_API_PROPS_SQL = """
INSERT INTO instance_api_property_t
(host_id, instance_api_id, property_id, property_value, update_user, update_ts)
SELECT host_id, instance_api_id, property_id, property_value, update_user, update_ts
FROM snapshot_instance_api_property_t
WHERE snapshot_id = ? AND host_id = ? AND instance_api_id = ANY(?)
""";
private static final String INSERT_INSTANCE_APP_PROPS_SQL = """
INSERT INTO instance_app_property_t
(host_id, instance_app_id, property_id, property_value, update_user, update_ts)
SELECT host_id, instance_app_id, property_id, property_value, update_user, update_ts
FROM snapshot_instance_app_property_t
WHERE snapshot_id = ? AND host_id = ? AND instance_app_id = ANY(?)
""";
private static final String INSERT_INSTANCE_APP_API_PROPS_SQL = """
INSERT INTO instance_app_api_property_t
(host_id, instance_app_id, instance_api_id, property_id, property_value, update_user, update_ts)
SELECT host_id, instance_app_id, instance_api_id, property_id, property_value, update_user, update_ts
FROM snapshot_instance_app_api_property_t
WHERE snapshot_id = ? AND host_id = ? AND instance_app_id = ANY(?) AND instance_api_id = ANY(?)
""";
public Result<String> rollbackConfigInstance(Map<String, Object> event) {
// 1. Extract Input Parameters
UUID snapshotId = (UUID) event.get("snapshotId");
UUID hostId = (UUID) event.get("hostId");
UUID instanceId = (UUID) event.get("instanceId");
UUID userId = (UUID) event.get("userId"); // For potential auditing
String description = (String) event.get("rollbackDescription"); // Optional reason
if (snapshotId == null || hostId == null || instanceId == null) {
return Failure.of(new Status(INVALID_PARAMETER, "snapshotId, hostId, and instanceId are required."));
}
Connection connection = null;
List<UUID> currentApiIds = null;
List<UUID> currentAppIds = null;
try {
connection = ds.getConnection();
connection.setAutoCommit(false); // Start Transaction
// --- Pre-computation: Find CURRENT associated IDs for DELETE scope ---
// It's generally safer to delete based on current relationships and then
// insert based on snapshot relationships if they could have diverged.
currentApiIds = findRelevantInstanceApiIds(connection, hostId, instanceId);
currentAppIds = findRelevantInstanceAppIds(connection, hostId, instanceId);
// Note: InstanceAppApi requires both lists.
logger.info("Starting rollback for instance {} (host {}) to snapshot {}", instanceId, hostId, snapshotId);
// --- Execute Deletes from LIVE tables ---
executeDelete(connection, DELETE_INSTANCE_PROPS_SQL, hostId, instanceId);
if (!currentApiIds.isEmpty()) {
executeDeleteWithArray(connection, DELETE_INSTANCE_API_PROPS_SQL, hostId, currentApiIds);
// Also delete AppApi props related to these APIs if apps also exist
if (!currentAppIds.isEmpty()) {
executeDeleteWithTwoArrays(connection, DELETE_INSTANCE_APP_API_PROPS_SQL, hostId, currentAppIds, currentApiIds);
}
}
if (!currentAppIds.isEmpty()) {
executeDeleteWithArray(connection, DELETE_INSTANCE_APP_PROPS_SQL, hostId, currentAppIds);
// AppApi props deletion might have already happened above if APIs existed.
// If only apps existed but no APIs, delete AppApi here (redundant if handled above)
// Generally safe to run the AppApi delete again if needed, targeting only appIds.
// For simplicity, we assume the AppApi delete targeting both arrays covers necessary cases.
}
// --- Execute Inserts from SNAPSHOT tables ---
executeInsertSelect(connection, INSERT_INSTANCE_PROPS_SQL, snapshotId, hostId, instanceId);
// For array-based inserts, we need the IDs *from the snapshot time*
// However, the SELECT inside the INSERT query implicitly filters by snapshot_id AND the array condition,
// so it should correctly only insert relationships that existed in the snapshot.
// We still use the *current* IDs to DEFINE the overall scope of instance being affected,
// but the INSERT...SELECT filters correctly based on snapshot content.
if (!currentApiIds.isEmpty()) { // Use currentApiIds to decide IF we run the insert query
executeInsertSelectWithArray(connection, INSERT_INSTANCE_API_PROPS_SQL, snapshotId, hostId, currentApiIds);
if (!currentAppIds.isEmpty()) {
executeInsertSelectWithTwoArrays(connection, INSERT_INSTANCE_APP_API_PROPS_SQL, snapshotId, hostId, currentAppIds, currentApiIds);
}
}
if (!currentAppIds.isEmpty()) { // Use currentAppIds to decide IF we run the insert query
executeInsertSelectWithArray(connection, INSERT_INSTANCE_APP_PROPS_SQL, snapshotId, hostId, currentAppIds);
// Redundant AppApi insert if handled above? No, the INSERT uses the AppId filter.
// If only apps existed at snapshot time, this covers it.
}
// --- Optional: Audit Logging ---
// logRollbackActivity(connection, snapshotId, hostId, instanceId, userId, description);
// --- Commit Transaction ---
connection.commit();
logger.info("Successfully rolled back instance {} (host {}) to snapshot {}", instanceId, hostId, snapshotId);
return Success.of("Rollback successful to snapshot " + snapshotId);
} catch (SQLException e) {
logger.error("SQLException during rollback for instance {} to snapshot {}: {}", instanceId, snapshotId, e.getMessage(), e);
if (connection != null) {
try {
connection.rollback();
logger.warn("Transaction rolled back for instance {} snapshot {}", instanceId, snapshotId);
} catch (SQLException ex) {
logger.error("Error rolling back transaction:", ex);
}
}
return Failure.of(new Status(SQL_EXCEPTION, "Database error during rollback operation."));
} catch (Exception e) { // Catch other potential errors
logger.error("Exception during rollback for instance {} to snapshot {}: {}", instanceId, snapshotId, e.getMessage(), e);
if (connection != null) {
try { connection.rollback(); } catch (SQLException ex) { logger.error("Error rolling back transaction:", ex); }
}
return Failure.of(new Status(GENERIC_EXCEPTION, "Unexpected error during rollback operation."));
} finally {
if (connection != null) {
try {
connection.setAutoCommit(true); // Restore default behavior
connection.close();
} catch (SQLException e) {
logger.error("Error closing connection:", e);
}
}
}
}
// --- Helper Methods for Execution ---
private void executeDelete(Connection conn, String sql, UUID hostId, UUID instanceId) throws SQLException {
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, hostId);
ps.setObject(2, instanceId);
int rowsAffected = ps.executeUpdate();
logger.debug("Deleted {} rows from {} for instance {}", rowsAffected, getTableNameFromDeleteSql(sql), instanceId);
}
}
private void executeDeleteWithArray(Connection conn, String sql, UUID hostId, List<UUID> idList) throws SQLException {
if (idList == null || idList.isEmpty()) return; // Nothing to delete if list is empty
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, hostId);
Array sqlArray = conn.createArrayOf("UUID", idList.toArray());
ps.setArray(2, sqlArray);
int rowsAffected = ps.executeUpdate();
logger.debug("Deleted {} rows from {} for {} IDs", rowsAffected, getTableNameFromDeleteSql(sql), idList.size());
sqlArray.free();
}
}
private void executeDeleteWithTwoArrays(Connection conn, String sql, UUID hostId, List<UUID> idList1, List<UUID> idList2) throws SQLException {
if (idList1 == null || idList1.isEmpty() || idList2 == null || idList2.isEmpty()) return;
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, hostId);
Array sqlArray1 = conn.createArrayOf("UUID", idList1.toArray());
Array sqlArray2 = conn.createArrayOf("UUID", idList2.toArray());
ps.setArray(2, sqlArray1);
ps.setArray(3, sqlArray2);
int rowsAffected = ps.executeUpdate();
logger.debug("Deleted {} rows from {} for {}x{} IDs", rowsAffected, getTableNameFromDeleteSql(sql), idList1.size(), idList2.size());
sqlArray1.free();
sqlArray2.free();
}
}
private void executeInsertSelect(Connection conn, String sql, UUID snapshotId, UUID hostId, UUID instanceId) throws SQLException {
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, snapshotId);
ps.setObject(2, hostId);
ps.setObject(3, instanceId);
int rowsAffected = ps.executeUpdate();
logger.debug("Inserted {} rows into {} from snapshot {}", rowsAffected, getTableNameFromInsertSql(sql), snapshotId);
}
}
private void executeInsertSelectWithArray(Connection conn, String sql, UUID snapshotId, UUID hostId, List<UUID> idList) throws SQLException {
if (idList == null || idList.isEmpty()) return; // No scope to insert for
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, snapshotId);
ps.setObject(2, hostId);
Array sqlArray = conn.createArrayOf("UUID", idList.toArray());
ps.setArray(3, sqlArray);
int rowsAffected = ps.executeUpdate();
logger.debug("Inserted {} rows into {} from snapshot {} for {} IDs", rowsAffected, getTableNameFromInsertSql(sql), snapshotId, idList.size());
sqlArray.free();
}
}
private void executeInsertSelectWithTwoArrays(Connection conn, String sql, UUID snapshotId, UUID hostId, List<UUID> idList1, List<UUID> idList2) throws SQLException {
if (idList1 == null || idList1.isEmpty() || idList2 == null || idList2.isEmpty()) return;
try (PreparedStatement ps = conn.prepareStatement(sql)) {
ps.setObject(1, snapshotId);
ps.setObject(2, hostId);
Array sqlArray1 = conn.createArrayOf("UUID", idList1.toArray());
Array sqlArray2 = conn.createArrayOf("UUID", idList2.toArray());
ps.setArray(3, sqlArray1);
ps.setArray(4, sqlArray2);
int rowsAffected = ps.executeUpdate();
logger.debug("Inserted {} rows into {} from snapshot {} for {}x{} IDs", rowsAffected, getTableNameFromInsertSql(sql), snapshotId, idList1.size(), idList2.size());
sqlArray1.free();
sqlArray2.free();
}
}
// --- Helper methods to find associated IDs (same as before) ---
private List<UUID> findRelevantInstanceApiIds(Connection conn, UUID hostId, UUID instanceId) throws SQLException {
// ... implementation ...
}
private List<UUID> findRelevantInstanceAppIds(Connection conn, UUID hostId, UUID instanceId) throws SQLException {
// ... implementation ...
}
// --- Optional: Helper to get table name from SQL for logging ---
private String getTableNameFromDeleteSql(String sql) {
// Simple parsing, might need adjustment
try { return sql.split("FROM ")[1].split(" ")[0]; } catch (Exception e) { return "[unknown table]"; }
}
private String getTableNameFromInsertSql(String sql) {
try { return sql.split("INTO ")[1].split(" ")[0]; } catch (Exception e) { return "[unknown table]"; }
}
// --- Optional: Audit Logging Method ---
// private void logRollbackActivity(Connection conn, UUID snapshotId, ...) throws SQLException { ... }
}
Explanation:
- Parameter Extraction: Gets
snapshotId,hostId,instanceId. - Transaction Control: Sets
setAutoCommit(false)at the start and usescommit()orrollback(). - Find Current Related IDs: Queries
instance_api_tandinstance_app_tto find the currently associated APIs and Apps for the target instance. This is needed to define the scope for theDELETEoperations. - Execute Deletes: Uses helper methods to execute
DELETEstatements against the live override tables (instance_property_t,instance_api_property_t, etc.), filtering byhostId,instanceId, and the arrays ofcurrentApiIds/currentAppIdswhere applicable. This clears the current overrides within the specific scope. - Execute Inserts: Uses helper methods to execute
INSERT INTO [live_table] SELECT ... FROM [snapshot_table]statements.- These statements select data from the
snapshot_*_property_ttables. - They are filtered by the target
snapshotId. - Crucially, they are also filtered by
hostId,instanceId, and the relevantinstance_api_id/instance_app_idarrays. This ensures only the overrides relevant to this instance within that snapshot are inserted.
- These statements select data from the
- SQL Arrays: Uses
connection.createArrayOf("UUID", ...)to pass lists of IDs efficiently to PostgreSQL’sANY(?)operator. Remember tofree()the array resources. - Helpers: Includes helper methods for executing the different types of DELETE/INSERT statements and for finding the related API/App IDs.
- Error Handling: Uses
try-catch-finallyto manage the transaction and connection state. - Logging: Added basic logging for start, success, errors, and rows affected.
- Scope Focus: This implementation focuses the rollback on the instance and its direct sub-levels (API, App, App-API). It assumes rolling back an instance doesn’t automatically mean rolling back shared Environment or Product Version overrides.
This approach provides a clean and relatively robust way to restore the specific override state from your snapshots using the DELETE/INSERT strategy. Remember to test thoroughly!
Instance File Config Phase
Overview
instance_file_t stores instance-specific files that are not modeled as standard config_property_t rows. Examples include API specifications such as openapi.yaml and custom certificates or supporting files.
The config snapshot model currently separates two kinds of file data:
- Standard files are flattened into
config_snapshot_property_t. - Non-standard instance files are copied into
snapshot_instance_file_t.
The /config-server/files endpoint must return both sets. It already filters standard files by config_phase through config_snapshot_property_t.config_phase, but instance_file_t and snapshot_instance_file_t do not currently carry config_phase. That makes it impossible to union the two sources while preserving runtime, deployment, and generator phase semantics.
Problem
When a service starts through DefaultConfigLoader, it calls /config-server/files with host, serviceId, and envTag. The endpoint resolves the current snapshot and returns the files that should be written into /config.
For the sidecar case, openapi.yaml exists in both instance_file_t and snapshot_instance_file_t, but it does not exist in config_snapshot_property_t. Since the current /files query reads only config_snapshot_property_t, the response does not include openapi.yaml, and the sidecar cannot write it to /config.
The correct endpoint behavior is:
- Read standard files from
config_snapshot_property_t. - Read non-standard files from
snapshot_instance_file_t. - Filter both sources by the requested config phase.
- Return one filename-to-base64-content map.
Decision
Add config_phase to both runtime and snapshot instance file tables:
instance_file_t.config_phasesnapshot_instance_file_t.config_phase
The allowed values should match config_t.config_phase:
G: generatorD: deploymentR: runtime
The default value for existing and new rows should be R, because current instance files are consumed by runtime startup unless explicitly marked otherwise.
Schema Changes
Runtime Table
ALTER TABLE instance_file_t
ADD COLUMN config_phase CHAR(1) NOT NULL DEFAULT 'R';
ALTER TABLE instance_file_t
ADD CHECK (config_phase IN ('G', 'D', 'R'));
ALTER TABLE instance_file_t
DROP CONSTRAINT IF EXISTS instance_file_uk;
ALTER TABLE instance_file_t
ADD CONSTRAINT instance_file_uk
UNIQUE (host_id, instance_id, config_phase, v_file_name);
The unique constraint must include config_phase so the same filename can exist separately for runtime and deployment if needed.
Snapshot Table
ALTER TABLE snapshot_instance_file_t
ADD COLUMN config_phase CHAR(1) NOT NULL DEFAULT 'R';
ALTER TABLE snapshot_instance_file_t
ADD CHECK (config_phase IN ('G', 'D', 'R'));
CREATE INDEX idx_snap_inst_file_phase
ON snapshot_instance_file_t (snapshot_id, config_phase, file_type, active);
The primary key can remain (snapshot_id, host_id, instance_file_id) because instance_file_id identifies the copied runtime row. The phase-aware index supports config-server lookups.
Migration
Existing rows should be backfilled to runtime:
UPDATE instance_file_t
SET config_phase = 'R'
WHERE config_phase IS NULL;
UPDATE snapshot_instance_file_t
SET config_phase = 'R'
WHERE config_phase IS NULL;
If a historical custom file was actually intended for deployment or generator use, it must be corrected explicitly after migration. There is no reliable way to infer that from the current schema.
Snapshot Creation
create_snapshot must copy config_phase from instance_file_t into snapshot_instance_file_t.
Current copy shape:
INSERT INTO snapshot_instance_file_t (
snapshot_id, host_id, instance_file_id, instance_id, file_type,
file_name, file_value, file_desc, expiration_ts,
aggregate_version, active, update_user, update_ts
)
SELECT
p_snapshot_id, t.host_id, t.instance_file_id, t.instance_id, t.file_type,
t.file_name, t.file_value, t.file_desc, t.expiration_ts,
t.aggregate_version, t.active, t.update_user, t.update_ts
FROM instance_file_t t
WHERE t.host_id = p_host_id
AND t.instance_id = p_instance_id
AND t.active = TRUE;
Target copy shape:
INSERT INTO snapshot_instance_file_t (
snapshot_id, host_id, instance_file_id, instance_id, config_phase,
file_type, file_name, file_value, file_desc, expiration_ts,
aggregate_version, active, update_user, update_ts
)
SELECT
p_snapshot_id, t.host_id, t.instance_file_id, t.instance_id, t.config_phase,
t.file_type, t.file_name, t.file_value, t.file_desc, t.expiration_ts,
t.aggregate_version, t.active, t.update_user, t.update_ts
FROM instance_file_t t
WHERE t.host_id = p_host_id
AND t.instance_id = p_instance_id
AND t.active = TRUE;
Snapshot creation should continue copying all active instance files for the instance. Consumers filter by phase when reading.
Config Server Query
The /files endpoint should union standard files and non-standard instance files for the current snapshot.
Standard files:
SELECT
p.source_level AS source,
c.config_name,
p.property_name,
p.value_type,
p.property_value,
10 AS source_rank
FROM config_snapshot_property_t p
JOIN config_snapshot_t cs ON cs.snapshot_id = p.snapshot_id
JOIN config_t c ON c.config_id = p.config_id
JOIN host_t h ON cs.host_id = h.host_id
WHERE h.sub_domain || '.' || h.domain = ?
AND cs.current = TRUE
AND p.config_phase = ?
AND p.property_type = 'File'
AND cs.service_id = ?
AND cs.environment = ?
Non-standard instance files:
SELECT
'instance_file' AS source,
'files' AS config_name,
f.file_name AS property_name,
'string' AS value_type,
f.file_value AS property_value,
100 AS source_rank
FROM snapshot_instance_file_t f
JOIN config_snapshot_t cs
ON cs.snapshot_id = f.snapshot_id
AND cs.host_id = f.host_id
AND cs.instance_id = f.instance_id
JOIN host_t h ON h.host_id = cs.host_id
WHERE h.sub_domain || '.' || h.domain = ?
AND cs.current = TRUE
AND f.config_phase = ?
AND f.file_type = 'File'
AND f.active = TRUE
AND cs.service_id = ?
AND cs.environment = ?
The implementation can combine these with UNION ALL. If the same filename appears in both sources, the instance file should win because it is the instance-specific override. Java can enforce this by inserting standard rows first and custom rows second into the response map. SQL can enforce it with source_rank and DISTINCT ON (property_name) if the response is assembled directly from a result set.
The same model should be applied to /certs with property_type = 'Cert' and file_type = 'Cert', because instance_file_t.file_type already supports certificates.
API and Event Changes
All create, update, query, and replay paths for instance files should include configPhase.
Required behavior:
- New create/update requests accept
configPhase. - Missing
configPhasedefaults toRfor backward compatibility. - Created and updated events include
configPhase. - Replay of historical events defaults missing
configPhasetoR. - Query responses expose
configPhase. - UI forms and grids allow the operator to choose or filter by phase.
Code Impact
Expected implementation surfaces:
portal-db/postgres/ddl.sqlportal-db/postgres/ddl-dbvis.sql- New
portal-db/postgres/patch_*.sql portal-db/postgres/sp_tr_fn.sqllight-portal/db-providerpersistence for create, update, query, snapshot, clone, and replay flowslight-config-serversnapshot/filesand/certsquery behavior throughConfigServerQueryPersistenceImplportal-service/crates/portal-coresnapshot file and cert queriesportal-service/apps/config-serverresponse assembly if duplicate precedence is handled outside SQLportal-viewschemas/forms/pages for instance files
Validation
Minimum checks:
- Create or migrate an instance file named
openapi.yamlwithconfig_phase = 'R'. - Create a snapshot for the instance.
- Verify
snapshot_instance_file_thas the sameconfig_phase. - Call
/config-server/files?host=dev.lightapi.net&serviceId=...&envTag=dev. - Confirm the response contains both standard files such as
logback.xmland non-standard files such asopenapi.yaml. - Start a sidecar with
DefaultConfigLoaderand confirm/config/openapi.yamlis written.
Regression tests should cover:
- Existing instance files default to runtime.
- Same filename can exist in different phases.
/filesfilters out non-matching phases.- Custom instance files override standard files with the same filename.
- Java and Rust config-server implementations return the same file keys.
Out of Scope
This change does not move non-standard files into config_snapshot_property_t. Keeping them in snapshot_instance_file_t preserves the distinction between modeled config properties and instance-specific file artifacts.
Deployment
Deployment service allows users to deploy and manage their configured light products. This service is used by the application and api developers and operations.
The deployment service contains pipeline management, platform management and deployment management. It also integrates with product management and instance management services.
Timestamp
Okay, let’s break down the best way to persist Java’s OffsetDateTime in PostgreSQL.
1. Best Database Column Type: TIMESTAMP WITH TIME ZONE (or TIMESTAMPTZ)
This is unequivocally the best choice in PostgreSQL for storing OffsetDateTime objects. Here’s why:
- Preserves the Instant:
OffsetDateTimerepresents a specific instant in time with an offset from UTC.TIMESTAMPTZis designed precisely for this. - UTC Normalization: When you insert a value into a
TIMESTAMPTZcolumn, PostgreSQL uses the provided offset to normalize the timestamp and stores it internally as UTC. This is crucial for correctly representing the absolute point in time, regardless of the original offset. - Automatic Conversion on Retrieval: When you select data from a
TIMESTAMPTZcolumn, PostgreSQL automatically converts the stored UTC value back to the current session’s timezone setting (TimeZoneparameter). Your JDBC driver then maps this appropriately. - Avoids Ambiguity: Using
TIMESTAMPTZprevents the ambiguity that can arise withTIMESTAMP WITHOUT TIME ZONE, where the lack of offset/timezone information can lead to incorrect interpretations depending on server and client settings.
Why NOT TIMESTAMP WITHOUT TIME ZONE (or TIMESTAMP)?
- This type stores the date and time literally as provided, discarding any timezone or offset information.
- If you store an
OffsetDateTime’s local date/time part into this column, you lose the offset, making it impossible to know the exact instant it represents globally. This is generally incorrect forOffsetDateTime.
2. How to Convert (JDBC)
Modern JDBC drivers (PostgreSQL JDBC driver versions supporting JDBC 4.2+, which is most versions used today) handle the conversion automatically and correctly when you use the appropriate methods.
Persisting (Saving):
- Use
PreparedStatement.setObject(int parameterIndex, OffsetDateTime value)
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.time.OffsetDateTime;
import java.time.ZoneOffset;
// Assume 'conn' is your established JDBC Connection
public void saveEventTime(Connection conn, int eventId, OffsetDateTime eventTime) throws SQLException {
// Use TIMESTAMPTZ in your table definition
String sql = "UPDATE events SET event_timestamp = ? WHERE id = ?";
try (PreparedStatement pstmt = conn.prepareStatement(sql)) {
// Use setObject for OffsetDateTime - the driver handles conversion
pstmt.setObject(1, eventTime); // Pass the OffsetDateTime directly
pstmt.setInt(2, eventId);
pstmt.executeUpdate();
}
}
// Example Usage:
// OffsetDateTime nowWithOffset = OffsetDateTime.now(); // Uses system default offset
// OffsetDateTime specificTime = OffsetDateTime.of(2023, 10, 27, 10, 30, 0, 0, ZoneOffset.ofHours(-4));
// saveEventTime(connection, 1, specificTime);
The JDBC driver sends the OffsetDateTime (including its offset) to PostgreSQL. PostgreSQL’s TIMESTAMPTZ type normalizes this to UTC for storage.
Retrieving (Loading):
- Use
ResultSet.getObject(int columnIndex, OffsetDateTime.class)orResultSet.getObject(String columnLabel, OffsetDateTime.class)
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.time.OffsetDateTime;
public OffsetDateTime loadEventTime(Connection conn, int eventId) throws SQLException {
String sql = "SELECT event_timestamp FROM events WHERE id = ?";
OffsetDateTime eventTime = null;
try (PreparedStatement pstmt = conn.prepareStatement(sql)) {
pstmt.setInt(1, eventId);
try (ResultSet rs = pstmt.executeQuery()) {
if (rs.next()) {
// Use getObject with the target class - the driver handles conversion
eventTime = rs.getObject("event_timestamp", OffsetDateTime.class);
}
}
}
return eventTime;
}
// Example Usage:
// OffsetDateTime retrievedTime = loadEventTime(connection, 1);
// if (retrievedTime != null) {
// System.out.println("Retrieved: " + retrievedTime);
// // Note: The offset might be different from the original if your
// // JVM's default timezone or JDBC connection timezone differs
// // from the original offset, but it represents the SAME instant in time.
// }
When retrieving, PostgreSQL sends the stored UTC timestamp. The JDBC driver, knowing the target type is OffsetDateTime and considering the session’s timezone setting, converts the UTC value back into an appropriate OffsetDateTime object representing the correct instant.
Summary:
- Database Column: Use
TIMESTAMP WITH TIME ZONE(TIMESTAMPTZ). - Persisting (Java -> DB): Use
PreparedStatement.setObject(index, yourOffsetDateTime). - Retrieving (DB -> Java): Use
ResultSet.getObject(column, OffsetDateTime.class). - JDBC Driver: Ensure you are using a modern PostgreSQL JDBC driver that supports JDBC 4.2 / Java 8 Time API.
- Session Timezone: Be aware that the
OffsetDateTimeretrieved might have an offset corresponding to the client/session’s timezone setting, but it will represent the same exact instant as the one stored (because it was normalized to UTC).
Tag
Let’s design a tagging system for your light-portal entities. Tags are typically non-hierarchical keywords or labels that you can assign to entities for flexible organization and discovery, complementing categories.
1. Database Design (PostgreSQL)
For a flexible and efficient tagging system, we’ll use two main tables: a central tags table and a join table entity_tags to create a many-to-many relationship between entities and tags.
a) tag Table:
Stores the definitions of the tags themselves.
CREATE TABLE tag_t (
tag_id VARCHAR(22) NOT NULL, -- Unique ID for the tag
host_id VARCHAR(22), -- null means global tag
tag_name VARCHAR(100) UNIQUE NOT NULL, -- Tag name (e.g., "featured", "urgent", "api", "documentation") - Enforce uniqueness
tag_desc VARCHAR(1024), -- Optional description of the tag
update_user VARCHAR(255) DEFAULT SESSION_USER NOT NULL,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY (tag_id)
);
-- Index for efficient lookup by tag_name (common search/filter)
CREATE INDEX idx_tags_tag_name ON tags_t (tag_name);
tag_id: Unique identifier for each tag.tag_name: The actual tag value (e.g., “featured”).UNIQUE NOT NULLconstraint ensures tag names are unique across the system (global tags in this design).tag_desc: Optional description for the tag.update_user,update_ts: Standard audit columns.UNIQUE (tag_name): Important constraint to ensure tag names are unique. This makes tag management simpler and consistent.
b) entity_tags_t Join Table (Many-to-Many Relationship):
Links entities to tags.
CREATE TABLE entity_tags_t (
entity_id VARCHAR(22) NOT NULL, -- ID of the entity (schema, product, document, etc.)
entity_type VARCHAR(50) NOT NULL, -- Type of the entity ('schema', 'product', 'document', etc.)
tag_id VARCHAR(22) NOT NULL REFERENCES tags_t(tag_id) ON DELETE CASCADE, -- Foreign key to tags_t
PRIMARY KEY (entity_id, entity_type, tag_id) -- Composite primary key to prevent duplicate tag assignments to the same entity
);
-- Indexes for efficient queries
CREATE INDEX idx_entity_tags_tag_id ON entity_tags_t (tag_id); -- Find entities by tag
CREATE INDEX idx_entity_tags_entity ON entity_tags_t (entity_id, entity_type); -- Find tags for an entity
entity_id: ID of the entity being tagged.entity_type: Type of the entity (must match the types you use for categories and other entity-related tables).tag_id: Foreign key referencing thetags_ttable.- Composite Primary Key (
entity_id,entity_type,tag_id): Ensures that an entity of a specific type cannot be associated with the same tag multiple times. ON DELETE CASCADE: If a tag is deleted fromtags_t, all associations inentity_tags_tare automatically removed. ConsiderON DELETE RESTRICTif you want to prevent tag deletion if it’s still in use.
2. Service Endpoints
You’ll need service endpoints to manage tags themselves and to manage the associations between tags and entities.
a) Tag Management Endpoints (Likely in a TagService or Admin-Specific Service):
- POST /tags - Create a new tag
- Request Body (JSON):
{ "tagId": "uniqueTagId123", // Optional - let backend generate if not provided "tagName": "featured", // Required - unique tag name "tagDesc": "Items that are highlighted or promoted" // Optional } - Response: 201 Created, with Location header (URL of the new tag) and response body (created tag JSON).
- Request Body (JSON):
- GET /tags - List all tags (with pagination, filtering, sorting - similar to
getCategoryendpoint)- Query Parameters:
offset,limit,tagName,tagDesc, etc. - Response: 200 OK, JSON array of tag objects (with
totalcount).
- Query Parameters:
- GET /tags/{tagId} - Get a specific tag by ID
- Path Parameter:
tagId - Response: 200 OK, tag object in JSON. 404 Not Found if not exists.
- Path Parameter:
- PUT /tags/{tagId} - Update an existing tag
- Path Parameter:
tagId - Request Body (JSON): (Same structure as POST, but
tagIdin the path is used for identification) - Response: 200 OK, updated tag object in JSON. 404 Not Found if tag not found.
- Path Parameter:
- DELETE /tags/{tagId} - Delete a tag
- Path Parameter:
tagId - Response: 204 No Content. 404 Not Found if tag not found.
- Path Parameter:
b) Entity Tag Association Endpoints (Likely within Entity-Specific Services like SchemaService, ProductService):
- (Within POST /schemas, PUT /schemas/{schemaId}, etc. entity creation/update endpoints):
- Request Body for creating or updating an entity should include a field (e.g.,
tagIds:["tagId1", "tagId2"]) to specify the tags to associate with the entity. - Service logic (like in the updated
createSchemaandupdateSchemamethods) will handle updating theentity_tags_ttable (deleting old links and inserting new ones) within the same transaction as the entity creation/update.
- Request Body for creating or updating an entity should include a field (e.g.,
- GET /schemas/{schemaId}/tags (or
/products/{productId}/tags, etc.) - Get tags associated with a specific entity- Path Parameter:
schemaId(orproductId, etc.) - Response: 200 OK, JSON array of tag objects associated with the entity.
- Path Parameter:
- PUT /schemas/{schemaId}/tags (or similar) - Replace tags associated with an entity (Less common, often handled within the entity update endpoint directly)
- Path Parameter:
schemaId - Request Body (JSON):
{ "tagIds": ["tagIdA", "tagIdB"] }- list of tag IDs to associate. - Response: 200 OK, updated entity object (or just 204 No Content).
- Path Parameter:
c) Entity Filtering/Search Endpoints:
- GET /schemas (or
/products,/documents, etc.) - List entities, now with tag filtering:- Query Parameter:
tagNames(ortagIds, ortags- choose one and be consistent), e.g.,tagNames=featured,api&tagNames=urgent(multiple tags to filter by). - Backend logic: Modify the
getSchema(orgetProduct,getDocument, etc.) service methods to:- Parse the
tagNamesparameter (could be comma-separated, multiple parameters, etc.). - Modify the SQL query to include a
JOINwithentity_tags_tandtags_tand add aWHEREclause to filter by the provided tag names. You might need to useEXISTSorINsubqueries for efficient filtering by multiple tags.
- Parse the
- Query Parameter:
Example Query for Filtering Schemas by Tags (using PostgreSQL EXISTS):
SELECT schema_t.*, ... -- Select schema columns
FROM schema_t
WHERE EXISTS (
SELECT 1
FROM entity_tags_t et
INNER JOIN tags_t t ON et.tag_id = t.tag_id
WHERE et.entity_id = schema_t.schema_id
AND et.entity_type = 'schema'
AND t.tag_name IN (?, ?, ?) -- Parameterized tag names list
);
UI Considerations:
- Tag Management UI: Similar to category management, likely an admin section to create, edit, delete tags.
- Tag Assignment UI:
- Entity creation/edit forms should include a tag selection component (e.g., tag input with autocomplete, checkboxes, tag pills).
- Allow users to search/browse existing tags and assign them.
- Tag Filtering/Browsing UI:
- Display tags prominently (tag cloud, list, filters).
- Clicking/selecting a tag should filter the entity lists to show only entities associated with that tag.
Benefits of this Tagging System:
- Flexible Organization: Tags are free-form and non-hierarchical, allowing for more flexible and ad-hoc categorization than categories alone.
- Discoverability: Improves search and filtering capabilities, making it easier for users to find relevant entities.
- Metadata Enrichment: Tags add valuable metadata to entities.
- Scalability: The database design is efficient for querying and managing tags and associations even with a large number of entities and tags.
This design provides a solid foundation for a tagging system. You can further refine it based on your specific requirements, such as adding tag groups, permissions for tag management, or more advanced search capabilities.
UUID
In the light-portal database, we are using UUID for most of the keys in order to support event replay between multiple environments. To balance database performance with the need for URL-friendly, we are using the PostgreSQL native UUID type for the key.
CREATE TABLE your_table (
id UUID PRIMARY KEY,
-- other columns
);
The PostgreSQL can only generate UUIDv4 and it causes index locality problem. So we are using Java to generate UUIDv7 which is Time-Ordered UUID. These embed a timestamp, making them roughly sequential and significantly improving index locality and insert performance. You’ll need a library for this.
import com.github.f4b6a3.uuid.UuidCreator;
import java.util.UUID;
// In your entity or service
UUID primaryKey = UuidCreator.getTimeOrderedEpoch(); // UUIDv7
// Store this 'primaryKey' directly.
In light-4j utility module, we have a UuidUtil class that can generate the UUIDv7 and also encode/decode to base64 string.
Here is the class.
package com.networknt.utility;
import com.github.f4b6a3.uuid.UuidCreator;
import java.util.Base64;
import java.util.UUID;
import java.nio.ByteBuffer;
public class UuidUtil {
// Use Java 8's built-in Base64 encoder/decoder
private static final Base64.Encoder URL_SAFE_ENCODER = Base64.getUrlEncoder().withoutPadding();
private static final Base64.Decoder URL_SAFE_DECODER = Base64.getUrlDecoder();
public static UUID getUUID() {
return UuidCreator.getTimeOrderedEpoch(); // UUIDv7
}
/**
* Generate a UUID and encode it to a URL-safe Base64 string.
*
* @return A URL-safe Base64 encoded UUID string.
*/
public static String uuidToBase64(UUID uuid) {
ByteBuffer bb = ByteBuffer.wrap(new byte[16]);
bb.putLong(uuid.getMostSignificantBits());
bb.putLong(uuid.getLeastSignificantBits());
return URL_SAFE_ENCODER.encodeToString(bb.array());
}
/**
* Decode a URL-safe Base64 string back to a UUID.
*
* @param base64 A URL-safe Base64 encoded UUID string.
* @return The decoded UUID.
*/
public static UUID base64ToUuid(String base64) {
byte[] bytes = URL_SAFE_DECODER.decode(base64);
ByteBuffer bb = ByteBuffer.wrap(bytes);
long high = bb.getLong();
long low = bb.getLong();
return new UUID(high, low);
}
}
Composit key vs Surrogate UUID key
Composite key with 5 or more columns
User the following three tables as examples. We have composite key with 5 columns and some of them are varchar types in product version_property_t table. Is is a good idea to create UUID keys for config_property_t and product_version_t?
-- each config file will have a config_id reference and this table contains all the properties including default.
CREATE TABLE config_property_t (
config_id UUID NOT NULL,
property_name VARCHAR(64) NOT NULL,
property_type VARCHAR(32) DEFAULT 'Config' NOT NULL,
light4j_version VARCHAR(12), -- only newly introduced property has a version.
display_order INTEGER,
required BOOLEAN DEFAULT false NOT NULL,
property_desc VARCHAR(4096),
property_value TEXT,
value_type VARCHAR(32),
property_file TEXT,
resource_type VARCHAR(30) DEFAULT 'none',
update_user VARCHAR(255) DEFAULT SESSION_USER NOT NULL,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL
);
ALTER TABLE config_property_t
ADD CHECK ( property_type IN ( 'Cert', 'Config', 'File') );
COMMENT ON COLUMN config_property_t.property_value IS
'Property Default Value';
COMMENT ON COLUMN config_property_t.value_type IS
'One of string, boolean, integer, float, map, list';
COMMENT ON COLUMN config_property_t.resource_type IS
'One of none, api, app, app_api, api|app_api, app|app_api, all';
ALTER TABLE config_property_t ADD CONSTRAINT config_property_pk PRIMARY KEY ( config_id, property_name );
CREATE TABLE product_version_t (
host_id UUID NOT NULL,
product_id VARCHAR(8) NOT NULL,
product_version VARCHAR(12) NOT NULL, -- internal product version
light4j_version VARCHAR(12) NOT NULL, -- open source release version
break_code BOOLEAN DEFAULT false, -- breaking code change to upgrade to this version.
break_config BOOLEAN DEFAULT false, -- config server need this to decide if clone is allowed for this version.
release_note TEXT,
version_desc VARCHAR(1024),
release_type VARCHAR(24) NOT NULL, -- Alpha Version, Beta Version, Release Candidate, General Availability, Production Release
current BOOLEAN DEFAULT false,
version_status VARCHAR(16) NOT NULL,
update_user VARCHAR (255) DEFAULT SESSION_USER NOT NULL,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY(host_id, product_id, product_version)
);
-- customized property for product version within the host.
CREATE TABLE product_version_property_t (
host_id UUID NOT NULL,
product_id VARCHAR(8) NOT NULL,
product_version VARCHAR(12) NOT NULL,
config_id UUID NOT NULL,
property_name VARCHAR(64) NOT NULL,
property_value TEXT,
property_file TEXT,
update_user VARCHAR (126) DEFAULT SESSION_USER NOT NULL,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL
);
ALTER TABLE product_version_property_t
ADD CONSTRAINT product_version_property_pk PRIMARY KEY ( host_id,
product_id,
product_version,
config_id,
property_name);
Replacing the wide, multi-column composite primary key in product_version_property_t with one based on surrogate UUID keys from the parent tables is generally a very good idea for several reasons, primarily related to performance and maintainability.
Let’s break down the pros and cons:
Current Situation (Composite Keys):
config_property_tPK: (config_idUUID,property_nameVARCHAR(64)) - Size: 16 + (up to 64 + overhead) bytes. Reasonably sized.product_version_tPK: (host_idUUID,product_idVARCHAR(8),product_versionVARCHAR(12)) - Size: 16 + (8 + overhead) + (12 + overhead) bytes. Getting wider.product_version_property_tPK: (host_idUUID,product_idVARCHAR(8),product_versionVARCHAR(12),config_idUUID,property_nameVARCHAR(64)) - Size: 16 + (8+oh) + (12+oh) + 16 + (64+oh) bytes. This is quite wide.
Problems with the Wide Composite Key in product_version_property_t:
- Large Primary Key Index: The B-tree index for this 5-column PK will be large. Larger indexes consume more disk space, more memory (reducing cache efficiency), and take longer to scan during queries, updates, and deletes.
- Large Foreign Keys (Implicit): If any other table were to reference
product_version_property_t(unlikely given its nature, but possible), the foreign key would also need all 5 columns. - Larger Secondary Indexes: Any other indexes you create on
product_version_property_twill typically include the primary key columns implicitly (in PostgreSQL leaf nodes). A wide PK bloats all other indexes on the table, magnifying the performance impact. - Update Overhead: Updating any of the PK columns (though generally discouraged) is very expensive. More importantly, index maintenance during INSERTs and DELETEs is heavier with wider keys.
- Join Complexity: Joining on 5 columns is syntactically more verbose.
Proposed Solution (Surrogate UUID Keys):
-
Add Surrogate Key to
config_property_t:- Add a new column:
cp_id UUID PRIMARY KEY DEFAULT gen_random_uuid()(or use UUIDv7 generator). - Add a
UNIQUEconstraint on the original natural key:ALTER TABLE config_property_t ADD CONSTRAINT config_property_nat_key UNIQUE (config_id, property_name);(This is CRITICAL to maintain data integrity). - Size of new PK: 16 bytes.
- Add a new column:
-
Add Surrogate Key to
product_version_t:- Add a new column:
pv_id UUID PRIMARY KEY DEFAULT gen_random_uuid()(or UUIDv7). - Add a
UNIQUEconstraint on the original natural key:ALTER TABLE product_version_t ADD CONSTRAINT product_version_nat_key UNIQUE (host_id, product_id, product_version);(CRITICAL). - Size of new PK: 16 bytes.
- Add a new column:
-
Modify
product_version_property_t:- Remove the existing 5-column PK.
- Replace the 5 individual columns used for the PK/FK relationship with two new columns:
pv_id UUIDandcp_id UUID. - Create the new Primary Key:
ALTER TABLE product_version_property_t ADD CONSTRAINT product_version_property_pk PRIMARY KEY (pv_id, cp_id); - Add Foreign Key constraints:
ALTER TABLE product_version_property_t ADD CONSTRAINT product_version_property_fk_pv FOREIGN KEY (pv_id) REFERENCES product_version_t(pv_id) ON DELETE CASCADE; -- Or appropriate action ALTER TABLE product_version_property_t ADD CONSTRAINT product_version_property_fk_cp FOREIGN KEY (cp_id) REFERENCES config_property_t(cp_id) ON DELETE CASCADE; -- Or appropriate action - Size of new PK: 16 + 16 = 32 bytes.
Advantages of the Proposed Solution:
- Significantly Smaller PK Index: The PK index on
product_version_property_tshrinks from potentially >100 bytes wide to just 32 bytes. This is the biggest win. - Smaller Secondary Indexes: All other indexes on
product_version_property_twill also be smaller. - Improved Query Performance: Smaller indexes mean faster scans, better cache utilization, and quicker lookups, updates, and deletes involving the PK or indexes on
product_version_property_t. - Simpler Joins: Joining
product_version_property_tto its parent tables now only requires joining on the singlepv_idorcp_idcolumns. - Simpler Foreign Keys: The FK definitions are simpler.
Disadvantages/Considerations:
- Need for
UNIQUEConstraints: You absolutely must add theUNIQUEconstraints on the original natural keys in the parent tables (config_property_t,product_version_t) to prevent duplicate entries based on business logic. This adds two new indexes. - Storage Overhead (Parent Tables): You add 16 bytes per row to
config_property_tandproduct_version_tfor the new UUID PKs. - Migration Effort: You need to:
- Add the new UUID columns to the parent tables.
- Populate them with unique UUIDs for all existing rows.
- Add the new FK columns (
pv_id,cp_id) toproduct_version_property_t. - Update
product_version_property_tto populate these new columns by joining back to the parent tables based on the original composite keys. - Drop the old PK and FK constraints (or the old columns if they are no longer needed besides the value overrides).
- Add the new PK and FK constraints.
- Add the
UNIQUEconstraints on the parent tables.
- Lookups by Natural Key: Queries that previously used the composite PK directly on the parent tables (e.g.,
SELECT ... FROM product_version_t WHERE host_id = ? AND product_id = ? AND product_version = ?) will now use the newUNIQUEindex instead of the PK index. Performance should still be very good, but it’s technically hitting a different index.
Conclusion:
Yes, it is a good idea. The performance benefits gained from dramatically reducing the size of the primary (and consequently, secondary) indexes on the frequently accessed product_version_property_t table likely outweigh the disadvantages (extra storage on parent tables, migration effort, extra unique indexes). The simplification of joins and foreign keys is also a significant maintainability improvement.
Recommendation:
Proceed with the plan to introduce surrogate UUID keys (preferably UUIDv7 for generation if using a library) on config_property_t and product_version_t, and change product_version_property_t to use these two UUIDs as its composite primary key. Do not forget to add the UNIQUE constraints on the original natural keys in the parent tables.
Composite key with 3 or 4 columns
If a composite key with 3 or 4 columns, it is less definitive than the 5-column case. It moves into a “it depends” territory, requiring weighing the pros and cons more carefully based on specifics.
Here’s a breakdown of factors to consider for 3 or 4 column composite primary keys:
Arguments for Sticking with the Composite Natural Key (CNK):
- Simplicity (Potentially): No need for an extra surrogate key column and an extra
UNIQUEindex on the natural key columns. The schema might feel slightly less cluttered if the natural key is intuitive and stable. - Reduced Storage (Parent Table): Avoids adding 16 bytes per row for the UUID PK in the table itself.
- Meaningful Key: The PK components have inherent business meaning, which can sometimes be useful for direct queries or understanding relationships without extra joins (though the
UNIQUEindex on the SUK approach provides this lookup too). - Migration Cost: Avoids the effort of adding columns, backfilling data, and changing referencing tables.
Arguments for Refactoring to a Surrogate UUID Key (SUK):
- Index Size (Still Relevant): This is the biggest factor.
- Calculate the Width: Add up the maximum potential size of the 3 or 4 columns in the CNK.
UUID: 16 bytesINT: 4 bytesBIGINT: 8 bytesVARCHAR(N): N bytes + 1 or 4 bytes overhead (depending on length)TIMESTAMP: 8 bytesBOOLEAN: 1 byte
- Compare: Compare the calculated width to the typical width of a surrogate key reference (16 bytes for one UUID, or 32 bytes if the child table needs two UUIDs like in your
product_version_property_texample). - Threshold: If the CNK width starts exceeding ~32-40 bytes, the performance benefits of a narrower SUK (especially for secondary indexes and joins) become increasingly attractive. Even a 3-column key like
(UUID, VARCHAR(8), VARCHAR(12))is already16 + (8+1) + (12+1) = 38bytes. A 4-column key is almost certainly wider.
- Calculate the Width: Add up the maximum potential size of the 3 or 4 columns in the CNK.
- Secondary Index Bloat: Remember, all other indexes on the table implicitly include the PK columns. A wide CNK makes every index larger, impacting cache efficiency and scan speed across the board. This effect is magnified if you have many secondary indexes.
- Join Performance/Simplicity: Joining on a single UUID column is generally faster and syntactically simpler than joining on 3 or 4 columns, especially if some are strings.
- Foreign Key Simplicity: Tables referencing this table only need to store a single UUID column as the foreign key, rather than 3 or 4 columns. This significantly reduces storage and complexity in child tables.
- Immutability/Stability: Surrogate keys are inherently stable. If there’s any chance the values in the natural key columns might need to change (which is generally bad practice for PKs but sometimes unavoidable), using a SUK provides crucial insulation.
Decision Rubric (Heuristics):
- Scenario 1: Key is mostly small, fixed-width types (e.g., 3 INTs, 2 INTs + UUID). Total width < ~32 bytes.
- Lean towards: Keeping the CNK might be acceptable, especially if the table isn’t heavily referenced or doesn’t have many secondary indexes. The benefits of SUK are less pronounced.
- Scenario 2: Key includes even one moderately sized VARCHAR (e.g., VARCHAR(20+)) or multiple smaller VARCHARs. Total width > ~32-40 bytes.
- Lean towards: Refactoring to SUK. The performance benefits from reduced index sizes (primary and secondary) and simpler joins/FKs likely start to outweigh the costs.
- Scenario 3: Key includes wide VARCHARs (VARCHAR(50+)).
- Lean towards: Strongly consider refactoring to SUK. Benefits are significant.
- Scenario 4: The table is frequently referenced by other tables via FK.
- Lean towards: Refactoring to SUK to simplify the child tables.
- Scenario 5: The table has many secondary indexes.
- Lean towards: Refactoring to SUK to keep those indexes smaller.
- Scenario 6: Any component of the natural key is potentially volatile.
- Lean towards: Refactoring to SUK for stability.
Conclusion:
For 3 or 4 columns, it’s less of a slam dunk than for 5 columns, but refactoring to a surrogate UUID key is still often beneficial, especially if VARCHARs are involved or the total key width exceeds roughly 32-40 bytes. The key drivers are the reduction in primary and secondary index size and the simplification of foreign keys and joins.
Carefully calculate the width of your specific 3 or 4 column composite keys and consider the usage patterns of the table before deciding.
Atomic Events vs Composite Events
Event Design
This is a fundamental design decision in any Event Sourcing system, and the choice has significant long-term consequences for your system’s flexibility, maintainability, and clarity.
The overwhelming consensus and best practice is to design atomic events.
Let’s break down why, and then look at the trap of composite events.
1. Atomic Events (The Recommended Approach)
An atomic event represents a single, granular, immutable fact that has occurred in your system. It is the smallest possible unit of change that has meaning to the business domain.
Example Scenario: A user updates their profile by changing their name and their shipping address on the same form.
With atomic events, this single user action (the Command) would result in two separate events being written to the stream:
CustomerNameChanged { customerId: "123", newName: "Jane Doe" }CustomerAddressChanged { customerId: "123", newAddress: "..." }
Why Atomic Events are Superior:
a) Maximum Flexibility and Reusability:
- Targeted Consumers: You can have different parts of your system (projections, process managers, other microservices) subscribe to only the events they care about. The shipping department only needs to know about
CustomerAddressChanged, while the marketing department might only care aboutCustomerNameChanged. With a composite event, both would have to subscribe and parse the larger event to see if the part they care about was updated. - Future-Proofing: Six months from now, you might need to build a new feature that triggers a welcome kit to be sent when a customer provides an address for the first time. It’s trivial to add a new consumer for the
CustomerAddressChangedevent.
b) Clear and Unambiguous Intent:
- Each event has a single, well-defined purpose. The name of the event, like
ItemAddedToCart, tells you exactly what happened. - A composite event like
ProfileUpdatedis ambiguous. Did the name change? The address? The phone number? The password? You have to inspect the payload to understand the actual change, which makes the event stream harder to read and debug.
c) Simplified Event Handlers (Projections):
- Your event handlers become small and simple. The logic for handling
CustomerAddressChangedis focused purely on updating the address. - With a composite event, your handler becomes a complex series of
ifstatements:if (event.newName != null) { updateName(); } if (event.newAddress != null) { updateAddress(); } .... This is brittle and hard to maintain.
d) Better for Versioning:
- Events are immutable and stored forever. Your system will evolve. It’s much easier to version a small, focused event (e.g., add an optional field to
CustomerAddressChanged_v2) than it is to version a large, complex composite event where any one of its dozen fields might change.
2. Composite Events (The Anti-Pattern)
A composite event attempts to bundle multiple state changes into a single, large event. It’s often a result of developers mapping the event directly to the UI form or the incoming command.
Using the same scenario (updating name and address):
A composite event might look like this:
CustomerProfileUpdated { customerId: "123", name: "Jane Doe", address: "...", phoneNumber: null }
The Problems with this Approach:
- Loss of Granularity: As explained above, you lose the ability to react to specific changes.
- Ambiguous Intent: The event name doesn’t tell the whole story.
- Sparse Payloads: What if the user only changed their name? The
addressandphoneNumberfields in the event payload would benullor unchanged. Your consumers have to handle these sparse, optional fields, leading to more complex logic. - False Cohesion: It groups things together that only changed at the same time but are not necessarily part of the same business fact.
3. The Key Insight: The Role of the Aggregate
You might be thinking, “But the name and address change must be atomic! What if the system crashes after writing the first event but before the second?”
This is where the Aggregate from Domain-Driven Design (DDD) comes in. The Aggregate is the consistency boundary.
Here is the correct flow:
- Command: A single
UpdateCustomerProfileCommandis sent to the system. It contains both the new name and the new address. - Aggregate: The
CustomerAggregate receives the command. It validates the business rules (e.g., the name isn’t empty, the address is valid). - Event Generation: If the rules pass, the Aggregate’s method produces a list of atomic events:
[CustomerNameChanged, CustomerAddressChanged]. - Atomic Persistence: The Event Store then takes this list of events and persists them to the event stream in a single, atomic transaction. It’s all or nothing. You will never have a
CustomerNameChangedin the stream without the correspondingCustomerAddressChangedfrom the same command.
Key Takeaway: The Command can be composite (representing a single user gesture), but it should be resolved into a set of atomic Events. The atomicity of writing these events is guaranteed by your Aggregate and Event Store, not by bundling them into a single event type.
How to Design Good Atomic Events (A Checklist)
- Name it in the Past Tense:
UserRegistered,OrderShipped,PasswordReset. It’s a fact that has already happened. - Capture Business Intent: Don’t just record a CRUD-like change.
ProductPriceUpdatedis okay, butPriceAdjustedForSaleis better because it captures the why. - Ensure it’s a Complete Fact: Include all necessary data for a consumer to understand the event without having to look up previous state. For example,
ItemAddedToCartshould includeproductId,quantity, andpriceAtTimeOfAdding, not justproductId. - Include Causation and Correlation IDs: Add metadata to your events. Who triggered this change (
userId)? What command caused it (causationId)? What overall business process is this part of (correlationId)? - Think “What happened?” not “What changed?”: An event is a story.
OrderSubmittedis a great event. A composite event likeOrderStateChanged { oldState: "Pending", newState: "Submitted" }is far less expressive.
Final Verdict
Always design atomic events.
Use the Aggregate pattern to ensure that multiple atomic events resulting from a single command are persisted transactionally. This approach gives you a system that is far more flexible, maintainable, readable, and ready for future evolution.
Transaction
When using Kafka to handle the events, we need to handle the transactions on both producer side and consumer side along with database transation to ensure multiple events can be handled together. The choice you make here directly impacts the atomicity, ordering, and resilience of your entire system.
Let’s break down the options and the correct way to handle transactions on both the producer and consumer side.
The Correct Approach: Multiple Events, One Transaction
The best practice, by far, is to push multiple, individual event messages per command, wrapped in a single Kafka producer transaction.
This approach maintains the benefits of atomic events (flexibility, clarity) while guaranteeing the all-or-nothing atomicity you need.
How to Implement This (Producer Side)
Your service that processes the command and talks to the Aggregate will do the following:
- Generate Events: The
CustomerAggregate generates the list of atomic events:[CustomerNameChanged, CustomerAddressChanged]. - Use Kafka Producer Transactions: This is the key feature that makes it all work.
Here is a conceptual code flow (using Java-like syntax):
// IMPORTANT: Configure your producer for transactions and idempotence
// props.put("transactional.id", "my-unique-transactional-id");
// props.put("enable.idempotence", "true");
KafkaProducer<String, Event> producer = new KafkaProducer<>(props);
// The list of events from your Aggregate
List<Event> events = customerAggregate.handle(updateProfileCommand);
// 1. Initialize the transaction
producer.initTransactions();
try {
// 2. Begin the transaction
producer.beginTransaction();
// The Aggregate ID (e.g., "customer-123") is the Kafka Key
String aggregateId = customerAggregate.getId();
for (Event event : events) {
// 3. Send EACH event as a SEPARATE message.
// CRUCIAL: All events for this transaction MUST have the same key.
// This ensures they all go to the same partition and are consumed in order.
producer.send(new ProducerRecord<>("customer-events-topic", aggregateId, event));
}
// 4. Commit the transaction.
// This makes all messages in the transaction visible to consumers atomically.
producer.commitTransaction();
} catch (ProducerFencedException | OutOfOrderSequenceException | AuthorizationException e) {
// These are fatal errors, we should close the producer
producer.close();
} catch (KafkaException e) {
// 5. If anything goes wrong, abort. None of the messages will be visible.
producer.abortTransaction();
}
producer.close();
Why this is the best way:
- Atomicity Guaranteed: Kafka guarantees that consumers will either see ALL the messages from
commitTransactionor NONE of them (if youabortTransaction). - Ordering Guaranteed: By using the same key (
aggregateId) for all events in the transaction, you ensure they are written to the same partition in the exact order you sent them. Your consumer will read them in that same order. - Consumer Flexibility: Your stream processors can now consume individual, meaningful events. A shipping-related processor can filter for and process only
CustomerAddressChangedevents, completely ignoringCustomerNameChanged.
How to Process Events Transactionally (Consumer Side)
Now, how does your streams processor populate the database tables while maintaining consistency? This is often called the “Transactional Outbox” pattern, but in reverse—a “Transactional Inbox”.
The goal is to atomically update the database AND commit the Kafka offset. You never want to commit an offset for a message whose database update failed.
Here is the standard, robust pattern for a custom consumer/streams processor:
-
Disable Kafka Auto-Commit: This is the most important step. Your application must take manual control of committing offsets. In your consumer configuration, set
enable.auto.commit=false. -
Consume and Process in Batches:
// This is a conceptual loop for your consumer
while (true) {
// 1. Poll for a batch of records. Kafka gives you a batch.
ConsumerRecords<String, Event> records = consumer.poll(Duration.ofMillis(1000));
if (records.isEmpty()) {
continue;
}
// Get your database connection
Connection dbConnection = database.getConnection();
dbConnection.setAutoCommit(false); // Start manual DB transaction management
try {
// 2. Process each record in the polled batch
for (ConsumerRecord<String, Event> record : records) {
Event event = record.value();
// Apply the change to the database based on the event type
processEvent(event, dbConnection);
}
// 3. If all events in the batch were processed successfully, commit the database transaction
dbConnection.commit();
// 4. IMPORTANT: Only after the DB commit succeeds, commit the Kafka offset.
// This tells Kafka "I have successfully and durably processed all messages up to this point."
consumer.commitSync();
} catch (SQLException e) {
// 5. If the DB update fails, rollback the DB transaction...
dbConnection.rollback();
// ...and DO NOT commit the Kafka offset.
// The consumer will re-poll and re-process this same batch of messages later.
// This is why your processing logic MUST be idempotent.
System.err.println("Database update failed. Rolling back. Will retry batch.");
// You might want to seek to the beginning of the failed batch to be explicit
// consumer.seek(record.topic(), record.partition(), record.offset());
} finally {
dbConnection.close();
}
}
It is possible to handle transactions in a Kafka Streams processor, but it requires using the low-level Processor API and is significantly more complex than the standard consumer approach. You cannot achieve this with the high-level DSL (.map(), .filter(), etc.) alone.
If your processor’s only job is to read from Kafka and write to a database: Use the Plain Kafka Consumer. It is simpler, more direct, less error-prone, and purpose-built for this task. You are essentially building a custom, lightweight Kafka Connect sink.
The Critical Need for Idempotency
Because a failure can occur after the DB commit but before the Kafka offset commit, your application might restart and re-process the same batch of events.
Your database update logic must be idempotent. This means running the same update multiple times produces the same result as running it once.
Examples of Idempotent Operations:
INSERTwith a primary key:INSERT INTO customers (...) VALUES (...) ON DUPLICATE KEY UPDATE ...(MySQL) orINSERT ... ON CONFLICT ... DO UPDATE ...(PostgreSQL).UPDATEstatements:UPDATE customers SET name = 'Jane Doe' WHERE customer_id = '123'. Running this 5 times is the same as running it once.- Using Versioning: Store a
versionorlast_processed_event_idin your database table.
If the update tries to run again, theUPDATE customers SET name = 'Jane Doe', version = 2 WHERE customer_id = '123' AND version = 1;WHEREclause will not match, and no rows will be affected.
Why Not Put a List of Events in One Message?
This is an anti-pattern that solves one problem (producer atomicity) by creating many more downstream.
- Loss of Meaning: The fundamental unit is the event, not a list of events. A Kafka message should represent one fact.
- Consumer Complexity: Every single consumer now has to be written to expect a list. It has to deserialize the list and loop through it.
- No Filtering: A consumer who only cares about
CustomerAddressChangedstill has to receive and parse the entire message containing theCustomerNameChangedevent, only to discard it. This is inefficient and tightly couples your consumers to the producer’s batching behavior. - Versioning Hell: Versioning a list of events is much harder than versioning a single event.
Summary
| Action | Recommended Approach |
|---|---|
| Event Design | Atomic Events: CustomerNameChanged, CustomerAddressChanged. |
| Producing to Kafka | Multiple Messages, One Kafka Transaction: Use producer.beginTransaction() and producer.commitTransaction(). |
| Kafka Message Key | Aggregate ID: Use the same key (e.g., customer-123) for all events from the same command to ensure ordering. |
| Consuming from Kafka | Manual Offset Commits: Disable auto-commit. |
| Database Updates | Transactional Batch Processing: [Start DB Tx] -> [Process Batch] -> [Commit DB Tx] -> [Commit Kafka Offset]. |
| Database Logic | Idempotent: Your UPDATE/INSERT logic must handle being re-run on the same event without causing errors or incorrect data. |
Mixed Aggregates vs Single Aggregate
In the simple batch-processing consumer example I provided, the Kafka message key is not being used to segregate processing. The example processes a batch of records polled from Kafka, and that batch can indeed contain events for many different user_ids or host_ids, all mixed together in a single database transaction.
Let’s break down why this happens, the implications, and how to design a consumer that does respect aggregate boundaries for processing.
Why the Simple Batch Consumer Mixes Aggregates
-
Kafka’s Partitioning: You use the
user_id/host_idas the key. Kafka’s producer hashes this key to determine which partition the message goes to. This is excellent because it guarantees that all events for a single user (a single aggregate) will always go to the same partition and will be consumed in the order they were produced. -
The Consumer’s Polling: A Kafka consumer is assigned one or more partitions to read from. When it calls
consumer.poll(), it fetches a batch of records that have arrived on all of its assigned partitions since the last poll.- If your consumer is assigned Partition 0, and events for User A, User B, and User C have all landed on Partition 0, your polled batch will contain
[EventA1, EventB1, EventC1, EventA2, ...]. - They are mixed together, but the ordering per key is preserved (Event A1 will always come before Event A2).
- If your consumer is assigned Partition 0, and events for User A, User B, and User C have all landed on Partition 0, your polled batch will contain
-
The Simple Transaction Loop: The example loop I showed takes this entire mixed batch (
records) and processes it within one DB transaction.// This loop combines multiple aggregates into one DB transaction dbConnection.beginTransaction(); for (ConsumerRecord record : records) { // 'records' contains events for User A, B, C... updateDatabase(record.value()); } dbConnection.commit();
Is This a Problem? (The Trade-offs)
For many use cases, processing mixed aggregates in a single batch is perfectly fine and often more performant.
- Pro: High Throughput. Batching database commits is much more efficient than committing after every single event. Committing a transaction that updates 100 rows for 50 different users is faster than running 100 separate transactions.
- Con: “Noisy Neighbor” Problem. If processing an event for User C throws an unrecoverable
SQLException, the entire batch transaction will be rolled back. This means the valid updates for User A and User B will also be rolled back and retried. The failure of one aggregate’s event processing blocks the progress of others in the same batch. - Con: Loss of Concurrency. You are processing everything serially within a single consumer thread. You aren’t taking advantage of the fact that User A’s events are independent of User B’s events.
The Better Approach: Processing per Aggregate
If you want to isolate failures and potentially parallelize work, you need to change your consumer logic to process events grouped by their key (user_id/host_id).
This pattern is more complex but far more robust for multi-tenant systems.
Conceptual Code for Aggregate-based Processing
This approach reorganizes the polled batch by key before processing.
// Still disable auto-commit: enable.auto.commit=false
while (true) {
ConsumerRecords<String, Event> records = consumer.poll(Duration.ofMillis(1000));
if (records.isEmpty()) continue;
// 1. Group the polled records by their key (the aggregate ID)
Map<String, List<ConsumerRecord<String, Event>>> recordsByAggregate = new HashMap<>();
for (ConsumerRecord<String, Event> record : records) {
recordsByAggregate
.computeIfAbsent(record.key(), k -> new ArrayList<>())
.add(record);
}
// This map now holds the highest offset for each partition from this poll
Map<TopicPartition, OffsetAndMetadata> offsetsToCommit = new HashMap<>();
// 2. Process the events for EACH aggregate in its OWN transaction
for (Map.Entry<String, List<ConsumerRecord<String, Event>>> entry : recordsByAggregate.entrySet()) {
String aggregateId = entry.getKey();
List<ConsumerRecord<String, Event>> aggregateEvents = entry.getValue();
// Start a DB transaction FOR THIS AGGREGATE ONLY
Connection dbConnection = database.getConnection();
dbConnection.setAutoCommit(false);
try {
for (ConsumerRecord<String, Event> record : aggregateEvents) {
// Your idempotent database logic
updateDatabaseForAggregate(record.value(), dbConnection);
// Keep track of the highest offset we've successfully processed
TopicPartition partition = new TopicPartition(record.topic(), record.partition());
OffsetAndMetadata offset = new OffsetAndMetadata(record.offset() + 1);
offsetsToCommit.merge(partition, offset, (oldVal, newVal) -> newVal.offset() > oldVal.offset() ? newVal : oldVal);
}
// Commit the DB transaction for this one aggregate
dbConnection.commit();
} catch (Exception e) {
// FAILURE for a single aggregate!
System.err.println("Failed to process batch for aggregate: " + aggregateId + ". Rolling back.");
dbConnection.rollback();
// What to do now?
// Option A: Skip this aggregate and continue with others (might break ordering).
// Option B (Better): Stop processing the entire poll, log the poison pill, and DO NOT commit any offsets.
// Let's assume Option B. We would break out of this loop.
// For simplicity, we'll just log and continue, but in reality, you need a robust dead-letter queue strategy here.
} finally {
dbConnection.close();
}
}
// 3. After attempting to process all aggregates in the batch, commit the offsets
// for all the partitions where we made progress.
if (!offsetsToCommit.isEmpty()) {
consumer.commitSync(offsetsToCommit);
}
}
Key Differences and Improvements in this Pattern:
- Isolation: A failure in processing for
user-123no longer affectsuser-456. The transaction foruser-456can still be committed. - Correctness: The database state for each aggregate is updated transactionally, respecting its own event sequence.
- Concurrency (Potential): This design is a stepping stone to true parallel processing. You could submit each
List<ConsumerRecord>(the events for one aggregate) to a thread pool for processing, allowing you to update multiple users’ data concurrently. This requires careful management of the offsets to commit.
Summary and Recommendation
Your intuition was exactly right. The simple batch consumer ignores the key for processing logic, which can be a problem.
| Approach | When to Use | Key Benefit | Key Drawback |
|---|---|---|---|
| Simple Batch Processing | - High-throughput, simple ETL jobs. - When a single failure blocking a whole batch is acceptable. - Low-contention systems. | Simplicity & Performance. Easiest to write and often the fastest for bulk loading. | Noisy Neighbor Problem. One bad event can halt processing for all others in the batch. |
| Per-Aggregate Processing | - Multi-tenant systems (your case). - Systems requiring high reliability and fault isolation. - When you need to guarantee progress for healthy aggregates even if others are failing. | Robustness & Isolation. Failures are contained to a single aggregate. | Increased Complexity. Requires more code to group records and manage transactions and offsets carefully. |
For your system, especially a multi-tenant one, the “Per-Aggregate Processing” model is strongly recommended. The added complexity is a worthwhile investment for the significant gains in robustness and fault isolation. It correctly uses the Kafka key not just for partitioning, but also as the logical boundary for your downstream processing and transactions.
DLQ and Process Flow
This is the final, crucial piece of the puzzle for building a truly resilient system. You’ve identified the exact scenario that separates a robust consumer from a brittle one: handling a “poison pill” message.
A poison pill is a message that your consumer cannot process successfully, no matter how many times it retries. A database constraint violation (UNIQUE constraint failed, FOREIGN KEY constraint failed) is a perfect example. If you simply roll back and retry, you will be stuck in an infinite loop, processing the same failing message forever and making no progress.
Yes, you absolutely must handle this by routing the problematic events to a Dead Letter Queue (DLQ), which is typically another Kafka topic.
Here is the complete, robust workflow for handling these failures.
The Dead Letter Queue (DLQ) Pattern
The goal is to isolate the “bad” event(s) so that the consumer can move on and continue processing subsequent “good” events for that aggregate and others.
1. The DLQ Topic
First, you create another Kafka topic specifically for these failed events. For example, if your main topic is customer-events, your DLQ topic might be customer-events-dlq.
The messages in the DLQ topic should be enriched with metadata about the failure:
- The original event payload.
- The original topic, partition, and offset.
- The
consumer-groupthat failed to process it. - A timestamp of the failure.
- The error message or stack trace (e.g., “UNIQUE constraint failed on customers.email”).
2. Modified Consumer Logic with DLQ
Let’s refine the “Per-Aggregate Processing” logic to include the DLQ step.
// Assumes you have a separate KafkaProducer instance for the DLQ
KafkaProducer<String, DeadLetterEvent> dlqProducer = ...;
while (true) {
ConsumerRecords<String, Event> records = consumer.poll(...);
if (records.isEmpty()) continue;
// Group records by aggregate key
Map<String, List<ConsumerRecord<String, Event>>> recordsByAggregate = groupRecordsByKey(records);
Map<TopicPartition, OffsetAndMetadata> offsetsToCommit = new HashMap<>();
for (Map.Entry<String, List<ConsumerRecord<String, Event>>> entry : recordsByAggregate.entrySet()) {
String aggregateId = entry.getKey();
List<ConsumerRecord<String, Event>> aggregateEvents = entry.getValue();
Connection dbConnection = database.getConnection();
dbConnection.setAutoCommit(false);
try {
for (ConsumerRecord<String, Event> record : aggregateEvents) {
// Your idempotent database update logic
updateDatabaseForAggregate(record.value(), dbConnection);
}
// If all events for this aggregate succeed, commit the DB transaction
dbConnection.commit();
// And mark the final offset for this aggregate as ready to commit
markOffsetsAsProcessed(aggregateEvents, offsetsToCommit);
} catch (SQLException e) {
// A "poison pill" or unrecoverable error was detected!
dbConnection.rollback(); // Roll back any partial DB changes for this aggregate
System.err.println("Unrecoverable error processing aggregate " + aggregateId + ". Sending to DLQ. Error: " + e.getMessage());
// **THE DLQ LOGIC**
// Send the entire batch of events for this failing aggregate to the DLQ.
// It's crucial to send the whole batch to preserve their relative order.
for (ConsumerRecord<String, Event> failedRecord : aggregateEvents) {
DeadLetterEvent dlqEvent = createDlqEvent(failedRecord, e);
// Use the same key to keep ordering in the DLQ
dlqProducer.send(new ProducerRecord<>("customer-events-dlq", aggregateId, dlqEvent));
}
// **CRITICAL STEP:** We have now "handled" these poison pill events by sending them to the DLQ.
// We must treat them as successfully processed from the perspective of the main topic
// so we can commit their offsets and move on.
markOffsetsAsProcessed(aggregateEvents, offsetsToCommit);
} catch (Exception otherException) {
// Handle transient errors (e.g., network timeout) differently.
// You might want to retry these without going to the DLQ immediately.
dbConnection.rollback();
// Don't commit offsets, let the consumer re-poll and retry.
System.err.println("Transient error occurred. Will retry batch.");
// Break the loop and wait for the next poll
break;
} finally {
dbConnection.close();
}
}
// After processing all aggregates in the poll...
if (!offsetsToCommit.isEmpty()) {
// Commit the offsets for both successfully processed and DLQ'd aggregates.
// This advances the consumer past the poison pill.
consumer.commitSync(offsetsToCommit);
}
}
// Helper method to create a DLQ event
private DeadLetterEvent createDlqEvent(ConsumerRecord record, Exception e) {
// Populate with original event, error message, topic, offset, etc.
return new DeadLetterEvent(...);
}
// Helper method to update the offsets map
private void markOffsetsAsProcessed(List<ConsumerRecord<String, Event>> records, Map<TopicPartition, OffsetAndMetadata> offsets) {
records.forEach(rec -> {
TopicPartition partition = new TopicPartition(rec.topic(), rec.partition());
OffsetAndMetadata offset = new OffsetAndMetadata(rec.offset() + 1);
offsets.merge(partition, offset, (oldVal, newVal) -> newVal.offset() > oldVal.offset() ? newVal : oldVal);
});
}
What to Do with the DLQ Topic?
The DLQ is not a garbage can. It’s a hospital for sick messages. You need a strategy for managing it.
-
Monitoring and Alerting: Set up alerts on the DLQ topic. A message landing here is an exceptional event that indicates a bug, bad data, or a system inconsistency. A human needs to be notified.
-
Manual Intervention: An operator or developer should inspect the DLQ message.
- Is it a bug in the consumer? If so, deploy a fix to the consumer code.
- Is it bad data from the producer? For example, a
UserRegisteredevent was sent with an email that already exists. The upstream service needs to be fixed. - Is it a state inconsistency? Maybe an event arrived out of order due to a misconfiguration, and the state it expects in the database doesn’t exist yet.
-
Reprocessing (The “Re-drive” Pattern): Once the underlying issue is fixed (e.g., the consumer bug is patched, or the inconsistent DB state is manually corrected), you need a way to re-introduce the events from the DLQ back into the main processing flow. This is typically done with a separate utility or “re-driver” application that reads from the DLQ and publishes the original event back to the original topic.
Differentiating Error Types
It’s crucial to distinguish between:
- Transient Errors: Network issues, temporary database unavailability, lock timeouts. These are retryable. The correct response is to roll back and not commit the offset, forcing a retry on the next poll.
- Permanent Errors: DB constraint violations, deserialization errors, unrecoverable business logic failures (
NullPointerException). These are not retryable. The correct response is to route to the DLQ and commit the offset to move on.
Your catch blocks should be structured to differentiate these.
try {
// ... processing logic
} catch (SQLIntegrityConstraintViolationException | DeserializationException e) {
// PERMANENT: Rollback, send to DLQ, commit offset
} catch (SQLTransientConnectionException | LockTimeoutException e) {
// TRANSIENT: Rollback, DO NOT commit offset, let it retry
} catch (Exception e) {
// Generic catch-all, probably treat as permanent to be safe
// Rollback, send to DLQ, commit offset
}
By implementing this complete pattern, you create a system that is not only transactional and correct but also self-healing. It can automatically isolate failures, alert you to the problem, and continue operating for all healthy aggregates, preventing a single bad event from bringing your entire system to a halt.
Notification and Event Store
1. Is notification_t replacing the DLQ?
Short Answer: No, not effectively. They serve different primary purposes, though they can complement each other.
Let’s clarify the roles:
-
Dead Letter Queue (DLQ - Kafka Topic):
- Primary Purpose: Operational recovery. It’s a queue of unprocessable messages that allows your consumer to move on and continue processing subsequent messages. It’s designed for reprocessing the original event once the underlying issue (code bug, bad data, external system outage) is resolved.
- Nature: A temporary holding area for raw events that need to be re-driven into the main processing flow. It’s part of your automated error handling and retry mechanism.
- Mechanism: It preserves the original message payload (and its context) in a format easily consumable by other Kafka applications (like a re-driver).
-
notification_t(Database Table):- Primary Purpose: Audit, visibility, and user-facing reporting. It’s a record of processing outcomes (success/failure) and associated metadata (error messages). It’s a read model or a projection for displaying status.
- Nature: A durable log or materialized view of processing activity. It’s primarily for human intervention and analysis.
- Mechanism: Stores a summary or specific details about what happened during processing, typically in a structured way that can be queried and displayed.
Why notification_t doesn’t replace a DLQ:
-
Reprocessing:
- If an event fails and you only log it to
notification_t, your Kafka consumer is still stuck. If it commits the offset for that failed message, the message is lost from the Kafka topic (due to retention policies). You’d then have to reconstruct the original message fromnotification_tand manually re-publish it to Kafka, which is cumbersome. - A DLQ (Kafka topic) already holds the raw message and allows for a more automated re-driving process.
- If an event fails and you only log it to
-
Operational Flow:
- A DLQ is part of an automated pipeline: consumer fails -> sends to DLQ -> consumer moves on. Alerts are triggered.
- With just
notification_t, you need an external mechanism (human reading the UI, another scheduled job) to query the table, identify failures, and trigger manual re-publishing. This is less reactive and scalable.
-
Mixing Concerns:
- Your
notification_ttable correctly stores processing results. This is a projection of the events. - The raw events themselves are what need to be re-driven.
- A DLQ focuses solely on holding the raw, unprocessable events.
- Your
How they can complement each other:
- When an event is sent to the DLQ, you also log an entry in
notification_tindicating the failure, which event was sent to DLQ, and why. This provides the user-facing visibility you want while maintaining the operational robustness of the DLQ. - Your re-driver for the DLQ could also update the
notification_tentry when an event is successfully re-processed.
Conclusion on DLQ vs. notification_t: Your notification_t is a valuable audit and reporting tool, but it should not be your sole mechanism for handling unprocessable Kafka messages. The DLQ pattern with a dedicated Kafka topic is the industry standard for robust, scalable error handling and reprocessing in a streaming architecture.
2. Using notification_t as the Event Store for replay?
Short Answer: This is generally a poor idea due to mixed concerns and potential data loss, unless your notification_t is specifically designed as a pure Event Store.
Let’s define “Event Store” in Event Sourcing:
- The Event Store: This is the single, authoritative source of truth for your system’s state. It stores all historical domain events (atomic, immutable facts) in the exact order they occurred, for all time (or at least for a very long retention period). It’s used to:
- Rebuild the current state of an aggregate.
- Replay all events to build new read models (projections).
- Perform historical analysis.
Evaluating notification_t as an Event Store:
-
“Save all the events”: This is the fundamental requirement. If it indeed stores the full, raw, original event payload for every event that enters your system, then this part is met.
-
“Success or failure of the processing with error message”: This is where it breaks the Event Store principle. An Event Store should only contain facts that happened. Whether an event was processed successfully or failed is a derived state (a projection or audit log entry), not the event itself.
- Problem 1: Mixing Concerns: Mixing raw events with processing results violates the purity of an Event Store. It makes the Event Store harder to reason about and potentially less efficient for replay.
- Problem 2: Data Integrity/Purity for Replay: If you replay events from this table, do you replay the “success/failure” status? No, you only care about the event itself. This metadata is irrelevant for rebuilding aggregate state or building new projections.
-
“Kafka topic might not contain all the events”: This is a critical point.
- If your Kafka topics have short retention (e.g., 7 days), then yes, you absolutely need an external, durable Event Store that retains events indefinitely.
- A relational database is a perfectly valid choice for an Event Store. Many Event Sourcing implementations use a relational DB table (
eventsorevent_stream) where each row is an event, uniquely identified, with the aggregate ID, sequence number, event type, and event payload.
Recommendation for your Event Store:
-
Dedicate a separate table as your Event Store: If you want to use a database for event storage (which is fine!), create a table specifically for
eventsorevent_store.- This table should only store the raw, immutable event data (ID, aggregate ID, sequence number, timestamp, type, payload).
- It should not include processing status or error messages. These are projection data.
-
Maintain
notification_tas a separate read model: Yournotification_ttable should remain exactly as it is: a valuable projection that tracks the processing success/failure for auditing and user visibility. It is derived from the events, not the source of events. -
Choose your Event Store carefully:
- Kafka: If you configure Kafka topics for infinite retention (or very long, e.g., years), Kafka can serve as your primary Event Store. This is powerful for stream processing.
- Relational Database: If Kafka retention is short, or if you prefer a traditional database for your Event Store, design a dedicated
eventstable that conforms to the Event Store principles. - Specialized Event Stores: There are also purpose-built Event Stores (e.g., EventStoreDB) that excel at this task.
Combined Ideal Architecture:
- Producers send atomic events to Kafka topics (e.g.,
customer-events) using Kafka transactions for atomicity. - Kafka topics are configured with long retention if they are your Event Store, or short retention if you’re using a separate database as Event Store.
- Consumers/Stream Processors read from Kafka.
- They process events (e.g., update a materialized view, send to another service).
- They also persist a copy of the event to your dedicated, pure Event Store table (if Kafka isn’t your long-term store). This must happen transactionally with their other updates or be a separate, guaranteed write.
- They also update your
notification_ttable with the processing status (success/failure) for audit/UI. - If an unrecoverable error occurs, they send the original event to a DLQ topic (e.g.,
customer-events-dlq) and log the failure tonotification_t. - They then commit their offsets on the main topic.
- DLQ management: Separate processes monitor the DLQ, allow manual inspection, and facilitate re-driving events back to the main topic after a fix.
- Replaying: When needed, new services or new read models can be built by replaying all events from your chosen Event Store (long-retention Kafka topics or your dedicated
eventsdatabase table).
This separation of concerns leads to a much more robust, understandable, and maintainable system.
Event Store
When to populate the event store is a crucial design decision for ensuring the integrity and completeness of your Event Store. There are two main patterns to consider, each with its own trade-offs.
Option 1: Populating the Event Store (DB Table) Before Publishing to Kafka (Transactional Outbox Pattern)
This is generally the most robust and recommended approach for ensuring at-least-once (often effectively once) persistence of your events. It guarantees that an event is durably stored in your Event Store before it is ever considered for publishing to Kafka.
How it works:
-
Command Processing:
- Your
Aggregatereceives a command and generates a list of atomic events. - These events are persisted to your dedicated Event Store table (e.g.,
events_store_t) within the same local database transaction as any state changes to your aggregate’s materialized view (if applicable). This is the key: a single local transaction. - Alongside storing the event in
events_store_t, the event is also stored in an “Outbox” table (e.g.,outbox_messages) in the same database transaction. Theoutbox_messagestable serves as a temporary holding area for events that need to be published to Kafka.
- Your
-
Outbox Relayer/Publisher:
- A separate, dedicated process (the “Outbox Relayer” or “Change Data Capture (CDC) Publisher”) continuously monitors the
outbox_messagestable for new entries. - When it finds new events in the
outbox_messagestable, it reads them and publishes them to Kafka. - After successfully publishing to Kafka, it marks the event as “published” in the
outbox_messagestable or deletes it.
- A separate, dedicated process (the “Outbox Relayer” or “Change Data Capture (CDC) Publisher”) continuously monitors the
Why this is best:
- Atomicity Guaranteed (Local): The critical guarantee is that the event is either stored in your Event Store AND in the Outbox table, or neither. If the application crashes after generating events but before publishing to Kafka, the events are durably stored in the Outbox and will be published later by the relayer.
- No Data Loss: Events are never lost between generation and publication to Kafka.
- Decoupling: The service generating events doesn’t need to know about Kafka’s availability. It only needs to commit to its local database. The Outbox Relayer handles the Kafka dependency.
- Effective Once: Combined with Kafka’s idempotent producer, this provides effectively once-delivery.
- Source of Truth: The event_store_t database table will be our source of truth and it allows queries against it.
Where the events_store_t is populated:
- In the same local DB transaction where the events are generated and recorded in the Outbox table.
Option 2: Populating the Event Store (DB Table) After Consuming from Kafka
This approach involves two stages of atomicity: first, the producer guarantees delivery to Kafka, and then the consumer guarantees persistence from Kafka to your Event Store.
How it works:
-
Command Processing & Kafka Publishing:
- Your
Aggregategenerates events. - These events are immediately published to Kafka using Kafka producer transactions (as we discussed previously, to guarantee all events from a command are published atomically).
- Your
-
Consumer Processing:
- Your Kafka consumer (the one responsible for populating your Event Store) reads events from Kafka.
- For each event (or batch of events from the same aggregate), it persists the event to your dedicated
events_store_ttable within a local database transaction. - Crucially: It commits the Kafka offset only after the database transaction to
events_store_tis successful.
Why this is generally less ideal for the primary Event Store:
- Producer Responsibility: The service that generates the events also has the responsibility of publishing to Kafka. If Kafka is down or slow, the producer service might be blocked or need to implement complex retry logic.
- Data Durability Gap: There’s a theoretical, albeit small, window where events are generated but might not yet be durably committed to your authoritative
events_store_tdatabase if the consumer or Kafka has issues. (Kafka itself provides durability, but your application’s Event Store is separate). - Complexity for Replay: If your consumer fails and you need to replay events, where do you replay from? Kafka? What if Kafka’s retention is short? This pattern requires Kafka to be the true long-term Event Store, or it introduces a reliance on the consumer correctly populating the DB.
- Source of Truth: The Kafka topic is written first and it will be our event store. It doesn’t support query on the events directly.
Where the events_store_t is populated:
- In the consumer process, within a local DB transaction, after polling from Kafka.
Answering your Specific Questions:
-
“Where is the best place to populate this table?” The best place is in the same database transaction where the event is generated and stored in an Outbox table (Option 1). This ensures that your authoritative Event Store (your
events_store_ttable) is always the first and most reliable source of truth. -
“In the second case, should we populate the failed events (events went to DLQ) to the event store?” This question is about what constitutes “truth” in your Event Store.
No, you should populate all original events to the
events_store_ttable regardless of whether they later cause a processing error or end up in a DLQ.Reasoning:
- The
events_store_tis a record of what happened in the domain. An event likeOrderPlacedis a fact that occurred, regardless of whether a downstream system successfully processed it or failed due to a unique constraint violation. - The
events_store_tshould be pure. It tells the story of your system’s state changes. - The fact that an event failed to be processed by a consumer is a processing audit detail that belongs in your
notification_ttable or system logs, not in the fundamental Event Store. - If you don’t put the failed event in
events_store_t, you are losing part of your system’s history. When you rebuild state by replaying fromevents_store_t, you would miss this event, leading to an incorrect state.
In summary:
events_store_t: Stores all events that happened, always.notification_t: Stores the status of processing each event (success/failure, error message), as a projection.- DLQ: Stores unconsumable events for reprocessing.
- The
Conclusion
I strongly recommend implementing the Transactional Outbox pattern (Option 1) for populating your events_store_t table. This pattern has become an industry best practice for achieving reliable event publishing from a database-backed service. It is more complex initially but provides superior durability and resilience compared to directly publishing to Kafka from your domain service.
And regardless of the publishing mechanism, your events_store_t should be a complete, immutable log of all domain events, untainted by processing outcomes.
Change Data Capture
Using Change Data Capture (CDC) (like Debezium) for the Transactional Outbox is the gold standard for reliably publishing events from a database-backed service to Kafka.
Here’s a detailed design and a conceptual Java implementation for the producer side, along with the Debezium configuration.
Overall Architecture
-
Producer Service (Your Java Application):
- Receives commands (e.g.,
UpdateCustomerProfileCommand). - Interacts with the
CustomerAggregate. - Generates a list of atomic domain events (e.g.,
CustomerNameChanged,CustomerAddressChanged). - Crucially: Persists these events to two database tables within a single local database transaction:
events_store_t: Your immutable, authoritative Event Store (long-term historical log).outbox_messages: A temporary table used by CDC to pick up events for Kafka.
- Receives commands (e.g.,
-
Transactional Outbox Table (
outbox_messages):- A simple database table that acts as a queue for events to be published.
- Rows are inserted into this table in the same transaction as any other domain state changes.
-
CDC Tool (Debezium):
- Monitors the
outbox_messagestable (and potentiallyevents_store_tif you want a separate stream for the full event store, though typically you’d monitor the outbox). - Detects new rows (inserts).
- Captures the
afterimage of the inserted row. - Transforms this data into a Kafka message.
- Publishes the Kafka message to the configured topic.
- Monitors the
-
Kafka Topic(s):
- Events are published here. You can configure Debezium to route events to different topics based on the
aggregate_typeorevent_typefrom youroutbox_messagestable.
- Events are published here. You can configure Debezium to route events to different topics based on the
-
Kafka Consumers:
- Your downstream services (stream processors, materialized view builders, notification services) consume from these Kafka topics.
- They process the events, update their read models, and commit their offsets.
Design of the Database Tables
1. events_store_t (Your Primary Event Store)
This table holds the immutable, ordered sequence of all domain events.
CREATE TABLE events_store_t (
id UUID PRIMARY KEY, -- Unique ID for the event itself
aggregate_id VARCHAR(255) NOT NULL, -- The ID of the aggregate (e.g., customer-123)
aggregate_type VARCHAR(255) NOT NULL, -- The type of aggregate (e.g., 'Customer')
event_type VARCHAR(255) NOT NULL, -- The specific type of event (e.g., 'CustomerNameChanged')
sequence_number BIGINT NOT NULL, -- Monotonically increasing sequence number per aggregate
timestamp TIMESTAMP WITH TIME ZONE NOT NULL, -- When the event occurred
payload JSONB NOT NULL, -- The full event payload (JSON)
metadata JSONB, -- Optional: correlation IDs, causation IDs, user ID, etc.
-- Constraints for event order and uniqueness per aggregate
UNIQUE (aggregate_id, sequence_number)
);
-- Index for efficient lookup by aggregate
CREATE INDEX idx_events_store_aggregate ON events_store_t (aggregate_id);
2. outbox_messages (For CDC Publishing)
This table serves as the bridge to Kafka.
CREATE TABLE outbox_messages (
id UUID PRIMARY KEY, -- Unique ID for this outbox message
aggregate_id VARCHAR(255) NOT NULL, -- The ID of the aggregate (for Kafka key)
aggregate_type VARCHAR(255) NOT NULL, -- The type of aggregate (for Kafka topic routing)
event_type VARCHAR(255) NOT NULL, -- The specific type of event
timestamp TIMESTAMP WITH TIME ZONE NOT NULL, -- When the event was created
payload JSONB NOT NULL, -- The full event payload (JSON)
metadata JSONB, -- Optional: correlation IDs, causation IDs, user ID, etc.
-- Note: No sequence_number here, as the Event Store manages that.
-- Debezium will process these by insertion order.
);
-- An index on timestamp can be useful for manual cleanup or if not using CDC
-- CREATE INDEX idx_outbox_timestamp ON outbox_messages (timestamp);
Java Implementation (Producer Service)
We’ll use Spring Boot for simplicity, Spring Data JPA for database interaction, and Jackson for JSON serialization.
Dependencies (build.gradle):
dependencies {
implementation 'org.springframework.boot:spring-boot-starter-data-jpa'
implementation 'org.springframework.boot:spring-boot-starter-web'
implementation 'org.postgresql:postgresql' // Or your chosen DB driver
runtimeOnly 'com.h2database:h2' // For in-memory testing convenience
compileOnly 'org.projectlombok:lombok'
annotationProcessor 'org.projectlombok:lombok'
implementation 'com.fasterxml.jackson.core:jackson-databind' // For JSON
implementation 'com.fasterxml.jackson.datatype:jackson-datatype-jsr310' // For Java 8 Date/Time
}
1. Domain Events
// domain/events/DomainEvent.java
package com.example.eventoutbox.domain.events;
import com.fasterxml.jackson.annotation.JsonSubTypes;
import com.fasterxml.jackson.annotation.JsonTypeInfo;
import java.time.Instant;
import java.util.UUID;
// Use JsonTypeInfo for polymorphic deserialization (if you need to deserialize events later)
@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "eventType")
@JsonSubTypes({
@JsonSubTypes.Type(value = CustomerNameChanged.class, name = "CustomerNameChanged"),
@JsonSubTypes.Type(value = CustomerAddressChanged.class, name = "CustomerAddressChanged")
})
public abstract class DomainEvent {
private final UUID eventId;
private final Instant timestamp;
private final String aggregateId;
private final String aggregateType;
private final long sequenceNumber; // Important for Event Sourcing
public DomainEvent(UUID eventId, Instant timestamp, String aggregateId, String aggregateType, long sequenceNumber) {
this.eventId = eventId;
this.timestamp = timestamp;
this.aggregateId = aggregateId;
this.aggregateType = aggregateType;
this.sequenceNumber = sequenceNumber;
}
public UUID getEventId() { return eventId; }
public Instant getTimestamp() { return timestamp; }
public String getAggregateId() { return aggregateId; }
public String getAggregateType() { return aggregateType; }
public long getSequenceNumber() { return sequenceNumber; }
public abstract String getEventType();
}
// domain/events/CustomerNameChanged.java
package com.example.eventoutbox.domain.events;
import java.time.Instant;
import java.util.UUID;
public class CustomerNameChanged extends DomainEvent {
private final String newName;
public CustomerNameChanged(UUID eventId, Instant timestamp, String customerId, long sequenceNumber, String newName) {
super(eventId, timestamp, customerId, "Customer", sequenceNumber);
this.newName = newName;
}
public String getNewName() { return newName; }
@Override
public String getEventType() { return "CustomerNameChanged"; }
}
// domain/events/CustomerAddressChanged.java
package com.example.eventoutbox.domain.events;
import java.time.Instant;
import java.util.UUID;
public class CustomerAddressChanged extends DomainEvent {
private final String newAddress; // Simple string for address example
public CustomerAddressChanged(UUID eventId, Instant timestamp, String customerId, long sequenceNumber, String newAddress) {
super(eventId, timestamp, customerId, "Customer", sequenceNumber);
this.newAddress = newAddress;
}
public String getNewAddress() { return newAddress; }
@Override
public String getEventType() { return "CustomerAddressChanged"; }
}
2. Aggregate
// domain/Customer.java
package com.example.eventoutbox.domain;
import com.example.eventoutbox.domain.events.CustomerAddressChanged;
import com.example.eventoutbox.domain.events.CustomerNameChanged;
import com.example.eventoutbox.domain.events.DomainEvent;
import lombok.Getter;
import java.time.Instant;
import java.util.ArrayList;
import java.util.List;
import java.util.UUID;
// This is a simplified Aggregate. In a real ES system, you'd load state from events.
// For this example, we're just focusing on event generation.
@Getter
public class Customer {
private final String customerId;
private String name;
private String address;
private long currentSequenceNumber; // Tracks the next sequence number for new events
private final List<DomainEvent> uncommittedEvents = new ArrayList<>();
public Customer(String customerId, long currentSequenceNumber) {
this.customerId = customerId;
this.currentSequenceNumber = currentSequenceNumber;
}
public static Customer create(String customerId) {
return new Customer(customerId, 0L); // Start with seq 0 for a new aggregate
}
public void changeName(String newName) {
if (!newName.equals(this.name)) { // Only emit event if something actually changed
this.name = newName;
this.currentSequenceNumber++;
uncommittedEvents.add(new CustomerNameChanged(UUID.randomUUID(), Instant.now(), customerId, currentSequenceNumber, newName));
}
}
public void changeAddress(String newAddress) {
if (!newAddress.equals(this.address)) {
this.address = newAddress;
this.currentSequenceNumber++;
uncommittedEvents.add(new CustomerAddressChanged(UUID.randomUUID(), Instant.now(), customerId, currentSequenceNumber, newAddress));
}
}
// After events are stored, clear them
public void markEventsCommitted() {
this.uncommittedEvents.clear();
}
}
3. Persistence Layer (Entities and Repositories)
// infrastructure/persistence/outbox/OutboxMessage.java
package com.example.eventoutbox.infrastructure.persistence.outbox;
import jakarta.persistence.*;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;
import org.hibernate.annotations.JdbcTypeCode;
import org.hibernate.type.SqlTypes;
import java.time.Instant;
import java.util.UUID;
@Entity
@Table(name = "outbox_messages")
@Data
@NoArgsConstructor
@AllArgsConstructor
@Builder
public class OutboxMessage {
@Id
private UUID id; // Event ID
private String aggregateId;
private String aggregateType;
private String eventType;
private Instant timestamp;
@JdbcTypeCode(SqlTypes.JSON) // For PostgreSQL JSONB type
@Column(columnDefinition = "jsonb")
private String payload; // Store payload as JSON string
@JdbcTypeCode(SqlTypes.JSON)
@Column(columnDefinition = "jsonb")
private String metadata; // Optional metadata as JSON string
}
// infrastructure/persistence/outbox/OutboxMessageRepository.java
package com.example.eventoutbox.infrastructure.persistence.outbox;
import org.springframework.data.jpa.repository.JpaRepository;
import java.util.UUID;
public interface OutboxMessageRepository extends JpaRepository<OutboxMessage, UUID> {}
// infrastructure/persistence/eventstore/EventStoreEvent.java
package com.example.eventoutbox.infrastructure.persistence.eventstore;
import jakarta.persistence.*;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;
import org.hibernate.annotations.JdbcTypeCode;
import org.hibernate.type.SqlTypes;
import java.time.Instant;
import java.util.UUID;
@Entity
@Table(name = "events_store_t")
@Data
@NoArgsConstructor
@AllArgsConstructor
@Builder
public class EventStoreEvent {
@Id
private UUID id; // Event ID
private String aggregateId;
private String aggregateType;
private String eventType;
private Instant timestamp;
private long sequenceNumber;
@JdbcTypeCode(SqlTypes.JSON)
@Column(columnDefinition = "jsonb")
private String payload; // Store payload as JSON string
@JdbcTypeCode(SqlTypes.JSON)
@Column(columnDefinition = "jsonb")
private String metadata; // Optional metadata as JSON string
}
// infrastructure/persistence/eventstore/EventStoreEventRepository.java
package com.example.eventoutbox.infrastructure.persistence.eventstore;
import org.springframework.data.jpa.repository.JpaRepository;
import java.util.UUID;
public interface EventStoreEventRepository extends JpaRepository<EventStoreEvent, UUID> {}
4. Application Service (Handles Commands and Persistence)
This is where the magic of the single transaction happens.
// application/CustomerApplicationService.java
package com.example.eventoutbox.application;
import com.example.eventoutbox.domain.Customer;
import com.example.eventoutbox.domain.events.DomainEvent;
import com.example.eventoutbox.infrastructure.persistence.eventstore.EventStoreEvent;
import com.example.eventoutbox.infrastructure.persistence.eventstore.EventStoreEventRepository;
import com.example.eventoutbox.infrastructure.persistence.outbox.OutboxMessage;
import com.example.eventoutbox.infrastructure.persistence.outbox.OutboxMessageRepository;
import com.fasterxml.jackson.databind.ObjectMapper;
import lombok.RequiredArgsConstructor;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;
import java.io.IOException;
import java.util.List;
import java.util.UUID;
import java.util.stream.Collectors;
@Service
@RequiredArgsConstructor
public class CustomerApplicationService {
private final OutboxMessageRepository outboxMessageRepository;
private final EventStoreEventRepository eventStoreEventRepository;
private final ObjectMapper objectMapper; // For JSON serialization
// Represents an incoming command from e.g., a REST endpoint
public record UpdateCustomerProfileCommand(String customerId, String newName, String newAddress) {}
// @Transactional ensures that all database operations within this method
// (saving to outbox_messages and events_store_t) are part of a single DB transaction.
@Transactional
public void updateCustomerProfile(UpdateCustomerProfileCommand command) {
// --- 1. Load/Create Aggregate (Simplified for this example) ---
// In a real Event Sourcing system, you would load the Customer's state
// by replaying events from eventStoreEventRepository for command.customerId.
// For simplicity, we'll assume a new customer or just focus on event generation.
Customer customer = Customer.create(command.customerId);
// customer.loadFromEvents(eventStoreEventRepository.findByAggregateIdOrderBySequenceNumberAsc(command.customerId));
// --- 2. Apply Business Logic & Generate Events ---
if (command.newName() != null) {
customer.changeName(command.newName());
}
if (command.newAddress() != null) {
customer.changeAddress(command.newAddress());
}
// --- 3. Persist Events to Event Store & Outbox (Atomically) ---
List<DomainEvent> eventsToStore = customer.getUncommittedEvents();
if (eventsToStore.isEmpty()) {
return; // No changes, no events to publish
}
List<EventStoreEvent> eventStoreEntities = eventsToStore.stream()
.map(this::mapToEventStoreEvent)
.collect(Collectors.toList());
eventStoreEventRepository.saveAll(eventStoreEntities); // Save to the authoritative Event Store
List<OutboxMessage> outboxMessages = eventsToStore.stream()
.map(this::mapToOutboxMessage)
.collect(Collectors.toList());
outboxMessageRepository.saveAll(outboxMessages); // Save to the Outbox for CDC
customer.markEventsCommitted(); // Clear uncommitted events after successful persistence
}
private OutboxMessage mapToOutboxMessage(DomainEvent event) {
try {
return OutboxMessage.builder()
.id(event.getEventId())
.aggregateId(event.getAggregateId())
.aggregateType(event.getAggregateType())
.eventType(event.getEventType())
.timestamp(event.getTimestamp())
.payload(objectMapper.writeValueAsString(event)) // Serialize event to JSON
.metadata(null) // Add actual metadata if needed
.build();
} catch (IOException e) {
throw new RuntimeException("Failed to serialize event to JSON: " + event.getEventId(), e);
}
}
private EventStoreEvent mapToEventStoreEvent(DomainEvent event) {
try {
return EventStoreEvent.builder()
.id(event.getEventId())
.aggregateId(event.getAggregateId())
.aggregateType(event.getAggregateType())
.eventType(event.getEventType())
.timestamp(event.getTimestamp())
.sequenceNumber(event.getSequenceNumber())
.payload(objectMapper.writeValueAsString(event)) // Serialize event to JSON
.metadata(null) // Add actual metadata if needed
.build();
} catch (IOException e) {
throw new RuntimeException("Failed to serialize event to JSON: " + event.getEventId(), e);
}
}
}
5. REST Controller (Entry Point)
// application/CustomerController.java
package com.example.eventoutbox.application;
import lombok.RequiredArgsConstructor;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
@RestController
@RequestMapping("/customers")
@RequiredArgsConstructor
public class CustomerController {
private final CustomerApplicationService customerApplicationService;
@PostMapping("/profile")
public ResponseEntity<String> updateCustomerProfile(@RequestBody CustomerApplicationService.UpdateCustomerProfileCommand command) {
customerApplicationService.updateCustomerProfile(command);
return ResponseEntity.ok("Customer profile update command received and processed.");
}
}
6. Spring Boot Application (and application.properties)
// EventOutboxApplication.java
package com.example.eventoutbox;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
@SpringBootApplication
public class EventOutboxApplication {
public static void main(String[] args) {
SpringApplication.run(EventOutboxApplication.class, args);
}
}
# application.properties (for H2 in-memory for testing)
spring.datasource.url=jdbc:h2:mem:testdb;DB_CLOSE_DELAY=-1;DB_CLOSE_ON_EXIT=FALSE
spring.datasource.driverClassName=org.h2.Driver
spring.datasource.username=sa
spring.datasource.password=
spring.jpa.database-platform=org.hibernate.dialect.H2Dialect
spring.jpa.hibernate.ddl-auto=update # Use 'update' for schema management in dev
spring.jackson.serialization.write-dates-as-timestamps=false # Good practice for Instant
# If using PostgreSQL:
# spring.datasource.url=jdbc:postgresql://localhost:5432/yourdb
# spring.datasource.username=youruser
# spring.datasource.password=yourpassword
# spring.jpa.database-platform=org.hibernate.dialect.PostgreSQLDialect
Debezium Configuration (Conceptual)
You’ll deploy Debezium as a Kafka Connect connector. Here’s a sample configuration (e.g., postgresql-outbox-connector.json) for PostgreSQL.
{
"name": "outbox-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "postgres",
"database.password": "secret",
"database.dbname": "configserver",
"database.server.name": "postgres",
"topic.prefix": "portal-event",
"schema.include.list": "public",
"table.include.list": "public.outbox_message_t",
"message.key.columns": "public.outbox_message_t:host_id",
"plugin.name": "pgoutput",
"publication.name": "dbz_publication",
"slot.name": "dbz_replication_slot",
"slot.drop.on.stop": "false",
"signal.when.disconnected": "true",
"tombstones.on.delete": "true",
"max.retries": 5,
"retry.delay.ms": 10000,
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter.schemas.enable": "false",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": "false",
"transforms": "unwrap,addTransactionIdHeader,timestamp_converter,outbox,extractPayload,extractKey,final_route",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.delete.handling.mode": "none",
"transforms.addTransactionIdHeader.type": "org.apache.kafka.connect.transforms.HeaderFrom$Value",
"transforms.addTransactionIdHeader.fields": "transaction_id",
"transforms.addTransactionIdHeader.headers": "transaction_id",
"transforms.addTransactionIdHeader.operation": "copy",
"transforms.timestamp_converter.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value",
"transforms.timestamp_converter.field": "event_ts",
"transforms.timestamp_converter.target.type": "unix",
"transforms.timestamp_converter.format": "yyyy-MM-dd'T'HH:mm:ss.SSSSSS'Z'",
"transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
"transforms.outbox.table.field.event.id": "id",
"transforms.outbox.table.field.event.key": "host_id",
"transforms.outbox.table.field.event.type": "event_type",
"transforms.outbox.table.field.event.timestamp": "event_ts",
"transforms.outbox.table.field.event.payload": "payload",
"transforms.outbox.table.field.event.metadata": "metadata",
"transforms.outbox.table.field.aggregate.type": "aggregate_type",
"transforms.outbox.table.field.aggregate.id": "aggregate_id",
"transforms.extractPayload.type": "org.apache.kafka.connect.transforms.ExtractField$Value",
"transforms.extractPayload.field": "payload",
"transforms.extractKey.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractKey.field": "host_id",
"transforms.final_route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.final_route.regex": "portal-event\\.public\\.outbox_message_t",
"transforms.final_route.replacement": "portal-event"
}
}
And here is the curl command to create the connector locally.
curl --location --request POST 'http://localhost:8083/connectors' \
--header 'Content-Type: application/json' \
--data-raw '{
"name": "outbox-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "postgres",
"database.password": "secret",
"database.dbname": "configserver",
"database.server.name": "postgres",
"topic.prefix": "portal-event",
"schema.include.list": "public",
"table.include.list": "public.outbox_message_t",
"message.key.columns": "public.outbox_message_t:host_id",
"plugin.name": "pgoutput",
"publication.name": "dbz_publication",
"slot.name": "dbz_replication_slot",
"slot.drop.on.stop": "false",
"signal.when.disconnected": "true",
"tombstones.on.delete": "true",
"max.retries": 5,
"retry.delay.ms": 10000,
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter.schemas.enable": "false",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": "false",
"transforms": "unwrap,addTransactionIdHeader,timestamp_converter,outbox,extractPayload,extractKey,final_route",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.delete.handling.mode": "none",
"transforms.addTransactionIdHeader.type": "org.apache.kafka.connect.transforms.HeaderFrom$Value",
"transforms.addTransactionIdHeader.fields": "transaction_id",
"transforms.addTransactionIdHeader.headers": "transaction_id",
"transforms.addTransactionIdHeader.operation": "copy",
"transforms.timestamp_converter.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value",
"transforms.timestamp_converter.field": "event_ts",
"transforms.timestamp_converter.target.type": "unix",
"transforms.timestamp_converter.format": "yyyy-MM-dd'\''T'\''HH:mm:ss.SSSSSS'\''Z'\''",
"transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
"transforms.outbox.table.field.event.id": "id",
"transforms.outbox.table.field.event.key": "host_id",
"transforms.outbox.table.field.event.type": "event_type",
"transforms.outbox.table.field.event.timestamp": "event_ts",
"transforms.outbox.table.field.event.payload": "payload",
"transforms.outbox.table.field.event.metadata": "metadata",
"transforms.outbox.table.field.aggregate.type": "aggregate_type",
"transforms.outbox.table.field.aggregate.id": "aggregate_id",
"transforms.extractPayload.type": "org.apache.kafka.connect.transforms.ExtractField$Value",
"transforms.extractPayload.field": "payload",
"transforms.extractKey.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractKey.field": "host_id",
"transforms.final_route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.final_route.regex": "portal-event\\.public\\.outbox_message_t",
"transforms.final_route.replacement": "portal-event"
}
}
'
The following are the commands to check the connector status and config:
# Check connector status
curl http://localhost:8083/connectors/outbox-connector/status
# Check connector config
curl http://localhost:8083/connectors/outbox-connector/config
Important Notes on Debezium Transforms:
EventRouterTransform: This is a specialized Debezium SMT (Single Message Transform) designed specifically for the Transactional Outbox pattern.- It expects
id,aggregate_id,aggregate_type,event_type,timestamp,payload, andmetadatafields in youroutbox_messagestable. - It automatically wraps the
payloadinto the Kafka message value and sets the Kafka key based onaggregate_id. - It can route to specific topics (e.g.,
outbox.Customer,outbox.Order) based onaggregate_type. - It filters out
DELETEoperations on theoutbox_messagestable (which is what your clean-up process would do, if you had one).
- It expects
- CDC (Debezium) only processes
INSERTs: When you insert a row intooutbox_messages, Debezium picks it up. After it’s published, you can (optionally) have a separate, idempotent cleanup job or a Debezium signal that deletes the record fromoutbox_messages. Debezium will then capture thisDELETEevent, but theEventRoutertransform will typically filter it out, preventing re-publishing.
How to test the Java Producer Service
-
Run your Spring Boot application.
-
Use a tool like
curlor Postman to send a POST request:curl -X POST http://localhost:8080/customers/profile \ -H "Content-Type: application/json" \ -d '{ "customerId": "customer-abc-123", "newName": "Alice Smith", "newAddress": "123 Main St, Anytown" }' -
Check your database
events_store_tandoutbox_messagestables. You should see entries forCustomerNameChangedandCustomerAddressChangedin both, all committed atomically.
Key Benefits of this Setup
- Guaranteed Event Persistence: Events are first stored in your durable
events_store_tandoutbox_messagestables within a single, local, ACID transaction. This means if your application crashes before the event is published to Kafka, it’s still safe in your database and will be picked up by Debezium later. - Decoupling: Your core business logic (in
CustomerApplicationService) doesn’t directly interact with Kafka. It only interacts with the database. This makes your service more resilient to Kafka outages. - Simplified Retries: Debezium and Kafka Connect handle the complexities of retrying Kafka publication.
- Single Source of Truth: Your
events_store_tremains the authoritative event log for replay and aggregate reconstruction. - Scalability: You can scale your application service and Debezium independently.
This pattern is a fundamental building block for highly reliable, event-driven microservices.
Multiple Topics
This is a classic scenario in event-driven architectures: an event needs to trigger processing in multiple downstream systems. The key is maintaining atomicity and understanding transaction boundaries.
Given your setup where:
ScheduleCreatedEventoriginates from your service’s outbox.- Debezium pushes it to
portal-event. - Your
PortalEventConsumerreads fromportal-eventand performs database updates (likenotification_t). - The same event needs to go to be processed by the Schedule Kafka Streams.
- All operations related to processing this event should ideally be atomic.
Understanding the Transactional Challenge
Your PortalEventConsumer has a well-defined transactional boundary:
[Start DB Tx] -> [DB Updates (e.g., notification_t)] -> [DB Commit] -> [Kafka Consumer Offset Commit]
You want to add “push to light-schedule” into this atomic unit.
Options for Pushing to light-schedule
Let’s evaluate the best places:
1. NOT Recommended: Direct Kafka Producer Send within PortalEventConsumer’s DB Transaction.
- Approach: Inside the
PortalEventConsumerloop, after processingScheduleCreatedEventand beforeconn.commit(), instantiate a Kafka Producer andproducer.send()the event tolight-schedule. - Problem: This is incredibly difficult to make truly atomic across all three resources (source Kafka topic
portal-eventoffset, your database transaction, AND the target Kafka topiclight-schedule).- If
producer.send()tolight-schedulefails afterconn.commit()but beforeconsumer.commitSync(), you have an inconsistent state:notification_tis updated, butlight-scheduledidn’t get the event. The consumer will re-process, leading to duplicates innotification_t(which requires idempotency) and potential duplicates tolight-schedule. - Managing Kafka Producer transactions nested within a JDBC transaction is not standard and adds immense complexity.
- If
2. Recommended for Robustness (but more infrastructure): A Secondary Outbox Table.
- Approach:
- When
PortalEventConsumerprocessesScheduleCreatedEventfromportal-event, it updatesnotification_t(and any other DB projections) in its current DB transaction. - Within the same DB transaction, it also inserts a record (representing the
ScheduleCreatedEventforlight-schedule) into a new, dedicated outbox table (e.g.,schedule_events_outbox_t). - A second Debezium connector (or a polling publisher) then monitors
schedule_events_outbox_tand pushes events to thelight-scheduletopic.
- When
- Benefits:
- True Atomicity: The event lands in
notification_tAND is queued forlight-schedulepublishing, all within thePortalEventConsumer’s single DB transaction. This is guaranteed. - High Reliability: Leverages the proven Transactional Outbox pattern again.
- True Atomicity: The event lands in
- Drawbacks:
- Adds another outbox table to manage.
- Requires another Debezium connector instance.
- More operational overhead.
3. Most Recommended for Simplicity & Kafka Streams Integration: A Separate Kafka Streams Application.
- Approach:
- Your
PortalEventConsumercontinues to subscribe toportal-eventand performs its database updates tonotification_t(and other projections) as it currently does. It remains the sink for all events fromportal-eventinto your relational database. - Create a separate, dedicated Kafka Streams application whose sole purpose is to process scheduling events.
- This Kafka Streams application subscribes directly to the
portal-eventtopic. - It uses Kafka Streams DSL to
filterforScheduleCreatedEvents.
- Your
- Benefits:
- Clean Separation of Concerns: Your
PortalEventConsumeris a database sink. Your Kafka Streams app is a stream processor. - Kafka Streams EOS (Exactly-Once Semantics): Kafka Streams handles transactional guarantees (atomic consumption from
portal-eventand process the scheduled events natively. - Simpler Code: No complex producer/consumer/DB transaction coordination in one app.
- Scalability: Each application can scale independently.
- Clean Separation of Concerns: Your
- Drawbacks:
- Adds another logical application to deploy and manage.
Best Place to Push to light-schedule:
For your setup, the Separate Kafka Streams Application (Option 3) is generally the best approach.
- Your
PortalEventConsumer’s role: It acts as a generic projection builder into your relational database, consuming all events fromportal-eventand updatingnotification_t(and any other necessary read models). This ensures a full audit and visibility for all processed events in your DB. - The new Kafka Streams app’s role: It acts as a specialized router and processor for
ScheduleCreatedEvents specifically, forwarding them to the appropriate Kafka Streams pipeline (light-schedule).
This maintains a clean, decoupled architecture where each component has a clear responsibility and leverages Kafka’s native stream processing capabilities for atomic Kafka-to-Kafka operations.
Database Concurrency
Multiple users updating the same aggregate is a classic concurrency problem in multi-user applications, often referred to as the “lost update” problem. In an Event Sourcing system, preventing this overwrite is crucial because the sequence of events defines the state.
The standard and most effective way to prevent concurrent updates from overwriting each other in an Event Sourcing system is through Optimistic Concurrency Control (OCC), specifically using version numbers (or sequence numbers) at the aggregate level.
How Optimistic Concurrency Control (OCC) Works in Event Sourcing
-
Version Tracking (Sequence Number):
- Every Aggregate (e.g., a
Customer, anOrder, aProduct) has a version, which is typically its current sequence number in the event stream. This sequence number represents the number of events that have been applied to build its current state. - Your
events_store_ttable already hassequence_numberfor this purpose:
TheCREATE TABLE events_store_t ( id UUID PRIMARY KEY, aggregate_id VARCHAR(255) NOT NULL, -- ... other fields ... sequence_number BIGINT NOT NULL, -- This is the key! UNIQUE (aggregate_id, sequence_number) -- CRITICAL constraint! );UNIQUE (aggregate_id, sequence_number)constraint is the fundamental database-level guarantee against concurrent writes for the same aggregate at the same version.
- Every Aggregate (e.g., a
-
Load the Aggregate’s Current Version:
- When your application service wants to modify an aggregate, it first loads the aggregate’s current state by replaying all events for that
aggregate_idfrom theevents_store_t. - During this replay, it tracks the
currentSequenceNumber(the sequence number of the last event applied).
- When your application service wants to modify an aggregate, it first loads the aggregate’s current state by replaying all events for that
-
Pass Expected Version with Command:
- The user interface (UI) or the client application that initiated the change should also hold the
currentSequenceNumberit observed when it last fetched the aggregate’s state. - This
expectedVersion(orexpectedSequenceNumber) is then sent along with the command (e.g.,UpdateCustomerProfileCommand(customerId, newName, newAddress, expectedSequenceNumber)).
- The user interface (UI) or the client application that initiated the change should also hold the
-
Conditional Event Appending:
- When your
CustomerApplicationServicereceives the command:- It loads the
Customeraggregate from theevents_store_t, determining its actualcurrentSequenceNumber. - It compares the
command.expectedSequenceNumberwith thecustomer.actualCurrentSequenceNumber(derived from the Event Store). - If
command.expectedSequenceNumberdoes NOT matchcustomer.actualCurrentSequenceNumber: This means another concurrent transaction has already written new events for this aggregate since the client loaded its state. AConcurrencyException(or similar domain-specific exception) is thrown. - If they DO match: The aggregate’s business logic is applied, generating new events. These new events will have
customer.actualCurrentSequenceNumber + 1,customer.actualCurrentSequenceNumber + 2, etc.
- It loads the
- When your
-
Atomic Persistence (The DB Constraint):
- The new events are then attempted to be saved to
events_store_t(andoutbox_messages) within a single database transaction. - If a concurrency conflict was not detected at step 4 (meaning two commands arrived almost simultaneously and passed the initial check), the
UNIQUE (aggregate_id, sequence_number)constraint in theevents_store_ttable will prevent the “lost update.” Only the first transaction to successfully insert events with the “next” sequence numbers will succeed. The second will fail with aDataIntegrityViolationException(or similar).
- The new events are then attempted to be saved to
Example Flow:
- User A fetches
Customer-123. The current state (replayed fromevents_store_t) showssequenceNumber = 5. - User B also fetches
Customer-123. It also seessequenceNumber = 5. - User A sends
UpdateCustomerProfileCommand(customerId="123", newName="Alice", expectedSequenceNumber=5).- App Service loads
Customer-123, actualsequenceNumber = 5. MatchesexpectedSequenceNumber. - Generates
CustomerNameChangedevent withsequenceNumber = 6. - Attempts to save event(s) to
events_store_t(andoutbox_messages). Succeeds.
- App Service loads
- User B sends
UpdateCustomerProfileCommand(customerId="123", newAddress="456 Oak", expectedSequenceNumber=5).- App Service loads
Customer-123. It now replays events up tosequenceNumber = 6. So,actualSequenceNumber = 6. - It compares
command.expectedSequenceNumber=5withcustomer.actualSequenceNumber=6. They do NOT match! - The
CustomerApplicationServicethrows aConcurrencyException. - The transaction is rolled back, and no events are written from User B’s command.
- App Service loads
Java Implementation Changes
Let’s modify the previous CustomerApplicationService and add a way to load the aggregate from events.
1. Customer Aggregate (Revised)
// domain/Customer.java (Revised)
package com.example.eventoutbox.domain;
import com.example.eventoutbox.domain.events.CustomerAddressChanged;
import com.example.eventoutbox.domain.events.CustomerNameChanged;
import com.example.eventoutbox.domain.events.DomainEvent;
import lombok.Getter;
import java.time.Instant;
import java.util.ArrayList;
import java.util.List;
import java.util.UUID;
@Getter
public class Customer {
private final String customerId;
private String name;
private String address;
private long version; // This is the 'sequenceNumber' of the LAST applied event
private final List<DomainEvent> uncommittedEvents = new ArrayList<>();
// Constructor for creating a new aggregate
public Customer(String customerId) {
this.customerId = customerId;
this.version = 0; // New aggregates start at version 0
}
// Static factory method to load an aggregate from its events
public static Customer loadFromEvents(String customerId, List<DomainEvent> history) {
Customer customer = new Customer(customerId);
history.forEach(customer::applyEvent); // Apply each historical event
return customer;
}
// Method to apply an event to the aggregate's state
private void applyEvent(DomainEvent event) {
// This is where you would update the aggregate's internal state
// based on the specific event type.
if (event instanceof CustomerNameChanged nameChanged) {
this.name = nameChanged.getNewName();
} else if (event instanceof CustomerAddressChanged addressChanged) {
this.address = addressChanged.getNewAddress();
}
this.version = event.getSequenceNumber(); // Update version to the sequence number of the applied event
}
// Domain behavior methods that generate new events
public void changeName(String newName) {
if (!newName.equals(this.name)) {
// New events get the *next* sequence number
long nextSequence = this.version + 1;
CustomerNameChanged event = new CustomerNameChanged(UUID.randomUUID(), Instant.now(), customerId, nextSequence, newName);
uncommittedEvents.add(event);
applyEvent(event); // Apply immediately to current state for consistency
}
}
public void changeAddress(String newAddress) {
if (!newAddress.equals(this.address)) {
long nextSequence = this.version + 1;
CustomerAddressChanged event = new CustomerAddressChanged(UUID.randomUUID(), Instant.now(), customerId, nextSequence, newAddress);
uncommittedEvents.add(event);
applyEvent(event);
}
}
public void markEventsCommitted() {
this.uncommittedEvents.clear();
}
}
2. ConcurrencyException
// domain/ConcurrencyException.java
package com.example.eventoutbox.domain;
public class ConcurrencyException extends RuntimeException {
public ConcurrencyException(String message) {
super(message);
}
}
3. CustomerApplicationService (Revised)
// application/CustomerApplicationService.java (Revised)
package com.example.eventoutbox.application;
import com.example.eventoutbox.domain.ConcurrencyException;
import com.example.eventoutbox.domain.Customer;
import com.example.eventoutbox.domain.events.DomainEvent;
import com.example.eventoutbox.infrastructure.persistence.eventstore.EventStoreEvent;
import com.example.eventoutbox.infrastructure.persistence.eventstore.EventStoreEventRepository;
import com.example.eventoutbox.infrastructure.persistence.outbox.OutboxMessage;
import com.example.eventoutbox.infrastructure.persistence.outbox.OutboxMessageRepository;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
import lombok.RequiredArgsConstructor;
import org.springframework.dao.DataIntegrityViolationException;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;
import java.io.IOException;
import java.util.List;
import java.util.Optional;
import java.util.UUID;
import java.util.stream.Collectors;
@Service
@RequiredArgsConstructor
public class CustomerApplicationService {
private final OutboxMessageRepository outboxMessageRepository;
private final EventStoreEventRepository eventStoreEventRepository;
private final ObjectMapper objectMapper;
// Command now includes expectedVersion
public record UpdateCustomerProfileCommand(String customerId, String newName, String newAddress, long expectedVersion) {}
@Transactional
public void updateCustomerProfile(UpdateCustomerProfileCommand command) {
// --- 1. Load Aggregate State ---
List<EventStoreEvent> historicalEvents = eventStoreEventRepository.findByAggregateIdOrderBySequenceNumberAsc(command.customerId());
Customer customer;
if (historicalEvents.isEmpty()) {
customer = new Customer(command.customerId());
// If it's a new aggregate, expectedVersion must be 0
if (command.expectedVersion() != 0) {
throw new ConcurrencyException("Customer with ID " + command.customerId() + " does not exist or expected version is incorrect.");
}
} else {
// Deserialize historical events to DomainEvent objects
List<DomainEvent> domainEventsHistory = historicalEvents.stream()
.map(this::deserializeEventStoreEvent)
.collect(Collectors.toList());
customer = Customer.loadFromEvents(command.customerId(), domainEventsHistory);
// --- 2. OPTIMISTIC CONCURRENCY CHECK ---
if (customer.getVersion() != command.expectedVersion()) {
throw new ConcurrencyException(
"Customer with ID " + command.customerId() + " has been updated by another user. " +
"Expected version " + command.expectedVersion() + " but found " + customer.getVersion() + "."
);
}
}
// --- 3. Apply Business Logic & Generate Events ---
if (command.newName() != null) {
customer.changeName(command.newName());
}
if (command.newAddress() != null) {
customer.changeAddress(command.newAddress());
}
// --- 4. Persist Events to Event Store & Outbox (Atomically) ---
List<DomainEvent> eventsToStore = customer.getUncommittedEvents();
if (eventsToStore.isEmpty()) {
return; // No changes, no events to publish
}
try {
List<EventStoreEvent> eventStoreEntities = eventsToStore.stream()
.map(this::mapToEventStoreEvent)
.collect(Collectors.toList());
eventStoreEventRepository.saveAll(eventStoreEntities);
List<OutboxMessage> outboxMessages = eventsToStore.stream()
.map(this::mapToOutboxMessage)
.collect(Collectors.toList());
outboxMessageRepository.saveAll(outboxMessages);
customer.markEventsCommitted();
} catch (DataIntegrityViolationException e) {
// This catches the UNIQUE constraint violation on (aggregate_id, sequence_number)
// This means another transaction has just written to this aggregate
throw new ConcurrencyException(
"Another concurrent update detected for customer " + command.customerId() + ". " +
"Please refresh and try again.", e
);
} catch (IOException e) {
throw new RuntimeException("Failed to serialize event to JSON", e);
}
}
// Helper methods for mapping/deserializing (similar to before)
private OutboxMessage mapToOutboxMessage(DomainEvent event) {
try {
return OutboxMessage.builder()
.id(event.getEventId())
.aggregateId(event.getAggregateId())
.aggregateType(event.getAggregateType())
.eventType(event.getEventType())
.timestamp(event.getTimestamp())
.payload(objectMapper.writeValueAsString(event))
.metadata(null)
.build();
} catch (JsonProcessingException e) {
throw new RuntimeException("Failed to serialize event to JSON: " + event.getEventId(), e);
}
}
private EventStoreEvent mapToEventStoreEvent(DomainEvent event) {
try {
return EventStoreEvent.builder()
.id(event.getEventId())
.aggregateId(event.getAggregateId())
.aggregateType(event.getAggregateType())
.eventType(event.getEventType())
.timestamp(event.getTimestamp())
.sequenceNumber(event.getSequenceNumber())
.payload(objectMapper.writeValueAsString(event))
.metadata(null)
.build();
} catch (JsonProcessingException e) {
throw new RuntimeException("Failed to serialize event to JSON: " + event.getEventId(), e);
}
}
private DomainEvent deserializeEventStoreEvent(EventStoreEvent eventStoreEvent) {
try {
// Assuming your event JSON includes the 'eventType' field for polymorphic deserialization
return objectMapper.readValue(eventStoreEvent.getPayload(), DomainEvent.class);
} catch (JsonProcessingException e) {
throw new RuntimeException("Failed to deserialize event: " + eventStoreEvent.getId(), e);
}
}
}
4. EventStoreEventRepository (Add find method)
// infrastructure/persistence/eventstore/EventStoreEventRepository.java (Revised)
package com.example.eventoutbox.infrastructure.persistence.eventstore;
import org.springframework.data.jpa.repository.JpaRepository;
import java.util.List;
import java.util.UUID;
public interface EventStoreEventRepository extends JpaRepository<EventStoreEvent, UUID> {
List<EventStoreEvent> findByAggregateIdOrderBySequenceNumberAsc(String aggregateId);
}
5. CustomerController (Handle Exception)
// application/CustomerController.java (Revised)
package com.example.eventoutbox.application;
import com.example.eventoutbox.domain.ConcurrencyException;
import lombok.RequiredArgsConstructor;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.ExceptionHandler;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
@RestController
@RequestMapping("/customers")
@RequiredArgsConstructor
public class CustomerController {
private final CustomerApplicationService customerApplicationService;
public record UpdateCustomerProfileRequest(String customerId, String newName, String newAddress, long expectedVersion) {}
@PostMapping("/profile")
public ResponseEntity<String> updateCustomerProfile(@RequestBody UpdateCustomerProfileRequest request) {
CustomerApplicationService.UpdateCustomerProfileCommand command =
new CustomerApplicationService.UpdateCustomerProfileCommand(
request.customerId(), request.newName(), request.newAddress(), request.expectedVersion()
);
customerApplicationService.updateCustomerProfile(command);
return ResponseEntity.ok("Customer profile update command received and processed.");
}
@ExceptionHandler(ConcurrencyException.class)
public ResponseEntity<String> handleConcurrencyException(ConcurrencyException ex) {
return ResponseEntity.status(HttpStatus.CONFLICT).body(ex.getMessage());
}
}
How to Handle Concurrency Conflicts on the Client/UI Side:
When ConcurrencyException is thrown:
- Inform the User: Display a message like “This item has been updated by another user. Please refresh the page to see the latest changes and try your update again.”
- Retry (less common for user-facing, but possible for background jobs): For non-interactive or automated processes, you might implement a retry mechanism. This retry would need to:
- Fetch the latest state of the aggregate from a read model.
- Re-create the command based on the original intent and the newly fetched expected version.
- Re-send the command.
- This is typically only done if the change is “safe” to re-apply (e.g., adding an item, not changing a specific value).
By combining the version check in your application service with the UNIQUE constraint in your database, you create a robust optimistic concurrency control mechanism that prevents lost updates effectively.
What if event consumer fails to apply an event to its read model
In this case, the read model becomes stale, and subsequent attempts to update based on that stale data will lead to conflicts.
Let’s break down the scenario and the robust solution.
The Problem Scenario (as you described)
- UI: Queries
entity_ttable (read model), getsEntity (aggregate_version = 5). - User: Makes changes.
- UI: Sends
UpdateCommand (..., expectedVersion = 5)to the write model. - Write Model (Command Handler):
- Loads aggregate from
event_store_t. Let’s say itsactualVersionis5. - OCC Check:
actualVersion (5) == expectedVersion (5). Success. - Generates
Event (..., sequence_number = 6). - Persists
Event (..., sequence_number = 6)toevent_store_tandoutbox_message_tin an ACID transaction. This commits version 6 to theevent_store_t. - Debezium publishes this event to Kafka.
- Loads aggregate from
- Kafka Consumer (PortalEventConsumer):
- Reads
Event (..., sequence_number = 6, expectedVersion = 5). - Tries to update
entity_t(your read model):UPDATE entity_t SET ..., aggregate_version = 6 WHERE entity_id = ? AND aggregate_version = 5. - FAILURE: An exception occurs in the database update (e.g., a network error, a constraint violation unrelated to
aggregate_version, or the consumer’s JVM crashes). - Result: The
entity_ttable is NOT updated and remains ataggregate_version = 5. Theevent_store_tis ataggregate_version = 6. The read model is now stale.
- Reads
- Next UI interaction:
- UI queries
entity_tagain. It still getsEntity (aggregate_version = 5)because the read model is stale. - UI sends
UpdateCommand (..., expectedVersion = 5).
- UI queries
- Write Model (Command Handler) - Second Attempt:
- Loads aggregate from
event_store_t. ItsactualVersionis6. - OCC Check:
actualVersion (6) != expectedVersion (5). Conflict detected! - Result: The command handler throws a
ConcurrencyException. It does NOT try to insert a new event intoevent_store_twithsequence_number=6(because that would be a duplicate and would indeed fail on the unique constraint). It correctly rejects the command.
- Loads aggregate from
The specific symptom you mentioned (“new event insert into the event_store_t and it will fail because the aggregate version is used before”) should ideally not happen if the write model correctly detects OCC. The ConcurrencyException should prevent the duplicate event generation.
The core problem, then, is stale read models due to consumer processing failures, which then lead to ConcurrencyException at the write model.
The Solution: Robust Kafka Consumer Processing (Retry & DLQ)
The solution lies entirely within your Kafka Consumer’s (PortalEventConsumerStartupHook) error handling strategy.
Your most recent incremental code includes the processSingleEventWithRetries method with retry and DLQ logic. This is precisely the mechanism designed to handle this situation.
Here’s how it’s supposed to work and what you need to ensure is functioning correctly:
-
Idempotency of Read Model Updates:
- All your
dbProvider.createXxx,updateXxx,deleteXxxmethods (e.g.,updateRole,deleteRole,createRole) must be idempotent in their database effects. - For
UPDATEandDELETE,WHERE aggregate_version = expectedVersionmakes them idempotent. If the update was already applied (or a newer version is present),0 rows affectedmeans no harm done (though it might still trigger aConcurrencyExceptionwithin the consumer’sdbProvidermethods if you implement the record-not-found-vs-conflict check). - For
INSERT, useINSERT ... ON CONFLICT (primary_key) DO UPDATE SET aggregate_version = excluded.aggregate_version, ...(UPSERT) if the “create” event might be re-delivered and you expect it to update an existing record (e.g., in a snapshot table). Otherwise, if it’s strictly a “create-only” and a duplicate PK is a bug, theSQLExceptionfor unique constraint violation is correct.
- All your
-
Consumer’s Retry/DLQ Logic (The core fix): The
processSingleEventWithRetriesmethod is crucial.-
Transient Errors:
- If
dbProvider.updateXxx(or any other part ofprocessSingleEventWithRetries) throws a transientSQLException(e.g., connection timeout, deadlock), thecurrentRetryis incremented, andThread.sleepoccurs. - If
maxRetriesis not exhausted,processSingleEventWithRetrieswill returnfalse. - The
onCompletionloop will thenbreak;(meaning it won’tcommitSync()any offsets for this batch). - On the next
readRecordscall, the entire batch (including the transiently failed record) will be re-polled and re-processed. This relies on idempotency.
- If
-
Permanent Errors:
- If
dbProvider.updateXxxthrows aDbProvider.ConcurrencyException(meaning the read model’s version was stale, so theWHERE aggregate_version = expectedVersionupdate in the consumer failed with 0 rows, but the record did exist at a higher version) or anIllegalArgumentException(bad data) or a permanentSQLException(e.g., unique constraint violation on anINSERTwhere it shouldn’t happen, or foreign key constraint violation):processSingleEventWithRetrieswill catch it and callhandlePermanentFailure.handlePermanentFailuresends the original Kafka record to the DLQ.processSingleEventWithRetriesthen returnstrue(because the event has been “handled” by being DLQ’d).onCompletionthen does include this record’s offset inoffsetsToCommitand proceeds tocommitSync()for the batch.- Result: The consumer makes progress past this “poison pill.” The stale event in
entity_tis not updated by this specific event, but the consumer doesn’t get stuck.
- If
-
How to Handle the Stale UI Problem
Once the consumer’s retry/DLQ is robust, the stale UI becomes a UX problem rather than a system consistency problem.
-
Producer’s
ConcurrencyExceptionis Key: When the UI sendsUpdateCommand(..., expectedVersion = 5)and theevent_store_tis already at version 6, the write model will throwConcurrencyException. This is the correct behavior. -
UI Response to
ConcurrencyException:- The UI must catch this
ConcurrencyExceptionfrom the API. - It should display a clear message to the user: “This record has been updated by another user. Please refresh the page to see the latest changes and re-apply your modifications if desired.”
- It should then trigger a refresh of the data from the read model, which will then fetch
Entity (aggregate_version = 6)(assuming the consumer eventually caught up or the record was DLQ’d and a manual fix was applied).
- The UI must catch this
Summary and What to Focus On:
- Ensure
processSingleEventWithRetriesis fully correct and deployed. This is your primary defense against the consumer getting stuck. - Carefully verify the
isTransientSqlErrorlogic. It must accurately identify only recoverable database errors. Misclassifying permanent errors as transient will exhaust retries and then DLQ; misclassifying transient as permanent sends to DLQ too soon. - Confirm
handlePermanentFailurecorrectly sends to DLQ and logs. - Confirm your
dbProvidermethods are truly idempotent for re-delivered messages.UPDATE ... WHERE aggregate_version = expectedVersionis idempotent.INSERTs are trickier. - Educate your UI/API layer: They must handle the
ConcurrencyExceptionfrom the write model by forcing a refresh and user notification.
Your recent changes to PortalEventConsumerStartupHook already implement this retry/DLQ pattern. The next step is to ensure that code is compilable, deployed correctly, and that the dbProvider methods are fully integrated with the aggregate_version for idempotency and OCC checking as discussed in the previous detailed answer.
What if DLQ messages are not handled in a timely fashion
This is a critical operational risk with the Dead Letter Queue (DLQ) pattern: DLQ messages are “handled exceptions,” not “disappearing problems.” If they aren’t processed, they represent real, unapplied business facts that can lead to data inconsistencies and broken business processes over time.
If a message sits in the DLQ for 30 days, your read models will be stale, your UIs will report incorrect data, and downstream systems relying on that information will also be out of sync. This can severely damage data integrity and user trust.
The DLQ is a “Hospital” or “Quarantine Zone,” Not a “Graveyard”
It’s a place for messages that need human intervention or a specific, non-automated re-driving process. It’s not a place for messages to just die.
Strategies to Prevent DLQ Message Stagnation
To ensure DLQ messages are handled in a timely fashion, you need a robust DLQ management strategy that goes beyond just pushing messages to the topic.
1. Robust Monitoring & Alerting (Immediate Action)
- Metric: Count of messages in DLQ topics (
kafka_topic_partition_current_offset,kafka_consumer_group_lag, or custom JMX metrics). - Alerting Thresholds:
- Urgent: Alert immediately (PagerDuty, Slack, SMS) if the number of messages in any DLQ topic goes above 0 or a very small threshold (e.g., 5-10 messages). A DLQ is an exceptional queue.
- Warning: Alert if messages persist for a certain duration (e.g., 1 hour, 4 hours).
- Dashboards: Create a dashboard that prominently displays the number of messages in each DLQ topic and their age.
2. Clear Ownership & Standard Operating Procedures (SOPs)
- Who owns the DLQ? Assign clear responsibility to a specific team (e.g., SRE, Development team for that microservice).
- What’s the process? Define a clear SOP for handling DLQ alerts:
- Acknowledge alert.
- Inspect the DLQ message content (payload, error message, original topic/offset).
- Identify the root cause (code bug, malformed data, transient external system outage, business process error).
- Decide on action:
- Fix Code/Data: If it’s a bug, deploy a fix. If it’s bad data, decide if it needs manual correction in the database or if upstream data entry needs fixing.
- Re-drive: After fixing the root cause, re-drive the message(s) back to the original topic.
- Discard (Rare & Documented): Only if the message is truly unrecoverable garbage or a test message that accidentally ended up there, and its impact is negligible. This decision must be audited and requires strong justification.
3. Automated DLQ Re-driving with Human Trigger (Operational Playbook)
- You’ll need a “re-driver” tool/application.
- Purpose: This tool reads messages from the DLQ, and publishes them back to their original topic for re-processing.
- Features:
- Preview: Show content of DLQ messages before re-driving.
- Selectivity: Allow re-driving specific messages, or ranges of messages.
- Filtering: Filter by error type, timestamp, etc.
- Audit: Log who re-drove what message.
- Integration:
- Could be a simple command-line tool.
- Could be integrated into your internal developer portal or ops dashboard.
- Could be a scheduled job that runs periodically but requires explicit human approval before actually publishing.
4. Automated Retries (Beyond Initial Consumer)
For certain classes of “permanent-but-maybe-not-really” errors (e.g., external API rate limits, very long-running external process), you could have a separate, simpler consumer that specifically subscribes to the DLQ.
- Purpose: This DLQ consumer would only handle a very specific, narrow class of DLQ messages.
- Logic: It would apply its own retry logic (e.g., exponential backoff for a longer period, up to 24 hours).
- Re-DLQ: If this DLQ consumer also fails after its retries, it would push the message back to the same DLQ topic (or a different, truly “unresolvable” DLQ) to re-trigger human intervention.
- Caution: This adds complexity and should only be done for errors you’ve explicitly identified as potentially auto-recoverable over a very long time.
5. Data Reconciliation (Long-Term Strategy)
- For critical business data, even with DLQs, sometimes manual reconciliation or a dedicated reconciliation service might be needed.
- This involves periodically comparing your read model state with the authoritative Event Store (or an authoritative external system) to detect and resolve discrepancies that might have slipped through.
Addressing “Most errors cannot be recovered anyway”
If “most errors cannot be recovered anyway” and always lead to DLQ, it points to a need for:
- Better
isTransientSqlErrorlogic: Are you correctly identifying all transient SQL errors that could be retried by the main consumer? - More robust validation upstream: Can you prevent more malformed/invalid data from even generating events that will fail downstream? (E.g., more validation in the command handler).
- Refined Domain Model/Business Rules: Sometimes, a “permanent” error in the consumer points to a business rule that isn’t fully handled by events, or a valid state transition that wasn’t expected.
DLQs are not a substitute for preventing errors, but they are essential for graceful failure and ensuring eventual consistency. Their effectiveness hinges entirely on the operational discipline and tooling around them.
How to handle the scenario that it takes weeks or even months to fix a bug and get it deployed
This is an issue regarding a major reality of enterprise software: deployment cycles can be slow, and manual intervention for DLQs simply isn’t scalable for a continuous stream of errors. This is precisely why a purely “human-in-the-loop” DLQ management strategy breaks down in a corporate context with persistent bugs.
To handle a bug that takes weeks/months to deploy, you need a strategy that focuses on containment, bypass, and eventually, reconciliation.
Strategy for Long-Deployment-Cycle Bugs (Weeks/Months)
The goal is to stop the errors from piling up indefinitely in the DLQ, and to allow the rest of the system to continue processing, even if some data is temporarily inconsistent.
1. Immediate Containment: Filtering or Skipping “Poison Pill” Messages
-
Implement a “Hot Fix” Filter (Code-based or SMT-based):
- In your Kafka Consumer (
PortalEventConsumerStartupHook): If you identify a bug where a specific type of event (or event with specific data) consistently causes failures:- Add a temporary code filter. For instance, if
ScheduleCreatedEventwithnulluserIdis causingNullPointerException, add:if (eventType.equals(PortalConstants.SCHEDULE_CREATED_EVENT) && eventMap.get("userId") == null) { logger.warn("Skipping known bug event type {} for record {} due to null userId. Not processing.", eventType, record.offset()); handlePermanentFailure(record, "Known bug: null userId for " + eventType, "KnownBugSkip"); return true; // Mark as handled (DLQ'd), commit offset, move on. } - If the bug is in a specific
dbProvidermethod: You can wrap that call in a try-catch forPermanentProcessingExceptionspecifically for that event type, and if it’s the known bug, send it to DLQ and commit.
- Add a temporary code filter. For instance, if
- Using Kafka Connect SMT (if source is Kafka Connect): You could implement a custom
FilterSMT that drops/routes specific problematic messages before they even hit your consumer app. This requires deploying a new SMT, but it can be faster than an app deployment.
- In your Kafka Consumer (
-
Why: This immediately stops the DLQ from growing uncontrollably with known bad messages. It sacrifices processing that specific message but ensures the consumer stays healthy.
2. Automated (Limited) Re-driving for Transient/Known Issues (Or Triage)
- “Error Triage” Consumer: Instead of just sending to a single DLQ, consider a dedicated consumer that subscribes to your main DLQ topic.
- This consumer acts as an automated triage.
- It checks the
errorType(fromhandlePermanentFailure’s metadata). - If
errorTypeis “TransientSqlError” or “RetriesExhausted” (but could eventually succeed): It re-publishes the original message back to theportal-eventtopic with an exponential backoff. It might implement its own max retries (e.g., 50 retries over 24 hours). If it still fails, then it pushes to a “Final DLQ” that truly requires manual intervention. - If
errorTypeis “ConcurrencyConflict”, “DataValidationError”, “UnhandledEventType”, or “KnownBugSkip”: It pushes to a separate “Permanent DLQ” topic. This queue is smaller and truly requires human eyes.
- Why: This handles messages that might eventually self-resolve or that you know can’t be fixed by immediate retries but aren’t necessarily “dead forever.” It reduces the volume of messages requiring immediate human attention.
3. Manual Intervention for “Permanent DLQ” / Complex Bugs (When Devs Get Involved)
- The “Permanent DLQ” is where true bugs/bad data sit.
- The same monitoring and alerting from before applies, but now it’s for a much smaller, higher-priority queue.
- Developers must actively:
- Analyze: What exactly caused this? Why did it bypass automated retries/filters?
- Fix: Develop and deploy the bug fix.
- Reconcile/Re-drive:
- If the bug fix resolves the issue, use a re-driver tool to re-submit messages from the Permanent DLQ to the
portal-eventtopic. - If the bug resulted in data inconsistencies that can’t be fixed by re-driving (e.g., a critical business state was violated), you might need to perform a manual database correction on the affected aggregate(s) (this is the most dangerous and should be avoided if possible).
- If the bug fix resolves the issue, use a re-driver tool to re-submit messages from the Permanent DLQ to the
4. Long-Term Data Reconciliation / Auditing
- Offline Reconciliation: For critical data, implement daily/weekly batch jobs that compare the state of your read model tables with the authoritative Event Store.
- If discrepancies are found, they are reported, and a reconciliation process is triggered (either manual or automated). This ensures that even if events were missed or misapplied, data consistency is eventually achieved.
- Event Replay (When all else fails): If a significant bug causes widespread data corruption or loss of consistency, the ultimate fallback is to:
- Deploy the bug fix.
- Stop the affected read model consumer.
- Clear the affected read model tables.
- Replay all historical events from the
event_store_t(or long-retention Kafka topics) through the fixed consumer logic. This rebuilds the read model from scratch, reflecting the correct business logic. This is why Event Sourcing is so powerful.
Example Workflow with a Long-Deployment-Cycle Bug
- Bug Identified:
ScheduleCreatedEventcreates a schedule, but due to a bug in the consumer’sdbProvider.createSchedulemethod, it tries to insert a duplicate primary key ifscheduleId(aggregate ID) exists, and this causes a permanent error in the consumer. - Immediate Containment (Filter/Bypass):
- A hotfix is applied to the
PortalEventConsumerStartupHook(or a dedicated filter SMT) to recognizeScheduleCreatedEventwherescheduleIdalready exists. - For such events, it
handlePermanentFailure()the message to aportal-event-dlq-permanenttopic (or aKnownBugDLQ). This prevents the main consumer from getting stuck.
- A hotfix is applied to the
- DLQ Accumulation & Monitoring: Messages related to this bug pile up in
portal-event-dlq-permanent. Alerts are firing. - Development Fix: The development team works on a fix for
dbProvider.createSchedule(e.g., changing it to anUPSERTif a “create” event implies “idempotent create/update” or better handling of duplicate primary keys if it truly is an error). This takes weeks. - Deployment: The fix is deployed.
- Re-driving: Operations team uses the re-driver tool to:
- Read messages from
portal-event-dlq-permanent. - Publish them back to the
portal-eventtopic. - The now-fixed consumer processes them correctly.
- Read messages from
This approach balances immediate operational stability with eventual consistency, acknowledging the realities of enterprise deployment cycles.
MSAL Light OAuth Integration
Token Exchange
We have a React Single Page Application(SPA) running within a corporate network using Azure AD for SSO. However, we want to use JWT tokens from another OAuth 2.0 provider Light-OAuth to issue the token with additional custom claims for fine-grained Authorization. What is the best way to do the integration?
This is a common and powerful architectural pattern used to separate authentication from authorization, allowing you to use a central identity provider (like Azure AD) for who the user is and a specialized service for what the user can do.
The goal is to use the trusted identity established by Microsoft to “mint” a new, enriched token from your second provider.
The best and most secure way to achieve this is by using the OAuth 2.0 Token Exchange Flow (RFC 8693). This process happens on your backend, ensuring that no sensitive client secrets or complex token logic is exposed in the React SPA.
Let’s break down the architecture and the steps involved.
The Recommended Architecture: Backend-Mediated Token Exchange
In this model, your React SPA’s responsibility remains simple: authenticate the user with Microsoft and get a token. It then sends this token to your own backend API. Your backend is the one that communicates with the second OAuth provider to exchange the token.
Here is a diagram of the flow:
+-----------+ +-----------------+ +----------------+
| | (1) | | (2) | |
| React SPA |---->| Microsoft |---->| React SPA |
| | | Entra ID (MSAL)| | (has MS Token) |
+-----------+ +-----------------+ +----------------+
|
| (3) Sends MS Token to Backend
v
+-----------+ +-----------------+ +----------------+
| | (4) | | (5) | |
| Your |---->| Second OAuth |---->| Your |
| Backend | | Provider | | Backend |
| API | | (Token Exchange)| | (has New Token)|
+-----------+ +-----------------+ +----------------+
|
| (6) Establishes session / returns new token
v
+-----------+
| React SPA |
| (is now |
| authorized)|
+-----------+
Why this is the best approach:
- Security: Your second provider’s
client_idandclient_secretare kept securely on your backend, never exposed to the browser. - Separation of Concerns: The React app only worries about authenticating with Microsoft. Your backend handles the complex authorization logic.
- Flexibility: You can swap out the second authorization provider without changing your entire frontend authentication flow.
- Standard-Compliant: The Token Exchange grant type is an official standard designed for this exact “delegation” or “impersonation” scenario.
Step-by-Step Integration Guide
Here’s how to implement this pattern.
Step 1: Configure Your Second OAuth 2.0 Provider
This is the most critical setup step. You need to configure your authorization provider (e.g., Auth0, Okta, Duende IdentityServer, or a custom one) to:
- Trust Microsoft Entra ID (Azure AD) as an Identity Provider: It must be able to validate the signature of the token it receives from your backend. This usually involves pointing it to Microsoft’s OIDC discovery endpoint (
https://login.microsoftonline.com/{tenant-id}/v2.0/.well-known/openid-configuration) to get the public signing keys. - Enable the Token Exchange Grant Type: You’ll need to create a new “client” or “application” within this provider for your backend API. This client must be configured to use the
urn:ietf:params:oauth:grant-type:token-exchangegrant type. - Define a User Mapping Strategy: The provider needs to know how to link the incoming Microsoft token to a user in its own database. A common practice is to map the
oid(Object ID) orsub(Subject) claim from the Microsoft token to a user profile in the second provider. This is how it knows which fine-grained permissions (claims) to add. - Define the Custom Claims: Configure the rules that add the additional claims to the new token when the exchange is successful. For example: “If the incoming user has
oid‘123-abc’, add the claimspermissions: ['create:document', 'read:report'].”
Step 2: Update Your React SPA Logic
Your React app’s interaction with MSAL will remain largely the same, with one key difference in what you do after a successful login.
-
Authenticate and Acquire a Token: Use MSAL as you normally would to log the user in and get an access token for your own backend API.
// msalConfig.js - Make sure you have a scope for your own backend API export const msalConfig = { auth: { /* ... */ }, cache: { /* ... */ }, }; export const loginRequest = { scopes: ["User.Read", "api://<your-backend-client-id>/access_as_user"] }; -
Call Your Backend: After getting the token, instead of using it to call various protected resources, you make a single call to a dedicated endpoint on your backend (e.g.,
/auth/ms/exchange) to initiate the session.import { useMsal } from "@azure/msal-react"; import { loginRequest } from "./msalConfig"; function MyComponent() { const { instance, accounts } = useMsal(); const handleLoginAndExchange = async () => { try { // 1. Get the MSAL token for our backend const response = await instance.acquireTokenSilent({ ...loginRequest, account: accounts[0], }); const microsoftAccessToken = response.accessToken; // 2. Send it to our backend for exchange const backendResponse = await fetch('/auth/ms/exchange', { method: 'POST', headers: { 'Authorization': `Bearer ${microsoftAccessToken}`, 'Content-Type': 'application/json', }, }); if (!backendResponse.ok) { throw new Error('Token exchange failed'); } // The backend will likely set a secure HttpOnly cookie, // so there might be nothing else to do here. // Or, it might return the new token to be stored in memory. const { newAccessToken } = await backendResponse.json(); console.log("Received new, enriched token from our backend!"); // Now use this newAccessToken for subsequent API calls } catch (error) { // Handle token acquisition or exchange errors console.error(error); if (error.name === "InteractionRequiredAuthError") { instance.acquireTokenPopup(loginRequest); } } }; // ... }
Step 3: Implement the Backend Token Exchange Endpoint
This is where the core logic resides. You’ll create an endpoint that receives the Microsoft token and exchanges it.
-
Protect the Endpoint: Configure your backend to validate the
Bearertoken from Microsoft that it receives from your React app. This ensures only authenticated users from your SPA can trigger an exchange. -
Implement the Exchange Logic:
if (exchange.getRelativePath().equals(config.getExchangePath())) { // token exchange request handling. if(logger.isTraceEnabled()) logger.trace("MsalTokenExchangeHandler exchange is called."); String authHeader = exchange.getRequestHeaders().getFirst(Headers.AUTHORIZATION); if (authHeader == null || !authHeader.startsWith("Bearer ")) { setExchangeStatus(exchange, JWT_BEARER_TOKEN_MISSING); return; } String microsoftToken = authHeader.substring(7); // --- Validate the incoming Microsoft Token --- if(msalJwtVerifier == null) { // handle case where config failed to load throw new Exception("MsalJwtVerifier is not initialized."); } try { // We only need to verify it, we don't need the claims for much. // The second provider will do its own validation and claim mapping. // Set skipAudienceVerification to true if the 'aud' doesn't match this BFF's client ID. String reqPath = exchange.getRequestPath(); msalJwtVerifier.verifyJwt(microsoftToken, msalSecurityConfig.isIgnoreJwtExpiry(), true, null, reqPath, null); } catch (InvalidJwtException e) { logger.error("Microsoft token validation failed.", e); setExchangeStatus(exchange, INVALID_AUTH_TOKEN, e.getMessage()); return; } // --- Perform Token Exchange --- String csrf = UuidUtil.uuidToBase64(UuidUtil.getUUID()); TokenExchangeRequest request = new TokenExchangeRequest(); request.setSubjectToken(microsoftToken); request.setSubjectTokenType("urn:ietf:params:oauth:token-type:jwt"); request.setCsrf(csrf); // The CSRF for the *new* token we are getting Result<TokenResponse> result = OauthHelper.getTokenResult(request); if (result.isFailure()) { logger.error("Token exchange failed with status: {}", result.getError()); setExchangeStatus(exchange, TOKEN_EXCHANGE_FAILED, result.getError().getDescription()); return; } // --- The setCookies logic is identical --- List<String> scopes = setCookies(exchange, result.getResult(), csrf); if(logger.isTraceEnabled()) logger.trace("scopes = {}", scopes); exchange.setStatusCode(StatusCodes.OK); exchange.getResponseHeaders().put(Headers.CONTENT_TYPE, "application/json"); // Return the scopes in the response body Map<String, Object> rs = new HashMap<>(); rs.put(SCOPES, scopes); exchange.getResponseSender().send(JsonMapper.toJson(rs)); } else if (exchange.getRelativePath().equals(config.getLogoutPath())) { // logout request handling, this is the same as StatelessAuthHandler to remove the cookies. if(logger.isTraceEnabled()) logger.trace("MsalTokenExchangeHandler logout is called."); removeCookies(exchange); exchange.endExchange(); } else { // This is the subsequent request handling after the token exchange. Here we verify the JWT in the cookies. if(logger.isTraceEnabled()) logger.trace("MsalTokenExchangeHandler is called for subsequent request."); String jwt = null; Cookie cookie = exchange.getRequestCookie(ACCESS_TOKEN); if(cookie != null) { jwt = cookie.getValue(); // verify the jwt with the internal verifier, the token is from the light-oauth token exchange. JwtClaims claims = internalJwtVerifier.verifyJwt(jwt, securityConfig.isIgnoreJwtExpiry(), true); String jwtCsrf = claims.getStringClaimValue(Constants.CSRF); // get csrf token from the header. Return error is it doesn't exist. String headerCsrf = exchange.getRequestHeaders().getFirst(HttpStringConstants.CSRF_TOKEN); if(headerCsrf == null || headerCsrf.trim().length() == 0) { setExchangeStatus(exchange, CSRF_HEADER_MISSING); return; } // verify csrf from jwt token in httpOnly cookie if(jwtCsrf == null || jwtCsrf.trim().length() == 0) { setExchangeStatus(exchange, CSRF_TOKEN_MISSING_IN_JWT); return; } if(logger.isDebugEnabled()) logger.debug("headerCsrf = " + headerCsrf + " jwtCsrf = " + jwtCsrf); if(!headerCsrf.equals(jwtCsrf)) { setExchangeStatus(exchange, HEADER_CSRF_JWT_CSRF_NOT_MATCH, headerCsrf, jwtCsrf); return; } // renew the token 1.5 minute before it is expired to keep the session if the user is still using it // regardless the refreshToken is long term remember me or not. The private message API access repeatedly // per minute will make the session continue until the browser tab is closed. if(claims.getExpirationTime().getValueInMillis() - System.currentTimeMillis() < 90000) { jwt = renewToken(exchange, exchange.getRequestCookie(REFRESH_TOKEN)); } } else { // renew the token and set the cookies jwt = renewToken(exchange, exchange.getRequestCookie(REFRESH_TOKEN)); } if(logger.isTraceEnabled()) logger.trace("jwt = " + jwt); if(jwt != null) exchange.getRequestHeaders().put(Headers.AUTHORIZATION, "Bearer " + jwt); // if there is no jwt and refresh token available in the cookies, the user not logged in or // the session is expired. Or the endpoint that is trying to access doesn't need a token // for example, in the light-portal command side, createUser doesn't need a token. let it go // to the service and an error will be back if the service does require a token. // don't call the next handler if the exchange is completed in renewToken when error occurs. if(!exchange.isComplete()) Handler.next(exchange, next); }
What to Avoid: The Anti-Pattern
Do not try to perform two separate, chained OAuth flows in the frontend. This would involve:
- User logs in with MSAL.
- Your React app gets the MSAL token.
- Your React app then initiates a second redirect or popup flow with the other provider, trying to pass the MSAL token as a parameter.
This is a bad idea because:
- Terrible User Experience: It can lead to multiple redirects, popups, and a confusing login process.
- Security Risk: It increases the surface area for token handling in the browser and might require you to use less secure flows (like Implicit flow) on the second provider.
- Complexity: Managing the state of two independent authentication libraries and their tokens in a SPA is extremely difficult and error-prone.
Client Secret
Token exchange specification doesn’t require client_id and client_secret to be sent to the second OAuth 2.0 provider to exchage the token. However, it is highly recommended to pass the client_id and client_secret from the BFF to the second OAuth 2.0 provider. The subject token along is not sufficient.
This is a critical security aspect of the Token Exchange flow. Let’s break down why.
The “Two Questions” Security Model
When your BFF makes the token exchange request, the second OAuth provider needs to answer two fundamental security questions:
-
WHO IS THE USER? (Authentication of the Subject)
- This question is answered by the
subject_token(the Microsoft token). - The provider validates the token’s signature, issuer (
iss), expiration (exp), and audience (aud) to confirm that it’s a legitimate token for a valid user from a trusted identity provider (Microsoft).
- This question is answered by the
-
WHO IS ASKING FOR THIS TOKEN? (Authentication of the Client)
- This question is answered by the
client_idandclient_secret. - This is crucial. The provider needs to know which application is requesting to act on the user’s behalf. It’s not enough that the user is valid; the application making the request must also be a known, trusted, and authorized client.
- This question is answered by the
Why the Subject Token Alone is a Security Risk
Imagine if only the subject_token were required. Any malicious actor or compromised service that managed to get a user’s Microsoft access token could then send it to your second OAuth provider and exchange it for a new token containing your fine-grained authorization claims. This would allow them to impersonate the user within your system completely.
By requiring the client_id and client_secret, you ensure that only your specific, trusted BFF application is allowed to perform this exchange. The client_secret is the proof that the request is coming from your backend and not some other application.
The Token Exchange Request Body
So, the POST request your MsalTokenExchangeHandler (the BFF) sends to your second provider’s token endpoint will be application/x-www-form-urlencoded and must look like this:
grant_type=urn:ietf:params:oauth:grant-type:token-exchange
&client_id=YOUR_BFFS_CLIENT_ID_FOR_THE_SECOND_PROVIDER
&client_secret=YOUR_BFFS_CLIENT_SECRET
&subject_token=THE_MICROSOFT_ACCESS_TOKEN_FROM_THE_SPA
&subject_token_type=urn:ietf:params:oauth:token-type:access_token
&scope=permissions_for_the_new_token
Configuration Checklist for your Second OAuth Provider
This means that on your second OAuth 2.0 provider, you must:
- Create a Client Registration: Create a new “Application” or “Client” specifically for your
light-gatewayBFF. - Set Client Type: Configure this client as a Confidential Client (as opposed to a Public Client like a SPA), because it is capable of securely storing a secret.
- Generate Credentials: Generate a
client_idand aclient_secretfor this BFF client. - Enable Grant Type: Explicitly enable the
urn:ietf:params:oauth:grant-type:token-exchangegrant type for this specific client. Your provider’s security policy should only allow trusted, confidential clients to use this powerful grant type. - Store Credentials Securely: Store the generated
client_idandclient_secretsecurely in your BFF’s configuration (e.g., insecret.ymlor environment variables), where they are not exposed to the outside world.
In short:
- The
subject_tokenproves who the user is. - The
client_idandclient_secretprove who your BFF is.
Both are required for a secure delegation and token exchange process.
Token Verification
It is necessary to verify the jwt token from Azure AD on both BFF and light-oauth based on the “Zero Trust” principle. They perform the validation for different, but equally important, reasons.
Skipping the validation on the BFF, while technically possible, is a significant security anti-pattern. Let’s break down the distinct roles of each validation step.
1. The BFF’s Responsibility: “Am I Talking to a Legitimate Client?”
The validation performed by your MsalTokenExchangeHandler in the BFF serves as a gatekeeper for your own system. Its purpose is to protect the BFF itself and the downstream services it communicates with.
When the BFF validates the Microsoft token, it’s asking these questions:
- Is this token even real? (Signature validation).
- Is it from an identity provider I trust? (Checking the
issor “issuer” claim is fromlogin.microsoftonline.com/...). - Is this token actually meant for me? (This is CRITICAL). The BFF must check the
audor “audience” claim. Theaudshould be the Client ID of your BFF application. This prevents a token that was issued for another API (like the Microsoft Graph API) from being replayed against your BFF to trick it. This is a defense against the “confused deputy” problem. - Has it expired? (Checking the
expor “expiration” claim).
Why this is crucial for the BFF:
- Fail Fast: You immediately reject invalid, expired, or improperly targeted tokens. This is a better user experience and saves system resources.
- Denial-of-Service (DoS) Protection: If you don’t validate, your BFF becomes a dumb proxy that forwards every piece of junk it receives to your second OAuth provider. An attacker could flood your BFF with garbage tokens, causing it to swamp your authorization server with useless validation and exchange requests, potentially taking it down.
- Security Boundary: The BFF is the first line of defense. It should never blindly trust any input it receives from the public internet, even from your own SPA.
2. The Second OAuth Provider’s Responsibility: “Can I Issue a New Token for this Subject?”
The validation performed by the second OAuth provider is the authoritative act of delegation. It’s the ultimate source of truth for the new, enriched token. It cannot and must not trust that the BFF has already performed a valid check.
When the second OAuth provider receives the subject_token, it asks all the same questions as the BFF, but for its own security policy:
- Is this token real and from a trusted issuer? (Signature and
issvalidation). It must have Microsoft configured as a trusted external identity provider. - Is this token meant for a client that is allowed to exchange it? (It might check the
audclaim). - Is the client making the request (the BFF) authorized to perform a token exchange? (This is validated via the
client_idandclient_secretyou send in the request). - How do I map this external user to an internal user? (This is the most important unique step). It will inspect the
oid,sub,email, or another claim from the Microsoft token to find the corresponding user in its own database. - What new claims should I issue for this user? Based on the mapped internal user, it will apply its authorization rules to mint the new token with fine-grained permissions.
Analogy: A High-Security Building
Think of it like this:
- The React SPA is you, the visitor.
- The Microsoft Token is your government-issued driver’s license.
- The BFF is the receptionist at the front desk of the building. They look at your driver’s license (
BFF validation) to make sure it’s not expired and that your name is on the visitor list for that day (audcheck). They protect the building from random people just walking in. - The Second OAuth Provider is the guard in front of the secure vault on the 10th floor. When you get to the 10th floor, the guard doesn’t just say, “Oh, the receptionist let you in, so you must be fine.” No, they perform their own, more thorough check of your driver’s license (
OAuth provider validation), check their specific access logs (user mapping), and then issue you a special keycard (new enriched token) that only opens the specific safety deposit box you’re allowed to access (fine-grained claims).
You wouldn’t want a security system where the vault guard blindly trusts the front desk. Each layer must perform its own validation.
Conclusion
Do not skip the validation on the BFF.
- Verify on the BFF to protect your own application, fail fast, and prevent it from becoming a DoS vector.
- The Second OAuth Provider MUST verify as its core function to securely map the user and issue an authoritative, enriched token.
The verifications are not redundant; they are a fundamental part of a layered, defense-in-depth security strategy.
Single Page Application
There are two endpoints that the SPA should access for both token exchange and logout.
Login
After the SSO with Azure AD via SSO, you need to send this ID token to the backend API endpoint “/auth/ms/exchange” to establish the session with a GET request. The header is the standard authorization header with “Bearer IdToken”. You will receive a response in JSON with a list of scopes that is represent the access permission. You can display them to the user for consent or simply ignore them. Along with the response body, some cookies will be set on the browser local storage to establish the session. Once the login is done, the backend will automatically renew the access token with a refresh token automatically as long as the user sending the request to the server.
Logout
To logout, you need to logout from the Azure AD and then send a GET request to the backend API endpoint “/auth/ms/logout” to remove session cookies.
User Registration and Onboarding
In light-portal, user management is the foundation of the authentication and authorization to access different services.
There are two different apporaches to create an user entry in the system: Internet and Corporation.
Internet User
This is for Internet users to register and verify via email to a cloud Light Portal instance.
The entry point is the createUser command handle in the user-command service.
Corporation User
This is for corporation users to onboard to a dedicated Intranet Light Portal instance.
The entry point is the onboardUser command handler in the user-comand service.
User Password
In the user_t table, the password is nullable and onboardUser doesn’t have password passed in as the authentication is done through Azure AD and ECIF etc.
Optimistic vs Pessimistic UI
When you create, update, delete an entity on the UI and refresh the list immediately, chances are the newly updated entity doesn’t show up the changes. This is a classic challenge when working with systems that use Event Sourcing and CQRS (Command Query Responsibility Segregation).
- Command: Your
deleteHostrequest is a Command. It’s sent to the write-model to change the state of the system and publish an event (e.g.,HostDeletedEvent). - Query: Your
fetchDatarequest is a Query. It reads from a separate read-model (thehostsdatabase view/table). - Eventual Consistency: There is a delay (usually milliseconds, but it can vary) between the command succeeding and the event consumer updating the read-model.
Your UI is so fast that it’s sending the Query before the read-model has been updated, leading to the stale data problem.
Should we wait a few seconds?
No, please do not use a setTimeout to wait. This is the most important takeaway. It’s an unreliable “magic number” that will cause problems:
- Bad UX: It forces the user to wait for an arbitrary amount of time, even if the system is fast.
- Unreliable: If the system is under heavy load, the delay might be longer than your timeout, and the bug will reappear.
- It’s a “code smell”: It indicates that the UI isn’t correctly handling the nature of the backend architecture.
The Professional Solutions
There are two primary, robust patterns for handling this on the UI. The best choice depends on the desired user experience.
Option 1: Optimistic UI (Recommended for Best UX)
This is the most common and user-friendly approach in modern web applications. You assume the command will succeed and update the UI immediately.
How it works:
- User clicks “Delete”.
- You immediately remove the item from your local React state. The user sees the item disappear instantly.
- You send the
deleteHostcommand to the server in the background. - Crucially: If the command fails for some reason (e.g., validation error, server down), you revert the UI change (add the item back) and show an error message.
This provides the best possible user experience because the UI feels instantaneous.
Here is how you would implement this in your handleDelete function:
// Delete handler - OPTIMISTIC UI APPROACH
const handleDelete = useCallback(async (row: MRT_Row<HostType>) => {
if (!window.confirm(`Are you sure you want to delete host: ${row.original.subDomain}?`)) {
return;
}
// Keep a copy of the current data in case we need to roll back
const originalData = [...data];
// 1. Optimistically update the UI
setData(prevData => prevData.filter(host => host.hostId !== row.original.hostId));
setRowCount(prev => prev - 1); // Also optimistically update the total count
// 2. Send the command to the server
const cmd = {
host: 'lightapi.net',
service: 'host',
action: 'deleteHost',
version: '0.1.0',
data: { hostId: row.original.hostId, aggregateVersion: row.original.aggregateVersion },
};
try {
const result = await apiPost({ url: '/portal/command', headers: {}, body: cmd });
if (result.error) {
// 3a. On failure, revert the UI and show an error
console.error('API Error on delete:', result.error);
alert('Failed to delete host. Please try again.'); // Or use a snackbar
setData(originalData);
setRowCount(originalData.length); // Revert the count
}
// 3b. On success, do nothing! The UI is already correct.
// You could trigger a silent background refetch here if you want to be 100% in sync, but it's often not necessary.
} catch (e) {
// Also handle network errors
console.error('Network Error on delete:', e);
alert('Failed to delete host due to a network error.');
setData(originalData);
setRowCount(originalData.length);
}
}, [data]); // The main dependency is the 'data' for rollback.
Option 2: Pessimistic UI with State Locking (Simpler, Good UX)
This approach is more straightforward. You “lock” the UI in a loading state until you are certain the operation is complete.
How it works:
- User clicks “Delete”.
- You show a loading spinner on that specific row or disable the whole table.
- Send the
deleteHostcommand. - When the command API call returns a success, you then call
fetchData()to get the fresh data. Because the command has completed, it’s much more likely the read model is now consistent. This is essentially what you were trying to do before.
The problem, as you noted, is that even after the command returns, the read model might still not be updated. The optimistic approach neatly sidesteps this entire timing issue. If you must stick to a pessimistic approach, the Optimistic UI is still the superior and often easier pattern to implement correctly.
To fix your current pessimistic implementation, you’d have to implement polling, which is complex:
- Send delete command.
- On success, start a
setIntervalto callfetchDataevery 2 seconds. - In each
fetchDataresponse, check if the deleted item is gone. - If it is,
clearIntervaland stop. - Add a timeout to stop polling after ~10-15 seconds to prevent infinite loops.
As you can see, this is much more complicated than the optimistic update.
Recommendation
Adopt the Optimistic UI pattern (Option 1). It provides the best user experience, is resilient to timing issues caused by eventual consistency, and the implementation is clean and modern. The code provided for the optimistic handleDelete is a drop-in replacement that will solve your problem robustly.
Should you poll peroidically to sync other users’ changes
The short answer is: No, you should still avoid client-side polling for this specific use case. It’s generally the wrong tool for this problem and creates more issues than it solves. The Optimistic UI approach is still preferable, but it needs to be combined with a robust backend and potentially other real-time technologies for a complete solution.
Let’s break down why and explore the professional-grade solutions.
Why Polling is a Bad Fit Here
Your concern is valid: polling does add significant pressure, and it’s inefficient.
- High Network Traffic: Every active user would be sending a
getHostquery every few seconds. If you have 50 users on that page, that’s 10-25 queries per second just from this one component, most of which will return no new data. - Database and Service Layer Load: This traffic directly translates to load on your service and database. Your
SELECTquery, while indexed, still consumes resources. At scale, this can become a significant performance bottleneck. - Delayed UX: The user experience is still poor. A user makes a change and might have to wait up to
Xseconds (your polling interval) to see it reflected, which feels sluggish. - Complexity: As we discussed, managing polling logic (starting, stopping, timeouts) on the client adds complexity and potential bugs.
So, while polling can eventually get you the latest data, it’s a brute-force approach with major drawbacks.
The Professional-Grade Solutions for Multi-User Environments
The key is to shift from a “pull” model (client polling) to a “push” model (server notifies the client). This is where real-time technologies shine.
Solution 1: Optimistic UI + Server-Sent Events (SSE) or WebSockets (Best for Real-Time)
This is the gold standard for collaborative applications.
How it Works:
-
Frontend (Your Optimistic UI):
- User A deletes a host. Their UI updates instantly (optimistic update). The
deleteHostcommand is sent to the server. - User B is looking at the same list. Their screen is unchanged for now.
- User A deletes a host. Their UI updates instantly (optimistic update). The
-
Backend (The Magic):
- The command handler processes the
deleteHostcommand and publishes aHostDeletedEvent. - An Event Notifier Service listens for this event.
- Upon receiving the event, this service pushes a notification to all connected clients who are interested in
hostupdates. This is done via Server-Sent Events (SSE) or WebSockets. SSE is often simpler for server-to-client-only communication.
- The command handler processes the
-
Frontend (Receiving the Push):
- User B’s browser receives the
HostDeletedEventpush notification. - The React component’s event listener fires. It can do one of two things:
- A) Smart Update (Ideal): The event payload contains the
hostIdthat was deleted. The client simply finds that ID in its localdatastate and removes it. This is hyper-efficient. - B) Refetch (Simpler): Upon receiving any host-related event, the client triggers a
fetchData()call to get the latest list. This is less efficient than a smart update but still vastly better than polling.
- A) Smart Update (Ideal): The event payload contains the
- User A’s browser also receives the event. It can simply ignore it, as its UI is already up-to-date.
- User B’s browser receives the
Why this is the best solution:
- Real-Time: Updates are pushed instantly to all users.
- Hyper-Efficient: No unnecessary network requests. The server and client only communicate when there’s an actual state change.
- Scalable: A single event from the backend can update thousands of connected clients simultaneously.
- Excellent UX: The application feels alive and collaborative.
Solution 2: Optimistic UI + Stale-While-Revalidate (SWR) / react-query with Refetch-on-Focus
This is a powerful and very easy-to-implement pattern that offers a great “80% solution” without needing a full real-time backend setup. Libraries like react-query (now TanStack Query) or Vercel’s swr are built for this.
How it Works:
- You replace your manual
useState/useEffect/fetchDatalogic with theuseQueryhook fromreact-query. - User A deletes a host. You perform an optimistic update using the library’s built-in tools.
- User B is looking at the list. Nothing happens yet.
- Now, User B switches from another browser tab back to your application tab.
react-queryautomatically detects this “window focus” event and triggers a background refetch of the data.- The UI is seamlessly updated with the latest data (showing User A’s deletion).
Why this is a great solution:
- Extremely Simple to Implement: You get this behavior for free just by using the library.
- “Good Enough” Real-Time: Data is refreshed exactly when the user is most likely to need it (when they re-engage with the app).
- Efficient: Avoids constant polling. It only refetches on specific, user-driven events (window focus, network reconnect, etc.).
- Handles Caching, Loading States, etc.: These libraries solve many data-fetching headaches for you.
Recommendation & Path Forward
-
Immediate Step: Stick with the Optimistic UI approach from my previous answer. It correctly handles the single-user eventual consistency problem, which is your most pressing issue. It’s the foundation for everything else.
-
Next Step (Highly Recommended): Introduce a data-fetching library like TanStack Query (
react-query). This will simplify your code and give you the “refetch-on-focus” behavior out of the box, largely solving the multi-user problem with minimal effort. -
Long-Term Goal (For True Real-Time): If your application’s core value is real-time collaboration (like a Google Doc or Figma), then plan to add a Server-Sent Events (SSE) or WebSocket layer to your backend to push updates to clients.
In summary: Avoid client-side polling. Implement the optimistic UI pattern now, and for multi-user synchronization, use a purpose-built library like react-query or a real-time backend push technology like SSE.
Soft Delete vs Hard Delete
Soft Delete vs Hard Delete
Here is a classic problem in Event Sourcing, often related to the concept of “soft deletes” or “state transitions” versus “hard deletes” and re-insertions. The core issue is that aggregate_version must be strictly unique for a given aggregate. If you try to re-insert an aggregate at an old version, it fundamentally violates Event Sourcing principles.
Let’s break down the scenario and the best ways to handle it.
The Problem Scenario: Version Conflict on Re-add
Your scenario:
UserHostCreatedEvent (userId=U, hostId=H, aggregate_version=1)->event_store_thas version 1.user_host_t(projection) has version 1.UserHostDeletedEvent (userId=U, hostId=H, aggregate_version=2)->event_store_thas version 2.user_host_teither deletes or marks as inactive.UserHostCreatedEvent (userId=U, hostId=H, aggregate_version=1)-> CONFLICT! This event says the aggregate(U,H)is at version 1 again, butevent_store_talready has version 2 for(U,H).
Root Cause: You cannot “re-add” an aggregate at an old version. An aggregate’s version always strictly increases. The action of “adding back” is not a “first time add” in the event history; it’s a new state transition.
Best Ways to Handle This Kind of Scenario
The solution involves redefining what “add back” means in an Event Sourcing context and how your aggregates and projections handle it.
Option 1: State Transitions (Recommended for your scenario)
This is the most common and robust approach. Instead of thinking of “add” and “remove” as discrete CRUD operations on a single record, think of them as state changes of an aggregate instance that always exists.
Aggregate Design (Conceptual UserHostMapping Aggregate):
- An aggregate representing the state of a
(User, Host)relationship (e.g.,UserHostMappingAggregate(userId, hostId)). - It has a state, e.g.,
ACTIVE,INACTIVE. - The
aggregate_idfor this aggregate would be a composite ID (e.g.,userId + "-" + hostIdor a UUID that represents this specific mapping). - It has a
version(sequence number).
Event Types:
UserHostActivatedEvent (userId, hostId, sequence_number)UserHostDeactivatedEvent (userId, hostId, sequence_number)
Scenario with State Transitions:
-
Add Host to User Mapping (First Time):
- Command:
ActivateUserHostMapping(userId=U, hostId=H, expectedVersion=0)(Expected version 0 because it doesn’t exist yet). - Aggregate
(U,H): GeneratesUserHostActivatedEvent (userId=U, hostId=H, sequence_number=1). event_store_t: Saves version 1.user_host_t(projection): INSERTS record(U, H, status=ACTIVE, aggregate_version=1).
- Command:
-
Remove Host to User Mapping:
- Command:
DeactivateUserHostMapping(userId=U, hostId=H, expectedVersion=1). - Aggregate
(U,H): GeneratesUserHostDeactivatedEvent (userId=U, hostId=H, sequence_number=2). event_store_t: Saves version 2.user_host_t(projection): UPDATES record(U, H)tostatus=INACTIVE, aggregate_version=2. (Doesn’t delete the row).
- Command:
-
Add Back the Same Host to User Mapping:
- Command:
ReactivateUserHostMapping(userId=U, hostId=H, expectedVersion=2). (Expected version 2 because it’s currently INACTIVE at version 2). - Aggregate
(U,H): GeneratesUserHostActivatedEvent (userId=U, hostId=H, sequence_number=3). event_store_t: Saves version 3.user_host_t(projection): UPDATES record(U, H)tostatus=ACTIVE, aggregate_version=3.
- Command:
Benefits of State Transitions:
- Strictly Monotonic Versions: The
sequence_numberfor theUserHostMappingaggregate (U,H) always increases (0 -> 1 -> 2 -> 3). No version conflicts. - Complete History: The Event Store clearly shows the activation/deactivation cycle.
- Simpler Projection: The projection (
user_host_t) never deletes rows; it only updates their status and version. This makes updates simple (UPDATE ... WHERE aggregate_id = ? AND aggregate_version = ?) and avoidsINSERTconflicts on “re-add.” - Idempotent Read Model Updates: The consumer logic is straightforward.
Option 2: Unique ID for Each Relationship Instance (Less common for simple toggles)
- Approach: Instead of
(U,H)being one aggregate that changes status, you treat each “active period” of(U,H)as a new, distinct aggregate. aggregate_id: A brand new UUID for each activation of(U,H).- Event Types:
UserHostCreatedEvent (mappingId=M1, userId=U, hostId=H, sequence_number=1)UserHostDeletedEvent (mappingId=M1, userId=U, hostId=H, sequence_number=2)UserHostCreatedEvent (mappingId=M2, userId=U, hostId=H, sequence_number=1)(for the second time)
- Projection: The
user_host_ttable would track thesemappingIds, possibly withstart_tsandend_ts. When a mapping is terminated, you update itsend_ts. When “added back,” you insert a new row with a newmappingId. - Complexity: Managing which
mappingIdis current for(U,H)can be tricky. It’s usually overkill for simple active/inactive toggles.
Option 3: History Table for User Host Mapping
- Approach: Create a user_host_history_t to keep a history of UserHostMapping.
- Projection: The
user_host_tanduser_host_history_tjoin together for the query with both snapshot and historical views. - Complexity: Managing both original and historical tables is overkill in this use case unless you need historical query very frequently.
Recommended Approach for your user_host_t scenario
Go with Option 1: State Transitions for a (User, Host) Aggregate.
Detailed Changes:
-
Database Schema for
user_host_t:- Add a
statuscolumn (e.g.,VARCHAR(10) NOT NULL DEFAULT 'ACTIVE'). - Ensure
aggregate_versioncolumn exists. - Primary key/unique constraint likely remains
(host_id, user_id).
ALTER TABLE user_host_t ADD COLUMN status VARCHAR(10) NOT NULL DEFAULT 'ACTIVE', ADD COLUMN aggregate_version BIGINT NOT NULL DEFAULT 0; -- Add a unique constraint if not already present on (host_id, user_id) -- ALTER TABLE user_host_t ADD CONSTRAINT pk_user_host PRIMARY KEY (host_id, user_id); - Add a
-
Define specific Event Types:
UserHostActivatedEventUserHostDeactivatedEvent
-
Command Handling Logic (Write Model):
- When the “add host to user” command comes in:
- Load the
UserHostMappingaggregate (identified by(host_id, user_id)). - If not found (expectedVersion 0), generate
UserHostActivatedEvent. - If found and
status=INACTIVE(expectedVersion > 0), generateUserHostActivatedEvent. - If found and
status=ACTIVE(expectedVersion > 0), reject (already active, idempotent no-op).
- Load the
- When the “remove host from user” command comes in:
- Load the
UserHostMappingaggregate. - If not found or
status=INACTIVE, reject (already inactive/not found). - If
status=ACTIVE, generateUserHostDeactivatedEvent.
- Load the
- When the “add host to user” command comes in:
-
PortalEventConsumerLogic (Read Model Update):-
For
UserHostActivatedEvent:- This event means the mapping is now active.
- Try to
UPDATE user_host_t SET status='ACTIVE', aggregate_version=? WHERE host_id=? AND user_id=? AND aggregate_version=?. - If 0 rows updated:
- Check if the record exists (
SELECT COUNT(*) ...). - If it exists (and version didn’t match), it’s a
ConcurrencyException. - If it doesn’t exist, it’s the very first time this mapping became active, so
INSERT INTO user_host_t (...) VALUES (...).
- Check if the record exists (
- This will handle both initial creation and reactivation as idempotent updates/inserts based on state.
-
For
UserHostDeactivatedEvent:- This event means the mapping is now inactive.
UPDATE user_host_t SET status='INACTIVE', aggregate_version=? WHERE host_id=? AND user_id=? AND aggregate_version=?.- If 0 rows updated, it’s either
ConcurrencyExceptionor “not found” (already inactive).
-
This approach treats the user_host_t relationship as a single logical entity (an aggregate instance) that transitions through states (ACTIVE/INACTIVE), ensuring the aggregate_version always progresses monotonically and avoiding the conflict you described.
Command Handler Logic
It is crucial to figure out the db logic between the read model (what the UI sees) and the command model (what the command handler needs to decide). The command handler cannot rely solely on the UI’s expectedVersion in this scenario. It needs to query its own source of truth (the Event Store) to decide if it’s an “initial activation” or a “reactivation.”
Let’s refine the command handling logic for the UserHostMapping aggregate.
Key: The Command Handler Owns the Decision, Using the Event Store
The command handler’s job is to:
- Load the aggregate’s current state (by replaying events from
event_store_t). - Determine its current status and current version based on that replay.
- Compare the
expectedVersionfrom the command with the aggregate’scurrentVersion. - Apply business rules to decide what event(s) to generate.
Event Types & Aggregate ID (as per previous recommendation)
- Aggregate ID: A composite of
hostIdanduserId(e.g.,hostId + "_" + userId). - Events:
UserHostActivatedEvent: Represents the relationship becoming active.UserHostDeactivatedEvent: Represents the relationship becoming inactive.
Step-by-Step Command Handling Logic
Let’s assume your command handler is UserHostMappingCommandHandler and it interacts with a UserHostMappingAggregate.
1. UserHostMappingAggregate (Internal Logic):
This aggregate needs to rebuild its state (currentStatus, currentVersion) from its event stream.
public class UserHostMappingAggregate {
private final String hostId;
private final String userId;
private UserHostMappingStatus currentStatus; // Enum: ACTIVE, INACTIVE, NON_EXISTENT
private long currentVersion; // Sequence number of the last applied event
private List<DomainEvent> uncommittedEvents = new ArrayList<>();
public UserHostMappingAggregate(String hostId, String userId) {
this.hostId = hostId;
this.userId = userId;
this.currentStatus = UserHostMappingStatus.NON_EXISTENT; // Initial state
this.currentVersion = 0;
}
public static UserHostMappingAggregate loadFromEvents(String hostId, String userId, List<DomainEvent> history) {
UserHostMappingAggregate aggregate = new UserHostMappingAggregate(hostId, userId);
if (history != null && !history.isEmpty()) {
history.forEach(aggregate::applyEvent);
}
return aggregate;
}
private void applyEvent(DomainEvent event) {
if (event instanceof UserHostActivatedEvent) {
this.currentStatus = UserHostMappingStatus.ACTIVE;
} else if (event instanceof UserHostDeactivatedEvent) {
this.currentStatus = UserHostMappingStatus.INACTIVE;
}
this.currentVersion = event.getSequenceNumber(); // Update version based on event
}
// --- Command Handling Methods ---
public void activateMapping(long expectedVersion) {
// OCC Check (optional here, but good practice if not relying solely on DB constraint)
if (this.currentVersion != expectedVersion) {
throw new ConcurrencyException("Concurrency conflict. Expected version " + expectedVersion + ", actual " + this.currentVersion);
}
// Business Logic: What state must it be in to activate?
if (this.currentStatus == UserHostMappingStatus.ACTIVE) {
// Already active, idempotent no-op or reject as invalid transition
logger.info("Mapping for user {} host {} is already active. No new event generated.", userId, hostId);
return;
}
// Generate new event
long nextVersion = this.currentVersion + 1;
UserHostActivatedEvent event = new UserHostActivatedEvent(
UUID.randomUUID(), Instant.now(), getAggregateId(), "UserHostMapping", nextVersion, hostId, userId
);
uncommittedEvents.add(event);
applyEvent(event); // Apply to internal state immediately for consistency
}
public void deactivateMapping(long expectedVersion) {
// OCC Check
if (this.currentVersion != expectedVersion) {
throw new ConcurrencyException("Concurrency conflict. Expected version " + expectedVersion + ", actual " + this.currentVersion);
}
// Business Logic
if (this.currentStatus != UserHostMappingStatus.ACTIVE) {
logger.info("Mapping for user {} host {} is not active. Cannot deactivate.", userId, hostId);
throw new IllegalStateException("Mapping is not active and cannot be deactivated.");
}
// Generate new event
long nextVersion = this.currentVersion + 1;
UserHostDeactivatedEvent event = new UserHostDeactivatedEvent(
UUID.randomUUID(), Instant.now(), getAggregateId(), "UserHostMapping", nextVersion, hostId, userId
);
uncommittedEvents.add(event);
applyEvent(event);
}
// Helper to get the composite aggregate ID
public String getAggregateId() {
return hostId + "_" + userId; // Consistent composite ID
}
// Getters for external access
public UserHostMappingStatus getCurrentStatus() { return currentStatus; }
public long getCurrentVersion() { return currentVersion; }
public List<DomainEvent> getUncommittedEvents() { return uncommittedEvents; }
public void markEventsCommitted() { uncommittedEvents.clear(); }
public enum UserHostMappingStatus {
ACTIVE, INACTIVE, NON_EXISTENT
}
}
2. UserHostMappingCommandHandler (Application Service):
This is where the command logic happens. The key is that the command from the UI is now generic (e.g., SetUserHostMappingStatus).
public class UserHostMappingCommandHandler { // This is your application service
private final EventStoreEventRepository eventStoreRepository; // To load events
private final OutboxMessageRepository outboxRepository; // To save new events
// Constructor injection
// ...
public void handleSetUserHostMappingStatus(String hostId, String userId, boolean activate, long expectedVersionFromUI) {
String aggregateId = hostId + "_" + userId;
// 1. Load aggregate state from Event Store
List<DomainEvent> history = eventStoreRepository.findByAggregateIdOrderBySequenceNumberAsc(aggregateId)
.stream()
.map(this::deserializeEventStoreEvent) // Deserialize from DB format
.collect(Collectors.toList());
UserHostMappingAggregate aggregate = UserHostMappingAggregate.loadFromEvents(hostId, userId, history);
// 2. Perform business logic based on intent (activate) and current state
if (activate) {
aggregate.activateMapping(expectedVersionFromUI); // Will generate UserHostActivatedEvent
} else {
aggregate.deactivateMapping(expectedVersionFromUI); // Will generate UserHostDeactivatedEvent
}
// 3. Persist new events
List<DomainEvent> newEvents = aggregate.getUncommittedEvents();
if (!newEvents.isEmpty()) {
// Your transactional outbox logic (save to Event Store and Outbox)
eventStoreRepository.saveAll(newEvents.stream().map(this::mapToEventStoreEvent).collect(Collectors.toList()));
outboxRepository.saveAll(newEvents.stream().map(this::mapToOutboxMessage).collect(Collectors.toList()));
aggregate.markEventsCommitted();
}
}
// Helper methods for serialization/deserialization as shown in previous examples
// ...
}
3. PortalEventConsumer Logic (Read Model Update):
The consumer updates user_host_t based on the events.
-
For
UserHostActivatedEvent:// In your PortalEventConsumer (inside processSingleEventWithRetries for this event type) Map<String, Object> eventData = extractEventData(eventMap); String hostId = (String) eventMap.get(Constants.HOST); // Assuming hostId is a CE extension String userId = (String) eventMap.get(Constants.USER); // Assuming userId is a CE extension String aggregateId = (String) eventMap.get(CloudEventV1.SUBJECT); // Or extract from eventData if set as such long newVersion = getEventSequenceNumber(eventMap); // SQL: UPSERT is ideal here. If record exists, update status/version. If not, insert. // This handles both initial activation (INSERT) and reactivation (UPDATE) idempotently. final String upsertSql = "INSERT INTO user_host_t (host_id, user_id, status, aggregate_version, update_user, update_ts) " + "VALUES (?, ?, ?, ?, ?, ?) " + "ON CONFLICT (host_id, user_id) DO UPDATE SET " + "status = EXCLUDED.status, " + "aggregate_version = EXCLUDED.aggregate_version, " + "update_user = EXCLUDED.update_user, " + "update_ts = EXCLUDED.update_ts " + "WHERE user_host_t.aggregate_version < EXCLUDED.aggregate_version"; // Only update if incoming event is newer try (PreparedStatement statement = conn.prepareStatement(upsertSql)) { statement.setObject(1, UUID.fromString(hostId)); statement.setObject(2, UUID.fromString(userId)); statement.setString(3, UserHostMappingAggregate.UserHostMappingStatus.ACTIVE.name()); statement.setLong(4, newVersion); statement.setString(5, (String)eventMap.get(Constants.USER)); // From CE extension statement.setObject(6, OffsetDateTime.parse((String)eventMap.get(CloudEventV1.TIME))); statement.executeUpdate(); }- Crucial
ON CONFLICT ... WHERE user_host_t.aggregate_version < EXCLUDED.aggregate_version: This makes the projection update idempotent and handles out-of-order delivery. If the database already has a newer version than the incoming event, it simply does nothing (0 rows affected), preventing a stale event from overwriting a more recent state.
- Crucial
-
For
UserHostDeactivatedEvent:// In your PortalEventConsumer (inside processSingleEventWithRetries for this event type) Map<String, Object> eventData = extractEventData(eventMap); String hostId = (String) eventMap.get(Constants.HOST); String userId = (String) eventMap.get(Constants.USER); long newVersion = getEventSequenceNumber(eventMap); final String updateSql = "UPDATE user_host_t SET status='INACTIVE', aggregate_version=?, update_user=?, update_ts=? " + "WHERE host_id = ? AND user_id = ? AND aggregate_version < ?"; // Only update if incoming event is newer try (PreparedStatement statement = conn.prepareStatement(updateSql)) { statement.setLong(1, newVersion); statement.setString(2, (String)eventMap.get(Constants.USER)); statement.setObject(3, OffsetDateTime.parse((String)eventMap.get(CloudEventV1.TIME))); statement.setObject(4, UUID.fromString(hostId)); statement.setObject(5, UUID.fromString(userId)); statement.setLong(6, newVersion); // Only update if current DB version < newVersion (from event) statement.executeUpdate(); }
How to Figure it Out in the Command Handler (from UI perspective)
The UI will initially query the user_host_t read model.
-
Scenario A: UI queries, no record for
(U,H)found.- UI infers state is “Non-Existent” or “Inactive”.
- UI provides
expectedVersion = 0to the command (because the read model had no entry). - Command handler:
aggregate.currentStatus == NON_EXISTENT. GeneratesUserHostActivatedEvent (sequence_number=1).
-
Scenario B: UI queries, record
(U,H, status=ACTIVE, aggregate_version=1)found.- UI provides
expectedVersion = 1to the command. - User wants to “remove.”
- Command handler:
aggregate.currentStatus == ACTIVE. GeneratesUserHostDeactivatedEvent (sequence_number=2).
- UI provides
-
Scenario C: UI queries, record
(U,H, status=INACTIVE, aggregate_version=2)found. (This assumes your UI could list inactive items, or an admin UI can see it.)- UI provides
expectedVersion = 2to the command. - User wants to “add back” / “reactivate.”
- Command handler:
aggregate.currentStatus == INACTIVE. GeneratesUserHostActivatedEvent (sequence_number=3).
- UI provides
-
Crucial UI Aspect: If the UI doesn’t display inactive items (which is typical for a “list active” view), and the user tries to “add” an item that used to exist but is now inactive, the UI would initially send
expectedVersion = 0.- Command handler receives
expectedVersion = 0, but aggregate is actuallyINACTIVEatversion=2. - OCC Conflict!
aggregate.currentVersion (2) != expectedVersion (0). Command is rejected. - User Experience: “Cannot add. This mapping exists in an inactive state. Please activate it instead.” This forces a clearer UI workflow.
- Command handler receives
By leveraging state transitions within your aggregate and using ON CONFLICT / WHERE aggregate_version < EXCLUDED.aggregate_version in your projection updates, you ensure strict versioning, idempotent read models, and a consistent business logic flow.
Tenant Migration and Environment Promotion
This document addresses the reality of tenant migration and environment promotion in an Event Sourcing context. You must preserve the event sequence (aggregate_version) while making necessary adjustments (hostId, new userId UUIDs) to fit the target environment.
Design Strategy: The Event Mutator
The best design is to introduce a specific, configurable pipeline stage—an Event Mutator—that runs after deserialization but before the final DB insert.
We’ll define the replacement and enrichment parameters as JSON/YAML structures and create a separate utility to apply the mutations.
1. Mutation Configuration Format
We’ll define the parameters to be a JSON string representing a list of mutation rules.
replacement(-r): Find a field with an old value and replace it with a new value.- Example:
[{"field": "hostId", "from": "UUID_A", "to": "UUID_B"}, {"field": "user_id", "from": "ID_X", "to": "ID_Y"}]
- Example:
enrichment(-e): Find a field and generate a new, unique value for it.- Example:
[{"field": "id", "action": "generateUUID"}, {"field": "userId", "action": "mapAndGenerate", "sourceField": "originalUserId"}]
- Example:
2. The EventMutator Class
This class will handle parsing the configuration and applying the changes to the CloudEvent attributes and the data payload.
Refactored Cli.java and New EventMutator Logic
Here is the updated Cli.java and a conceptual EventMutator structure.
A. New Class: EventMutator.java
This class handles the core logic. Since CloudEvents are immutable, any change requires rebuilding the event (CloudEventBuilder.v1(cloudEvent)).
package net.lightapi.importer;
import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.networknt.config.Config;
import com.networknt.utility.UuidUtil;
import io.cloudevents.CloudEvent;
import io.cloudevents.core.builder.CloudEventBuilder;
import net.lightapi.portal.PortalConstants;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.UUID;
public class EventMutator {
private static final Logger logger = LoggerFactory.getLogger(EventMutator.class);
private final ObjectMapper mapper = Config.getInstance().getMapper();
// Parsed list of rules
private final List<Map<String, String>> replacementRules;
private final List<Map<String, String>> enrichmentRules;
// Map to track generated UUIDs for consistent replacement across events (e.g., old user ID -> new user ID)
private final Map<String, String> generatedIdMap = new HashMap<>();
public EventMutator(String replacementJson, String enrichmentJson) {
this.replacementRules = parseRules(replacementJson);
this.enrichmentRules = parseRules(enrichmentJson);
}
private List<Map<String, String>> parseRules(String json) {
if (json == null || json.isEmpty()) return Collections.emptyList();
try {
return mapper.readValue(json, new TypeReference<List<Map<String, String>>>() {});
} catch (IOException e) {
logger.error("Failed to parse mutation rules JSON: {}", json, e);
throw new IllegalArgumentException("Invalid JSON format for mutation rules.", e);
}
}
/**
* Applies all replacement and enrichment rules to a single CloudEvent.
* @param originalEvent The original CloudEvent object.
* @return The mutated CloudEvent.
*/
public CloudEvent mutate(CloudEvent originalEvent) {
CloudEventBuilder builder = CloudEventBuilder.v1(originalEvent);
Map<String, Object> dataMap = null;
// Deserialize data payload once (if present)
if (originalEvent.getData() != null && originalEvent.getData().toBytes().length > 0) {
try {
dataMap = mapper.readValue(originalEvent.getData().toBytes(), new TypeReference<HashMap<String, Object>>() {});
} catch (IOException e) {
logger.error("Failed to deserialize CloudEvent data for mutation. Skipping data mutation.", e);
// Continue with just extension mutation
}
}
// 1. Apply Replacements
applyReplacements(builder, dataMap);
// 2. Apply Enrichments
applyEnrichments(builder, dataMap);
// Rebuild CloudEvent with mutated data if it was changed
if (dataMap != null && dataMap.containsKey("__MUTATED_DATA__")) {
builder.withData(originalEvent.getDataContentType().orElse("application/json"), dataMap.get("__MUTATED_DATA__"));
// Remove the internal flag
dataMap.remove("__MUTATED_DATA__");
}
return builder.build();
}
// --- Private Mutation Helpers ---
private void applyReplacements(CloudEventBuilder builder, Map<String, Object> dataMap) {
for (Map<String, String> rule : replacementRules) {
String field = rule.get("field");
String from = rule.get("from");
String to = rule.get("to");
if (field == null || from == null || to == null) continue;
// Check CloudEvent Extensions (including known attributes like host, user)
Object extensionValue = builder.getExtension(field);
if (extensionValue != null && extensionValue.toString().equals(from)) {
builder.withExtension(field, to);
logger.debug("Replaced extension {} from {} to {}", field, from, to);
}
// Check CloudEvent Data Payload
if (dataMap != null && dataMap.containsKey(field) && dataMap.get(field) != null && dataMap.get(field).toString().equals(from)) {
dataMap.put(field, to);
dataMap.put("__MUTATED_DATA__", dataMap); // Flag that data was mutated
logger.debug("Replaced data field {} from {} to {}", field, from, to);
}
}
}
private void applyEnrichments(CloudEventBuilder builder, Map<String, Object> dataMap) {
for (Map<String, String> rule : enrichmentRules) {
String field = rule.get("field");
String action = rule.get("action");
if (field == null || action == null) continue;
String generatedId = null;
if ("generateUUID".equalsIgnoreCase(action)) {
// Generate and cache a new UUID for the whole import run if needed, or always generate new.
// For simplicity, we assume we generate a new UUID for the field.
generatedId = UuidUtil.getUUID().toString();
} else if ("mapAndGenerate".equalsIgnoreCase(action)) {
String sourceField = rule.get("sourceField");
String originalId = null;
// Get the original ID from a source field in the data payload (e.g., from an 'oldUserId' field)
if (dataMap != null && sourceField != null && dataMap.containsKey(sourceField)) {
originalId = dataMap.get(sourceField).toString();
}
// Or get from a specific CloudEvent extension/subject
else if ("subject".equalsIgnoreCase(sourceField) && builder.getSubject() != null) {
originalId = builder.getSubject();
}
if (originalId != null) {
// Check cache for consistency (e.g., ensure old_user_ID_A always maps to new_user_ID_X)
generatedId = generatedIdMap.computeIfAbsent(field + ":" + originalId, k -> UuidUtil.getUUID().toString());
logger.debug("Mapped original ID {} to new ID {}", originalId, generatedId);
} else {
// Cannot map, fall back to simple UUID generation if allowed
generatedId = UuidUtil.getUUID().toString();
}
} else if ("aggregateIdMap".equalsIgnoreCase(action) && field.equals("subject")) {
// This complex logic is for when a related aggregate ID needs to be updated.
// E.g., when importing a User, the UserCreatedEvent ID is the new Subject/AggregateId.
// The actual logic for this is too complex for a generic SMT and relies on a separate lookup service.
// Skip for this simple mutator.
continue;
}
if (generatedId != null) {
// Mutate CloudEvent Extensions (Subject, ID, etc.)
if ("id".equalsIgnoreCase(field)) {
builder.withId(generatedId);
} else if ("subject".equalsIgnoreCase(field)) {
builder.withSubject(generatedId);
} else if (builder.getExtension(field) != null) { // Custom extension
builder.withExtension(field, generatedId);
}
// Mutate Data Payload
if (dataMap != null && dataMap.containsKey(field)) {
dataMap.put(field, generatedId);
dataMap.put("__MUTATED_DATA__", dataMap); // Flag that data was mutated
}
logger.debug("Enriched field {} with new ID {}", field, generatedId);
}
}
}
}
B. Updated Cli.java to Integrate EventMutator
package net.lightapi.importer;
// ... (Existing imports) ...
import com.networknt.config.JsonMapper;
import com.networknt.db.provider.SqlDbStartupHook;
import com.networknt.monad.Result;
import com.networknt.service.SingletonServiceFactory;
import com.networknt.status.Status;
import com.networknt.utility.Constants;
import com.networknt.utility.UuidUtil; // Used in mutator
import io.cloudevents.CloudEvent;
import io.cloudevents.core.builder.CloudEventBuilder;
import io.cloudevents.core.format.EventFormat;
import io.cloudevents.core.provider.EventFormatProvider;
import io.cloudevents.jackson.JsonFormat;
import net.lightapi.portal.EventTypeUtil;
import net.lightapi.portal.PortalConstants;
import net.lightapi.portal.db.PortalDbProvider;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import java.util.UUID; // Used in mutator
public class Cli {
private static final Logger logger = LoggerFactory.getLogger(Cli.class); // Added logger
public static PortalDbProvider dbProvider;
public static SqlDbStartupHook sqlDbStartupHook;
@Parameter(names={"--filename", "-f"}, required = false,
description = "The filename to be imported.")
String filename;
@Parameter(names={"--batchSize", "-b"}, required = false,
description = "Number of events to import per database transaction batch. Default is 1000.")
int batchSize = 1000;
@Parameter(names={"--replacement", "-r"}, required = false,
description = "JSON array string of replacement rules: [{'field': 'oldHostId', 'from': 'UUID_A', 'to': 'UUID_B'}].")
String replacement;
@Parameter(names={"--enrichment", "-e"}, required = false,
description = "JSON array string of enrichment rules: [{'field': 'userId', 'action': 'mapAndGenerate', 'sourceField': 'oldUserId'}].")
String enrichment;
@Parameter(names={"--help", "-h"}, help = true)
private boolean help;
public static void main(String ... argv) throws Exception {
try {
// ... (Startup initialization remains the same) ...
Cli cli = new Cli();
JCommander jCommander = JCommander.newBuilder().addObject(cli).build();
jCommander.parse(argv);
// Assuming SingletonServiceFactory and SqlDbStartupHook setup is correct
dbProvider = (PortalDbProvider) SingletonServiceFactory.getBean(DbProvider.class);
cli.run(jCommander);
} catch (ParameterException e) {
System.err.println("Command line parameter error: " + e.getLocalizedMessage());
jCommander.usage();
} catch (Exception e) {
System.err.println("An unexpected error occurred during startup or import: " + e.getLocalizedMessage());
e.printStackTrace();
}
}
public void run(JCommander jCommander) throws Exception {
if (help) {
jCommander.usage();
return;
}
logger.info("Starting event import with batch size: {}", batchSize);
if (replacement != null) logger.info("Replacement rules: {}", replacement);
if (enrichment != null) logger.info("Enrichment rules: {}", enrichment);
EventFormat cloudEventFormat = EventFormatProvider.getInstance().resolveFormat(JsonFormat.CONTENT_TYPE);
if (cloudEventFormat == null) {
logger.error("No CloudEvent JSON format provider found.");
throw new IllegalStateException("CloudEvent JSON format not found.");
}
// --- Instantiate EventMutator ---
EventMutator mutator = new EventMutator(replacement, enrichment);
List<CloudEvent> currentBatch = new ArrayList<>(batchSize);
long importedCount = 0;
long lineNumber = 0;
try (BufferedReader reader = new BufferedReader(new FileReader(filename))) {
String line;
while((line = reader.readLine()) != null) {
lineNumber++;
if(line.startsWith("#") || line.trim().isEmpty()) continue;
try {
// Assuming format: "key value" (where key is user_id, value is the full database row JSON)
int firstSpace = line.indexOf(" ");
if (firstSpace == -1) {
logger.warn("Skipping malformed line {} (no space separator): {}", lineNumber, line);
continue;
}
String dbRowJson = line.substring(firstSpace + 1); // <<< Full DB row JSON
// 1. Deserialize the nested CloudEvent (The Fix from prior step)
Map<String, Object> dbRowMap = Config.getInstance().getMapper().readValue(dbRowJson, new TypeReference<HashMap<String, Object>>() {});
String cloudEventJsonFromPayload = (String) dbRowMap.get("payload");
CloudEvent cloudEvent = cloudEventFormat.deserialize(cloudEventJsonFromPayload.getBytes(StandardCharsets.UTF_8));
// 2. Perform Mutation/Enrichment
CloudEvent mutatedEvent = mutator.mutate(cloudEvent);
// 3. Finalization/Validation (Transfer critical top-level DB fields to Extensions)
// Transferring nonce and aggregateVersion from the exported DB row into the CloudEvent's extensions.
Object dbNonceObj = dbRowMap.get("nonce");
if (dbNonceObj instanceof Number) {
mutatedEvent = CloudEventBuilder.v1(mutatedEvent)
.withExtension(PortalConstants.NONCE, ((Number)dbNonceObj).longValue())
.build();
}
Object dbAggVersionObj = dbRowMap.get("aggregateVersion");
if (dbAggVersionObj instanceof Number) {
mutatedEvent = CloudEventBuilder.v1(mutatedEvent)
.withExtension(PortalConstants.EVENT_AGGREGATE_VERSION, ((Number)dbAggVersionObj).longValue())
.build();
}
// 4. Add to current batch.
currentBatch.add(mutatedEvent);
// If batch is full, process it
if (currentBatch.size() >= batchSize) {
processBatch(currentBatch);
importedCount += currentBatch.size();
currentBatch.clear();
}
} catch (Exception e) {
logger.error("Error processing line {}: {}", lineNumber, e.getMessage(), e);
// Log and continue to process the rest of the file.
}
} // end while loop
// Process any remaining events in the last batch
if (!currentBatch.isEmpty()) {
processBatch(currentBatch);
importedCount += currentBatch.size();
}
} catch (IOException e) {
logger.error("Error reading file {}: {}", filename, e.getMessage(), e);
throw e;
} finally {
logger.info("Import process finished. Total events successfully imported in batches: {}", importedCount);
}
logger.info("All Portal Events have been imported successfully from {}. Have fun!!!", filename);
}
/**
* Processes a batch of CloudEvents by inserting them into the database in a single transaction.
* @param batch The list of CloudEvents to insert.
*/
private void processBatch(List<CloudEvent> batch) {
// --- Transaction Management ---
// The transaction logic is ideally handled inside dbProvider.insertEventStore
// or by a wrapper method if insertEventStore doesn't handle transactions internally.
Result<String> eventStoreResult = dbProvider.insertEventStore(batch.toArray(new CloudEvent[0]));
if(eventStoreResult.isFailure()) {
logger.error("Failed to insert batch of {} events. Rollback occurred. Error: {}", batch.size(), eventStoreResult.getError());
// In a CLI, failing the batch often means stopping the entire import process
// to ensure data integrity, as a full rollback on the entire batch has occurred.
// If you want to continue, you would need complex tracking of failed batches.
// For now, logging the error is sufficient, and the method returns.
} else {
logger.info("Imported batch of {} records successfully.", batch.size());
}
}
}
Key Usage Examples for the CLI
When calling the CLI, you pass the mutation rules as a single JSON string (often enclosed in single quotes '...' in the shell):
1. Replace Host ID (Tenant Migration)
You moved from old_host_uuid to new_host_uuid.
java -jar importer.jar -f events.log -r '[{"field": "hostId", "from": "OLD_HOST_UUID", "to": "NEW_HOST_UUID"}]'
2. Replace Host ID and Generate New Aggregate IDs (Full Isolation)
You want to map the old userId to a new userId and generate new eventIds and subject (aggregate ID).
java -jar importer.jar -f events.log \
-r '[{"field": "hostId", "from": "OLD_HOST_UUID", "to": "NEW_HOST_UUID"}]' \
-e '[
{"field": "id", "action": "generateUUID"},
{"field": "subject", "action": "generateUUID"},
{"field": "originalUserId", "action": "mapAndGenerate", "sourceField": "userId"}
]'
(Note: For the user mapping, you would need a custom solution that first reads a mapping table or performs a one-time query to get the originalUserId from a previous step, and then uses the mapping to generate the new ID consistently.)
Product Version Config
When using light-portal to manage the configurations for Apis or Apps. The configuration can be overwritten at different level. On top of platform default, the production level and production version level are utilized very often.
There are two options:
- Extract the config files from the product jar and create the events for mapping. This includes all config and config properties in the jar file per product and product version.
Pros:
- Can be automatically done with a process.
- Standardized and hardly make mistakes.
Cons:
- It cannot be customized per organization.
- Manually create events for mappings per product and per product version for the properties that is potentially changeable.
Pros:
- Flexible and customizable per organization.
- Can be improved in a process.
Cons:
- May take some time to create and maintain the event file for every release.
Optimistic Concurrency Control (OCC)
In the previous documento optimistic-pessimistic-ui, we have decided to leverage the OCC to prevent multiple users update the same aggregate at the same time from different browser sessions.
With OCC, we have the single point of necessary trust: the read model must be consistent enough to support the OCC check.
The concern here is the core trade-off of CQRS: Eventual Consistency.
The Problem: When Eventual Consistency Breaks OCC
Your system’s flow is:
- Read (UI): Reads
ReadModel (V=5)from Projection DB. - Write (Command Handler):
- Command arrives with
expectedVersion=5. - Handler verifies against Event Store (Source of Truth):
EventStore.currentVersionmust be5.
- Command arrives with
- The Stale Read Model Gap (The Problem):
- Event
E6is processed by the Command Handler and committed toEventStore (V=6). - Before the Consumer applies
E6to the Projection DB, the UI reads. - UI still reads
ReadModel (V=5)(STALE). - User submits
Command2 (expectedVersion=5). - The Conflict: The Command Handler checks
EventStore.currentVersionwhich is now 6. It sees6 != 5and throws a ConcurrencyException.
- Event
Result: The user is incorrectly told there was a conflict and must refresh, even though their original read was perfectly valid and their change was submitted before any other user’s command. The issue is that the read model was too slow to reflect the change that already happened in the source of truth.
The Solution: Shift the OCC Check to the Event Store’s Version
The best way to handle this and eliminate the dependency on the read model’s consistency is to ensure the UI’s OCC is based on the authoritative version from the Event Store itself.
Here are three practical options for injecting the authoritative version.
The “best” option balances data consistency (critical) against performance and complexity (practical). Given the context of a high-performance CQRS/ES application, here is the evaluation and recommendation.
Evaluation of Options for OCC Version Retrieval
| Option | Where Version is Fetched | Consistency Status | Performance Impact | Complexity | Evaluation |
|---|---|---|---|---|---|
1. Join with event_store_t (Pagination Query) | Read Model + Event Store | Authoritative (Best) | High (Slows down every page load, large joins are expensive). | High (Complex SQL, need to avoid full table scans). | POOR (Breaks Read Performance/Scalability). |
| 2. Button Click/Form Load | Dedicated Version Service (Event Store) | Authoritative (Best) | Low/Moderate (1 extra, quick, targeted query per form load). | Low/Moderate (Easy to implement service). | GOOD (Decouples Read/Write, best UX). |
| 3. Command Submission | Dedicated Version Service (Event Store) | Authoritative (Best) | Low (1 extra query per command). | Low/Moderate (Easy to implement service). | GOOD but FLAWED UX (Causes more false failures). |
The Recommended Option: Option 2 (Button Click / Form Load)
Fetch the authoritative version when the user initiates the edit (button click / form load).
Why Option 2 is the Best Balance:
- Highest Consistency & UX: It provides the highest level of consistency without sacrificing the performance of the common “list entities” query. When the user loads the edit form, they are guaranteed to see the latest version. If another user commits a change before the form loads, the user will see the newest data and version, preventing the immediate “false conflict.”
- Performance Preservation: The most frequently executed query (
queryAllEntitiesWithPagination) remains fast, hitting only the optimized Projection DB. The extra query (VersionLookup) only runs when a user takes the action to edit, which is a rare event compared to listing. - Simplicity: It requires a simple, dedicated, fast endpoint in your backend (e.g.,
/api/version/role/{id}) that executes theSELECT MAX(sequence_number) ...query against yourevent_store_t.
Why the Other Options Fail:
- Option 1 (Join with Pagination Query): Fails Scalability. Joining a wide, paginated projection table with a potentially massive, ever-growing
event_store_ttable (even with indexes) is a performance killer. It makes every single query slow. You use CQRS to avoid this kind of cross-cutting query. - Option 3 (Command Submission): Fails User Experience.
- User loads data (Version 5).
- User spends 5 minutes making changes.
- During those 5 minutes, another user commits V6 and V7.
- User submits
Command (expectedVersion=5). - Handler fetches latest version (V7). Conflict: 7 != 5.
- User is rejected and loses 5 minutes of work.
- By contrast, Option 2 would have made the user refresh immediately upon clicking ‘Edit’ (because the version check would have failed then), saving the user from losing their work.
Implementation Flow for Option 2 (The Correct Flow)
- UI/List View: Populated from
Projection.queryEntities(offset, limit, filters). This query is fast and returns the version from Read Model. (The version might be the stale one). - User Action: User clicks “Edit” button for
role_id=R1. - Backend Call 1 (Version Check): UI calls a dedicated endpoint:
/api/write/version/{aggregate_id}(R1). The backend executesSELECT MAX(aggregate_version) FROM event_store_t WHERE aggregate_id = 'R1'. ReturnscurrentVersion = V. - Version Comparison 1: Compare the V with aggregate_version of the UI form data derived from the list view. If they are the same, no further action.
- Backend Call 2: If the form data version is less than the V from event_store_t, UI calls
/api/read/role/{id}to get fresh form data from the Read Model. - Version Comprison 2: Compare the V with aggregate_version reload from the Read Model. In most of the case, they should be the same. However, the Read Model might not be updated if there is consumer lag. In this case, an error message will be shown on the UI to inform user to wait several minutes to refresh. If problem persist, the user needs to report to the support team to get the issue resolved.
- UI Form: Data is populated. A hidden field is set to
aggregateVersion = V. - User Submission: UI sends
UpdateCommand(..., expectedVersion=V)to the command endpoint. - Command Handler: Executes OCC check against the Event Store. This check is now authoritative and highly likely to succeed.
Aggregate Version in Projection
Adding aggregate_version in all tables in read models is the most common, reliable, and scalable pattern to implement Optimistic Concurrency Control (OCC) in a CQRS/Event Sourcing system that uses a relational database for its read models.
Confirmation of the OCC Pattern
| Component | Responsibility for OCC | Details |
|---|---|---|
| Projection Tables (Read Model) | Store the Version | Required: Must have an aggregate_version column (e.g., BIGINT) on every entity row that represents an Aggregate Root. |
| Pagination/List Query (UI Read) | Retrieve the Version | Required: The API endpoint for listing entities must include the aggregate_version column in its SELECT statement and return it to the UI. |
| UI Form (Client) | Hold the Version | Required: The UI must store this retrieved aggregate_version (often in a hidden field) and rename it to expectedVersion for the next command. |
| Command Handler (Write Model) | Perform the Check | Required: When the command arrives, check: EventStore.actualVersion MUST EQUAL command.expectedVersion. |
Summary of Why This is Necessary
- Atomicity of the Check: The
aggregate_versionin the read model serves as the handle for the OCC check. The UI has to pass some authoritative marker of the state it observed. - Decoupling: By having the version in the read model, you avoid performing costly
SELECT MAX(sequence_number)queries against theevent_store_tfor every single row in the pagination result. Instead, you only perform the authoritative version lookup (or the OCC check itself) on the one specific record the user is attempting to modify. - Read/Write Split: This solution maintains the separation of concerns:
- Read Side: Fast, optimized for retrieval.
- Write Side: Slow, transactionally consistent, responsible for the final state check.
Final Recommendation:
Yes, we must include aggregate_version in all projected tables that are used as the basis for user updates, and it must be part of the data retrieved by the UI’s list queries.
This is the non-negotiable step to ensuring your access control system prevents the dangerous “Last-Write-Wins” scenario.
Refresh Data for Edit
We need to get the latest data after user click the ‘Edit’ button, there are two ways to get the lastet data: Read model or Replay. Let’s clarify exactly what data consistency level is needed for the “Edit” form.
The answer is: You should read the data from the Read Model (Projection) and retrieve the latest aggregate_version from the Event Store.
You should NOT replay the Event Store to populate the UI form.
Analysis of the Two Read Operations
| Operation | Source | Purpose | Consistency Level | Performance |
|---|---|---|---|---|
| Data Retrieval | Read Model (role_t Projection) | To populate the UI form fields (name, description, etc.). | Eventual (It’s the data the user sees). | Fast (Single row lookup by PK). |
| Version Retrieval | Event Store (event_store_t) | To provide the authoritative expectedVersion for OCC. | Strictly Authoritative (Source of Truth). | Fast (Single SELECT MAX(sequence_number) WHERE aggregate_id=? query). |
| Replay Operation | Event Store (event_store_t) | To reconstruct the current state by re-running all events. | Source of Truth (Highest fidelity). | Slow (Involves reading many rows, deserialization, and business logic execution). |
Why Combining Read Model + Version Lookup is Best
The flow for the /api/read/role/{id} endpoint should be:
-
Retrieve Authoritative Version:
- Execute:
SELECT MAX(sequence_number) AS authoritative_version FROM event_store_t WHERE aggregate_id = ? - (This is fast).
- Execute:
-
Retrieve Data (The actual form fields):
- Execute:
SELECT * FROM role_t WHERE role_id = ? - (This is also fast).
- Execute:
-
Combine and Return:
- Return the data from the Read Model and replace the
aggregate_versionin the final JSON with theauthoritative_versionretrieved in Step 1.
// Final API Response { "roleId": "R1", "roleDesc": "...", // Data from Read Model "updateUser": "...", // Data from Read Model "aggregateVersion": [Authoritative_Version_from_ES] // Replaced version from Event Store } - Return the data from the Read Model and replace the
Reasons for this Approach:
- Speed (Performance): Replaying the Event Store is computationally expensive and slow. You would never do this for a simple read-to-populate-a-form scenario unless the Read Model was completely missing the data. Using the existing projected data for the form fields is orders of magnitude faster.
- Decoupling (Read/Write Split): This maintains the CQRS separation. The Read Model is still the source for what is displayed to the user. The Event Store is only queried for the transactional anchor (
MAX(sequence_number)). - Consistency (High Enough): By replacing the read model’s version (
V_stale) with the authoritative version (V_authoritative), you achieve transactional consistency for the critical OCC field, and high eventual consistency for the form data (since the event consumer should be very fast at updating the read model).
What if the Read Model is Out of Sync?
The only time this approach is problematic is if the read model is severely stale (e.g., the consumer has been down for hours). In this case, the form data the user sees might not reflect the most recent events, even though the aggregateVersion is correct.
- Example: Event E6 added a field
is_admin = true. The consumer is down. The Read Model doesn’t haveis_admin = true. The UI loads the form, setsaggregateVersion=V6, but the form field foris_adminis missing. - Mitigation: The expectation in a healthy CQRS system is that the read model lag is measured in milliseconds, not minutes or hours. If lag is severe, the solution is to fix the consumer and the Eventual Consistency pipeline, not to slow down every read operation by resorting to full Event Replay.
Conclusion: The solution is to mix and match: Read data from the projection, but read the version from the source of truth (Event Store).
Eventual Consistency Trade Off
Availability/consistency trade-off is the most nuanced and important philosophical point in CQRS/Event Sourcing.
It is correct that Option 2 (Version Lookup on Button Click) temporarily breaks pure eventual consistency for the purpose of transactional integrity.
Here is the detailed elaboration on why this is necessary and how it redefines the consistency boundary, rather than fundamentally destroying the trade-off.
1. The Principle of Eventual Consistency (EC)
- Definition: The system state (Read Model) will eventually equal the Source of Truth (Write Model/Event Store) after a small delay ($\Delta t$).
- Trade-Off: You trade strong/immediate consistency for high availability and high performance (speed of writes and reads).
- The Acceptable Lie: The Read Model is allowed to lie for $\Delta t$ seconds.
2. The Unacceptable Lie: Breaking Transactional Integrity
The moment a user wants to perform a write operation, the system must enforce Strong Consistency for that single transaction, regardless of the CQRS pattern.
- Goal of the Transaction: To guarantee that the command (write) is based on a known, singular, correct state of the Aggregate.
- The Problem: If we use the stale version from the Projection DB (
V_stale), and the Write Model is atV_authoritative, one of two things happens:- If
V_authoritative > V_stale(Stale Read): The command is rejected (correctly by the Command Handler’s OCC check). The user is told to refresh. - If we tried to bypass OCC: A new event is generated based on stale data, potentially creating an invalid state (e.g., inventory going negative). This is a data integrity failure.
- If
Conclusion: For the Write Path, you must have Strong Consistency. The Write Path does not participate in the EC trade-off.
3. Why Option 2 is the Best Synthesis (The Redefined Trade-Off)
Option 2 queries Projection AND Event Store) is a controlled and highly localized violation of pure EC that elevates transactional integrity.
| Operation | Consistency Mechanism | Status |
|---|---|---|
| A. List View | Eventual Consistency (EC) | Hits Projection DB only. Fast. Can be stale. $\checkmark$ |
| B. Button Click/Form Load | Read-Your-Own-Writes Consistency (RYOW) / Strong Consistency | Hits Event Store for Version ONLY. Checks V_stale against V_authoritative. If V_stale < V_authoritative, the read is aborted, forcing a fresh, consistent read for the form. $\checkmark$ |
| C. Command Submission | Optimistic Concurrency Control (OCC) / Strong Consistency | Hits Event Store for the final check. Guarantees data integrity. $\checkmark$ |
The “Controlled Violation” Justification
- Only for the Edit Form: The slow, authoritative query is only done for one Aggregate when a user chooses to edit it. This minimizes performance impact.
- Only for the Version: The authoritative query (
SELECT MAX(sequence_number)) is the lightest possible query against the Event Store, ensuring minimum latency for this strong consistency check. It doesn’t incur the cost of full Event Replay or full Projection DB reads. - Redefining the UX Contract: You are establishing a clear contract: “The List View is for quick display (EC). The Edit Form guarantees that the state you see is the last known state, and we checked the Source of Truth just for you (Strong Consistency).”
The Availability Trade-off is NOT Broken:
- Availability: The Write Model (Command Handler) is still available. The Read Model (Projection) is still available. The system has not halted.
- Performance: The List View remains fast (available). The Write Path remains fast (available).
- The Cost: The cost is one extra, fast, authoritative query right before the edit begins, which is a small price to pay for preventing a major data integrity or user experience failure.
Summary
The pattern you are implementing with Option 2 (Projection Data + Event Store Version) is the standard, correct way to implement OCC across the Read-Write boundary in a performant CQRS/ES system. It is a necessary local “tightening” of the eventual consistency model to ensure transactional correctness where it absolutely matters: at the point of data modification.
Cascade Soft Delete
With the recent refactor, relying on ON DELETE CASCADE is no longer suitable after implementing soft deletes, because soft delete is an UPDATE operation (SET active = FALSE) and not a true DELETE from the database.
The pattern we should follow in an Event Sourcing / Event-Driven Architecture with soft deletes is:
1. The Principle of Causality (or Domain Consistency)
When a parent entity (e.g., role_t) is soft-deleted, all its dependent children entities (e.g., role_user_t, role_permission_t, etc.) must also be soft-deleted to maintain domain consistency. This cascade logic must be implemented in the application layer (the projection service or command handler or database).
2. Implementation in the Command/Event Handler/Database
Strategy A: Event Amplification
The command handler that received the initial command/event (e.g., DeleteRoleCommand -> RoleDeletedEvent) should not directly perform the cascading database updates. Instead, it should be responsible for emitting new cascading events for each child entity.
- Incoming Command: Generate a
RoleDeletedEvent(for a specificrole_id). - Emitting Child Events: It then emits an event for each dependent child, such as
RoleUserRemovedEvent(role_id, user_id)andRolePermissionRemovedEvent(role_id, permission_id). - Event Store: Push an array of events to event_store_t and outbox_message_t tables in a transaction.
- Event Processor: All events will be processed in the same transaction to update parent table and child tables together.
Pro: Decoupled, explicit, audit trail for every change. Con: More complex event processing, increased event volume; Need to refactor all delete command handlers to emit more events and it is significant code change and long term maintenance work.
Strategy B: Direct Application-Level Cascade
In a service that primarily acts as a projection (CQRS read model) and is tightly coupled with its projection logic, the simplest approach is to bundle the cascading logic directly into the parent handler’s processing.
- Incoming Event:
RoleDeletedEvent. - Event Processor: The
deleteRole(conn, event)method would execute the parent soft delete (UPDATE role_t SET active=FALSE). - Cascading Updates: Immediately after, within the same transaction, it would execute multiple cascading
UPDATEstatements on the child tables. Make sure that only the active flag is updated based on the primary key for child tables.
// Inside deleteRole(Connection conn, Map<String, Object> event)
// 1. Soft delete the parent
// UPDATE role_t SET active = FALSE WHERE ...
// 2. Soft delete the children in the same transaction
// UPDATE role_user_t SET active = FALSE, update_user = ?, update_ts = ? WHERE host_id = ? AND role_id = ?
// UPDATE role_permission_t SET active = FALSE, update_user = ?, update_ts = ? WHERE host_id = ? AND role_id = ?
Pro: Simple, fast, maintains transactional integrity easily. Con: Tightly couples the projection logic; no explicit events for child deletion in the event store; Many db provider update and long term maintenace work.
Strategy C: Direct Database-Level Cascade
Create a trigger in database to manage the cascade soft delete for child tables. This can be individual trigger on each table or a centralized trigger to apply on all tables.
Pro: Simple, fast, maintains transactional integrity easily. Minimum code change in app logic and easy to implement and maintain. Con: Need to make sure that the project team is aware of the logic to void confusions.
Create a cascade_relationships_v view based on the foreign keys.
-- create a view to simplify the foreign key relationship.
DROP VIEW IF EXISTS cascade_relationships_v;
CREATE VIEW cascade_relationships_v AS
WITH fk_details AS (
SELECT
pn.nspname::text AS parent_schema,
pc.relname::text AS parent_table,
cn.nspname::text AS child_schema,
cc.relname::text AS child_table,
c.conname::text AS constraint_name,
c.oid AS constraint_id,
cc.oid AS child_table_oid,
pc.oid AS parent_table_oid,
unnest.parent_col,
unnest.child_col,
unnest.ord
FROM pg_constraint c
JOIN pg_class pc ON c.confrelid = pc.oid
JOIN pg_namespace pn ON pc.relnamespace = pn.oid
JOIN pg_class cc ON c.conrelid = cc.oid
JOIN pg_namespace cn ON cc.relnamespace = cn.oid
CROSS JOIN LATERAL (
SELECT
unnest(c.confkey) AS parent_col,
unnest(c.conkey) AS child_col,
generate_series(1, array_length(c.conkey, 1)) AS ord
) unnest
WHERE c.contype = 'f'
)
SELECT
fd.parent_schema,
fd.parent_table,
fd.child_schema,
fd.child_table,
fd.constraint_name,
-- Human readable mapping
string_agg(
format('%I → %I',
(SELECT attname FROM pg_attribute
WHERE attrelid = fd.parent_table_oid
AND attnum = fd.parent_col),
(SELECT attname FROM pg_attribute
WHERE attrelid = fd.child_table_oid
AND attnum = fd.child_col)
),
', ' ORDER BY fd.ord
) AS foreign_key_mapping,
-- Structured data for trigger
jsonb_object_agg(
(SELECT attname FROM pg_attribute
WHERE attrelid = fd.parent_table_oid
AND attnum = fd.parent_col),
(SELECT attname FROM pg_attribute
WHERE attrelid = fd.child_table_oid
AND attnum = fd.child_col)
) AS foreign_key_json,
-- Arrays for easier processing
array_agg(
(SELECT attname FROM pg_attribute
WHERE attrelid = fd.parent_table_oid
AND attnum = fd.parent_col)
ORDER BY fd.ord
) AS parent_columns,
array_agg(
(SELECT attname FROM pg_attribute
WHERE attrelid = fd.child_table_oid
AND attnum = fd.child_col)
ORDER BY fd.ord
) AS child_columns,
COUNT(*) AS column_count,
fd.child_table_oid,
fd.parent_table_oid,
-- Check for required columns
EXISTS (
SELECT 1 FROM pg_attribute a
WHERE a.attrelid = fd.parent_table_oid
AND a.attname = 'delete_ts'
AND NOT a.attisdropped
) AS parent_has_delete_ts,
EXISTS (
SELECT 1 FROM pg_attribute a
WHERE a.attrelid = fd.child_table_oid
AND a.attname = 'delete_ts'
AND NOT a.attisdropped
) AS child_has_delete_ts,
EXISTS (
SELECT 1 FROM pg_attribute a
WHERE a.attrelid = fd.parent_table_oid
AND a.attname = 'delete_user'
AND NOT a.attisdropped
) AS parent_has_delete_user,
EXISTS (
SELECT 1 FROM pg_attribute a
WHERE a.attrelid = fd.child_table_oid
AND a.attname = 'delete_user'
AND NOT a.attisdropped
) AS child_has_delete_user
FROM fk_details fd
-- Only include relationships where both tables have deletion tracking
WHERE EXISTS (
SELECT 1 FROM pg_attribute a
WHERE a.attrelid = fd.parent_table_oid
AND a.attname = 'delete_ts'
AND NOT a.attisdropped
) AND EXISTS (
SELECT 1 FROM pg_attribute a
WHERE a.attrelid = fd.child_table_oid
AND a.attname = 'delete_ts'
AND NOT a.attisdropped
)
GROUP BY
fd.parent_schema, fd.parent_table,
fd.child_schema, fd.child_table,
fd.constraint_name, fd.constraint_id,
fd.child_table_oid, fd.parent_table_oid
ORDER BY fd.parent_schema, fd.parent_table, fd.child_schema, fd.child_table;
To test the view above.
SELECT * FROM cascade_relationships_v
WHERE parent_table = 'api_t' AND child_table = 'api_version_t';
And the result.
parent_schema parent_table child_schema child_table constraint_name foreign_key_mapping foreign_key_json parent_columns child_columns column_count child_table_oid parent_table_oid parent_has_delete_ts child_has_delete_ts parent_has_delete_user child_has_delete_user
------------- ------------ ------------ ------------- --------------------------------- ---------------------------------- ------------------------------------------ -------------------- -------------------- ------------ --------------- ---------------- -------------------- ------------------- ---------------------- ---------------------
public api_t public api_version_t api_version_t_host_id_api_id_fkey host_id → host_id, api_id → api_id {"api_id": "api_id", "host_id": "host_id"} ["host_id","api_id"] ["host_id","api_id"] 2 360279 360268 true true true true
Create a function for update active to true and false.
CREATE OR REPLACE FUNCTION smart_cascade_soft_delete()
RETURNS TRIGGER AS $$
DECLARE
fk_record RECORD;
where_clause TEXT;
query_text TEXT;
column_index INT;
current_user_name TEXT;
deletion_context TEXT;
deletion_context_pattern TEXT;
delete_timestamp TIMESTAMP;
BEGIN
-- Get current user
current_user_name := current_user;
-- Handle SOFT DELETE (active = false)
IF NEW.active = FALSE AND OLD.active = TRUE THEN
-- Generate deletion timestamp
delete_timestamp := CURRENT_TIMESTAMP;
-- Set deletion context
deletion_context := format('PARENT_CASCADE_%s_%s',
TG_TABLE_NAME,
to_char(delete_timestamp, 'YYYYMMDD_HH24MISSMS')
);
-- Update parent with deletion context if columns exist
IF EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_schema = TG_TABLE_SCHEMA
AND table_name = TG_TABLE_NAME
AND column_name = 'delete_user'
) THEN
NEW.delete_user := deletion_context;
END IF;
IF EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_schema = TG_TABLE_SCHEMA
AND table_name = TG_TABLE_NAME
AND column_name = 'delete_ts'
) THEN
NEW.delete_ts := delete_timestamp;
END IF;
-- Update parent's update columns
NEW.update_ts := delete_timestamp;
NEW.update_user := current_user_name;
FOR fk_record IN
SELECT *
FROM cascade_relationships_v
WHERE parent_schema = TG_TABLE_SCHEMA
AND parent_table = TG_TABLE_NAME
LOOP
-- Build WHERE clause
where_clause := '';
FOR column_index IN 1..fk_record.column_count LOOP
IF column_index > 1 THEN
where_clause := where_clause || ' AND ';
END IF;
where_clause := where_clause || format(
'%I = $1.%I',
fk_record.child_columns[column_index],
fk_record.parent_columns[column_index]
);
END LOOP;
-- Add condition to only update currently active records
where_clause := where_clause || ' AND active = TRUE';
-- Cascade the soft delete with context
query_text := format(
'UPDATE %I.%I
SET active = FALSE,
delete_ts = $2,
delete_user = $3,
update_ts = $2,
update_user = $4
WHERE %s',
fk_record.child_schema,
fk_record.child_table,
where_clause
);
EXECUTE query_text USING OLD, delete_timestamp, deletion_context, current_user_name;
END LOOP;
-- Handle RESTORE (active = true)
ELSIF NEW.active = TRUE AND OLD.active = FALSE THEN
-- Only restore children that were deleted by parent cascade
FOR fk_record IN
SELECT *
FROM cascade_relationships_v
WHERE parent_schema = TG_TABLE_SCHEMA
AND parent_table = TG_TABLE_NAME
LOOP
-- Pattern to match cascade deletions
deletion_context_pattern := format('PARENT_CASCADE_%s_%%', TG_TABLE_NAME);
-- Build WHERE clause
where_clause := '';
FOR column_index IN 1..fk_record.column_count LOOP
IF column_index > 1 THEN
where_clause := where_clause || ' AND ';
END IF;
where_clause := where_clause || format(
'%I = $1.%I',
fk_record.child_columns[column_index],
fk_record.parent_columns[column_index]
);
END LOOP;
-- Only restore cascade-deleted records
where_clause := where_clause ||
' AND delete_user LIKE $2 AND active = FALSE';
-- Restore the records
query_text := format(
'UPDATE %I.%I
SET active = TRUE,
delete_ts = NULL,
delete_user = NULL,
update_ts = CURRENT_TIMESTAMP,
update_user = $3
WHERE %s',
fk_record.child_schema,
fk_record.child_table,
where_clause
);
EXECUTE query_text USING OLD, deletion_context_pattern, current_user_name;
END LOOP;
-- Clear parent's deletion context
IF EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_schema = TG_TABLE_SCHEMA
AND table_name = TG_TABLE_NAME
AND column_name = 'delete_user'
) THEN
NEW.delete_user := NULL;
END IF;
IF EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_schema = TG_TABLE_SCHEMA
AND table_name = TG_TABLE_NAME
AND column_name = 'delete_ts'
) THEN
NEW.delete_ts := NULL;
END IF;
-- Update parent's update columns
NEW.update_ts := CURRENT_TIMESTAMP;
NEW.update_user := current_user_name;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
Install the trigger.
-- Apply cascade triggers only to tables that have BOTH active AND delete_ts columns
DO $$
DECLARE
table_record RECORD;
has_active_column BOOLEAN;
has_delete_ts_column BOOLEAN;
BEGIN
FOR table_record IN
SELECT
n.nspname AS schema_name,
c.relname AS table_name,
c.oid AS table_oid
FROM pg_class c
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE c.relkind = 'r' -- Regular tables only
AND n.nspname NOT IN ('pg_catalog', 'information_schema')
AND EXISTS (
SELECT 1 FROM pg_constraint con
JOIN pg_class ref ON con.confrelid = ref.oid
WHERE con.contype = 'f'
AND ref.oid = c.oid
)
LOOP
-- Check if table has required columns
SELECT EXISTS (
SELECT 1 FROM pg_attribute a
WHERE a.attrelid = table_record.table_oid
AND a.attname = 'active'
AND NOT a.attisdropped
) INTO has_active_column;
SELECT EXISTS (
SELECT 1 FROM pg_attribute a
WHERE a.attrelid = table_record.table_oid
AND a.attname = 'delete_ts'
AND NOT a.attisdropped
) INTO has_delete_ts_column;
IF NOT (has_active_column AND has_delete_ts_column) THEN
RAISE NOTICE 'Skipping %.% - missing required columns (active: %, delete_ts: %)',
table_record.schema_name, table_record.table_name,
has_active_column, has_delete_ts_column;
CONTINUE;
END IF;
-- Drop existing trigger if it exists
EXECUTE format(
'DROP TRIGGER IF EXISTS trg_cascade_soft_ops ON %I.%I',
table_record.schema_name, table_record.table_name
);
-- Create new trigger
EXECUTE format(
'CREATE TRIGGER trg_cascade_soft_ops
AFTER UPDATE OF active ON %I.%I
FOR EACH ROW
EXECUTE FUNCTION smart_cascade_soft_delete()',
table_record.schema_name, table_record.table_name
);
RAISE NOTICE 'Created cascade trigger on %.%',
table_record.schema_name, table_record.table_name;
END LOOP;
END $$;
The above appoach has the following benefits.
-
Clean separation: delete_ts/delete_user are dedicated to soft delete tracking
-
Clear semantics: Easy to understand and query
-
No interference: Doesn’t conflict with update_ts/update_user for normal updates
-
Intelligent restoration: Can restore only cascade-deleted records
-
Audit trail: Complete history of who deleted what and when
This approach ensures you only restore child entities that were cascade-deleted, maintaining data integrity while providing a clear audit trail.
3. Special Handler for deletion of Host and Org
Due to the significant tables that needs to be updated when deleting a host or an org, we need to rely on the cascade delete of the database. So deletion of host or org will be implemented as hard delete and it should be warned to users on the UI interface.
4. Add delete_ts column to reverse cascade soft delete
After cascade soft delete for role_t, all children entities will be marked as active = false. When add back the same role again, we need to mark all the cascade delete children entities to active = true. However, we need to avoid updating the rows that were soft deleted individually. By adding a delete_ts, we can use it to find out all related children entities that are cascade deleted.
5. Update queries to add active = true condition
We need to update some queries in the db provider to add conditions for each joining table with active = true so that only active rows will be returned.
Conclusion:
Based on our team discussion, we are going to:
- Adopt the third option that use db trigger to do that same like the hard cascade delete.
- Change the org and host delete to hard delete.
- Update some queries to add condition to check the active = true.
Query Active Rows
Since we use soft deletes for most tables in the read model, we need to apply an active = true filter to our queries.
For single-table queries, this is straightforward—we can simply add AND active = true to the query. However, for join queries involving multiple tables, the active = true condition must be applied consistently across all participating tables, ideally in an automatic manner.
There are two approaches we can take on top of the current database provider implementation:
Active in filters
@Override
public Result<String> queryRolePermission(int offset, int limit, String filtersJson, String globalFilter, String sortingJson, String hostId) {
boolean isActive = true; // Default to true (active records only)
// Iterate safely to find and remove the 'active' filter to handle it manually
if (filters != null) {
Iterator<Map<String, Object>> it = filters.iterator();
while (it.hasNext()) {
Map<String, Object> filter = it.next();
if ("active".equals(filter.get("id"))) {
Object val = filter.get("value");
if (val != null) {
isActive = Boolean.parseBoolean(val.toString());
}
it.remove(); // Remove from list so dynamicFilter doesn't add it again
break;
}
}
}
StringBuilder activeSql = new StringBuilder();
if (isActive) {
// Strict consistency: A record is only "active" if all related entities are active
activeSql.append(" AND rp.active = true");
activeSql.append(" AND r.active = true");
activeSql.append(" AND ae.active = true");
activeSql.append(" AND av.active = true");
} else {
// Soft-deleted view: Usually we only care that the specific record itself is inactive
activeSql.append(" AND rp.active = false");
}
}
Pros
- No need to change the signature, UI and service layer.
Cons
- Need to iterate all filters to find the active flag per call.
Active as a seperate parameter
@Override
public Result<String> queryRolePermission(int offset, int limit, String filtersJson, String globalFilter, String sortingJson, boolean active, String hostId) {
StringBuilder activeSql = new StringBuilder();
if (active) {
// Strict consistency: A record is only "active" if all related entities are active
activeSql.append(" AND rp.active = true");
activeSql.append(" AND r.active = true");
activeSql.append(" AND ae.active = true");
activeSql.append(" AND av.active = true");
} else {
// Soft-deleted view: Usually we only care that the specific record itself is inactive
activeSql.append(" AND rp.active = false");
}
}
Pros
- Logic is simple in the query.
Cons
- Need to change the service layer and UI to add an additional parameter.
Conclusion
We recommend proceeding with Option 2. While it requires an initial refactor of the Service and UI layers, it provides strict type safety and cleaner code.
Reasoning:
-
Code Reuse: Option 1 requires repeating the filter iteration logic inside every DAO method. Option 2 keeps DAO methods clean.
-
Semantics: The active status affects multiple table joins (Data Integrity), distinguishing it from standard column filters. It should be an explicit argument.
-
Maintainability: Option 2 decouples the Database layer from the UI’s JSON structure. If the UI changes how it sends the active status, we only change the extraction logic in the Controller, not every SQL query method.
Distributed Scheduler Design
Introduction
The Distributed Scheduler is a robust, highly available component of the light-portal architecture that manages the periodic execution of tasks across a cluster of application instances. It ensures that scheduled tasks are executed exactly as defined, even in a distributed environment, by using a database-backed leader election and locking mechanism.
Architecture
The scheduler follows a Leader-Follower pattern to prevent redundant executions and ensure consistency.
- Leader Election: All scheduler instances compete for a global lock in the
scheduler_lock_ttable. - Lock Heartbeat: The leader periodically updates its heartbeat to maintain ownership. If the leader fails, another instance will eventually claim the lock after a timeout.
- Polling Loop: Only the leader performs the polling of the
schedule_ttable for due tasks. - Task Execution: When a task is due, the scheduler generates the corresponding event into the
event_store_tandoutbox_message_ttables and updates thenext_run_tsfor the next occurrence.
Database Schema
schedule_t
Stores the definitions and state of all scheduled tasks.
CREATE TABLE schedule_t (
schedule_id UUID NOT NULL,
host_id UUID NOT NULL,
schedule_name VARCHAR(126) NOT NULL,
frequency_unit VARCHAR(16) NOT NULL, -- e.g., 'MINUTES', 'HOURS', 'DAYS'
frequency_time INTEGER NOT NULL,
start_ts TIMESTAMP WITH TIME ZONE NOT NULL,
next_run_ts TIMESTAMP WITH TIME ZONE NOT NULL,
event_topic VARCHAR(126) NOT NULL,
event_type VARCHAR(126) NOT NULL,
event_data TEXT NOT NULL,
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN NOT NULL DEFAULT TRUE,
PRIMARY KEY(schedule_id)
);
CREATE INDEX idx_schedule_active_next_run ON schedule_t (active, next_run_ts);
scheduler_lock_t
Facilitates distributed locking and leader election.
CREATE TABLE scheduler_lock_t (
lock_id INT PRIMARY KEY, -- Static ID for the global scheduler lock
instance_id VARCHAR(255) NOT NULL, -- ID of the holding instance
last_heartbeat TIMESTAMP WITH TIME ZONE NOT NULL
);
Implementation Details
Leader Election and Heartbeat
Instances attempt to acquire the lock by updating the last_heartbeat if the existing heartbeat has expired (e.g., more than 60 seconds ago).
UPDATE scheduler_lock_t
SET instance_id = ?, last_heartbeat = CURRENT_TIMESTAMP
WHERE lock_id = 1 AND (instance_id = ? OR last_heartbeat < ?);
Polling Mechanism
The leader queries for tasks where next_run_ts <= CURRENT_TIMESTAMP and active = true.
SELECT * FROM schedule_t
WHERE active = true AND next_run_ts <= CURRENT_TIMESTAMP
ORDER BY next_run_ts ASC
LIMIT ?;
Next Run Timestamp Calculation
After a task is executed, the next_run_ts is incremented based on the frequency_unit and frequency_time.
- Interval-based: Adds the specified amount of time to the
next_run_ts. - Drift Correction: To prevent cumulative drift, the calculation is based on the original
start_tsor the previousnext_run_tsrather than the actual execution time.
Execution Flow
- Leader polls for due tasks.
- For each task:
- Starts a database transaction.
- Inserts the specified event into the event store and outbox message.
- Updates
next_run_tsinschedule_t. - Commits the transaction.
- The event is then picked up and processed by the Event Consumer (Kafka or Postgres).
Conclusion
The Distributed Scheduler provides a reliable and scalable way to handle periodic activities within the light-portal, ensuring that tasks are executed predictably and exclusively by a single active leader at any given time.
PostgreSQL Pub/Sub Design
Introduction
The PostgreSQL Pub/Sub mechanism provides an alternative to Kafka for event distribution within the light-portal architecture. It is designed for smaller deployments or environments where Kafka is not available, offering a reliable, low-latency, and strictly ordered event delivery system using native PostgreSQL features.
Architecture
The system utilizes a hybrid Polling + LISTEN/NOTIFY approach to achieve both high reliability and low latency.
1. Logical Partitioning
To support horizontal scalability and ensure ordered processing for multi-tenant environments, the system uses logical partitioning based on the host_id.
- Events are distributed across a fixed number of partitions (e.g., 8 or 16).
- Partition index =
abs(hashtext(host_id::text)) % total_partitions. - Each partition has its own progress tracker in
consumer_offsets.
2. Contiguous Offset Claiming
Within each partition, the consumer claims a batch of events using gapless logical offsets (c_offset).
3. Real-time Wake-up
To minimize latency without high-frequency polling, the system uses the PostgreSQL LISTEN/NOTIFY mechanism.
- A database trigger on the
outbox_message_ttable issues aNOTIFY event_channelwhenever new messages are inserted. - Consumers use
LISTEN event_channelto subscribe to these real-time signals. - The consumer loop calls
pgConn.getNotifications(timeout)to wait for signals. This allows the consumer thread to sleep efficiently and wake up immediately when work is available, while still falling back to a poll-based check if no notification is received within thewaitPeriodMs.
Database Schema
log_counter
Manages the global version/offset for the outbox.
CREATE TABLE log_counter (
id INT PRIMARY KEY,
next_offset BIGINT NOT NULL DEFAULT 1
);
INSERT INTO log_counter (id, next_offset) VALUES (1, 1);
consumer_offsets
Tracks the progress of each consumer partition.
CREATE TABLE consumer_offsets (
group_id VARCHAR(255),
topic_id INT, -- 1 for global outbox
partition_id INT, -- Logical partition index
next_offset BIGINT NOT NULL DEFAULT 1,
PRIMARY KEY (group_id, topic_id, partition_id)
);
outbox_message_t (Modified)
Stores the events to be published.
ALTER TABLE outbox_message_t ADD COLUMN c_offset BIGINT UNIQUE;
CREATE INDEX idx_outbox_offset ON outbox_message_t (c_offset);
Triggers and Functions
Enables the NOTIFY mechanism.
CREATE OR REPLACE FUNCTION notify_event() RETURNS TRIGGER AS $$
BEGIN
PERFORM pg_notify('event_channel', 'new_event');
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER event_trigger
AFTER INSERT ON outbox_message_t
FOR EACH STATEMENT EXECUTE FUNCTION notify_event();
Implementation Details
Offset Reservation
When inserting events, the system locks the log_counter row to reserve a range of offsets:
UPDATE log_counter SET next_offset = next_offset + ? WHERE id = 1 RETURNING next_offset - ?;
Competing Consumer Pattern
To support multiple instances within the same consumer group, logical offsets are “claimed” in batches using an atomic UPDATE ... RETURNING statement. This ensures that each event is processed exactly once by one member of the group.
WITH counter_tip AS (
SELECT (next_offset - 1) AS highest_committed_offset FROM log_counter WHERE id = 1
),
to_claim AS (
SELECT group_id, next_offset,
LEAST(batch_size, GREATEST(0, (SELECT highest_committed_offset FROM counter_tip) - next_offset + 1)) AS delta
FROM consumer_offsets
WHERE group_id = ? AND topic_id = 1
FOR UPDATE
),
upd AS (
UPDATE consumer_offsets c SET next_offset = c.next_offset + t.delta
FROM to_claim t
WHERE c.group_id = t.group_id AND c.topic_id = 1
RETURNING t.next_offset AS start_offset, (c.next_offset - 1) AS end_offset
)
SELECT start_offset, end_offset FROM upd;
Transactional User-Based Batching
To ensure that events generated from the same user are handled atomically and in order, the consumer employs a grouping strategy within its processing cycle:
- Fetch Batch: Read raw payloads from
outbox_message_tfor the assigned partition range. - Filter and Group:
- Filter messages by the partition hash:
abs(hashtext(host_id::text)) % ? = ?. - Group the filtered messages by
host_idanduser_id.
- Filter messages by the partition hash:
- Process by User:
- For each
(host_id, user_id)group, execute all events in a single database transaction.
- For each
Handling Large Atomic Transactions (Batch Extension)
If a business activity (e.g., “instance clone”) generates more events than the configured batchSize, these events should still be processed in a single transaction to maintain system consistency.
The consumer handles this via Atomic Batch Extension:
- After fetching the initial batch (e.g., 100 events), the consumer peeks at the next available event in the outbox.
- If the next event belongs to the same
user_idas the last event in the batch, the consumer continues fetching consecutive events for that user until the transaction boundary is found. - The
consumer_offsetsare then atomically updated to reflect the true end of the extended batch. - This ensures that even if 120 events were generated, all 120 are processed in a single transaction, regardless of the
batchSizelimit.
This approach ensures that even if events are processed in parallel across different partitions, events belonging to the same user are always handled in the same transaction, maintaining consistency across subsystems.
Transaction ID and Dead Letter Queue
Transaction ID
To provide precise boundaries for atomic transactions, the system uses a transaction_id column in the outbox_message_t table:
ALTER TABLE outbox_message_t ADD COLUMN transaction_id UUID;
When events are persisted to the outbox, all events generated within a single business transaction are assigned the same transaction_id (a UUID generated once per batch in EventPersistenceImpl.insertEventStore()).
This eliminates ambiguity when grouping events:
- Without
transaction_id: Events are grouped byhost_id:user_id, which may incorrectly group unrelated transactions from the same user. - With
transaction_id: Events are grouped by their exact transaction boundary, ensuring atomic processing of related events only.
Dead Letter Queue (DLQ)
When event processing fails, the system implements a granular fallback mechanism to prevent the entire batch from being blocked:
Schema
CREATE TABLE IF NOT EXISTS dead_letter_queue (
group_id VARCHAR(255),
host_id UUID,
user_id UUID,
c_offset BIGINT,
transaction_id UUID,
payload JSONB,
exception TEXT,
created_dt TIMESTAMP DEFAULT NOW()
);
Processing Flow
-
Normal Processing: The consumer attempts to process all events in a claimed batch within a single database transaction.
-
Batch Failure Detection: If any event in the batch fails (e.g., constraint violation, business logic error), the entire transaction is rolled back.
-
Fallback Mode: The consumer switches to
processBatchWithFallback():- Re-claims the same offset range.
- Groups events by
transaction_id. - For each transaction group:
- Creates a JDBC
Savepoint. - Attempts to process all events in that transaction.
- On success: Continues to the next transaction.
- On failure:
- Rolls back to the
Savepoint. - Inserts all events from the failed transaction into
dead_letter_queue. - Logs the error with the
transaction_idfor debugging.
- Rolls back to the
- Creates a JDBC
-
Commit: After processing all transactions (successful or moved to DLQ), the consumer commits the transaction, advancing the offset.
Benefits
- Isolation: Only the failing transaction is moved to DLQ; other transactions in the batch proceed normally.
- Atomicity: All events belonging to a single business transaction are either processed together or moved to DLQ together.
- No Blocking: The consumer never gets stuck on a single bad event.
- Debuggability: The DLQ table preserves the full context (payload, exception, transaction_id) for manual investigation and replay.
Configuration
The consumer is configured via db-event-consumer.yml and runs in a Java 21 Virtual Thread. This ensures that the frequent Thread.sleep (during retries) and the blocking pgConn.getNotifications() (waiting for wake-ups) do not tie up native system threads, making the consumer extremely lightweight.
# Postgres pub/sub event processor configuration
# Consumer group id and it is default to user-query-group. Please only change it if you
# know exactly what you are doing.
groupId: ${db-event-consumer.groupId:user-query-group}
# The batch size when polling from the database for events. It is not fixed and will be
# adjusted if there are more than 100 events belong to the same transaction.
batchSize: ${db-event-consumer.batchSize:100}
# The number of total partitions. It should be the same number of portal-query instances.
totalPartitions: ${db-event-consumer.totalPartitions:1}
# Partition id starting from 0 to totalPartitions - 1 to assign each portal query instance.
partitionId: ${db-event-consumer.partitionId:0}
# The poll interval from the Postgres database to process the events from outbox_message_t.
waitPeriodMs: ${db-event-consumer.waitPeriodMs:1000}
Clean Shutdown
To ensure resources are released cleanly when the application stops, a ShutdownHookProvider is implemented:
- DbEventConsumerShutdownHook: Sets the
doneflag to stop the consumer loop and shuts down theExecutorService. This ensures that the application doesn’t hang on exit and that the database connections are properly returned to the pool.
Conclusion
This native PostgreSQL implementation provides a robust alternative to Kafka, leveraging standard relational database features to maintain strict event ordering and delivery guarantees with minimal infrastructure overhead.
Comparison: Leader Election vs. Competing Consumer (Claiming)
The light-portal architecture employs two different distributed coordination strategies: Leader Election for the Scheduler and Competing Consumers (Offset Claiming) for the PostgreSQL Pub/Sub. Each approach is optimized for its specific use case.
Summary Table
| Feature | Leader Election (scheduler_lock_t) | Host Partitioning (consumer_offsets) |
|---|---|---|
| Primary Goal | Exclusive Control (Safety) | Horizontal Scalability (Throughput) |
| Mechanism | Centralized “lock” with heartbeat. | Logical partitioning via host_id hash. |
| Parallelism | None (Single active instance). | High (N partitions, N consumers). |
| Database Load | Very Low (Heartbeat only). | Moderate (Per-partition updates). |
| Failover | Detection delay (Timeout-based). | Instant (One processor per partition). |
| Complexity | Simple. | Moderate (Hashing + Batching). |
1. Leader Election (Used in Distributed Scheduler)
Why it’s used for the Scheduler:
The “work” done by the scheduler is extremely lightweight: it simply checks if a task is due, inserts a one-line event into the outbox, and updates the next run time. However, the cost of double execution (starting the same job twice) is high.
- Efficiency: Having one leader prevents multiple instances from redundant polling of the
schedule_ttable, which reduces database contention. - Safety: It provides a simple guarantee that only one controller is making decisions about what triggers and when.
- Scaling: Since the scheduler doesn’t do the actual “heavy lifting” (the work is done by event consumers), the leader bottleneck is rarely an issue.
2. Host-Based Partitioning (Used in Postgres Pub/Sub)
Why it’s used for Event Processing:
Event processing is the “Data Plane” of the system. By partitioning based on host_id, we emulate Kafka’s partitioning behavior within PostgreSQL.
- Ordered Processing: Ensures all events for a specific host (or user) are processed by the same partition sequence, avoiding race conditions on multi-tenant data.
- Throughput: Multiple consumers can process different partitions in parallel. 8 partitions = 8 instances working concurrently.
- Implicit Load Balancing: Distributes thousands of hosts across a fixed number of partitions.
- Resiliency: Each partition’s progress is independent. A failure in one host/partition doesn’t block others.
Conclusion: Which is “Better”?
Neither is universally better; they are complementary:
- Leader Election is better for orchestration and control: Where you need a single “brain” to make consistent decisions and volume is manageable.
- Competing Consumers is better for workload distribution: Where you need to process a high volume of independent tasks as quickly as possible.
In light-portal, we use the Scheduler (Leader) to reliably “kick off” tasks by emitting events, and the Pub/Sub (Competing Consumers) to at-scale process those events.
Kafka Event Processor
Overview
The Kafka Event Processor (PortalEventConsumerStartupHook) consumes events from Kafka topics that are populated by Debezium CDC from the outbox_message_t table. It provides robust event processing with transaction-level granularity and Dead Letter Queue (DLQ) support.
Architecture
The processor uses a two-phase processing strategy with automatic fallback to ensure both performance and reliability:
- Optimistic Batch Processing: Attempts to process all transactions in a single database transaction for maximum throughput
- Granular Fallback: On failure, switches to individual transaction processing with JDBC Savepoints to isolate failures
Transaction ID Header
Events published to Kafka include a transaction_id header added by Debezium’s HeaderFrom transform. This UUID groups all events that were generated within a single business transaction, enabling:
- Precise transaction boundaries: Events are grouped by their actual transaction, not just by user/host
- Atomic DLQ handling: Failed transactions are moved to DLQ as a complete unit
- Backward compatibility: Falls back to Kafka key-based grouping for events without the header
Debezium Configuration
The transaction_id header is added via the Debezium connector configuration:
{
"transforms": "unwrap,addTransactionIdHeader,timestamp_converter,...",
"transforms.addTransactionIdHeader.type": "org.apache.kafka.connect.transforms.HeaderFrom$Value",
"transforms.addTransactionIdHeader.fields": "transaction_id",
"transforms.addTransactionIdHeader.headers": "transaction_id",
"transforms.addTransactionIdHeader.operation": "copy"
}
Processing Flow
Phase 1: Optimistic Batch Processing
// 1. Group events by transaction_id from headers
Map<String, List<ConsumerRecord>> transactionBatches = groupByTransactionId(records);
// 2. Process all transactions in one DB transaction
Connection conn = ds.getConnection();
conn.setAutoCommit(false);
for (Map.Entry<String, List<ConsumerRecord>> entry : transactionBatches.entrySet()) {
for (ConsumerRecord record : entry.getValue()) {
updateDatabaseWithEvent(conn, record.getValue());
}
}
conn.commit();
commitOffset(records);
Benefits:
- High throughput with single database transaction
- Minimal overhead for the common success case
Phase 2: Fallback with Savepoints
If the batch processing fails, the processor switches to granular mode:
Connection conn = ds.getConnection();
conn.setAutoCommit(false);
for (Map.Entry<String, List<ConsumerRecord>> entry : transactionBatches.entrySet()) {
String transactionId = entry.getKey();
List<ConsumerRecord> txRecords = entry.getValue();
Savepoint sp = conn.setSavepoint("TX_" + transactionId.hashCode());
try {
for (ConsumerRecord record : txRecords) {
updateDatabaseWithEvent(conn, record.getValue());
}
// Success - continue to next transaction
} catch (Exception e) {
// Rollback only this transaction
conn.rollback(sp);
// Send to DLQ
produceDLQ(txRecords, e);
}
}
// Commit all successful transactions
conn.commit();
commitOffset(allRecords);
Benefits:
- Isolation: Only failing transactions are moved to DLQ
- Atomicity: All events in a transaction are processed together or fail together
- No Blocking: Consumer continues processing subsequent transactions
- Progress Guarantee: Offsets are committed for all records (successful + DLQ’d)
Dead Letter Queue (DLQ)
DLQ Topic
Failed transactions are sent to a DLQ topic: {original-topic}-dlq
Each DLQ message includes:
- Key: Original Kafka key (user_id)
- Value: Original event payload
- TraceabilityId: Exception stack trace for debugging
DLQ Producer Configuration
The DLQ producer is configured via DeadLetterProducerStartupHook and must be enabled in the consumer config:
# kafka-consumer.yml
deadLetterEnabled: true
deadLetterTopicExt: -dlq
Monitoring and Recovery
- Alerting: Set up monitoring on the DLQ topic for new messages
- Investigation: Inspect DLQ messages to identify root cause (bad data, code bug, constraint violation)
- Fix: Deploy code fix or correct data inconsistency
- Replay: Use a re-driver application to republish events from DLQ back to the original topic
Transaction Grouping Logic
The processor extracts transaction_id from Kafka record headers:
private String extractTransactionId(ConsumerRecord<Object, Object> record) {
Map<String, String> headers = record.getHeaders();
if (headers != null) {
return headers.get("transaction_id");
}
return null;
}
Fallback for Legacy Events:
If no transaction_id header is present (old events before the header was added), the processor falls back to using the Kafka key for grouping:
String transactionId = extractTransactionId(record);
if (transactionId == null) {
transactionId = (String) record.getKey(); // Backward compatibility
}
Error Handling Strategy
Permanent vs Transient Errors
The processor treats all exceptions during fallback processing as permanent errors that warrant DLQ routing. This includes:
- Database constraint violations (unique, foreign key, not null)
- Deserialization errors (malformed JSON, schema mismatch)
- Business logic errors (validation failures, state inconsistencies)
Rationale: If an event fails during fallback (after the initial batch attempt failed), it’s unlikely to succeed on retry without intervention.
Health Monitoring
The processor sets healthy = false on critical failures, which triggers Kubernetes health probes to restart the pod:
- Consumer instance not found
- Framework exceptions during polling
- Fatal errors in fallback processing (after DLQ attempt)
Configuration
Consumer configuration in kafka-consumer.yml:
# Kafka consumer properties
topic: portal-event
groupId: user-query-group
keyFormat: string
valueFormat: string
# DLQ configuration
deadLetterEnabled: true
deadLetterTopicExt: -dlq
# Polling configuration
waitPeriod: 1000 # ms to wait between polls when no records
Comparison with DB Event Consumer
| Feature | Kafka Consumer | DB Consumer |
|---|---|---|
| Event Source | Kafka topic (via Debezium CDC) | Direct PostgreSQL polling |
| Transaction ID | From Kafka headers | From outbox_message_t.transaction_id column |
| Grouping | Map<String, List<ConsumerRecord>> | Map<String, List<EventData>> |
| DLQ Target | Kafka DLQ topic | PostgreSQL dead_letter_queue table |
| Offset Management | Kafka consumer offsets | PostgreSQL consumer_offsets table |
| Fallback Mechanism | JDBC Savepoints | JDBC Savepoints |
Both implementations share the same core DLQ philosophy: isolate failures at the transaction level to prevent blocking the entire consumer.
Best Practices
- Idempotent Processing: Ensure
updateDatabaseWithEvent()logic is idempotent to handle potential reprocessing - Monitor DLQ: Set up alerts for DLQ topic activity
- Version Events: Use schema versioning to handle event evolution gracefully
- Test Failure Scenarios: Regularly test DLQ routing with intentional failures
- DLQ Retention: Configure appropriate retention for DLQ topics to allow investigation and replay
Configuration Snapshot Design
This document describes the design and implementation of the configuration snapshot feature in the light-portal.
Overview
A configuration snapshot captures the state of an instance’s configuration at a specific point in time. It includes all properties, files, and relationships defined for that instance, merging overrides from various levels (Product, Environment, Product Version) into a “burned-in” effective configuration.
Snapshots are created in two scenarios:
- Deployment Trigger: Automatically created when a deployment occurs (to capture the state being deployed).
- User Trigger: Manually created by a user via the UI (e.g., to save a milestone).
Data Model
Snapshot Header (config_snapshot_t)
Captures metadata about the snapshot.
snapshot_id: UUIDsnapshot_type: Type of snapshot (e.g.,DEPLOYMENT,USER_SAVE)instance_id: Target instancehost_id: Tenant identifierdeployment_id: Link to deployment (if applicable)product_version: Locked product version at time of snapshotservice_id: Locked service ID
Snapshot Content
Snapshot data is normalizing into shadow tables that mirror the runtime configuration tables. These tables differ from the runtime tables by including a snapshot_id and lacking some runtime-specific fields.
Key tables include:
snapshot_instance_property_tsnapshot_instance_file_tsnapshot_deployment_instance_property_tsnapshot_product_version_property_tsnapshot_environment_property_t- … (others for APIs, Apps, etc.)
Effective Configuration (config_snapshot_property_t)
A flattened, merged view of all properties for the snapshot. This table represents the “final” configuration values used by the instance.
- Calculated by merging properties from all levels (Deployment > Instance > Product Version > Environment > Product) based on priority.
Backend Implementation
Stored Procedure (create_snapshot)
Located in portal-db/postgres/sp_tr_fn.sql.
This procedure performs the heavy lifting:
- Validates the instance and retrieves scope data (product, environment, etc.).
- Creates the snapshot header record.
- Copies raw data from active runtime tables to snapshot tables (e.g.,
instance_property_t->snapshot_instance_property_t). - Merges properties from all levels into
config_snapshot_property_t.- Handles list/map merging (aggregation).
- Handles scalar overriding (last update wins/priority tiers).
Persistence Layer (ConfigPersistenceImpl.java)
Provides the Java interface to calls the stored procedure:
createConfigSnapshot: CallsCALL create_snapshot(...).getConfigSnapshot: Retrieves snapshot headers with filtering/sorting.updateConfigSnapshot: Updates metadata (description).deleteConfigSnapshot: Deletes a snapshot and its cascaded data (if cascade delete is set up in DB, otherwise manual cleanup might be needed).
Front End Implementation
Config Snapshot Page (ConfigSnapshot.tsx)
- Displays a list of snapshots for a selected instance.
- Supports filtering by
current, ID, date, etc. - Actions:
- Create: Navigates to
/app/form/createConfigSnapshot. - Update: Fetches fresh data and navigates to update form.
- Delete: Calls
deleteSnapshotcommand.
- Create: Navigates to
Gap Analysis & Missing Components
The following components are currently MISSING or incomplete:
-
Command Handlers:
CreateConfigSnapshothandler (for User Trigger) is missing inconfig-command.DeleteConfigSnapshothandler is missing inconfig-command.GetFreshConfigSnapshothandler is missing (required for the “Update” action in UI).
-
Deployment Integration:
CreateDeployment.java(indeployment-command) does NOT callcreateConfigSnapshot.- The automatic snapshot creation on deployment is currently not implemented.
-
API Definition:
- The
createConfigSnapshotanddeleteConfigSnapshotendpoints need to be defined in the schema/routing if they are not already.
- The
Action Plan
-
Implement Command Handlers:
- Create
CreateConfigSnapshothandler inconfig-commandthat invokesConfigPersistence.createConfigSnapshot. - Create
DeleteConfigSnapshothandler inconfig-command. - Create
GetFreshConfigSnapshothandler inconfig-query.
- Create
-
Integrate with Deployment:
- Modify
CreateDeployment.java(or the platform handler it invokes) to callConfigPersistence.createConfigSnapshotimmediately after a successful deployment job is submitted or completed.
- Modify
-
Review Idempotency:
- Ensure
create_snapshothandles re-runs gracefully (Idempotency is partially handled by UUID generation, but business logic should prevent duplicate snapshots for the exact same state if needed).
- Ensure
Config Clone
OAuth 2.0 State Parameter Design
This document outlines the design, generation, and flow of the state parameter within the LightAPI OAuth 2.0 architecture.
Overview
The state parameter is an opaque value used by the client to maintain state between the request and callback. In the OAuth 2.0 Authorization Code Flow, its primary and critical function is to prevent Cross-Site Request Forgery (CSRF) attacks.
Workflow
The flow involves three parties:
- Client: The application requesting access (e.g., Light Portal).
- Authorization Server UI: The front-end login interface (e.g., Login View).
- Authorization Service: The backend service validating credentials and issuing codes.
Step-by-Step Flow
-
Generation (Client Side)
- The User initiates a login action on the Client.
- The Client generates a cryptographically strong random string (the
state). - The Client stores this
statelocally (e.g., in a secure, HTTP-only cookie or Session Storage) bound to the user’s current session. - The Client redirects the browser to the Authorization Server UI (
login-view), appending thestateas a query parameter.
GET https://login.lightapi.net/?client_id=...&response_type=code&state=xyz123... -
Preservation (Authorization Server UI)
- The Authorization Server UI (
login-view) loads and parses the query parameters. - It must not modify or validate the
state. Its sole responsibility is preservation. - When the user submits credentials (username/password) or selects a social provider, the UI passes the
stateexactly as received to the backend Authorization Service.
- The Authorization Server UI (
-
Authorization (Authorization Service)
- The backend service authenticates the user.
- Upon success, it generates an Authorization Code.
- It constructs the redirect URL back to the Client.
- It must append the exact same
statevalue received from the UI to this redirect URL.
HTTP/1.1 302 Found Location: https://portal.lightapi.net/authorization?code=auth_code_abc&state=xyz123... -
Verification (Client Side)
- The Client receives the callback request.
- It extracts the
statefrom the URL parameters. - It retrieves the stored
statefrom its local session. - It compares the two values:
- Match: The request is valid. Proceed to exchange the code for a token.
- Mismatch: The request is potentially malicious (CSRF likely). Reject the request and show an error.
Security Requirements
- Uniqueness: The
statemust be unique per authentication request. - Entropy: It must be a cryptographically random string (high entropy) to be unguessable.
- Binding: It must be bound to the user’s specific browser session on the client side.
Responsibility Matrix
| Component | Responsibility | Action |
|---|---|---|
| Portal (Client) | Owner | Generate, Store, Verify. |
| Login View (UI) | Carrier | Receive, Preserve, Forward. |
| Auth Service | Echo | Receive, Echo back in Redirect. |
References
Event Promotion Design: State-Based Reconciliation with Composite Keys
Overview
Traditional event sourcing replication involves copying raw events from one environment to another. However, this fails when the target environment has diverged (e.g., hotfixes), causing aggregateVersion conflicts. Additionally, strict global UUID constraints can prevent reusing the same ID across environments (Tenants). Finally, partial promotions can fail if parent dependencies (referential integrity) are missing in the target.
To resolve this, we adopt a State-Based Reconciliation approach (Semantic Replay) combined with Composite Keys for identity and Recursive Dependency Resolution for integrity.
Core Strategy: State-Based Reconciliation
Workflow
- Export (Lower Environment):
- Query the current state (Snapshot) of the entity from the Lower Environment (LE).
- Produce a “Canonical State Snapshot” (JSON).
- Import & Diff (Higher Environment):
- Read the LE Snapshot.
- Query the current state of the representative entity in the Higher Environment (HE).
- Compare:
- New? -> Generate
XxxCreatedEvent. - Changed? -> Calculate Delta -> Generate
XxxUpdatedEvent. - Same? -> No-op.
- New? -> Generate
Advantages
- Conflict Immunity: No
aggregateVersionconflicts; we always append new events. - Self-Healing: Automatically synchronizes diverged states.
Identity Strategy: Composite Keys
The Problem: Global UUID Uniqueness
In a multi-tenant system shareing a single database, a standard Primary Key UUID (e.g., user_id) is globally unique. This prevents us from having “User Steve” with UUID 123 in both the “Dev Tenant” and “Prod Tenant” if the DB enforces strict uniqueness on that column.
The Solution: Composite Keys (host_id + aggregate_id)
We scope all identity by the Tenant ID (host_id).
-
Schema Change:
- Primary Keys: Change from
PK(id)toPK(host_id, id). - Uniqueness: Change unique constraints (e.g., email) from
UK(email)toUK(host_id, email). - Event Store: Change unique constraint from
UK(aggregate_id, version)toUK(host_id, aggregate_id, version).
- Primary Keys: Change from
-
Promotion Benefit:
- Dev Tenant:
host_id=DEV, user_id=123 - Prod Tenant:
host_id=PROD, user_id=123 - Matching entities is trivial (compare
iddirectly).
- Dev Tenant:
Data Integrity: Recursive Dependency Resolution
The Problem: Missing Dependencies
Promoting a child entity (e.g., API Configuration) fails if its parent (e.g., API Instance) does not exist in the target environment (Higher Env).
The Solution: Deep Promotion (Recursive Bundling)
The exporter must be “Topology Aware”.
-
Dependency Metadata: Every Entity Type must declare its dependencies.
ApiConfigdepends onApiInstance.ApiInstancedepends onGatewayInstance.GatewayInstancedepends onHost.
-
Export Workflow (Recursive): When a user selects
ApiConfig-123for promotion:- System checks
ApiConfig-123-> ParentApiInstance-456. - System checks
ApiInstance-456-> ParentGatewayInstance-789. - Export Package: Includes
[GatewayInstance-789, ApiInstance-456, ApiConfig-123](Ordered by dependency).
- System checks
-
Import Workflow (Ordered): The Importer processes the list in order:
- GatewayInstance: Exists in Prod? Yes. (Skip).
- ApiInstance: Exists in Prod? No. Action: Create
ApiInstance. - ApiConfig: Exists in Prod? No. Action: Create
ApiConfig.
Dry Run Technical Implementation
Purpose
To guarantee the promotion will succeed without actually modifying the Higher Environment (Production).
Option 1: Application-Layer Simulation (Fast, Recommended for Planning)
- Logic: The Importer queries the DB (read-only) to fetch the current state of all entities in the package.
- Result: It calculates the “Diff Plan” purely in memory.
- Output: “Plan: Create API Instance (New), Update API Config (Diff)”.
- Pros: Very fast, zero DB locks.
- Cons: Does not verify deep database constraints (e.g., complex triggers or check constraints) that only trigger on write.
Option 2: Transaction Rollback (Robust, Recommended for Validation)
- Logic:
- Start a Database Transaction:
connection.setAutoCommit(false); - Simulate Execution: Perform the actual SQL Inserts and Updates generated by the Plan.
- Insert
ApiInstance… - Insert
ApiConfig…
- Insert
- Check for Errors: If any SQL Exception occurs (e.g., FK violation, unique constraint violation), catch it.
- Rollback: Regardless of success or failure, always call
connection.rollback().
- Start a Database Transaction:
- Output: “Validation Successful: The detailed plan is valid and safe to execute.” OR “Validation Failed: FK Violation on Table X”.
- Pros: 100% certainty that the data is valid according to the database schema.
- Cons: Slightly heavier key locks, but acceptable for admin operations.
Recommendation
Use Option 1 (App Simulation) for the UI preview to show the user “what will happen”. Use Option 2 (Transaction Rollback) immediately when the user clicks “Promote” (as a pre-flight check) or as an explicit “Verify” button to ensure deep integrity.
Sibling Deletion: Handling Orphaned Items
The User Case
When promoting a collection of items (e.g., “10 Config Properties” in HE vs “8 in LE”), simply creating or updating the 8 matching items from LE is insufficient. We must identify the 2 extra items in HE that likely need to be deleted to match the LE state.
Design Pattern: Scoped Reconciliation
To handle this, the import logic must be aware of the “Parent Scope” of the entities being promoted.
-
Export (Snapshot with Siblings):
- When promoting
ApiConfig-123, we fetch ALL associated properties for that config in LE. - LE Snapshot:
Properties = {P1, P2, ... P8}(Total 8).
- When promoting
-
Import (Set Difference Logic):
- Query ALL associated properties for
ApiConfig-123in HE. - HE State:
Properties = {P1, P2, ... P8, P9, P10}(Total 10). - Logic:
HE_Only = HE_Set - LE_Set=>{P9, P10}.
- Query ALL associated properties for
-
User Decision (Interactive Mode):
- The Dry Run Plan reports:
Updates:8 items synced (P1..P8).Deuntions (Potential):2 items exist in Prod but not Dev (P9, P10).
- Default Action: Do nothing (Safe Mode).
- Option: “Sync Deletes” -> Checkbox to delete extras?
- Strict Mode: Mirror exact state (Automatically schedule
ConfigPropertyDeletedEventfor P9, P10).
- The Dry Run Plan reports:
Implementation Checklist
- Exporter must include the full list of children IDs when exporting a parent container.
- Importer must realize that for “One-to-Many” relationships, it has to fetch the full target set to detect orphans.
UI and Service Design
Entity Dependency Graph
The exporter must be “Topology Aware”. When exporting an entity, all parent and child dependencies are included. Starting with instance_t as the primary promotable entity:
host_t
└── instance_t
├── instance_property_t
├── instance_file_t
├── instance_api_t
│ ├── instance_api_property_t
│ └── instance_api_path_prefix_t
├── instance_app_t
│ ├── instance_app_property_t
│ └── instance_app_api_t
│ └── instance_app_api_property_t
└── deployment_instance_t
└── deployment_instance_property_t
Promotion Modes
Two promotion modes are supported:
- Cross-Instance (JSON): Export entity snapshots as JSON files, then import them into a different environment/database instance. Used when source and target are in separate databases.
- Same-Instance (Data Table): Use
promotion_tandpromotion_item_ttables for tracking promotions between hosts within the same database. Source and target hosts share the same database.
Database: Promotion Tracking Tables
These tables are for same-instance promotions to track promotion jobs and their items.
CREATE TABLE promotion_t (
promotion_id UUID NOT NULL,
source_host_id UUID NOT NULL,
target_host_id UUID NOT NULL,
entity_type VARCHAR(64) NOT NULL, -- 'instance', 'rule', 'api', etc.
promotion_status VARCHAR(16) NOT NULL, -- 'Planned', 'DryRun', 'Executed', 'Failed', 'RolledBack'
plan_summary JSONB, -- The diff plan generated by dry run
created_by UUID NOT NULL,
aggregate_version BIGINT DEFAULT 1 NOT NULL,
active BOOLEAN NOT NULL DEFAULT TRUE,
delete_user VARCHAR(255),
delete_ts TIMESTAMP WITH TIME ZONE,
update_user VARCHAR(255) DEFAULT SESSION_USER NOT NULL,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY(promotion_id)
);
CREATE TABLE promotion_item_t (
promotion_id UUID NOT NULL,
item_id UUID NOT NULL,
entity_type VARCHAR(64) NOT NULL, -- 'instance', 'instance_property', etc.
entity_id VARCHAR(255) NOT NULL, -- The ID of the entity being promoted
action VARCHAR(16) NOT NULL, -- 'CREATE', 'UPDATE', 'DELETE', 'NOOP'
source_snapshot JSONB, -- State in source (LE)
target_snapshot JSONB, -- State in target (HE) for diff
diff_summary JSONB, -- Field-level diff
execution_status VARCHAR(16) DEFAULT 'Pending', -- 'Pending', 'Success', 'Failed'
error_message TEXT,
update_user VARCHAR(255) DEFAULT SESSION_USER NOT NULL,
update_ts TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY(promotion_id, item_id),
FOREIGN KEY(promotion_id) REFERENCES promotion_t(promotion_id) ON DELETE CASCADE
);
Service API Contracts
All promotion services are implemented in the user-command module (net.lightapi.portal.user.command.handler) alongside the existing ExportPortalEvent and ImportPortalEvent handlers.
Export Snapshot (Query)
Exports the current state of selected entities and all their children as a canonical JSON snapshot.
- Service:
user - Action:
exportSnapshot - Request Data:
sourceHostId(UUID) – The host to export from.entityType(String) – e.g.,"instance".entityIds(Array<String>) – IDs of entities to export.includeChildren(Boolean) – Recursively include child entities.includeSiblings(Boolean) – Include full sibling sets for orphan detection.
- Response: Canonical State Snapshot JSON containing all entities ordered by dependency depth, with nested children. The nested format is preferred over flat-with-references because the tree depth is bounded (max 4 levels for
instance_t), making it self-contained and easy to process depth-first during import.
{
"exportVersion": "1.0.0",
"sourceHostId": "...",
"exportTs": "2026-03-09T20:00:00Z",
"entities": [
{
"entityType": "instance",
"entityId": "...",
"data": { },
"children": {
"instance_property": [ ],
"instance_file": [ ],
"instance_api": [
{
"data": { },
"children": {
"instance_api_property": [ ],
"instance_api_path_prefix": [ ]
}
}
],
"instance_app": [
{
"data": { },
"children": {
"instance_app_property": [ ],
"instance_app_api": [
{
"data": { },
"children": {
"instance_app_api_property": [ ]
}
}
]
}
}
],
"deployment_instance": [
{
"data": { },
"children": {
"deployment_instance_property": [ ]
}
}
]
}
}
]
}
Import Dry Run (Command)
Performs an application-layer simulation (Option 1) to calculate the diff plan without modifying the database.
- Service:
user - Action:
importDryRun - Request Data:
targetHostId(UUID) – The host to import into.snapshot(Object) – The exported canonical snapshot JSON.
- Response: Diff plan with summary counts and per-item actions.
{
"promotionId": "...",
"summary": { "create": 5, "update": 3, "noop": 2, "orphan": 1 },
"items": [
{
"entityType": "instance",
"entityId": "...",
"action": "UPDATE",
"diff": { "instance_name": { "from": "old-name", "to": "new-name" } }
},
{
"entityType": "instance_property",
"entityId": "...",
"action": "CREATE",
"diff": null
}
]
}
Import Execute (Command)
Executes the promotion plan, applying all changes to the target host.
- Service:
user - Action:
importExecute - Request Data:
targetHostId(UUID) – The host to apply changes to.promotionId(UUID, optional) – From the dry run (for same-instance tracking).snapshot(Object) – The canonical snapshot.orphanAction(String) –"keep"|"delete"|"sync".
- Response: Execution result with per-item status (Success/Failed) and error messages.
UI Pages
All pages are located under portal-view/src/pages/promotion/ and accessible via a top-level “Promotion” sidebar menu with children: Export, Import, History.
PromotionExport.tsx (/app/promotion/export)
A 3-step wizard guiding the user through the export process:
- Select Source & Type: User picks a source host from a dropdown and selects the entity type (starting with “Instance”).
- Select Entities: A
MaterialReactTableloads entities for the selected host with checkbox selection. Supports filtering, sorting, and pagination. - Preview & Export: Two options:
- Download JSON – Downloads the canonical snapshot as a
.jsonfile for cross-instance promotion. - Promote to Host – Select a target host and navigate to the Import page with the snapshot pre-loaded for dry run.
- Download JSON – Downloads the canonical snapshot as a
PromotionImport.tsx (/app/promotion/import)
Handles the import and execution workflow:
- Select Import Source: Upload a JSON file, or receive a snapshot from the Export page via navigation state.
- Dry Run Preview: After selecting a target host and clicking “Run Dry Run,” displays the diff plan:
- New items (green) – Will be created.
- Changed items (yellow) – Will be updated, with expandable field-level diffs.
- Same items (gray) – No action needed.
- Orphaned items (red) – Exist in target but not in source.
- Execute: For orphaned items, user selects Keep/Delete/Sync via radio buttons. Clicking “Execute Promotion” applies all changes.
PromotionHistory.tsx (/app/promotion/history)
A standard MaterialReactTable listing past promotions with columns: Source Host, Target Host, Entity Type, Status (color-coded chip), Created By, Timestamp, Promotion ID. Row action: View Details (navigates to diff view).
PromotionDiffView.tsx (/app/promotion/diff)
Displays detailed promotion metadata (source/target hosts, status, timestamps) and a table of all promotion items with expandable field-level diffs showing source vs. target values and per-item execution status.
Implementation Phases
- Phase 1 – UI Foundation: Create promotion pages, sidebar menu entry, route registration. (Completed)
- Phase 2 – Backend Services: Implement
exportSnapshot,importDryRun,importExecuteservices, andpromotion_t/promotion_item_tDDL. - Phase 3 – Same-Instance Promotion: Integrate promotion tracking tables, add “Promote to Host” flow, orphan detection.
- Phase 4 – Additional Entity Types: Add support for
rule_t,schema_t,api_t,config_tand extend the dependency resolver. - Phase 5 – Global Migration Export: Implement dynamic table discovery for full-database migration (see below).
Global Migration Export
Motivation
The entity-level promotion (ExportSnapshot) is designed for selective promotion — the user picks specific entities (e.g., 3 instances) and promotes them from a lower environment to a higher one. For that use case, the export produces a rich nested JSON with children and dependencies, which requires hand-crafted exportXxxSnapshot() methods per entity type.
However, a full database migration has fundamentally different requirements:
- Scope: ALL entities across ALL entity types — not a user-selected subset.
- Maintainability: When new tables are added to the system, the migration should work automatically without code changes.
- Simplicity: A flat per-table export is sufficient since all data is exported together (no missing dependency risk).
Design: Dynamic Table Discovery
Instead of maintaining a manual list of entity types and per-type export methods, the Global Migration Export uses PostgreSQL DatabaseMetaData to automatically discover and export all projection tables.
How It Works
- Discover all tables ending in
_tin thepublicschema viaDatabaseMetaData.getTables(). - Skip infrastructure tables that should never be exported:
event_store_t— immutable event log (events will be regenerated on import)outbox_message_t— transient consumer outboxconsumer_offsets— operational stateconsumer_lock— operational lockpromotion_t,promotion_item_t— promotion tracking (environment-specific)
- For each discovered table:
- Inspect column metadata to detect if the table has
host_idandactivecolumns. - If
activecolumn exists:SELECT * FROM table_t WHERE active = TRUE [AND host_id = ?]. - If no
activecolumn:SELECT * FROM table_t [WHERE host_id = ?]. - Convert each row to
Map<String, Object>with camelCase key names.
- Inspect column metadata to detect if the table has
- Record a consistency marker:
SELECT MAX(id) FROM event_store_tat the start of the export transaction to stamp the snapshot with thelastEventId. - Use
REPEATABLE READtransaction isolation for consistency across all tables (PostgreSQL MVCC ensures a frozen-in-time view even if events are being processed concurrently).
Data Consistency Strategy
Querying projection tables directly is safe because:
- PostgreSQL MVCC:
REPEATABLE READprovides a consistent snapshot at transaction start time. Concurrent event processing does not affect the exported data. - Atomic event application: Each event is applied via
handleEvent()within its own transaction, so partial aggregate states are never visible. lastEventIdmarker: The export records the maximum event ID at transaction start, providing an auditable consistency boundary without the cost of event replay.
Why not replay events from event_store_t?
- The projection tables are the replayed event result — re-replaying is redundant.
handleEvent()has 120+ event type cases — duplicating that logic in an in-memory replayer is impractical.- Event replay would not unlock any consistency benefit beyond what MVCC already provides.
Output Format
{
"exportVersion": "1.0",
"sourceHostId": "N2CMw0HGQXeLvC1wBfln2A",
"lastEventId": "abc123...",
"exportTs": "2026-04-09T20:00:00Z",
"tables": {
"config_t": {
"count": 5,
"rows": [
{ "configId": "...", "configName": "...", "configPhase": "...", ... },
...
]
},
"user_t": {
"count": 12,
"rows": [
{ "userId": "...", "email": "...", "firstName": "...", ... },
...
]
},
"role_t": { ... },
"instance_t": { ... },
...
}
}
Key differences from the per-entity promotion export:
| Aspect | Per-Entity Promotion (ExportSnapshot) | Global Migration (ExportGlobalSnapshot) |
|---|---|---|
| Scope | User-selected entities | All active entities |
| Structure | Nested (parent/children/dependencies) | Flat per-table |
| New table support | Requires code changes | Automatic via DatabaseMetaData |
| Use case | Lower env → Higher env | Full database migration |
| Output | Entity-centric JSON | Table-centric JSON |
| Import mechanism | Same-instance via promotion_t or Cross-instance via JSON | Cross-instance via JSON only |
Import: Event-Based Migration (Refined in Phase 2.5)
To ensure maximum compatibility and maintain the integrity of the event-sourced system, the global import process follows a 3-step pipeline:
Source DB → Export (Flat JSON) → Convert to Events (Ordered JSON) → Import (Target DB)
1. Snapshot-to-Events Conversion
An intermediate step (ConvertSnapshotToEvents) transforms the flat table-centric snapshot into an ordered JSON array of CloudEvents. This format is 100% compatible with the existing event-importer CLI tool (matching the 00-bootstrap.json structure).
2. Topological Sequencing (Dependency Awareness)
Since a full migration often involves complex relationships, the converter is “Relationship Aware.” It uses DatabaseMetaData.getImportedKeys() to dynamically discover parent→child dependencies.
- Topological Sort: It implements Kahn’s algorithm to order events such that parent entities (e.g.,
Org,Host,User,Role) are processed before their children (e.g.,UserHost,RoleUser,AuthProviderClient). - Dynamic: This approach handles new tables and FK constraints automatically without requiring code changes to a “hard-coded” dependency list.
3. Batch Replay & Reconciliation
The import handler performs a batch insertion of these generated events into event_store_t and outbox_message_t within a single transaction.
- Nonce Re-calculation: Nonces are re-calculated on the target system during import to ensure uniqueness.
- Automatic Projections: Inserting into the outbox triggers the
DbEventConsumerStartupHookto rebuild all materialized projection tables on the target system.
Service API Contract
-
Export:
- Handler:
GlobalSnapshotExport(user-query) - Service ID:
lightapi.net/user/exportGlobalSnapshot/0.1.0 - Request:
{ "sourceHostId": "...", "entityTypes": [...] } - Response: Canonical snapshot JSON (flat tables)
- Handler:
-
Convert (New):
- Handler:
ConvertSnapshotToEvents(user-query) - Service ID:
lightapi.net/user/convertSnapshotToEvents/0.1.0 - Request:
{ "snapshot": "...", "targetHostId": "...", "adminUserId": "..." } - Response: JSON array of ordered CloudEvents (event-importer compatible)
- Handler:
-
Import:
- Handler:
GlobalSnapshotImport(user-command) - Service ID:
lightapi.net/user/importGlobalSnapshot/0.1.0 - Request:
{ "targetHostId": "...", "snapshot": "...", "entityTypes": [...] } - Response:
{ "imported": 42, "total": 42 }
- Handler:
Implementation Phases (Updated)
- Phase 1 – UI Foundation: Create promotion pages, sidebar menu entry. (Completed)
- Phase 2 – Global Export: Implement dynamic table discovery via JDBC metadata. (Completed)
- Phase 2.5 – Global Migration Step: Implement Topological Sorting and Snapshot-to-Events conversion for CLI compatibility. (Completed)
- Phase 3 – Entity Promotion (Selective): Implement recursive bundling for user-selected entities (e.g., Instance export).
- Phase 4 – Same-Instance Tracking: Integrated
promotion_ttracking for in-DB moves.
Deployment Workflow
Light Portal manages product, API, application, instance, runtime configuration, and deployment metadata for multiple tenants. The deployment workflow extends that model so a user can deploy a configured instance to a Kubernetes cluster from the Instance Admin page.
The goal is to provide a production-like deployment path for small businesses and enterprise tenants without requiring Light Portal to have direct network access to every customer cluster.
Problem
Each API or application repository can contain a k8s/ folder with Kubernetes
deployment templates. The templates contain variables in the following format:
${key:defaultValue}
For each configured portal instance, Light Portal can generate a values.yml
document that contains deployment-time values such as image URL, namespace,
replica count, service ports, config references, resource limits, ingress host,
and rollout options.
When a user clicks the Deployment button for an instance, the system should:
- Resolve the target instance and deployment environment.
- Generate or fetch the instance deployment
values.yml. - Send a deployment command to a deployer that can access the target Kubernetes cluster.
- Render the final Kubernetes manifests from the repository templates.
- Validate and apply the manifests.
- Track rollout status and return deployment results to Light Portal.
Recommended Architecture
The recommended default is to run a small Rust deployer inside each target Kubernetes cluster.
Light Portal
|
| deployment request / status query
v
Light Controller
|
| outbound WebSocket session / MCP tool call
v
In-cluster Rust Deployer Pod
|
| Kubernetes API via in-cluster ServiceAccount
v
Customer Kubernetes Cluster
This is similar to the agent model used by GitOps and cloud management systems: the cluster-local agent connects outbound to the control plane and performs cluster operations using tightly scoped Kubernetes RBAC.
Why In-Cluster Deployer
Running the deployer inside the cluster should be the default for production.
Kubernetes Authentication
An in-cluster deployer can use Kubernetes in-cluster configuration. The Rust
service can use kube-rs and call the equivalent of default client discovery.
Kubernetes mounts a ServiceAccount token into the pod, so no external
kubeconfig file needs to be copied, stored, rotated, or exposed.
Least-Privilege RBAC
The deployer should run as a dedicated ServiceAccount with only the permissions needed for the namespaces and resources it manages. If a deployer is compromised, the blast radius is limited by Kubernetes RBAC.
For a small-business deployment, the first version can bind the deployer to a dedicated namespace. For managed enterprise environments, the portal can create one deployer per cluster or per tenant namespace.
Firewall Traversal
Many customer clusters are behind firewalls or corporate networks. An in-cluster deployer can open an outbound WebSocket connection to Light Controller. This avoids inbound firewall rules and allows Light Portal to manage deployments without direct access to the Kubernetes API server.
Operational Simplicity
Customers do not need to run a separate VM or keep a standalone deployment process alive. They install the deployer with one Kubernetes YAML file or Helm chart, and Kubernetes restarts it if it fails.
Deployment Transports
The deployment system should support two transports.
Controller-Mediated WebSocket
This is the preferred transport for private customer environments.
- The deployer pod starts inside the customer cluster.
- It registers with Light Controller over an outbound WebSocket.
- The controller authenticates the deployer and records its tenant, cluster, environment, capabilities, and current status.
- Light Portal sends deployment commands to the controller.
- The controller forwards the command to the deployer using MCP-style tool calls over the existing session.
- The deployer streams status back through the controller.
This mode works when Light Portal cannot reach the customer environment.
Direct Deployer URL
This is useful for local MicroK8s, managed clusters, and environments where Light Portal can reach the deployer directly.
The deployer URL can be stored in deployment configuration or config server metadata. Light Portal or the workflow engine can call the deployer’s API/MCP endpoint directly.
Direct mode should be treated as an optimization, not the primary model for customer-managed private networks.
Deployer Responsibilities
The deployer is intentionally narrow. It should not own tenant configuration or business workflow decisions. It executes deployment instructions and reports results.
The deployer should support these actions:
render: Fetch templates and values, render manifests, and return a manifest summary.dryRun: Render manifests and validate them against the Kubernetes API without applying changes.deploy: Apply manifests and wait for rollout status.redeploy: Re-apply manifests and trigger rollout if needed.undeploy: Delete resources created by the deployment.status: Return current Kubernetes resource and rollout status.logs: Return recent pod logs for the deployed instance.rollback: Redeploy a previous Light Portal deployment snapshot.
The first implementation should include dryRun, deploy, undeploy, and
status.
Rollback should be implemented through Light Portal deployment history, not native Kubernetes rollout undo. Native Kubernetes rollback only reverts the Deployment pod template and does not reliably revert associated ConfigMaps, Secrets, or deployment values. A Light Portal rollback should redeploy a previous immutable deployment snapshot so pods, config, environment variables, and related resources return to the same known state.
Deployment Request
A deployment request should be explicit and auditable.
requestId: 01964b05-0000-7000-8000-000000000001
hostId: 01964b05-552a-7c4b-9184-6857e7f3dc5f
instanceId: petstore-dev
environment: dev
clusterId: microk8s-local
namespace: petstore-dev
action: deploy
valuesRef:
source: config-server
path: /deployments/petstore-dev/values.yml
template:
repoUrl: https://github.com/lightapi/petstore-api.git
ref: main
path: k8s
options:
dryRun: false
waitForRollout: true
timeoutSeconds: 300
The request should be created by Light Portal and persisted as deployment history before it is sent to the deployer.
Values File
The values.yml is instance-specific. It should contain all values needed to
render Kubernetes templates for one deployment target.
image:
repository: ghcr.io/lightapi/petstore-api
tag: 1.0.0
deployment:
replicas: 2
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
service:
port: 8080
ingress:
enabled: true
host: petstore-dev.example.com
config:
snapshotId: petstore-dev-20260427
configServerUrl: https://config.lightapi.net
template:
repoUrl: https://github.com/lightapi/petstore-api.git
ref: main
path: k8s
The deployer can receive the values inline or fetch them from config server
using the valuesRef in the deployment request.
Config Server should be the authoritative source of truth for deployment
values. At deployment time, Light Portal should create an immutable snapshot of
both the deployment values.yml and the runtime configuration values.yml.
That snapshot is the deployment evidence. If a deployment fails or must be
audited later, the team must be able to reconstruct exactly which values were
used even if the current config has changed.
Light Portal should persist the snapshot reference and hash in deployment history. It should not rely only on a mutable config path.
Template Rendering
The initial template format can use simple placeholders:
image: ${image.repository}:${image.tag}
replicas: ${deployment.replicas:1}
The renderer should support nested keys and defaults. If a key is missing and no default is provided, rendering should fail.
The deployer should render manifests in memory and avoid writing generated YAML to disk unless debug mode is explicitly enabled.
Longer term, the deployer can support additional renderers:
- Built-in
${key:default}renderer for simple service templates. - Kustomize for standard Kubernetes overlays.
- Helm for teams that already maintain charts.
The built-in renderer should be deterministic and small. It should not evaluate arbitrary code.
Do not use raw string replacement or regex replacement against raw YAML text. YAML is indentation sensitive, and multi-line values, certificates, JSON strings, and embedded config blocks can break when substituted as plain text.
The preferred first renderer is a constrained internal AST renderer:
- Parse each template document with
serde_yamlintoserde_yaml::Value. - Recursively traverse the YAML value tree.
- Resolve placeholders only inside string scalar values.
- Replace
${key:default}with values from the structured deployment values. - Serialize the YAML value back to YAML or convert it directly to Kubernetes dynamic objects.
This avoids most quoting, escaping, and indentation bugs because YAML parsing and serialization remain responsible for formatting. It also keeps the renderer small and prevents arbitrary code execution.
The implementation must include tests for ConfigMap multi-line blocks, JSON strings, certificate-shaped values, and Secret references before production use.
Kubernetes Execution
The Rust deployer should prefer kube-rs and the Kubernetes API over shelling
out to kubectl.
Benefits:
- no
kubectlbinary dependency - structured errors
- easier dry-run and rollout status handling
- better control over authentication and namespaces
- safer request construction
kubectl can remain a diagnostic or fallback mode, but it should not be the
default production implementation.
The deployer should use Kubernetes server-side dry run for validation:
dryRun=All
For apply, use server-side apply when possible so the deployer has a clear field manager identity.
The field manager must be explicit, for example:
fieldManager=light-deployer
Using a stable field manager is important for coexistence with other Kubernetes controllers. For example, a Horizontal Pod Autoscaler may own Deployment replica changes. Server-side apply helps the deployer avoid accidentally overwriting fields owned by other managers.
For rollout status, the deployer should use the Kubernetes watch API rather than only polling logs. The portal user experience should show resource status transitions such as:
Pending -> ContainerCreating -> Running -> Ready
Streaming watch events through the deployer gives Light Portal a precise deployment timeline similar to a CI/CD job log while still preserving structured Kubernetes state.
Security Model
Security is the central design constraint because this component can mutate a customer cluster.
Authentication
The deployer must authenticate to Light Controller or Light Portal before it can receive commands. Recommended options:
- mTLS for deployer-to-controller registration
- signed JWT enrollment token for first registration
- short-lived command tokens issued by Light Portal
The deployer should have a stable deployerId and should report cluster,
namespace, version, and capability metadata during registration.
Authorization
Light Portal must verify that the requesting user can deploy the target instance, environment, and tenant. The deployer must also enforce local constraints:
- allowed namespaces
- allowed repository hosts and repository names
- allowed image registries
- allowed Kubernetes resource kinds
- allowed actions
The deployer should reject commands outside its configured policy even if the portal sends them.
RBAC
For namespace-scoped deployments, prefer Role and RoleBinding over
ClusterRole and ClusterRoleBinding.
Version 1 should allow only application-level resource kinds:
DeploymentServiceIngressConfigMapSecret
Version 1 should explicitly block cluster-scoped and control-plane resources, including:
NamespaceClusterRoleClusterRoleBindingCustomResourceDefinition- admission webhooks
This keeps the default deployer RBAC narrow and supports least-privilege customer installations.
Example namespace-scoped installation:
apiVersion: v1
kind: ServiceAccount
metadata:
name: light-portal-deployer
namespace: petstore-dev
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: light-portal-deployer
namespace: petstore-dev
rules:
- apiGroups: ["", "apps", "networking.k8s.io"]
resources: ["deployments", "services", "ingresses", "configmaps"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: light-portal-deployer
namespace: petstore-dev
subjects:
- kind: ServiceAccount
name: light-portal-deployer
namespace: petstore-dev
roleRef:
kind: Role
name: light-portal-deployer
apiGroup: rbac.authorization.k8s.io
Secrets should be handled carefully. Avoid logging rendered manifests that contain secret values. Prefer references to existing Kubernetes Secrets, External Secrets, Sealed Secrets, or config-server secret references resolved inside the deployer.
The Rust implementation must also avoid logging raw Kubernetes apply payloads.
When using tracing or log, never log full kube-rs request objects,
patches, or serialized manifests for Secret resources. Kubernetes Secret
values are base64 encoded, not encrypted, and will leak credentials if written
to pod stdout.
Deployment Pod
The deployer can be installed as a Kubernetes Deployment.
apiVersion: apps/v1
kind: Deployment
metadata:
name: light-portal-deployer
namespace: petstore-dev
spec:
replicas: 1
selector:
matchLabels:
app: light-portal-deployer
template:
metadata:
labels:
app: light-portal-deployer
spec:
serviceAccountName: light-portal-deployer
containers:
- name: deployer
image: ghcr.io/lightapi/light-portal-deployer:0.1.0
env:
- name: LIGHT_CONTROLLER_WS_URL
value: wss://controller.lightapi.net/deployer/ws
- name: DEPLOYER_ID
value: petstore-dev-microk8s
- name: DEPLOYER_TOKEN
valueFrom:
secretKeyRef:
name: light-portal-deployer-credentials
key: token
- name: ALLOWED_NAMESPACES
value: petstore-dev
Portal Workflow
The Instance Admin Deployment button should not synchronously run deployment logic in the browser request. It should create a deployment request and trigger an asynchronous workflow.
Recommended flow:
- User clicks Deployment for an instance.
- Portal validates authorization.
- Portal resolves instance, environment, product version, image, config snapshot, and template repository.
- Portal creates a deployment request row/event.
- Portal snapshots deployment values and runtime values.
- Portal or workflow engine runs
dryRun. - If the target environment requires approval, workflow waits for human approval.
- Workflow calls
deploy. - Deployer streams events: render complete, dry-run complete, apply started, pod phase changes, rollout progressing, rollout complete or failed.
- Portal updates deployment history and status.
- User can inspect rendered manifest summary, rollout status, pod status, and logs.
This fits the agentic workflow model. The workflow can ask the user to approve the rendered changes before applying them.
Approval should be configurable at the environment level. Development and test environments can allow automatic deployment. Production environments should normally require manual approval through Light Portal or an agentic workflow ask task.
Status And Audit
Light Portal should persist deployment history.
Suggested fields:
deploymentIdhostIdinstanceIdenvironmentclusterIdnamespaceactionstatusrequestUserdeployerIdtemplateRepoUrltemplateReftemplatePathvaluesHashvaluesSnapshotIdruntimeValuesHashruntimeValuesSnapshotIdmanifestHashtemplateCommitSharesourceSummaryimageRepositoryimageTagstartedTscompletedTserrorMessage
The deployer should return enough detail to reproduce the deployment intent without storing secrets.
Light Portal should store only the rendered manifest hash, Git commit SHA, and a redacted resource summary. It should not store full rendered YAML in the database because rendered manifests can contain environment variables, connection strings, or credentials.
Example resource summary:
[
{"kind": "Deployment", "namespace": "petstore-dev", "name": "petstore"},
{"kind": "Service", "namespace": "petstore-dev", "name": "petstore"}
]
Multi-Tenant Considerations
Small-business cloud service means multiple tenants may share Light Portal but deploy to separate clusters or namespaces.
Rules:
- Tenant identity must be present in every deployment request.
- A deployer must be bound to one tenant boundary. In most installations, that means one tenant namespace or a tightly controlled set of namespaces owned by that tenant.
- Do not share one deployer across unrelated tenants.
- Namespace policy must be enforced both by portal authorization and deployer local policy.
- Deployment history must be filtered by
hostId. - A compromised deployer must not be able to receive commands for another tenant.
Failure Handling
The deployer should classify failures:
- template repository fetch failure
- values file fetch failure
- render failure
- manifest validation failure
- Kubernetes API authorization failure
- apply failure
- rollout timeout
- health check failure
- controller WebSocket disconnected
- deployer registration rejected
Each failure should include a safe message and diagnostic metadata. Secret values must be redacted.
For controller-mediated deployments, the deployer must have a resilient WebSocket lifecycle. If Light Controller restarts or the network drops, the deployer should not crash. It should reconnect with exponential backoff and jitter, re-register after reconnecting, and resume accepting commands only after the controller confirms the deployer session.
First Implementation
The first implementation should target local MicroK8s and direct feedback in Light Portal.
Phase 1:
- Create Rust deployer service.
- Run it inside MicroK8s.
- Support direct API mode for local testing.
- Implement
render,dryRun,deploy,undeploy, andstatus. - Use
kube-rsand in-cluster ServiceAccount authentication. - Support built-in
${key:default}rendering. - Add deployment request and deployment history tables/events.
- Add Instance Admin deployment request flow.
Phase 2:
- Add controller-mediated WebSocket registration.
- Expose deployer operations as MCP tools through the controller.
- Stream deployment progress and Kubernetes watch events to Light Portal.
- Implement exponential backoff reconnect and re-registration.
- Add approval step through agentic workflow.
Phase 3:
- Add Helm/Kustomize renderer support if needed.
- Add rollback support.
- Add multi-cluster inventory and deployer health view.
- Add deployment policy and quota enforcement.
Resolved Design Decisions
- Config Server is the authoritative source of truth for values. Each deployment stores immutable deployment and runtime values snapshot references plus hashes.
- Light Portal stores rendered manifest hash, template Git commit SHA, and redacted resource summary. It does not store full rendered manifests by default.
- Regulated environments can add an opt-in enterprise artifact mode that stores the full rendered manifest in encrypted object storage with strict retention. Full manifests should stay out of the relational database.
- Deployment approval is configured at the environment level. Production should require approval by default.
- Deployers are installed per tenant boundary and should not be shared across unrelated tenants.
- Version 1 allows only application-level resources: Deployment, Service, Ingress, ConfigMap, and Secret.
- The first renderer should be a constrained internal AST renderer based on
serde_yaml, not raw text replacement. - The direct deployer URL mode should expose MCP immediately, using the same internal tool implementation that controller-mediated WebSocket mode will use later.
- Rollback is a redeploy of a previous Light Portal deployment snapshot, not a native Kubernetes rollout undo.
Open Questions
- Which object storage providers should enterprise artifact mode support first?
- What retention policies should be available for encrypted rendered manifest artifacts?
- Should direct MCP use streamable HTTP only, or should it also expose SSE for long-running deployment progress events?
- Should rollback require the same environment-level approval policy as deploy?
Recommendation
Use an in-cluster Rust deployer as the default production model. The deployer should connect outbound to Light Controller and execute deployment commands via MCP-style tools. Direct deployer URL mode is useful for MicroK8s and managed environments but should be secondary. The MCP tool implementation should be shared by both transports from the beginning.
Use kube-rs instead of shelling out to kubectl for the production execution
path. Keep the deployer small, policy-bound, and auditable. Let Light Portal own
deployment intent and history, while the deployer owns safe cluster-local
execution.
Portal View Help
This is the fallback help page for portal-view.
Use this section when a page, form, or task does not yet have a more specific
help page. The contextual help design expects the portal UI to link here when a
specific helpPath is missing.
Common Starting Points
- Pages explain what a screen is used for and which actions are available.
- Forms explain when to submit a command and what happens after submission.
- Tasks explain multi-step workflows that span pages and forms.
- Concepts explain reusable portal ideas such as ownership, hosts, and API versioning.
Portal View Page Help
This section contains page-level help for portal-view.
Page help should explain what the screen is for, who can access it, which records are visible, and which common actions are available.
API Catalog
Use API Catalog to browse APIs that are ready for consumer discovery.
The catalog is backed by the same API records, categories, and tags used by API Admin and the API create/update forms. It is intended for browse and discovery, not bulk administration.
Common filters:
- search text for API name, id, and description
- categories for stable browse buckets
- grouped tags for capability, protocol, lifecycle, security, runtime, domain, consumer, operations, and integration facets
- active or inactive status
- sort and card/list view options
Catalog cards show a compact operational summary:
- active API version count and latest version
- endpoint count across active versions
- runtime bindings through instance APIs
- access-control coverage from active endpoint rules
Common actions:
- open API details and versions
- open endpoints for the latest active version
- create a new API version
- update API metadata when you own the API or have API administrator access
- continue related publish, MCP onboarding, or access-control tasks
API Admin
Use API Admin to create, review, update, and retire APIs owned by your team or visible to your administrator role.
This page is owner-aware. Regular users should see only APIs they own or can access through their position. API administrators can see all APIs for the host.
Common actions:
- create a new API
- update API metadata
- open API versions
- link the API into onboarding or marketplace tasks
API Detail
Use API Detail to review API versions and version-specific integration details.
This page helps users move from a business API record to the concrete API versions that can be linked to instances, MCP tools, marketplace listings, or access control rules.
Common actions:
- create an API version
- update version metadata
- review endpoint and scope details
- start related task flows from the selected API version
App Admin
Use App Admin to manage client applications that own OAuth clients and instance application links.
This page is owner-aware. Regular users should see only apps they own or can access through their position. App administrators can see all apps for the host.
Common actions:
- create a client app
- update app metadata
- open OAuth clients for the app
- link the app to an instance
OAuth Client
Use OAuth Client to create and manage OAuth clients for applications, APIs, or instances.
This page is owner-aware. Regular users should see only OAuth clients they own or can access through their position. OAuth client administrators can see all OAuth clients for the host.
Common actions:
- create an OAuth client
- update client metadata
- review scopes and token-exchange settings
- open client tokens
OAuth Client Token
Use OAuth Client Token to create and review long-lived client tokens.
Tokens are sensitive. Users should create tokens only for clients they own or are authorized to manage. Administrators can review all client tokens for the host when their role allows it.
Common actions:
- create a client token
- review token metadata
- delete or rotate tokens according to operational policy
Instance Admin
Use Instance Admin to manage service instances for the current host.
This page is owner-aware. Regular users should see only instances they own or can access through their position. Instance administrators can see all instances for the host.
Common actions:
- create an instance
- update instance metadata
- review linked APIs and apps
- open runtime endpoints and configuration links
Runtime Instance
Use Runtime Instance to review runtime endpoints for services.
Runtime instances describe where a service is running and how the portal can reach it for deployment, gateway, or operational workflows.
Common actions:
- create a runtime endpoint
- update endpoint status and connection details
- review active runtime records for an instance or service
Instance API
Use Instance API to link API versions to service instances.
This relationship tells the portal which API version is served by which instance and is used by gateway, MCP, configuration, and access-control tasks.
Common actions:
- link an API version to an instance
- review existing instance API links
- open path prefixes or MCP tool mappings
Instance API Path Prefix
Use Instance API Path Prefix to manage route prefixes for an API version linked to an instance.
Path prefixes help gateways and tools route traffic to the correct API surface.
Common actions:
- add a path prefix
- update a path prefix ownership position
- review prefixes for an instance API link
Instance App
Use Instance App to link client apps to service instances.
This relationship is used when an application needs to interact with a deployed instance and related APIs.
Common actions:
- link an app to an instance
- review app links for an instance
- open app API relationship records
Instance App API
Use Instance App API to link an instance app relationship to an instance API relationship.
This page connects which app can use which API on a specific service instance.
Common actions:
- create an instance app API link
- review existing links
- open configuration for the relationship
Schedule Admin
Use Schedule Admin to create and manage scheduled portal events.
This page is owner-aware. Regular users should see only schedules they own or can access through their position. Schedule administrators can see all schedules for the host.
Common actions:
- create a schedule
- update schedule timing or event data
- delete schedules no longer needed
Workflow Definition
Use Workflow Definition to create and manage workflow definitions.
Workflow definitions describe repeatable processes that can be started manually or triggered by other portal events.
Common actions:
- create a workflow definition
- update workflow YAML
- start or review related workflow execution records
Portal View Form Help
This section contains form-level help for generated and custom portal-view
forms.
Form help should explain when to use the form, what happens after submit, important required fields, important optional fields, ownership behavior, and common validation problems.
Create API
Use this form to register a new API record for the current host.
After submission, the API becomes available for version creation, marketplace publishing, MCP onboarding, instance links, and access-control tasks.
Important fields:
apiId: stable API identifier for the hostapiName: user-facing API nameapiStatus: lifecycle statusownerPositionId: optional position owner for team access
Update API
Use this form to update API metadata.
Updating an API changes descriptive and ownership metadata for the API record. It does not replace the API version specification.
Important fields:
apiName: user-facing API nameapiStatus: lifecycle statusownerPositionId: optional position owner for team access
Create API Version
Use this form to add a version to an existing API.
After submission, the API version can be linked to instances, gateway flows, MCP tools, marketplace publishing, and access-control rules.
Important fields:
apiId: parent APIapiVersion: version labelapiType: API style such as OpenAPI, GraphQL, Hybrid, or MCPserviceId: backing service identifierspec: API specification text, or MCPtools/listJSON output for MCP API versionstransportConfig: MCP transport and URL whenapiTypeis MCPownerPositionId: optional position owner for team access
MCP Tool Discovery
For MCP API versions, there are two ways to populate tools:
- If the portal service can reach the MCP server, select
MCPas the API Type and filltransportConfig, for example{"transport":"streamable http","url":"http://localhost:5000/mcp"}. - If the portal service cannot reach the MCP server because of firewall or security boundaries, call the MCP server yourself and paste the response into
spec.
Example manual discovery call:
curl --location --request POST 'http://localhost:5000/mcp' \
--header 'Content-Type: application/json' \
--data-raw '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}'
Paste the response into Spec / MCP Tools JSON. The form accepts any of these
payload shapes.
Full JSON-RPC response:
{
"jsonrpc": "2.0",
"result": {
"tools": [
{
"name": "echo",
"description": "Echoes back the input",
"inputSchema": {
"type": "object",
"properties": {
"message": {
"type": "string"
}
},
"required": [
"message"
]
}
}
]
},
"id": 1
}
Object with a top-level tools array:
{
"tools": [
{
"name": "echo",
"description": "Echoes back the input",
"inputSchema": {
"type": "object",
"properties": {
"message": {
"type": "string"
}
},
"required": [
"message"
]
}
}
]
}
Raw tools array:
[
{
"name": "echo",
"description": "Echoes back the input",
"inputSchema": {
"type": "object",
"properties": {
"message": {
"type": "string"
}
},
"required": [
"message"
]
}
}
]
Keep transportConfig populated with the real MCP transport and URL when the runtime still needs it for invocation.
Update API Version
Use this form to update API version metadata and integration details.
Updating a version can affect downstream instance links, gateway behavior, and task flows that reference the API version.
Important fields:
apiVersion: version labelapiType: API styleserviceId: backing service identifierspec: API specification text, or MCPtools/listJSON output for MCP API versionstransportConfig: MCP transport and URL whenapiTypeis MCPprotocol,envTag, andtargetHost: runtime routing detailsownerPositionId: optional position owner for team access
MCP Tool Discovery
For MCP API versions, there are two ways to refresh tools:
- If the portal service can reach the MCP server, select
MCPas the API Type and filltransportConfig, for example{"transport":"streamable http","url":"http://localhost:5000/mcp"}. - If the portal service cannot reach the MCP server because of firewall or security boundaries, call the MCP server yourself and paste the response into
spec.
Example manual discovery call:
curl --location --request POST 'http://localhost:5000/mcp' \
--header 'Content-Type: application/json' \
--data-raw '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}'
Paste the response into Spec / MCP Tools JSON. The form accepts any of these
payload shapes.
Full JSON-RPC response:
{
"jsonrpc": "2.0",
"result": {
"tools": [
{
"name": "echo",
"description": "Echoes back the input",
"inputSchema": {
"type": "object",
"properties": {
"message": {
"type": "string"
}
},
"required": [
"message"
]
}
}
]
},
"id": 1
}
Object with a top-level tools array:
{
"tools": [
{
"name": "echo",
"description": "Echoes back the input",
"inputSchema": {
"type": "object",
"properties": {
"message": {
"type": "string"
}
},
"required": [
"message"
]
}
}
]
}
Raw tools array:
[
{
"name": "echo",
"description": "Echoes back the input",
"inputSchema": {
"type": "object",
"properties": {
"message": {
"type": "string"
}
},
"required": [
"message"
]
}
}
]
Keep transportConfig populated with the real MCP transport and URL when the runtime still needs it for invocation.
Create App
Use this form to register a client application.
After submission, the app can own OAuth clients and can be linked to service instances.
Important fields:
appId: stable app identifier for the hostappName: user-facing app nameisKafkaApp: whether this app uses Kafka-specific behaviorownerPositionId: optional position owner for team access
Update App
Use this form to update client application metadata.
Updating an app does not automatically change OAuth clients or instance app links that reference the app.
Important fields:
appName: user-facing app nameoperationOwneranddeliveryOwner: business ownership metadataownerPositionId: optional position owner for team access
Create Client
Use this form to create an OAuth client.
An OAuth client can be associated with an app, API version, or instance, depending on the selected ownership context.
Important fields:
clientName: user-facing client nameclientType: client typeclientProfile: OAuth profileproviderId: OAuth providerownerPositionId: optional position owner for team access
Update Client
Use this form to update OAuth client metadata.
Changing client settings can affect token issuance, access scope, and downstream integrations that use the client.
Important fields:
clientName: user-facing client nameclientScope: requested scopestokenExType: token exchange typeownerPositionId: optional position owner for team access
Create Client Token
Use this form to create a long-lived token for an OAuth client.
Client tokens are sensitive. Create them only for clients you own or are authorized to manage.
Important fields:
clientId: OAuth client that will receive the tokenclientSecret: client credential used for token creationownerPositionId: optional position owner for team access
Create Instance
Use this form to create a service instance.
After submission, the instance can be linked to API versions, client apps, runtime endpoints, and configuration records.
Important fields:
instanceName: user-facing instance nameproductVersionId: product version for the instanceserviceId: service identifierenvironment,region, andlob: deployment metadataownerPositionId: optional position owner for team access
Update Instance
Use this form to update service instance metadata.
Updating an instance can affect task context, instance links, and owner-scoped visibility.
Important fields:
instanceName: user-facing instance nameserviceId: service identifiercurrent: whether this instance is the current instance for the serviceownerPositionId: optional position owner for team access
Create Instance API
Use this form to link an API version to an instance.
After submission, the relationship can be used for route prefixes, MCP tools, configuration, and access-control workflows.
Important fields:
instanceId: target instanceapiVersionId: API version to linkownerPositionId: optional position owner for team access
Create Instance API Path Prefix
Use this form to add a path prefix to an instance API link.
Path prefixes help map incoming gateway paths to the correct API surface.
Important fields:
instanceApiId: instance API relationshippathPrefix: route prefixownerPositionId: optional position owner for team access
Update Instance API Path Prefix
Use this form to update ownership metadata for an instance API path prefix.
The path prefix itself is part of the relationship key and should be treated as stable for the existing record.
Important fields:
instanceApiId: instance API relationshippathPrefix: route prefixownerPositionId: optional position owner for team access
Create Instance App
Use this form to link a client app to an instance.
After submission, the app can be connected to APIs exposed by the same instance.
Important fields:
instanceId: target instanceappId: client appappVersion: app versionownerPositionId: optional position owner for team access
Create Instance App API
Use this form to connect an instance app relationship to an instance API relationship.
This link tells the portal which app can use which API on a specific instance.
Important fields:
instanceAppId: instance app relationshipinstanceApiId: instance API relationshipownerPositionId: optional position owner for team access
Create Runtime Instance
Use this form to create a runtime endpoint for a service.
Runtime instances describe where a service is reachable and support operational workflows.
Important fields:
serviceId: service identifierprotocol: runtime protocolipAddressandportNumber: endpoint locationinstanceStatus: runtime statusownerPositionId: optional position owner for team access
Update Runtime Instance
Use this form to update a runtime endpoint.
Updating runtime details can affect operational workflows that depend on the service endpoint.
Important fields:
runtimeInstanceId: runtime endpoint recordserviceId: service identifieripAddressandportNumber: endpoint locationinstanceStatus: runtime statusownerPositionId: optional position owner for team access
Create Schedule
Use this form to create a scheduled portal event.
After submission, the scheduler can emit the configured event according to the selected frequency and start time.
Important fields:
scheduleName: user-facing schedule namefrequencyUnitandfrequencyTime: schedule cadencestartTs: first scheduled timeeventTopic,eventType, andeventData: event payloadownerPositionId: optional position owner for team access
Update Schedule
Use this form to update a scheduled portal event.
Changing schedule timing or event data affects future executions.
Important fields:
scheduleName: user-facing schedule namefrequencyUnitandfrequencyTime: schedule cadenceeventTopic,eventType, andeventData: event payloadownerPositionId: optional position owner for team access
Create Workflow Definition
Use this form to create a workflow definition.
After submission, the workflow definition can be started manually or referenced by task flows and automation.
Important fields:
namespace: workflow namespacename: workflow nameversion: workflow versiondefinition: workflow YAMLownerPositionId: optional position owner for team access
Update Workflow Definition
Use this form to update workflow definition metadata and YAML.
Updating the definition affects future workflow starts. Existing process instances may continue according to their already captured definition state.
Important fields:
wfDefId: workflow definition recordnamespace,name, andversion: workflow identitydefinition: workflow YAMLownerPositionId: optional position owner for team access
Portal View Task Help
This section contains task-level help for workflows that span multiple pages and forms.
Task help should explain the goal, prerequisites, required steps, optional steps, and common next actions.
Onboard API to MCP Gateway
Use this task to expose an existing API through MCP Gateway.
Typical steps:
- select or create an API
- select or create an API version
- choose a deployment mode
- link the API version to a gateway or sidecar instance
- select MCP tools
- configure access control when required
Register Standalone MCP Server
Use this task to register an MCP server that is not derived from an existing API version.
Typical steps:
- register the MCP server
- add a server version
- link the server to a gateway
- review MCP tools
Publish API
Use this task to prepare an API for publication and review.
Typical steps:
- create or select the API
- create or select an API version
- review the marketplace listing
Manage Instance
Use this task to create, review, and connect service instances.
Typical steps:
- create or review the instance
- create or review runtime endpoints
- link APIs to the instance
- link apps to the instance
- manage app API links and path prefixes
Manage Client App
Use this task to manage a client app and its OAuth clients.
Typical steps:
- create or review the client app
- create or review OAuth clients
- link the app to an instance
- create or review client tokens
Manage Workflow
Use this task to create and operate workflow definitions.
Typical steps:
- create or review workflow definitions
- start a workflow
- review process instances, tasks, assignments, worklists, and audit logs
Portal View Concepts
This section contains reusable explanations for portal concepts referenced by many pages, forms, and tasks.
Concept help should be linked from page or field-level help when a short label or tooltip is not enough.
Ownership And Positions
Portal records can have an individual owner and a position owner.
owner_user_id is derived from the authenticated user when a record is created.
It should not be submitted from normal browser forms.
owner_position_id is optional and can be selected on owner-aware forms. It
allows users with the matching effective position to see or manage the record
when service-side authorization grants that scope.
Rows with no owner user and no owner position are legacy or unassigned records. They should normally be visible only to all-scope administrators until ownership is assigned.
Hosts And User Hosts
A host is the tenant boundary for most portal records.
User-host membership determines which host a user can work in. Most admin pages and generated forms operate against the currently selected host.
When a user cannot see expected records, first confirm that the correct host is selected and that the user has membership for that host.
API Versioning
An API is the stable business record. An API version is the concrete version that can be linked to instances, MCP tools, marketplace listings, and access control rules.
Create the API first, then create one or more API versions under it. Operational relationships should usually reference the API version instead of only the API.
OAuth Client Ownership
OAuth clients can be owned by apps, API versions, or instances depending on the selected creation context.
Ownership affects which users can see or modify client records. Regular users should manage only clients they own or can access through their position. Administrators can manage all clients for the host when their role allows it.
Implementation
Sign In
Portal Dashboard
The Portal Dashboard is served by the portal-view single-page application.
-
Guest User Access:
Upon landing on the dashboard, a guest user can:- View certain menus.
- Perform limited actions within the application.
-
Accessing Privileged Features:
To access additional features:- Click the User button.
- Select the Sign In menu item.
Login View
-
Redirection to Login View:
When the Sign In menu item is clicked, the browser is redirected to the Login View single-page application. This application is served by the same instance oflight-gatewayand handles user authentication against the OAuth 2.0 server (OAuth Kafka) to initiate the Authorization Code grant flow. -
OAuth 2.0 Client ID:
Theclient_idis included in the redirect URL as a query parameter. This ensures that theclient_idis sent to the OAuth 2.0 server to obtain the authorization code. In this context, theclient_idis associated with theportal-viewapplication. -
Login View Responsibilities:
The Login View is a shared single-page application used by all other SPAs across various hosts. It is responsible for:- Authenticating users.
- Ensuring that user credentials are not passed to any other single-page applications or business APIs.
-
SaaS Deployment in the Cloud:
In a SaaS environment, all users are authenticated by the OAuth 2.0 server using thelight-portaluser database. As a result, the user type does not need to be passed from the Login View. -
On-Premise Deployment:
For on-premise deployments, a customized Login View should include a radio button for selecting the user type. Typical options for most organizations are:- Employee (E)
- Customer (C)
-
Customized Authentication:
Based on the selected user type:- Employees are authenticated via Active Directory.
- Customers are authenticated using the customer database.
A customized authenticator implementation should handle this logic, ensuring the correct authentication method is invoked for each user type.
Login Form Submission
-
Form Submission Endpoint:
/oauth2/N2CMw0HGQXeLvC1wBfln2A/code -
Request Details:
- Headers:
Content-Type:application/x-www-form-urlencoded
- Method:
POST
- Body Parameters:
j_username: The user’s username.j_password: The user’s password.remember: Indicates whether the session should persist.client_id: The OAuth 2.0 client identifier.state: A hardcoded value (requires additional work for dynamic handling).user_type: (Optional) Specifies the type of user (e.g., employee or customer).redirect_uri: (Optional) The URI to redirect after authentication.
- Headers:
Light Gateway
The light-gateway instance acts as a BFF and it has a routing rule to route any request with prefix /oauth2 to kafka-oauth server.
OAuth Kafka
-
LightPortalAuthenticator
A request to hybrid-query:
{"host":"lightapi.net","service":"user","action":"loginUser","version":"0.1.0","data":{"email":"%s","password":"%s"}}
User Query
- LoginUser
This handler calls loginUserByEmail method from PortalDbProviderImpl.
PortalDbProviderImpl
The input for this method is the user’s email. Upon successful execution, the method returns a JSON string containing all user properties retrieved from the login query.
LightPortalAuthenticator
The authenticator will utilize the user data returned from the above query to validate the password. Upon successful password verification, it will return an Account object with the following attributes:
- Principal: The user’s identifier, which is the email.
- Roles: A collection containing a single element—the user’s JSON
After the Account object is created and returned, control is passed to the HostIdCodePostHandler.
HostIdCodePostHandler
It get the client_id from the submitted form and call dbProvider.queryClientByClientId to get client information. Upon successful, it get the Account object created by the authenticator above from the security context.
Create a UUID authorization code and a map associates with the code. The map contains properties that need to create authorization code token. Some properties from the client and the entire user json.
Call the ClientUtil.createAuthCode with the codeMap to create the authorization code and then redirect the code to back to the redirect uri.
ClientUtil.createAuthCode
The ClientUtil gets a client credentials token and call the CreateAuthCode handler in the hybrid-command to publish the code to the Kafka cluster in order to notify other party about this code. The codeMap is passed to the handler as data.
CreateAuthCode Handler
The handler create a MarketCodeCreatedEvent and pass the entire input map to the event as value field.
MarketQueryStreams
It processes the MarketCodeCreatedEvent and calls dbProvider.createMarketCode with the event.
createMarketCode
This method in dbProvider will put the event value into cacheManager cache named “auth_code”. Now, the code is ready to be query from the market-query.
Portal View
The HostIdCodePostHandler redirects the code to the Portal View with /authorization?code=??? and this request will be sent to the light-gateway StatelessAuthHandler.
StatelessAuthHandler
If the request path matches to the configured authPath, it will retrieve the code from the query parameter. Then create a csrf UUID token and an AuthorizationCodeRequest to get a token via OauthHelper. This request will have the auth code, the csrf token and other properties from the configuration. The request is sent to the HostIdTokenPostHandler to create the authorization code token.
HostIdTokenPostHandler
It calls dbProvider.queryClientByClientId and then verify the clientId and clientSecret matches.
It invokes ClientUtil.getAuthCodeDetail from the market-query service and calls the ClientUtil.deleteAuthCode to remove the auth code as it is one-time code.
Login View
The login-view is a Single Page Application (SPA) built with React and Vite. It serves as the user interface for the OAuth 2.0 Authorization Code flow within the LightAPI ecosystem.
Overview
This application acts as the front-end for the Authorization Server. When a user attempts to access a protected resource on a client application (the “Portal”), they are redirected to this application to authenticate and grant consent.
It handles:
- User Authentication (Username/Password).
- Social Login (Google, Facebook, GitHub).
- OAuth 2.0 Consent Granting.
- Password Management (Forgot Password, Reset Password).
Technology Stack
- Framework: React 18
- Build Tool: Vite
- UI Library: Material UI (MUI) v6
- Routing: React Router DOM v6
- Social Login:
- Google:
@react-oauth/google - Facebook:
@greatsumini/react-facebook-login - GitHub: Manual OAuth 2.0 flow with
react-social-login-buttons
- Google:
Key Flows
1. OAuth 2.0 Authorization
The application expects to be opened with standard OAuth 2.0 query parameters:
client_id: The ID of the client application requesting access.response_type: Typicallycode.redirect_uri: Where to redirect after success.state: A random string generated by the client to prevent CSRF.scope: Requested permissions.
Process:
- The
Logincomponent extracts these parameters from the URL. - User submits credentials or uses social login.
- On success, the application receives an authorization code from the backend.
- To grant consent (if configured), the user is shown the
Consentscreen. - Finally, the browser is redirected to the
redirect_uriwith thecodeandstate.
2. Social Login Configuration
The application supports multiple identity providers.
- Google: Uses the modern Google Identity Services. Configured in
src/main.jsxviaGoogleOAuthProvider. - Facebook: Uses the Facebook SDK wrapper. Configured in
src/components/FbLogin.jsx. - GitHub: Uses a manual popup flow. The client ID is configured in
src/components/GithubLogin.jsx. The redirect URI/github/callbackhandles the code extraction.
3. Backend Integration
The application proxies API requests to the backend (Light Gateway/OAuth Provider) using vite.config.js proxy settings during development.
/oauth2/*: For token and code endpoints./portal/*: For user management commands (login query)./google,/facebook,/github: Endpoints to exchange social tokens/codes for LightAPI authorization codes.
Development
Setup
yarn install
Run Locally
yarn dev
Runs on https://localhost:5173 by default.
Build
yarn build
Generates production assets in the dist folder.
Project Structure
src/components/: Reusable UI components (Login forms, Social buttons).src/theme.js: MUI theme configuration.src/main.jsx: Application entry point and providers.vite.config.js: Vite configuration including proxy rules.
Portal Services
This section provides an overview of the services utilized by Light Portal. Each service is implemented as a separate repository and is initialized during the hybrid-query or hybrid-command startup process. These services are designed to handle specific functionalities within the portal and may interact with one another to execute complex operations.
Light Portal adopts the Command Query Responsibility Segregation (CQRS) pattern, categorizing services into two types: Query and Command. Query services manage read operations, while Command services handle write operations, ensuring a clear separation of responsibilities.
Attribute Service
Attribute Query Service
Handles queries related to attributes.
Important Links
Services Used
–
Attribute Command Service
Handles commands related to attributes.
Important Links
Services Used
user-query
Client Service
Client Query Service
Handles queries related to clients.
Important Links
Services Used
–
Client Command Service
Handles commands related to clients.
Important Links
Services Used
user-query
Config Service
Config Query Service
Handles queries related to configurations.
Important Links
Services Used
–
Config Command Service
Handles commands related to configurations.
Important Links
Services Used
user-queryconfig-query
Deployment Service
Deployment Query Service
Handles queries related to deployments.
Important Links
Services Used
–
Deployment Command Service
Handles commands related to deployments.
Important Links
Services Used
user-query
Group Service
Group Query Service
Handles queries related to groups.
Important Links
Services Used
–
Group Command Service
Handles commands related to groups.
Important Links
Services Used
user-query
Host Service
Host Query Service
Handles queries related to hosts.
Important Links
Services Used
–
Host Command Service
Handles commands related to hosts.
Important Links
Services Used
user-query
Instance Service
Instance Query Service
Handles queries related to instances.
Important Links
Services Used
–
Instance Command Service
Handles commands related to instances.
Important Links
Services Used
user-query
OAuth Service
OAuth Query Service
Handles queries related to OAuth.
Important Links
Services Used
–
OAuth Command Service
Handles commands related to OAuth.
Important Links
Services Used
user-queryoauth-query
Position Service
Position Query Service
Handles queries related to positions.
Important Links
Services Used
–
Position Command Service
Handles commands related to positions.
Important Links
Services Used
user-query
Product Service
Product Query Service
Handles queries related to products.
Important Links
Services Used
–
Product Command Service
Handles commands related to products.
Important Links
Services Used
user-query
Role Service
Role Query Service
Handles queries related to roles.
Important Links
Services Used
–
Role Command Service
Handles commands related to roles.
Important Links
Services Used
user-query
Rule Service
Rule Query Service
Handles queries related to rules.
Important Links
Services Used
service-query
Rule Command Service
Handles commands related to rules.
Important Links
Services Used
user-queryhost-query
Service Service
Service Query Service
Handles queries related to services.
Important Links
Services Used
–
Service Command Service
Handles commands related to services.
Important Links
Services Used
user-query
User Service
User Query Service
Handles queries related to users.
Important Links
Services Used
–
User Command Service
Handles commands related to users.
Important Links
Services Used
user-queryservice-query
Portal View
OAuth 2.0 State Verification
This document describes the implementation of CSRF protection for the OAuth 2.0 authorization code flow in the portal-view application.
Overview
To prevent Cross-Site Request Forgery (CSRF) attacks during the OAuth 2.0 authentication process, we implement a state parameter check. A random state string is generated before the authentication request and verified upon the callback.
Implementation Details
State Generation
Location: src/components/Header/ProfileMenu.tsx
When the user initiates the sign-in process:
- A random alphanumeric string is generated.
- This string is stored in the browser’s
localStorageunder the keyportal_auth_state. - The string is appended as the
statequery parameter to the OAuth 2.0 authorization URL.
// Generate a random state for CSRF protection
const state = Math.random().toString(36).substring(7);
localStorage.setItem('portal_auth_state', state);
const defaultUrl =
`https://locsignin.lightapi.net?client_id=...&state=${state}`;
Redirect Handling
Location: src/App.tsx
To ensure the state query parameter is preserved during the redirect from the root path (/) to the dashboard, a custom RedirectWithQuery component is used. This component handles both standard query parameters and hash-based redirects (common with certain OAuth providers or router configurations).
- Checks
window.location.hashfor paths (e.g.,/#/app/dashboard?state=...). - Prioritizes the hash path if present to ensure
react-routerreceives the correct target. - Appends existing query parameters from
useLocation().search. - Uses
useNavigatefor the redirection.
const RedirectWithQuery = ({ to }: { to: string }) => {
// ... logic to preserve search params and handle hash paths
if (window.location.pathname === to) return; // Prevent loop
// ...
navigate(target, { replace: true });
};
State Verification
Location: src/pages/dashboard/Dashboard.tsx
Upon successful authentication, the provider redirects the user back to the application (defaulting to the Dashboard).
- The application retrieves the
stateparameter from the URL query string. - It retrieves the stored state from
localStorage(portal_auth_state). - The two values are compared:
- Match: The verification succeeds, and the
portal_auth_stateis removed fromlocalStorage. - Mismatch: The verification fails. The user is alerted and immediately logged out via
signOutto protect the session.
- Match: The verification succeeds, and the
useEffect(() => {
const searchParams = new URLSearchParams(location.search);
const state = searchParams.get('state');
// Check if we have a state and haven't attempted verification yet in this mount
if (state && !verificationAttempted.current) {
verificationAttempted.current = true;
const storedState = localStorage.getItem('portal_auth_state');
if (storedState === state) {
console.log('OAuth state verified successfully.');
localStorage.removeItem('portal_auth_state');
// Remove state from URL to prevent re-verification
const newSearchParams = new URLSearchParams(location.search);
newSearchParams.delete('state');
navigate({ search: newSearchParams.toString() }, { replace: true });
} else {
console.error('OAuth state mismatch. Potential CSRF attack.');
alert('OAuth state mismatch. Potential CSRF attack. Logging out...');
signOut(userDispatch, navigate);
}
}
}, [location, navigate, userDispatch]);
Testing State Mismatch (Manual Steps)
To manually verify the security logout mechanism:
- Ensure you are logged in to the application.
- Open your browser’s Developer Tools (F12) and go to the Console tab.
- Set a dummy “valid” state in your local storage:
localStorage.setItem('portal_auth_state', 'my_secret_state'); - Manually modify the URL to include a different state parameter.
- Example:
https://localhost:3000/app/dashboard?state=attackers_fake_state - Note: If using hash routing, ensure it is inside the hash:
https://localhost:3000/#/app/dashboard?state=attackers_fake_state
- Example:
- Press Enter to navigate.
Expected Result:
- An alert appears: “OAuth state mismatch. Potential CSRF attack. Logging out…”
- The user is immediately signed out of the application.
Configuration
light-gateway
Client Credentials Token
All the accesses from the light-gateway to the downstream APIs should have at least one token in the Authorization header. If there is an authorization code token in the Authorization header, then a client credentials token will be added to the X-Scope-Token header by the TokenHandler.
Since all light portal services have the same scopes (portal.r and portal.w), one token should be enough for accessing all APIs.
Add the client credentials token config in client.yml section.
# Client Credential
client.tokenCcUri: /oauth2/N2CMw0HGQXeLvC1wBfln2A/token
client.tokenCcClientId: f7d42348-c647-4efb-a52d-4c5787421e72
client.tokenCcClientSecret: f6h1FTI8Q3-7UScPZDzfXA
client.tokenCcScope:
- portal.r
- portal.w
Add TokenHandler to the handler.yml section.
# handler.yml
handler.handlers:
.
.
.
- com.networknt.router.middleware.TokenHandler@token
.
.
.
handler.chains.default:
.
.
.
- prefix
- token
- router
Add the TokenHandler configuration token.yml section.
# token.yml
token.enabled: true
token.appliedPathPrefixes:
- /r
light-reference
Cors Configuration
As the light-gateway is handling the SPA interaction and cors, we don’t need to enable the cors on the reference API. However, the cors handler is still registered in the default handler.yml in case the reference API is used as a standalone service.
In the light-portal configuration, we need to disable the cors.
# cors.yml
cors.enabled: false
Client Configuration
We need to load the jwk from the oauth-kafka service to validate the incoming jwk tokens. To set up the jwk, add the following lines to the values.yml file.
# client.yml
client.tokenKeyServerUrl: https://localhost:6881
client.tokenKeyUri: /oauth2/N2CMw0HGQXeLvC1wBfln2A/keys
Test
Automated Integration Testing & AI Agent Strategy for Light-Portal
Document Type: Engineering Strategy / Architecture
System: Light-Portal (Multi-Service Architecture)
1. Executive Summary
As Light-Portal scales into a complex multi-service ecosystem, traditional end-to-end (E2E) tests become too slow, brittle, and difficult to maintain. To enable rapid updates without fear of regression, we must adopt a Shift-Left Layered Integration Approach.
Furthermore, to minimize the manual overhead of test creation and maintenance, this strategy incorporates AI QA Agents capable of autonomously generating, executing, and self-healing test suites based on structured declarative specifications.
2. Core Automated Integration Strategy
To test inter-service communication reliably and rapidly, we will implement the following methodologies:
A. Consumer-Driven Contract (CDC) Testing
Instead of spinning up the entire portal ecosystem to test a single integration, we will use Pact.
- How it works: The “Consumer” service defines the expected API structure (the contract). The “Provider” service checks its responses against this contract during its CI pipeline.
- Benefit: Catches breaking API changes instantaneously without requiring a full staging environment.
B. Ephemeral Environments
Tests should never rely on shared, persistent environments which are prone to state pollution.
- Tooling: Testcontainers or dynamic Docker Compose files.
- Execution: During the CI/CD pipeline, isolated instances of necessary services (e.g., databases, message brokers like Kafka, OAuth providers) are spun up, tested against, and destroyed.
C. API-First Testing
Because Light-Portal relies on strict API boundaries, UI-based testing should be minimized for integration validation.
- Tooling: Karate DSL or REST Assured.
- Benefit: Tests the actual data contracts and service boundaries directly, resulting in faster and more resilient tests.
D. Mocking External Dependencies
- Tooling: WireMock or Mountebank.
- Usage: Stub out third-party APIs or external legacy systems to ensure our integration tests are entirely deterministic and not subject to external network failures.
3. AI Agent Automation Capabilities
Autonomous AI agents can significantly reduce the testing bottleneck. Within this architecture, AI agents will be utilized for the following tasks:
- Test Generation: Automatically parse OpenAPI specifications to generate exhaustive test suites covering positive paths, edge cases, and error handling (400, 401, 429, 500).
- Self-Healing Test Pipelines: When an engineer modifies an API schema intentionally, the AI agent will detect the resulting broken test, read the commit diff, and automatically generate a Pull Request to align the test with the new API schema.
- Synthetic Data Generation: Generate realistic, schema-compliant JSON payloads for testing, avoiding hard-coded or outdated mock data.
- State Machine Exploration: Execute multi-step user journeys by exploring the API state (e.g., Authenticate -> Register Service -> Query Gateway -> Validate Routing).
4. AI-Optimized Test Specifications & Plans
AI agents require structured, semantic, and declarative inputs to function reliably. To direct the AI agent, we will provide test plans in the following formats:
A. OpenAPI / AsyncAPI Specifications (The Golden Source)
The most effective way to instruct an AI is to provide the API design spec.
- AI Action: The agent reads
openapi.yaml, identifies required headers (e.g., JWT authorizations) and payload schemas, and writes the baseline integration code automatically.
B. Behavior-Driven Development (BDD) / Gherkin Syntax
For complex business logic, engineers and product managers will write Gherkin specs. The AI agent translates this plain English into executable API scripts.
Example Spec:
Feature: Light-Portal Service Registration
Scenario: Registering a new microservice routing path
Given the light-oauth2 service provides a valid admin JWT
When I send a POST request to "/portal/services" with the following payload:
"""
{
"serviceId": "demo-service",
"route": "/api/v1/demo"
}
"""
Then the response status should be 201
And the service should be discoverable via the light-router instance
C. Declarative YAML Test Workflows
Instead of writing imperative code (Java/Node.js), test workflows should be written in YAML. YAML is highly deterministic and minimizes AI syntax hallucinations.
Example Spec:
# AI Agent Workflow Instructions
name: Developer Onboarding Flow
steps:
- name: Get Token
api: POST /oauth/token
extract:
token: response.body.access_token
- name: Register Service
api: POST /portal/services
headers:
Authorization: Bearer ${token}
assert:
status: 200
D. Flow-Based “User Stories” (Agentic Prompting)
For autonomous exploration, the AI can be given high-level flow objectives. The agent is responsible for breaking the flow into actual API requests.
Example Prompt to Agent:
“Simulate a developer onboarding flow for Light-Portal. 1. Request an OAuth token. 2. Register a new mock-service to the portal. 3. Update the rate-limiting configuration for that service to 5 requests per minute. 4. Send 10 concurrent requests to verify the rate limit correctly throws a 429 error.”
5. Conclusion & Next Steps
By combining Contract Testing (Pact), Ephemeral Environments (Testcontainers), and Declarative AI-driven Automation, Light-Portal can scale its microservices with confidence.
Immediate Action Items:
- Standardize and centralize all
openapi.yamlfiles for Light-Portal services. - Integrate Testcontainers into the primary CI/CD pipeline.
- Select an AI testing tool/framework (e.g., CodiumAI, Postman Postbot, or a custom LLM script) and seed it with our initial Gherkin business flows.