Future SaaS Functionality Recommendations¶

This document catalogues recommended future capabilities for the Coaching App platform across security hardening, audit & compliance, accountability, background workers, and general SaaS operational maturity. Each item includes a rationale, suggested implementation approach, and a rough priority label.

Priority legend

Label	Meaning
🔴 Critical	Needed before serious production / enterprise use
🟠 High	Strong user or compliance expectation in mature SaaS
🟡 Medium	Significant value; can be deferred one or two cycles
🟢 Nice-to-have	Differentiating features; tackle when capacity allows

1. Security¶

1.1 Multi-Factor Authentication (MFA) 🔴¶

What: Allow users to enroll a TOTP authenticator app (Google Authenticator, Authy) or a hardware security key (WebAuthn / FIDO2) as a second factor. Administrators should be able to make MFA mandatory for their organization.

Why: Credential compromise is the leading cause of SaaS account takeover. MFA eliminates the majority of that risk.

Approach:

Better-auth ships a twoFactor plugin — enable it and expose enroll/verify endpoints.
Add a mfaEnforced flag to the organizations table; block access to org-scoped routes until MFA is configured.
Surface the enrollment flow in user profile settings (QR code + backup codes).

1.2 Single Sign-On (SSO) via SAML / OIDC 🟠¶

What: Allow enterprise organizations to authenticate their users through their own Identity Provider (Okta, Azure AD, Google Workspace, etc.) using SAML 2.0 or OpenID Connect.

Why: Enterprise buyers require SSO as a baseline. It simplifies employee lifecycle management (auto-provision / deprovision on HR events).

Approach:

Better-auth has a saml plugin in progress; alternatively, integrate passport-saml as a Hono middleware.
Store per-org IdP configuration (entity ID, SSO URL, certificate) in a new organization_sso_configs table.
On login, detect the email domain and redirect to the org's IdP if SSO is configured.

1.3 Secrets Management & Rotation 🔴¶

What: Move all secrets (database URLs, MinIO credentials, auth/bearer token secrets, API keys) out of .env files and into a dedicated secrets manager such as HashiCorp Vault, AWS Secrets Manager, or Doppler.

Why: Hardcoded or file-based secrets are frequently leaked in git history or container images. Rotation without a secrets manager requires downtime or manual coordination.

Approach:

Wrap backend/src/config/env-requirements.ts in a secrets-loader abstraction; in production, the loader calls the secrets manager API instead of reading process.env.
Schedule automatic rotation jobs (see §4 Workers) that regenerate credentials and push them to the secrets manager; the app re-reads on next secret access.

1.4 IP Allowlisting & Geo-Blocking per Organization 🟡¶

What: Let organization administrators restrict access to their org's data from specific IP ranges or countries.

Why: Compliance requirements (GDPR, FERPA) often require data to be accessible only from approved jurisdictions or networks.

Approach:

Store allowedIpRanges and blockedCountries JSON columns on organizations.
Enforce in tenant middleware using the CF-Connecting-IP / CF-IPCountry Cloudflare headers already available behind the Zero Trust layer.

1.5 Dependency Vulnerability Scanning in CI 🔴¶

What: Block merges when any npm dependency has a known CVE above a configurable severity threshold.

Why: Supply-chain attacks via compromised npm packages are increasingly common.

Approach:

Add a pnpm audit --audit-level=high step to the GitHub Actions CI pipeline.
Integrate Dependabot (or Renovate) for automated PR creation when new versions fix vulnerabilities.
Consider Snyk or Socket.dev for deeper analysis including transitive dependencies.

1.6 Penetration Testing & Bug Bounty Program 🟠¶

What: Commission an annual third-party penetration test; optionally launch a responsible disclosure (bug bounty) program via HackerOne or Bugcrowd.

Why: Automated scans miss business-logic vulnerabilities. External testers bring fresh perspective and adversarial creativity.

Approach:

Define scope document (in-scope endpoints, out-of-scope paths).
Remediate findings within agreed SLA (Critical: 24 h, High: 7 days, Medium: 30 days).
Publish a security.txt at /.well-known/security.txt for responsible disclosure contact.

2. Audit Logs & Compliance¶

2.1 Immutable Audit Log 🔴¶

What: Record every state-changing action (user created, role changed, session booked, file uploaded, organization setting updated) as an append-only log entry with actor, timestamp, IP address, and before/after diff.

Why: Audit logs are required for SOC 2 Type II, GDPR accountability obligations, and enterprise procurement questionnaires. They are also indispensable for incident forensics.

Approach:

Create an audit_logs table with columns: id, organization_id, actor_user_id, action, resource_type, resource_id, payload JSONB, ip_address, user_agent, created_at.
Write a Drizzle auditLog(ctx, action, resource) helper that inserts to this table transactionally alongside the main write.
Never allow UPDATE or DELETE on audit_logs — enforce via Postgres row-level security (RLS) USING (false) WITH CHECK (false) for all non-superuser roles.
Expose a paginated GET /api/audit-logs endpoint gated to OrgAdmin / PlatformAdmin.
Implement log export to CSV / JSON for compliance download.

What: Provide data subject request workflows: right to access (export all data for a user), right to erasure (delete or anonymize), and consent management.

Why: GDPR fines reach 4 % of global annual revenue. Any EU user makes this legally mandatory.

Approach:

Data Export: GET /api/me/data-export — ZIP archive of all personal data rows across tables, generated as a background job (see §4) and delivered via email link.
Erasure: DELETE /api/me/account — anonymize PII fields in-place (replace name/email with a hash); retain aggregate data for analytics.
Consent records: Store timestamped consent events in a consent_records table (terms version, privacy policy version, marketing consent).
Data retention policy: Add a nightly worker (§4) that deletes records older than the configured retention window per org.

2.3 SOC 2 Readiness 🟠¶

What: Achieve SOC 2 Type I (design) and eventually Type II (operating effectiveness) certification.

Why: Unlocks enterprise contracts that require vendor SOC 2 reports.

Key controls to implement:

Control area	Implementation
Access control	§1.1 MFA, §1.2 SSO, role hierarchy already in place
Audit logging	§2.1 Immutable audit log
Availability	§4 uptime monitoring, alerting, SLA tracking
Confidentiality	§1.3 secrets management, encryption at rest (Postgres + MinIO)
Change management	PR-based deploys, CI gate, migration versioning
Incident response	§6 incident runbook, on-call rotation, post-mortem template
Vendor management	Third-party dependency inventory, §1.5 vulnerability scanning

2.4 Session & Access Reports 🟡¶

What: Generate periodic (weekly/monthly) reports per organization summarizing user activity, login counts, failed authentication attempts, and permission changes.

Why: Lets organization admins spot dormant accounts, unusual login patterns, and unauthorized role escalations without writing custom queries.

Approach:

Schedule a report-generation worker (§4) that aggregates audit_logs over the period and renders an HTML email with summary statistics and a downloadable CSV attachment.

3. Accountability & Governance¶

3.1 Role Change Approval Workflow 🟡¶

What: Introduce an optional two-step approval flow for sensitive role assignments (e.g. promoting a Coach to Manager or creating a new OrgAdmin). A second admin must approve before the role change takes effect.

Why: Prevents a single compromised admin account from silently escalating privileges.

Approach:

Add a pending_role_changes table (id, organization_id, requestor_id, target_user_id, requested_role, approved_by, status, created_at).
The role-change endpoint creates a pending record and sends an email to all existing org admins.
A separate POST /api/role-change-requests/:id/approve endpoint (OrgAdmin only) commits the change and records the approver.
Configurable per org: requireRoleChangeApproval: boolean.

3.2 Organization Spending & Usage Limits 🟠¶

What: Track and enforce configurable limits per organization: maximum number of users, maximum storage (GB), maximum sessions per month, and maximum video call minutes.

Why: Usage limits protect platform economics, prevent abuse, and underpin tiered pricing plans.

Approach:

Add a organization_quotas table and a organization_usage_metrics table updated by workers.
Enforce quotas in the relevant API routes before creating a resource; return HTTP 402 with a clear error message when exceeded.
Expose GET /api/organizations/:id/usage to administrators.
Alert organization admins at 80 % and 100 % of quota via email and in-app notification.

3.3 Billing & Subscription Management 🟠¶

What: Integrate with Stripe (or Paddle) for subscription billing, plan management, invoicing, and dunning.

Why: Monetization is a prerequisite for SaaS sustainability. Manual invoicing does not scale.

Approach:

Add a subscriptions table linked to Stripe Customer / Subscription IDs.
Handle Stripe webhooks (invoice.paid, customer.subscription.deleted, etc.) in a new /api/billing/webhook endpoint.
Expose a billing portal link (POST /api/billing/portal-session) that redirects to Stripe's hosted portal for plan upgrades, payment method changes, and invoice history.
Gate feature access by subscription plan using a featureFlags lookup from the active subscription's product metadata.

3.4 Impersonation Audit Trail 🟡¶

What: Allow PlatformAdmin users to impersonate any user for support purposes, with a mandatory audit log entry and a visible impersonation banner in the UI.

Why: Support teams need to reproduce bugs in user contexts; without a formal mechanism they resort to password resets, which is both insecure and disruptive.

Approach:

POST /api/admin/impersonate/:userId — creates a short-lived impersonation session token (15-minute expiry, non-renewable) and records an IMPERSONATION_STARTED audit log event.
The frontend detects an impersonation token and displays a persistent yellow banner: "You are viewing as [User Name]. [End Session]".
Ending the session records IMPERSONATION_ENDED and revokes the token.

4. Background Workers & Job Queue¶

4.1 Persistent Job Queue 🔴¶

What: Replace any fire-and-forget setImmediate / setTimeout patterns with a durable job queue backed by Redis (BullMQ) or Postgres (pg-boss / Graphile Worker).

Why: In-process async work is lost on server restart. A durable queue retries failed jobs, provides visibility into job status, and decouples heavy processing from the HTTP request cycle.

Recommended library: BullMQ (Redis-backed, TypeScript-native, dashboard available via Bull Board).

Initial job types to migrate:

Job	Trigger
Send email	Any `nodemailer.sendMail` call
Generate data export	User requests GDPR export
Resize / transcode uploaded files	File upload to MinIO
Session reminder notifications	Scheduled N hours before session start time
Usage metrics aggregation	Nightly cron
Audit log archival	Weekly cron — move old entries to cold storage

4.2 Cron / Scheduled Jobs 🟠¶

What: A structured cron system for recurring maintenance tasks, distinct from user-triggered jobs.

Recommended jobs:

Schedule	Job
Every 5 minutes	Health-check all external integrations; alert on failure
Hourly	Sync calendar events to/from Google Calendar / Outlook
Nightly 02:00 UTC	Data retention cleanup (delete expired records)
Nightly 03:00 UTC	Generate and email usage reports to org admins
Weekly Sunday	Aggregate weekly statistics per organization
Monthly 1^st	Generate and send invoices (if not using Stripe)

Approach: Use BullMQ Queue + Worker + QueueScheduler. Define all cron jobs in a single backend/src/workers/cron.ts file for discoverability.

4.3 Worker Observability 🟡¶

What: Expose a dashboard (Bull Board or Taskforce.sh) showing job queues, retry counts, failed jobs, and throughput. Emit job lifecycle metrics to the monitoring stack.

Why: Without visibility, silent job failures go undetected. A failed "send invoice" job has direct revenue impact.

Approach:

Mount Bull Board at /admin/queues (PlatformAdmin only, protected by role middleware).
Emit OpenTelemetry spans for each job execution.
Alert when the dead-letter queue length exceeds a threshold.

4.4 Webhook Delivery System 🟡¶

What: Allow organizations to register webhook endpoints that receive real-time event notifications (session booked, user joined, file uploaded, etc.).

Why: Enables integration with third-party systems (Zapier, CRMs, LMS platforms) without polling the API.

Approach:

organization_webhooks table: id, organization_id, url, secret, events JSONB, active.
On relevant events, enqueue a deliverWebhook job that POSTs a signed payload (HMAC-SHA256) to the registered URL.
Retry with exponential back-off (up to 5 attempts); disable the webhook after 10 consecutive failures and notify the org admin.
Expose GET /api/webhooks/:id/deliveries for delivery history and manual replay.

5. Observability & Reliability¶

5.1 Structured Logging 🔴¶

What: Replace console.log throughout the backend with a structured logger (Pino or Winston) that emits JSON with consistent fields: timestamp, level, service, traceId, userId, orgId, method, path, statusCode, durationMs.

Why: Structured logs are queryable in any log aggregation tool (Loki, Datadog, CloudWatch). Unstructured logs are useless at scale.

Approach:

Introduce backend/src/lib/logger.ts exporting a Pino instance.
Replace all console.* calls.
Add a Hono request-logging middleware that logs one line per request with the fields above.
In Docker Compose dev, pipe logs to Loki + Grafana for local observability.

5.2 Distributed Tracing 🟡¶

What: Instrument the backend with OpenTelemetry to produce traces that span HTTP requests, database queries, Redis operations, and worker jobs.

Why: Helps diagnose latency issues and understand cross-service dependencies as the system grows.

Approach:

Add @opentelemetry/sdk-node and instrument Hono, Drizzle, and Redis clients.
Export traces to a local Jaeger instance in dev; in production send to Honeycomb or Datadog.

5.3 Uptime Monitoring & SLA Tracking 🔴¶

What: External uptime checks on /api/health every 60 seconds from multiple regions. Alert the on-call engineer within 2 minutes of downtime.

Why: You cannot meet an SLA you are not measuring.

Approach:

Configure Better Uptime (Better Stack) or Checkly.
Add a GET /api/status page (separate from /api/health) that returns per-service availability for display on a public status page (Statuspage.io or a self-hosted Cachet instance).

5.4 Error Tracking 🔴¶

What: Capture unhandled exceptions and unexpected errors in both the backend and frontend with full stack traces, user context, and breadcrumbs.

Recommended: Sentry (open source self-hosted or SaaS).

Approach:

Add @sentry/node to the backend; wrap the Hono error handler.
Add @sentry/react-native to the frontend; wrap the root navigator.
Configure source map upload in CI so production stack traces resolve to TypeScript source.
Set alert rules: email on-call when a new issue fires more than 5 times in 10 minutes.

5.5 Database Performance & Slow-Query Monitoring 🟡¶

What: Enable pg_stat_statements on the Postgres instance and expose a query for the top 20 slowest queries. Alert when a query exceeds 500 ms average.

Why: Performance regressions often start as slowly degrading queries invisible to users until the system becomes unusable.

Approach:

Add shared_preload_libraries = 'pg_stat_statements' to postgresql.conf.
Create a GET /api/admin/db-stats endpoint (PlatformAdmin only).
In production, forward metrics to Grafana via the Postgres Prometheus exporter.

6. Incident Response & Operational Maturity¶

6.1 Incident Runbook 🟠¶

What: A documented step-by-step runbook covering the most likely incident types: database down, high error rate, storage full, certificate expiry, and secrets compromise.

Why: Under stress, engineers make mistakes. A runbook transforms a panic into a procedure.

Suggested runbook location: docs/operations/incident-runbooks.md

Minimum content per incident type:

Detection — how to confirm the incident is real.
Immediate mitigation — steps to reduce user impact within 5 minutes.
Root cause investigation — queries, commands, dashboards to consult.
Resolution — steps to fully restore service.
Post-incident — what to update in monitoring/alerts to prevent recurrence.

6.2 On-Call Rotation & Escalation Policy 🟠¶

What: Define a formal on-call schedule with primary and secondary responders, response-time SLAs (acknowledge within 15 min, resolve within 2 h for Critical), and an escalation path.

Tools: PagerDuty, OpsGenie, or BetterStack Incidents.

6.3 Post-Mortem Culture 🟡¶

What: After every severity-1 incident, publish a blameless post-mortem within 48 hours covering timeline, root cause, contributing factors, and action items with owners and due dates.

Why: Post-mortems convert costly failures into organizational knowledge and prevent repeat incidents.

Suggested location: docs/operations/post-mortems/YYYY-MM-DD-<slug>.md

6.4 Disaster Recovery & Backup Testing 🔴¶

What: Automated daily backups of Postgres and MinIO, with documented and regularly tested restore procedures.

Why: Backups that have never been restored are not backups.

Approach:

Daily pg_dump job (§4 cron) uploads an encrypted snapshot to a separate S3 bucket / region from the live data.
Monthly restore drill: spin up a blank Postgres instance, restore the latest backup, run the test suite against the restored data, and record the result.
Document Recovery Time Objective (RTO) and Recovery Point Objective (RPO) targets.

7. Multi-Tenancy Hardening¶

7.1 Row-Level Security (RLS) on All Tenant Tables 🔴¶

What: Enable Postgres Row-Level Security on every table that contains organization-scoped data so that a query running with the application role can only see rows belonging to the current organization, even if the application-level WHERE orgId = ? clause is accidentally omitted.

Why: Defense in depth against tenant data leakage caused by future application bugs or missing WHERE clauses.

Approach:

Set app.current_org_id as a Postgres session variable at the start of each request (via a Drizzle beforeQuery hook or a connection pool acquire callback).
Define RLS policies: USING (organization_id = current_setting('app.current_org_id')).
PlatformAdmin queries bypass RLS via a dedicated superuser role.

7.2 Tenant Data Isolation Tests 🔴¶

What: An automated integration test suite that verifies Organization A cannot read or write Organization B's data through any API endpoint.

Why: Cross-tenant data leakage is the most catastrophic SaaS vulnerability. Automated tests prevent regressions.

Approach:

Add a tenant-isolation.test.ts integration test file.
For every resource type (users, sessions, files, events, etc.), create two organizations with test data, authenticate as Org A, and assert that every Org B resource returns 403 or 404.

8. Developer Experience & Platform APIs¶

8.1 Public REST API & API Keys 🟡¶

What: Issue long-lived API keys to organization administrators for server-to-server integrations. API keys authenticate like session cookies but are not browser-bound.

Approach:

organization_api_keys table: id, organization_id, key_hash, name, scopes, last_used_at, expires_at.
Store sha256(key) only — the raw key is shown once at creation.
Authenticate via Authorization: Bearer <key> header alongside the existing cookie auth.
Scope API keys to a subset of operations (read-only, sessions-only, etc.).

8.2 GraphQL or tRPC Layer 🟢¶

What: Expose a typed, self-documenting GraphQL or tRPC API as an alternative to the REST API, enabling richer queries and strong end-to-end type safety for third-party integrations.

Why: REST endpoints multiply as data requirements grow more complex. GraphQL / tRPC reduce over-fetching and under-fetching and keep client code lean.

8.3 Developer Documentation Site 🟡¶

What: Publish a public-facing API reference (beyond the internal MkDocs site) with:

Interactive API explorer (Swagger UI / Redoc / Scalar).
Code samples in JavaScript, Python, and cURL.
Changelog RSS feed.
Rate-limit and quota documentation.

Approach: Auto-generate an OpenAPI 3.1 spec from Hono route definitions using @hono/zod-openapi or hono-openapi.

9. Accessibility & Internationalisation¶

9.1 Full WCAG 2.1 AA Compliance 🟠¶

What: Audit the frontend for accessibility failures and remediate: keyboard navigation, screen-reader labels, colour contrast, focus management in modals.

Why: Legal requirement in many jurisdictions; expands the addressable market.

Approach: Run axe-core in the test suite (via jest-axe) on every modal and screen component. Fix all critical and serious violations before each release.

9.2 Full i18n Support 🟡¶

What: Extract all remaining hard-coded English strings in the frontend into the i18n translation files and add at least one additional locale (e.g. Spanish is partially done; complete it and add Portuguese).

Why: The platform serves Latin American educational markets where Spanish and Portuguese are primary languages.

10. Feature Flags & Gradual Rollouts¶

10.1 Feature Flag System 🟡¶

What: A lightweight feature flag system that lets the platform enable or disable specific features per organization, per user, or globally — without a code deployment.

Why: Enables canary releases, A/B testing, and per-tier feature gating (e.g. disable video calls for the free tier).

Recommended approach:

Simple option: add a featureFlags JSONB column to organizations; the backend reads it and the frontend receives active flags in the /api/me response.
Advanced option: integrate OpenFeature with a flag provider such as Unleash or Flagsmith (both self-hostable).

Summary Priority Matrix¶

#	Item	Priority
1.1	Multi-Factor Authentication	🔴 Critical
1.3	Secrets Management & Rotation	🔴 Critical
1.5	Dependency Vulnerability Scanning in CI	🔴 Critical
2.1	Immutable Audit Log	🔴 Critical
2.2	GDPR & Data Privacy Tooling	🔴 Critical
4.1	Persistent Job Queue (BullMQ)	🔴 Critical
5.1	Structured Logging (Pino)	🔴 Critical
5.3	Uptime Monitoring & SLA Tracking	🔴 Critical
5.4	Error Tracking (Sentry)	🔴 Critical
6.4	Disaster Recovery & Backup Testing	🔴 Critical
7.1	Row-Level Security on Tenant Tables	🔴 Critical
7.2	Tenant Data Isolation Tests	🔴 Critical
1.2	SSO via SAML / OIDC	🟠 High
1.6	Penetration Testing & Bug Bounty	🟠 High
2.3	SOC 2 Readiness	🟠 High
3.2	Organization Usage Limits	🟠 High
3.3	Billing & Subscription Management	🟠 High
4.2	Cron / Scheduled Jobs	🟠 High
6.1	Incident Runbook	🟠 High
6.2	On-Call Rotation & Escalation Policy	🟠 High
9.1	WCAG 2.1 AA Compliance	🟠 High
1.4	IP Allowlisting & Geo-Blocking	🟡 Medium
2.4	Session & Access Reports	🟡 Medium
3.1	Role Change Approval Workflow	🟡 Medium
3.4	Impersonation Audit Trail	🟡 Medium
4.3	Worker Observability	🟡 Medium
4.4	Webhook Delivery System	🟡 Medium
5.2	Distributed Tracing (OpenTelemetry)	🟡 Medium
5.5	DB Slow-Query Monitoring	🟡 Medium
6.3	Post-Mortem Culture	🟡 Medium
8.1	Public API Keys	🟡 Medium
8.3	Public Developer Documentation	🟡 Medium
9.2	Full i18n Support	🟡 Medium
10.1	Feature Flag System	🟡 Medium
8.2	GraphQL / tRPC Layer	🟢 Nice-to-have

Future SaaS Functionality Recommendations¶

1. Security¶

1.1 Multi-Factor Authentication (MFA) 🔴¶

1.2 Single Sign-On (SSO) via SAML / OIDC 🟠¶

1.3 Secrets Management & Rotation 🔴¶

1.4 IP Allowlisting & Geo-Blocking per Organization 🟡¶

1.5 Dependency Vulnerability Scanning in CI 🔴¶

1.6 Penetration Testing & Bug Bounty Program 🟠¶

2. Audit Logs & Compliance¶

2.1 Immutable Audit Log 🔴¶

2.2 GDPR & Data Privacy Tooling 🔴¶

2.3 SOC 2 Readiness 🟠¶

2.4 Session & Access Reports 🟡¶

3. Accountability & Governance¶

3.1 Role Change Approval Workflow 🟡¶

3.2 Organization Spending & Usage Limits 🟠¶

3.3 Billing & Subscription Management 🟠¶

3.4 Impersonation Audit Trail 🟡¶

4. Background Workers & Job Queue¶

4.1 Persistent Job Queue 🔴¶

4.2 Cron / Scheduled Jobs 🟠¶

4.3 Worker Observability 🟡¶

4.4 Webhook Delivery System 🟡¶

5. Observability & Reliability¶

5.1 Structured Logging 🔴¶

5.2 Distributed Tracing 🟡¶

5.3 Uptime Monitoring & SLA Tracking 🔴¶

5.4 Error Tracking 🔴¶

5.5 Database Performance & Slow-Query Monitoring 🟡¶

6. Incident Response & Operational Maturity¶

6.1 Incident Runbook 🟠¶

6.2 On-Call Rotation & Escalation Policy 🟠¶

6.3 Post-Mortem Culture 🟡¶

6.4 Disaster Recovery & Backup Testing 🔴¶

7. Multi-Tenancy Hardening¶

7.1 Row-Level Security (RLS) on All Tenant Tables 🔴¶

7.2 Tenant Data Isolation Tests 🔴¶

8. Developer Experience & Platform APIs¶

8.1 Public REST API & API Keys 🟡¶

8.2 GraphQL or tRPC Layer 🟢¶

8.3 Developer Documentation Site 🟡¶

9. Accessibility & Internationalisation¶

9.1 Full WCAG 2.1 AA Compliance 🟠¶

9.2 Full i18n Support 🟡¶

10. Feature Flags & Gradual Rollouts¶

10.1 Feature Flag System 🟡¶

Summary Priority Matrix¶