Future SaaS Functionality Recommendations¶
This document catalogues recommended future capabilities for the Coaching App platform across security hardening, audit & compliance, accountability, background workers, and general SaaS operational maturity. Each item includes a rationale, suggested implementation approach, and a rough priority label.
Priority legend
| Label | Meaning |
|---|---|
| π΄ Critical | Needed before serious production / enterprise use |
| π High | Strong user or compliance expectation in mature SaaS |
| π‘ Medium | Significant value; can be deferred one or two cycles |
| π’ Nice-to-have | Differentiating features; tackle when capacity allows |
1. Security¶
1.1 Multi-Factor Authentication (MFA) π΄¶
What: Allow users to enroll a TOTP authenticator app (Google Authenticator, Authy) or a hardware security key (WebAuthn / FIDO2) as a second factor. Administrators should be able to make MFA mandatory for their organization.
Why: Credential compromise is the leading cause of SaaS account takeover. MFA eliminates the majority of that risk.
Approach:
- Better-auth ships a
twoFactorplugin β enable it and expose enroll/verify endpoints. - Add a
mfaEnforcedflag to theorganizationstable; block access to org-scoped routes until MFA is configured. - Surface the enrollment flow in user profile settings (QR code + backup codes).
1.2 Single Sign-On (SSO) via SAML / OIDC π ¶
What: Allow enterprise organizations to authenticate their users through their own Identity Provider (Okta, Azure AD, Google Workspace, etc.) using SAML 2.0 or OpenID Connect.
Why: Enterprise buyers require SSO as a baseline. It simplifies employee lifecycle management (auto-provision / deprovision on HR events).
Approach:
- Better-auth has a
samlplugin in progress; alternatively, integratepassport-samlas a Hono middleware. - Store per-org IdP configuration (entity ID, SSO URL, certificate) in a new
organization_sso_configstable. - On login, detect the email domain and redirect to the org's IdP if SSO is configured.
1.3 Secrets Management & Rotation π΄¶
What: Move all secrets (database URLs, MinIO credentials, auth/bearer token secrets, API
keys) out of
.env files and into a dedicated secrets manager such as HashiCorp Vault, AWS Secrets Manager,
or Doppler.
Why: Hardcoded or file-based secrets are frequently leaked in git history or container images. Rotation without a secrets manager requires downtime or manual coordination.
Approach:
- Wrap
backend/src/config/env-requirements.tsin a secrets-loader abstraction; in production, the loader calls the secrets manager API instead of readingprocess.env. - Schedule automatic rotation jobs (see Β§4 Workers) that regenerate credentials and push them to the secrets manager; the app re-reads on next secret access.
1.4 IP Allowlisting & Geo-Blocking per Organization π‘¶
What: Let organization administrators restrict access to their org's data from specific IP ranges or countries.
Why: Compliance requirements (GDPR, FERPA) often require data to be accessible only from approved jurisdictions or networks.
Approach:
- Store
allowedIpRangesandblockedCountriesJSON columns onorganizations. - Enforce in tenant middleware using the
CF-Connecting-IP/CF-IPCountryCloudflare headers already available behind the Zero Trust layer.
1.5 Dependency Vulnerability Scanning in CI π΄¶
What: Block merges when any npm dependency has a known CVE above a configurable severity threshold.
Why: Supply-chain attacks via compromised npm packages are increasingly common.
Approach:
- Add a
pnpm audit --audit-level=highstep to the GitHub Actions CI pipeline. - Integrate Dependabot (or Renovate) for automated PR creation when new versions fix vulnerabilities.
- Consider Snyk or Socket.dev for deeper analysis including transitive dependencies.
1.6 Penetration Testing & Bug Bounty Program π ¶
What: Commission an annual third-party penetration test; optionally launch a responsible disclosure (bug bounty) program via HackerOne or Bugcrowd.
Why: Automated scans miss business-logic vulnerabilities. External testers bring fresh perspective and adversarial creativity.
Approach:
- Define scope document (in-scope endpoints, out-of-scope paths).
- Remediate findings within agreed SLA (Critical: 24 h, High: 7 days, Medium: 30 days).
- Publish a
security.txtat/.well-known/security.txtfor responsible disclosure contact.
2. Audit Logs & Compliance¶
2.1 Immutable Audit Log π΄¶
What: Record every state-changing action (user created, role changed, session booked, file uploaded, organization setting updated) as an append-only log entry with actor, timestamp, IP address, and before/after diff.
Why: Audit logs are required for SOC 2 Type II, GDPR accountability obligations, and enterprise procurement questionnaires. They are also indispensable for incident forensics.
Approach:
- Create an
audit_logstable with columns:id, organization_id, actor_user_id, action, resource_type, resource_id, payload JSONB, ip_address, user_agent, created_at. - Write a Drizzle
auditLog(ctx, action, resource)helper that inserts to this table transactionally alongside the main write. - Never allow UPDATE or DELETE on
audit_logsβ enforce via Postgres row-level security (RLS)USING (false) WITH CHECK (false)for all non-superuser roles. - Expose a paginated
GET /api/audit-logsendpoint gated to OrgAdmin / PlatformAdmin. - Implement log export to CSV / JSON for compliance download.
2.2 GDPR & Data Privacy Tooling π΄¶
What: Provide data subject request workflows: right to access (export all data for a user), right to erasure (delete or anonymize), and consent management.
Why: GDPR fines reach 4 % of global annual revenue. Any EU user makes this legally mandatory.
Approach:
- Data Export:
GET /api/me/data-exportβ ZIP archive of all personal data rows across tables, generated as a background job (see Β§4) and delivered via email link. - Erasure:
DELETE /api/me/accountβ anonymize PII fields in-place (replace name/email with a hash); retain aggregate data for analytics. - Consent records: Store timestamped consent events in a
consent_recordstable (terms version, privacy policy version, marketing consent). - Data retention policy: Add a nightly worker (Β§4) that deletes records older than the configured retention window per org.
2.3 SOC 2 Readiness π ¶
What: Achieve SOC 2 Type I (design) and eventually Type II (operating effectiveness) certification.
Why: Unlocks enterprise contracts that require vendor SOC 2 reports.
Key controls to implement:
| Control area | Implementation |
|---|---|
| Access control | Β§1.1 MFA, Β§1.2 SSO, role hierarchy already in place |
| Audit logging | Β§2.1 Immutable audit log |
| Availability | Β§4 uptime monitoring, alerting, SLA tracking |
| Confidentiality | Β§1.3 secrets management, encryption at rest (Postgres + MinIO) |
| Change management | PR-based deploys, CI gate, migration versioning |
| Incident response | Β§6 incident runbook, on-call rotation, post-mortem template |
| Vendor management | Third-party dependency inventory, Β§1.5 vulnerability scanning |
2.4 Session & Access Reports π‘¶
What: Generate periodic (weekly/monthly) reports per organization summarizing user activity, login counts, failed authentication attempts, and permission changes.
Why: Lets organization admins spot dormant accounts, unusual login patterns, and unauthorized role escalations without writing custom queries.
Approach:
- Schedule a report-generation worker (Β§4) that aggregates
audit_logsover the period and renders an HTML email with summary statistics and a downloadable CSV attachment.
3. Accountability & Governance¶
3.1 Role Change Approval Workflow π‘¶
What: Introduce an optional two-step approval flow for sensitive role assignments (e.g. promoting a Coach to Manager or creating a new OrgAdmin). A second admin must approve before the role change takes effect.
Why: Prevents a single compromised admin account from silently escalating privileges.
Approach:
- Add a
pending_role_changestable (id, organization_id, requestor_id, target_user_id, requested_role, approved_by, status, created_at). - The role-change endpoint creates a pending record and sends an email to all existing org admins.
- A separate
POST /api/role-change-requests/:id/approveendpoint (OrgAdmin only) commits the change and records the approver. - Configurable per org:
requireRoleChangeApproval: boolean.
3.2 Organization Spending & Usage Limits π ¶
What: Track and enforce configurable limits per organization: maximum number of users, maximum storage (GB), maximum sessions per month, and maximum video call minutes.
Why: Usage limits protect platform economics, prevent abuse, and underpin tiered pricing plans.
Approach:
- Add a
organization_quotastable and aorganization_usage_metricstable updated by workers. - Enforce quotas in the relevant API routes before creating a resource; return HTTP 402 with a clear error message when exceeded.
- Expose
GET /api/organizations/:id/usageto administrators. - Alert organization admins at 80 % and 100 % of quota via email and in-app notification.
3.3 Billing & Subscription Management π ¶
What: Integrate with Stripe (or Paddle) for subscription billing, plan management, invoicing, and dunning.
Why: Monetization is a prerequisite for SaaS sustainability. Manual invoicing does not scale.
Approach:
- Add a
subscriptionstable linked to Stripe Customer / Subscription IDs. - Handle Stripe webhooks (
invoice.paid,customer.subscription.deleted, etc.) in a new/api/billing/webhookendpoint. - Expose a billing portal link (
POST /api/billing/portal-session) that redirects to Stripe's hosted portal for plan upgrades, payment method changes, and invoice history. - Gate feature access by subscription plan using a
featureFlagslookup from the active subscription's product metadata.
3.4 Impersonation Audit Trail π‘¶
What: Allow PlatformAdmin users to impersonate any user for support purposes, with a mandatory audit log entry and a visible impersonation banner in the UI.
Why: Support teams need to reproduce bugs in user contexts; without a formal mechanism they resort to password resets, which is both insecure and disruptive.
Approach:
POST /api/admin/impersonate/:userIdβ creates a short-lived impersonation session token (15-minute expiry, non-renewable) and records anIMPERSONATION_STARTEDaudit log event.- The frontend detects an impersonation token and displays a persistent yellow banner: "You are viewing as [User Name]. [End Session]".
- Ending the session records
IMPERSONATION_ENDEDand revokes the token.
4. Background Workers & Job Queue¶
4.1 Persistent Job Queue π΄¶
What: Replace any fire-and-forget setImmediate / setTimeout patterns with a durable
job queue backed by Redis (BullMQ) or Postgres (pg-boss / Graphile Worker).
Why: In-process async work is lost on server restart. A durable queue retries failed jobs, provides visibility into job status, and decouples heavy processing from the HTTP request cycle.
Recommended library: BullMQ (Redis-backed, TypeScript-native, dashboard available via Bull Board).
Initial job types to migrate:
| Job | Trigger |
|---|---|
| Send email | Any nodemailer.sendMail call |
| Generate data export | User requests GDPR export |
| Resize / transcode uploaded files | File upload to MinIO |
| Session reminder notifications | Scheduled N hours before session start time |
| Usage metrics aggregation | Nightly cron |
| Audit log archival | Weekly cron β move old entries to cold storage |
4.2 Cron / Scheduled Jobs π ¶
What: A structured cron system for recurring maintenance tasks, distinct from user-triggered jobs.
Recommended jobs:
| Schedule | Job |
|---|---|
| Every 5 minutes | Health-check all external integrations; alert on failure |
| Hourly | Sync calendar events to/from Google Calendar / Outlook |
| Nightly 02:00 UTC | Data retention cleanup (delete expired records) |
| Nightly 03:00 UTC | Generate and email usage reports to org admins |
| Weekly Sunday | Aggregate weekly statistics per organization |
| Monthly 1st | Generate and send invoices (if not using Stripe) |
Approach: Use BullMQ Queue + Worker + QueueScheduler. Define all cron jobs in a
single backend/src/workers/cron.ts file for discoverability.
4.3 Worker Observability π‘¶
What: Expose a dashboard (Bull Board or Taskforce.sh) showing job queues, retry counts, failed jobs, and throughput. Emit job lifecycle metrics to the monitoring stack.
Why: Without visibility, silent job failures go undetected. A failed "send invoice" job has direct revenue impact.
Approach:
- Mount Bull Board at
/admin/queues(PlatformAdmin only, protected by role middleware). - Emit OpenTelemetry spans for each job execution.
- Alert when the dead-letter queue length exceeds a threshold.
4.4 Webhook Delivery System π‘¶
What: Allow organizations to register webhook endpoints that receive real-time event notifications (session booked, user joined, file uploaded, etc.).
Why: Enables integration with third-party systems (Zapier, CRMs, LMS platforms) without polling the API.
Approach:
organization_webhookstable:id, organization_id, url, secret, events JSONB, active.- On relevant events, enqueue a
deliverWebhookjob that POSTs a signed payload (HMAC-SHA256) to the registered URL. - Retry with exponential back-off (up to 5 attempts); disable the webhook after 10 consecutive failures and notify the org admin.
- Expose
GET /api/webhooks/:id/deliveriesfor delivery history and manual replay.
5. Observability & Reliability¶
5.1 Structured Logging π΄¶
What: Replace console.log throughout the backend with a structured logger (Pino or
Winston) that emits JSON with consistent fields: timestamp, level, service, traceId, userId,
orgId, method, path, statusCode, durationMs.
Why: Structured logs are queryable in any log aggregation tool (Loki, Datadog, CloudWatch). Unstructured logs are useless at scale.
Approach:
- Introduce
backend/src/lib/logger.tsexporting a Pino instance. - Replace all
console.*calls. - Add a Hono request-logging middleware that logs one line per request with the fields above.
- In Docker Compose dev, pipe logs to Loki + Grafana for local observability.
5.2 Distributed Tracing π‘¶
What: Instrument the backend with OpenTelemetry to produce traces that span HTTP requests, database queries, Redis operations, and worker jobs.
Why: Helps diagnose latency issues and understand cross-service dependencies as the system grows.
Approach:
- Add
@opentelemetry/sdk-nodeand instrument Hono, Drizzle, and Redis clients. - Export traces to a local Jaeger instance in dev; in production send to Honeycomb or Datadog.
5.3 Uptime Monitoring & SLA Tracking π΄¶
What: External uptime checks on /api/health every 60 seconds from multiple regions.
Alert the on-call engineer within 2 minutes of downtime.
Why: You cannot meet an SLA you are not measuring.
Approach:
- Configure Better Uptime (Better Stack) or Checkly.
- Add a
GET /api/statuspage (separate from/api/health) that returns per-service availability for display on a public status page (Statuspage.io or a self-hosted Cachet instance).
5.4 Error Tracking π΄¶
What: Capture unhandled exceptions and unexpected errors in both the backend and frontend with full stack traces, user context, and breadcrumbs.
Recommended: Sentry (open source self-hosted or SaaS).
Approach:
- Add
@sentry/nodeto the backend; wrap the Hono error handler. - Add
@sentry/react-nativeto the frontend; wrap the root navigator. - Configure source map upload in CI so production stack traces resolve to TypeScript source.
- Set alert rules: email on-call when a new issue fires more than 5 times in 10 minutes.
5.5 Database Performance & Slow-Query Monitoring π‘¶
What: Enable pg_stat_statements on the Postgres instance and expose a query for the
top 20 slowest queries. Alert when a query exceeds 500 ms average.
Why: Performance regressions often start as slowly degrading queries invisible to users until the system becomes unusable.
Approach:
- Add
shared_preload_libraries = 'pg_stat_statements'topostgresql.conf. - Create a
GET /api/admin/db-statsendpoint (PlatformAdmin only). - In production, forward metrics to Grafana via the Postgres Prometheus exporter.
6. Incident Response & Operational Maturity¶
6.1 Incident Runbook π ¶
What: A documented step-by-step runbook covering the most likely incident types: database down, high error rate, storage full, certificate expiry, and secrets compromise.
Why: Under stress, engineers make mistakes. A runbook transforms a panic into a procedure.
Suggested runbook location: docs/operations/incident-runbooks.md
Minimum content per incident type:
- Detection β how to confirm the incident is real.
- Immediate mitigation β steps to reduce user impact within 5 minutes.
- Root cause investigation β queries, commands, dashboards to consult.
- Resolution β steps to fully restore service.
- Post-incident β what to update in monitoring/alerts to prevent recurrence.
6.2 On-Call Rotation & Escalation Policy π ¶
What: Define a formal on-call schedule with primary and secondary responders, response-time SLAs (acknowledge within 15 min, resolve within 2 h for Critical), and an escalation path.
Tools: PagerDuty, OpsGenie, or BetterStack Incidents.
6.3 Post-Mortem Culture π‘¶
What: After every severity-1 incident, publish a blameless post-mortem within 48 hours covering timeline, root cause, contributing factors, and action items with owners and due dates.
Why: Post-mortems convert costly failures into organizational knowledge and prevent repeat incidents.
Suggested location: docs/operations/post-mortems/YYYY-MM-DD-<slug>.md
6.4 Disaster Recovery & Backup Testing π΄¶
What: Automated daily backups of Postgres and MinIO, with documented and regularly tested restore procedures.
Why: Backups that have never been restored are not backups.
Approach:
- Daily
pg_dumpjob (Β§4 cron) uploads an encrypted snapshot to a separate S3 bucket / region from the live data. - Monthly restore drill: spin up a blank Postgres instance, restore the latest backup, run the test suite against the restored data, and record the result.
- Document Recovery Time Objective (RTO) and Recovery Point Objective (RPO) targets.
7. Multi-Tenancy Hardening¶
7.1 Row-Level Security (RLS) on All Tenant Tables π΄¶
What: Enable Postgres Row-Level Security on every table that contains organization-scoped
data so that a query running with the application role can only see rows belonging to the
current organization, even if the application-level WHERE orgId = ? clause is accidentally
omitted.
Why: Defense in depth against tenant data leakage caused by future application bugs or missing WHERE clauses.
Approach:
- Set
app.current_org_idas a Postgres session variable at the start of each request (via a DrizzlebeforeQueryhook or a connection poolacquirecallback). - Define RLS policies:
USING (organization_id = current_setting('app.current_org_id')). - PlatformAdmin queries bypass RLS via a dedicated superuser role.
7.2 Tenant Data Isolation Tests π΄¶
What: An automated integration test suite that verifies Organization A cannot read or write Organization B's data through any API endpoint.
Why: Cross-tenant data leakage is the most catastrophic SaaS vulnerability. Automated tests prevent regressions.
Approach:
- Add a
tenant-isolation.test.tsintegration test file. - For every resource type (users, sessions, files, events, etc.), create two organizations with test data, authenticate as Org A, and assert that every Org B resource returns 403 or 404.
8. Developer Experience & Platform APIs¶
8.1 Public REST API & API Keys π‘¶
What: Issue long-lived API keys to organization administrators for server-to-server integrations. API keys authenticate like session cookies but are not browser-bound.
Approach:
organization_api_keystable:id, organization_id, key_hash, name, scopes, last_used_at, expires_at.- Store
sha256(key)only β the raw key is shown once at creation. - Authenticate via
Authorization: Bearer <key>header alongside the existing cookie auth. - Scope API keys to a subset of operations (read-only, sessions-only, etc.).
8.2 GraphQL or tRPC Layer π’¶
What: Expose a typed, self-documenting GraphQL or tRPC API as an alternative to the REST API, enabling richer queries and strong end-to-end type safety for third-party integrations.
Why: REST endpoints multiply as data requirements grow more complex. GraphQL / tRPC reduce over-fetching and under-fetching and keep client code lean.
8.3 Developer Documentation Site π‘¶
What: Publish a public-facing API reference (beyond the internal MkDocs site) with:
- Interactive API explorer (Swagger UI / Redoc / Scalar).
- Code samples in JavaScript, Python, and cURL.
- Changelog RSS feed.
- Rate-limit and quota documentation.
Approach: Auto-generate an OpenAPI 3.1 spec from Hono route definitions using
@hono/zod-openapi or hono-openapi.
9. Accessibility & Internationalisation¶
9.1 Full WCAG 2.1 AA Compliance π ¶
What: Audit the frontend for accessibility failures and remediate: keyboard navigation, screen-reader labels, colour contrast, focus management in modals.
Why: Legal requirement in many jurisdictions; expands the addressable market.
Approach: Run axe-core in the test suite (via jest-axe) on every modal and screen
component. Fix all critical and serious violations before each release.
9.2 Full i18n Support π‘¶
What: Extract all remaining hard-coded English strings in the frontend into the i18n translation files and add at least one additional locale (e.g. Spanish is partially done; complete it and add Portuguese).
Why: The platform serves Latin American educational markets where Spanish and Portuguese are primary languages.
10. Feature Flags & Gradual Rollouts¶
10.1 Feature Flag System π‘¶
What: A lightweight feature flag system that lets the platform enable or disable specific features per organization, per user, or globally β without a code deployment.
Why: Enables canary releases, A/B testing, and per-tier feature gating (e.g. disable video calls for the free tier).
Recommended approach:
- Simple option: add a
featureFlags JSONBcolumn toorganizations; the backend reads it and the frontend receives active flags in the/api/meresponse. - Advanced option: integrate OpenFeature with a flag provider such as Unleash or Flagsmith (both self-hostable).
Summary Priority Matrix¶
| # | Item | Priority |
|---|---|---|
| 1.1 | Multi-Factor Authentication | π΄ Critical |
| 1.3 | Secrets Management & Rotation | π΄ Critical |
| 1.5 | Dependency Vulnerability Scanning in CI | π΄ Critical |
| 2.1 | Immutable Audit Log | π΄ Critical |
| 2.2 | GDPR & Data Privacy Tooling | π΄ Critical |
| 4.1 | Persistent Job Queue (BullMQ) | π΄ Critical |
| 5.1 | Structured Logging (Pino) | π΄ Critical |
| 5.3 | Uptime Monitoring & SLA Tracking | π΄ Critical |
| 5.4 | Error Tracking (Sentry) | π΄ Critical |
| 6.4 | Disaster Recovery & Backup Testing | π΄ Critical |
| 7.1 | Row-Level Security on Tenant Tables | π΄ Critical |
| 7.2 | Tenant Data Isolation Tests | π΄ Critical |
| 1.2 | SSO via SAML / OIDC | π High |
| 1.6 | Penetration Testing & Bug Bounty | π High |
| 2.3 | SOC 2 Readiness | π High |
| 3.2 | Organization Usage Limits | π High |
| 3.3 | Billing & Subscription Management | π High |
| 4.2 | Cron / Scheduled Jobs | π High |
| 6.1 | Incident Runbook | π High |
| 6.2 | On-Call Rotation & Escalation Policy | π High |
| 9.1 | WCAG 2.1 AA Compliance | π High |
| 1.4 | IP Allowlisting & Geo-Blocking | π‘ Medium |
| 2.4 | Session & Access Reports | π‘ Medium |
| 3.1 | Role Change Approval Workflow | π‘ Medium |
| 3.4 | Impersonation Audit Trail | π‘ Medium |
| 4.3 | Worker Observability | π‘ Medium |
| 4.4 | Webhook Delivery System | π‘ Medium |
| 5.2 | Distributed Tracing (OpenTelemetry) | π‘ Medium |
| 5.5 | DB Slow-Query Monitoring | π‘ Medium |
| 6.3 | Post-Mortem Culture | π‘ Medium |
| 8.1 | Public API Keys | π‘ Medium |
| 8.3 | Public Developer Documentation | π‘ Medium |
| 9.2 | Full i18n Support | π‘ Medium |
| 10.1 | Feature Flag System | π‘ Medium |
| 8.2 | GraphQL / tRPC Layer | π’ Nice-to-have |