Open Infra Customer Service — Professional Guide
Contents
Overview and scope
Open infrastructure customer service refers to the operational, technical and contractual practices used to support production deployments of open-source infrastructure software (OpenStack, Kubernetes distributions, Ceph, Prometheus, etc.). These systems power clusters ranging from a single rack to fleets of 10,000+ physical or virtual nodes; a support program must scale from troubleshooting single-node failures to coordinating multi-region incident response. A practical support organization balances community-driven patching and commercial SLAs so that customers get predictable uptime without forfeiting the benefits of open-source agility.
Key public resources include the Open Infrastructure Foundation (website: https://www.openinfra.dev) and project pages such as https://www.openstack.org. Production operators typically track three domains: incident management (P0–P3), platform maintenance (patching, upgrades, backports) and customer enablement (training, runbooks, professional services). In mature teams these domains are split across NOC/SRE/support engineering but integrated through shared tooling and escalation rules.
Support models, tiers and commercial options
There are three common commercial support models: self-service/community (no commercial SLA), subscription/support contracts, and fully managed services. Subscription contracts typically start at the low five-figure range per year for small deployments (example: $12,000–$30,000/year) and scale to six-figure enterprise agreements when you include 24×7 phone access, on-site support windows and custom engineering. Managed services are priced monthly and commonly range from $10,000/month for small clusters to $100,000+/month for multi-site managed platforms; per-node support contracts (common for hardware+software bundles) are often between $50–$1,000 per node/year depending on scope.
Choosing a model requires mapping customer risk tolerance to vendor responsiveness (SLA) and engineering involvement (code backporting, custom integrations). In many enterprise deals the vendor commits to backporting critical fixes for a defined lifecycle (commonly 2–5 years), provides security bulletins, and guarantees upgrade windows. Contract clauses should specify change freezes, maintenance windows (e.g., weekly 02:00–06:00 local time), and rollback commitments to reduce customer downtime risk.
- Common support tiers and example commitments:
- Bronze / Community: Email/Forum support, expected response 48–72 hours, no SLA.
- Silver / Standard: Ticketed support, business-hours response ≤8 hours, uplift engineering, typical price range $12k–$50k/year.
- Gold / Enterprise: 24×7 phone & pager, P1 response ≤15 minutes, MTTR targets, dedicated TAM, typical price $50k–$500k+/year.
- Managed Service: Full operation, monitoring, upgrades, and incident remediation, priced monthly (example $10k–$100k+/month).
SLAs, SLOs, error budgets and KPIs
Translate business goals into measurable SLOs. Common uptime targets and their maximum monthly downtime are: 99% → ~7.2 hours/month, 99.9% → ~43.8 minutes/month, 99.95% → ~21.6 minutes/month, 99.99% → ~4.38 minutes/month. These numbers should be explicitly stated in contracts and reflected in monitoring and alerting thresholds. Equally important are MTTR and response-time objectives: industry best practice for mission-critical P1 incidents is initial acknowledgement within 15 minutes and a target MTTR under 60 minutes; P2 incidents commonly target MTTR under 4 hours and P3 under 24–72 hours.
Key KPIs to report monthly include: uptime %, number of incidents by priority, average MTTR per priority, change failure rate (% of changes causing an incident), and restore time percentile (P50, P90, P99). Maintain an error budget (for example, 0.05% downtime allowance per month for a 99.95% SLO) and a governance process that ties change velocity to remaining error budget: if the budget is exhausted, enforce a stricter change freeze until stability is restored.
Incident response, tooling and playbooks
Effective response requires three integrated components: detection (metrics, logs, traces), communication (paging, incident channels), and execution (runbooks and escalation). Toolsets commonly used include Prometheus + Alertmanager for SLO-based alerts, Grafana for dashboards, ELK/Opensearch for logs, and PagerDuty/Signalfx for on-call routing. ChatOps (Slack/Microsoft Teams with bot integrations) accelerates shared situational awareness and automated remediation actions. Implement automated remediation for the most common faults (service restart, autoscaling, disk cleanup) but require human approval for higher-risk actions.
Runbooks should be version-controlled and short (1–4 pages per common incident), include clear detection criteria, step-by-step diagnostics with commands and expected outputs, rollback steps and post-incident verification. Typical escalation timeline used by mature teams is: 0 minutes — automated alert fires; 0–15 minutes — on-call acknowledgement; 15–60 minutes — senior SRE engineer engaged; 60–240 minutes — cross-functional war room with engineering, product and account teams. Conduct tabletop exercises quarterly and full-scale incident drills annually; after every P1 incident produce a blameless postmortem within 72 hours and a remediation plan with owners and deadlines.
Community integration and vendor collaboration
Open infrastructure customer service differs from closed-source in that many fixes flow through upstream projects. Establish an upstream engagement policy: identify which bugs must be fixed upstream vs backported internally, assign engineers to be active contributors (e.g., at least 10–20% of an SRE’s time), and maintain a tracked backlog of upstream patches and backports. This reduces long-term maintenance cost and ensures security patches are available to all customers. Vendor agreements should clarify responsibilities for upstream PR submission, testing on customer branches, and timelines for backports (a common SLA is 30–90 days for security-critical backports).
For procurement and vendor selection, verify references and ask for concrete metrics: number of production deployments, average MTTR (provide historical values), compliance certifications (SOC2, ISO27001), and published SLAs. Use the foundation and community resources (openinfra.dev, openstack.org) to validate vendor claims and to find interoperability testing programs and certification matrices.
Practical checklist for implementation
- Define SLOs and calculate downtimes (e.g., 99.95% → 21.6 minutes/month) and publish them to customers.
- Set incident targets: P1 ack ≤15 minutes, MTTR targets P1 <60m / P2 <4h / P3 <24–72h.
- Maintain a small set of version-controlled runbooks (1–4 pages each) and test them monthly via drills.
- Staffing & rotations: on-call rotations of 1–2 weeks, limit weekly on-call hours per engineer to ≤40, provide 16+ hours/year training per engineer.
- Monitoring & tooling: Prometheus+Grafana, ELK/Opensearch, PagerDuty; automate remediations for top 10 incidents.
- Contracts: specify maintenance windows, backport timelines (30–90 days for critical fixes), and explicit rollback commitments.
- Reporting: monthly KPI package (uptime %, incident counts, MTTR, change failure rate, remaining error budget).
Open infra customer service is a disciplined blend of software engineering, operations and customer-facing processes. By codifying SLAs, investing in automation, maintaining upstream relationships and operating with a metrics-driven culture, organizations can deliver predictable, high-quality support for complex open-source platforms. For community information and governance resources see https://www.openinfra.dev and project sites such as https://www.openstack.org.
How do I open a Customer Service call?
Examples of How to Start a Call
- “Good morning!
- “Hi [Customer’s Name], thank you for calling [Company Name].
- “Hello [Customer’s Name], it’s great to speak with you again.
- “Good afternoon, [Customer’s Name].
- “Hi [Customer’s Name], I understand you’re having an issue with [specific product/service].
How do I cancel open infra?
Agreement: Customer may terminate the Agreement and all Service by providing 30 days written notice to OPEN INFRA. OPEN INFRA may terminate Agreement and all Service by providing 15 days written notice without cause and immediately without notice for cause.
Is orbital infrastructure still in business?
Orbital Infrastructure Group, Inc went out of business as its plan of liquidation became effective on December 6, 2023.
What is OpenInfra?
We are a non-profit organization providing a neutral, open environment for organizations, developers, and users to build open source infrastructure software together.
Who is the CEO of Open Infra?
Johan Sundberg is the Founder and CEO of Open Infra.
Is Open Infra still in business?
Open Infra is still owned and operated by its founders and employees.