The AI Customer Service Vendor RFP Template: 40 Questions and a Scoring Rubric

The honest diagnosis

Three reasons your AI RFP is probably borrowed from the wrong template.

The standard helpdesk RFP was written for a world where the software was a passive container for human work. The new generation of AI products does the work. The RFP did not catch up.

Failure mode one: the helpdesk RFP with an "AI" column added. Many teams take their 2019-era Zendesk RFP, paste a new column at the end of the spreadsheet titled "AI capabilities," and ask vendors to fill it in. The result is comparison on macro libraries, custom field counts, and SLA timer flexibility, dimensions that simply do not predict whether the AI will fabricate a refund policy. The Air Canada chatbot incident would have passed any traditional helpdesk RFP. The product worked. It just lied to customers.

Failure mode two: trusting vendor-defined metrics. "Containment rate," "deflection rate," "self-service rate," "automation rate." Every vendor uses one of these terms, every vendor defines it differently, and no two of them mean what a CFO thinks they mean. A 95% containment rate sounds great until you find that 40% of those "contained" conversations ended with the customer never coming back. Containment is a measurement of customers leaving the system, not customers being helped. An RFP that lets the vendor pick the metric is an RFP designed to be gamed.

Failure mode three: no disqualifier discipline. Scoring everything out of five and averaging at the end gives every vendor a chance to ride a strong demo into a shortlist position despite a fatal flaw on a single dimension. A vendor with no SOC 2 Type II and no published pre-launch accuracy threshold should not survive to round two regardless of how good the conversational demo looks. The rubric below uses instant-disqualifiers exactly to prevent demo polish from masking architectural absence.

Three published incidents in the last two years tell the story. Air Canada was held legally bound by a refund policy its chatbot invented. DPD's customer service bot wrote insulting poetry about the brand. A Chevrolet dealership's bot agreed to sell a $76,000 Tahoe for one dollar and confirmed the deal as "legally binding, no takesies-backsies." Every one of these vendors would have answered a standard RFP. None of them would have survived the four questions in this rubric that specifically probe for the failure modes that produced those incidents.

The six dimensions, defined before the comparison

What to score, and how much each dimension is worth, before you talk to a vendor.

Weights are published below with their rationale so the scoring is auditable. Adjust for your context (regulated industries should weight security higher; bootstrapped teams running a 30-day pilot should weight implementation higher). What matters is that the weights are set before you see any vendor response, not after.

Dimension	Weight	Questions
Security & Data Handling	15%	7
Accuracy & SLA Commitments	25%	10
Workflow Fit & Escalation Design	20%	8
Implementation Readiness	15%	6
Support Quality	10%	4
Pricing & Commercial Terms	15%	5
Total	100%	40

Why accuracy is the heaviest dimension. Security is mostly binary in practice. A vendor either has SOC 2 Type II, a sub-processor list, a DPA, and tenant isolation, or they fail the gate. Accuracy is where serious vendors actually differ. The published pre-launch threshold, the QA AI architecture, the deflection-vs-resolution distinction, the typed tool layer, the audit trail granularity, these are where production safety lives, and they are precisely the dimensions a slick demo does not test. Weighting accuracy at 25% forces the buyer to spend most of the diligence budget where the actual variance is.

Why workflow fit is next. Even an accurate AI is operationally useless if it cannot reach into the systems where the work actually happens (Shopify, Recharge, Loop, Klaviyo, ShipStation, your custom OMS) or if its escalation handoff drops context. The Persona 3 mid-market buyer in our 2026 evaluation cohort cited handoff fidelity and multi-system integration depth more often than any other operational concern.

Why pricing is weighted modestly. Not because price is unimportant, but because pricing dispersion across serious AI customer service vendors is narrower than buyers expect. The dispersion that matters most is on total cost of ownership (overages, integration fees, professional services, sub-processor passthrough), not headline list price. The five pricing questions below are designed to flush out the hidden costs rather than to compete on $/conversation.

The rubric

The 40 questions, organized by dimension. Ask everyone the same things.

Six questions (one per dimension, marked with full anchored 0 to 3 scoring) are worked end-to-end so you can calibrate your own scoring. The remaining 34 list the question and a one-line note on what a strong answer looks like. The editable version of the full rubric (every question with all four anchors) is available on request via email at the bottom of this page.

15%Security & data

Security & Data Handling (7 questions)

Mostly binary in practice. The vendor either has the certifications, contracts, and isolation guarantees, or they fail the gate. Score conservatively. Half-answers ("SOC 2 is in progress") are not 2s, they are 1s with a path.

Q01 (anchored): What certifications, audit reports, and data-protection contracts can you provide today, dated within the last 12 months?

Score 0 (instant DQ): No SOC 2 Type II or equivalent. No GDPR DPA available for execution. No security questionnaire response on file.
Score 1: SOC 2 Type I (not Type II), or SOC 2 Type II report dated more than 18 months ago with no public statement of next-cycle commitment.
Score 2: Current SOC 2 Type II, GDPR DPA, sub-processor list maintained. ISO 27001 absent. HIPAA available on request.
Score 3: Current SOC 2 Type II, ISO 27001, executable DPA covering GDPR and UK-GDPR, maintained sub-processor list with email notification on changes, HIPAA-compliant tenancy available, penetration test summary dated within the last 12 months.

Q02. Where do customer data and conversation logs physically reside, and can we select region?Strong answer: region selection (US, EU, APAC) at workspace-creation time, named cloud regions, documented retention policy in days.
Q03. Provide your current sub-processor list and the change-notification policy.Strong answer: public sub-processor page, 30-day advance notification on additions, opt-out path for material changes.
Q04. How is PII redacted or tokenized before content is sent to the underlying LLM provider?Strong answer: documented PII detection layer, tokenization or masking before model calls, audit log of what was redacted per response.
Q05. What tenant isolation guarantees apply between our workspace and other customers'?Strong answer: row-level isolation in shared infrastructure with cryptographic separation, optional single-tenant deployment for regulated industries.
Q06. What is the data-export and termination-assistance process if we leave?Strong answer: full conversation history, audit logs, knowledge base, configuration, exportable in standard JSON or CSV within 30 days of termination.
Q07. When was your last third-party penetration test, who ran it, and what was the summary finding?Strong answer: named third-party firm, within last 12 months, public executive summary, remediation status on any high-severity findings.

25%Accuracy & SLA

Accuracy & SLA Commitments (10 questions)

The dimension where demos lie the most. Force every claim to a specific number, a published methodology, or a contractual remedy. Vague answers ("our AI is highly accurate") score 0 or 1, never higher.

Q08 (anchored): What is your published pre-launch accuracy threshold, on the customer's own historical tickets, before the AI is permitted to send messages to live customers?

Score 0 (instant DQ): No published threshold. "We test thoroughly" or "our AI scores well on benchmarks." No willingness to define a per-customer cutoff.
Score 1: An aggregate accuracy number is cited (e.g., "92% on industry benchmarks") but no per-customer test methodology, no go-live gate.
Score 2: Pre-launch eval is run on the customer's tickets, accuracy is reported, but no published threshold below which the AI is blocked from going live. The decision to go live is judgment-based.
Score 3: Published threshold (e.g., 95% to 99%) on the customer's own historical tickets, methodology is disclosed (sample size, rubric, who scores, calibration), and the AI is contractually blocked from production traffic until the threshold is met.

Q09. Will you run your pre-launch eval on 100 to 1,000 of our actual historical tickets before contract signature?Strong answer: yes, this is the standard pilot. The vendor produces per-response scoring with reasoning. A refusal is an instant disqualifier.
Q10. Walk us through your pre-launch eval methodology in detail.Strong answer: stratified sample by intent, named scoring dimensions (factual accuracy, policy compliance, brand voice, tool-call correctness, escalation appropriateness), two-stage labeling (LLM judge + human SME), holdout discipline.
Q11. How is hallucination defined and measured in production, not just in evals?Strong answer: any response with a factual claim not traceable to retrieved source content, regardless of whether the claim happens to be correct. Measured continuously on a live sample, not a one-time number.
Q12. Describe your QA layer. Is there a second-pass review per response, and what does it see?Strong answer: per-response QA AI, ideally running on a different model from the response model, seeing the conversation, retrieved sources, tool calls, and applicable policy. Approve / rewrite / escalate as discrete outputs, visible in product UI with reasoning logs.
Q13. Show us the audit trail for a single response in your product UI.Strong answer: customer message, sources retrieved, response drafted, QA reasoning, final response sent, tool calls, escalation trail, all visible per response, exportable via API.
Q14. What is your contractual resolution-rate guarantee, and what is the remedy if you miss it?Strong answer: a specific resolution percentage within a specific timeframe, with refund or termination rights as remedy, not just service credits.
Q15. Define deflection and resolution as separate metrics in your reporting.Strong answer: deflection is a customer not coming back; resolution is a customer's issue being confirmed solved. Vendors that conflate them are optimizing for the wrong outcome. Confusion or refusal to distinguish is an instant disqualifier.
Q16. How do you detect model drift over time, and what is your remediation process?Strong answer: continuous evaluation against a sealed holdout, alerts within hours when accuracy drops, rollback discipline on model-version changes.
Q17. Multi-model and bring-your-own-model: which providers are supported and how is routing decided?Strong answer: multiple frontier providers (e.g., Claude, GPT, Gemini), per-workspace selection, optional per-intent routing, QA AI running on a different model from the response AI by default.

20%Workflow & escalation

Workflow Fit & Escalation Design (8 questions)

Where the AI actually has to reach into your business. A perfectly accurate AI that cannot issue a refund or escalate cleanly is operationally a research project. The architectural keyword here is "deterministic." Free-text paths to commercial actions are a class of failure, not a feature.

Q18 (anchored): When the AI proposes a transactional action (refund, cancellation, address change, exchange), is the action expressed as a typed tool call with policy-bounded parameters, or as free-text the system then parses?

Score 0 (instant DQ): Free-text output ("I'll refund you $76,000") parsed downstream. No typed tool boundary. This is the architecture that produced the $1 Tahoe incident.
Score 1: Structured output exists but parameters are not policy-bounded. The AI can propose any refund amount and the system relies on a human review step to catch out-of-bounds proposals.
Score 2: Typed tool calls with parameter validation, but the policy bounds are minimal (e.g., refund cannot exceed order total) and not customer-configurable.
Score 3: Typed tool calls with policy-bounded parameters configured per tenant, validated at the tool layer not the model layer. Out-of-bounds proposals are blocked architecturally and escalated, not silently clamped or surfaced for human review.

Q19. List the channels supported natively (not via integration) and the maturity of each.Strong answer: email, chat, SMS, social DMs, WhatsApp at parity. Voice is acceptable as "in beta" with disclosure.
Q20. Provide the complete list of tool actions the AI can take, with the typed parameters for each.Strong answer: a documented list (refund, cancel, edit, address change, subscription pause, exchange, etc.), each with typed parameters and validation constraints.
Q21. Configure escalation triggers: what is configurable, and where in the product?Strong answer: low-confidence threshold (numeric), QA AI rejection, sensitive-intent detection, tool failure, customer requests human, all per-tenant tunable, all visible in product UI.
Q22. Walk us through the handoff to a human agent. What does the human see?Strong answer: full conversation, AI reasoning chain, sources retrieved, tool calls proposed, escalation reason, suggested next action. The human starts from context, not from zero.
Q23. How is brand voice configured, and how deeply can it be tuned?Strong answer: URL crawl plus manual tone-of-voice override, with per-channel and per-intent variation possible. Edits land within minutes, not after a retraining cycle.
Q24. Multi-brand or multi-store support: can one workspace run N brands with distinct configurations?Strong answer: yes, single account, separate brand voices, separate knowledge bases, central reporting. Critical for portfolio operators.
Q25. How does the AI ingest knowledge sources, and how quickly do edits propagate?Strong answer: URL crawl plus document upload plus structured KB import, with edits live within minutes for high-volume sources and within hours for re-indexed corpora.

15%Implementation

Implementation Readiness (6 questions)

The dimension where vendors most often over-promise. "Two weeks to go-live" should mean two weeks to operational AI, not two weeks to the first sandbox login. Force the vendor to define which milestone the timeline measures.

Q26 (anchored): What is your published time-to-go-live, measured from contract signature to the AI handling its first live customer message at the agreed pre-launch accuracy threshold?

Score 0: No published timeline. "It depends on the customer." Or a timeline measured to first sandbox login rather than to live customer traffic.
Score 1: Stated timeline exceeds 60 business days for a standard mid-market deployment without regulated-industry constraints.
Score 2: Published timeline of 21 to 60 business days, including pre-launch eval, knowledge base ingestion, tool wiring, and first live traffic.
Score 3: Published timeline under 21 business days for a standard mid-market deployment, with a documented day-by-day onboarding plan and a stated remedy if the vendor misses the date.

Q27. Migration support: how is historical ticket data, macros, tags, users, and SLA configuration moved from the incumbent system?Strong answer: one-click connectors for the major incumbents (Zendesk, Gorgias, Intercom), automated migration of macros and tags, named timeline for the import.
Q28. Onboarding model: self-serve, guided, or fully managed?Strong answer: clarity on which tier we are in, named onboarding manager if guided, written onboarding plan with milestones.
Q29. Engineering hours required from our team during implementation.Strong answer: zero or near-zero for standard deployments. Engineering involvement only for custom integrations or single-sign-on.
Q30. Pilot or proof-of-value structure: is there a no-commitment trial on real traffic?Strong answer: a defined pilot window (typically 30 days), live customer traffic, exit-without-commitment if accuracy thresholds are not met.
Q31. Change management for our human agent team: training, rollout cadence, internal communication support.Strong answer: enablement materials, in-product training mode for agents, phased rollout patterns (collaborative-mode first, autonomous-mode second).

10%Support quality

Support Quality (4 questions)

Most vendors look the same here on paper. The differentiator is whether they use their own product for their own support, which is the single most reliable signal of a serious AI vendor in this category.

Q32. What is your support response SLA by severity tier, in business hours, for the tier we are buying?Strong answer: published P1 / P2 / P3 SLAs, named coverage hours, escalation path to engineering for product defects.
Q33. Dedicated CSM or account team structure: who do we work with after onboarding?Strong answer: named CSM at the mid-market tier and above, quarterly business reviews, defined upgrade path.
Q34. Provide three customer references at our size and industry who went live within your stated timeline.Strong answer: three names provided within 48 hours of request, references actually return calls. Vendor refusal or unresponsive references is a strong negative signal.
Q35. Do you use your own product for your own customer service?Strong answer: yes, with a willingness to show their own deployment on the demo call. A vendor that does not eat its own dog food on a flagship product is sending the loudest possible negative signal.

15%Pricing & terms

Pricing & Commercial Terms (5 questions)

Score for transparency and alignment to outcome, not for lowest list price. Hidden overages and the gap between list price and total cost of ownership are the failures here, not whether the vendor charges $0.40 or $0.60 per resolution.

Q36. Pricing model: per-conversation, per-resolution, per-seat, or hybrid? How does cost scale with volume?Strong answer: pricing aligned to outcome (resolution beats conversation; conversation beats seat for AI products), transparent volume tiers, no per-message gotchas.
Q37. Provide a fully-loaded total cost of ownership estimate for our expected first-year volume.Strong answer: line items for platform fee, conversation or resolution volume, sub-processor passthrough (e.g., underlying LLM costs), professional services, integration fees, training. No hidden line items.
Q38. Contract term, ramp provisions, and exit terms.Strong answer: annual term standard with quarterly true-ups, no early-termination penalty if the resolution-rate guarantee is missed, data export within 30 days of termination.
Q39. Cancellation rights tied to the performance guarantee.Strong answer: yes. If the published resolution-rate guarantee is missed during the trial or ramp window, the buyer has unilateral termination and refund rights, not just service credits.
Q40. Price protection or rate-lock on multi-year commitments.Strong answer: rate-lock for the term of the contract, CPI-indexed renewals, no surprise multipliers at year two.

The hard nos

Eight instant disqualifiers, regardless of category scores.

A vendor can score 95 across all six dimensions and still be wrong for production if any of the patterns below appears. Treat these as gates, not as inputs to the weighted total.

DQ 01

No specific pre-launch accuracy threshold.

"We test thoroughly" or "our AI scores well on benchmarks" is not an answer. If the vendor cannot or will not publish a per-customer cutoff before live traffic, the architecture for it does not exist.

DQ 02

Refusal to run pre-launch eval on your historical tickets.

Any vendor unwilling or operationally unable to test on your actual data is selling demos. Production performance is not predictable from a demo on the vendor's curated examples.

DQ 03

QA layer exists "in principle" but has no visible product UI.

If you cannot click into a response and see the QA reasoning, the QA layer does not exist as a system. It exists as a feature on a slide.

DQ 04

Conflates deflection with resolution.

Vendors that cannot or will not separate "customer stopped messaging" from "customer's issue was confirmed solved" are optimizing for a metric that hides the worst failures. Walk.

DQ 05

AI emits free-text refund or commercial commitments.

If the architecture has a free-text path from model output to commercial action, the $1 Tahoe sale is a future incident, not a past one. Typed tool calls with policy bounds are non-negotiable.

DQ 06

No published resolution-rate guarantee with refund remedy.

Service credits are not a remedy at the scale where this matters. A vendor confident in their product offers termination rights and refunds when the guarantee misses.

DQ 07

Cannot produce three live customer references within 48 hours.

References at your size and industry, live on production within the vendor's stated timeline, who actually return calls. A vendor that stalls here is a vendor whose customer list is thinner than the website implies.

DQ 08

SOC 2 Type II "available on request" but never materializes after NDA.

The classic procurement bait-and-switch. If the executive summary does not arrive within a week of NDA signature, assume the report does not exist in current form.

The arithmetic

How to score, what the totals mean, and then verify them.

Each question is scored 0 to 3 (0 = unacceptable or DQ, 1 = weak, 2 = competent, 3 = exemplary). A vendor's raw score per dimension is the sum of its question scores. The dimension's weighted contribution is (raw / max possible for that dimension) × weight. The vendor's total is the sum of weighted contributions, normalized to 100.

Worked example. A vendor scores 17/21 on Security (raw 17, max 21, weighted contribution = 17/21 × 15 = 12.1). They score 22/30 on Accuracy (22/30 × 25 = 18.3). They score 19/24 on Workflow Fit (19/24 × 20 = 15.8). They score 12/18 on Implementation (12/18 × 15 = 10.0). They score 9/12 on Support (9/12 × 10 = 7.5). They score 11/15 on Pricing (11/15 × 15 = 11.0). Total = 12.1 + 18.3 + 15.8 + 10.0 + 7.5 + 11.0 = 74.7 out of 100. Below the strong-shortlist threshold; revisit with conditions before signing.

85 to 100

Strong shortlist

Architecturally serious, commercially aligned. Move to reference checks and a paid pilot.

70 to 84

Revisit with conditions

Address the specific dimensions that scored below 80% of max before signing. Use those as contractual commitments.

55 to 69

Conditional, narrow scope

Workable only if you can scope the deployment narrowly enough to avoid the dimensions where they score lowest.

Below 55

Walk

Or any vendor with an instant disqualifier, regardless of total. Below 55 means structural absence, not gaps that can be closed in negotiation.

Trust but verify

How to verify the answers, not just collect them. What good actually looks like.

RFP responses are written by sales engineers, not by production teams. The scoring above only works if you verify the answers against artifacts. Three verification moves separate the diligent buyer from the buyer who gets surprised in month four.

One: ask for the product UI, not the deck. Every claim about the QA layer, the audit trail, the escalation triggers, the tool definitions, every one of these should be answerable by a live demo of the product UI, not by a slide. If the vendor cannot show the artifact in the product, the feature is on a roadmap and you should score it as a 1, not a 2 or 3.

Two: run the pre-launch eval as a contractual condition. The vendor's accuracy score on your historical tickets is the single most predictive metric in the entire evaluation. Make it a contractual condition (not a courtesy) that the AI must clear the published threshold on your tickets before the contract effective date. Vendors who balk at this are telling you the threshold is aspirational.

Three: structure reference calls around failure, not success. Standard reference questions ("how was your onboarding?") produce standard reference answers. The questions that produce signal are: "Tell me about a time the AI got something wrong in production. How did the vendor respond?" and "What surprised you in month three or month six that the sales process did not prepare you for?" A vendor's references will answer these honestly if they are real customers. A vendor's references will hedge or redirect if they were curated.

What this looks like in practice: across roughly 1,000 Richpanel customer onboardings, the production safety pattern that correlates most strongly with successful long-term deployment is not the highest demo accuracy score; it is the willingness of the buyer to insist on a pre-launch eval gate before live traffic. Buyers who skip that step often get exactly what the vendor measured: not what their own data actually demonstrates.

For an architectural deep-dive on the accuracy dimension specifically, see the companion piece on the four-layer hallucination defense. That article goes further into the QA AI architecture, the typed tool boundary, and what each layer's failure modes look like in production.

How the weights were calibrated

Methodology, calibrated against production failure modes.

The weights and instant disqualifiers in this rubric are not opinion. They are the failure-mode pattern observed across roughly 1,000 Richpanel customer onboardings (2024 to 2026) plus a synthesis of 69 mid-market evaluation calls run in April and May 2026.

How the dimensions were chosen

Inventory of failure modes: Every escalation pattern, customer complaint, security incident, and post-mortem ticket from Richpanel onboardings between 2024 and 2026 was categorized. Six clusters emerged that mapped cleanly to evaluation dimensions: security or data handling, accuracy or SLA, workflow fit or escalation, implementation, support, and pricing or commercial terms.
Weights from failure frequency, not from feature checklist: Each dimension's weight reflects the share of customer-facing incidents that originated in that dimension. Accuracy and workflow-fit failures together account for roughly 45% of recorded incidents, which is why those two dimensions together carry 45% of the rubric weight. Security and implementation failures are lower-frequency but high-severity, hence smaller weights but several instant disqualifiers in those categories.
Cross-checked against demo cohort: The 69 mid-market evaluation calls between April and May 2026 surfaced the buyer-side concerns that come up most often in real RFPs. The eight instant disqualifiers map directly to the failure modes most often missed by traditional RFP templates as reported by those buyers. Verbatim sources for the buyer-language in this article (e.g., the "Gorgias AI just doesn't work" pattern, the "how does the second AI catch the first AI?" mid-market sophistication signal) come from that cohort.
What is not in this rubric on purpose: Aesthetics of the agent UI, marketing-page polish, conversational style on a curated demo, and analyst rankings are deliberately excluded. None of them predict production safety. Including them dilutes the score for buyers most likely to be misled by them.
Limits of the calibration: The weighting reflects the failure-mode distribution observed primarily in mid-market ecommerce, SaaS, fitness, and consumer-services categories. Buyers in heavily regulated industries (healthcare with PHI exposure, financial services with regulator-mandated controls, government) should expect to raise security weight to 25 to 30% and lower pricing weight accordingly. The 40 questions remain valid; the weights should move.

The honest caveats

What this rubric does not catch, and what to do about it.

Even a rigorous scoring process has known blind spots. Naming them is part of using the rubric well.

Organizational culture of the vendor. No 40-question RFP detects whether the vendor's engineering team responds to a Sev-1 incident at 2am or sends a templated apology and a Tuesday update. Reference calls structured around failure (the third verification move above) are the only reliable proxy.
The quality of the deployment engineering team that touches your account. Vendors can have excellent architecture and a weak deployment bench. Insist on naming the implementation engineer assigned to your account during contract negotiation and treat unwillingness to do so as a soft disqualifier.
Model-provider risk over a multi-year horizon. The frontier model landscape moves quarterly. A vendor locked to a single provider whose pricing or capability profile shifts is exposed. Multi-model support (Q17) is the partial hedge; long-term, the buyer's contract should include a model-change notification clause and reset rights if the underlying capability materially changes.
Edge cases that have not yet occurred. The 15% to 30% pre-launch hallucination rate on ungrounded LLMs cited in the companion architecture piece is a measurement; it is not a prediction about which specific failures you will see. Mitigation lives in the four-layer architecture, not in the RFP.
Cultural fit with your CX team. The AI's tone, the rollout cadence, the change-management discipline, all of these matter and all of them resist scoring. The collaborative-mode-then-autonomous-mode rollout pattern documented across the April to May 2026 mid-market cohort is the operational answer to most of this; the RFP cannot substitute for actually running the pilot.

The honest framing of this rubric is not "score every vendor on these 40 questions and the winner is the right vendor." It is "use these 40 questions plus eight instant disqualifiers to narrow your shortlist from five vendors to two, then run a real pilot to choose between the final two." The pilot is the actual selection mechanism. The RFP is the pre-filter that keeps the pilot from being a waste.

Frequently asked

The rubric, in plain English.

Why is accuracy weighted higher than security in this rubric?

Security is binary in practice. A vendor either has SOC 2 Type II, a sub-processor list, and a DPA, or they don't. The seven security questions act as a gate. Accuracy is the dimension where serious vendors actually differ. The published pre-launch threshold, the QA AI architecture, the deflection-vs-resolution distinction, the typed tool layer, these are where production safety lives. That is why accuracy is weighted at 25%, the heaviest single dimension.

Is this rubric overkill for a brand with only three agents and 1,500 tickets a month?

No. The same failure modes apply. A small brand with a $1 Tahoe chatbot incident loses just as much trust as a large one. What changes is process load, not the questions. A small team can run the rubric in a single working day per vendor by scoring directly off published documentation plus a 60-minute call. A mid-market team runs it across two to three weeks with multiple stakeholders. The 40 questions and the weights do not change. The depth of verification does.

Did Richpanel score itself on this rubric?

Yes, and it is the honest version. Richpanel scores 3 on accuracy questions where the architecture is the product (pre-launch eval, QA AI, typed tools, full-context escalation), 2 on most security questions (SOC 2 Type II, GDPR DPA, tenant isolation, region selection, all present; a published penetration-test summary is in progress for 2026 Q3), 2 on support quality (we use our own product for our own CX, so reference questions are honest; named-CSM coverage starts at the mid-market tier), and 1 to 2 on certain implementation questions where some buyers want a fully-managed onboarding that Richpanel handles directly but does not yet publish as a tiered SKU. Detail on request.

Can I adjust the weights for my organization?

You should. The published weights (15% security, 25% accuracy, 20% workflow fit, 15% implementation, 10% support, 15% pricing) reflect the failure-mode frequency observed across roughly 1,000 Richpanel customers and a synthesis of 69 mid-market evaluation calls run between April and May 2026. A healthcare or fintech buyer should likely raise security from 15% to 25% and lower pricing to 10%. A bootstrapped brand evaluating on a 30-day pilot should likely raise implementation from 15% to 20%. The rubric is intentionally documented so the weighting decisions are auditable.

What happens if no vendor clears the threshold?

That is a useful signal. It usually means one of three things: the market is not yet mature enough for your specific use case (heavy regulated industries, niche multi-system integrations), your shortlist was assembled on the wrong criteria (vendors recommended by analysts who are not buyer-aligned), or you need to scope the deployment more narrowly (start with a single intent or channel rather than the whole helpdesk). The right move when nothing clears 80 is to widen the shortlist or narrow the scope, not to lower the threshold.

Get the editable version

Run this on your own shortlist.

The complete rubric, with all four anchored scoring bands for every one of the 40 questions, a pre-populated Google Sheets scorer, and the eight instant disqualifiers wired as conditional formatting, is available on request. Email amit@richpanel.com with the subject line "RFP rubric" and the cohort of vendors you are evaluating, and we will send the sheet within one business day. No demo gate, no marketing nurture sequence. The rubric is more useful in the field than on a website.

If you would also like Richpanel scored honestly against the rubric (including the dimensions where we score 1 or 2), include that in your email. We score ourselves the same way we ask buyers to score every other vendor, by sending the spreadsheet back with our cells filled in and our reasoning per cell.

The standard customer service RFP rewards the wrong things. Here are 40 questions, six dimensions, and eight instant-disqualifiers that fix that.