Natural-Language Interfaces for the Software You Own

Written by

John

Published on

September 29, 2025

NL-to-use ≠ new apps: You ask for an outcome in plain English; the AI invokes your existing APIs, tools, and repos under guardrails to deliver a finished, verifiable result.

Reliability you can track: Every step is logged and testable; leaders watch ECR (Execution Completion Rate) and TPR (Task Pass Rate) to manage quality and cost.

Why now: Common contracts (OpenAPI/JSON Schema), structured tool-calling, execution-based benchmarks, and better observability make this production-ready.

Business impact: Faster cycles, fewer bespoke integrations, higher throughput—especially for less-experienced users in dev, ops, data, support, and marketing.

1. Introduction

Imagine saying, “Extract tables from these PDFs and load them into our dashboard,” or “Clean this repo and generate a styled image,” and having it done by the systems you already use—no new app or brittle glue code. That’s natural-language-to-use (NL-to-use): you state a goal in plain language, and an AI translates it into governed, auditable calls to your existing APIs, apps, and repositories. The output isn’t “some code you still have to wire up”—it’s a finished, verifiable outcome.

Leaders care because outcomes become consistent and measurable. Two simple, board-friendly metrics make progress visible: ECR (did the run finish?) and TPR (did it pass predefined checks?). When these rise—and cost per successful run falls—you’re compounding ROI, not just shipping a demo.

2. Background

2.1 What NL-to-use means in practice

To make the idea concrete, here are the four building blocks that show up in every successful deployment:

Intent capture: First, disambiguate the request and collect required inputs. Modern model APIs support typed, schema-validated tool calls so arguments are precise and checkable.
Grounding: Next, map the request to documented capabilities—APIs, CLIs, apps, repos—using machine-readable contracts like OpenAPI and JSON Schema.
Execution: Then orchestrate real calls in real environments, handling credentials, dependencies, and side effects. This goes beyond chat; it’s reliable operation of live systems.
Verification & provenance: Finally, confirm success against predefined checks and emit an auditable trace of calls, IO, and decisions (e.g., via GenAI spans).

2.2 How it differs from adjacent ideas

Because terms get conflated, it helps to separate NL-to-use from neighboring approaches:

Not NL-to-code: Codegen emits new code; NL-to-use primarily invokes existing capabilities and treats any emitted code as temporary scaffolding.
Not just chatbots: The focus is executing actions under explicit contracts with measurable outcomes, not conversation alone.
Not classic RPA: RPA replays screens and clicks; NL-to-use prefers typed interfaces with verifiable results. GUI control is a last-resort tool—still under guardrails.

2.3 Why now

Momentum is real because several foundations have clicked into place:

Common contracts: JSON Schema and OpenAPI 3.1 make API surfaces portable and typed; model platforms now support structured function/tool calling to reduce ambiguity.
Interop protocols: MCP standardizes how agents connect to tools/data; A2A standardizes how agents discover and collaborate via simple “agent cards.”
Execution-based evaluation: Benchmarks like WebArena, OSWorld, and GitTaskBench test whether tasks complete and pass checks—moving from “sounds good” to “works.”
Observability & governance: OpenTelemetry GenAI spans, supply-chain attestations, sandboxing, and dependency pinning make execution safer in production.

2.4 A recent development: “Agentized” repositories

One practical pattern turns GitHub repositories into interactive agents that you can call in natural language:

Setup: Read docs, plan a structured TODO, install dependencies, fetch models/data, and prepare validation samples.
Use: Create a repo-specific agent that executes tasks from plain English, retrying and revising plans on errors.
Collaborate: Publish “agent cards” so repo-agents can chain capabilities through an A2A protocol.

Why this matters: it showcases the shift from “AI that writes code” to “AI that uses your code and tools”—with validation-first setup and reliability you can measure (ECR/TPR).

3. Business Applications

Organizations that layer NL-to-use onto systems they already license report faster cycles and fewer bespoke integrations. Below are illustrative domains to make the value tangible:

Developer acceleration: In IDEs and platforms, assistants handle authoring, refactoring, reviews, and repo Q&A—speeding completion times and increasing throughput, especially for juniors.
IT/DevOps & service operations: Summaries, grounded answers, and guided remediation reduce handle time and escalations; AIOps flows improve MTTR while keeping actions auditable.
Data & document workflows: Natural-language over contracts/emails compresses triage and review; governed connectors keep sensitive data in-platform.
Customer operations: Assist agents and deflect chats using CRM/KB context; large field studies show higher issues-resolved-per-hour, with outsized gains for novices.
Analytics & governed data access: NL query/notebook assistance within data platforms saves hundreds of hours without punching holes in governance.
Creative & marketing production: Asset generation and adaptation shift from days to hours, increasing variants while lowering external spend.

To run this like an operation—not a demo—track a small set of metrics:

ECR: Percent of runs that complete without tool/model errors.
TPR: Percent of runs that pass predefined checks.
Latency & p95 runtime: Impact on time-to-resolution and SLAs.
Cost-of-pass: Spend per successful run—critical for multi-step workflows.

4. Future Implications

The next 12–36 months will favor teams that standardize skills and make success observable:

Standard skills & internal marketplaces: Expect shared skill schemas (“agent cards”) and A2A to become vendor-neutral interfaces; enterprises will curate marketplaces with versioning, SLAs, and provenance.
Verification as table stakes: Validation datasets, contract tests, and end-to-end auditing will be required, with benchmarks expanding to multi-repo, long-horizon tasks that also track cost/latency.
Hardened execution & governance: Micro-VM sandboxing by default, dependency pinning, supply-chain attestations, and least-privilege access will be baseline expectations.
Measurable operations: ECR, TPR, p95, and cost/run will drive budgeting and SLOs—managed via SRE-style error budgets.
Evolving roles: Developers curate skills and validations (instead of writing glue code); Ops runs the auditable control plane; business owners set outcomes, budgets, and guardrails.

Leaders should also keep a few questions front-of-mind:

Standards & trust: How will A2A, MCP, and platform manifests converge—and what verification tiers/SLAs will your marketplace require?
Risk & accountability: When workflows compose multiple skills/agents, who owns the outcome—and how does provenance resolve incidents?
Long-horizon reliability & cost: As tasks span more steps/systems, how will you keep TPR high and cost-of-pass low without constant human oversight?

Natural-Language Interfaces for the Software You Own

1. Introduction

2. Background

2.1 What NL-to-use means in practice

2.2 How it differs from adjacent ideas

2.3 Why now

2.4 A recent development: “Agentized” repositories

3. Business Applications

4. Future Implications

5. References

5.1 Core standards & protocols

5.2 Tool use & platform mechanics

5.3 Benchmarks & methods

5.4 Context & adjacent work

5.5 Business evidence & platform examples