Skip to main content
Ability.ai company logo
AI Implementation

Scaling AI agents: lessons from GitHub's 8M weekly calls

Scaling AI agents requires solving critical security and context overload risks.

Eugene Vyborov·
Scaling AI agents architecture diagram showing context optimization, server-side orchestration, and governance patterns derived from enterprise-scale deployments

Scaling AI agents is the practice of moving autonomous AI systems from isolated prototypes to reliable, governed enterprise deployments that handle millions of operations weekly. GitHub's MCP server - processing roughly 8 million tool calls per week - reveals the critical patterns every operations leader needs to master.

When organizations begin scaling AI agents across their operations, they quickly collide with a hard reality - the gap between a neat local experiment and a reliable enterprise deployment is massive. Industry data from the development of GitHub's Model Context Protocol (MCP) server exposes the hidden complexities of moving AI from prototype to production.

For operational leaders caught between the sprawl of ungoverned Shadow AI and the sluggish pace of massive consulting projects, these technical scaling challenges hold vital strategic lessons. By analyzing how one of the world's largest developer platforms optimizes context windows, secures access tokens, and orchestrates complex system operations, businesses can build highly reliable sovereign AI agent systems that drive actual outcomes without introducing unacceptable risks.

Why scaling AI agents fails when you add more tools

There is a common misconception in enterprise AI that giving an agent access to every possible tool and system will make it more capable. The data proves the exact opposite.

When agents are loaded with too many tools, their reasoning degrades. They become forgetful, confused, and prone to hallucination. Early in GitHub's MCP journey, exposing agents to over 100 discrete tools for repositories, pull requests, actions, and projects rapidly blew out context windows and degraded the agent's core performance. This aligns with earlier industry research demonstrating that excessive context degrades LLM reasoning quality rather than enhancing it.

The solution requires strict constraint and curation. By trimming default configurations, focusing tools on general use cases, and clustering CRUD (Create, Read, Update, Delete) operations, context load can be reduced by nearly 49%. Furthermore, stripping unnecessary metadata from tool outputs - such as aggressively tailoring the data returned by a "list pull requests" command - can eliminate over 75% of output token consumption.

For business operations, this validates a solution-first model over open-ended AI experimentation. Instead of overwhelming an AI with access to your entire tech stack, deployments should begin with a tightly scoped starter project. By defining a fixed scope and a specific business outcome, you naturally constrain the agent's toolset, preserving its reasoning capabilities and ensuring immediate, reliable value.

Server-side orchestration: hiding complexity from the LLM

One of the most profound lessons in scaling AI agents at the enterprise level is recognizing what an LLM should not do. Large Language Models are exceptional reasoning engines, but they are relatively fragile workflow orchestrators.

When an agent must navigate complex operational logic - like making five sequential API calls to figure out repo permissions before committing code - failure rates spike. Agents do not inherently know which systems they have write permissions for, leading to inevitable hallucinations and failed actions.

The architectural fix is shifting this complexity away from the agent and onto the server. By encoding the "agent intent" on the server side, a single request from the AI can autonomously trigger a robust, multi-step execution handled by deterministic software. This approach slashes round trips between the agent and the server, preserves the context window, and significantly boosts reliability - ultimately driving tool success rates above 95%.

This maps perfectly to how modern sovereign AI agent systems should be designed. Rather than relying entirely on the LLM to navigate fragile API endpoints, organizations should utilize battle-tested workflow automation tools (n8n, Make, or custom pipelines) for process orchestration. The LLM handles the cognitive reasoning, while the orchestration layer handles the deterministic integration logic. This separation of concerns is exactly what makes visual orchestration more reliable than terminal-based agent workflows.

The Shadow AI security crisis when scaling AI agents

The utility of autonomous agents currently sits in direct conflict with enterprise security. In the wild, AI setups are frequently insecure by default. Users often rely on plain text access tokens that are stored in easily accessible locations, are long-lived, and remain dangerously overprivileged.

Security researchers have repeatedly demonstrated prompt injection and data exfiltration attacks. If an agent has sweeping access to a company's internal code or customer data, a malicious prompt can trick the agent into exposing that private data. The lethal trifecta of agent security - prompt injection, autonomous tool execution, and overprivileged access - makes ungoverned Shadow AI a ticking time bomb for enterprise IT.

Scaling operations teams cannot allow employees to hook random AI chatbots up to their core systems using personal tokens. Securing this infrastructure requires a centralized, governed approach. Solutions include supporting dynamic token scoping, where tools are immediately filtered down based on the exact permissions of the provided token, and step-up authentication that interactively asks the user to authorize new permissions mid-workflow rather than silently failing or operating with permanent god-mode access.

Deploying sovereign AI means taking back control of these integrations. Whether utilizing robust cloud infrastructure or maintaining strict local data boundaries, organizations must own the infrastructure where their tokens and tools reside. See how mid-market companies are implementing governed AI agent integration without the Shadow AI sprawl.

Need help turning AI strategy into results? Ability.ai builds custom AI automation systems that deliver defined business outcomes — no platform fees, no vendor lock-in.

Read-only modes and human-in-the-loop workflows

Trust is the primary bottleneck for AI adoption in critical business operations. Even when an agent is technically capable of executing a complex task, operations leaders remain hesitant to grant autonomous write-access to core systems like CRMs, HR databases, or production environments.

Industry data shows that roughly 17% of users actively opt for read-only modes when interacting with AI servers. This is a massive segment of the user base signaling that they want the analytical and observational power of AI without the risk of autonomous execution.

Furthermore, introducing human-in-the-loop (HITL) workflows drastically improves adoption for sensitive tasks. For example, allowing a user to review and edit an AI-generated issue or response before it is officially posted bridges the gap between automation and quality control. Users care deeply about how their outputs are received by clients or peers, and providing an interception point builds confidence in the system.

When rolling out AI to sales, marketing, or customer support teams, read-only observability should be the baseline phase of any starter project. As the agent proves its reasoning capabilities, organizations can gradually introduce HITL approval workflows for external system writes - eventually transitioning to fully autonomous actions only when the system has earned unwavering trust. Organizations exploring this approach can start with AI-powered operations automation that includes built-in governance guardrails.

Rethinking how you evaluate AI agent tools

As organizations build out custom tools for their AI agents, they often fall into the trap of micro-optimizing individual tool descriptions. A developer might tweak a tool's system prompt so perfectly that the agent understands exactly when to use it.

However, when that highly optimized tool is placed in a sandbox with ten other tools, unexpected behaviors emerge. The tools begin to "fight" for the agent's attention. A perfectly described tool might sound so universally applicable that the agent starts calling it even when a different, more specific tool would be appropriate.

Evaluating AI agent tools requires holistic testing. You cannot test tools in isolation; you must evaluate them against each other in a pooled environment to ensure they are called at the right times and ignored at the wrong times. This requires rigorous testing frameworks - similar to the agent reliability metrics and governance patterns that mature AI operations teams are now implementing - to measure how the inclusion of a new tool degrades or impacts the performance of existing tools within the agent's arsenal.

A pragmatic roadmap for scaling AI agents in the enterprise

The future of enterprise AI involves highly compositional tool use, autonomous server discovery, and agents navigating thousands of specialized capabilities seamlessly. But today, we are still navigating the crucial experimental phase where governance, context limits, and security are active constraints.

The key takeaway for operations leaders is that scaling AI agents is not about maximizing the number of tools you give an LLM - it is about maximizing the reliability, security, and specific business outcomes of the system.

Stop paying platform fees for generic chatbots that encourage Shadow AI sprawl. Instead, invest in solutions that prioritize server-side orchestration to hide complexity from the LLM, mandate human-in-the-loop safeguards to protect your brand, and operate within a governed sovereign AI architecture. By starting with a tightly defined, fixed-scope project, you can bypass the chaos of context overload and prove immediate operational value.

See what AI automation could do for your business

Get a free AI strategy report with specific automation opportunities, ROI estimates, and a recommended implementation roadmap — tailored to your company.

Frequently asked questions about scaling AI agents

When agents are loaded with too many tools, their context windows become overloaded, leading to degraded reasoning, hallucinations, and forgotten instructions. Industry data from GitHub's MCP server shows that trimming default tool configurations and clustering operations can reduce context load by nearly 49%, while stripping unnecessary metadata from outputs can eliminate over 75% of output token consumption. The solution is strict constraint and curation - not unlimited access.

Server-side orchestration shifts complex multi-step workflow logic away from the LLM and onto deterministic backend systems. Instead of making the AI navigate five sequential API calls to check permissions before committing code, a single agent request triggers a robust server-side execution pipeline. This approach reduces round trips, preserves the context window, and pushes tool success rates above 95% - far more reliable than relying on the LLM to orchestrate fragile API sequences.

Securing AI agents requires a centralized, governed approach that addresses the lethal trifecta of agent security: prompt injection, autonomous tool execution, and overprivileged access. Key measures include dynamic token scoping that filters available tools based on exact permissions, step-up authentication that requests user authorization mid-workflow, and deploying within a sovereign AI architecture where the organization owns all infrastructure, tokens, and data boundaries.

Human-in-the-loop workflows bridge the trust gap that blocks AI adoption in critical business operations. Roughly 17% of users actively opt for read-only modes when interacting with AI servers. Starting with read-only observability as the baseline, organizations can gradually introduce approval workflows for external writes, and eventually transition to full autonomy only when the system has proven reliable - building confidence incrementally rather than demanding blind trust.

AI agent tools must be evaluated holistically in a pooled environment rather than in isolation. A tool that performs perfectly on its own may confuse an agent when placed alongside ten other tools - competing descriptions cause the agent to call the wrong tool at the wrong time. Rigorous testing frameworks are needed to measure how adding a new tool impacts the performance of all existing tools in the agent's arsenal.