Show HN: archgw: open-source, intelligent proxy for AI agents, built on Envoy

20 points by adilhafeez 8 months ago

Hi HN! This is Adil, Salman, Co and Shuguang and we're excited to introduce archgw [1], an open source intelligent proxy for agents built on Envoy [2]. Arch moves the critical but crufty work around safety, observability, and routing of prompts outside business logic. Arch is a uniquely intelligent infrastructure primitive, engineered with purpose-built fast LLMs [3] for tasks like intent detection over multi-turn, parameter identification and extraction, triggering single/multiple function calls, and offers convenience features to auto dispatch LLM calls for summarization based on data from your APIs via system prompts configured in archgw.

Today, the approach to build a smart production-ready agent is weaving together a large set of mono-functional opinionated libraries, adding extra layers like LLM-based preprocessing to determine things like relevance and safety of the user's prompt (e.g. applying governance and guardrails). Once past that stage, developers must extract relevant information from the user prompt to determine intent, extract parameters as necessary, package relevant tools calls to an LLM to trigger a backend API to execute particular domain-specific task. etc. After all that is done then only are developers ready to trigger an LLM call for summarization and must manage upstream error handling and retry logic themselves. Not to mention, if they want to experiment with multiple LLMs or move between LLM versions, they have to write crufty undifferentiated code. This entire experience is slow, error prone, cumbersome, and not specifically unique.

Prior to building archgw, the team spent time building Envoy [2] at Lyft, API Gateway at AWS, specialized search and intent models at Microsoft Research and worked on safety at Meta. archgw was born out of the belief that several rules based mono-functional tools should be converged into a multi-functional infrastructure primitive designed for prompts and agents. We built archgw on the highly popular, battle-tested open source proxy Envoy and re-imagined it for prompts and agents. For this we had to build blazing fast LLMs [3] that can handle crufty, ahead-in-the-request-path type of work in handling and processing prompts that are sent to an agent, so that developers can focus on what matters most: building fast personalized agents without the unnecessary prompt engineering and systems integration work needed to get there.

Here are some additional details about the open source project. arghw is written in rust, and the request path has three main parts:

* Listener subsystem which handles downstream (ingress) and upstream (egress) request processing.

* Prompt handler subsystem. This is where archgw makes decisions on the safety of the incoming request via its prompt_guard primitive and identifies where to forward the conversation to via its prompt_target primitive.

* Model serving subsystem is the interface that hosts all the lightweight LLMs engineered in archgw and offers a framework for things like hallucination detection of our these models

We loved building this open source project, and our belief is that this infra primitive would help developers build faster, safer and more personalized agents without all the manual prompt engineering and systems integration work needed to get there. We hope to invite other developers to use and improve Arch. Please give it a shot and leave feedback here, or at our discord channel [4]

Also here is a quick demo of the project in action [5]. You can check out our public docs here at [6]. Our models are also available here [7].

[1] https://github.com/katanemo/archgw

[2] https://www.envoyproxy.io/

[3] https://huggingface.co/collections/katanemo/arch-function-66...

[4] https://discord.com/channels/1292630766827737088/12926307682...

[5] https://www.youtube.com/watch?v=I4Lbhr-NNXk

[6] https://docs.archgw.com/

[7] https://huggingface.co/katanemo

mudassaralam 8 months ago

why didn’t you build your own gateway from ground up? especially when rust runtime in envoy is not production ready yet. From envoyproxy,

… This extension is functional but has not had substantial production burn time, use only with this caveat.

This extension has an unknown security posture and should only be used in deployments where both the downstream and upstream are trusted.

adilhafeez 8 months ago

Envoy has proven itself in the industry and we didn't want to reinvent the wheel by doing what envoy had already done for observability, rate-limits, connection management etc. And reason for using proxy-wasm was so we don't take hard dependency on any built version of envoy. There are many other benefits too which are listed here [1].
Regarding support for wasm runtime in envoy. We believe wasm support in envoy is not going anywhere and it will continue to become more and more stable over time. Envoy has heathy community and in case of any security vulnerability we will hope that envoy will ship fix quite fast which it has done in the past. See here for details of security patch rollout [2]
[1] https://github.com/proxy-wasm/spec/blob/main/docs/WebAssembl...
[2] https://github.com/envoyproxy/envoy/blob/main/SECURITY.md

naveed174 8 months ago

For complex agent scenarios where there might be COT reasoning needed how does archgw work in that scenario? BTW nice detailed post. And congratulations on the launch.

sparacha 8 months ago

If you are building a state machine then you should use tools that enable you to do that (temporal, langgraph, etc) by orchestrating multple LLM calls - in that case archgw offers intelligent routing to your COT agent and enables you to transparently add rich observability and metrics for all calls made to LLMs in the COT/state machine scenario.

herewhere 8 months ago

I am interested in knowing how the arch would run? Is it a library that I need to add in my code to make it work? Or do I need to deploy some sort of service in my infrastructure?

honorable_judge 8 months ago

Its a proxy - built on Envoy. I think its fairly clear that this is a separate process. As far as I can tell, you create a config file, boot up archgw, and in the config have it point to endpoints where prompts get forwarded.

fahimulhaq 8 months ago

Hey Adil, Thanks for sharing and congratulations on launch.

Can I just use arch for routing between LLMs? And what LLMs do you support? And what about key management? Do I manage access keys myself?

adilhafeez 8 months ago
Thanks! Those are all good questions. Let me respond to them one by one,
> Can I just use arch for routing between LLMs
Yes, you can use arch_config.yaml file to select between LLMs. In fact we have a demo on llm_routing [1] that you can try. Here how you can specify different LLMs in our config,
```
  llm_providers:
    - name: gpt-4o-mini
      access_key: $OPENAI_API_KEY
      provider: openai
      model: gpt-4o-mini
      default: true

    - name: gpt-3.5-turbo-0125
      access_key: $OPENAI_API_KEY
      provider: openai
      model: gpt-3.5-turbo-0125

    - name: gpt-4o
      access_key: $OPENAI_API_KEY
      provider: openai
      model: gpt-4o

    - name: ministral-3b
      access_key: $MISTRAL_API_KEY
      provider: mistral
      model: ministral-3b-latest
```
> And what LLMs do you support
We currently support mistral and openai. And for both of them we support streaming interface. We do expose openai complaint v1/chat interface so any chat UI that works with openai should work with us as well. We do ship demos with gradio sample application.
> And what about key management? Do I manage access keys myself?
None of your clients need to manage access keys. Upon receipt of request our filter will appropriate LLM from arch_config and pick relevant access_key and modify request with access_key from arch_config before sending request to upstream LLM [2].
[1] https://github.com/katanemo/archgw/tree/main/demos/llm_routi...
[2] https://github.com/katanemo/archgw/blob/main/crates/llm_gate...

mikram 8 months ago

Congrats Adil! Interested idea with lot of potential.

Do you have to use envoyproxy to use archgw? Can archgw be used for LLM routing without using envoyproxy?

adilhafeez 8 months ago

Thanks! My responses inline,
> do you have to use envoyproxy to use archgw
Yes, 100%. Our gateway is implemented as rust filter which runs inside envoy process.
> Can archgw be used for LLM routing without using envoyproxy?
Unfortunately no. Since we are built in top of envoyproxy.

Nomi21 8 months ago

This is honestly quite a detailed and thoughtfully put together post. I do have some questions and would love to hear your thoughts on those. First off, can I use just the model itself? Do you have models hosted somewhere or they run locally? If they run locally what are the system requirements? Can I build RAG based applications on arch? And how do you do intent detection in multi-turn dialogue? How does parameter gathering work, is the model capable of conversing with the user to gather parameters?

sparacha 8 months ago

Arch-Function our fast, open source LLM does most of the heavy lifting on extracting parameter values from a user prompt, gathering more information from the user, determining the right set of functions to call downstream. Its designed for smarter RAG scenarios and agentic workflows (like buying an insurance claim through prompts). While you can use the model yourself, archgw offers a framework on usage to detect hallucinations and re-prompt the LLM if token logprobs.
The same model is currently being updated to handle complex multi-turn intent and parameter extraction scenarios, so that the dreaded follow-up, clarifying RAG use case can be effortlessly handled by developers without having to resort to complex LLM pre-processing. In essence, if the user's follow-up question is "remove X", their RAG endpoints get structured information about the prompt and refined parameters against which developers simply have to retrieve the right chunks for summarization.
adilhafeez 8 months ago

Thanks :) thanks for those comments. Those are great questions. Let me respond to them one by one,
> Can I use just the model itself?
yes - our models are on huggingface. You can use them directly.
> Do you have models hosted somewhere or they run locally? If they run locally what are the system requirements?
arch gateway does bunch of processing locally for example for intent detection and hallucination we use nli model. For function calling we use hosted version of our 1.5B function calling model [1]. We use vllm to host our model, But vllm is not supported on mac. There are other issues too running model locally on mac. For example docker doesn't support giving gpu access on mac to containers. We tried using ollama in the past to host model but ollama doesn't support exposing logprobs. But we do have an issue on this [2] and we will improve it soon.
> Can I build RAG based applications on arch?
Yes you can. You would need to host vector db. In arch we don't host vector db, we wanted to keep our infra simple and clean. Do do have a default target that you can use to build RAG application. See this demo for example see insurance agent demo [3]. We do have an open issue on building a full RAG demo here [4], +1 to it to show your support.
> How does parameter gathering work, is the model capable of conversing with the user to gather parameters?
Our model is trained to engage in dialogue if a parameter is missing because our model has seen examples of missing parameters during training. During our evals and tests we found out that our model could still hallucinate e.g. for the question "how is the weather" model could hallucinate city as "LA" even though LA was not specified in query. We handle hallucination detection in arch using nli model to establish entailment of parameters from input query. BTW we are currently working on to improve that part by quite a lot. More on that in next release.
[1] https://huggingface.co/katanemo/Arch-Function-1.5B.gguf
[2] https://github.com/katanemo/archgw/issues/286
[3] https://github.com/katanemo/archgw/blob/main/demos/insurance...
[4] https://github.com/katanemo/archgw/issues/287

honorable_judge 8 months ago

With all the focus on language specific frameworks - this out of process architecture choice is an interesting one. On one hand, it helps you side step the "is this functionality available on js, java, etc" question, and on the other it means its not as easy as `import archgw` in python. Good luck though, feels like an interesting project