Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Guidance for Multi-Provider Generative AI Gateway on AWS

Summary: This implementation guide provides an overview of Guidance for Multi-Provider Generative AI Gateway on AWS, its reference architecture and components, considerations for planning the deployment, and configuration steps for deploying this guidance to Amazon Web Services (AWS). This guide is intended for Solution architects, Business decision makers, DevOps engineers, Data scientists, and Cloud computing professionals who want to implement Guidance for Multi-Provider Generative AI Gateway on AWS in their environment.


Overview

This implementation guide provides an automated AWS Cloud Development Kid (AWS CDK) and HashiCorp Terraform-based deployment of LiteLLM Proxy Server (LLM Gateway) onto Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS) container orchestration platforms on AWS. It aims to be pre-configured with defaults that will allow most users to quickly get started with LiteLLM Proxy Server.

It also provides additional features on top of LiteLLM, such as an Amazon Bedrock Interface (instead of the default OpenAI interface), support for Amazon Bedrock Managed Prompts, Chat History, and support for Okta Oauth 2.0 JWT Token Auth.

Features and benefits

  • This comprehensive gateway serves as a complete enterprise solution for managing and standardizing interactions with various large language models (LLMs), including Amazon Bedrock models. It provides a single API for seamless integration with multiple LLM providers, handling standardized input/output across different models and automatically translating for provider-specific endpoints, including completion, embedding, and image generation, while maintaining consistent response formatting.

  • The system enables centralized management of LLM usage at user, team, and API key levels, with robust usage tracking and monitoring capabilities. Administrators can configure budgets, set rate limits, restrict model access, and establish custom routing rules and policies. The gateway implements intelligent retry and fallback logic across providers, featuring load balancing between deployments, automatic failover mechanisms, and prompt caching for cost optimization.

  • Security and compliance features include integration with Amazon Bedrock Guardrails, which extends to all LLM providers, along with secure API key management and comprehensive access controls. The Guidance comes with a web-based administrative interface where managers can oversee user and team settings, while users can generate API keys and test different LLMs through an interactive chat interface.

  • Cost optimization is achieved through prompt caching mechanisms, usage-based cost allocation, strict budget enforcement, and resource utilization monitoring. The gateway seamlessly integrates with Amazon Bedrock and other AWS services, offering additional optimizations and enhanced compatibility with the AWS ecosystem.

  • All functionality is accessible through both API endpoints and an intuitive UI, making it a robust solution for organizations seeking to implement and manage LLMs at scale while maintaining control over usage, costs, and security.

Use cases

  • Multi-provider LLM integration: Seamlessly switch between different LLM providers such as Amazon Bedrock, OpenAI, Azure OpenAI, Cohere, and more, using a single API interface.

  • Cost management: Track and monitor LLM usage across different models, set related budgets, and implement rate limiting to control spending.

  • Simplified development: Focus on application logic rather than managing complex API calls to various LLM providers by using a consistent input/output format.

  • Load balancing: Distribute requests across multiple LLM instances for better performance and redundancy.

  • Access control and authentication: Manage API keys and restrict access to specific models based on user permissions.

  • Logging and observability: Monitor LLM calls with detailed logging for troubleshooting and analysis.

Architecture overview

This section provides a reference implementation architecture diagrams for the components deployed with this Guidance.

Architecture Diagram

Below are reference architecture diagrams for LiteLLM and API/middleware containerized application deployed on Amazon ECS or Amazon EKS container orchestrators respectively in addition to their integrations with other AWS services and components interactions.

Reference Architecture on ECS or EKS AWS Container Orchestration Platforms

Figure 1: Generative AI gateway reference architecture on Amazon ECS or Amazon EKS for container orchestration

Architecture steps

  1. Tenants and client applications access the LiteLLM gateway proxy API through the Amazon Route 53 URL endpoint, which is protected against common web exploits and bots using AWS WAF.
  2. AWS WAF forwards requests to Application Load Balancer (ALB) to automatically distribute incoming application traffic to Amazon Elastic Container Service (Amazon ECS) tasks or Amazon Elastic Kubernetes Service (Amazon EKS) pods running generative AI gateway containers. TLS/SSL encryption secures traffic to the load balancer using a certificate issued by AWS Certificate Manager (ACM).
  3. Container images for API/middleware and LiteLLM applications are built during guidance deployment and pushed to Amazon Elastic Container Registry (Amazon ECR). They are used for deployment to Amazon ECS on AWS Fargate or Amazon EKS clusters that run these applications as containers in ECS tasks or EKS pods, respectively. LiteLLM provides a unified application interface for configuration and interacting with LLM providers. The API/middleware integrates natively with Amazon Bedrock to enable features not supported by the LiteLLM open source project.
  4. Models hosted on Amazon Bedrock provide model access, guardrails, prompt caching, and routing to enhance the AI gateway and additional controls for clients through a unified API. Access to required Amazon Bedrock models must be properly configured.
  5. External model providers (such as OpenAI, Anthropic, or Vertex AI) are configured using the LiteLLM Admin UI to enable additional model access through LiteLLM’s unified application interface. Integrate pre-existing configurations of third-party providers into the gateway using LiteLLM APIs.
  6. LiteLLM integrates with Amazon ElastiCache (Redis OSS), Amazon Relational Database Service (Amazon RDS), and AWS Secrets Manager services. Amazon ElastiCache enables multi-tenant distribution of application settings and prompt caching. Amazon RDS enables persistence of virtual API keys and other configuration settings provided by LiteLLM. Secrets Manager stores external model provider credentials and other sensitive settings securely.
  7. LiteLLM and the API/middleware store application sends logs to the dedicated Amazon Simple Storage Service (Amazon S3) storage bucket for troubleshooting and access analysis.

Serving Architecture

LiteLLm Proxy Serving Architecture

Figure 2: LiteLLM Proxy Serving Architecture

AWS Services in this Guidance

AWS ServiceRoleDescription
Amazon BedrockCore serviceManages Single API access to multiple Foundational Models
Amazon ECSCore serviceManages application platform and on-demand infrastructure for LiteLLM container orchestration.
Amazon EKSCore serviceManages Kubernetes control plane and compute nodes for LiteLLM container orchestration.
Amazon Elastic Compute Cloud (Amazon EC2)Core serviceProvides compute instances for Amazon EKS compute nodes and runs containerized applications.
Amazon Virtual Private Cloud (Amazon VPC)Core ServiceCreates an isolated network environment with public and private subnets across multiple Availability Zones.
AWS WAFCore ServiceProtect guidance applications from common exploits
Amazon Elastic Container Registry (Amazon ECR)Supporting serviceStores and manages Docker container images for Amazon EKS deployments.
Application Load Balancer (ALB)Supporting serviceDistributes incoming traffic across multiple targets in the Amazon EKS cluster.
Amazon Simple Storage Service (Amazon S3)Supporting serviceProvides persistent object storage for applications logs and other related data.
Amazon Relational Database Service (Amazon RDS)Supporting serviceEnables persistence of virtual API keys and other configuration settings provided by LiteLLM.
Amazon ElastiCache Service (Redis OSS)Supporting serviceEnables multi-tenant distribution of application settings and prompt caching.
Amazon Route 53Supporting ServiceRoutes users to the Guidance application through DNS records.
AWS Identity and Access Management (IAM)Supporting serviceManages access to AWS services and resources securely, including Amazon ECS or Amazon EKS cluster access.
AWS Certificate Manager (ACM)Security serviceManages SSL/TLS certificates for secure communication within the cluster.
Amazon CloudWatchMonitoring serviceCollects and tracks metrics, logs, and events from Amazon EKS and other AWS resources provisioned in the Guidance.
AWS Secrets ManagerManagement serviceManager stores external model provider credentials and other sensitive settings securely.
AWS Key Management Service (AWS KMS)Security serviceManages encryption keys for securing data in Amazon EKS and other AWS services.

For any Guidance deployment, either Amazon ECS or Amazon EKS can be used, but not both.

Plan your deployment

Prerequisites

Cost

Cost Considerations

When implementing this solution on AWS, it’s important to understand the various factors that contribute to the overall cost. This section outlines the primary cost components and key factors that influence pricing.

Cost Components

The total cost of running this solution can be broadly categorized into two main components:

  1. LLM Provider Costs: These are the charges incurred for using services from LLM providers such as Amazon Bedrock, Amazon SageMaker, Anthropic, and others. Each provider has its own pricing model, typically based on factors like the number of tokens processed, model complexity, and usage volume.

  2. AWS Infrastructure Costs: These are the costs associated with running the Generative AI Gateway proxy server on AWS infrastructure. This includes various AWS services and resources used to host and operate the solution.

Key Factors Influencing AWS Infrastructure Costs

While the default configuration provides a starting point, the actual cost of running the LiteLLM-based proxy server on AWS can vary significantly based on your specific implementation and usage patterns. Some of the major factors that can impact scaling and cost include:

  1. Compute instances: The type and number of EC2 instances used to host the LiteLLM container as a proxy. Instance type selection affects both performance and cost.

  2. Amazon EBS storage: The type and size of EBS volumes attached to the EC2 instances can influence both performance and cost.

  3. Autoscaling configuration: The autoscaling policies configured for Amazon EKS and Amazon ECS clusters will affect how the solution scales in response to demand, impacting both performance and cost.

  4. Traffic patterns: The shape and distribution of LLM requests, including factors such as:

    • Request/response payload sizes
    • Tokens per minute (TPM)
    • Requests per minute (RPM)
    • Concurrency levels
    • Model latency (from downstream LLM providers)
    • Network latency between AWS and LLM providers
  5. Caching configuration: Effective caching can reduce the number of requests to LLM providers, potentially lowering costs but requiring additional resources.

  6. Database storage: The amount of storage required for managing virtual keys, organizations, teams, users, budgets, and per-request usage tracking.

  7. High availability and resiliency: Configurations for load balancing, routing, and retries can impact both reliability and cost.

  8. Logging level: The configured logging level affects storage and potentially network egress costs.

  9. Networking costs: This includes data transfer charges and the cost of running NAT gateways for outgoing calls to LLM providers.

It’s important to note that this is not an exhaustive list of cost factors, but rather highlights some of the major contributors to the overall cost of the solution.

Customer Responsibility

While this implementation guide provides default configurations, customers are responsible for:

  1. Configuring the solution to their optimal settings based on their specific use case and requirements.
  2. Monitoring and managing the costs incurred from running the proxy server on AWS infrastructure.
  3. Managing and optimizing the costs associated with their chosen LLM providers.

Customers should regularly review their AWS service usage patterns, adjust configurations as needed, and leverage AWS cost management tools to optimize their spending.

We recommend creating a budget  through AWS Cost Explorer to help manage costs. Prices are subject to change and also depend on model provider usage patterns/volume of data. For full details, refer to the pricing webpage for each AWS service used in this guidance.

Sample cost table

The following tables provide a sample cost breakdown for deploying this guidance on Amazon ECS and Amazon EKS with the default parameters in the us-east-1 (N. Virginia) Region for one month. These estimates are based on the AWS Pricing Calculator outputs for the full deployments as per guidance and are subject to changes in underlying services configuration.

For ECS container orchestration platform:

AWS serviceDimensionsCost, month [USD]
Amazon ECSOS: Linux, CPU Architecture: ARM, 24 hours, 2 tasks per day, 4 GB Memory, 20 GB ephemeral storage115.33
Amazon VPC1 VPC, 4 subnets, 1 NAT Gateway, 1 public IPv4, 100 GB outbound data per month50.00
Amazon ECR5 GB image storage/month0.50
ALB1 ALB, 1 TB/month24.62
Amazon S3100 GB/month7.37
Amazon RDS2 db.t3.micro nodes, 100% utilization, multi-AZ, 2 vCPU,1 GiB Memory98.26
Amazon ElastiCache Service (Redis OSS)2 cache.t3.micro nodes, 2 vCPU, 0.5 GiB Memory, Up to 5 GB Network performance, 100% utilization24.82
Amazon Route 531 hosted zone, 1 million standard queries/month26.60
Amazon CloudWatch25 metrics to preserve12.60
AWS Secrets Manager5 secrets, 30 days, 1 million API calls per month7.00
AWS KMS1 key, 1 million symmetric requests4.00
AWS WAF1 web ACL, 2 rules7.00
ACM1 Certificatefree
TOTAL $378.10/month

For detailed cost estimates for Amazon ECS, we recommended creating an AWS Pricing Calculator such as this:

For EKS container orchestration platform:

AWS serviceDimensionsCost, month [USD]
Amazon EKS1 control plane73.00
Amazon EC2EKS Compute Nodes, 2 nodes t4g.medium49.06
Amazon VPC1 VPC, 4 subnets, 1 NAT Gateway, 1 public IPv4, 100 GB outbound data per month50.00
Amazon ECR5 GB image storage/month0.50
ALB1 ALB, 1 TB/month24.62
Amazon S3100 GB/month7.37
Amazon RDS2 db.t3.micro nodes, 100% utilization, multi-AZ, 2 vCPU,1 GiB Memory98.26
Amazon ElastiCache Service (Redis OSS)2 cache.t3.micro nodes, 2 vCPU, 0.5 GiB Memory, Up to 5 GB Network performance, 100% utilization24.82
Amazon Route 531 hosted zone, 1 million standard queries/month26.60
Amazon CloudWatch25 metrics to preserve12.60
AWS Secrets Manager5 secrets, 30 days, 1 million API calls per month7.00
AWS KMS1 key, 1 million symmetric requests4.00
AWS WAF1 web ACL, 2 rules7.00
ACM1 Certificatefree
TOTAL $384.83/month

For detailed cost estimates for EKS platform, it is recommended to create an AWS Price calculator such as the following:

Security

When you build systems on AWS infrastructure, security responsibilities are shared between you and AWS. This shared responsibility model reduces your operational burden because AWS operates, manages, and controls the components including the host operating system, the virtualization layer, and the physical security of the facilities in which the services operate. For more information about AWS security, visit AWS Cloud Security.

This Guidance implements several security best practices and AWS services to enhance the security posture of your Amazon ECS or Amazon EKS clusters. Here are the key security components and considerations:

Identity and Access Management (IAM)

  • IAM roles: The architecture deploys dedicated IAM roles (litellm-stack-developers, litellm-stack-operators) to manage access to Amazon ECS or Amazon EKS cluster resources. This follows the principle of least privilege, ensuring users and services have only the permissions necessary to perform their tasks.
  • Amazon EKS-managed node groups: These groups use created IAM roles (litellm-stack-eks-nodegroup-role) with specific permissions required for nodes to join the cluster and for pods to access AWS services.

Network Security

  • Amazon VPC: Amazon ECS or Amazon EKS clusters are deployed within a VPC (newly created or custom specified in guidance deployment configuration) with public and private subnets across multiple Availability Zones, providing network isolation.
  • Security groups: Security groups are typically used to control inbound and outbound traffic to EC2 instances and other resources within the VPC.
  • NAT gateways: Deployed in public subnets to allow outbound internet access for resources in private subnets while preventing inbound access from the internet.

Data Protection

  • Amazon EBS encryption: EBS volumes used by EC2 instances for EKS compute nodes are typically encrypted to protect data at rest.
  • AWS KMS: Used for managing encryption keys for various services, including EBS volume encryption.
  • AWS Secrets Manager: Used for stores external model providers credentials and other sensitive settings securely.

Kubernetes-Specific Security

  • Kubernetes RBAC: Role-based access control (RBAC) is implemented within the Amazon EKS cluster to manage fine-grained access to Kubernetes resources.
  • ACM: Integrated to manage SSL/TLS certificates for secure communication within the clusters.
  • AWS Identity and Access Manager: Used for role/policy-based access to AWS services and resources, including Amazon ECS or Amazon EKS cluster resource access.

Monitoring and Logging

  • Amazon CloudWatch: Used for monitoring and logging of AWS resources and applications running on the Amazon EKS cluster.

Container Security

  • Amazon ECR: Stores container images in a secure, encrypted repository. It includes vulnerability scanning to identify security issues in your container images.

Secrets Management

  • AWS Secrets Manager: Secrets Manager stores external model provider credentials and other sensitive settings securely.

Additional Security Considerations

  • Regularly update and patch Amazon ECS or Amazon EKS clusters, compute nodes, and container images.
  • Implement network policies to control pod-to-pod communication within the cluster.
  • Use pod security policies or pod security standards to enforce security best practices for pods.
  • Implement proper logging and auditing mechanisms for both AWS and Kubernetes resources.
  • Regularly review and rotate IAM and Kubernetes RBAC permissions.

Supported AWS Regions

As of March 2025, Guidance for Multi-Provider Generative AI Gateway on AWS is supported in the following AWS Regions:

Region NameRegion Code
US East (Ohio)us-east-1
US East (N. Virginia)us-east-2
US West (Northern California)us-west-1
US West (Oregon)us-west-2
Europe (Paris)eu-west-3
Canada (Central)ca-central-1
South America (São Paulo)sa-east-1
Europe (Frankfurt)eu-central-1
Europe (Ireland)eu-west-1
Europe (London)eu-west-2
Europe (Paris)eu-west-3
Europe (Stockholm)eu-north-1
Europe (Milan)eu-south-1
Europe (Spain)eu-south-2
Europe (Zurich)eu-central-2

Quotas

Service quotas, also referred to as limits, are the maximum number of service resources or operations for your AWS account.

Quotas for AWS services in this Guidance

Make sure you have sufficient quota for each of the services implemented in this guidance. For more information, see AWS service quotas.

To view the service quotas for all AWS services in the documentation without switching pages, view the information in the Service endpoints and quotas page in the PDF instead.

Deploy the Guidance

Prerequisite CLI tools

  1. Install Docker CLI
  2. Install AWS CLI
  3. Install terraform CLI
  4. Install yq utility (install with brew if on Mac using brew install yq, or download binaries if on other Linux (see details here)
  5. If you have DEPLOYMENT_PLATFORM parameter set to EKS in the .env file for Amazon EKS as container orchestration platform:

    Install kubectl CLI utility

Environment tested and confirmed:

  • Docker: version 27.3.1
  • AWS CLI: version 2.19.5
  • yq: version 4.40.5
  • Terraform: v1.5.7
  • kubectl Client Version: v1.32.1
  • kubectl Kustomize Version: v5.5.0

Deployment process overview

Before you launch the Guidance, review the cost, architecture, security, and other considerations discussed in this document. Follow the step-by-step instructions in this section to configure and deploy the guidance into your account.

Time to deploy: Approximately 35-40 minutes

Create a Domain and Certificate:

    • Follow the instructions here to create a certificate with ACM.
    • Follow the instructions here to validate your domain ownership for your certificate.
  1. Verify that a HostedZone has been created in Route53Hosted zones service area:
Route 53 Hosted Zone record

Figure 3: AWS Route 53 Hosted Zone entry

  1. If you need to create a new certificate, complete the following steps to create an verify a certificate using the domain you just created:

    • Go to Certificate Manager (ACM) -> Request a certificate
    • Select “Request a public certificate”
    • In Fully qualified domain name, put in the domain you registered in Route53 above (for example, <username>.people.<company>.com)
    • [Optional] Click Add another name to this Certificate and add a sub domain (for example, dashboard.<username>.people.<company>.com)
    • Select Request

Otherwise, you may import an existing certificate by using Certificate Manager (ACM) -> Import a certificate option:

Route 53 Import Certificate

Figure 4: AWS Route 53 Import Certificate

  1. Go to your created/imported certificate in AWS Certificate Manager -> Certificates, and select Create records in Route 53:
AWS Certificate in Route 53

Figure 5: Sample AWS Certificate for Route 53 entry

Once created, the DNS records should appear associated with the certificate:

DNS records in Route 53

Figure 6: DNS records in Route 53

Deployment steps

  1. Make sure Docker daemon is running.
  2. Clone the code repository from GitHub: git clone https://github.com/aws-samples/genai-gateway.git
  3. In the terminal prompt, change into the root directory of the clone repository: cd ./PATH/TO/genai-gateway
  4. Run cp .env.template .env to copy template of LiteLLM deployment options configuration file to its actual value.
  5. In .env config file, set the TERRAFORM_S3_BUCKET_NAME value to a globally unique bucket name. This will store your .tfstate files for your Terraform deployment.
  6. In .env config file , set the CERTIFICATE_ARN value to the ARN of the certificate you created/imported in the Create a Domain and Certificate above.
  7. In .envconfig file, set the RECORD_NAME value to the sub domain you created in that section above.
  8. In .env, set the HOSTED_ZONE_NAME value to the hosted zone you created in created in that section above.
  9. In .env config file, fill out values of any API keys you need for any third-party providers. If you only want to use Amazon Bedrock, you can just leave these settings in that file as is:

Below is the .env file sample:

# LITELLM_VERSION eg: v1.56.5
# Get it from https://github.com/BerriAI/litellm/releases
LITELLM_VERSION="v1.61.0"
TERRAFORM_S3_BUCKET_NAME="<unique bucket name>" #Must be globally unique
BUILD_FROM_SOURCE="false"
HOSTED_ZONE_NAME="<userid>.<people>.<company>.com"
RECORD_NAME="dashboard.<userid>.<people>.<company>.com"
CERTIFICATE_ARN="arn:aws:acm:us-east-1:XXXXXXXXXXXX:certificate/9c89bb61-7272-4342-aae4-2018d6a79fe9"
OKTA_ISSUER=""
OKTA_AUDIENCE="api://default"
OPENAI_API_KEY="placeholder"
AZURE_OPENAI_API_KEY="placeholder"
AZURE_API_KEY="placeholder"
ANTHROPIC_API_KEY="placeholder"
GROQ_API_KEY="placeholder"
COHERE_API_KEY="placeholder"
CO_API_KEY="placeholder"
HF_TOKEN="placeholder"
HUGGINGFACE_API_KEY="placeholder"
DATABRICKS_API_KEY="placeholder"
GEMINI_API_KEY="placeholder"
CODESTRAL_API_KEY="placeholder"
MISTRAL_API_KEY="placeholder"
AZURE_AI_API_KEY="placeholder"
NVIDIA_NIM_API_KEY="placeholder"
XAI_API_KEY="placeholder"
PERPLEXITYAI_API_KEY="placeholder"
GITHUB_API_KEY="placeholder"
DEEPSEEK_API_KEY="placeholder"
AI21_API_KEY="placeholder"
LANGSMITH_API_KEY=""
LANGSMITH_PROJECT=""
LANGSMITH_DEFAULT_RUN_NAME=""
DEPLOYMENT_PLATFORM="ECS"
EXISTING_VPC_ID=""
EXISTING_EKS_CLUSTER_NAME=""
DISABLE_OUTBOUND_NETWORK_ACCESS="false"
CREATE_VPC_ENDPOINTS_IN_EXISTING_VPC="false"
INSTALL_ADD_ONS_IN_EXISTING_EKS_CLUSTER="false"
DESIRED_CAPACITY="2" #Number of ECS or EKS instances to run by default (for horizontal scaling)
MIN_CAPACITY="2"
MAX_CAPACITY="4"
ECS_CPU_TARGET_UTILIZATION_PERCENTAGE="50"
ECS_MEMORY_TARGET_UTILIZATION_PERCENTAGE="40"
ECS_VCPUS="2"
EKS_ARM_INSTANCE_TYPE="t4g.medium"
EKS_X86_INSTANCE_TYPE="t3.medium"
EKS_ARM_AMI_TYPE="AL2_ARM_64"
EKS_X86_AMI_TYPE="AL2_x86_64"
PUBLIC_LOAD_BALANCER="true"
RDS_INSTANCE_CLASS="db.t3.micro"
RDS_ALLOCATED_STORAGE_GB="20"
REDIS_NODE_TYPE="cache.t3.micro"
REDIS_NUM_CACHE_CLUSTERS="2" #Number of cache clusters (primary and replicas) the replication group will have
EC2_KEY_PAIR_NAME=""
DISABLE_SWAGGER_PAGE="false"
DISABLE_ADMIN_UI="false"
LANGFUSE_PUBLIC_KEY=""
LANGFUSE_SECRET_KEY=""
LANGFUSE_HOST="" # Optional, defaults to https://cloud.langfuse.com
FAKE_LLM_LOAD_TESTING_ENDPOINT_CERTIFICATE_ARN=""
FAKE_LLM_LOAD_TESTING_ENDPOINT_HOSTED_ZONE_NAME=""
FAKE_LLM_LOAD_TESTING_ENDPOINT_RECORD_NAME="
  1. By default, this Guidance is deployed with Redis OSS caching enabled, and with most popular model providers enabled. At first, the CDK deployment will automatically use the config/default-config.yaml and copy it to config.yaml. Make sure you enable model access on Amazon Bedrock.
  2. If you want to remove support for certain models, or add more models, you can create and edit your own config/config.yaml file.
  3. Make sure you have valid AWS credentials configured in your CLI environment before running the next step
    • In the popup, copy the bash/zsh option and paste into the terminal window you plan to run the deploy command
  4. There are different deployment customizations available for target Amazon ECS or Amazon EKS. Please review the options below when deciding your target platform and specify appropriate values in the .env file:

Deployment Configurations

ECS platform - default option

By default, the deployment application platform is set to Amazon ECS with Fargate using the DEPLOYMENT_PLATFORM="ECS" parameter setting in the .env file:

DEPLOYMENT_PLATFORM="ECS"

EKS platform - basic options

If you prefer to deploy the Guidance on Amazon EKS with compute node groups, use the following:

DEPLOYMENT_PLATFORM="EKS"

For a brand new EKS cluster to be provisioned to host LiteLLM application - leave EXISTING_EKS_CLUSTER_NAME value empty:

EXISTING_EKS_CLUSTER_NAME=""
  1. After all required settings in the .env file are done and saved, initiate Guidance deployment by running the ./deploy.sh deployment script.

For both Amazon ECS and Amazon EKS deployment, the container images are built by the deployment automation. Additionally, they are identical and pushed to the Amazon ECR container repositories named litellm and middleware respectively with images version tags based on the setting in the .env file that is also displayed in the output:

LITELLM_VERSION="v1.61.0"

To confirm, container images for LiteLLM and middleware integrations containers will be built and pushed to ECR registries as shown below:

Container images in ECR repo

Figure 7: Built container images for LiteLLM and middleware containers in the ECR registry

  1. After the deployment is complete, you can visit the LiteLLM UI by going to the url, which is based on the RECORD_NAME you configured earlier.
RECORD_NAME="dashboard.<userid>.<people>.<company>.com"
LiteLLM CloudFormation Stack outputs

Figure 8: LiteLLM CloudFormation Stack outputs

  1. The master API key is stored in Secrets Manager in the LiteLLMMasterSalt-XXXXXXXX secret in the LITELLM_MASTER_KEY key. This API key can be used to call the LiteLLM API and is also the default password for the LiteLLM UI.

user id: admin, pwd: XXXXXXXXXXXX

LiteLLM Admin login password

Figure 9: LiteLLM Admin login password

Optional Deployment Platform Configurations

Additional ECS platform deployment options

a. By default, the deployment application platform is set to Amazon ECS using the DEPLOYMENT_PLATFORM=”ECS” parameter setting in the .env file:

DEPLOYMENT_PLATFORM="ECS"

b. If you need to deploy the ECS components into an existing VPC in your account, set EXISTING_VPC_ID to your existing VPC ID in the .env file . The VPC is expected to have both private and public subnets. The public subnets should have Auto-assign public IPv4 address set to yes. The VPC should also have at least one NAT Gateway. This Guidance has been tested with the following VPC specification:

  • 2 Availability Zones
  • 1 NAT gateway
  • 1 internet gateway
  • 2 public subnets (auto-assigned public ips enabled, route table pointing to internet gateway)
  • 2 private subnets (Route table pointed to NAT gateway)
EXISTING_VPC_ID="vpc-XXXXXXXXXXX"

c. If you’d like to run LiteLLM in subnets without outbound internet access, set DISABLE_OUTBOUND_NETWORK_ACCESS="true". Due to lack of internet access, this configuration will only work with Amazon Bedrock or Amazon SageMaker models. Because of this, you must remove all non-Amazon Bedrock/non-SageMaker models from your config/config.yaml file. If you do not do this, LiteLLM will fail to start as it will keep attempting to call third-party models over the internet.

DISABLE_OUTBOUND_NETWORK_ACCESS="true"

d. To manage ECS compute resource target utilization by tasks running containers, use the following settings:

ECS_CPU_TARGET_UTILIZATION_PERCENTAGE="50"
ECS_MEMORY_TARGET_UTILIZATION_PERCENTAGE="40"
ECS_VCPUS="2"

e. ECS compute nodes capacity settings can be set using parameters below:

DESIRED_CAPACITY="2"
MIN_CAPACITY="2"
MAX_CAPACITY="4"

Additional Amazon EKS deployment options

f. If you prefer to deploy the Guidance on Amazon EKS, use the following related setting in the .env file:

...
 DEPLOYMENT_PLATFORM="EKS"

For a brand new EKS cluster provisioned to host LiteLLM application, leave EXISTING_EKS_CLUSTER_NAME value empty:

...
EXISTING_EKS_CLUSTER_NAME=""

g. If you prefer to deploy a new Amazon EKS cluster into an existing VPC, specify EXISTING_VPC_ID value as shown below:

...
EXISTING_EKS_CLUSTER_NAME=""
EXISTING_VPC_ID="vpc-XXXXXXXXXXXXXX"

In this case, a new Amazon EKS cluster with 2 nodes based on t4g.medium AWS AMI (AMD_x86 or ARM64 architecture depending on the value of the CPU_ARCHITECTURE parameter or environment where the core images are built) will be created for deployment of LiteLLM containers.

h. If you want to use your existing Amazon EKS cluster deployed in an existing VPC, specify respective values for
EXISTING_VPC_ID and EXISTING_EKS_CLUSTER_NAME variables respectively to be VPC ID and Amazon EKS cluster name for the existing cluster:

EXISTING_VPC_ID="vpc-XXXXXXXXXXXXXX"
EXISTING_EKS_CLUSTER_NAME="my-eks-cluster"

If you’re deploying into an existing Amazon EKS cluster, it is assumed that the role or user deploying this has admin level access to the cluster. In that case, in your existing Amazon EKS cluster, a new node group with 2 nodes based on t4g.medium AWS AMI (AMD_x86 or ARM64 architecture depending on the value of the CPU_ARCHITECTURE parameter or environment where the core images are built) will be created for deployment of LiteLLM containers.

i. If you set INSTALL_ADD_ONS_IN_EXISTING_EKS_CLUSTER=”false”, you must have these EKS add-ons pre-installed:

  • Amazon EKS Pod Identity Agent
  • Amazon VPC CNI
  • kube-proxy
  • CoreDNS

Otherwise, set INSTALL_ADD_ONS_IN_EXISTING_EKS_CLUSTER=”true” to have them installed.

j. To control AMI, CPU architecture, and resource settings for EKS compute nodes (to be deployed in both greenfield or existing clusters), there are the following self-explanatory settings:

...
EKS_ARM_INSTANCE_TYPE="t4g.medium"
EKS_X86_INSTANCE_TYPE="t3.medium"
EKS_ARM_AMI_TYPE="AL2_ARM_64"
EKS_X86_AMI_TYPE="AL2_X86_64"

EKS Compute nodes group capacity settings can be set using parameters below:

DESIRED_CAPACITY="2"
MIN_CAPACITY="2"
MAX_CAPACITY="4"

k. The following setting can explicitly set the target CPU architecture of compute nodes (and container images, respectively) for both Amazon ECS and Amazon EKS orchestrators which can be set to “x86” or “arm”:

CPU_ARCHITECTURE="" #If empty, defaults to the architecture of your deployment machine "x86" or "arm"

k. If for some reason you need to use private load balancer (for example, company network security requirements), use the following related configuration parameters in the .env file:

...
HOSTED_ZONE_NAME="<userid>.<people>.<company>.com"
RECORD_NAME="dashboard.<userid>.<people>.<company>.com"
...
PUBLIC_LOAD_BALANCER="false"

This configuration would result in creation of a private ALB connected to LiteLLM containers/ports through corresponding target groups with the certificate associated the same way it gets associated with a public ALB.

EKS K8s Object Validation

To validate that all expected K8s objects are created in the Amazon EKS cluster, you may run the following command from the same environment where deploy.sh was run using kubectl CLI:

kubectl get deploy,po,svc,ingress
---------------------------------------
NAME  READY  UP-TO-DATE  AVAILABLE  AGE
deployment.apps/litellm-deployment  1/1  1 1  40m

NAME READY  STATUS RESTARTS  AGE
pod/litellm-deployment-79487599c5-vh9xx  2/2  Running  0 4m38s

NAME TYPE CLUSTER-IP  EXTERNAL-IP  PORT(S)  AGE
service/kubernetes ClusterIP  172.20.0.1  <none> 443/TCP  53m
service/litellm-service  ClusterIP  172.20.1.2  <none> 4000/TCP,3000/TCP  40m

NAME CLASS HOSTS  ADDRESS PORTS  AGE
ingress.networking.k8s.io/litellm-ingress  <none>  dashboard.<userid>.<people>.<company>.com  k8s-default-litellmi-a21acabe02-1905940975.us-east-1.elb.amazonaws.com  80 5m31s

Configuration Parameters for RDS and ElasticCache

The following parameters control the size of Amazon RDS database instances, related allocated storage, and Redis Cache node type and number of cache clusters:

...
RDS_INSTANCE_CLASS="db.t3.micro"
RDS_ALLOCATED_STORAGE_GB="20"
REDIS_NODE_TYPE="cache.t3.micro"
REDIS_NUM_CACHE_CLUSTERS="2" #Number of cache clusters (primary and replicas) the replication group will have

Settings that disable LiteLLM API Swagger page and/or Admin UI

To avoid potential security warnings for publicly exposed Swagger and/or Admin UI pages (see details below), there are the following settings. When set to “true”, LiteLLM Swagger and/or Admin UI pages will not be created/rendered:

...
DISABLE_SWAGGER_PAGE="false"
DISABLE_ADMIN_UI="false"

Using Public/Private Load Balancer configurations

When LiteLLM application platform is deployed into an existing VPC and there’s a requirement to use private load balancer for limiting public access, there is a specific configuration that can be used to achieve that.

Below is default and that use case specific configurations that can be used:

  • If PUBLIC_LOAD_BALANCER="true", deployment assumes that your public hosted zone already exists, and imports it
  • If PUBLIC_LOAD_BALANCER="false", and you pass in a EXISTING_VPC_ID (meaning VPC exists and will be used for ECS/EKS deployment), deployment assumes that your private hosted zone already exists, and imports it. If that is not the case, use the parameter CREATE_PRIVATE_HOSTED_ZONE_IN_EXISTING_VPC="true" - deployment will create the hosted zone and associate with your existing VPC
  • If PUBLIC_LOAD_BALANCER="false", and you don’t pass in a EXISTING_VPC_ID, deployment will create the hosted zone for you and associate it to your new VPC
  • To access private Load balancer from a bastion EC2 host, you can specify an existing value for EC2_KEY_PAIR_NAME parameter

To make it more convenient to get access to this private Load balancer, there is a script to deploy a Windows EC2 instance in the same VPC, described below:

Setting up EC2 in your VPC and remote desktop access to it

To get access to LiteLLM in the case you set PUBLIC_LOAD_BALANCER="false", please follow the steps below to provision and use dedicated “jump host” Amazon EC2:

Prerequisites:

  • AWS CDK

Deploying the EC2

  1. cd to the litellm-private-load-balancer-ec2 folder
  2. cp .ec2.env.template .ec2.env
  3. In .ec2.env, set KEY_PAIR_NAME to a key pair that you have the private .pem file for. If you don’t have a key pair, create one
  4. cd back to the root directory
  5. Run ./create-ec2-to-access-private-load-balancer.sh

Remote Desktop Setup

  1. Download Microsoft Remote Desktop for Mac.
  2. Open Microsoft Remote Desktop, and select the “+” button in the top menu, then select “Add PC”.
  3. In the “PC name” field under the General tab, enter localhost:13389.

  4. Select “Add” to add your remote desktop configuration.
  5. To get the EC2 instance password, follow these steps:
  • Go to the AWS Console, and locate the EC2 instance named “WindowsBrowserInstance”.
  • Select the instance and select “Connect”.
  • Go to the “RDP client” tab.
  • Select “Get password” at the bottom.
  • Upload your private .pem file from your keypair (the one specified in .ec2.env in the KEY_PAIR_NAME value).
  • Copy the decrypted password. You’ll need this to connect.
  1. Open a terminal and run the following AWS SSM command to create a port forwarding session. Replace YOUR_INSTANCE_ID with the instance ID from your stack outputs:
aws ssm start-session --target YOUR_INSTANCE_ID --document-name AWS-StartPortForwardingSession --parameters "portNumber=3389,localPortNumber=13389"
  1. Return to Microsoft Remote Desktop and double-click your newly added PC. When prompted, enter Username: Administrator, Password: [the password you retrieved in step 5]

EC2_KEY_PAIR_NAME setting can be used for SSH access to a bastion host.

Integration with LangFuse

If you have an existing deployment of popular OSS Langfuse metric visualization platform, there are configuration parameters for sending LiteLLM metrics to that environment, as shown below:

...
LANGFUSE_PUBLIC_KEY=""
LANGFUSE_SECRET_KEY=""
LANGFUSE_HOST="" # Optional, defaults to https://cloud.langfuse.com

Load Testing

Sometimes, there is a need to deploy a “fake” LLM server for OpenAI and Amazon Bedrock requests for the purpose of baselining load/latency testing. Parameters for those settings start with FAKE_LLM.. and are similar to “real” endpoint settings described above

...
HOSTED_ZONE_NAME="<userid>.<people>.<company>.com"
RECORD_NAME="dashboard.<userid>.<people>.<company>.com"
CERTIFICATE_ARN="arn:aws:acm:us-east-1:XXXXXXXXXXXX:certificate/XXXXXXXXXXXXXX"
....
#for fake LLM Server endpoint
FAKE_LLM_LOAD_TESTING_ENDPOINT_CERTIFICATE_ARN=""
FAKE_LLM_LOAD_TESTING_ENDPOINT_HOSTED_ZONE_NAME=""
FAKE_LLM_LOAD_TESTING_ENDPOINT_RECORD_NAME="

There is a create-fake-llm-load-testing-server.sh script in the project root directory that uses these settings to deploy a mockup LiteLLM server instance that would return responses.

Please note that you need to create a new certificate for the FAKE_LLM_LOAD_TESTING_ENDPOINT_CERTIFICATE_ARN with a subdomain that you use as FAKE_LLM_LOAD_TESTING_ENDPOINT_RECORD_NAME (e.g. fakellmserver.<userid>.<people>.<company>.com) and that you can use the same hosted zone for FAKE_LLM_LOAD_TESTING_ENDPOINT_HOSTED_ZONE_NAME that you used for HOSTED_ZONE_NAME

Also, you will need to add the following to your config.yaml in your models list:

...
- model_name: fake-openai-endpoint
    litellm_params:
      model: aiohttp_openai/any
      api_base: https://<value-you-put-for-FAKE_LLM_LOAD_TESTING_ENDPOINT_RECORD_NAME>/chat/completions
      api_key: "test"

This will deploy a simple HTTP backend that exposes 3 APIs for use through HTTP POST:

  • /model/{model_id}/converse

    Returns a single JSON response according to the Amazon Bedrock response schema. The model_id is ignored.

  • /chat/completions
  • /v1/chat/completions

    The two completion endpoints return either:

    • A normal (non-streaming) completion with a random 1–3 second delay
    • A streaming response with multiple chunks and random 0.2–0.8 second delays

To clean up the mock LLM backend, run the delete-fake-llm-load-testing-server.sh script.

Config.yaml (all values pre-populated in Config.yaml, what they do, and what the default values are)

{.note} To update this file without having to redeploy the whole stack, you can use the update-litellm-config.sh script.

  • model_list: Within this field, many different models are already configured for you. If you would like to add more models or remove models, edit this field. Some model providers (such as Databricks and Azure OpenAI) will need you to add additional configuration to function, so they are commented out by default.

  • model_name: This is the model’s public name. You can set it to whatever you like. When someone is calling your model using the OpenAI client, they will use this value for the model ID. By default, the model_name is set to the model ID from each provider.

If a model ID is used by two different providers, we instead used <provider>/<model_id> as the model_name. For example, the github provider shares a model ID with the groq provider. So, to avoid the conflict, we use github/llama-3.1-8b-instant instead of just llama-3.1-8b-instant.

  • litellm_params: This is the full list of additional parameters sent to the model. For most models, this will only be model which is the model ID used by the provider. Some providers such as azure need additional parameters, which are documented in config/default-config.yaml.

You can also use this to set default parameters for the model such as temperature and top_p.

You can also use this to override the default Region for a model. For example, if you deploy the litellm to us-east-1, but want to use an Amazon Bedrock model in us-west-2, you would set aws_region_name to us-west-2. The parameter to adjust the default Region will vary by LLM provider.

You can also use this to set a weight for an LLM to load balance between two different models with the same model_name.

  • litellm_settings: These are additional settings for the litellm proxy.

  • litellm_settings.cache: Whether to enable or disable prompt caching. Caches prompt results to save time and money for repeated prompts. Set to True by default.

  • litellm_settings.cache_params.type: The type of cache to use. Set to Redis by default. Redis is currently the only cache type that will work out of the box with this deployment

  • litellm_settings.max_budget: (float) sets max budget in dollars across the entire proxy across all API keys. Note that the budget does not apply to the master key. That is the only exception. Set to an extremely large value by default (1000000000.0).

  • litellm_settings.budget_duration: (str) frequency of budget reset for entire proxy. You can set the duration as seconds (“30s”), minutes (“30m”), hours (“30h”), days (“30d”), months (“1mo”). Set to 1mo (1 month) by default.

  • litellm_settings.max_internal_user_budget: (float) sets default budget in dollars for each internal user. (This doesn’t apply to admins, teams, or the master key). Set to an extremely large value by default (1000000000.0).

  • litellm_settings.internal_user_budget_duration: (str) frequency of budget reset for each internal user. You can set duration as seconds (“30s”), minutes (“30m”), hours (“30h”), days (“30d”), months (“1mo”). Set to 1mo (1 month) by default.

  • litellm_settings.success_callback: Defines where success logs are sent to. Defaults to Amazon S3. Amazon S3 is the only destination that works out of the box with this deployment.

  • litellm_settings.failure_callback: Defines where failure logs are sent to. Defaults to Amazon S3. Amazon S3 is the only destination that works out of the box with this deployment.

  • litellm_settings.service_callback: Defines where service logs (such as Amazon RDS and Redis) are sent to. Defaults to Amazon S3. Amazon S3 is the only destination that works out of the box with this deployment.

  • litellm_settings.s3_callback_params.s3_bucket_name: Defines the bucket where the logs will be sent to. This is automatically populated with the name of an S3 bucket that is created during deployment.

  • litellm_settings.s3_callback_params.s3_region_name: Defines the bucket Region where the logs will be sent to. This is automatically populated with the current Region used during deployment.

Routing
A/B testing and load balancing

To do A/B testing with two different models, do the following:

  1. Define two different models with the same model_name.
  2. Point them each to one of the two different models you want to A/B test.
  3. Set the weight for each of them to determine the percentage of traffic you want going to each model.

Example: Say you’re using OpenAI, but you want to migrate to Anthropic Claude on Amazon Bedrock. You want to send 10% of your traffic there to see if you’re getting comparable speed and quality before you fully commit. You can do so like this in your config.yaml:

- model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      weight: 9
- model_name: gpt-4o
    litellm_params:
      model: bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0
      weight: 1

You can list several models under the same model name, and LiteLLM will automatically distribute traffic between them. If one begins failing, it will fall back to the others.

Example:

model_list:
  - model_name: claude-3-5-sonnet-20240620-v1:0
    litellm_params:
      model: bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0
  - model_name: claude-3-5-sonnet-20240620-v1:0
    litellm_params:
      model: bedrock/anthropic.claude-3-haiku-20240307-v1:0
  - model_name: claude-3-5-sonnet-20240620-v1:0
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20240620

To see more possible routing options, check the full documentation

Routing Strategies

By default, this Guidance is configured with usage-based-routing. This will always route to the model with lowest token per minute (TPM)) usage for that minute for a given model_name.

This routing will also respect tpm and rpm (requests per minute) limits and will stop sending traffic to a model if it exceeds that limit.

Example of setting multiple models to be load balanced between with different tpm and rpm limits.

model_list:
  - model_name: claude-3-5-sonnet-20240620-v1:0
    tpm: 100000
    rpm: 1000
    litellm_params:
      model: bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0
  - model_name: claude-3-5-sonnet-20240620-v1:0
    tpm: 200000
    rpm: 2000
    litellm_params:
      model: bedrock/anthropic.claude-3-haiku-20240307-v1:0
  - model_name: claude-3-5-sonnet-20240620-v1:0
    tpm: 300000
    rpm: 3000
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20240620

The routing strategy is configured as the following:

router_settings:
  routing_strategy: usage-based-routing-v2
  enable_pre_call_check: true

You can explore alternative routing strategies here.

Fallbacks

You can also configure fallbacks for an entire model_name. If all models in a given model_name are failing, you can configure a final fallback.

Let’s say you frequently work with Claude 3.5 sonnet, but occasionally your users perform a query that overwhelms its context window size. You can configure a fallback for that scenario. All requests will go to Claude 3.5 Sonnet, but if they are too large, they will go to Gemini which has a larger context window.

router_settings:
    context_window_fallbacks: [{"anthropic.claude-3-5-sonnet-20240620-v1:0": ["gemini-1.5-pro"]}]

If a model_name fails for any other reason, you can configure a generic fallback.

router_settings:
    fallbacks: [{"gpt-4o": ["anthropic.claude-3-5-sonnet-20240620-v1:0"]}]

Finally, you can set a fallback for all model_name as a global fallback in case of unexpected failures:

router_settings:
  default_fallbacks: ["anthropic.claude-3-haiku-20240307-v1:0"]

You can find more details here and here.

Guardrails

To set Guardrails for your LLM API calls, follow these steps:

  1. Create a Guardrail in Amazon Bedrock.
  2. Get the Guardrail ID and guardrail version.
  3. Define the Guardrail, as shown in the example below in your config.yaml.
guardrails:
   - guardrail_name: "bedrock-pre-guard"
     litellm_params:
       guardrail: bedrock
       mode: "during_call" # supported values: "pre_call", "post_call", "during_call"
       guardrailIdentifier: ff6ujrregl1q # your guardrail ID on bedrock
       guardrailVersion: "1"         # your guardrail version on bedrock
       default_on: true # enforces the guardrail serverside for all models. Caller does not need to pass in the name of the guardrail for it to be enforced.

If you set default_on to true, the guardrail will be enforced at all times. If you set it to false, enforcement is optional.

In the case that default_on is false, in order to make use of the Guardrail, you must specify its name in the client call.

Example:

export GATEWAY_URL=<Your-Proxy-Endpoint>
export GATEWAY_API_KEY=<Your-Master-Key-Or-Admin-Key>

curl -X POST "$GATEWAY_URL/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GATEWAY_API_KEY" \
-d '{
    "model": "anthropic.claude-3-5-sonnet-20240620-v1:0",
    "messages": [
        {
            "role": "user",
            "content": "prohibited topic"
        }
    ],
    "guardrails": ["bedrock-pre-guard"]
}'

You can find more details on Guardrails here.

Usage Instructions

  1. Using LiteLLM proxy with your code is practically identical to using OpenAI. You just need to replace the <BASE_URL> of the LiteLLM endpoint URL (see below) and the API key value with one generated from LiteLLM admin portal, as shown in the sample below:
import openai # openai v1.0.0+
client = openai.OpenAI(api_key="<API KEY>",base_url="https://<Your-Proxy-Endpoint>") # set proxy to base_url
response = client.chat.completions.create(model="anthropic.claude-3-5-sonnet-20240620-v1:0", messages = [
    {
        "role": "user",
        "content": "this is a test request, write a short poem about ocean"
    }
])

print(response)

which should return a response such as the following:

liteLLM response ChatCompletion(id='chatcmpl-c496dbe7-610e-4bbf-8a50-b70f85725c13', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="Here's a short poem about the ocean:\n\nVast and blue, the ocean's might,\nWaves that dance in day and night.\nSecrets deep beneath the foam,\nWhere countless creatures make their home.\nSalty breeze and sandy shore,\nThe sea's allure forevermore.", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1739470642, model='us.anthropic.claude-3-5-sonnet-20240620-v1:0', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=69, prompt_tokens=19, total_tokens=88, completion_tokens_details=None, prompt_tokens_details=None), session_id='b0ca2d23-cabe-4f5e-97c7-0b2729d4386d')
  1. Testing LiteLLM deployment with curl command:
curl -X POST "https://<Your-Proxy-Endpoint>/user/new" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer <Your-Master-Key-Or-Admin-Key>" \ -d '{ "user_email": "new_user@example.com", "user_role": "internal_user" }'

LiteLLM Admin UI access and usage:

  1. Log in to the base URL of your LiteLLM deployment (for example, **https://dashboard....com/**) You can see API documentation on that page:
LiteLLM base URL

Figure 10: LiteLLM Base API URL

  1. Login to LiteLLM Admin interface using /ui path added to the above dashboard URL such as: https://dashboard.<userid>.<people>.<company>.com/ui

login with Username: admin, pwd:

NOTE: Value of admin user’s password in case of both Amazon ECS and Amazon EKS deployments is retrieved from the following generated secret from Secrets Manager with name starting with: LiteLLMSecret value of LITELLM_MASTER_KEY key

LiteLLM base URL

Figure 11: LiteLLM admin portal authentication

  1. Navigate to Test Key tab, choose API Key Source as Current UI Session, and select desired model. You can chat with the model, as shown below:
LiteLLM Test Key UI

Figure 12: LiteLLM Test Key UI

  1. Go to Virtual Keys and use Create New Key menu item, as shown below for yours and another user:
LiteLLM Virtual Key - Same User

Figure 13: LiteLLM Create Virtual Key - Same User

LiteLLM Virtual Key - Different User

Figure 14: LiteLLM Create Virtual Key - Different User

  1. Navigate to TeamsCreate New Team to create a new Team, as shown below:
LiteLLM - Create new Team

Figure 15: LiteLLM - Create new Team

  1. Add members to this team using TeamsAdd member:
LiteLLM - adding members to the Team

Figure 16: LiteLLM - Create new Team

  1. Verify that new Team is created and members added to that team:
LiteLLM - Team with added members

Figure 17: LiteLLM - Team members added

  1. You can then create a virtual key for testing API calls to the models that team has access to, as was shown above for individual user:
LiteLLM - Create Virtual key for All Team models

Figure 18: LiteLLM Create Virtual Key - Team models

Amazon Bedrock interface

This deployment has a middleware layer that allows you to use the Amazon Bedrock interface using boto3 library instead of the OpenAI interface. This requires overriding the endpoint_url and injecting your API key into the authorization header in the request. Below is an example script showing how to do this:

Set the required environment variables:

export API_ENDPOINT="your-bedrock-endpoint" #Should be https://<Your-Proxy-Endpoint>/bedrock
export API_KEY="your-api-key" #Should be your litellm api key you normally use
export AWS_REGION="your-region" #Should be your deployment region

Install dependencies:

pip install boto3

Run the below script:

Here is a basic example of initializing and using the boto3 client for use with the gateway:

import boto3
import os
from botocore.client import Config
from botocore import UNSIGNED
from typing import Generator, Dict, Any, Optional

def create_bedrock_client():
    """
    Creates a Bedrock client with custom endpoint and authorization header.
    Uses environment variables for configuration.

    Required environment variables:
    - API_ENDPOINT: Custom Bedrock endpoint URL
    - API_KEY: Authorization bearer token
    - AWS_REGION: AWS region

    Returns:
        boto3.client: Configured Bedrock client
    """

    # Get configuration from environment variables
    endpoint = os.getenv("API_ENDPOINT")
    api_key = os.getenv("API_KEY")
    region = os.getenv("AWS_REGION")

    if not all([endpoint, api_key, region]):
        raise ValueError(
            "Missing required environment variables: API_ENDPOINT, API_KEY, AWS_REGION"
        )

    # Initialize session and configure client
    session = boto3.Session()
    client_config = Config(
        signature_version=UNSIGNED,  # Disable SigV4 signing
        retries={"max_attempts": 10, "mode": "standard"},
    )

    # Create the Bedrock client
    client = session.client(
        "bedrock-runtime",
        endpoint_url=endpoint,
        config=client_config,
        region_name=region,
    )

    # Define authorization header handler
    def add_authorization_header(request, **kwargs):
        request.headers["Authorization"] = f"Bearer {api_key}"

    # Register the event handler
    client.meta.events.register("request-created.*", add_authorization_header)

    return client

bedrock_client = create_bedrock_client()
messages = [{"role": "user", "content": [{"text": "Create a list of 3 pop songs."}]}]
model_id = "anthropic.claude-3-5-sonnet-20240620-v1:0"
response = bedrock_client.converse(modelId=model_id, messages=messages)
print(response)

Amazon Bedrock Managed Prompts

The middleware layer also has support for Amazon Bedrock Managed Prompts. It works the same as documented here

You can use a managed prompt such as the following:

model_id = "arn:aws:bedrock:us-west-2:XXXXXXXXXXXXXX:prompt/6LE1KDKISG" #Put the arn of your prompt as the model_id
response = client.converse(
    modelId=model_id,
    promptVariables={ #specify any variables you need for your prompt
        "topic": {"text": "fruit"},
    })

The OpenAI Interface also has support for Amazon Bedrock Manage Prompts.

You can use a managed prompt, such as the following:

model = "arn:aws:bedrock:us-west-2:XXXXXXXXXXXXXX:prompt/6LE1KDKISG:2" #Put the arn of your prompt as the model_id

response = client.chat.completions.create(
    model=model,
    messages=[], #Messages is required to be passed in, but it will not be used. Your managed prompt will be used instead
    stream=False,
    extra_body={"promptVariables": {"topic": {"text": "fruit"}}},
)
return response.choices[0].message.content

Chat History

Middleware layer of this Guidance’s architecture also supports chat history, using a session_id.

A session_id is tied to a specific API key. Only that API key can access that chat history associated with the session. It may eventually make an exception for admins or eventually allow a single user across multiple API keys to own a session_id.

To use this with the OpenAI Interface when not using streaming, do the following:

response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            stream=False,
            extra_body={"enable_history": True}
        )

session_id = response.model_extra.get("session_id")

response_2 = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt2}],
            stream=False,
            extra_body={"session_id": session_id}
        )

The session_id is returned as part of the response.model_extra dictionary. Pass that session_id in the extra_body parameter to continue the same conversation.

To use this with the OpenAI Interface with streaming, do the following:

stream = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            stream=True,
            extra_body={"enable_history": True}
        )

session_id = None
first_chunk = True

for chunk in stream:
    # Get session_id from first chunk
    if first_chunk:
        session_id = getattr(chunk, "session_id", None)
        first_chunk = False

    #Do normal processing on all chunks

stream2 = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            stream=True,
            extra_body={"session_id": session_id}
        )

The session_id is returned as part of the first chunk of the response stream. Pass that session_id in the extra_body parameter to continue the same conversation.

To use this with the Amazon Bedrock interface, do the following:

response = client.converse(
                modelId=model_id,
                messages=[{"role": "user", "content": [{"text": message}]}],
                additionalModelRequestFields={"enable_history": True}
            )

session_id = response["ResponseMetadata"]["HTTPHeaders"].get("x-session-id")

response2 = client.converse(
                modelId=model_id,
                messages=[{"role": "user", "content": [{"text": message2}]}],
                additionalModelRequestFields={"session_id": session_id},
            )

The session_id is returned as a header in response["ResponseMetadata"]["HTTPHeaders"]. Pass that session_id in the additionalModelRequestFields parameter to continue the same conversation.

The approach with Amazon Bedrock interface with streaming is identical, but included here for convenience:

response = client.converse_stream(
                modelId=model_id,
                messages=[{"role": "user", "content": [{"text": message}]}],
                additionalModelRequestFields={"enable_history": True},
            )
session_id = response["ResponseMetadata"]["HTTPHeaders"].get("x-session-id")

response2 = client.converse_stream(
                modelId=model_id,
                messages=[{"role": "user", "content": [{"text": message2}]}],
                additionalModelRequestFields={"session_id": session_id},
            )

The session_id is returned as a header in response["ResponseMetadata"]["HTTPHeaders"]. And you pass that session_id in the additionalModelRequestFields parameter to continue the same conversation

You can get the chat history for a given session id by calling POST /chat-history for history in OpenAI format, or POST /bedrock/chat-history for history in AWS Bedrock Converse API format, such as the following:

# Common headers, including authorization
headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}

# Request body containing session_id
payload = {"session_id": session_id}

# Endpoint to get chat history in OpenAI format
openai_url = f"{base_url}/chat-history"
response_openai = requests.post(openai_url, json=payload, headers=headers)

if response_openai.status_code == 200:
    print("OpenAI Format History:")
    print(response_openai.json())
else:
    print("Failed to retrieve OpenAI format history")
    print("Status code:", response_openai.status_code)
    print("Response:", response_openai.text)

# Endpoint to get chat history in Bedrock format
bedrock_url = f"{base_url}/bedrock/chat-history"
response_bedrock = requests.post(bedrock_url, json=payload, headers=headers)

if response_bedrock.status_code == 200:
    print("\nBedrock Format History:")
    print(response_bedrock.json())
else:
    print("Failed to retrieve Bedrock format history")
    print("Status code:", response_bedrock.status_code)
    print("Response:", response_bedrock.text)

You can get all session ids for an API key by calling POST /session-ids such as the following:

endpoint = f"{base_url}/session-ids"
headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}

response = requests.post(endpoint, headers=headers, json={})
if response.status_code == 200:
    print("Success!")
    print("Session IDs:", response.json().get("session_ids", []))
else:
    print(f"Error: {response.status_code}")
    print(response.text)

Best Practices for Amazon Bedrock Integration

When integrating Amazon Bedrock with Guidance for Multi-Provider Generative AI Gateway on AWS, following these best practices can help optimize performance, cost, and operational efficiency:

  1. Cross-Region Inference Optimization: Implement cross-Region inference profiles to:
  • Manage unexpected traffic bursts effectively
  • Distribute workloads across multiple AWS Regions
  • Mitigate service quota limitations
  • Handle peak usage scenarios seamlessly
  • Achieve higher throughput and better resilience

Configure the cross-Region inference profile ID in the LiteLLM’s config.yaml file. For example, use us.anthropic.claude-3-5-sonnet-20240620-v1:0 instead of the foundation model ID.

Before implementing cross-Region inference profiles, it’s crucial to review and configure your AWS Identity and Access Management (IAM) policies and Service Control Policies (SCPs) appropriately. Your IAM policies must explicitly grant access to the Amazon Bedrock service across all Regions that are included in your cross-Region profile configuration. This includes permissions for model invocation in each specified Region, as well as the necessary IAM actions required for cross-Region operations. Without proper IAM and SCP configurations, your API calls will fail, even if the inference profile is correctly set up in your LiteLLM configuration. Work with your AWS security team to ensure that any organizational SCPs or permission boundaries don’t inadvertently restrict Amazon Bedrock access in any of the target regions.

  1. Application Inference Profiles: Create and manage application inference profiles to track costs and Amazon Bedrock model usage effectively on AWS.

  2. Quota Management and Scaling: Actively manage service quotas for Amazon Bedrock

  • Regularly review current quota utilization for Amazon Bedrock APIs.
  • Monitor default quotas through AWS Service Quotas console.
  • Plan for quota increases based on usage patterns.

Request quota adjustments proactively:

  • Use Service Quotas for adjustable limits.
  • Submit limit increase forms for non-adjustable quotas.
  • Work with your AWS account manager for on-demand model invocation quota increases.

Uninstall the Guidance

To avoid charges for AWS resources usage, you can uninstall Guidance for Multi-Provider Generative AI Gateway on AWS by running the ./undeploy.sh script located in the source code root folder (same as where ./deploy.sh script is located). Its order of uninstallation of AWS services mirrors the order of installation implemented in that script and should result in complete “cleanup” of all deployed services. You can also delete AWS resources using the AWS Management Console by deleting corresponding AWS CloudFormation stacks (names starting with LiteLLM.. and trackingStack) manually along with dependencies, but the automated uninstallation is recommended.

Contributors

  • Daniel Zilberman, WWSO Sr Solutions Architect, AWS Tech Solutions Team
  • Sreedevi Velagala, WWSO Solutions Architect, AWS Tech Solutions Team
  • Michael Rodriguez, SDE, AWS Telco Engineering
  • Dhawal Patel, Technical Lead, AWS Principal Generative AI Architect

Notices

Customers are responsible for making their own independent assessment of the information in this document. This document: (a) is for informational purposes only, (b) represents AWS current product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers.