Build and deploy AI inference workflows with new enhancements to the Amazon SageMaker Python SDK

Amazon SageMaker Inference has been a popular tool for deploying advanced machine learning (ML) and generative AI models at scale. As AI applications become increasingly complex, customers want to deploy multiple models in a coordinated group that collectively process inference requests for an application. In addition, with the evolution of generative AI applications, many use cases now require inference workflows—sequences of interconnected models operating in predefined logical flows. This trend drives a growing need for more sophisticated inference offerings.

To address this need, we are introducing a new capability in the SageMaker Python SDK that revolutionizes how you build and deploy inference workflows on SageMaker. We will take Amazon Search as an example to show case how this feature is used in helping customers building inference workflows. This new Python SDK capability provides a streamlined and simplified experience that abstracts away the underlying complexities of packaging and deploying groups of models and their collective inference logic, allowing you to focus on what matter most—your business logic and model integrations.

In this post, we provide an overview of the user experience, detailing how to set up and deploy these workflows with multiple models using the SageMaker Python SDK. We walk through examples of building complex inference workflows, deploying them to SageMaker endpoints, and invoking them for real-time inference. We also show how customers like Amazon Search plan to use SageMaker Inference workflows to provide more relevant search results to Amazon shoppers.

Whether you are building a simple two-step process or a complex, multimodal AI application, this new feature provides the tools you need to bring your vision to life. This tool aims to make it easy for developers and businesses to create and manage complex AI systems, helping them build more powerful and efficient AI applications.

In the following sections, we dive deeper into details of the SageMaker Python SDK, walk through practical examples, and showcase how this new capability can transform your AI development and deployment process.

Key improvements and user experience

The SageMaker Python SDK now includes new features for creating and managing inference workflows. These additions aim to address common challenges in developing and deploying inference workflows:

Deployment of multiple models – The core of this new experience is the deployment of multiple models as inference components within a single SageMaker endpoint. With this approach, you can create a more unified inference workflow. By consolidating multiple models into one endpoint, you can reduce the number of endpoints that need to be managed. This consolidation can also improve operational tasks, resource utilization, and potentially costs.
Workflow definition with workflow mode – The new workflow mode extends the existing Model Builder capabilities. It allows for the definition of inference workflows using Python code. Users familiar with the ModelBuilder class might find this feature to be an extension of their existing knowledge. This mode enables creating multi-step workflows, connecting models, and specifying the data flow between different models in the workflows. The goal is to reduce the complexity of managing these workflows and enable you to focus more on the logic of the resulting compound AI system.
Development and deployment options – A new deployment option has been introduced for the development phase. This feature is designed to allow for quicker deployment of workflows to development environments. The intention is to enable faster testing and refinement of workflows. This could be particularly relevant when experimenting with different configurations or adjusting models.
Invocation flexibility – The SDK now provides options for invoking individual models or entire workflows. You can choose to call a specific inference component used in a workflow or the entire workflow. This flexibility can be useful in scenarios where access to a specific model is needed, or when only a portion of the workflow needs to be executed.
Dependency management – You can use SageMaker Deep Learning Containers (DLCs) or the SageMaker distribution that comes preconfigured with various model serving libraries and tools. These are intended to serve as a starting point for common use cases.

To get started, use the SageMaker Python SDK to deploy your models as inference components. Then, use the workflow mode to create an inference workflow, represented as Python code using the container of your choice. Deploy the workflow container as another inference component on the same endpoints as the models or a dedicated endpoint. You can run the workflow by invoking the inference component that represents the workflow. The user experience is entirely code-based, using the SageMaker Python SDK. This approach allows you to define, deploy, and manage inference workflows using SDK abstractions offered by this feature and Python programming. The workflow mode provides flexibility to specify complex sequences of model invocations and data transformations, and the option to deploy as components or endpoints caters to various scaling and integration needs.

Solution overview

The following diagram illustrates a reference architecture using the SageMaker Python SDK.

The improved SageMaker Python SDK introduces a more intuitive and flexible approach to building and deploying AI inference workflows. Let’s explore the key components and classes that make up the experience:

ModelBuilder simplifies the process of packaging individual models as inference components. It handles model loading, dependency management, and container configuration automatically.
The CustomOrchestrator class provides a standardized way to define custom inference logic that orchestrates multiple models in the workflow. Users implement the handle() method to specify this logic and can use an orchestration library or none at all (plain Python).
A single deploy() call handles the deployment of the components and workflow orchestrator.
The Python SDK supports invocation against the custom inference workflow or individual inference components.
The Python SDK supports both synchronous and streaming inference.

CustomOrchestrator is an abstract base class that serves as a template for defining custom inference orchestration logic. It standardizes the structure of entry point-based inference scripts, making it straightforward for users to create consistent and reusable code. The handle method in the class is an abstract method that users implement to define their custom orchestration logic.

class CustomOrchestrator (ABC):
“””
Templated class used to standardize the structure of an entry point based inference script.
“””

@abstractmethod
def handle(self, data, context=None):
“””abstract class for defining an entrypoint for the model server”””
return NotImplemented

With this templated class, users can integrate into their custom workflow code, and then point to this code in the model builder using a file path or directly using a class or method name. Using this class and the ModelBuilder class, it enables a more streamlined workflow for AI inference:

Users define their custom workflow by implementing the CustomOrchestrator class.
The custom CustomOrchestrator is passed to ModelBuilder using the ModelBuilder inference_spec parameter.
ModelBuilder packages the CustomOrchestrator along with the model artifacts.
The packaged model is deployed to a SageMaker endpoint (for example, using a TorchServe container).
When invoked, the SageMaker endpoint uses the custom handle() function defined in the CustomOrchestrator to handle the input payload.

In the follow sections, we provide two examples of custom workflow orchestrators implemented with plain Python code. For simplicity, the examples use two inference components.

We explore how to create a simple workflow that deploys two large language models (LLMs) on SageMaker Inference endpoints along with a simple Python orchestrator that calls the two models. We create an IT customer service workflow where one model processes the initial request and another suggests solutions. You can find the example notebook in the GitHub repo.

Prerequisites

To run the example notebooks, you need an AWS account with an AWS Identity and Access Management (IAM) role with least-privilege permissions to manage resources created. For details, refer to Create an AWS account. You might need to request a service quota increase for the corresponding SageMaker hosting instances. In this example, we host multiple models on the same SageMaker endpoint, so we use two ml.g5.24xlarge SageMaker hosting instances.

Python inference orchestration

First, let’s define our custom orchestration class that inherits from CustomOrchestrator. The workflow is structured around a custom inference entry point that handles the request data, processes it, and retrieves predictions from the configured model endpoints. See the following code:

class PythonCustomInferenceEntryPoint(CustomOrchestrator):
def __init__(self, region_name, endpoint_name, component_names):
self.region_name = region_name
self.endpoint_name = endpoint_name
self.component_names = component_names

def preprocess(self, data):
payload = {
“inputs”: data.decode(“utf-8”)
}
return json.dumps(payload)

def _invoke_workflow(self, data):
# First model (Llama) inference
payload = self.preprocess(data)

llama_response = self.client.invoke_endpoint(
EndpointName=self.endpoint_name,
Body=payload,
ContentType=”application/json”,
InferenceComponentName=self.component_names[0]
)
llama_generated_text = json.loads(llama_response.get(‘Body’).read())[‘generated_text’]

# Second model (Mistral) inference
parameters = {
“max_new_tokens”: 50
}
payload = {
“inputs”: llama_generated_text,
“parameters”: parameters
}
mistral_response = self.client.invoke_endpoint(
EndpointName=self.endpoint_name,
Body=json.dumps(payload),
ContentType=”application/json”,
InferenceComponentName=self.component_names[1]
)
return {“generated_text”: json.loads(mistral_response.get(‘Body’).read())[‘generated_text’]}

def handle(self, data, context=None):
return self._invoke_workflow(data)

This code performs the following functions:

Defines the orchestration that sequentially calls two models using their inference component names
Processes the response from the first model before passing it to the second model
Returns the final generated response

This plain Python approach provides flexibility and control over the request-response flow, enabling seamless cascading of outputs across multiple model components.

Build and deploy the workflow

To deploy the workflow, we first create our inference components and then build the custom workflow. One inference component will host a Meta Llama 3.1 8B model, and the other will host a Mistral 7B model.

from sagemaker.serve import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder

# Create a ModelBuilder instance for Llama 3.1 8B
# Pre-benchmarked ResourceRequirements will be taken from JumpStart, as Llama-3.1-8b is a supported model.
llama_model_builder = ModelBuilder(
model=”meta-textgeneration-llama-3-1-8b”,
schema_builder=SchemaBuilder(sample_input, sample_output),
inference_component_name=llama_ic_name,
instance_type=”ml.g5.24xlarge”
)

# Create a ModelBuilder instance for Mistral 7B model.
mistral_mb = ModelBuilder(
model=”huggingface-llm-mistral-7b”,
instance_type=”ml.g5.24xlarge”,
schema_builder=SchemaBuilder(sample_input, sample_output),
inference_component_name=mistral_ic_name,
resource_requirements=ResourceRequirements(
requests={
“memory”: 49152,
“num_accelerators”: 2,
“copies”: 1
}
),
instance_type=”ml.g5.24xlarge”
)

Now we can tie it all together to create one more ModelBuilder to which we pass the modelbuilder_list, which contains the ModelBuilder objects we just created for each inference component and the custom workflow. Then we call the build() function to prepare the workflow for deployment.

# Create workflow ModelBuilder
orchestrator= ModelBuilder(
inference_spec=PythonCustomInferenceEntryPoint(
region_name=region,
endpoint_name=llama_mistral_endpoint_name,
component_names=[llama_ic_name, mistral_ic_name],
),
dependencies={
“auto”: False,
“custom”: [
“cloudpickle”,
“graphene”,
# Define other dependencies here.
],
},
sagemaker_session=Session(),
role_arn=role,
resource_requirements=ResourceRequirements(
requests={
“memory”: 4096,
“num_accelerators”: 1,
“copies”: 1,
“num_cpus”: 2
}
),
name=custom_workflow_name, # Endpoint name for your custom workflow
schema_builder=SchemaBuilder(sample_input={“inputs”: “test”}, sample_output=”Test”),
modelbuilder_list=[llama_model_builder, mistral_mb] # Inference Component ModelBuilders created in Step 2
)
# call the build function to prepare the workflow for deployment
orchestrator.build()

In the preceding code snippet, you can comment out the section that defines the resource_requirements to have the custom workflow deployed on a separate endpoint instance, which can be a dedicated CPU instance to handle the custom workflow payload.

By calling the deploy() function, we deploy the custom workflow and the inference components to your desired instance type, in this example ml.g5.24.xlarge. If you choose to deploy the custom workflow to a separate instance, by default, it will use the ml.c5.xlarge instance type. You can set inference_workflow_instance_type and inference_workflow_initial_instance_count to configure the instances required to host the custom workflow.

predictors = orchestrator.deploy(
instance_type=”ml.g5.24xlarge”,
initial_instance_count=1,
accept_eula=True, # Required for Llama3
endpoint_name=llama_mistral_endpoint_name
# inference_workflow_instance_type=”ml.t2.medium”, # default
# inference_workflow_initial_instance_count=1 # default
)

Invoke the endpoint

After you deploy the workflow, you can invoke the endpoint using the predictor object:

from sagemaker.serializers import JSONSerializer
predictors[-1].serializer = JSONSerializer()
predictors[-1].predict(“Tell me a story about ducks.”)

You can also invoke each inference component in the deployed endpoint. For example, we can test the Llama inference component with a synchronous invocation, and Mistral with streaming:

from sagemaker.predictor import Predictor
# create predictor for the inference component of Llama model
llama_predictor = Predictor(endpoint_name=llama_mistral_endpoint_name, component_name=llama_ic_name)
llama_predictor.content_type = “application/json”

llama_predictor.predict(json.dumps(payload))

When handling the streaming response, we need to read each line of the output separately. The following example code demonstrates this streaming handling by checking for newline characters to separate and print each token in real time:

mistral_predictor = Predictor(endpoint_name=llama_mistral_endpoint_name, component_name=mistral_ic_name)
mistral_predictor.content_type = “application/json”

body = json.dumps({
“inputs”: prompt,
# specify the parameters as needed
“parameters”: parameters
})

for line in mistral_predictor.predict_stream(body):
decoded_line = line.decode(‘utf-8’)
if ‘n’ in decoded_line:
# Split by newline to handle multiple tokens in the same line
tokens = decoded_line.split(‘n’)
for token in tokens[:-1]: # Print all tokens except the last one with a newline
print(token)
# Print the last token without a newline, as it might be followed by more tokens
print(tokens[-1], end=”)
else:
# Print the token without a newline if it doesn’t contain ‘n’
print(decoded_line, end=”)

So far, we have walked through the example code to demonstrate how to build complex inference logic using Python orchestration, deploy them to SageMaker endpoints, and invoke them for real-time inference. The Python SDK automatically handles the following:

Model packaging and container configuration
Dependency management and environment setup
Endpoint creation and component coordination

Whether you’re building a simple workflow of two models or a complex multimodal application, the new SDK provides the building blocks needed to bring your inference workflows to life with minimal boilerplate code.

Customer story: Amazon Search

Amazon Search is a critical component of the Amazon shopping experience, processing an enormous volume of queries across billions of products across diverse categories. At the core of this system are sophisticated matching and ranking workflows, which determine the order and relevance of search results presented to customers. These workflows execute large deep learning models in predefined sequences, often sharing models across different workflows to improve price-performance and accuracy. This approach makes sure that whether a customer is searching for electronics, fashion items, books, or other products, they receive the most pertinent results tailored to their query.

The SageMaker Python SDK enhancement offers valuable capabilities that align well with Amazon Search’s requirements for these ranking workflows. It provides a standard interface for developing and deploying complex inference workflows crucial for effective search result ranking. The enhanced Python SDK enables efficient reuse of shared models across multiple ranking workflows while maintaining the flexibility to customize logic for specific product categories. Importantly, it allows individual models within these workflows to scale independently, providing optimal resource allocation and performance based on varying demand across different parts of the search system.

Amazon Search is exploring the broad adoption of these Python SDK enhancements across their search ranking infrastructure. This initiative aims to further refine and improve search capabilities, enabling the team to build, version, and catalog workflows that power search ranking more effectively across different product categories. The ability to share models across workflows and scale them independently offers new levels of efficiency and adaptability in managing the complex search ecosystem.

Vaclav Petricek, Sr. Manager of Applied Science at Amazon Search, highlighted the potential impact of these SageMaker Python SDK enhancements: “These capabilities represent a significant advancement in our ability to develop and deploy sophisticated inference workflows that power search matching and ranking. The flexibility to build workflows using Python, share models across workflows, and scale them independently is particularly exciting, as it opens up new possibilities for optimizing our search infrastructure and rapidly iterating on our matching and ranking algorithms as well as new AI features. Ultimately, these SageMaker Inference enhancements will allow us to more efficiently create and manage the complex algorithms powering Amazon’s search experience, enabling us to deliver even more relevant results to our customers.”

The following diagram illustrates a sample solution architecture used by Amazon Search.

Clean up

When you’re done testing the models, as a best practice, delete the endpoint to save costs if the endpoint is no longer required. You can follow the cleanup section the demo notebook or use following code to delete the model and endpoint created by the demo:

mistral_predictor.delete_predictor()
llama_predictor.delete_predictor()
llama_predictor.delete_endpoint()
workflow_predictor.delete_predictor()

Conclusion

The new SageMaker Python SDK enhancements for inference workflows mark a significant advancement in the development and deployment of complex AI inference workflows. By abstracting the underlying complexities, these enhancements empower inference customers to focus on innovation rather than infrastructure management. This feature bridges sophisticated AI applications with the robust SageMaker infrastructure, enabling developers to use familiar Python-based tools while harnessing the powerful inference capabilities of SageMaker.

Early adopters, including Amazon Search, are already exploring how these capabilities can drive major improvements in AI-powered customer experiences across diverse industries. We invite all SageMaker users to explore this new functionality, whether you’re developing classic ML models, building generative AI applications or multi-model workflows, or tackling multi-step inference scenarios. The enhanced SDK provides the flexibility, ease of use, and scalability needed to bring your ideas to life. As AI continues to evolve, SageMaker Inference evolves with it, providing you with the tools to stay at the forefront of innovation. Start building your next-generation AI inference workflows today with the enhanced SageMaker Python SDK.

About the authors

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Osho Gupta is a Senior Software Developer at AWS SageMaker. He is passionate about ML infrastructure space, and is motivated to learn & advance underlying technologies that optimize Gen AI training & inference performance. In his spare time, Osho enjoys paddle boarding, hiking, traveling, and spending time with his friends & family.

Joseph Zhang is a software engineer at AWS. He started his AWS career at EC2 before eventually transitioning to SageMaker, and now works on developing GenAI-related features. Outside of work he enjoys both playing and watching sports (go Warriors!), spending time with family, and making coffee.

Gary Wang is a Software Developer at AWS SageMaker. He is passionate about AI/ML operations and building new things. In his spare time, Gary enjoys running, hiking, trying new food, and spending time with his friends and family.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends. You can find him on LinkedIn.

Vaclav Petricek is a Senior Applied Science Manager at Amazon Search, where he led teams that built Amazon Rufus and now leads science and engineering teams that work on the next generation of Natural Language Shopping. He is passionate about shipping AI experiences that make people’s lives better. Vaclav loves off-piste skiing, playing tennis, and backpacking with his wife and three children.

Wei Li is a Senior Software Dev Engineer in Amazon Search. She is passionate about Large Language Model training and inference technologies, and loves integrating these solutions into Search Infrastructure to enhance natural language shopping experiences. During her leisure time, she enjoys gardening, painting, and reading.

Brian Granger is a Senior Principal Technologist at Amazon Web Services and a professor of physics and data science at Cal Poly State University in San Luis Obispo, CA. He works at the intersection of UX design and engineering on tools for scientific computing, data science, machine learning, and data visualization. Brian is a co-founder and leader of Project Jupyter, co-founder of the Altair project for statistical visualization, and creator of the PyZMQ project for ZMQ-based message passing in Python. At AWS he is a technical and open source leader in the AI/ML organization. Brian also represents AWS as a board member of the PyTorch Foundation. He is a winner of the 2017 ACM Software System Award and the 2023 NASA Exceptional Public Achievement Medal for his work on Project Jupyter. He has a Ph.D. in theoretical physics from the University of Colorado.