Introducing the AMD Inference Server

Sep 22, 2022

Overview

The Xilinx Inference Server is the fastest new way to deploy your Vitis™ AI environment XModels for inferencing. You no longer need to write custom logic with the Vitis AI Runtime libraries for each XModel. Instead, you can use the Vitis AI tools to compile and prepare your XModel (or grab a trained one from the Vitis AI Model Zoo), and then use the Inference Server to make the XModel available for servicing inferencing requests. These requests can be easily made using the included Python API, which provides methods to load your XModel and directly make an inference without touching any C++. In addition to ease of use, the Inference Server provides a high-performance and scalable solution to leverage all the FPGAs on your machine or even in your cluster with Kubernetes and KServe. In the future, we plan on supporting other machine learning frameworks and even GPUs to create an all-in-one solution for heterogeneous machine learning inference.

The AMD Xilinx Inference Server is open-sourced on GitHub and under active development. Clone the repository and try it out! Take a look through the documentation for how to get started to set up the environment and walking through some examples.

How to Start

Say you wanted to make some inferences to a trained ResNet50 model with your Alveo™ U250 data center accelerator card. You’d be in luck as there’s already a trained XModel for this platform that you can find from the Vitis AI Model Zoo. But before you can use the Inference Server, you need to prepare your host and board. Follow the instructions in the Vitis AI repository to install the Xilinx Runtime (XRT), the AMD Xilinx Resource Manager (XRM), and the target platform on the Alveo card. Once your host and card are set up, you’re ready to use the server. Note that the following example and instructions are adapted from the documentation which will have the most up-to-date version of these instructions.

$ git clone https://github.com/Xilinx/inference-server.git

$ cd inference-server

$ ./proteus dockerize

First, we clone the repository and build the Docker image to run the server. The resulting Docker container contains all the dependencies to build, test and run the Inference Server. By using containers, we can easily run the server and deploy it onto clusters.

$ ./proteus run --dev

Once the container is built, we can start it by using this command. This will start the container, mount our local directory into it for development, pass along any FPGAs on the host, and drop us into a terminal in the container. The rest of these instructions are run inside the container.

$ proteus build –all

In the container, we can build the server executable. Once the executable is built, we’re ready to use it for inference. One easy way to do this is using a Python script, which we break down next.

import proteus

To simplify interacting with the server from Python, we provide a Python library that we can import into our script.

server = proteus.Server()

client = proteus.RestClient("127.0.0.1:8998")

server.start()

client.wait_until_live()

Next, we can create our server and client. We point our client to the address where the server is running (by default, the server will be running on the localhost at port 8998). Then, we can start our server and let our client wait until the server is live.

parameters = {"xmodel": path_to_xmodel}

response = client.load("Xmodel", parameters)

worker_name = response.html

while not client.model_ready(worker_name):

pass

Since we want to run the ResNet50 XModel, we load the XModel worker and pass it the path to the XModel we downloaded from the Vitis AI Model Zoo. The server responds back with an endpoint that we can use for subsequent interactions with this worker. We then wait until the worker is ready.

images = []

for _ in range(batch_size):

images.append(path_to_image)

images = preprocess(images)

request = proteus.ImageInferenceRequest(images, True)

response = client.infer(worker_name, request)

Now, we’re ready to make an inference. We can prepare a batch of images to send to the server and preprocess them in Python using custom logic. Finally, we can prepare the request using the preprocessed images and send it to the server for inference. The response can then be parsed, postprocessed, and evaluated.

Next Steps

The example above shows the basic method of interacting with the AMD Xilinx Inference Server. Check out the documentation to learn more about automatic batching, the C++ API, deploying on a cluster, user-defined parallelism, and running end-to-end inferences. Stay tuned to the AMD Xilinx Inference Server repository for future updates!

ARTICLE BY:

Bingqing Guo

Data Center

Business Systems

Personal & Gaming

Embedded

Resources

GPU Accelerators

Adaptive Accelerators

DPU Accelerators

Ethernet Adapters

Workstations

Desktops

Laptops

Resources

Adaptive SoCs & FPGAs

System-on-Modules (SOMs)

Technologies

Resources

Evaluation Boards & Kits

Processor Tools

Graphics Tools & Apps

Adaptive SoC & FPGA Tools

Intellectual Property & Apps

GPU Accelerator Tools & Apps

Overview

For Data Center & Cloud

For Edge & Endpoints

For Developers

Industries

Industries

Industries

Industries

Gaming

Systems

Technologies

Resources

EPYC Processors

Radeon Graphics & AMD Chipsets

Adaptive SoCs & FPGAs

Alveo Accelerators & Kria SOMs

Ryzen Processors

Ethernet Adapters

Overview

Processors

Accelerators

Adaptive SoCs, FPGAs, & SOMs

Graphics

Overview

Resources by Market Segment

Resources by Product

Resources by Type

About Our Partners

AMD Global Support

Processors & Graphics

Accelerators

Adaptive SoCs & FPGAs

Gaming & Personal Computing

Adaptive & Embedded Computing

Get AMD Fan Gear

Shop Our Retail Partners

Your cart is empty

Introducing the AMD Inference Server

Overview

How to Start

Next Steps

Company

News & Events

Community

Partners

Investors