Interact with a Model using vLLM

Running a vLLM Server Service

Interactive services in Transformer Lab allow you to launch environments like vLLM Server to interact with models directly from the interface. This guide walks you through running a vLLM Server service.

Prerequisites

Before running a vLLM Server service, ensure you have a Compute Provider set up and active.

Navigate to Team Settings and set up a Compute Provider.
Make sure the provider is active by clicking on the health button.

Steps to Run a vLLM Server Service

Go to the Interact page in Transformer Lab.
Click on the "New" button to create a new interactive service.
Select "vLLM Server" as the type of interactive service to launch.
Configure the service:
- Enter a name for the service.
- Select the Compute Provider to use.
- Specify the resources: CPU, memory, and GPUs.
- Provide the Hugging Face ID of the model to use.
- Set the Tensor parallel size for the vLLM engine.
- Provide any additional inputs if required (e.g., ngrok auth token for tunneling).
Click "Launch" to start the vLLM Server service.
Once launched, a card will appear for the service. Click the "Interactive Setup" button on the card.
Follow the provided URL or steps to access and interact with the model via the vLLM Server.

Running a vLLM Server Service​

Prerequisites​

Steps to Run a vLLM Server Service​

Running a vLLM Server Service

Prerequisites

Steps to Run a vLLM Server Service