Create a Managed Inference Job (web interface)

Create a Managed Inference Job in the CosmicAC web interface, then call your model.

Create a Managed Inference Job in the CosmicAC web interface. You set the basics, select a model, configure the endpoint and hardware, then launch the job. The Job configuration reference describes every field.

Prerequisites

You need the following before you start:

A running CosmicAC deployment. See Installation.
Access to the CosmicAC web interface.

Steps

Open the new job form

On the Models page, click Deploy model.

Enter the basics

In the Basics section, fill in the Job name, Location, and Tags.

Select the job type

In the What kind of job? section, select Managed Inference.

Select a model

In the Model to serve section, select a Model. The configuration fields depend on the model's runtime.

For a vLLM model, set the Serving configuration:

Runtime image (CUDA) — the vLLM serving image.
Data type — the numeric precision the model runs at.
Quantisation — how to compress the model weights.
Tensor parallel — how many GPUs to split the model across.
GPU memory utilization — the fraction of GPU memory to use.
Max model length — the maximum context length.
Max concurrent sequences — the maximum requests handled at once.
Reasoning parser — the parser for reasoning output.
Video & image input — whether the model accepts multimodal input.
Root disk size — the VM root disk size in GB.

These fields come prefilled with the model master's defaults. Apply the Recommended model parameters for supported models.

For a Parakeet model (nvidia/parakeet-tdt-0.6b-v3), set the Parakeet configuration:

Chunk duration — the audio chunk length in seconds.
Chunk overlap — the overlap between chunks in seconds.
Max file size (MB) — the maximum audio upload size.

To find a vLLM model, browse the Hugging Face model hub or the vLLM supported models list.

Set environment variables

If your inference service needs environment variables, add them under Environment & advanced.

A Parakeet job has no environment variables, so skip this step.

Configure the endpoint

Still in the Model to serve section, set the Endpoint name and Replicas under Endpoint. The endpoint name must be unique across your active inference jobs.

A Parakeet job has no replicas, so set only the Endpoint name.

Require an API key

Under API key required, keep Require Authorization header enabled. With it enabled, callers send an API key to reach the endpoint. See Create an API key.

Choose the hardware

In the Hardware section, select a GPU and the GPU count, then set the CUDA / driver.

Review and create the job

In the Review & launch section, confirm the job spec is valid, then click Create job.

Open the endpoint

Wait for the job to go live, then click Open endpoint.

Call your model

Copy the endpoint URL, then send a request with your API key in the Authorization header. Use the example request shown on the endpoint as a starting point.