Built with developer experience in mind, Tensorkube simplifies the process of deploying serverless GPU apps. In this guide, we will walk you through the process of deploying jina-embeddings-v2-base-code model on your private cloud.

Jina-embeddings-v2-base-code is an multilingual embedding model speaks English and 30 widely used programming languages

Prerequisites

Before you begin, ensure you have configured Tensorkube on your AWS account. If you haven’t done that yet, follow the Getting Started guide.

Deploying Jina with Tensorfuse

Each tensorkube deployment requires two things - your code and your environment (as a Dockerfile). While deploying machine learning models, it is beneficial if your model is also a part of your container image. This reduces cold-start times by a significant margin.

We are using the Huggingface Text Embeddings Inference toolkit to make our models utilise the full GPU capacity. You can try any of the supported model here.

Code files

We will use an nginx server to start our app. We will configure the /readiness endpoint to return a 200 status code. Remember that Tensorfuse uses this endpoint to check the health of your deployment.

The Huggingface TEI toolkit serves embeddings at /embed and hence we configure all other endpoints to route to the TEI toolkit which is running on port 8000.

nginx.conf
worker_processes  auto;

error_log  /var/log/nginx/error.log warn;
pid        /var/run/nginx.pid;

events {
    worker_connections  1024;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;

    sendfile        on;
    keepalive_timeout  65;

    server {
        listen 80;

        client_max_body_size 200M;

        location /readiness {
            return 200 'true';
            add_header Content-Type text/plain;
        }

        location / {
            # You may need to adjust this if your application is not running on localhost:8000
            proxy_pass http://127.0.0.1:8000;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
        }
    }
}

Environment files (Dockerfile)

Next, create a Dockerfile. Given below is a simple Dockerfile that you can use:

Dockerfile
# Use the Hugging Face base image
FROM ghcr.io/huggingface/text-embeddings-inference:86-1.5

# Install Python 3.10, Nginx, and other dependencies
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3.10-dev \
    python3-pip \
    nginx \
    && rm -rf /var/lib/apt/lists/*

# Set Python 3.10 as the default Python version
RUN ln -s /usr/bin/python3.10 /usr/bin/python

# Upgrade pip and install the requests library
RUN pip3 install --no-cache-dir --upgrade pip && pip install requests

# Copy the Nginx configuration file
COPY nginx.conf /etc/nginx/nginx.conf

# Expose port 80 for Nginx
EXPOSE 80

# Start both Nginx and the text-embeddings-router
CMD ["sh", "-c", "nginx && text-embeddings-router --json-output --max-batch-tokens 163840 --model-id jinaai/jina-embeddings-v2-base-code --port 8000"]

Deploying the app

Jina is now ready to be deployed on Tensorkube. Navigate to your project root and run the following command:

tensorkube deploy --gpus 1 --gpu-type a10g

Jina embedding model is now deployed on your AWS account. You can access your app at the URL provided in the output or using the following command:

tensorkube list deployments

And that’s it! You have successfully deployed Jina embedding model on serverless GPUs using Tensorkube. 🚀

To test it out you can run the following command by replacing the URL with the one provided in the output:


curl -X POST -H "Content-Type: application/json" -d '{"inputs":"&int x = &y"}' <YOUR_URL_HERE>/embed

You can also send multiple inputs in a batch. For example,


curl <YOUR_URL_HERE>/embed \
    -X POST \
    -d '{"inputs":["Today is a nice day", "I like you"]}' \
    -H 'Content-Type: application/json'

You can also use the readiness endpoint to wake up your nodes in case you are expecting incoming traffic

curl <YOUR_APP_URL_HERE>/readiness