Calling Nbility from Python: Chat, Streaming, and Error Handling

Python calling Nbility cover

If you want to add AI features to a script, backend job, bot, or web service, the best starting point is not a complex Agent framework. First, make three basics reliable: one normal chat call, one streaming response, and one debuggable error-handling path.

This tutorial shows how to call Nbility’s OpenAI-compatible Chat Completions API from Python. You can keep using the familiar openai Python SDK; the main change is setting base_url to Nbility and keeping the API key in environment variables. Once this foundation works, it can power Dify, Hermes Agent, Telegram bots, internal automations, and your own backend services.

Understand the Call Flow

Minimal Python to Nbility call flow

The minimal flow has four steps:

Store NBILITY_API_KEY, NBILITY_BASE_URL, and NBILITY_MODEL in .env;
Create a Python client with OpenAI(api_key=..., base_url=...);
Send messages with client.chat.completions.create();
Read choices[0].message.content, while logging usage, latency, and errors.

Nbility’s Chat Completions documentation uses POST /v1/chat/completions, a Bearer token in the Authorization header, a request body with model and messages, and stream: true for SSE streaming. The OpenAI Chat Completions API uses the same core structure, so the Python SDK fits naturally.

Prepare the Environment

Use a virtual environment so the project remains isolated:

python -m venv .venv
source .venv/bin/activate
pip install openai python-dotenv

Create .env:

NBILITY_API_KEY=[REDACTED]
NBILITY_BASE_URL=https://api.nbility.dev/v1
NBILITY_MODEL=gpt-4o

Keep the real key in local environment variables, server secrets, or CI secrets. Never commit it to Git, expose it in frontend code, screenshots, or logs.

Minimal Chat Call

Create chat_once.py:

import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI(
    api_key=os.environ["NBILITY_API_KEY"],
    base_url=os.getenv("NBILITY_BASE_URL", "https://api.nbility.dev/v1"),
)

response = client.chat.completions.create(
    model=os.getenv("NBILITY_MODEL", "gpt-4o"),
    messages=[
        {"role": "system", "content": "You are a concise and reliable technical assistant."},
        {"role": "user", "content": "Explain OpenAI-compatible APIs in three sentences."},
    ],
    temperature=0.3,
    max_tokens=300,
)

print(response.choices[0].message.content)
if response.usage:
    print("usage:", response.usage)

Run it:

python chat_once.py

If this works, you have validated the most important pieces: key, Base URL, model name, network access, and SDK compatibility.

Wrap It as a Reusable Function

In real projects, avoid scattering API calls everywhere. Wrap the client once:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["NBILITY_API_KEY"],
    base_url=os.getenv("NBILITY_BASE_URL", "https://api.nbility.dev/v1"),
    timeout=60,
    max_retries=2,
)


def ask_ai(prompt: str, system: str = "You are a reliable technical assistant.") -> str:
    response = client.chat.completions.create(
        model=os.getenv("NBILITY_MODEL", "gpt-4o"),
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt},
        ],
        temperature=0.2,
        max_tokens=800,
    )
    return response.choices[0].message.content or ""

Two settings matter here:

timeout=60: prevents a request from hanging forever;
max_retries=2: handles temporary network issues without infinite retries.

With Nbility as the unified model gateway, you can switch models via environment variables instead of editing application code. For example, use a cheaper model in development and a stronger model for important production tasks.

Streaming Output

Streaming is useful for chat UIs, command-line assistants, and community bots. The user can see the first tokens quickly instead of waiting for the whole answer.

Create chat_stream.py:

import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI(
    api_key=os.environ["NBILITY_API_KEY"],
    base_url=os.getenv("NBILITY_BASE_URL", "https://api.nbility.dev/v1"),
)

stream = client.chat.completions.create(
    model=os.getenv("NBILITY_MODEL", "gpt-4o"),
    messages=[{"role": "user", "content": "Write a short Python API integration tip."}],
    stream=True,
    temperature=0.4,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

print()

Nbility returns Server-Sent Events when stream: true is enabled. The OpenAI Python SDK parses those events into iterable chunks, so you only need to read delta.content.

For web apps, your backend can forward those chunks as SSE or WebSocket messages. For Telegram, QQ, or Feishu bots, it is usually better to edit a message in segments or send an initial “working on it” message, rather than sending one message per token.

Error Handling: Avoid Plain `except Exception`

The hard part in production is not the happy path. It is knowing what failed and what action to take.

Python API error handling matrix

Start with a practical wrapper:

import os
from openai import OpenAI, APIConnectionError, APIStatusError, RateLimitError

client = OpenAI(
    api_key=os.environ["NBILITY_API_KEY"],
    base_url=os.getenv("NBILITY_BASE_URL", "https://api.nbility.dev/v1"),
    timeout=60,
    max_retries=2,
)


def safe_chat(prompt: str) -> str:
    try:
        response = client.chat.completions.create(
            model=os.getenv("NBILITY_MODEL", "gpt-4o"),
            messages=[{"role": "user", "content": prompt}],
            temperature=0.2,
            max_tokens=500,
        )
        return response.choices[0].message.content or ""
    except RateLimitError as exc:
        raise RuntimeError("Too many requests or quota pressure. Please retry later.") from exc
    except APIConnectionError as exc:
        raise RuntimeError("Could not connect to the API. Check network, Base URL, or proxy settings.") from exc
    except APIStatusError as exc:
        status = exc.status_code
        body = getattr(exc, "response", None)
        detail = body.text[:1000] if body is not None else str(exc)
        if status in {401, 403}:
            raise RuntimeError("Authentication failed: check API key, Bearer format, permissions, or balance.") from exc
        if status in {400, 404}:
            raise RuntimeError(f"Request parameters may be wrong: check model, messages, and max_tokens. Detail: {detail}") from exc
        if status >= 500:
            raise RuntimeError("Temporary upstream error. Retry later or switch models.") from exc
        raise RuntimeError(f"API request failed: HTTP {status}, {detail}") from exc

This wrapper does not silently swallow errors. Your application can show the user-friendly message to the user while writing technical details to logs.

Logging and Cost Control

An AI API integration is not finished just because it returns text. If multiple users, customer-service bots, or automated Agents will use it, log these fields from day one:

request_id: local request id
user_id: calling user or account
model: actual model name
stream: whether streaming was enabled
latency_ms: request duration
prompt_tokens / completion_tokens / total_tokens: token usage
status: success / failed
type: error type or HTTP status

If the response contains usage, store it in logs or a database. That is how you answer practical questions later: who used the API, which feature used it, and why cost suddenly increased.

Nbility fits well as the unified entry point here. Text models, image models, model switching, and cost observation can be managed centrally, while your code keeps the OpenAI-compatible shape.

Common Issues

1. `401 Unauthorized`

Check these first:

Is the API key correct?
Did you include Authorization: Bearer?
Was .env loaded correctly?
Did server environment variables override local config?
Does the key have balance or permission?

2. `404 model not found`

Usually the model name is wrong, or the account does not have access to that model. Confirm the available model name in Nbility, update NBILITY_MODEL, and rerun the minimal chat script.

3. Streaming produces nothing

Check three things:

Was stream=True actually passed?
Are you reading chunk.choices[0].delta.content?
Is a proxy, gateway, or web framework buffering the response?

Many web frameworks buffer responses by default, which makes streaming look broken. Verify that the backend really sends SSE or chunked responses to the browser.

4. When should I use sync vs streaming?

Use synchronous calls for background jobs, short answers, and structured processing. Use streaming for chat UIs, long answers, and mobile or group-chat bots. Streaming improves perceived latency; it does not necessarily reduce token cost.

Launch Checklist

[ ] API key never appears in code, frontend, logs, or screenshots
[ ] Base URL is https://api.nbility.dev/v1
[ ] Model name lives in environment variables
[ ] Every request has a timeout
[ ] 429 / 5xx use limited retries or queue backoff
[ ] 400 / 401 / 403 / 404 are not blindly retried
[ ] usage, latency, model, and user_id are logged
[ ] Streaming is visible in the target client and not buffered by the gateway