GPT-image / Multimodal Image API Beginner Guide

GPT-image / multimodal image API cover

The first mental model most people have for an image API is simple: send a prompt, get an image. Once you connect it to a product, a publishing workflow, a group bot, or a marketing pipeline, the real questions appear: why is the image returned as base64? Is text-to-image the same as editing a reference image? Which size and quality should you choose? When generation fails, should you retry or ask the user to rewrite the prompt?

This guide explains the basic GPT-image / multimodal image API workflow in a copy-pasteable way. The examples use an OpenAI-compatible style, with Nbility as the unified API entry point: one Base URL, one API key, and a model name you can route, monitor, and replace later.

Three different tasks: vision, text-to-image, and image editing

“Multimodal” can be confusing. For image workflows, split the problem into three categories:

Vision / image understanding: provide images as input and ask a model to describe, OCR, classify, or reason about them. OpenAI’s Images and Vision guide covers this class of use cases across APIs such as Chat Completions and Responses.
Text-to-image generation: provide only a prompt and create a new image. The Image API images/generations endpoint is the simple path.
Reference-image editing: upload one or more images and ask the model to preserve, transform, or edit parts of them. Image edit endpoints may also support masks, file constraints, and model-specific parameters.

A common beginner mistake is to say “let the model look at this image and generate another one” without specifying whether the image is a vision input, a reference image, or an edit target.

Image API request path

Start with the Image API before building a complex Agent

OpenAI’s documentation describes two broad ways to generate images: the Image API and the image generation tool inside the Responses API.

Image API: best for one-shot generation or edits. It is easy to plug into scripts, backends, and automation jobs.
Responses API with image generation tool: useful for conversational and iterative editing, such as generating an image and then asking the model to make it more realistic or change the background.

If you are building article covers, product drafts, social images, or bot-generated illustrations, start with the Image API. Move to Responses-based workflows when you need multi-turn editing history or an Agent that decides whether to generate or edit.

Environment variables

Create a .env file:

NBILITY_API_KEY=[REDACTED]
NBILITY_BASE_URL=https://api.nbility.dev/v1
NBILITY_IMAGE_MODEL=gpt-image-2

If your account does not currently expose gpt-image-2, replace it with an available GPT image model. The important part is consistency:

base_url should point to the OpenAI-compatible API root, usually including /v1.
api_key is sent as a Bearer token.
model must be an image-generation model, not a regular chat model.

Minimal Python example: text-to-image

Install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install openai python-dotenv

Create generate_image.py:

import base64
import os
from pathlib import Path

from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

client = OpenAI(
    api_key=os.environ["NBILITY_API_KEY"],
    base_url=os.environ.get("NBILITY_BASE_URL", "https://api.nbility.dev/v1"),
)

prompt = """
A clean hero image for a technical blog post about AI image generation APIs:
a developer desk, floating image thumbnails, API request cards, black and orange color palette,
modern 3D illustration, no readable text.
"""

result = client.images.generate(
    model=os.environ.get("NBILITY_IMAGE_MODEL", "gpt-image-2"),
    prompt=prompt,
    size="1536x1024",
    quality="medium",
)

image_base64 = result.data[0].b64_json
image_bytes = base64.b64decode(image_base64)

out = Path("generated-cover.png")
out.write_bytes(image_bytes)
print(f"saved: {out.resolve()}")

Run it:

python generate_image.py

Many GPT image models return base64 image data instead of a permanent URL, so your application should decode and store the file.

cURL example: verify the request path first

When SDK errors are unclear, use curl to test the raw request:

curl -X POST "https://api.nbility.dev/v1/images/generations"   -H "Authorization: Bearer $NBILITY_API_KEY"   -H "Content-Type: application/json"   -d '{
    "model": "gpt-image-2",
    "prompt": "A minimal orange and black illustration of an AI image API pipeline, no text",
    "size": "1024x1024",
    "quality": "medium"
  }'   | jq -r '.data[0].b64_json'   | base64 --decode > test.png

This verifies the Base URL, API key, model availability, and the b64_json response field.

Reference-image editing

If the user asks to keep the same cat pose but change the background to a cyberpunk city, that is image editing, not pure text-to-image generation. OpenAI’s image edit reference says GPT image models can accept input files such as png, webp, and jpg, with model-specific limits for file size, number of images, masks, and fidelity options.

Example:

import base64
import os
from pathlib import Path

from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI(
    api_key=os.environ["NBILITY_API_KEY"],
    base_url=os.environ.get("NBILITY_BASE_URL", "https://api.nbility.dev/v1"),
)

result = client.images.edit(
    model=os.environ.get("NBILITY_IMAGE_MODEL", "gpt-image-2"),
    image=open("input.png", "rb"),
    prompt="Keep the main character, change the background into a cozy orange developer studio, no text.",
    size="1536x1024",
    quality="medium",
)

Path("edited.png").write_bytes(base64.b64decode(result.data[0].b64_json))

For production, hard-code a parameter allowlist. Do not expose every model parameter directly to end users.

Choosing size, quality, and format

Good beginner defaults:

Blog cover: 1536x1024.
Square social image: 1024x1024.
Vertical poster: 1024x1536.
Quality: start with medium; use high only for final or commercial images.
Format: default to png; consider webp or jpeg for faster web delivery.
Count: default to n=1; use a queue for batch generation.

Some models support custom dimensions, but they may impose divisibility, aspect-ratio, and maximum-resolution constraints. In a product UI, expose only a few verified size buttons.

Do not keep users waiting inside one HTTP request

Image generation is slower than chat. A more robust backend flow is:

User submits a prompt.
Backend creates a job and returns task_id.
A worker calls the image API.
The result is stored in object storage or a static directory.
The frontend polls job status, or a bot sends the resulting image link.

A unified API gateway such as Nbility helps because you can track chat, summarization, vision, and image-generation usage in one place, then attribute cost by user, group, article, or automation job.

Troubleshooting by layer

Image API troubleshooting checklist

Common failures:

401 / 403: invalid key, missing permission, or unavailable model. Check Authorization and the model name.
400: incompatible parameter, unsupported size, transparent background not supported, invalid mask, or unsupported file format.
429: rate limit. Queue the request and retry later.
timeout / upstream error: upstream generation is slow or temporarily unavailable. Retry once, not forever.
safety / policy: the prompt violates policy. Ask the user to change the description instead of calling it a network error.

A practical rule: retry network errors, timeouts, and 5xx responses; do not automatically retry 400, 401, 403, or safety-policy errors.

Prompt structure: write for the use case

A useful image prompt usually contains five parts:

Subject: the main person, object, or scene
Use case: blog cover, product banner, tutorial image, social poster
Composition: horizontal/vertical, title space, close-up/wide shot
Style: realistic, 3D, flat illustration, anime, brand colors
Constraints: no readable text, no watermark, no fake logos

Example:

A horizontal technical blog cover about multimodal image generation API.
Main subject: a developer dashboard with floating image thumbnails and API request cards.
Composition: leave clean title space on the left, main visual on the right.
Style: modern 3D illustration, black and orange palette, soft lighting.
Constraints: no readable text, no watermark, no real company logos.

The more the prompt reflects the final use case, the more usable the output tends to be.

A reusable backend function

Wrap generation into a function:

import base64
import os
from pathlib import Path
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["NBILITY_API_KEY"],
    base_url=os.environ.get("NBILITY_BASE_URL", "https://api.nbility.dev/v1"),
)

def generate_image(prompt: str, out_path: str, *, size="1536x1024", quality="medium") -> Path:
    if len(prompt) > 4000:
        raise ValueError("prompt too long for this application policy")

    result = client.images.generate(
        model=os.environ.get("NBILITY_IMAGE_MODEL", "gpt-image-2"),
        prompt=prompt,
        size=size,
        quality=quality,
    )

    data = result.data[0].b64_json
    path = Path(out_path)
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_bytes(base64.b64decode(data))
    return path