GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning Effort, Function Calling, and Long-Context Retrieval

In this tutorial, we work with GLM-5.2 and use its hosted, OpenAI-compatible API instead of running the full model locally. We begin by setting up multiple provider options, securely loading the API key, and creating a reusable chat wrapper that supports normal chat, thinking mode, streaming, tool calling, and token tracking. Then we move beyond a simple chatbot example and test the model in more practical situations, including reasoning-effort control, streamed reasoning and answers, function calling, a small tool-using agent, structured JSON output, long-context retrieval, and cost estimation.

Setting Up the GLM-5.2 OpenAI-Compatible Client and Reusable Chat Wrapper

import sys, subprocess subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U", "openai"], check=False) import os, re, json, time, getpass from openai import OpenAI PROVIDERS = {    "zai":         {"base_url": "https://api.z.ai/api/paas/v4/",   "model": "glm-5.2",        "env": "ZAI_API_KEY"},    "openrouter":  {"base_url": "https://openrouter.ai/api/v1",    "model": "z-ai/glm-5.2",   "env": "OPENROUTER_API_KEY"},    "together":    {"base_url": "https://api.together.xyz/v1",     "model": "zai-org/GLM-5.2","env": "TOGETHER_API_KEY"},    "requesty":    {"base_url": "https://router.requesty.ai/v1",   "model": "zai/glm-5.2",    "env": "REQUESTY_API_KEY"},    "huggingface": {"base_url": "https://router.huggingface.co/v1","model": "zai-org/GLM-5.2","env": "HF_TOKEN"}, } PROVIDER = "zai" CFG   = PROVIDERS[PROVIDER] MODEL = CFG["model"] def load_api_key(env_name):    try:        from google.colab import userdata        v = userdata.get(env_name)        if v: return v    except Exception:        pass    if os.environ.get(env_name):        return os.environ[env_name]    return getpass.getpass(f"Enter your {env_name}: ") client = OpenAI(api_key=load_api_key(CFG["env"]), base_url=CFG["base_url"]) PRICE_IN_PER_M, PRICE_OUT_PER_M = 1.40, 4.40 _USAGE = {"in": 0, "out": 0, "calls": 0} def _track(usage):    if usage:        _USAGE["in"]    += getattr(usage, "prompt_tokens", 0) or 0        _USAGE["out"]   += getattr(usage, "completion_tokens", 0) or 0        _USAGE["calls"] += 1 def get_reasoning(obj):    """Pull GLM's hidden reasoning trace from a message/delta (a provider-extra field)."""    val = getattr(obj, "reasoning_content", None)    if val: return val    extra = getattr(obj, "model_extra", None) or {}    if extra.get("reasoning_content"): return extra["reasoning_content"]    try:    return obj.to_dict().get("reasoning_content")    except Exception: return None def chat(messages, effort=None, thinking=True, tools=None, tool_choice="auto",         stream=False, max_tokens=2048, temperature=1.0, tool_stream=False):    """    effort:   None | "high" | "max"   (GLM-5.2 thinking-effort level; max is the model default)    thinking: True -> deep thinking on; False -> off (fast, cheap, low-latency)    GLM-specific params go through extra_body so any OpenAI client works.    """    extra = {"thinking": {"type": "enabled" if thinking else "disabled"}}    if effort and thinking: extra["reasoning_effort"] = effort    if tool_stream:         extra["tool_stream"] = True    kwargs = dict(model=MODEL, messages=messages, max_tokens=max_tokens,                  temperature=temperature, stream=stream, extra_body=extra)    if tools:        kwargs.update(tools=tools, tool_choice=tool_choice)    if stream:        kwargs["stream_options"] = {"include_usage": True}    return client.chat.completions.create(**kwargs)

We set up the complete foundation for using GLM-5.2 through an OpenAI-compatible API. We define multiple provider options, load the API key securely, create the OpenAI client, and set up token-cost tracking for the entire notebook. We also build a reusable chat wrapper so that every subsequent demo can use thinking mode, reasoning effort, streaming, tool calling, and provider-specific parameters cleanly.

Basic Chat, Thinking-Effort Control, and Streamed Reasoning with GLM-5.2

def demo_basic():    print("n=== 1. BASIC CHAT / SANITY CHECK =========================")    resp = chat([{"role": "system", "content": "You are a concise technical assistant."},                 {"role": "user",   "content": "In one sentence, what is GLM-5.2 best at?"}],                thinking=False, max_tokens=200)    _track(resp.usage)    print(resp.choices[0].message.content.strip()) def demo_effort():    print("n=== 2. THINKING-EFFORT CONTROL (off / high / max) ========")    problem = ("Train A leaves city A at 9:00 going 60 km/h toward city B. "               "Train B leaves B (420 km away) at 9:30 going 90 km/h toward A. "               "At what clock time do they meet? Show the key steps briefly.")    for label, kw in [("thinking OFF", dict(thinking=False)),                      ("effort=high",  dict(thinking=True, effort="high")),                      ("effort=max",   dict(thinking=True, effort="max"))]:        t0 = time.time()        resp = chat([{"role": "user", "content": problem}], max_tokens=2000, **kw)        dt = time.time() - t0        _track(resp.usage)        msg, u = resp.choices[0].message, resp.usage        print(f"n--- {label} | {dt:0.1f}s | out_tokens={getattr(u,'completion_tokens',0)} ---")        r = get_reasoning(msg)        if r:            print("  [reasoning, first 220 chars]: " + " ".join(r.split())[:220] + " ...")        print("  : " + " ".join((msg.content or '').split())[:350]) def demo_streaming():    print("n=== 3. STREAMING: reasoning channel vs answer channel ====")    stream = chat([{"role": "user", "content":                    "Explain why the sky is blue, then give a one-line TL;DR."}],                  thinking=True, effort="high", stream=True, max_tokens=1200)    saw_r = saw_a = False    usage = None    for chunk in stream:        if getattr(chunk, "usage", None): usage = chunk.usage        if not chunk.choices: continue        delta = chunk.choices[0].delta        r = get_reasoning(delta)        if r:            if not saw_r: print("n[thinking] ", end="", flush=True); saw_r = True            print(r, end="", flush=True)        if getattr(delta, "content", None):            if not saw_a: print("nn ", end="", flush=True); saw_a = True            print(delta.content, end="", flush=True)    print()    _track(usage)

We start testing GLM-5.2 with basic chat, reasoning-effort control, and streaming output. We first run a simple sanity check, then compare the same problem across thinking-off, high-effort, and max-effort modes to observe changes in latency and output tokens. We also stream the model response so we can view the reasoning channel and the final answer separately as the response is being generated.

Function Calling and a Multi-Step Tool-Using GLM-5.2 Agent

def tool_calculator(expression: str):    if not re.fullmatch(r"[0-9+-*/(). %]+", expression or ""):        return {"error": "unsupported characters"}    try:    return {"result": eval(expression, {"__builtins__": {}}, {})}    except Exception as e: return {"error": str(e)} _CITY_POP = {"tokyo": 37_400_068, "delhi": 32_900_000, "shanghai": 28_500_000,             "sao paulo": 22_400_000, "mexico city": 21_800_000} def tool_city_population(city: str):    return {"city": city, "population": _CITY_POP.get((city or "").strip().lower())} TOOLS = [    {"type": "function", "function": {        "name": "calculator", "description": "Evaluate basic arithmetic like '37400068/21800000'.",        "parameters": {"type": "object", "properties": {"expression": {"type": "string"}},                       "required": ["expression"]}}},    {"type": "function", "function": {        "name": "city_population", "description": "Look up the metro population of a city.",        "parameters": {"type": "object", "properties": {"city": {"type": "string"}},                       "required": ["city"]}}}, ] TOOL_IMPLS = {"calculator": tool_calculator, "city_population": tool_city_population} def run_tool_loop(messages, max_rounds=6, effort="max"):    """Full loop: model -> tool_calls -> execute -> feed results back -> repeat."""    for _ in range(max_rounds):        resp = chat(messages, tools=TOOLS, thinking=True, effort=effort,                    max_tokens=1500, temperature=0.3)        _track(resp.usage)        m = resp.choices[0].message        if not getattr(m, "tool_calls", None):            return m.content        messages.append({            "role": "assistant", "content": m.content or "",            "tool_calls": [{"id": tc.id, "type": "function",                            "function": {"name": tc.function.name,                                         "arguments": tc.function.arguments}}                           for tc in m.tool_calls]})        for tc in m.tool_calls:            try:    args = json.loads(tc.function.arguments or "{}")            except json.JSONDecodeError: args = {}            result = TOOL_IMPLS.get(tc.function.name, lambda **k: {"error": "unknown"})(**args)            print(f"   ↳ {tc.function.name}({args}) -> {result}")            messages.append({"role": "tool", "tool_call_id": tc.id,                             "content": json.dumps(result)})    return "(stopped: max tool rounds reached)" def demo_tools():    print("n=== 4. FUNCTION / TOOL CALLING ===========================")    q = ("How many times larger is Tokyo's metro population than Mexico City's? "         "Use the tools, then answer with the ratio to one decimal place.")    print("Final:", " ".join((run_tool_loop([{"role": "user", "content": q}]) or "").split())) def demo_agent():    print("n=== 5. MINI MULTI-STEP AGENT (tools + max effort) ========")    task = ("Rank Tokyo, Delhi, and Shanghai by metro population (largest first), "            "then compute the combined population of the top two and report it. "            "Use the tools for every lookup and sum; never guess numbers.")    ans = run_tool_loop([{"role": "system", "content": "You are a careful analyst."},                         {"role": "user",   "content": task}])    print("Final:", " ".join((ans or "").split()))

We connect GLM-5.2 to external tools and build a small tool-using workflow. We define a calculator and a city-population lookup tool, register them in an OpenAI-style tool schema, and create a loop in which the model requests tool calls and receives tool results. We then use this setup for a direct function-calling task and a small multi-step agent that looks up populations, ranks cities, and performs calculations without guessing.

Structured JSON Output and Long-Context Retrieval with GLM-5.2

def tool_calculator(expression: str):    if not re.fullmatch(r"[0-9+-*/(). %]+", expression or ""):        return {"error": "unsupported characters"}    try:    return {"result": eval(expression, {"__builtins__": {}}, {})}    except Exception as e: return {"error": str(e)} _CITY_POP = {"tokyo": 37_400_068, "delhi": 32_900_000, "shanghai": 28_500_000,             "sao paulo": 22_400_000, "mexico city": 21_800_000} def tool_city_population(city: str):    return {"city": city, "population": _CITY_POP.get((city or "").strip().lower())} TOOLS = [    {"type": "function", "function": {        "name": "calculator", "description": "Evaluate basic arithmetic like '37400068/21800000'.",        "parameters": {"type": "object", "properties": {"expression": {"type": "string"}},                       "required": ["expression"]}}},    {"type": "function", "function": {        "name": "city_population", "description": "Look up the metro population of a city.",        "parameters": {"type": "object", "properties": {"city": {"type": "string"}},                       "required": ["city"]}}}, ] TOOL_IMPLS = {"calculator": tool_calculator, "city_population": tool_city_population} def run_tool_loop(messages, max_rounds=6, effort="max"):    """Full loop: model -> tool_calls -> execute -> feed results back -> repeat."""    for _ in range(max_rounds):        resp = chat(messages, tools=TOOLS, thinking=True, effort=effort,                    max_tokens=1500, temperature=0.3)        _track(resp.usage)        m = resp.choices[0].message        if not getattr(m, "tool_calls", None):            return m.content        messages.append({            "role": "assistant", "content": m.content or "",            "tool_calls": [{"id": tc.id, "type": "function",                            "function": {"name": tc.function.name,                                         "arguments": tc.function.arguments}}                           for tc in m.tool_calls]})        for tc in m.tool_calls:            try:    args = json.loads(tc.function.arguments or "{}")            except json.JSONDecodeError: args = {}            result = TOOL_IMPLS.get(tc.function.name, lambda **k: {"error": "unknown"})(**args)            print(f"   ↳ {tc.function.name}({args}) -> {result}")            messages.append({"role": "tool", "tool_call_id": tc.id,                             "content": json.dumps(result)})    return "(stopped: max tool rounds reached)" def demo_tools():    print("n=== 4. FUNCTION / TOOL CALLING ===========================")    q = ("How many times larger is Tokyo's metro population than Mexico City's? "         "Use the tools, then answer with the ratio to one decimal place.")    print("Final:", " ".join((run_tool_loop([{"role": "user", "content": q}]) or "").split())) def demo_agent():    print("n=== 5. MINI MULTI-STEP AGENT (tools + max effort) ========")    task = ("Rank Tokyo, Delhi, and Shanghai by metro population (largest first), "            "then compute the combined population of the top two and report it. "            "Use the tools for every lookup and sum; never guess numbers.")    ans = run_tool_loop([{"role": "system", "content": "You are a careful analyst."},                         {"role": "user",   "content": task}])    print("Final:", " ".join((ans or "").split()))

We focus on reliable, structured output and long-context retrieval. We create a JSON extraction helper, ask the model to return a strict JSON object, and retry once if the first response is not valid JSON. We also build a synthetic long document with a hidden “needle” and send it to GLM-5.2 to check whether the model retrieves the exact launch code from the provided context.

Running All Demos with GLM-5.2 Token and Cost Accounting

def cost_summary():    print("n=== 8. TOKEN + COST ACCOUNTING ===========================")    cost = _USAGE["in"]/1e6*PRICE_IN_PER_M + _USAGE["out"]/1e6*PRICE_OUT_PER_M    print(f"  calls: {_USAGE['calls']} | input: {_USAGE['in']:,} tok | output: {_USAGE['out']:,} tok")    print(f"  estimated spend @ ${PRICE_IN_PER_M}/{PRICE_OUT_PER_M} per 1M: ${cost:0.4f}") DEMOS = [demo_basic, demo_effort, demo_streaming, demo_tools,         demo_agent, demo_structured, demo_long_context] print(f"Provider={PROVIDER}   model={MODEL}") for fn in DEMOS:    try:    fn()    except Exception as e:        print(f"  [skipped {fn.__name__}: {type(e).__name__}: {e}]") cost_summary() print("nDone. Tweak PROVIDER / effort / max_tokens and re-run any demo function.")

We finish the tutorial by collecting usage information and running all demos from top to bottom. We calculate the estimated cost from total input and output tokens, then print a compact summary of calls, token counts, and spend. We also use a driver loop so that a single failed demo does not halt the entire notebook, making the tutorial easier to run, debug, and reuse.

Conclusion

In conclusion, we have a practical and reusable workflow for using GLM-5.2 in Python applications. We learned how to control its reasoning behavior, compare different thinking modes, connect it with tools, validate structured outputs, test long-context inputs, and monitor token usage with estimated cost. It provides us a strong starting point for building more advanced systems such as research assistants, document analysis tools, coding agents, long-context retrieval workflows, or API-based reasoning pipelines. We finished with a setup that is lightweight enough for Colab but still close to how we would build with GLM-5.2 in a real project.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Sana Hassan

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

Setting Up the GLM-5.2 OpenAI-Compatible Client and Reusable Chat Wrapper

Basic Chat, Thinking-Effort Control, and Streamed Reasoning with GLM-5.2

Function Calling and a Multi-Step Tool-Using GLM-5.2 Agent

Structured JSON Output and Long-Context Retrieval with GLM-5.2

Running All Demos with GLM-5.2 Token and Cost Accounting

Conclusion

Sana Hassan

Leave a Reply Cancel reply

Related Posts

Ant Group Releases Ling 2.0: A Reasoning-First MoE Language Model Series Built on the Principle that Each Activation Enhances Reasoning Capability

What is AGI? Nobody agrees, and it’s tearing Microsoft and OpenAI apart.

Get ready for ads in ChatGPT