A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence

In this tutorial, we build an end-to-end implementation around Qwen 3.6-35B-A3B and explore how a modern multimodal MoE model can be used in practical workflows. We begin by setting up the environment, loading the model adaptively based on available GPU memory, and creating a reusable chat framework that supports both standard responses and explicit thinking traces. From there, we work through important capabilities such as thinking-budget control, streamed generation with separated reasoning and answers, vision input handling, tool calling, structured JSON generation, MoE routing inspection, benchmarking, retrieval-augmented generation, and session persistence. Through this process, we run the model for inference and also examine how to design a robust application layer on top of Qwen 3.6 for real experimentation and advanced prototyping.

import subprocess, sys def _pip(*a): subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *a]) _pip("--upgrade", "pip") _pip("--upgrade",     "transformers>=4.48.0", "accelerate>=1.2.0", "bitsandbytes>=0.44.0",     "pillow", "requests", "sentencepiece",     "qwen-vl-utils[decord]", "sentence-transformers", "jsonschema")   import torch, os, json, time, re, gc, io, threading, textwrap, warnings from collections import Counter from typing import Any, Optional warnings.filterwarnings("ignore")   assert torch.cuda.is_available(), "GPU required. Switch runtime to A100 / L4." p = torch.cuda.get_device_properties(0) VRAM_GB = p.total_memory / 1e9 print(f"GPU: {p.name} | VRAM: {VRAM_GB:.1f} GB | CUDA {torch.version.cuda} | torch {torch.__version__}")   if VRAM_GB >= 75:   LOAD_MODE = "bf16" elif VRAM_GB >= 40: LOAD_MODE = "int8" else:               LOAD_MODE = "int4"   try:    import flash_attn    ATTN_IMPL = "flash_attention_2" except Exception:    ATTN_IMPL = "sdpa" print(f"-> mode={LOAD_MODE}  attn={ATTN_IMPL}")   from transformers import (    AutoModelForImageTextToText, AutoProcessor,    BitsAndBytesConfig, TextIteratorStreamer,    StoppingCriteria, StoppingCriteriaList, )   MODEL_ID = "Qwen/Qwen3.6-35B-A3B" kwargs = dict(device_map="auto", trust_remote_code=True,              low_cpu_mem_usage=True, attn_implementation=ATTN_IMPL,              torch_dtype=torch.bfloat16) if LOAD_MODE == "int8":    kwargs["quantization_config"] = BitsAndBytesConfig(load_in_8bit=True) elif LOAD_MODE == "int4":    kwargs["quantization_config"] = BitsAndBytesConfig(        load_in_4bit=True, bnb_4bit_quant_type="nf4",        bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True)   print("Loading processor...") processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True) print(f"Loading model in {LOAD_MODE} (first run downloads ~70GB) ...") t0 = time.time() model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, **kwargs); model.eval() print(f"Loaded in {time.time()-t0:.0f}s  |  VRAM used: {torch.cuda.memory_allocated()/1e9:.1f} GB")   SAMPLING = {    "thinking_general": dict(temperature=1.0, top_p=0.95, top_k=20, presence_penalty=1.5),    "thinking_coding":  dict(temperature=0.6, top_p=0.95, top_k=20, presence_penalty=0.0),    "instruct_general": dict(temperature=0.7, top_p=0.80, top_k=20, presence_penalty=1.5),    "instruct_reason":  dict(temperature=1.0, top_p=1.00, top_k=40, presence_penalty=2.0), } THINK_OPEN, THINK_CLOSE = "", ""   def split_thinking(text: str):    if THINK_OPEN in text and THINK_CLOSE in text:        a = text.index(THINK_OPEN) + len(THINK_OPEN); b = text.index(THINK_CLOSE)        return text[a:b].strip(), text[b + len(THINK_CLOSE):].strip()    if THINK_CLOSE in text:        b = text.index(THINK_CLOSE)        return text[:b].strip(), text[b + len(THINK_CLOSE):].strip()    return "", text.strip()

We set up the full environment required to run Qwen 3.6-35B-A3B in Google Colab and installed all supporting libraries for quantization, multimodal processing, retrieval, and schema validation. We then probe the available GPU, dynamically select the loading mode based on VRAM, and configure the attention backend so the model runs as efficiently as possible on the given hardware. After that, we load the processor and model from Hugging Face and define the core sampling presets and the thinking-splitting utility, which lay the foundation for all later interactions.

class QwenChat:    def __init__(self, model, processor, system=None, tools=None):        self.model, self.processor = model, processor        self.tokenizer = processor.tokenizer        self.history: list[dict] = []        if system: self.history.append({"role": "system", "content": system})        self.tools = tools      def user(self, content):      self.history.append({"role":"user","content":content}); return self    def assistant(self, content, reasoning=""):        m = {"role":"assistant","content":content}        if reasoning: m["reasoning_content"] = reasoning        self.history.append(m); return self    def tool_result(self, name, result):        self.history.append({"role":"tool","name":name,            "content": result if isinstance(result, str) else json.dumps(result)})        return self      def _inputs(self, enable_thinking, preserve_thinking):        return self.processor.apply_chat_template(            self.history, tools=self.tools, tokenize=True,            add_generation_prompt=True, return_dict=True, return_tensors="pt",            enable_thinking=enable_thinking, preserve_thinking=preserve_thinking,        ).to(self.model.device)      def generate(self, *, enable_thinking=True, preserve_thinking=False,                 max_new_tokens=2048, preset="thinking_general",                 stopping_criteria=None, append_to_history=True):        inp = self._inputs(enable_thinking, preserve_thinking)        cfg = SAMPLING[preset]        gk = dict(**inp, max_new_tokens=max_new_tokens, do_sample=True,                  temperature=cfg["temperature"], top_p=cfg["top_p"], top_k=cfg["top_k"],                  repetition_penalty=1.0,                  pad_token_id=self.tokenizer.pad_token_id or self.tokenizer.eos_token_id)        if stopping_criteria is not None: gk["stopping_criteria"] = stopping_criteria        with torch.inference_mode(): out = self.model.generate(**gk)        raw = self.tokenizer.decode(out[0, inp["input_ids"].shape[-1]:], skip_special_tokens=True)        think, ans = split_thinking(raw)        if append_to_history: self.assistant(ans, reasoning=think)        return think, ans      def stream(self, *, enable_thinking=True, preserve_thinking=False,               max_new_tokens=2048, preset="thinking_general",               on_thinking=None, on_answer=None):        inp = self._inputs(enable_thinking, preserve_thinking)        cfg = SAMPLING[preset]        streamer = TextIteratorStreamer(self.tokenizer, skip_prompt=True, skip_special_tokens=True)        gk = dict(**inp, streamer=streamer, max_new_tokens=max_new_tokens, do_sample=True,                  temperature=cfg["temperature"], top_p=cfg["top_p"], top_k=cfg["top_k"],                  pad_token_id=self.tokenizer.pad_token_id or self.tokenizer.eos_token_id)        t = threading.Thread(target=self.model.generate, kwargs=gk); t.start()        buf, in_think = "", enable_thinking        think_text, answer_text = "", ""        for piece in streamer:            buf += piece            if in_think:                if THINK_CLOSE in buf:                    close_at = buf.index(THINK_CLOSE)                    resid = buf[:close_at]                    if on_thinking: on_thinking(resid[len(think_text):])                    think_text = resid                    buf = buf[close_at + len(THINK_CLOSE):]                    in_think = False                    if buf and on_answer: on_answer(buf)                    answer_text = buf; buf = ""                else:                    if on_thinking: on_thinking(piece)                    think_text += piece            else:                if on_answer: on_answer(piece)                answer_text += piece        t.join()        self.assistant(answer_text.strip(), reasoning=think_text.strip())        return think_text.strip(), answer_text.strip()      def save(self, path):        with open(path, "w") as f:            json.dump({"history": self.history, "tools": self.tools}, f, indent=2)    @classmethod    def load(cls, model, processor, path):        with open(path) as f: data = json.load(f)        c = cls(model, processor, tools=data.get("tools"))        c.history = data["history"]; return c   class ThinkingBudget(StoppingCriteria):    def __init__(self, tokenizer, budget: int):        self.budget = budget        self.open_ids  = tokenizer.encode(THINK_OPEN,  add_special_tokens=False)        self.close_ids = tokenizer.encode(THINK_CLOSE, add_special_tokens=False)        self.start = None    def _find(self, seq, needle):        n = len(needle)        for i in range(len(seq)-n+1):            if seq[i:i+n] == needle: return i        return None    def __call__(self, input_ids, scores, **kwargs):        seq = input_ids[0].tolist()        if self.start is None:            idx = self._find(seq, self.open_ids)            if idx is not None: self.start = idx + len(self.open_ids)            return False        if self._find(seq[self.start:], self.close_ids) is not None: return False        return (len(seq) - self.start) >= self.budget   TOOL_CALL_RE = re.compile(r"s*({.*?})s*", re.S)   def run_calculate(expr: str) -> str:    if any(c not in "0123456789+-*/().% " for c in expr):        return json.dumps({"error":"illegal chars"})    try:    return json.dumps({"result": eval(expr, {"__builtins__": {}}, {})})    except Exception as e: return json.dumps({"error": str(e)})   _DOCS = {    "qwen3.6":  "Qwen3.6-35B-A3B is a 35B MoE with 3B active params and 262k native context.",    "deltanet": "Gated DeltaNet is a linear-attention variant used in Qwen3.6's hybrid layers.",    "moe":      "Qwen3.6 uses 256 experts with 8 routed + 1 shared per token.", } def run_search_docs(q):    hits = [v for k,v in _DOCS.items() if k in q.lower()]    return json.dumps({"results": hits or ["no hits"]}) def run_get_time():    import datetime as dt    return json.dumps({"iso": dt.datetime.utcnow().isoformat()+"Z"})   TOOL_FNS = {    "calculate":   lambda a: run_calculate(a["expression"]),    "search_docs": lambda a: run_search_docs(a["query"]),    "get_time":    lambda a: run_get_time(), } TOOLS_SCHEMA = [    {"type":"function","function":{"name":"calculate","description":"Evaluate arithmetic.",      "parameters":{"type":"object","properties":{"expression":{"type":"string"}},"required":["expression"]}}},    {"type":"function","function":{"name":"search_docs","description":"Search internal docs.",      "parameters":{"type":"object","properties":{"query":{"type":"string"}},"required":["query"]}}},    {"type":"function","function":{"name":"get_time","description":"Get current UTC time.",      "parameters":{"type":"object","properties":{}}}}, ]

We build the main QwenChat conversation manager, which handles message history, tool messages, chat template formatting, standard generation, streaming generation, and session persistence. We also define the ThinkingBudget stopping criterion to control how much reasoning the model is allowed to produce before continuing or stopping generation. In addition, we create the tool-calling support layer, including arithmetic, lightweight document search, time lookup, and the tool schema that allows the model to interact with external functions in an agent-style loop.

def run_agent(user_msg, *, max_steps=5, verbose=True):    chat = QwenChat(model, processor,        system="You are a helpful assistant. Call tools when helpful, then answer.",        tools=TOOLS_SCHEMA)    chat.user(user_msg)    for step in range(max_steps):        think, raw = chat.generate(enable_thinking=True, preserve_thinking=True,                                   preset="thinking_general", max_new_tokens=1024,                                   append_to_history=False)        calls = TOOL_CALL_RE.findall(raw)        if verbose:            print(f"n=== step {step+1} ===")            print("reasoning:", textwrap.shorten(think, 200))            print("raw     :", textwrap.shorten(raw, 300))        if not calls:            chat.assistant(raw, reasoning=think); return chat, raw        chat.assistant(raw, reasoning=think)        for payload in calls:            try: parsed = json.loads(payload)            except json.JSONDecodeError:                chat.tool_result("error", {"error":"bad json"}); continue            fn = TOOL_FNS.get(parsed.get("name"))            res = fn(parsed.get("arguments", {})) if fn else json.dumps({"error":"unknown"})            if verbose: print(f" -> {parsed.get('name')}({parsed.get('arguments',{})}) = {res}")            chat.tool_result(parsed.get("name"), res)    return chat, "(max_steps reached)"   import jsonschema   MOVIE_SCHEMA = {    "type":"object",    "required":["title","year","rating","genres","runtime_minutes"],    "additionalProperties": False,    "properties":{        "title":{"type":"string"},        "year":{"type":"integer","minimum":1900,"maximum":2030},        "rating":{"type":"number","minimum":0,"maximum":10},        "genres":{"type":"array","items":{"type":"string"},"minItems":1},        "runtime_minutes":{"type":"integer","minimum":1,"maximum":500},    }, } def extract_json(text):    text = re.sub(r"^```(?:json)?", "", text.strip())    text = re.sub(r"```$", "", text.strip())    s = text.find("{")    if s < 0: raise ValueError("no object")    d, e = 0, -1    for i in range(s, len(text)):        if text[i] == "{": d += 1        elif text[i] == "}":            d -= 1            if d == 0: e = i; break    if e < 0: raise ValueError("unbalanced braces")    return json.loads(text[s:e+1])   def json_with_retry(prompt, schema, *, max_tries=3):    sys_m = ("You reply with ONLY a single JSON object matching the user's schema. "             "No markdown fences. No commentary. No  blocks.")    chat = QwenChat(model, processor, system=sys_m)    chat.user(f"{prompt}nnRespond as JSON matching this schema:n{json.dumps(schema, indent=2)}")    last = None    for i in range(max_tries):        _, raw = chat.generate(enable_thinking=False, preset="instruct_general",                               max_new_tokens=512, append_to_history=False)        try:            obj = extract_json(raw); jsonschema.validate(obj, schema)            return obj, i+1        except Exception as e:            last = str(e); chat.assistant(raw)            chat.user(f"That failed validation: {last}. Produce ONLY valid JSON.")    raise RuntimeError(f"gave up after {max_tries}: {last}")   def benchmark(prompt, *, batch_sizes=(1,2,4), max_new_tokens=64):    print(f"{'batch':>6} {'tok/s':>10} {'total_s':>10} {'VRAM_GB':>10}")    print("-"*40)    for bs in batch_sizes:        gc.collect(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats()        msgs = [[{"role":"user","content":prompt}] for _ in range(bs)]        texts = [processor.apply_chat_template(m, tokenize=False, add_generation_prompt=True,                                                enable_thinking=False) for m in msgs]        processor.tokenizer.padding_side = "left"        inp = processor.tokenizer(texts, return_tensors="pt", padding=True).to(model.device)        torch.cuda.synchronize(); t0 = time.time()        with torch.inference_mode():            out = model.generate(**inp, max_new_tokens=max_new_tokens, do_sample=False,                pad_token_id=processor.tokenizer.pad_token_id or processor.tokenizer.eos_token_id)        torch.cuda.synchronize(); dt = time.time()-t0        new_toks = (out.shape[1] - inp["input_ids"].shape[1]) * bs        vram = torch.cuda.max_memory_allocated()/1e9        print(f"{bs:>6d} {new_toks/dt:>10.1f} {dt:>10.2f} {vram:>10.1f}")   def build_rag():    from sentence_transformers import SentenceTransformer    import numpy as np    embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")    KB = [        "Qwen3.6-35B-A3B has 35B total params and 3B activated via MoE.",        "Context length is 262,144 tokens natively, up to ~1M with YaRN.",        "The MoE layer uses 256 experts with 8 routed and 1 shared per token.",        "Thinking mode wraps internal reasoning in ... blocks.",        "preserve_thinking=True keeps prior reasoning across turns for agents.",        "Gated DeltaNet is a linear-attention variant in the hybrid layers.",        "The model accepts image, video, and text input natively.",        "Sampling for coding tasks uses temperature=0.6 rather than 1.0.",    ]    KB_EMB = embedder.encode(KB, normalize_embeddings=True)    def retrieve(q, k=3):        qv = embedder.encode([q], normalize_embeddings=True)[0]        import numpy as _np        return [KB[i] for i in _np.argsort(-(KB_EMB @ qv))[:k]]    return retrieve   def rag_answer(query, retrieve, k=3):    ctx = retrieve(query, k)    sys_m = "Answer using ONLY the provided context. If insufficient, say so."    user = "Context:n" + "n".join(f"- {c}" for c in ctx) + f"nnQuestion: {query}"    chat = QwenChat(model, processor, system=sys_m); chat.user(user)    _, ans = chat.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=300)    return ans, ctx

We define higher-level utility functions that turn the model into a more complete application framework for agentic, structured workflows. We implement the agent loop for iterative tool use, add JSON extraction and validation with retry logic, create a benchmarking function to measure generation throughput, and build a lightweight semantic retrieval pipeline for mini-RAG. Together, these functions help us move from basic prompting to more robust workflows in which the model can reason, validate outputs, retrieve supporting context, and be systematically tested.

print("n" + "="*20, "§4 thinking-budget", "="*20) c = QwenChat(model, processor) c.user("A frog is at the bottom of a 30m well. It climbs 3m/day, slips 2m/night. "       "How many days until it escapes? Explain.") budget = ThinkingBudget(processor.tokenizer, budget=150) think, ans = c.generate(enable_thinking=True, max_new_tokens=1200,                         stopping_criteria=StoppingCriteriaList([budget])) print(f"Thinking ~{len(processor.tokenizer.encode(think))} tok | Answer:n{ans or '(truncated)'}")   print("n" + "="*20, "§5 streaming split", "="*20) c = QwenChat(model, processor) c.user("Explain why transformers scale better than RNNs, in two short paragraphs.") print("[THINKING >>] ", end="", flush=True) first = [True] def _ot(x): print(x, end="", flush=True) def _oa(x):    if first[0]: print("nn[ANSWER >>] ", end="", flush=True); first[0] = False    print(x, end="", flush=True) c.stream(enable_thinking=True, preset="thinking_general", max_new_tokens=700,         on_thinking=_ot, on_answer=_oa); print()   print("n" + "="*20, "§6 vision", "="*20) IMG = "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/CI_Demo/mathv-1327.jpg" c = QwenChat(model, processor) c.history.append({"role":"user","content":[    {"type":"image","image":IMG},    {"type":"text","text":"Describe this figure in one sentence, then state what it's asking."}]}) _, ans = c.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=300) print("Describe:", ans)   GRD = "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.6/demo/RealWorld/RealWorld-04.png" c = QwenChat(model, processor) c.history.append({"role":"user","content":[    {"type":"image","image":GRD},    {"type":"text","text": "Locate every distinct object. Reply ONLY with JSON "     "[{"label":...,"bbox_2d":[x1,y1,x2,y2]}, ...] in pixel coords."}]}) _, ans = c.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=800) print("Grounding:", ans[:600])   print("n" + "="*20, "§7 YaRN override", "="*20) YARN = {"text_config": {"rope_parameters": {    "mrope_interleaved": True, "mrope_section": [11,11,10],    "rope_type": "yarn", "rope_theta": 10_000_000,    "partial_rotary_factor": 0.25, "factor": 4.0,    "original_max_position_embeddings": 262_144}}} print(json.dumps(YARN, indent=2))

We begin running the advanced demonstrations by testing thinking-budget control, split streaming, multimodal vision prompting, and a YaRN configuration example for extended context handling. We first observe how the model reasons under a limited thinking budget, then stream its thinking and answer separately so that we can inspect both parts of the response flow. We also send image-based prompts for description and grounding tasks, and finally print a YaRN rope-configuration override that shows how long-context settings can be prepared for model reloading.

print("n" + "="*20, "§8 agent loop", "="*20) chat, final = run_agent(    "What's 15% of 842 to 2 decimals? Also briefly explain gated DeltaNet per the docs.",    max_steps=4) print("nFINAL:", final)   print("n" + "="*20, "§9 structured JSON", "="*20) obj, tries = json_with_retry("Summarize the movie Inception as structured metadata.",                             MOVIE_SCHEMA) print(f"({tries} tries)", json.dumps(obj, indent=2))   print("n" + "="*20, "§10 MoE routing", "="*20) routers = [] for name, m in model.named_modules():    low = name.lower()    if (("gate" in low and ("moe" in low or "expert" in low)) or        low.endswith(".router") or low.endswith(".gate")) and hasattr(m, "weight"):        routers.append((name, m)) print(f"found {len(routers)} router-like modules")   TOP_K = 8 counts = [Counter() for _ in routers] handles = [] def _mkhook(i):    def h(_m, _i, out):        lg = out[0] if isinstance(out, tuple) else out        if lg.dim() != 2: return        try:            for eid in lg.topk(TOP_K, dim=-1).indices.flatten().tolist():                counts[i][eid] += 1        except Exception: pass    return h for i,(_,m) in enumerate(routers): handles.append(m.register_forward_hook(_mkhook(i))) try:    c = QwenChat(model, processor); c.user("Write one short sentence about sunset.")    c.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=40) finally:    for h in handles: h.remove() total = Counter() for c_ in counts: total.update(c_) print(f"distinct experts activated: {len(total)}") for eid, n in total.most_common(10): print(f"  expert #{eid:>3}  {n} fires")   print("n" + "="*20, "§11 benchmark", "="*20) benchmark("In one sentence, what is entropy?", batch_sizes=(1,2,4), max_new_tokens=48)   print("n" + "="*20, "§12 mini-RAG", "="*20) retrieve = build_rag() ans, ctx = rag_answer("How many experts are active per token, and why does that matter?", retrieve) print("retrieved:"); [print(" -", c) for c in ctx] print("answer:", ans)   print("n" + "="*20, "§13 save/resume", "="*20) c = QwenChat(model, processor); c.user("Give me a unique 5-letter codeword. Just the word.") _, a1 = c.generate(enable_thinking=True, max_new_tokens=256); print("T1:", a1) c.save("https://www.marktechpost.com/content/session.json") del c; gc.collect() r = QwenChat.load(model, processor, "https://www.marktechpost.com/content/session.json") r.user("Reverse the letters of that codeword.") _, a2 = r.generate(enable_thinking=True, preserve_thinking=True, max_new_tokens=256) print("T2:", a2)   print("n✓ tutorial complete")

We continue with the remaining demonstrations that showcase tool-augmented reasoning, schema-constrained JSON generation, MoE routing introspection, throughput benchmarking, retrieval-augmented answering, and save-resume session handling. We let the model solve a tool-using task, generate structured movie metadata with validation, inspect which expert-like router modules activate during inference, and measure tokens-per-second across different batch sizes. Finally, we test mini-RAG for context-grounded answering and verify conversational persistence by saving a session, reloading it, and continuing the interaction from the stored history.

In conclusion, we created a practical and detailed workflow for using Qwen 3.6-35B-A3B beyond simple text generation. We showed how to combine adaptive loading, multimodal prompting, controlled reasoning, tool-augmented interaction, schema-constrained outputs, lightweight RAG, and session save-resume patterns into one integrated system. We also inspected expert routing behavior and measured throughput to understand the model’s usability and performance. Also, we turned Qwen 3.6 into a working experimental playground where we can study its capabilities, test advanced interaction patterns, and build a strong foundation for more serious research or product-oriented applications.

Check out the Full Codes with Notebook here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Sana Hassan

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

Sana Hassan

Leave a Reply Cancel reply

Related Posts

How to Build a Fully Autonomous Local Fleet-Maintenance Analysis Agent Using SmolAgents and Qwen Model

How to Create a Bioinformatics AI Agent Using Biopython for DNA and Protein Analysis

Perplexity announces “Computer,” an AI agent that assigns work to other AI agents