A Coding Guide to Instrumenting, Tracing, and Evaluating LLM Applications Using TruLens and OpenAI Models

In this tutorial, we focus on building a transparent and measurable evaluation pipeline for large language model applications using TruLens. Rather than treating LLMs as black boxes, we instrument each stage of an application so that inputs, intermediate steps, and outputs are captured as structured traces. We then attach feedback functions that quantitatively evaluate model behavior along dimensions such as relevance, grounding, and contextual alignment. By running multiple application variants under the same evaluation setup, we show how TruLens enables disciplined experimentation, reproducibility, and data-driven improvement of LLM systems.

!pip -q install trulens trulens-providers-openai chromadb openai   import os, re, getpass from dataclasses import dataclass from typing import List, Dict, Any import numpy as np   import chromadb from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction   from openai import OpenAI   from trulens.core import TruSession, Feedback from trulens.providers.openai import OpenAI as TruOpenAI from trulens.apps.app import TruApp from trulens.core.otel.instrument import instrument from trulens.otel.semconv.trace import SpanAttributes from trulens.dashboard import run_dashboard   if not os.environ.get("OPENAI_API_KEY"):    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter OPENAI_API_KEY (input hidden): ")

We prepare the Colab environment by installing all required libraries and importing the core dependencies used throughout the tutorial. We securely read the OpenAI API key from the terminal to avoid hardcoding sensitive credentials. We also initialize the foundational tooling that enables tracing, feedback evaluation, and dashboard visualization.

def normalize_ws(s: str) -> str:    return re.sub(r"s+", " ", s).strip()   RAW_DOCS = [    {        "doc_id": "trulens_core",        "title": "TruLens core idea",        "text": "TruLens is used to track and evaluate LLM applications. It can log app runs, compute feedback scores, and provide a dashboard to compare versions and investigate traces and results."    },    {        "doc_id": "trulens_feedback",        "title": "Feedback functions",        "text": "TruLens feedback functions can score groundedness, context relevance, and answer relevance. They are configured by specifying which parts of an app record should be used as inputs."    },    {        "doc_id": "trulens_rag",        "title": "RAG workflow",        "text": "A typical RAG system retrieves relevant chunks from a vector database and then generates an answer using those chunks as context. The quality depends on retrieval, prompt design, and generation behavior."    },    {        "doc_id": "trulens_instrumentation",        "title": "Instrumentation",        "text": "Instrumentation adds tracing spans to your app functions (like retrieval and generation). This makes it possible to analyze which contexts were retrieved, latency, token usage, and connect feedback evaluations to specific steps."    },    {        "doc_id": "vectorstores",        "title": "Vector stores and embeddings",        "text": "Vector stores index embeddings for text chunks, enabling semantic search. OpenAI embedding models can be used to embed chunks and queries, and Chroma can store them locally in memory for a notebook demo."    },    {        "doc_id": "prompting",        "title": "Prompting and citations",        "text": "Prompting can encourage careful, citation-grounded answers. A stronger prompt can enforce: answer only from context, be explicit about uncertainty, and provide short citations that map to retrieved chunks."    }, ]   @dataclass class Chunk:    chunk_id: str    doc_id: str    title: str    text: str    meta: Dict[str, Any]   def chunk_docs(docs, chunk_size=350, overlap=80) -> List[Chunk]:    chunks: List[Chunk] = []    for d in docs:        text = normalize_ws(d["text"])        start = 0        idx = 0        while start < len(text):            end = min(len(text), start + chunk_size)            chunk_text = text[start:end]            chunk_id = f'{d["doc_id"]}_c{idx}'            chunks.append(                Chunk(                    chunk_id=chunk_id,                    doc_id=d["doc_id"],                    title=d["title"],                    text=chunk_text,                    meta={"doc_id": d["doc_id"], "title": d["title"], "chunk_index": idx},                )            )            idx += 1            start = end - overlap            if start < 0:                start = 0            if end == len(text):                break    return chunks   CHUNKS = chunk_docs(RAW_DOCS)

We define the raw knowledge sources and implement a clean, reusable text-chunking pipeline. We normalize document text and split it into overlapping chunks to preserve semantic continuity during retrieval. We structure each chunk with metadata so it can later be traced, evaluated, and cited during RAG execution.

EMBED_MODEL = "text-embedding-3-small" embedding_function = OpenAIEmbeddingFunction(    api_key=os.environ.get("OPENAI_API_KEY"),    model_name=EMBED_MODEL, )   chroma_client = chromadb.Client() collection = chroma_client.get_or_create_collection(    name="trulens_demo_kb",    embedding_function=embedding_function, )   ids = [c.chunk_id for c in CHUNKS] docs = [c.text for c in CHUNKS] metas = [c.meta for c in CHUNKS] collection.add(ids=ids, documents=docs, metadatas=metas)   oai_client = OpenAI()   def format_context(hits):    lines = []    for i, h in enumerate(hits):        meta = h["meta"]        lines.append(            f"[C{i}] ({meta.get('title','')}, {meta.get('doc_id','')}, chunk={meta.get('chunk_index','?')}): {h['text']}"        )    return "n".join(lines)

We create the vector database using Chroma and OpenAI embeddings to enable semantic search over the chunked knowledge base. We insert all chunks into the collection and prepare the OpenAI client for downstream generation. We also define a context-formatting utility that converts retrieved chunks into a structured prompt-ready format.

class RAG:    def __init__(self, *, gen_model: str, prompt_style: str = "base", k: int = 4):        self.gen_model = gen_model        self.prompt_style = prompt_style        self.k = k      @instrument(        span_type=SpanAttributes.SpanType.RETRIEVAL,        attributes={            SpanAttributes.RETRIEVAL.QUERY_TEXT: "query",            SpanAttributes.RETRIEVAL.RETRIEVED_CONTEXTS: "return",        },    )    def retrieve(self, query: str) -> list:        res = collection.query(query_texts=[query], n_results=self.k)        hits = []        for i in range(len(res["ids"][0])):            hits.append(                {                    "id": res["ids"][0][i],                    "text": res["documents"][0][i],                    "meta": res["metadatas"][0][i],                }            )        return hits      @instrument(span_type=SpanAttributes.SpanType.GENERATION)    def generate(self, query: str, hits: list) -> str:        if not hits:            return "I don't have enough relevant information in the knowledge base to answer."          context = format_context(hits)          if self.prompt_style == "strict_citations":            system = (                "You are a careful assistant. Use ONLY the provided context. "                "If the context is insufficient, say so. "                "When you make a claim, cite it with [C#] tags matching the context chunks."            )            user = f"Context:n{context}nnQuestion: {query}nnAnswer (with [C#] citations):"        else:            system = "You are a helpful assistant."            user = f"Context:n{context}nnQuestion: {query}nnAnswer using the context above:"          resp = oai_client.chat.completions.create(            model=self.gen_model,            messages=[                {"role": "system", "content": system},                {"role": "user", "content": user},            ],        )        out = resp.choices[0].message.content        return out if out else "No answer returned."      @instrument(        span_type=SpanAttributes.SpanType.RECORD_ROOT,        attributes={            SpanAttributes.RECORD_ROOT.INPUT: "query",            SpanAttributes.RECORD_ROOT.OUTPUT: "return",        },    )    def query(self, query: str) -> str:        hits = self.retrieve(query=query)        return self.generate(query=query, hits=hits)

We implement the core RAG application with explicit instrumentation on retrieval, generation, and the request root. We capture queries, retrieved contexts, and generated outputs as traceable spans for later evaluation. We also support multiple prompt styles, allowing us to systematically compare different prompting strategies under identical conditions.

session = TruSession() session.reset_database()   EVAL_MODEL = "gpt-4o-mini" provider = TruOpenAI(model_engine=EVAL_MODEL)   f_groundedness = (    Feedback(        provider.groundedness_measure_with_cot_reasons_consider_answerability,        name="Groundedness",    )    .on_context(collect_list=True)    .on_output()    .on_input() )   f_answer_relevance = (    Feedback(provider.relevance_with_cot_reasons, name="Answer Relevance")    .on_input()    .on_output() )   f_context_relevance = (    Feedback(provider.context_relevance_with_cot_reasons, name="Context Relevance")    .on_input()    .on_context(collect_list=False)    .aggregate(np.mean) )   GEN_MODEL = "gpt-4o-mini"   rag_base = RAG(gen_model=GEN_MODEL, prompt_style="base", k=4) rag_strict = RAG(gen_model=GEN_MODEL, prompt_style="strict_citations", k=4)   tru_base = TruApp(    rag_base,    app_name="TruLens-RAG",    app_version="v1_base_prompt",    feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance], )   tru_strict = TruApp(    rag_strict,    app_name="TruLens-RAG",    app_version="v2_strict_citations",    feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance], )   EVAL_QUERIES = [    "What is TruLens used for?",    "What are the three common RAG feedbacks to evaluate?",    "Why does instrumentation matter in RAG evaluation?",    "What role do embeddings play in a vector store?",    "How can prompting encourage grounded answers?", ]   with tru_base as recording:    for q in EVAL_QUERIES:        rag_base.query(q)   with tru_strict as recording:    for q in EVAL_QUERIES:        rag_strict.query(q)   leaderboard = session.get_leaderboard() print(leaderboard)   run_dashboard(session)

We configure the TruLens evaluation session and define feedback functions for groundedness, answer relevance, and context relevance. We run multiple versions of the RAG system across a shared evaluation set to generate comparable records. We then surface the results through the leaderboard and interactive dashboard to analyze performance differences and reasoning quality.

In conclusion, we established a practical workflow for understanding and evaluating LLM behavior beyond surface-level outputs. We demonstrated how instrumentation turns every model call into an inspectable artifact and how feedback functions convert subjective judgments into consistent metrics. Through versioned runs, leaderboards, and dashboards, we can compare design choices with clarity and confidence. This tutorial lays the groundwork for building reliable, auditable, and continuously improving LLM applications in real-world settings where trust and explainability matter as much as performance.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Michal Sutter

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

Michal Sutter

Leave a Reply Cancel reply

Related Posts

Meet ‘Kani-TTS-2’: A 400M Param Open Source Text-to-Speech Model that Runs in 3GB VRAM with Voice Cloning Support

A Coding Guide to Build an Agentic AI‑Powered Asynchronous Ticketing Assistant Using PydanticAI Agents, Pydantic v2, and SQLite Database

Creating an AI Agent-Based System with LangGraph: Adding Persistence and Streaming (Step by Step Guide)