In this tutorial, we explore the FineWeb dataset through an advanced hands-on workflow. We stream a manageable sample of the […]
Category: Data Science
Building a Code Dataset Pipeline from NVIDIA Nemotron-Pretraining-Code-v3 Metadata with Streaming, Pandas, and tiktoken
In this tutorial, we work with NVIDIA’s Nemotron-Pretraining-Code-v3 dataset as a large-scale metadata index for code pretraining research. Instead of […]
A Coding Guide to Implement a pgvector-Powered Semantic, Hybrid, Sparse, and Quantized Vector Search System
In this tutorial, we build a complete pgvector playground inside Google Colab and explore how PostgreSQL can work as a […]
How to Build Knowledge Graph Generation Pipelines From Text With kg-gen, NetworkX Analytics, and Interactive Visualizations
In this tutorial, we will generate knowledge graphs from plain text, conversations, and multiple source documents using kg-gen. We start […]
A Coding Guide Implementing SHAP Explainability Workflows with Explainer Comparisons, Maskers, Interactions, Drift, and Black-Box Models
In this tutorial, we implement SHAP workflows as a practical framework for interpreting machine learning models beyond basic feature-importance plots. […]
A Coding Implementation to Master GPU Computing with CuPy, Custom CUDA Kernels, Streams, Sparse Matrices, and Profiling
In this tutorial, we delve into CuPy as a powerful GPU-accelerated alternative to NumPy for high-performance numerical computing in Python. […]
A Coding Implementation to Portfolio Optimization with skfolio for Building Testing, Tuning, and Comparing Modern Investment Strategies
In this tutorial, we explore skfolio, a scikit-learn compatible portfolio optimization library that helps us build, compare, and evaluate different […]
How to Build Technical Analysis and Backtesting Workflow with pandas-ta-classic, Strategy Signals, and Performance Metrics
In this tutorial, we implement how to use pandas-ta-classic to build a complete technical analysis and trading strategy workflow. We […]
How to Build a Single-Cell RNA-seq Analysis Pipeline with Scanpy for PBMC Clustering, Annotation, and Trajectory Discovery
In this tutorial, we perform an advanced single-cell RNA-seq analysis workflow using Scanpy on the PBMC-3k benchmark dataset. We start […]
Why Gradient Descent Zigzags and How Momentum Fixes It
Gradient descent has a fundamental limitation: on most real-world loss surfaces, it is inefficient. When the surface has uneven curvature—steep […]
