Research

Infrastructure experiments, performance studies, and operational findings from hands-on AI systems work.

Research Areas

GPU Infrastructure

  • • CUDA optimization and kernel profiling
  • • GPU utilization patterns under inference loads
  • • Thermal and power characteristics
  • • Multi-GPU communication and scaling

Model Serving

  • • vLLM deployment and tuning
  • • Batch size vs. latency trade-offs
  • • Quantization impact on throughput
  • • KV cache optimization strategies

Storage Systems

  • • Model loading performance (NVMe vs HDD)
  • • Dataset I/O patterns
  • • Filesystem benchmarking (ext4, XFS, ZFS)
  • • Storage tiering strategies

Network Performance

  • • Inference API latency profiling
  • • Network topology for distributed workloads
  • • Bandwidth requirements for model synchronization
  • • Multi-node communication overhead

Virtualization

  • • GPU passthrough performance
  • • Container vs. bare metal inference
  • • Resource isolation strategies
  • • Overhead measurements

Monitoring & Observability

  • • GPU metrics collection (DCGM)
  • • Inference latency tracking
  • • Resource utilization dashboards
  • • Alerting strategies for AI workloads

Active Experiments

GPU Baseline Characterization

Experiment ID: EXP-001 | Started: 2026-02-01

Active

Establishing baseline performance metrics for GPU compute under sustained LLM inference workloads. Measuring throughput, latency, power consumption, and thermal behavior across different model sizes and batch configurations.

NVIDIA GPU vLLM Llama 3.3 70B DCGM

Storage Tier Performance Study

Experiment ID: EXP-002 | Started: 2026-02-05

Active

Comparing model loading times and I/O patterns across NVMe, SATA SSD, and HDD storage. Evaluating filesystem performance (ext4, XFS, ZFS) for large model weight files and dataset access patterns.

NVMe fio Model Loading Benchmarking

Inference Latency Optimization

Experiment ID: EXP-003 | Started: 2026-02-08

Planning

Investigating latency reduction techniques including continuous batching, speculative decoding, and KV cache tuning. Measuring p50, p95, and p99 latency under varying load conditions.

vLLM Continuous Batching Latency Load Testing

Experiment Methodology

Measurement Principles

  • • Benchmark on actual hardware, not cloud instances
  • • Run multiple iterations to account for variance
  • • Document environmental conditions (temperature, load)
  • • Use production-representative workloads
  • • Isolate variables to measure specific impacts

Documentation Standards

  • • Record hypothesis, methodology, and results
  • • Include hardware specs and software versions
  • • Document failures and negative results
  • • Share reproducible benchmark scripts
  • • Link to raw data and analysis notebooks

Publications & Findings

📋

Experiments in Progress

Research findings and experiment reports will be published here as work progresses. Initial experiments are currently in the baseline measurement phase.

View Experiment Documentation

Research Philosophy

Negative Results Matter

Failed experiments are documented with the same rigor as successful ones. Knowing what doesn't work is as valuable as knowing what does.

Measure, Don't Assume

All performance claims are backed by empirical measurements on real hardware. No theoretical estimates or vendor benchmarks without validation.

Reproducibility First

Experiments include complete methodology, hardware specifications, and benchmark scripts to enable reproduction of results.

Share Findings Publicly

All research is documented and shared via GitHub. The goal is to contribute to the broader infrastructure engineering community.

Follow the Research

Experiment documentation, findings, and methodology are maintained in the GitHub repository. New results are published as experiments progress.