Experience

Data Science Student

RBC Borealis (LSi – Cohort Fall 2025)

Oct 2025Dec 2025Toronto (Remote)

Built a PySpark pipeline on Databricks for eelgrass sensor data and implemented model monitoring with automated retraining.

What I Did

I was part of one of 8 teams selected across Canada for eelgrass mitigation research. I built a PySpark pipeline on Databricks that ingested 40k sensor frames, extracted signal quality features, and wrote scores to Delta Lake. I also implemented a model monitoring system that compared weekly predictions against held-out labels and triggered retraining jobs when precision or recall dropped below set thresholds.

Impact

The pipeline processed 40k sensor frames into quality scores stored in Delta Lake, enabling downstream analysis. The monitoring system ensured model performance stayed within acceptable bounds by automatically triggering retraining with the latest labeled data.

What I Learned

I gained hands-on experience with PySpark for distributed data processing and Databricks for orchestrating pipelines. I learned to design model monitoring systems that track precision and recall over time and trigger automated retraining. Working with Delta Lake taught me about versioned data storage and time travel for reproducibility.

Key Highlights

  • One of 8 teams selected across Canada for eelgrass mitigation research, built a PySpark pipeline on Databricks that ingested 40k sensor frames, extracted signal quality features, and wrote scores to Delta Lake.

  • Compared weekly model predictions against held-out labels and flagged when precision or recall fell below thresholds the team had set, triggering a retraining job with the latest labeled data each time.

Tech Stack

PySparkDatabricksDelta LakeSignal ProcessingModel Monitoring

Tags

researchmldata-engineeringtime-series

Command Palette

Search for a command to run...