Data Science Student
RBC Borealis (LSi – Cohort Fall 2025)
Built a PySpark pipeline on Databricks for eelgrass sensor data and implemented model monitoring with automated retraining.
What I Did
I was part of one of 8 teams selected across Canada for eelgrass mitigation research. I built a PySpark pipeline on Databricks that ingested 40k sensor frames, extracted signal quality features, and wrote scores to Delta Lake. I also implemented a model monitoring system that compared weekly predictions against held-out labels and triggered retraining jobs when precision or recall dropped below set thresholds.
Impact
The pipeline processed 40k sensor frames into quality scores stored in Delta Lake, enabling downstream analysis. The monitoring system ensured model performance stayed within acceptable bounds by automatically triggering retraining with the latest labeled data.
What I Learned
I gained hands-on experience with PySpark for distributed data processing and Databricks for orchestrating pipelines. I learned to design model monitoring systems that track precision and recall over time and trigger automated retraining. Working with Delta Lake taught me about versioned data storage and time travel for reproducibility.
Key Highlights
One of 8 teams selected across Canada for eelgrass mitigation research, built a PySpark pipeline on Databricks that ingested 40k sensor frames, extracted signal quality features, and wrote scores to Delta Lake.
Compared weekly model predictions against held-out labels and flagged when precision or recall fell below thresholds the team had set, triggering a retraining job with the latest labeled data each time.