Experience

Data Intern

Seaspan ULC

Jan 2025Apr 2025Vancouver, BC

Automated ETL pipelines and prototyped a RAG system for semantic search over internal documentation.

What I Did

I wrote Python scripts to automate data transformations that were previously done manually. The scripts connected to various data sources, cleaned and transformed data, and loaded it into target databases. I also prototyped a RAG (Retrieval-Augmented Generation) system using LangChain and SpaCy for semantic search across internal documentation and metadata.

Impact

The ETL automation reduced manual data wrangling work by about 70%. The RAG prototype demonstrated natural language querying over documentation, though it was still experimental when my term ended.

What I Learned

I learned ETL design patterns for handling data quality issues, schema mismatches, and idempotent transformations. The RAG prototype taught me about document chunking strategies, vector embeddings, and retrieval pipelines. I used SpaCy for entity extraction and LangChain for orchestrating the retrieval and generation steps.

Key Highlights

  • Automated bulk data transformations using Python/SQL to optimize ETL pipelines, reducing manual work by 70%, and prototyped a RAG system with LangChain and SpaCy to execute semantic search across semi-structured metadata websites and documentation.

Tech Stack

PythonSQLDatabricksLangChainSpaCyRAGETL

Tags

industrydata-engineeringgenairag

Command Palette

Search for a command to run...