Data Intern
Seaspan ULC
Automated ETL pipelines and prototyped a RAG system for semantic search over internal documentation.
What I Did
I wrote Python scripts to automate data transformations that were previously done manually. The scripts connected to various data sources, cleaned and transformed data, and loaded it into target databases. I also prototyped a RAG (Retrieval-Augmented Generation) system using LangChain and SpaCy for semantic search across internal documentation and metadata.
Impact
The ETL automation reduced manual data wrangling work by about 70%. The RAG prototype demonstrated natural language querying over documentation, though it was still experimental when my term ended.
What I Learned
I learned ETL design patterns for handling data quality issues, schema mismatches, and idempotent transformations. The RAG prototype taught me about document chunking strategies, vector embeddings, and retrieval pipelines. I used SpaCy for entity extraction and LangChain for orchestrating the retrieval and generation steps.
Key Highlights
Automated bulk data transformations using Python/SQL to optimize ETL pipelines, reducing manual work by 70%, and prototyped a RAG system with LangChain and SpaCy to execute semantic search across semi-structured metadata websites and documentation.