Data Intern (Gen AI)
Seaspan ULC
Built a RAG pipeline for maintenance PDF search with page-level citations and tuned retrieval precision.
What I Did
I built a RAG pipeline that indexed 350 maintenance PDFs with Cohere embeddings stored in pgvector on PostgreSQL. The system let the operations team search manuals and get answers with page-level citations through a FastAPI interface. I swept chunk sizes from 256 to 1024 tokens and tuned the overlap stride to maximize precision at rank 5.
Impact
The operations team could search across 350 maintenance manuals and get cited answers instantly instead of manually searching through PDFs. The chunk size and overlap tuning improved retrieval precision for the most relevant results.
What I Learned
I learned to design RAG pipelines end-to-end, including document chunking, embedding generation with Cohere, vector storage in pgvector, and retrieval ranking. Tuning chunk sizes and overlap taught me how these parameters affect retrieval quality. I gained experience with PostgreSQL's pgvector extension for similarity search.
Key Highlights
Built a RAG pipeline that indexed 350 maintenance PDFs with Cohere embeddings in pgvector on PostgreSQL, letting the operations team search manuals and get answers with page-level citations through FastAPI.
Swept chunk sizes from 256 to 1024 tokens and tuned overlap stride to maximize precision at rank 5.