Experience

Data Intern (Gen AI)

Seaspan ULC

Jan 2025Apr 2025Vancouver, BC

Built a RAG pipeline for maintenance PDF search with page-level citations and tuned retrieval precision.

What I Did

I built a RAG pipeline that indexed 350 maintenance PDFs with Cohere embeddings stored in pgvector on PostgreSQL. The system let the operations team search manuals and get answers with page-level citations through a FastAPI interface. I swept chunk sizes from 256 to 1024 tokens and tuned the overlap stride to maximize precision at rank 5.

Impact

The operations team could search across 350 maintenance manuals and get cited answers instantly instead of manually searching through PDFs. The chunk size and overlap tuning improved retrieval precision for the most relevant results.

What I Learned

I learned to design RAG pipelines end-to-end, including document chunking, embedding generation with Cohere, vector storage in pgvector, and retrieval ranking. Tuning chunk sizes and overlap taught me how these parameters affect retrieval quality. I gained experience with PostgreSQL's pgvector extension for similarity search.

Key Highlights

  • Built a RAG pipeline that indexed 350 maintenance PDFs with Cohere embeddings in pgvector on PostgreSQL, letting the operations team search manuals and get answers with page-level citations through FastAPI.

  • Swept chunk sizes from 256 to 1024 tokens and tuned overlap stride to maximize precision at rank 5.

Tech Stack

PythonCoherepgvectorPostgreSQLFastAPIRAG

Tags

industrydata-engineeringgenairag

Command Palette

Search for a command to run...