Cracked AI Engineering
Published on

OpenAI Files as Input

Authors

When you use ChatGPT you have the option to attach files to share more context for better responses. This is a super powerful and efficient way to share large amounts of data with the LLM. How exactly does that work under the hood? And how would I build a similar feature for my own AI agent application?

Before we get to the answer, we have to understand the depth of the problem first.

Small Context Windows

Initially the publicly available foundation models had very small context windows, the prompt size limit, so AI engineers were forced to get creative with the available space they had. This brought about the concept of retrieval augmented generation (RAG) that attempts to retrieve only the most important chunks of data to include in the prompt to maximize context while minimizing overall prompt size. The alternative was fine-tuning models which was difficult and costly, so in comparison RAG was like magic! As a result everyone started building RAG pipelines. It went like this: embed all data that might be useful as context (docs, email, pdf, etc.) in a vector database, then before sending a user prompt to the foundation model for a response, we query the vector database and append relevant context chunks that are returned. LLMs started magically responded with domain specific information as if they were trained on it.

Large Context Windows

Things have changed significantly over the last few years and the allowable size of context windows has grown significantly. So RAG is dead, right? Not quite. It’s less significant but many AI applications still need RAG to minimize the data they include in the context window. AI engineers have to make the trade off of whether they want to stuff the context window or if they want to invest in techniques like RAG to condense the context.

The Tradeoff

The tradeoff these days is more about engineering effort + costs vs performance.

The easiest option is to stuff the context window but that has downsides. We’ve found that too much context causes noise which can degrade output quality. It’s also cost inefficient because we pay per token so all the extra useless data that we stuff into the prompt is racking up a bill.

A high context and low data volume prompt is what we want to be cost efficient and return quality results but that comes at the cost of engineering time and effort. To get this you need to build RAG pipelines and maintain a vector database of embeddings.

Hence the trade off; is the quality degradation and cost increase tolerable to avoid extra engineering overhead for this application?

Offloading the Tradeoff: Files as Input

There’s yet another option available to us that combines some of the other techniques mentioned and avoids the tradeoff to some degree. Files as input. This is OpenAI’s solution to the problem for ChatGPT. Users are limited by the size of the prompt (far more than we are using the API) so they are encouraged to upload files to provide context. Then behind the scenes OpenAI manages the RAG step to get relevant context out of the files.

This option is desirable for us as AI engineers because we can avoid the tradeoff completely by offloading the RAG step to OpenAI. Tokens and cost are limited because files don’t count towards input tokens, and engineering costs are limited because we don't own the RAG pipelines and vector databases ourselves.

It doesn't come without its own trade offs though…any time we offload an engineering workload we lose control of it. This means the RAG step that OpenAI does is a black box to us. Personally I'll take that bet though, I'd guess that the quality of their RAG pipeline is better than what I can pull together for my application.

Summary

Here's the deal: we used to build RAG pipelines because context windows were tiny. Now they're huge, but RAG is still useful for keeping costs down and quality up. OpenAI's Files as Input feature gives us the best of both worlds - we get high-quality RAG without the engineering overhead. Sure, we lose some control, but we gain OpenAI's expertise. For most applications, that's a trade-off worth making.