Retrieval-augmented generation, or RAG, lets a model answer from information it did not memorize during training. The model still generates the response, but the retrieval layer decides what context it gets to see first.
The prompt becomes something machines can search.
People read a question as language. AI systems first turn it into tokens, then often into embeddings: numerical representations that capture meaning well enough to compare one passage against another.
That is the first useful mental shift. Retrieval does not depend on exact keyword matches alone. A good embedding can connect related ideas even when the wording is different.
Retrieval is a ranking problem.
Once the question has an embedding, the system searches a store of embedded documents, notes, tickets, specs, source files, or other chunks of knowledge.
The goal is not to retrieve everything. The goal is to retrieve context relevant enough to change the quality of the answer. That is why chunking, metadata, deduplication, source quality, ranking, and freshness can matter as much as the model choice.
The model answers from a rebuilt prompt.
After retrieval, the user question is usually reconstructed with system instructions and selected context. The model is still generating token by token, but now it has source material in the active context window instead of relying only on general training.
That does not mean the model permanently learned the retrieved facts. It means the system placed the right information in front of it at the moment of use.
The hard part is the system around the model.
RAG fails when retrieval returns stale, noisy, overly broad, or irrelevant context. It also fails when the model is not instructed clearly enough about what to do with that context.
Strong systems make retrieval, prompting, evaluation, and user experience work together. The model generates, but the retrieval system decides what it gets to see.
The model matters, but the system around the model is where most of the leverage lives.