jcsteh, As I understand it, with all current LLMs, having a conversation involves feeding the model the entire conversation up to this point. That is, there is no memory: the prompt you feed it just gets longer and longer. So how does that work with something like GPT-4O which could be processing audio and/or video at a much faster rate? Surely the prompts must get very large very quickly with anything beyond a short interaction? Doesn't that mean the responses take longer and cost more as the conversation gets longer?
Add comment