As I understand it, with all current LLMs, having a conversation involves... - Random

jcsteh, 24 days ago

As I understand it, with all current LLMs, having a conversation involves feeding the model the entire conversation up to this point. That is, there is no memory: the prompt you feed it just gets longer and longer. So how does that work with something like GPT-4O which could be processing audio and/or video at a much faster rate? Surely the prompts must get very large very quickly with anything beyond a short interaction? Doesn't that mean the responses take longer and cost more as the conversation gets longer?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ jaybird110127

Image

Image alternative text

chikim, 24 days ago

@jcsteh I don't know how it works, so everything is my speculation. lol Anyways, ChatGPT has memory feature. Possibly Retrieval-Augmented Generation? Also when you are about to reach the context limit, maybe they ask model to summarize previous context, and discard the detail and keep the important ones. Also maybe multimodal has longer context? For example, Google Gemini 1.5 Pro has 2 millions context length!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

miki, 23 days ago

@chikim @jcsteh Also there might be some caching involved. The expensive operation in LLMs is attention, which needs to be calculated for every pair of tokens, and that's O(n^2). However, when we're only adding a few new tokens to an already existing prompt, we only need to calcualte the new pairs, and that's just O(n+m*m), not O((n+m)^2). Most implementations throw all those calculations away after finishing every request. This makes sense, these attention vectors take up a lot of memory and there's usually load balancing involved, so even if you make a request with the same prompt, it's probably going to hit another instance. If you have a persistent connection to a single server and it's easy to determine exactly when this connection starts and ends, it might make sense to cache, which lowers the cost considerably.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment