In an age of LLMs, is it time to reconsider human-edited web directories?
Back in the early-to-mid '90s, one of the main ways of finding anything on the web was to browse through a web directory.
These directories generally had a list of categories on their front page. News/Sport/Entertainment/Arts/Technology/Fashion/etc.
Each of those categories had subcategories, and sub-subcategories that you clicked through until you got to a list of websites. These lists were maintained by actual humans.
Typically, these directories also had a limited web search that would crawl through the pages of websites listed in the directory.
Lycos, Excite, and of course Yahoo all offered web directories of this sort.
(EDIT: I initially also mentioned AltaVista. It did offer a web directory by the late '90s, but this was something it tacked on much later.)
By the late '90s, the standard narrative goes, the web got too big to index websites manually.
Google promised the world its algorithms would weed out the spam automatically.
And for a time, it worked.
But then SEO and SEM became a multi-billion-dollar industry. The spambots proliferated. Google itself began promoting its own content and advertisers above search results.
And now with LLMs, the industrial-scale spamming of the web is likely to grow exponentially.
My question is, if a lot of the web is turning to crap, do we even want to search the entire web anymore?
Do we really want to search every single website on the web?
Or just those that aren't filled with LLM-generated SEO spam?
Or just those that don't feature 200 tracking scripts, and passive-aggressive privacy warnings, and paywalls, and popovers, and newsletters, and increasingly obnoxious banner ads, and dark patterns to prevent you cancelling your "free trial" subscription?
At some point, does it become more desirable to go back to search engines that only crawl pages on human-curated lists of trustworthy, quality websites?
And is it time to begin considering what a modern version of those early web directories might look like?
It is absolutely astounding to me that we are still earnestly entertaining the possibility that #ChatGPT and #LLMS more broadly have a role in scientific writing, manuscript review, experimental design, etc.
The training data for the question below are massive. It's a very easy question if you're trained on the entire internet.
Question: What teams have never made it to the World Series?
#Threads is not a text sharing platform, nor a #SocialMedia app. It's a platform for people to create natural language examples Meta can use for training #LLMs, for free
Aside from social media divides, there is a HUGE divide in tech I'm seeing now - pro-AI (LLM) and anti-AI/LLM. People saying it's making awful code and causing other issues, and then the companies raving about adding it to things and demonstrations of what it can do. I seriously saw one after the other a couple times today 😬🤣
i’ll say it — #LLMs can and will spit out any topic they’ve been trained on
an absurd amount of research is going into preventing the #LLM from explaining how to make a bomb, when they could just do some dumb tricks and remove the “how to make a bomb” manuals from the training corpus.
I’ve been very puzzled lately by how quickly it seems that some of my social circles, as they are getting to be 40-50 years, seem to have closed their minds to new concepts in general and the youth in particular.
Concretely of course within the context of #llms, where I get so many takes that llms will replace junior engineers but not them, that kids will become lazy and not learn to distinguish truth from hallucination, etc…
It’s striking to me how strongly people feel about the word “artificial intelligence” in application to #LLMs -There seems to be a fairly widespread sense that the term isn’t just unhelpful but somehow factually deeply ‘wrong’.
Setting aside that AI is an established term, that intuition seems at odds to me with how language works. To see this imagine “artificial intelligence” is an entirely novel, never before uttered, noun compound. 1/n
Yesterday's maintenance work on #unpaper is something that to me clearly shows the point I was making about the opportunities arising in treating specific#LLMs as Computer-Aided Software Engineering (CASE) tools, so I thought I would post a quick thread here, since I don't think I'll manage to post it on the blog any time soon.
Full disclosure before I start: I work for Meta, which clearly has been betting a lot on AI — but this is my personal point of view, and I don't work on AI projects.
Last night I came up with (and implemented!) an idea for a mastodon client that automatically curates my feed by categorizing toots. It's just idea phase right now, but I wrote about the process here https://timkellogg.me/blog/2023/12/19/fossil#LLMs#AI#feditips
I just issued a data deletion request to #StackOverflow to erase all of the associations between my name and the questions, answers and comments I have on the platform.
One of the key ways in which #RAG works to supplement #LLMs is based on proven associations. Higher ranked Stack Overflow members' answers will carry more weight in any #LLM that is produced.
By asking for my name to be disassociated from the textual data, it removes a semantic relationship that is helpful for determining which tokens of text to use in an #LLM.
If you sell out your user base without consultation, expect a backlash.
You know that #BigTech looses millions of $ through their deployed #AI systems, right? You can expect a much higher price for using their #LLMs in the future - be it your privacy or your money.
So instead of learning proompt engineering, why not do something more useful and invest your time into learning a new #ProgrammingLanguage:
#Rust - a language empowering everyone to build reliable and efficient software
#Haskell - a purely functional language that changes the way you think
Besides writing transactional/functional text (memos, hiring ads, seo nonsense, technical summaries), one thing I like is that llms, by virtue of parroting the obvious, allow me to “subtract” mainstream boilerthought from my ideas.
If GPT can transpose what I am trying to say to 8 different topics, then maybe i’m not having that valuable a thought, at least without more vivid examples.
it's so wild to me that in 2024 half of the #LLMs related posts on my timeline (after filtering a fair amount of people that annoyed me) is still: "these things are useless lying pieces of nonsense". Do you live under a rock? How does this happen? Did you ever try these things out?
Or are people happy skimming the foam at the surface and then complaining their thirst doesn't get stilled and that they now have milk foam on their chin? (lol where am I going with that analogy...)
one of the “business person” talking points on #LLMs that annoys me is “memories”. they’re wowed at an AI’s ability to remember things, as if 2 TB hard drives didn’t exist
"Generative AI will be great for coding! It will reduce our development time for products so much!"
All the dev-background folx in my feed:
"Sure, #CoPilot will generate plausible code for you really quickly, but who's going to write your unit tests and make sure there aren't any insidious errors at a #systems level that you can't identify in a single block of code in isolation?"
i wish i knew more about comparing #embeddings. anyone have resources? one thing i’ve wondered is how to convert an embedding from a “point” to an “area” or “volume”. e.g. an embedding of a 5 paragraph essay will occupy a single point in embedding space, but if you broke it down (e.g. by paragraph), there would be several points and the whole would presumably be at the center. is there a way to trace the full space a text occupies in #embedding space? #LLMs#LLM#AI#NLP
#3dprinting has been a lot of fun, but i don’t see it scaling out to general audiences. simple things like printing an existing model are pretty complicated. even just, “load model, switch spool, print” is far beyond what my 7yo can do, and that seems like a big UX problem
i wonder if #LLMs could help parts of the UX. load a model and the LLM asks what you’ll be using it for, adjusts infill & speed parameters appropriately. idk, the whole market seems dead without something big changing
Whenever I see OpenAI's Sam Altman with his pseudo-innocent glance, he always reminds me of Carter Burke from Aliens (1986), who deceived the entire spaceship crew in favor of his corporation, with the aim of getting rich by weaponizing a newly discovered intelligent lifeform.
There's not enough "fuck you"s in the world to react to this shit. #LLMs should be tools used in the service of people; what in the world is this proposal to make people work for LLMs?!
Any and all changes to scientific publishing needs to be for so that other people can access them and understand them.
And the single most important change would be for Nature and other publishers not to charge 29.99 USD for a shitty 4-paragraph essay that they didn't pay for themselves.
i've spent the whole day procrastihacking on things, and I"m just amazed by how comfortable the whole "how the heck do I do this", "oh no I have to do $tediousStuff", "ohh... I wish I could learn more about X but I don't have the time" has become.
Pünktlich zum Wochenende ist mein "Longread" erschienen. Ja, 20.000 Zeichen zählt schon als lang - es ist immer gar nicht so einfach, so lange Texte durchzukriegen, weil alle Sorge haben, dass niemand online so lange liest. Dieser ist aber natürlich so spannend, dass ihr ihn bis zur letzten Zeile genießen werdet ;)
Es geht um einen Jailbreak, der mir Einblick gab in die "Ausbruchsphantasien" von Google Bard und um die Frage, ob #LLMs ein Weltmodell haben 💲
Kurzer Thread: https://www.zeit.de/digital/internet/2023-11/ki-chatbot-bard-liebe-befehle-emotionen/komplettansicht