nsa

@nsa@kbin.social

[D] Why do we need encoder-decoder models while decoder-only models can do everything? (www.reddit.com)

nsa, 5 months ago

Please don't post links to reddit discussions.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

pl.aiwright - GPT-4 dialogue for Disco Elysium: The Final Cut (pl.aiwright.dev)

pl.aiwright is an AI-powered dialogue generation tool for interactive narrative games. This is a first step for an open research platform to explore AI-generated dialogue in games....

What's In My Big Data? (arxiv.org)

Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, we propose What's In My Big Data? (WIMBD), a platform and a set of sixteen...

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI (arxiv.org)

The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts...

GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems (arxiv.org)

There has been considerable divergence of opinion on the reasoning abilities of Large Language Models (LLMs). While the initial optimism that reasoning might emerge automatically with scale has been tempered thanks to a slew of counterexamples, a wide spread belief in their iterative self-critique capabilities persists. In this...

A Long Way to Go: Investigating Length Correlations in RLHF (arxiv.org)

Great successes have been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models. Open-source preference datasets and reward models have enabled wider experimentation beyond generic chat settings, particularly to make systems more "helpful" for tasks like web question answering,...

Think before you speak: Training Language Models With Pause Tokens (arxiv.org)

Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{th}$ token?...

Language Modeling Is Compression (arxiv.org)

It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive...

Retentive Network: A Successor to Transformer for Large Language Models (arxiv.org)

This is an exciting new paper that replaces attention in the Transformer architecture with a set of decomposable matrix operations that retain the modeling capacity of Transformer models, while allowing parallel training and efficient RNN-like inference without the use of attention (it doesn't use a softmax)....

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time (proceedings.mlr.press)

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large...

nsa, 11 months ago

Averaging model weights seems to help across textual domains as well, see Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models and Scaling Expert Language Models with Unsupervised Domain Discovery. I wonder if the two types of averaging (across hyperparameters and across domains) can be combined to produce even better models.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

CoDi: Generate Anything from Anything All At Once through Composable Diffusion (codi-gen.github.io)

Abstract:...

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training (arxiv.org)

Abstract:...

nsa, 11 months ago

Research into efficient optimization techniques seems pretty important given the scale of LLMs these days. Nice to see a second-order approach that achieves reasonable wall-clock improvements.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing (arxiv.org)

Vision transformers (ViTs) have significantly changed the computer vision landscape and have periodically exhibited superior performance in vision tasks compared to convolutional neural networks (CNNs). Although the jury is still out on which model type is superior, each has unique inductive biases that shape their learning and...

nsa, 11 months ago

Please don't post links to reddit.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

nsa, 11 months ago

If there isn't any discussion on reddit (no discussion in this case), I don't see a reason to link to reddit; you can just link to the project page. That said, if you think there is important discussion happening that is helpful for understanding the paper, then use a teddit link instead, like:

https://teddit.net/r/MachineLearning/comments/14pq5mq/r_hardwiring_vit_patch_selectivity_into_cnns/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

nsa, 11 months ago

That's appreciated!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models (arxiv.org)

Abstract:...

nsa, 11 months ago

It seems like for creative text generation tasks, metrics have been shown to be deficient; this even holds for the new model-based metrics. That leaves human evaluation (both intrinsic and extrinsic) as the gold standard for those types of tasks. I wonder if the results from this paper (and other future papers that look automatic CV metrics) will lead reviewers to demand more human evaluation in CV tasks like they do for certain NLP tasks.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Koffindodjer, 11 months ago to machinelearning

@machinelearning am I in the right place? Lol

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...

nsa, 11 months ago

@Koffindodjer indeed you are!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Extending Context Window of Large Language Models via Positional Interpolation (arxiv.org)

Interesting technique to increase the context window of language models by finetuning on a small number of samples after pretraining....

nsa, 11 months ago

do you have a link?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

nsa, 11 months ago

hmmm... not sure which model you're referring to. do you have a paper link?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks (arxiv.org)

Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous...

nsa, 11 months ago

Also reminds me of this ICLR paper: Linearly Mapping from Image to Text Space.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Inverse Scaling: When Bigger Isn't Better (arxiv.org)

Abstract:...

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models (decodingtrust.github.io)

DecodingTrust is the Adversarial GLUE Benchmark. DecodingTrust aims at providing a thorough assessment of trustworthiness in GPT models....