ogrisel

@ogrisel@sigmoid.social

Machine Learning Engineer at :probabl., scikit-learn core contributor. #Python, #Pydata, #MachineLearning & #DeepLearning.

This profile is from a federated server and may be incomplete. Browse more on the original instance.

ogrisel, 2 months ago (edited 2 months ago) to random

HELP WANTED!

Are you a scikit-learn x PyPy user? If so we are looking for help to investigate why our test suite uses so much memory and causes our CI infrastructure to regularly fail as a result.

This kind of investigative work is time consuming and none of the current scikit-learn maintainers have a particular interest in investing time and effort to more efficiently support PyPy at the moment.

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ jochen, jorisvandenbossche

ogrisel, 2 months ago

If you would like to help, here is a concrete example of the kind of investigation you would need to conduct to help us pinpoint PyPy-specific memory problems:

https://github.com/scikit-learn/scikit-learn/issues/27662

and here is an example of the kind of fix that helped reduce this memory problem in the past:

https://github.com/scikit-learn/scikit-learn/pull/27670

We think that similar investigations and fixes are needed to make scikit-learn and PyPy reasonably memory efficient together.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ogrisel, 2 months ago

Also even if you do not have the time to help maintain PyPy support, we are still interested in learning more about any use cases of PyPy with scikit-learn together.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ogrisel, 4 months ago to python

I have been thinking a bit about how to detect supply chain attacks against popular open source projects such as scikit-learn.

If you have practical experience with https://reproducible-builds.org/ in particular in the #Python / #PyData ecosystem, I would be curious about any feedback to the plan I suggest for scikit-learn in the following issue.

Feel free to reply on mastodon first, if you have questions.

https://github.com/scikit-learn/scikit-learn/issues/28151

reply

expand (5)

collapse (5)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ sethmlarson, leahawasser, underdarkGIS

ogrisel, 6 months ago (edited 6 months ago) to random

I ran a quick Gradient Boosted Trees vs Neural Nets check using scikit-learn's dev branch which makes it more convenient to work with tabular datasets with mixed numerical and categorical features data (e.g. the Adult Census dataset).

Let's start with the GBRT model. It's now possible to reproduce the SOTA number of this dataset in a few lines of code 2 s (CV included) on my laptop.

1/n

#sklearn #PyData #MachineLearning #TabularData #GradientBoosting #DeepLearning #Python

reply

expand (8)

collapse (8)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ GaelVaroquaux

ogrisel, 6 months ago (edited 6 months ago)

For neural networks, feature preprocessing is a deal breaker.

I was pleasantly surprised to observe that by intuitively composing basic building blocks (OneHotEncoder and SplineTransformer and MLPClassifier) from scikit-learn, it's possible to approach the predictive performance of trees on this dataset.

2/n

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ GaelVaroquaux

ogrisel, 6 months ago

It was interesting to see that the neural network predictive accuracy would be degraded by one or two points if we had used standard scaling of numerical features instead of splines, or if I had used a small number of knots for the splines.

For this particular dataset, it seems important to use an axis-align prior feature preprocessing for the numerical features.

4/n

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ GaelVaroquaux

ogrisel, 6 months ago

Note that the runtime for the neural net is ~10x slower than the tree-based model on my Apple M1 laptop.

I did not try to use an expensive GPU with PyTorch.

Note however that I did configure conda-forge's numpy to link against Apple Accelerate and use the M1 chip builtin GPU which is typically around 3x faster than OpenBLAS on the M1's CPU.

It's possible that with float32 ops (instead of float64) the difference would be wider though. Unfortunately it's not yet easy to do w/ sklearn.

3/n

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ GaelVaroquaux

ogrisel, 6 months ago

Meanwhile, I also checked the calibration of the tree-based and nn-based models.

The conclusion is that both models are well calibrated by default, as long as you use early stopping.

If you disable early stopping and max_iter is too small (under fit) or too large (over fit) then the models can either be significantly under-confident or over-confident.

6/n

Near diagonal calibration curves.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ GaelVaroquaux

ogrisel, 6 months ago

This is in line with the numbers in the AD column of Table 6 of this very interesting paper:

On Embeddings for Numerical Features in Tabular Deep Learning
Yury Gorishniy, Ivan Rubachev, Artem Babenko

https://arxiv.org/abs/2203.05556

Note that I did not do extensive parameter tuning but my notebook is not too far from those numbers.

I might try to implement the periodic features as a preprocessor in the future.

5/n

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ GaelVaroquaux

ogrisel, 6 months ago

Here is the link to the rendered notebook:

https://nbviewer.org/github/ogrisel/notebooks/blob/master/sklearn_demos/gbdt_vs_neural_nets_on_tabular_data.ipynb

It also includes a similar study on California Housing which has only numerical features.

For this dataset, spline features degrade performance. I found that quite surprising. But standard scaling makes the neural network competitive (albeit still slower) than the tree based model.

7/7.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ GaelVaroquaux

ogrisel, 6 months ago (edited 6 months ago) to machinelearning

Today with @dholzmueller we explored the possibility to reduce a probabilistic regression problem to a classification problem by binning the target variable and interpolating the conditional CDF estimated by classifier.predict_proba(X_test).cumsum(axis=1) to the original continuous range.

Here is a notebook with the results of my experiments:

https://nbviewer.org/github/ogrisel/notebooks/blob/master/quantile_regression_as_classification.ipynb

#PyData #MachineLearning

reply

expand (7)

collapse (7)

report

activity

copy /kbin url

copy original url

open original url

Loading...

ogrisel, 6 months ago (edited 6 months ago) to ArtificialIntelligence

Interesting developments in subquadratic alternatives to self-attention based transformers for large sequence modeling (32k and more).

Hyena Hierarchy: Towards Larger Convolutional Language Models

https://arxiv.org/abs/2302.10866

They propose to replace the quadratic self-attention layers by an operator built with implicitly parametrized long kernel 1D convolutions.

#DeepLearning #LLMs #PaperThread

1/4

reply

expand (3)

collapse (3)

report

activity

copy /kbin url

copy original url

open original url

Loading...

ogrisel, 7 months ago to random

@fabian could you please enable mastodon full-text search indexing in the settings of the JMLR and TMLR accounts on sigmoid.social?

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

ogrisel, 7 months ago (edited 7 months ago) to python

cloudpickle 3.0.0 is out!

https://github.com/cloudpipe/cloudpickle

cloudpickle is a library used by PySpark, Dask, Ray and joblib / loky (among others) to make it possible to call dynamically or locally defined function, closures and lambdas on remote Python process workers.

This is typically necessary for running code in parallel on a distributed computing cluster from an interactive developer environment such as a Jupyter or Databricks notebooks.

#Python #PyData #HPC #DistributedComputing

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ GaelVaroquaux

ogrisel, 8 months ago to python

scikit-learn 1.3.1 is out!

This release fixes a bunch of annoying bugs. Here is the changelog:

https://scikit-learn.org/stable/whats_new/v1.3.html#version-1-3-1

Thanks very much to all bug reporters, PR authors and reviewers and thanks in particular to @glemaitre, the release manager of 1.3.1.

#PyData #SciPy #sklearn #Python #machinelearning

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ GaelVaroquaux

ogrisel, 9 months ago to random

Yesterday I learned at the #EuroScipy2023 #IbisData tutorial that Ibis now offers an implementation of the across function first introduced in #dplyr to conveniently and concisely apply transformations on a set of columns defined by selectors (e.g. based on column data types or name patterns).

This is especially convenient to implement scalable, in-DB feature engineering for machine learning models.

More examples in these blog post:

https://ibis-project.org/blog/selectors/

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

ogrisel, 9 months ago to random French

Intriguing paper: Provably Faster Gradient Descent via Long Steps by Benjamin Grimmer

The convergence rate of gradient descent on smooth convex objective functions can be improved by using a periodic learning rate pattern with some very large values:

https://arxiv.org/abs/2307.06324

Figure 1: Least squares problems minimizing ∥Ax − b∥2 2 (left) and ∥Ax − b∥2 2 + ∥x∥2 2 (right) with i.i.d. normal entries in A ∈ Rn×n and b ∈ Rn for n = 4000. Gradient Descent (1.3)’s objective gap is plotted over T = 2000 iterations with h = (1) and with each pattern from Table 1. Note this second objective is substantially more strongly convex, so its faster linear convergence is expected. Longer pattern periods with larger average step sizes lead to improved convergence for both problems.
Screenshot of table 1: Optimal step size patterns, for period t of 127, the largest step size is 370.0 and the constant in the convergence rate denominator is close to 5.83.

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...

ogrisel, 10 months ago (edited 10 months ago) to random

Jérémie has just released threadpoolctl 3.2.0:

https://pypi.org/project/threadpoolctl/

This is a small Python library to inspect and change the size of the threadpools used by libraries dynamically linked to a Python program (e.g. OpenBLAS, MKL, OpenMP runtimes...).

It is quite useful to debug oversubscription problems in the #SciPy / #PyData ecosystem.

This new version makes it possible to register a custom controller for your own native library. See the changelog for details:

https://github.com/joblib/threadpoolctl/blob/master/CHANGES.md

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ ogrisel

ogrisel, 11 months ago (edited 11 months ago) to ArtificialIntelligence French

LongNet: Scaling Transformers to 1,000,000,000 Tokens

https://arxiv.org/abs/2307.02486

#deeplearning #transformers

reply

expand (3)

collapse (3)

report

activity

copy /kbin url

copy original url

open original url

Loading...

ogrisel, 11 months ago (edited 11 months ago) to random

joblib 1.3.0 is out in the wild!

joblib is a library that provides an generic way to call into thread-based, process-based and distributed parallelism (via external backends) + a way to cache expensive computation in repeated function calls on disk.

https://joblib.readthedocs.io

This new release provides several major new features, inclusing a return_as="generator" argument to the Parallelclass to make it possible to aggregate parallel results when ready (preserving the submission order).

1/4

reply

expand (5)

collapse (5)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ ericholscher, jorisvandenbossche, GaelVaroquaux

ogrisel, 2 months ago to python

The deadline for the CFP of #PyData Paris 2024 is approaching soon!

Submit your talk proposal now:

https://pretalx.com/pydata-paris-2024/cfp

I would advise you not to expect an automatic deadline extension.

#Python #DataScience

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ogrisel, 2 months ago to random

joblib 1.4.0 is out!

Among other fixes and improvements this release brings:

numpy 2 compatibility;

support for a new Parallel kwarg: return_as=generator_unordered to return results out of order in a streaming manner.

https://github.com/joblib/joblib/blob/main/CHANGES.rst#release-140----20240408

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ogrisel, 14 days ago (edited 14 days ago) to python

Scikit-learn 1.5 release highlights in video:

https://youtu.be/mOpU-zremz4

Or as a webpage: https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_5_0.html

#Python #MachineLearning #PyData #SciPy

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ogrisel, 8 days ago to random

I just created a new official mastodon account for the PyData Paris 2024 conference.

@PyDataParis

I will use this account to relay official announcements in the fediverse.

#PyData #PyDataParis

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...