@ogrisel@sigmoid.social
@ogrisel@sigmoid.social avatar

ogrisel

@ogrisel@sigmoid.social

Machine Learning Engineer at :probabl., scikit-learn core contributor. #Python, #Pydata, #MachineLearning & #DeepLearning.

This profile is from a federated server and may be incomplete. Browse more on the original instance.

ogrisel, (edited ) to random
@ogrisel@sigmoid.social avatar

HELP WANTED!

Are you a scikit-learn x PyPy user? If so we are looking for help to investigate why our test suite uses so much memory and causes our CI infrastructure to regularly fail as a result.

This kind of investigative work is time consuming and none of the current scikit-learn maintainers have a particular interest in investing time and effort to more efficiently support PyPy at the moment.

ogrisel,
@ogrisel@sigmoid.social avatar

If you would like to help, here is a concrete example of the kind of investigation you would need to conduct to help us pinpoint PyPy-specific memory problems:

https://github.com/scikit-learn/scikit-learn/issues/27662

and here is an example of the kind of fix that helped reduce this memory problem in the past:

https://github.com/scikit-learn/scikit-learn/pull/27670

We think that similar investigations and fixes are needed to make scikit-learn and PyPy reasonably memory efficient together.

ogrisel,
@ogrisel@sigmoid.social avatar

Also even if you do not have the time to help maintain PyPy support, we are still interested in learning more about any use cases of PyPy with scikit-learn together.

ogrisel, to python
@ogrisel@sigmoid.social avatar

I have been thinking a bit about how to detect supply chain attacks against popular open source projects such as scikit-learn.

If you have practical experience with https://reproducible-builds.org/ in particular in the #Python / #PyData ecosystem, I would be curious about any feedback to the plan I suggest for scikit-learn in the following issue.

Feel free to reply on mastodon first, if you have questions.

https://github.com/scikit-learn/scikit-learn/issues/28151

ogrisel, (edited ) to random
@ogrisel@sigmoid.social avatar

I ran a quick Gradient Boosted Trees vs Neural Nets check using scikit-learn's dev branch which makes it more convenient to work with tabular datasets with mixed numerical and categorical features data (e.g. the Adult Census dataset).

Let's start with the GBRT model. It's now possible to reproduce the SOTA number of this dataset in a few lines of code 2 s (CV included) on my laptop.

1/n

#sklearn #PyData #MachineLearning #TabularData #GradientBoosting #DeepLearning #Python

ogrisel, (edited )
@ogrisel@sigmoid.social avatar

For neural networks, feature preprocessing is a deal breaker.

I was pleasantly surprised to observe that by intuitively composing basic building blocks (OneHotEncoder and SplineTransformer and MLPClassifier) from scikit-learn, it's possible to approach the predictive performance of trees on this dataset.

2/n

ogrisel,
@ogrisel@sigmoid.social avatar

It was interesting to see that the neural network predictive accuracy would be degraded by one or two points if we had used standard scaling of numerical features instead of splines, or if I had used a small number of knots for the splines.

For this particular dataset, it seems important to use an axis-align prior feature preprocessing for the numerical features.

4/n

ogrisel,
@ogrisel@sigmoid.social avatar

Note that the runtime for the neural net is ~10x slower than the tree-based model on my Apple M1 laptop.

I did not try to use an expensive GPU with PyTorch.

Note however that I did configure conda-forge's numpy to link against Apple Accelerate and use the M1 chip builtin GPU which is typically around 3x faster than OpenBLAS on the M1's CPU.

It's possible that with float32 ops (instead of float64) the difference would be wider though. Unfortunately it's not yet easy to do w/ sklearn.

3/n

ogrisel,
@ogrisel@sigmoid.social avatar

Meanwhile, I also checked the calibration of the tree-based and nn-based models.

The conclusion is that both models are well calibrated by default, as long as you use early stopping.

If you disable early stopping and max_iter is too small (under fit) or too large (over fit) then the models can either be significantly under-confident or over-confident.

6/n

Near diagonal calibration curves.

ogrisel,
@ogrisel@sigmoid.social avatar

This is in line with the numbers in the AD column of Table 6 of this very interesting paper:

On Embeddings for Numerical Features in Tabular Deep Learning
Yury Gorishniy, Ivan Rubachev, Artem Babenko

https://arxiv.org/abs/2203.05556

Note that I did not do extensive parameter tuning but my notebook is not too far from those numbers.

I might try to implement the periodic features as a preprocessor in the future.

5/n

ogrisel,
@ogrisel@sigmoid.social avatar

Here is the link to the rendered notebook:

https://nbviewer.org/github/ogrisel/notebooks/blob/master/sklearn_demos/gbdt_vs_neural_nets_on_tabular_data.ipynb

It also includes a similar study on California Housing which has only numerical features.

For this dataset, spline features degrade performance. I found that quite surprising. But standard scaling makes the neural network competitive (albeit still slower) than the tree based model.

7/7.

ogrisel, (edited ) to machinelearning
@ogrisel@sigmoid.social avatar

Today with @dholzmueller we explored the possibility to reduce a probabilistic regression problem to a classification problem by binning the target variable and interpolating the conditional CDF estimated by classifier.predict_proba(X_test).cumsum(axis=1) to the original continuous range.

Here is a notebook with the results of my experiments:

https://nbviewer.org/github/ogrisel/notebooks/blob/master/quantile_regression_as_classification.ipynb

ogrisel, (edited ) to ArtificialIntelligence
@ogrisel@sigmoid.social avatar

Interesting developments in subquadratic alternatives to self-attention based transformers for large sequence modeling (32k and more).

Hyena Hierarchy: Towards Larger Convolutional Language Models

https://arxiv.org/abs/2302.10866

They propose to replace the quadratic self-attention layers by an operator built with implicitly parametrized long kernel 1D convolutions.

#DeepLearning #LLMs #PaperThread

1/4

ogrisel, to random
@ogrisel@sigmoid.social avatar

@fabian could you please enable mastodon full-text search indexing in the settings of the JMLR and TMLR accounts on sigmoid.social?

ogrisel, (edited ) to python
@ogrisel@sigmoid.social avatar

cloudpickle 3.0.0 is out!

https://github.com/cloudpipe/cloudpickle

cloudpickle is a library used by PySpark, Dask, Ray and joblib / loky (among others) to make it possible to call dynamically or locally defined function, closures and lambdas on remote Python process workers.

This is typically necessary for running code in parallel on a distributed computing cluster from an interactive developer environment such as a Jupyter or Databricks notebooks.

#Python #PyData #HPC #DistributedComputing

ogrisel, to python
@ogrisel@sigmoid.social avatar

scikit-learn 1.3.1 is out!

This release fixes a bunch of annoying bugs. Here is the changelog:

https://scikit-learn.org/stable/whats_new/v1.3.html#version-1-3-1

Thanks very much to all bug reporters, PR authors and reviewers and thanks in particular to @glemaitre, the release manager of 1.3.1.

#PyData #SciPy #sklearn #Python #machinelearning

ogrisel, to random
@ogrisel@sigmoid.social avatar

Yesterday I learned at the #EuroScipy2023 #IbisData tutorial that Ibis now offers an implementation of the across function first introduced in #dplyr to conveniently and concisely apply transformations on a set of columns defined by selectors (e.g. based on column data types or name patterns).

This is especially convenient to implement scalable, in-DB feature engineering for machine learning models.

More examples in these blog post:

https://ibis-project.org/blog/selectors/

ogrisel, to random French
@ogrisel@sigmoid.social avatar
ogrisel, (edited ) to random
@ogrisel@sigmoid.social avatar

Jérémie has just released threadpoolctl 3.2.0:

https://pypi.org/project/threadpoolctl/

This is a small Python library to inspect and change the size of the threadpools used by libraries dynamically linked to a Python program (e.g. OpenBLAS, MKL, OpenMP runtimes...).

It is quite useful to debug oversubscription problems in the #SciPy / #PyData ecosystem.

This new version makes it possible to register a custom controller for your own native library. See the changelog for details:

https://github.com/joblib/threadpoolctl/blob/master/CHANGES.md

ogrisel, (edited ) to ArtificialIntelligence French
@ogrisel@sigmoid.social avatar

LongNet: Scaling Transformers to 1,000,000,000 Tokens

https://arxiv.org/abs/2307.02486

#deeplearning #transformers

ogrisel, (edited ) to random
@ogrisel@sigmoid.social avatar

joblib 1.3.0 is out in the wild!

joblib is a library that provides an generic way to call into thread-based, process-based and distributed parallelism (via external backends) + a way to cache expensive computation in repeated function calls on disk.

https://joblib.readthedocs.io

This new release provides several major new features, inclusing a return_as="generator" argument to the Parallelclass to make it possible to aggregate parallel results when ready (preserving the submission order).

1/4

ogrisel, to python
@ogrisel@sigmoid.social avatar

The deadline for the CFP of #PyData Paris 2024 is approaching soon!

Submit your talk proposal now:

https://pretalx.com/pydata-paris-2024/cfp

I would advise you not to expect an automatic deadline extension.

#Python #DataScience

ogrisel, to random
@ogrisel@sigmoid.social avatar

joblib 1.4.0 is out!

Among other fixes and improvements this release brings:

  • numpy 2 compatibility;
  • support for a new Parallel kwarg: return_as=generator_unordered to return results out of order in a streaming manner.

https://github.com/joblib/joblib/blob/main/CHANGES.rst#release-140----20240408

ogrisel, (edited ) to python
@ogrisel@sigmoid.social avatar
ogrisel, to random
@ogrisel@sigmoid.social avatar

I just created a new official mastodon account for the PyData Paris 2024 conference.

@PyDataParis

I will use this account to relay official announcements in the fediverse.

#PyData #PyDataParis

  • All
  • Subscribed
  • Moderated
  • Favorites
  • JUstTest
  • ngwrru68w68
  • everett
  • InstantRegret
  • magazineikmin
  • thenastyranch
  • rosin
  • GTA5RPClips
  • Durango
  • Youngstown
  • slotface
  • khanakhh
  • kavyap
  • DreamBathrooms
  • provamag3
  • tacticalgear
  • osvaldo12
  • tester
  • cubers
  • cisconetworking
  • mdbf
  • ethstaker
  • modclub
  • Leos
  • anitta
  • normalnudes
  • megavids
  • lostlight
  • All magazines