Posts

This profile is from a federated server and may be incomplete. Browse more on the original instance.

ogrisel, to random
@ogrisel@sigmoid.social avatar

I just created a new official mastodon account for the PyData Paris 2024 conference.

@PyDataParis

I will use this account to relay official announcements in the fediverse.

#PyData #PyDataParis

ogrisel, (edited ) to python
@ogrisel@sigmoid.social avatar
ogrisel, to random
@ogrisel@sigmoid.social avatar

joblib 1.4.0 is out!

Among other fixes and improvements this release brings:

  • numpy 2 compatibility;
  • support for a new Parallel kwarg: return_as=generator_unordered to return results out of order in a streaming manner.

https://github.com/joblib/joblib/blob/main/CHANGES.rst#release-140----20240408

ogrisel, to python
@ogrisel@sigmoid.social avatar

The deadline for the CFP of #PyData Paris 2024 is approaching soon!

Submit your talk proposal now:

https://pretalx.com/pydata-paris-2024/cfp

I would advise you not to expect an automatic deadline extension.

#Python #DataScience

ogrisel, (edited ) to random
@ogrisel@sigmoid.social avatar

HELP WANTED!

Are you a scikit-learn x PyPy user? If so we are looking for help to investigate why our test suite uses so much memory and causes our CI infrastructure to regularly fail as a result.

This kind of investigative work is time consuming and none of the current scikit-learn maintainers have a particular interest in investing time and effort to more efficiently support PyPy at the moment.

ogrisel,
@ogrisel@sigmoid.social avatar

If you would like to help, here is a concrete example of the kind of investigation you would need to conduct to help us pinpoint PyPy-specific memory problems:

https://github.com/scikit-learn/scikit-learn/issues/27662

and here is an example of the kind of fix that helped reduce this memory problem in the past:

https://github.com/scikit-learn/scikit-learn/pull/27670

We think that similar investigations and fixes are needed to make scikit-learn and PyPy reasonably memory efficient together.

ogrisel,
@ogrisel@sigmoid.social avatar

Also even if you do not have the time to help maintain PyPy support, we are still interested in learning more about any use cases of PyPy with scikit-learn together.

ogrisel, to random
@ogrisel@sigmoid.social avatar

The Call for Proposals for #PyData Paris 2024 is officially OPEN! 🎉

Share your insights, discoveries, and innovations with the open-source data science and AI/ML community.

Submit your proposal at https://pydata.org/paris2024 and be a part of this incredible event!

ogrisel, to python
@ogrisel@sigmoid.social avatar

I have been thinking a bit about how to detect supply chain attacks against popular open source projects such as scikit-learn.

If you have practical experience with https://reproducible-builds.org/ in particular in the #Python / #PyData ecosystem, I would be curious about any feedback to the plan I suggest for scikit-learn in the following issue.

Feel free to reply on mastodon first, if you have questions.

https://github.com/scikit-learn/scikit-learn/issues/28151

ogrisel,
@ogrisel@sigmoid.social avatar

@sethmlarson @vstinner I completely agree with all you said. Do you plan to focus first on helping make official cpython releases themselves automatically reproducible? Or do you plan to focus on improving wheel building tools to make pypi hosted artifacts reproducible?

Interesting work w.r.t. the SBOM of cpython. It would be interesting to have cibuildwheel able to dump an SBOM file, and later rebuild from one while checking the sha256 values of the dependencies.

sethmlarson, (edited )
@sethmlarson@fosstodon.org avatar

@ogrisel @vstinner Right now I'm focusing on CPython and next I'll likely focus on Python packaging ecosystem best practices.

It is on my roadmap to improve Python package tooling and standards to make reproducibility possible. I would like to get to this work in 2024, unfortunately there's only one of me right now so I can only say with certainty that I won't be able to start on that in early 2024. But if someone were to pick up this work earlier I would happily review and assist!

ogrisel, to random
@ogrisel@sigmoid.social avatar

MotherNet: A Foundational Hypernetwork for Tabular Classification

by Andreas Müller, Carlo Curino, Raghu Ramakrishnan
https://arxiv.org/abs/2312.08598

Crazy paper that introduces a meta trained transformer that can perform In-context Learning of the weights of small MLPs from numerical tabular training sets passed to the 'prompt' of the big transformer.

A kind of TabPFN but with very fast inference.

ogrisel, (edited ) to random
@ogrisel@sigmoid.social avatar

I ran a quick Gradient Boosted Trees vs Neural Nets check using scikit-learn's dev branch which makes it more convenient to work with tabular datasets with mixed numerical and categorical features data (e.g. the Adult Census dataset).

Let's start with the GBRT model. It's now possible to reproduce the SOTA number of this dataset in a few lines of code 2 s (CV included) on my laptop.

1/n

#sklearn #PyData #MachineLearning #TabularData #GradientBoosting #DeepLearning #Python

ogrisel, (edited )
@ogrisel@sigmoid.social avatar

For neural networks, feature preprocessing is a deal breaker.

I was pleasantly surprised to observe that by intuitively composing basic building blocks (OneHotEncoder and SplineTransformer and MLPClassifier) from scikit-learn, it's possible to approach the predictive performance of trees on this dataset.

2/n

ogrisel,
@ogrisel@sigmoid.social avatar

Note that the runtime for the neural net is ~10x slower than the tree-based model on my Apple M1 laptop.

I did not try to use an expensive GPU with PyTorch.

Note however that I did configure conda-forge's numpy to link against Apple Accelerate and use the M1 chip builtin GPU which is typically around 3x faster than OpenBLAS on the M1's CPU.

It's possible that with float32 ops (instead of float64) the difference would be wider though. Unfortunately it's not yet easy to do w/ sklearn.

3/n

ogrisel, (edited ) to machinelearning
@ogrisel@sigmoid.social avatar

Today with @dholzmueller we explored the possibility to reduce a probabilistic regression problem to a classification problem by binning the target variable and interpolating the conditional CDF estimated by classifier.predict_proba(X_test).cumsum(axis=1) to the original continuous range.

Here is a notebook with the results of my experiments:

https://nbviewer.org/github/ogrisel/notebooks/blob/master/quantile_regression_as_classification.ipynb

#PyData #MachineLearning

ogrisel, (edited )
@ogrisel@sigmoid.social avatar

@bsweber the idea is not that new though. If I recall correctly, PixelCNN and WaveNet both learned a conditional distribution on a discretized continuous variable by treating it as the target of a classification problem. This is personnally what gave me the idea to craft this meta-estimator a while ago.

I think @dholzmueller had another reference in mind, maybe https://arxiv.org/abs/2211.05641 ? This paper attempts to give a theoretical justification of a common practice among practitioners.

dholzmueller,

@ogrisel @bsweber
Yes, and that paper mentions a few other sources (e.g. in RL) where classification has been used for regression problems.

ogrisel, (edited ) to ArtificialIntelligence
@ogrisel@sigmoid.social avatar

Interesting developments in subquadratic alternatives to self-attention based transformers for large sequence modeling (32k and more).

Hyena Hierarchy: Towards Larger Convolutional Language Models

https://arxiv.org/abs/2302.10866

They propose to replace the quadratic self-attention layers by an operator built with implicitly parametrized long kernel 1D convolutions.

#DeepLearning #LLMs #PaperThread

1/4

ogrisel, (edited )
@ogrisel@sigmoid.social avatar

Unfortunately, the reduced FLOPS of Hyena layers does not necessarily yield a competitive walltime performance because long length kernel FFT convolutions typically have a hard time at using hardware accelerators (GPUs, TPUs) efficiently.

In particular, FlashAttentionV2 transformers can stay competitive for relatively long input sequences because of their highly optimized fused kernels.

3/4

ogrisel,
@ogrisel@sigmoid.social avatar

However, the following paper:

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

https://arxiv.org/abs/2311.05908

https://github.com/HazyResearch/flash-fft-conv

shows that it's possible to implement FFTConv efficiently on GPUs, therefore making the Hyena architecture more competitive.

This might be a game changer to tackle long sequence "reasoning" and recall tasks for LLMs, DNA sequence analysis and so on.

4/4

ogrisel, to random
@ogrisel@sigmoid.social avatar

@fabian could you please enable mastodon full-text search indexing in the settings of the JMLR and TMLR accounts on sigmoid.social?

fabian,

@ogrisel done!

ogrisel, (edited ) to python
@ogrisel@sigmoid.social avatar

cloudpickle 3.0.0 is out!

https://github.com/cloudpipe/cloudpickle

cloudpickle is a library used by PySpark, Dask, Ray and joblib / loky (among others) to make it possible to call dynamically or locally defined function, closures and lambdas on remote Python process workers.

This is typically necessary for running code in parallel on a distributed computing cluster from an interactive developer environment such as a Jupyter or Databricks notebooks.

#Python #PyData #HPC #DistributedComputing

ogrisel,
@ogrisel@sigmoid.social avatar

This release drops the support for Python 3.6 and 3.7 and add official support for Python 3.12.

Dropping support for older Python versions made it possible to simplify the code base a lot (more than 500 lines of code deleted).

We also fixed errors when pickling instances of dynamically defined dataclasses.

We also took the opportunity to upgrade our maintenance tools (no more setup.py in favor of pyproject.toml, use black and ruff in a pre-commit setting, ...).

Thanks to contributors!

ogrisel, to python
@ogrisel@sigmoid.social avatar

scikit-learn 1.3.1 is out!

This release fixes a bunch of annoying bugs. Here is the changelog:

https://scikit-learn.org/stable/whats_new/v1.3.html#version-1-3-1

Thanks very much to all bug reporters, PR authors and reviewers and thanks in particular to @glemaitre, the release manager of 1.3.1.

#PyData #SciPy #sklearn #Python #machinelearning

Scriddie,

@ogrisel @glemaitre
Now the only thing lacking for 2.0 is a high-res logo ;)

ogrisel,
@ogrisel@sigmoid.social avatar

@Scriddie @glemaitre actually we should probably just use the svg logo.

ogrisel, to random
@ogrisel@sigmoid.social avatar

Yesterday I learned at the #EuroScipy2023 #IbisData tutorial that Ibis now offers an implementation of the across function first introduced in #dplyr to conveniently and concisely apply transformations on a set of columns defined by selectors (e.g. based on column data types or name patterns).

This is especially convenient to implement scalable, in-DB feature engineering for machine learning models.

More examples in these blog post:

https://ibis-project.org/blog/selectors/

ogrisel,
@ogrisel@sigmoid.social avatar

And here are the notebooks of yesterday's #ibisdata tutorial for those interested:

https://github.com/gforsyth/ibis-tutorial

  • All
  • Subscribed
  • Moderated
  • Favorites
  • megavids
  • mdbf
  • ngwrru68w68
  • modclub
  • magazineikmin
  • thenastyranch
  • rosin
  • khanakhh
  • InstantRegret
  • Youngstown
  • slotface
  • Durango
  • kavyap
  • DreamBathrooms
  • JUstTest
  • GTA5RPClips
  • tacticalgear
  • normalnudes
  • tester
  • osvaldo12
  • everett
  • cubers
  • ethstaker
  • anitta
  • Leos
  • cisconetworking
  • provamag3
  • lostlight
  • All magazines