Are you a scikit-learn x PyPy user? If so we are looking for help to investigate why our test suite uses so much memory and causes our CI infrastructure to regularly fail as a result.
This kind of investigative work is time consuming and none of the current scikit-learn maintainers have a particular interest in investing time and effort to more efficiently support PyPy at the moment.
If you would like to help, here is a concrete example of the kind of investigation you would need to conduct to help us pinpoint PyPy-specific memory problems:
Also even if you do not have the time to help maintain PyPy support, we are still interested in learning more about any use cases of PyPy with scikit-learn together.
I have been thinking a bit about how to detect supply chain attacks against popular open source projects such as scikit-learn.
If you have practical experience with https://reproducible-builds.org/ in particular in the #Python / #PyData ecosystem, I would be curious about any feedback to the plan I suggest for scikit-learn in the following issue.
Feel free to reply on mastodon first, if you have questions.
@sethmlarson@vstinner I completely agree with all you said. Do you plan to focus first on helping make official cpython releases themselves automatically reproducible? Or do you plan to focus on improving wheel building tools to make pypi hosted artifacts reproducible?
Interesting work w.r.t. the SBOM of cpython. It would be interesting to have cibuildwheel able to dump an SBOM file, and later rebuild from one while checking the sha256 values of the dependencies.
@ogrisel@vstinner Right now I'm focusing on CPython and next I'll likely focus on Python packaging ecosystem best practices.
It is on my roadmap to improve Python package tooling and standards to make reproducibility possible. I would like to get to this work in 2024, unfortunately there's only one of me right now so I can only say with certainty that I won't be able to start on that in early 2024. But if someone were to pick up this work earlier I would happily review and assist!
Crazy paper that introduces a meta trained transformer that can perform In-context Learning of the weights of small MLPs from numerical tabular training sets passed to the 'prompt' of the big transformer.
I ran a quick Gradient Boosted Trees vs Neural Nets check using scikit-learn's dev branch which makes it more convenient to work with tabular datasets with mixed numerical and categorical features data (e.g. the Adult Census dataset).
Let's start with the GBRT model. It's now possible to reproduce the SOTA number of this dataset in a few lines of code 2 s (CV included) on my laptop.
For neural networks, feature preprocessing is a deal breaker.
I was pleasantly surprised to observe that by intuitively composing basic building blocks (OneHotEncoder and SplineTransformer and MLPClassifier) from scikit-learn, it's possible to approach the predictive performance of trees on this dataset.
Note that the runtime for the neural net is ~10x slower than the tree-based model on my Apple M1 laptop.
I did not try to use an expensive GPU with PyTorch.
Note however that I did configure conda-forge's numpy to link against Apple Accelerate and use the M1 chip builtin GPU which is typically around 3x faster than OpenBLAS on the M1's CPU.
It's possible that with float32 ops (instead of float64) the difference would be wider though. Unfortunately it's not yet easy to do w/ sklearn.
Today with @dholzmueller we explored the possibility to reduce a probabilistic regression problem to a classification problem by binning the target variable and interpolating the conditional CDF estimated by classifier.predict_proba(X_test).cumsum(axis=1) to the original continuous range.
Here is a notebook with the results of my experiments:
@bsweber the idea is not that new though. If I recall correctly, PixelCNN and WaveNet both learned a conditional distribution on a discretized continuous variable by treating it as the target of a classification problem. This is personnally what gave me the idea to craft this meta-estimator a while ago.
I think @dholzmueller had another reference in mind, maybe https://arxiv.org/abs/2211.05641 ? This paper attempts to give a theoretical justification of a common practice among practitioners.
Unfortunately, the reduced FLOPS of Hyena layers does not necessarily yield a competitive walltime performance because long length kernel FFT convolutions typically have a hard time at using hardware accelerators (GPUs, TPUs) efficiently.
In particular, FlashAttentionV2 transformers can stay competitive for relatively long input sequences because of their highly optimized fused kernels.
cloudpickle is a library used by PySpark, Dask, Ray and joblib / loky (among others) to make it possible to call dynamically or locally defined function, closures and lambdas on remote Python process workers.
This is typically necessary for running code in parallel on a distributed computing cluster from an interactive developer environment such as a Jupyter or Databricks notebooks.
This release drops the support for Python 3.6 and 3.7 and add official support for Python 3.12.
Dropping support for older Python versions made it possible to simplify the code base a lot (more than 500 lines of code deleted).
We also fixed errors when pickling instances of dynamically defined dataclasses.
We also took the opportunity to upgrade our maintenance tools (no more setup.py in favor of pyproject.toml, use black and ruff in a pre-commit setting, ...).
Yesterday I learned at the #EuroScipy2023#IbisData tutorial that Ibis now offers an implementation of the across function first introduced in #dplyr to conveniently and concisely apply transformations on a set of columns defined by selectors (e.g. based on column data types or name patterns).
This is especially convenient to implement scalable, in-DB feature engineering for machine learning models.