@ogrisel@sigmoid.social
@ogrisel@sigmoid.social avatar

ogrisel

@ogrisel@sigmoid.social

Machine Learning Engineer at :probabl., scikit-learn core contributor. #Python, #Pydata, #MachineLearning & #DeepLearning.

This profile is from a federated server and may be incomplete. Browse more on the original instance.

ogrisel, to random
@ogrisel@sigmoid.social avatar

joblib 1.4.0 is out!

Among other fixes and improvements this release brings:

  • numpy 2 compatibility;
  • support for a new Parallel kwarg: return_as=generator_unordered to return results out of order in a streaming manner.

https://github.com/joblib/joblib/blob/main/CHANGES.rst#release-140----20240408

ogrisel, to python
@ogrisel@sigmoid.social avatar

The deadline for the CFP of #PyData Paris 2024 is approaching soon!

Submit your talk proposal now:

https://pretalx.com/pydata-paris-2024/cfp

I would advise you not to expect an automatic deadline extension.

#Python #DataScience

ogrisel, (edited ) to random
@ogrisel@sigmoid.social avatar

HELP WANTED!

Are you a scikit-learn x PyPy user? If so we are looking for help to investigate why our test suite uses so much memory and causes our CI infrastructure to regularly fail as a result.

This kind of investigative work is time consuming and none of the current scikit-learn maintainers have a particular interest in investing time and effort to more efficiently support PyPy at the moment.

ogrisel,
@ogrisel@sigmoid.social avatar

If you would like to help, here is a concrete example of the kind of investigation you would need to conduct to help us pinpoint PyPy-specific memory problems:

https://github.com/scikit-learn/scikit-learn/issues/27662

and here is an example of the kind of fix that helped reduce this memory problem in the past:

https://github.com/scikit-learn/scikit-learn/pull/27670

We think that similar investigations and fixes are needed to make scikit-learn and PyPy reasonably memory efficient together.

ogrisel,
@ogrisel@sigmoid.social avatar

Also even if you do not have the time to help maintain PyPy support, we are still interested in learning more about any use cases of PyPy with scikit-learn together.

ansate, to random
@ansate@social.coop avatar

I'm probably going to regret this - but I need to do some smol local python dev. I want to use some kind of virtual environment instead of installing packages globally.

I'm looking at venv - this mostly makes sense: https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/

Poetry let me down on my last computer, and I'm not enjoying Conda on my work computer, so I'd rather not use either of those.

Given that background, do you have any advice for me?

ogrisel,
@ogrisel@sigmoid.social avatar

@ansate I would use miniforge to get a local, minimal install of conda/mamba in you home folder. From there you can create as many conda envs with any python versions you want in parallel. You can even use pip to istall everything you want in those conda envs if you don't like the conda-forge packages for whatever reason. Miniforge works on linux, macos and windows.

ogrisel, to random
@ogrisel@sigmoid.social avatar

The Call for Proposals for #PyData Paris 2024 is officially OPEN! 🎉

Share your insights, discoveries, and innovations with the open-source data science and AI/ML community.

Submit your proposal at https://pydata.org/paris2024 and be a part of this incredible event!

GregWilson, to datascience

Looks like #duckdb is a vector database now. Pack it in vector DB companies; time to pivot.
#datascience

ogrisel,
@ogrisel@sigmoid.social avatar

@GregWilson it would also require indexing and searching for approximate nearest neighbors (e.g. with some HNSW implementation or similar) to be considered a competitor to vector DBs.

hynek, to random
@hynek@mastodon.social avatar
ogrisel,
@ogrisel@sigmoid.social avatar

@hynek @henryiii maybe try a uname -a inside the container to check that it's not running linux amd64 (instead of arm64 aka aarch64) via qemu for some unexpected reason (e.g. bad choice of the base image).

ogrisel,
@ogrisel@sigmoid.social avatar

@hynek @henryiii I don't think docker relies on qemu to run a linux/arm64 container on macOS/arm64 (M1) host, the fast macOS hypervisor should be enough (I think).

I think docker only uses qemu if you run a linux/amd64 container image on macOS/arm64 host.

ogrisel,
@ogrisel@sigmoid.social avatar

@hynek @henryiii Alright, it makes a lot of sense now. Sorry for the misunderstanding.

itamarst, to python
@itamarst@hachyderm.io avatar

Numba profiler is becoming less of a prototype and more of a reality:

#python

ogrisel,
@ogrisel@sigmoid.social avatar

@itamarst great to hear!

ogrisel, to python
@ogrisel@sigmoid.social avatar

I have been thinking a bit about how to detect supply chain attacks against popular open source projects such as scikit-learn.

If you have practical experience with https://reproducible-builds.org/ in particular in the #Python / #PyData ecosystem, I would be curious about any feedback to the plan I suggest for scikit-learn in the following issue.

Feel free to reply on mastodon first, if you have questions.

https://github.com/scikit-learn/scikit-learn/issues/28151

ogrisel,
@ogrisel@sigmoid.social avatar

@sethmlarson @vstinner I completely agree with all you said. Do you plan to focus first on helping make official cpython releases themselves automatically reproducible? Or do you plan to focus on improving wheel building tools to make pypi hosted artifacts reproducible?

Interesting work w.r.t. the SBOM of cpython. It would be interesting to have cibuildwheel able to dump an SBOM file, and later rebuild from one while checking the sha256 values of the dependencies.

lmcinnes, to random

Introducing DataMapPlot for creating beautiful presentation ready plots of data maps.

With DataMapPlot all you need is a 2D representation of your data and labelled clusters and DataMapPlot can produce beautiful plots that you can easily style to your needs.

Documentation is on ReadTheDocs: https://datamapplot.readthedocs.io

Code is on Github: https://github.com/TutteInstitute/datamapplot

$ pip install datamapplot

image/png
image/png
image/png

ogrisel,
@ogrisel@sigmoid.social avatar

@lmcinnes Beautiful work!

In case you have unlabelled clusters but you have access to a text abstract for members of those clusters, do you think using some LLM could help suggest meaningful cluster labels automatically?

Also, it's not necessarily easy to assess the relative size of clusters from those maps, e.g. to answer questions such as what fraction of Wikipedia pages are dedicated to the cluster "Food and cooking"?

ansate, to random
@ansate@social.coop avatar

I have almost no plans today (move some mulch, take a nap)

I would like to bake something - any suggestions?

ogrisel,
@ogrisel@sigmoid.social avatar

@ansate Moving mulch is already a neat plan. On my side, I put some peanuts in the bird feeder, it was very cold this morning.

ogrisel,
@ogrisel@sigmoid.social avatar

@ansate Not yet but I am pretty sure they will come. Last year peanuts were very much appreciated.

And I am glad you liked the workshop!

ogrisel, to random
@ogrisel@sigmoid.social avatar

MotherNet: A Foundational Hypernetwork for Tabular Classification

by Andreas Müller, Carlo Curino, Raghu Ramakrishnan
https://arxiv.org/abs/2312.08598

Crazy paper that introduces a meta trained transformer that can perform In-context Learning of the weights of small MLPs from numerical tabular training sets passed to the 'prompt' of the big transformer.

A kind of TabPFN but with very fast inference.

minrk, to random
@minrk@hachyderm.io avatar

Star Trek life achievement unlocked: my job title is officially Chief (research) Engineer.

ogrisel,
@ogrisel@sigmoid.social avatar

@minrk congrats!

ogrisel, (edited ) to random
@ogrisel@sigmoid.social avatar

I ran a quick Gradient Boosted Trees vs Neural Nets check using scikit-learn's dev branch which makes it more convenient to work with tabular datasets with mixed numerical and categorical features data (e.g. the Adult Census dataset).

Let's start with the GBRT model. It's now possible to reproduce the SOTA number of this dataset in a few lines of code 2 s (CV included) on my laptop.

1/n

#sklearn #PyData #MachineLearning #TabularData #GradientBoosting #DeepLearning #Python

ogrisel,
@ogrisel@sigmoid.social avatar

It was interesting to see that the neural network predictive accuracy would be degraded by one or two points if we had used standard scaling of numerical features instead of splines, or if I had used a small number of knots for the splines.

For this particular dataset, it seems important to use an axis-align prior feature preprocessing for the numerical features.

4/n

ogrisel,
@ogrisel@sigmoid.social avatar

This is in line with the numbers in the AD column of Table 6 of this very interesting paper:

On Embeddings for Numerical Features in Tabular Deep Learning
Yury Gorishniy, Ivan Rubachev, Artem Babenko

https://arxiv.org/abs/2203.05556

Note that I did not do extensive parameter tuning but my notebook is not too far from those numbers.

I might try to implement the periodic features as a preprocessor in the future.

5/n

ogrisel,
@ogrisel@sigmoid.social avatar

Meanwhile, I also checked the calibration of the tree-based and nn-based models.

The conclusion is that both models are well calibrated by default, as long as you use early stopping.

If you disable early stopping and max_iter is too small (under fit) or too large (over fit) then the models can either be significantly under-confident or over-confident.

6/n

Near diagonal calibration curves.

ogrisel,
@ogrisel@sigmoid.social avatar

Here is the link to the rendered notebook:

https://nbviewer.org/github/ogrisel/notebooks/blob/master/sklearn_demos/gbdt_vs_neural_nets_on_tabular_data.ipynb

It also includes a similar study on California Housing which has only numerical features.

For this dataset, spline features degrade performance. I found that quite surprising. But standard scaling makes the neural network competitive (albeit still slower) than the tree based model.

7/7.

ogrisel, (edited )
@ogrisel@sigmoid.social avatar

@jjerphan I wouldn't be so sure. I think a PyTorch equivalent, possibly wrapped by skorch would be more efficient, even on CPU, especially with if one would use torch.compile to further remove overhead. I might update this notebook another time when I get the chance to confirm.

lmcinnes, to random

The landscape of the Machine Learning section of ArXiv.

This was the result of a side-project to build tools to automate the generation of such plots, from label placement, to palette and aesthetics.

Dataset was from https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers encoded with sentence-transformers and mapped with UMAP. Clustering by fast_hdbscan.

ogrisel,
@ogrisel@sigmoid.social avatar

@lmcinnes it would be interesting to plot the journey of senior researchers on this map. I suspect that some traveled a lot to explore new topical grounds over the course of their career while others stayed on their island ;)

  • All
  • Subscribed
  • Moderated
  • Favorites
  • JUstTest
  • mdbf
  • cubers
  • thenastyranch
  • InstantRegret
  • Youngstown
  • rosin
  • slotface
  • Durango
  • ngwrru68w68
  • khanakhh
  • kavyap
  • everett
  • DreamBathrooms
  • anitta
  • magazineikmin
  • cisconetworking
  • GTA5RPClips
  • osvaldo12
  • tacticalgear
  • ethstaker
  • modclub
  • tester
  • Leos
  • normalnudes
  • provamag3
  • megavids
  • lostlight
  • All magazines