I have been thinking a bit about how to detect supply chain attacks against popular open source projects such as scikit-learn.
If you have practical experience with https://reproducible-builds.org/ in particular in the #Python / #PyData ecosystem, I would be curious about any feedback to the plan I suggest for scikit-learn in the following issue.
Feel free to reply on mastodon first, if you have questions.
I ran a quick Gradient Boosted Trees vs Neural Nets check using scikit-learn's dev branch which makes it more convenient to work with tabular datasets with mixed numerical and categorical features data (e.g. the Adult Census dataset).
Let's start with the GBRT model. It's now possible to reproduce the SOTA number of this dataset in a few lines of code 2 s (CV included) on my laptop.
Today with @dholzmueller we explored the possibility to reduce a probabilistic regression problem to a classification problem by binning the target variable and interpolating the conditional CDF estimated by classifier.predict_proba(X_test).cumsum(axis=1) to the original continuous range.
Here is a notebook with the results of my experiments:
Do the one thing I really need Python for via {reticulate} by just sending it the exact dataframe it needs and sending the results back to R for post-processing
Hadn’t occurred to me until recently, but I am really, REALLY liking it.
cloudpickle is a library used by PySpark, Dask, Ray and joblib / loky (among others) to make it possible to call dynamically or locally defined function, closures and lambdas on remote Python process workers.
This is typically necessary for running code in parallel on a distributed computing cluster from an interactive developer environment such as a Jupyter or Databricks notebooks.
Anyone in the UK enthusiastic about dogs, #Rstats, #Python, #Pydata, and looking for a new job? There’s a Data Officer role going on my team if so! Interesting work and a nice bunch of people.
Soon I'll buy my Super-Fan tickets for #PositConf2024 in Seattle (not available quite yet as far as I can find), but first it's time for one more thread to summarize my threads! Each post in this thread will be flagged with a titled "content warning" to make it easier to find your way back to the top, I hope that works out!
@pydataamsterdam So excited to see the Thomas Wolf and more from the Hugging Face 🤗 giving a promising closing keynote! Just this Monday I was working with some colleagues on a HF + @kedro integration that hopefully will go open source soon.
Impressive results from Hugging Face: proper filtering of web data can match or exceed performance of commercial models trained on highly curated datasets.
@pydataamsterdam Thomas kind of dodged my question on the enforceability of OpenRAIL 😇 so happy that they exist anyway, it's a conversation we need to have.