I ran a quick Gradient Boosted Trees vs Neural Nets check using scikit-learn's dev branch which makes it more convenient to work with tabular datasets with mixed numerical and categorical features data (e.g. the Adult Census dataset).
Let's start with the GBRT model. It's now possible to reproduce the SOTA number of this dataset in a few lines of code 2 s (CV included) on my laptop.
Today with @dholzmueller we explored the possibility to reduce a probabilistic regression problem to a classification problem by binning the target variable and interpolating the conditional CDF estimated by classifier.predict_proba(X_test).cumsum(axis=1) to the original continuous range.
Here is a notebook with the results of my experiments:
Soon I'll buy my Super-Fan tickets for #PositConf2024 in Seattle (not available quite yet as far as I can find), but first it's time for one more thread to summarize my threads! Each post in this thread will be flagged with a titled "content warning" to make it easier to find your way back to the top, I hope that works out!
I have been thinking a bit about how to detect supply chain attacks against popular open source projects such as scikit-learn.
If you have practical experience with https://reproducible-builds.org/ in particular in the #Python / #PyData ecosystem, I would be curious about any feedback to the plan I suggest for scikit-learn in the following issue.
Feel free to reply on mastodon first, if you have questions.
Python Data Science at posit::conf(2023). We are excited about all our Python workshops are posit::conf this year.
posit::conf(2023) is our conference for all things open source data science. Join us in Chicago Sept 17-20. With two days of workshops, and two days of talks and community. Learn more at pos.it/conf.
🎂It's my birthday!🎂
To celebrate, I'm... Working to build a friendly, diverse #DataScience community at https://r4ds.io, just like I do every day! It'd make my day if you supported our efforts at https://r4ds.io/donate !
cloudpickle is a library used by PySpark, Dask, Ray and joblib / loky (among others) to make it possible to call dynamically or locally defined function, closures and lambdas on remote Python process workers.
This is typically necessary for running code in parallel on a distributed computing cluster from an interactive developer environment such as a Jupyter or Databricks notebooks.
We can't replace them, but we welcome anyone looking for a friendly, inclusive community to join us at the Data Science Learning Community (@DSLC) https://DSLC.io
Many thanks to @mariatta, @lorenipsum, the rest of the @pycon organizing team and @ThePSF staff, and everyone else who made PyCon US in Pittsburgh possible and awesome. See you again in 2025!
Ya está abierto el registro para nuestra reunión de abril: 🐲 LLMOps & ML para Drilling Performance y Python & Mazmorras, este mes en las oficinas de Repsol
I've given several internal versions of this workshop at Amazon and I daresay it's been very well received. The power of these new data wrangling libraries is honestly staggering. We use them all the time at work. You should too.
20 bucks gets you in the door. All proceeds to Ukraine aid orgs. #rstats#pydata
While data scientists are often taught about training a machine learning model, building a reliable MLOps strategy to deploy and maintain that model can be daunting.
It doesn’t have to be this way!
Join us with Julia Silge at Posit on Wednesday, April 24th at 11 am ET to learn how Posit Team provides fluent tooling for the whole ML lifecycle.
No registration is required to attend - simply add it to your calendar using this link, https://pos.it/team-demo
Join #PyData#Pittsburgh for a casual gathering of the local, national, and international PyData community on the sidelines of #PyCon US 2024! Meet up with fellow #DataScience, #MachineLearning, and scientific computing enthusiasts when the world's largest Python conference comes to town.
Check out Dr. Albert Rapp's latest YouTube video on mastering the great_tables Python package! From raw data to polished displays, learn about custom fonts, nanoplots, conditional formatting, and the steps to great a lovely looking data display table with great_tables. https://www.youtube.com/watch?v=ESyWcOFuMQc&ab_channel=AlbertRapp