Always happy to see new MLOps books! The DevOps for Data Science is a new book by Alex K Gold. As the name implies, the book focuses on topics related to DevOps for data scientists. This includes the following:
✅ Command line
✅ Working with Linux systems
✅ Docker
✅ Scaling resources
✅ Network, domains, DNS, SSL, etc.
✅ Authentication
Andrej Karpathy released today a tutorial for reproducing GPT-2 from scratch. OpenAI released GPT -2 in 2019, and it is a 124M parameters model. This four-hour tutorial covers setting up the GTP-2 network and then training and optimizing its parameters.
It looks like a really cool tutorial; I hope to get the bandwidth to watch it in the coming weeks!
Wie kann #KI Museen dabei helfen, Sammlungen zu erschließen? Sebastian Ruff vom Stadtmuseum Berlin erzählt von seinen Erfahrungen mit Tools zur automatischen Schlagwortgenerierung. Sein Fazit: die Arbeit damit kostet erstmal Zeit & die Tools haben ihre Grenzen, aber sie haben Potenzial. Wichtig am Anfang: vollständige Thesauri mit Normdaten & Fehlstellenanalyse im Datenbestand ☝️#Datenqualität https://www.kultur-b-digital.de/digitale-kultur/impulse/ki-im-stadtmuseum-interview-mit-sebastian-ruff/
Learn how to split strings and get the first element in R using base R, stringi, and stringr. Check out my latest post for examples and tips. Give it a try and share your experiences!
🛠️ Compa-tibble functions @grusonh
🏫 R tutorial worksheets with Quarto @nrennie
We're loving the ways we can add modern features to this show. Once you grab a new podcast app from https://newpodcastapps.com, you can see them in their full glory!
Vous êtes data scientist et vous travaillez pour le secteur public ? Le dernier guide de bonnes pratiques du Ministère de la digitalisation vous est destiné !
N'hésitez pas à publier le résultats de vos analyses sur data.public.lu si vous le pouvez, ou à inclure des données déjà disponibles sur en open data dans vos analyses.
In the development version of {collapse} [v2.0.15, available via install.packages("collapse", repos = "https://fastverse.r-universe.dev")], the pivot() function has received a FUN argument to support aggregation, including a number or hard-coded internal functions that do this "on the fly". Initial benchmarks show that this significantly outperforms other pivot table options in R. More at https://sebkrantz.github.io/collapse/reference/pivot.html (feel free to test and give feedback). #rcollapse#rstats#DataScience
(1/2) I am excited to present at the useR!2024 conference on July 2nd!
I am going to run a virtual workshop about deployment and monitoring data and ML pipelines using free and open-source tools. This includes setting pipelines using GitHub Actions, Docker 🐳, R, and Quarto 🚀.
posit::conf(2024) virtual tickets are now available!
Join us on August 12-14—from all over the world—to live stream the incredible talks and keynotes that will be taking place in Seattle.
We understand that not everyone will be able to make the trip to Seattle this year, so we’re excited to offer a fully virtual offering for everyone as an alternate option.
REGISTER: https://posit.co/conference/
I am excited to present at the Dev AI conference in Paris on June 19!
I am going to run a workshop about the deployment and monitoring of ML pipelines with free and open-source tools. This includes using tools such as GitHub Actions and Pages, Docker, Python, Quarto, etc.
🚀 Anúncio: Nova Versão do Módulo Python crossfire!
A nova versão do módulo Python crossfire, desenvolvida por mim e @cuducos está disponível!
✨ Novidades:
Bug corrigido: Agora compatível com Google Colab!
Funcionalidade extra: Parâmetro que desempacota dados aninhados para facilitar a análise.
Ideal para jornalistas de dados e analistas! Cadastre-se na API do Fogo Cruzado e acesse os dados direto no Python.
The TidyDensity package now includes new functions to calculate the Akaike Information Criterion (AIC) for various distributions, streamlining model quality assessment. Use functions like util_negative_binomial_aic() to automate AIC calculations, ensuring precise model evaluation.
🐘✨ Great news from Marcela Victoria Soto at R4HR in Buenos Aires! She recently shared updates about their dynamic activities: "Data analysis is crucial for agile decision-making in companies." Join them on June 1, 2024, for the "Data Visualization in HR" event. Perfect for Spanish-speaking R users interested in HR analytics. 📅👥 Read more: https://www.r-consortium.org/blog/2024/05/30/r4hr-in-buenos-aires-leveraging-r-for-dynamic-hr-solutions
Before I head off on a trip to various parts of not-Barcelona, I thought I’d share a somewhat provocative paper by David Hogg and Soledad Villar. In my capacity as journal editor over the past few years I’ve noticed that there has been a phenomenal increase in astrophysics papers discussing applications of various forms of Machine Leaning (ML). This paper looks into issues around the use of ML not just in astrophysics but elsewhere in the natural sciences.
The abstract reads:
Machine learning (ML) methods are having a huge impact across all of the sciences. However, ML has a strong ontology – in which only the data exist – and a strong epistemology – in which a model is considered good if it performs well on held-out training data. These philosophies are in strong conflict with both standard practices and key philosophies in the natural sciences. Here, we identify some locations for ML in the natural sciences at which the ontology and epistemology are valuable. For example, when an expressive machine learning model is used in a causal inference to represent the effects of confounders, such as foregrounds, backgrounds, or instrument calibration parameters, the model capacity and loose philosophy of ML can make the results more trustworthy. We also show that there are contexts in which the introduction of ML introduces strong, unwanted statistical biases. For one, when ML models are used to emulate physical (or first-principles) simulations, they introduce strong confirmation biases. For another, when expressive regressions are used to label datasets, those labels cannot be used in downstream joint or ensemble analyses without taking on uncontrolled biases. The question in the title is being asked of all of the natural sciences; that is, we are calling on the scientific communities to take a step back and consider the role and value of ML in their fields; the (partial) answers we give here come from the particular perspective of physics
arXiv:2405.18095
P.S. The answer to the question posed in the title is probably “yes”.