Best Algorithm for Tabular/Business Data: Sorry, it’s not deep learning
With all the hype about deep learning and “AI”, it is not well (enough) publicized that for structured/tabular data widely encountered in business applications it is actually another machine learning algorithm, the gradient boosting machine/gradient boosted decision trees (GBM/GBDT) that most often achieves the highest accuracy in supervised learning/prediction tasks.
In this talk we’ll provide plenty of evidence about the vast superiority of GBMs for tabular/business data. We will present some of the major open source implementations such as xgboost, h2o, lightgbm and catboost (all of them available from R and Python) and we will discuss their main performance characteristics: training speed, memory footprint, scaling to multiple CPU cores, GPU implementations, GPU utilization patterns etc.
While deep learning is certainly the best algorithm available for computer vision (and it has also shown some success in a few other rather specialized domains), in most business applications, where the data is most often of a tabular structure, gradient boosted decision trees are vastly superior to deep learning neural networks and should definitely be the algorithm of choice.
Pafka Szilárd, PhD
Chief Scientist, Epoch (USA)
Szilard studied Physics in the 90s and obtained a PhD by using statistical methods to analyze the risk of financial portfolios. He worked in finance, then in 2006 he moved to become the Chief Scientist of a tech company in Santa Monica, California doing everything data (analysis, modeling, data visualization, machine learning, data infrastructure etc). He was the founder/organizer of several meetups in the Los Angeles area (R, data science etc) and the data science community website datascience.la for more than a decade until he relocated to Texas in 2021. He is the author of a well-known machine learning benchmark on github (1000+ stars), a frequent speaker at conferences (keynote/invited at KDD, R-finance, Crunch, eRum and contributed at useR!, PAW, EARL, H2O World, Data Science Pop-up, Dataworks Summit etc.), and he has developed and taught graduate data science and machine learning courses as a visiting professor at two universities (UCLA in California and CEU in Europe).
Twitter Github