Where to Start for Machine Learning
5 min read
tl;dr - learn Python, numpy, pandas, scikit-learn, tensorflow. Get a little familiarity with python, then take Andrew Ng's Machine Learning specialization on Coursera.
If you are a programmer that wants to learn about machine learning it can be difficult to know where to start. From the outside, it looks like there are many choices of languages and tools, but there is a clear best path. This is what I wish someone had told me when I was trying to get started.
Use Python. If you don't know it already, it is worth the time to learn. It is widely used in production, but it is also the best language for learning machine learning concepts. The overwhelming majority of courses and books on machine learning use Python. The most important tools and libraries use Python. You don't need to be a master of the language, but you need to be able to read it, and you need to be able to work with arrays, dictionaries, imports, and classes.
There are two primary libraries for working with data in Python. They are numpy and pandas. Numpy is for working with arrays. It works with single and multidimensional arrays. Pandas is built on top of numpy. The primary structures in pandas are the Series and the DataFrame. You will use numpy arrays and pandas DataFrames all the time. You will want to get some familiarity with them before you dive into any machine learning resources.
I use the term "machine learning" as a generic concept: a process where a computer is fed data and the computer determines an appropriate decision making process based on the data. The programmer supplies the data, dictates what the structure of the process will be, and specifies how the process (or model) will be evaluated. The programmer does not specify in detail how to transform any given input into an output.
There are many ways to subdivide the field of machine learning. The way that is most important for this discussion is between traditional machine learning and deep learning. In traditional machine learning, the programmer preprocesses the data and chooses what sort of algorithm the model should use. Traditional machine learning works best for data that is in a tabular format. It is applicable to many of the problems that traditional businesses are trying to solve with machine learning.
For traditional machine-learning use scikit learn. This is actually a huge collection of libraries that has tools you will use for every step of the process, from preprocessing data, to creating models, to tweaking the hyperparameters you use in the models. You will supplement this with other tools, but this is the foundation for machine learning in python.
Deep learning is all about neural networks. The programmer does less preprocessing on the data, and instead of an algorithm the focus is on an architecture. You specify the number of nodes that the network will have, arrange them in layers, specify how they are connected, and what processing needs to be done between layers. Deep learning is often used on problems involving images, video, audio, text, and other things where there is a large amount of data that is not in a tabular format.
This is the only part of the stack where there is any question about which tools to use. The most widely used in production systems appear to be Google's TensorFlow and Facebook's PyTorch. AWS has built their tools on Apache's MXNet, so that also seems like a viable option.
The training materials I have used have all used TensorFlow, so that is my soft recommendation. If you find courses that use PyTorch, don't be put off by that at all. You are not giving up anything by choosing one over the other. If you go with TensorFlow, you will also be using Keras, which is an API for working with TensorFlow.
Where Can I Learn
There are lots of Python learning materials out there. Find a book, video, or course that appeals to you. Similarly, for numpy and pandas, I have not found a single best resource. You should get some familiarity so that you can understand these parts of the machine learning instruction. But you probably don't need to go into too much depth right away. When you start trying to work on problems on your own, you will be very motivated to learn specific things.
After you have some familiarity with Python, numpy and pandas, the best introduction to Machine Learning is Anderw Ng's specialization on Coursera: https://www.coursera.org/specializations/machine-learning-introduction. The clarity and depth of the teaching is astounding. The specialization also introduces deep learning, but it does not go into as much depth.
After you finish the specialization on Coursera, head over to Kaggle to practice what you have learned. Kaggle, https://www.kaggle.com/, is a site that hosts machine learning competitions, and it also has a lot of instruction, datasets, and discussions. When you first signup (oh, it is free, btw), look at the material that tells you how to get started. The introductory course will show you how to complete on of the Getting Started competitions.
From there, you will want to look at some of the other solutions people have published. This is where you will start to realize how much more you have to learn about pandas. In addition to the Getting Started competitions, there is a Playground competition. Each month there is a new problem that uses tabular data, and you have to use the techniques you learned to work on the problem.
There are also Featured competitions on Kaggle. These award status points, and for the top competitors even cash prizes. The problems in the featured section often involve video analysis, or language analysis, or other things that require deep learning or other large, complicated architectures to solve. I have not attempted any of these yet, so I haven't figured out what resources are best to prepare for those. I will post here when I know the answer.