Recently I helped create an upskilling curriculum for data science. It was aimed at people already in the industry with some tech background, though not in big data—those who didn’t have months to spend, but needed hands-on experience to get started. Our team debated which technologies would be most important for people to learn, given time constraints in the course. Jupyter, Docker, Spark, and TensorFlow each came up as key technologies. We started building examples to tie the tools together into practical workflows, plus a sampler of machine learning, data visualization, and associated topics.
At about the same time, I saw a preview of this book. “Hey, that’s it!” we recognized. “That’s just the right mix of tools, techniques, and real-world examples.”
I met Jerome years ago while guest lecturing for a data engineering fellowship. We were focused on Apache Spark in that course.
Although other components described here—Jupyter, Docker, Anaconda, deep learning, vector embedding, etc.—existed at the time, it wasn’t clear how they’d evolve and become important together. Later Jerome adapted IBM training materials for Spark to use in a training program at O’Reilly where we were both teaching. It’s been a pleasure to see him grow in this field.
Jerome has a passion for education and developer advocacy that shows throughout the pages here. Beyond demonstrating these open source technologies, he enjoys showcasing them in context, giving people tools they need to succeed in their work.