Who This Book Is For
If you use computers to work with data, this book is for you. You might go by the title of data analyst, database administrator, data engineer, data scientist, or systems programmer today. Although your role might be narrower today (perhaps you do only data analysis, or only model building, or only DevOps), you want to stretch your wings a bit—you want to learn how to create data science models as well as how to implement them at scale in production systems.
Google Cloud Platform is designed to make you forget about infrastructure. The marquee data services—Google BigQuery, Cloud Dataflow, Cloud Pub/Sub, and Vertex AI—are all serverless and autoscaling. When you submit a query to BigQuery, it is run on thousands of nodes, and you get your result back; you don’t spin up a cluster or install any software. Similarly, in Cloud Dataflow, when you submit a data pipeline, and in Vertex AI, when you submit a machine learning job, you can process data at scale and train models at scale without worrying about cluster management or failure recovery. Cloud Pub/Sub is a global messaging service that autoscales to the throughput and number of subscribers and publishers without any work on your part. Even when you’re running open source software like Apache Spark that’s designed to operate on a cluster, Google Cloud Platform makes it easy with job-specific clusters and serverless Spark. Because of this job-specific infrastructure, there’s no need to fear overprovisioning hardware or running out of capacity to run a job when you need it. Plus, data is encrypted, both at rest and in transit, and kept secure. As a data scientist, not having to manage infrastructure is incredibly liberating.
These autoscaled, fully managed services make it easier to implement data science models at scale—which is why data scientists no longer need to hand off their models to data engineers. Instead, they can write a data science workload, submit it to the cloud, and have that workload processed automatically in an autoscaled manner. At the same time, data science packages are becoming simpler and simpler. So, it has become extremely easy for an engineer to slurp in data and use a canned model to get an initial (and often very good) model up and running. With well-designed packages and easy-to-consume APIs, you don’t need to know the esoteric details of data science algorithms—only what each algorithm does and how to link algorithms together to solve realistic problems. This convergence between data science and data engineering is why you can stretch your wings beyond your current role.
Rather than simply read this book cover-to-cover, I strongly encourage you to follow along with me by trying out the code. The full source code for the end-to-end pipeline I build in this book is on GitHub. Create a Google Cloud Platform project, and after reading each chapter, try to repeat what I did by referring to the code files in each folder of the GitHub repository. Follow the instructions in the referenced GitHub files to try out the code. The code snippets in the book are often incomplete—for example, I may omit some arguments to cloud commands for clarity or conciseness.
Note that this is not a reference book—the best reference to Google Cloud is its documentation, and there is very little value to be had by simply reproducing that in a book. Instead, this book shows you how to use a variety of tools together to solve a problem. My goal here is to teach you how to think about a problem in order to solve it using Google Cloud, not to comprehensively cover any particular product.