Practical Web Scraping for Data Science

Get Started with Web Scraping using Python!

Order now on Amazon.

For those who are not familiar with programming or the deeper workings of the web, web scraping often looks like a black art: the ability to write a program that sets off on its own to explore the Internet and collect data is seen as a magical and exciting ability to possess.

In this book, we set out to provide a concise and modern guide to web scraping, using Python as our programming language, without glossing over important details or best practices. In addition, this book is written with a data science audience in mind. We're data scientists ourselves, and have very often found web scraping to be a powerful tool to have in your arsenal, as many data science projects start with the first step of obtaining an appropriate data set, so why not utilize the treasure trove of information the web provides.

As such, we’ve strived to offer a guide that:

Table of Contents

Nine chapters are included in this book:

  1. Introduction
    In Chapter 1, we provide a brief overview on web scraping and real-life use cases and make sure your Python environment is set up correctly.
  2. The Web Speaks HTTP
    In Chapter 2, you’ll learn the basics regarding HTTP, the core piece of technology behind the web, and start working with the requests Python library.
  3. Stirring the HTML and CSS Soup
    In Chapter 3, we explore scraping HTML and CSS sites, using the Beautiful Soup library.
  4. Delving Deeper in HTTP
    Chapter 4 returns to HTTP, exploring it more detail.
  5. Dealing with JavaScript
    Chapter 5 introduces the Selenium library, which you’ll use to scrape JavaScript-heavy websites.
  6. From Web Scraping to Web Crawling
    Chapter 6 explains web crawling in detail.
  7. Managerial and Legal Concerns
    In Chapter 7, an in-depth discussion regarding managerial and legal concerns is provided.
  8. Closing Topics
    Chapter 8 recaps best practices and provides pointers to other tools.
  9. Examples
    Chapter 9 includes fifteen, fully worked out web scraping examples bringing everything you’ve learned together, and illustrates various interesting data science oriented use cases.

Audience

We have written this book with a data science oriented audience in mind. As such, you'll probably already be familiar with Python or some other programming language or analytical toolkit (be it R, SAS, SPSS, or something else). If you're using Python already: you'll feel right at home. If not, we include a quick Python primer later on in this chapter to catch up with the basics and provide pointers to other references as well. Even if you're not using Python yet for your daily data science tasks (many will argue that you should), we want to show you that Python is a particularly powerful language to use for getting out data from the web. We also assume that you have some basic knowledge regarding how to web works.

To summarize, we have written this book to be useful to the following target groups:

  • Data science practitioners already using Python and wanting to learn how to scrape the web using this language
  • Data science practitioners using another programming language or toolkit, but want to adopt Python to perform the web scraping part of their pipeline
  • Lecturers and instructors of web scraping courses
  • Students working on a web scraping project or aiming to increase their Python skill set
  • “Citizen data scientists” with interesting ideas requiring data from the web
  • Data science or business intelligence managers wanting to get an overview of what web scraping is all about and how it can bring a benefit to their teams, and what the managerial and legal aspects are that need to be considered

About the Authors

Seppe vanden Broucke is an assistant professor of data and process science at the Faculty of Economics and Business, KU Leuven, Belgium. His research interests include business data mining and analytics, machine learning, process management, and process mining. His work has been published in well-known international journals and presented at top conferences. Seppe's teaching includes Advanced Analytics, Big Data and Information Management courses. He also frequently teaches for industry and business audiences. Besides work, Seppe enjoys travelling, reading (Murakami to Bukowski to Asimov), listening to music (Booka Shade to Miles Davis to Claude Debussy), watching movies and series (less so these days due to a lack of time), gaming, and keeping up with the news.

Bart Baesens is a professor of big data and analytics at KU Leuven, Belgium, and a lecturer at the University of Southampton, United Kingdom. He has done extensive research on big data and analytics, credit risk modeling, fraud detection and marketing analytics. Bart has written more than 200 scientific papers and several books. Besides enjoying time with his family, he is also a diehard Club Brugge soccer fan. Bart is a foodie and amateur cook. He loves drinking a good glass of wine (his favorites are white Viognier or red Cabernet Sauvignon) either in his wine cellar or when overlooking the authentic red English phone booth in his garden. Bart loves traveling and is fascinated by World War I and reads many books on the topic.

We hope you enjoy reading through this book as much as we had writing it. Feel free to contact us in case you have questions, find mistakes, or just want to get in touch! We love hearing from our readers and are open to receive any thoughts and questions.
— Seppe vanden Broucke,
seppe.vandenbroucke@kuleuven.be
— Bart Baesens, bart.baesens@kuleuven.be