Scraping data from any kind of website with Python can be exciting and challenging at the same time. Also, you can learn a lot about development and writing simple scripts – practice Python’s fundamentals.

More than 5 years ago, when I first started to get to know what Python is capable of – the very first thing I wanted to learn was how to automate browser actions – open websites, click on buttons, enter text into input fields, etc.

Back then I was learning how to automate facebook.com

Within the following steps, you will learn how to automate scraping data from amazon.com from A-Z.

I’ve prepared steps with everything you need to do in order to be able to fully automate the process and gather data for up to 100 products in less than a minute.

With the provided code you’ll have a good base for starting the Amazon website automation and you’ll be able to expand and extend my code to whatever your goals are.

Follow the steps below.

Project directory

Whenever I start to develop a project or write some kind of Python script, I always start a new project in my projects folder.

Unless you already have a specific project you’re working on, I’d recommend creating a new folder just to follow the steps in this tutorial.

Over the years I’ve realized that the best way is to keep all your projects in one folder called projects, for example, the path to mine is:

  • /Users/robertsgreibers/projects/

In this case, I’m going to call my project folder pythonic.me, which is the name of this blog.

Create a new project directory

If you don’t have a projects folder, I’d recommend creating one with the following steps:

  • mkdir /Users/robertsgreibers/projects/
  • cd /Users/robertsgreibers/projects/
  • mkdir pythonic.me

Replace pythonic.me with the name you want to use for your project, could call it as simple as amazon

Global environment

As you may already know, you can have two versions of Python on your computer (if you’re sneaky enough, probably more).

The versions I’m talking about are a global Python version and a local Python version.

By global, I mean, when you first install Python, open up a terminal window and type python it will give you a global version that’s available throughout your whole system.

robertsgreibers@MacBook-Pro ~ % python
Python 3.8.12 (default, Aug 31 2021, 04:08:54)
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

Local environment

The local Python version is the one you can create and use within a single project folder.

It is an isolated Python environment that lets you use as many Python packages as you want without affecting the global version of Python.

If you’ve ever played around with Python enough, you’ve probably noticed how having a lot of different Python packages can affect work on multiple Python projects.

I’d recommend you always use a local Python environment and mostly don’t even touch the global Python version.

Install pipenv

By this time you should already be in your project folder, ready to set up a new Python environment (for me project folder would be /Users/robertsgreibers/projects/)

Once inside your project folder, install pipenv with the following command (unless you already have pipenv installed):

brew install pipenv

If you’re on Windows, check out pipenv documentation page for details or just Google search for pipenv installation instructions, shouldn’t be hard to find a way to install pipenv on Windows.

You can make sure pipenv is installed by executing pipenv inside terminal which should give you a long list of available commands:

robertsgreibers@MacBook-Pro ~ % pipenv
Usage: pipenv [OPTIONS] COMMAND [ARGS]...

Options:
  --where                         Output project home information.
  --venv                          Output virtualenv information.
  --py                            Output Python interpreter information.
  --envs                          Output Environment Variable options.
  --rm                            Remove the virtualenv.
  --bare                          Minimal output.
  --completion                    Output completion (to be executed by the
                                  shell).

  --man                           Display manpage.
  --support                       Output diagnostic information for use in
                                  GitHub issues.

  --site-packages / --no-site-packages
                                  Enable site-packages for the virtualenv.
                                  [env var: PIPENV_SITE_PACKAGES]

  --python TEXT                   Specify which version of Python virtualenv
                                  should use.

  --three / --two                 Use Python 3/2 when creating virtualenv.
  --clear                         Clears caches (pipenv, pip, and pip-tools).
                                  [env var: PIPENV_CLEAR]

  -v, --verbose                   Verbose mode.
  --pypi-mirror TEXT              Specify a PyPI mirror.
  --version                       Show the version and exit.
  -h, --help                      Show this message and exit.


Usage Examples:
   Create a new project using Python 3.7, specifically:
   $ pipenv --python 3.7

   Remove project virtualenv (inferred from current directory):
   $ pipenv --rm

   Install all dependencies for a project (including dev):
   $ pipenv install --dev

   Create a lockfile containing pre-releases:
   $ pipenv lock --pre

   Show a graph of your installed dependencies:
   $ pipenv graph

   Check your installed dependencies for security vulnerabilities:
   $ pipenv check

   Install a local setup.py into your virtual environment/Pipfile:
   $ pipenv install -e .

   Use a lower-level pip command:
   $ pipenv run pip freeze

Commands:
  check      Checks for PyUp Safety security vulnerabilities and against PEP
             508 markers provided in Pipfile.

  clean      Uninstalls all packages not specified in Pipfile.lock.
  graph      Displays currently-installed dependency graph information.
  install    Installs provided packages and adds them to Pipfile, or (if no
             packages are given), installs all packages from Pipfile.

  lock       Generates Pipfile.lock.
  open       View a given module in your editor.
  run        Spawns a command installed into the virtualenv.
  shell      Spawns a shell within the virtualenv.
  sync       Installs all packages specified in Pipfile.lock.
  uninstall  Uninstalls a provided package and removes it from Pipfile.
  update     Runs lock, then sync.

Create a local environment

Inside your project folder, create a new Python virtual environment by executing pipenv install

robertsgreibers@MacBook-Pro pythonic.me % pipenv install
Warning: the environment variable LANG is not set!
We recommend setting this in ~/.profile (or equivalent) for proper expected behavior.
Creating a virtualenv for this project…
Pipfile: /Users/robertsgreibers/projects/pythonic.me/Pipfile
Using /usr/local/opt/python@3.8/bin/python3 (3.8.12) to create virtualenv…
⠧ Creating virtual environment...created virtual environment CPython3.8.12.final.0-64 in 427ms
  creator CPython3Posix(dest=/Users/robertsgreibers/.local/share/virtualenvs/temppp-r-PXQWHt, clear=False, global=False)
  seeder FromAppData(download=False, pip=latest, setuptools=latest, wheel=latest, via=copy, app_data_dir=/Users/robertsgreibers/Library/Application Support/virtualenv/seed-app-data/v1.0.1)
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator

✔ Successfully created virtual environment!
Virtualenv location: /Users/robertsgreibers/.local/share/virtualenvs/pythonic.me-r-PXQWHt
Creating a Pipfile for this project…
Pipfile.lock not found, creating…
Locking [dev-packages] dependencies…
Locking [packages] dependencies…
Updated Pipfile.lock (db4242)!
Installing dependencies from Pipfile.lock (db4242)…
  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0/0 — 00:00:00
To activate this project's virtualenv, run pipenv shell.
Alternatively, run a command inside the virtualenv with pipenv run.

After you execute pipenv install, check your project folder, you should have two new files available Pipfile and Pipfile.lock.

Both files are going to be responsible for keeping information about your installed Python packages for this specific Python virtual environment.

Pipfile & Pipfile.lock created by pipenv install

Install Selenium WebDriver

Activate virtual environment

Once you’ve set up your Python virtual environment, you’re ready to install new Python packages.

But before you actually do installation, make sure you have activated the environment by executing the following command: pipenv shell

(pythonic.me) robertsgreibers@MacBook-Pro pythonic.me % pipenv shell
Courtesy Notice: Pipenv found itself running within a virtual environment, so it will automatically use that environment, instead of creating its own for any project. You can set PIPENV_IGNORE_VIRTUALENVS=1 to force pipenv to ignore that environment and create its own instead. You can set PIPENV_VERBOSITY=-1 to suppress this warning.
Launching subshell in virtual environment…
 . /Users/robertsgreibers/.local/share/virtualenvs/pythonic.me-GsTWWFLs/bin/activate
robertsgreibers@MacBook-Pro pythonic.me %  . /Users/robertsgreibers/.local/share/virtualenvs/pythonic.me-GsTWWFLs/bin/activate
(pythonic.me) robertsgreibers@MacBook-Pro pythonic.me % 

You can know virtual environment is activated by the parentheses on the left side of your username, for me it’s (pythonic.me), for you, it could be different – depends on your project name.

Install Selenium WebDriver

The very first step to scraping Amazon products with Python you need to do is to install selenium.

(I’m assuming you’ve already installed Python and pipenv, if not just do a quick google search and install Python. For pipenv – use the steps described above).

Depending on your previous experience with Python, you might be familiar with pip tool.

Which is a good tool, but it’s an old one. Ever since I discovered pipenv I’ve never gone back to pip.

Once you have virtual environment activate, execute the following command to install Selenium:

(pythonic.me) robertsgreibers@MacBook-Pro pythonic.me % pipenv install selenium  
Installing selenium…
✔ Installation Succeeded 
Pipfile.lock (db4242) out of date, updating to (e89fe3)…
Locking [dev-packages] dependencies…
Locking [packages] dependencies…
Building requirements...
✔ Success! 
Updated Pipfile.lock (e89fe3)!
Installing dependencies from Pipfile.lock (e89fe3)…
  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0/0 — 00:00:00

Create amazon.py file

Once you have your project, pipenv, selenium ready, the next step is to create a Python file where you can follow along with the steps.

Since you’re about to jump into exploring Amazon product scraping I’d recommend naming your Python file accordingly.

Go ahead and create a new file called amazon.py in your project folder:

Select new Python file creation in PyCharm
Set file name: amazon.py
Set file name: amazon.py

Try opening browser with Selenium

If you just installed Selenium for the first time, you might run into problems even with the first step.

So, I’d recommend you to take a very simple selenium example code and execute it:

from selenium import webdriver

browser = webdriver.Firefox()
browser.get('http://selenium.dev/')

Geckodriver needs to be in PATH

The first problem you might run into is about missing geckodriver executable, the exact exception message you’re going to see is:

Traceback (most recent call last):
  File "/Users/robertsgreibers/.local/share/virtualenvs/pythonic.me-GsTWWFLs/lib/python3.8/site-packages/selenium/webdriver/common/service.py", line 71, in start
    self.process = subprocess.Popen(cmd, env=self.env,
  File "/usr/local/Cellar/python@3.8/3.8.12/Frameworks/Python.framework/Versions/3.8/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/local/Cellar/python@3.8/3.8.12/Frameworks/Python.framework/Versions/3.8/lib/python3.8/subprocess.py", line 1704, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'geckodriver'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/robertsgreibers/projects/pythonic.me/2_solid_python_selenium_beautifulsoup_examples/1_amazon_products/1_amazon_products.py", line 9, in <module>
    browser = webdriver.Firefox()
  File "/Users/robertsgreibers/.local/share/virtualenvs/pythonic.me-GsTWWFLs/lib/python3.8/site-packages/selenium/webdriver/firefox/webdriver.py", line 174, in __init__
    self.service.start()
  File "/Users/robertsgreibers/.local/share/virtualenvs/pythonic.me-GsTWWFLs/lib/python3.8/site-packages/selenium/webdriver/common/service.py", line 81, in start
    raise WebDriverException(
selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.

Fix geckodriver error

A quick Google search will give you a good answer to geckodriver error.

I’ve found a Stackoverflow post that will help you out, all you need to do to solve this problem is install geckodriver with brew:

brew install geckodriver
Execute: brew install geckodriver
Execute: brew install geckodriver
brew pouring geckodriver
brew pouring geckodriver

Open browser with Selenium

Once you install geckodriver you should be good to go, try to execute amazon.py once again and see what comes up.

File execution should open up Firefox browser and go to page http://selenium.dev/

Running Selenium for the first time
Running Selenium for the first time

If you got the browser page from the screenshot above, you’re all set and you can start to focus on Amazon product scraping with the next steps.

Choose product category with Selenium

Obviously, the first thing you need to do is to figure out how to open up amazon.com and start to look for products you want to be scraping.

Depending on your goals here, you can go in a lot of different directions.

I’m going to start with the steps that I tried to take and explain why my initial idea did not work.

Open amazon.com

Scratch whatever you had on amazon.py file and paste the following code into the file:

from selenium import webdriver

browser = webdriver.Firefox()
browser.get('https://www.amazon.com/')
browser.maximize_window()
  • browser.get('https://www.amazon.com/') – Open amazon.com
  • browser.maximize_window() – Maximize Firefox window so that there are no hidden HTML elements

Once you run code from a code block above, Python should open amazon.com landing page.

The first thing that comes to my mind is how I could change my code to scraping specific product categories instead of blindly scraping whatever is on the first page.

First idea: How to open Amazon product categories menu with Selenium?
First idea: How to open Amazon product categories menu with Selenium?

Select product category

Right-click on the dropdown menu icon All and select Inspect.

Inspect will open Firefox developer tools that let you inspect the HTML code of the page you’re currently on.

My initial goal was to find the HTML element responsible for the categories dropdown menu, this is the way to do it.

Inspect Amazon products category dropdown menu
Inspect Amazon products category dropdown menu

Here’s how the HTML code looks like for the category dropdown menu with id="searchDropdownBox"

Inspect amazon category dropdown
Inspect amazon category dropdown

Here’s the HTML code extracted from Amazon source code above. When you look at the code it seems like an easy thing to do:

  • Use Selenium to find HTML element with id="searchDropdownBox" and click on it
  • Once dropdown menu opens click on the category you want to search in, right?

Well, not really… and let me explain why (explained below)

<select aria-describedby="searchDropdownDescription" class="nav-search-dropdown searchSelect nav-progressive-attrubute nav-progressive-search-dropdown" data-nav-digest="Xa0GQ+pPQ/tdsV+GmRWeXB8PUD0=" data-nav-selected="0" id="searchDropdownBox" name="url" style="display: block; top: 2.5px;" tabindex="0" title="Search in">
    <option selected="selected" value="search-alias=aps">All Departments</option>
    <option value="search-alias=arts-crafts-intl-ship">Arts & Crafts</option>
    <option value="search-alias=automotive-intl-ship">Automotive</option>
    <option value="search-alias=baby-products-intl-ship">Baby</option>
    <option value="search-alias=beauty-intl-ship">Beauty & Personal Care</option>
    <option value="search-alias=stripbooks-intl-ship">Books</option>
    <option value="search-alias=computers-intl-ship">Computers</option>
    <option value="search-alias=digital-music">Digital Music</option>
    <option value="search-alias=electronics-intl-ship">Electronics</option>
    <option value="search-alias=digital-text">Kindle Store</option>
    <option value="search-alias=instant-video">Prime Video</option>
    <option value="search-alias=fashion-womens-intl-ship">Women's Fashion</option>
    <option value="search-alias=fashion-mens-intl-ship">Men's Fashion</option>
    <option value="search-alias=fashion-girls-intl-ship">Girls' Fashion</option>
    <option value="search-alias=fashion-boys-intl-ship">Boys' Fashion</option>
    <option value="search-alias=deals-intl-ship">Deals</option>
    <option value="search-alias=hpc-intl-ship">Health & Household</option>
    <option value="search-alias=kitchen-intl-ship">Home & Kitchen</option>
    <option value="search-alias=industrial-intl-ship">Industrial & Scientific</option>
    <option value="search-alias=luggage-intl-ship">Luggage</option>
    <option value="search-alias=movies-tv-intl-ship">Movies & TV</option>
    <option value="search-alias=music-intl-ship">Music, CDs & Vinyl</option>
    <option value="search-alias=pets-intl-ship">Pet Supplies</option>
    <option value="search-alias=software-intl-ship">Software</option>
    <option value="search-alias=sporting-intl-ship">Sports & Outdoors</option>
    <option value="search-alias=tools-intl-ship">Tools & Home Improvement</option>
    <option value="search-alias=toys-and-games-intl-ship">Toys & Games</option>
    <option value="search-alias=videogames-intl-ship">Video Games</option>
</select>

If you translate the above steps into Python code it boils down to the following code:

from selenium import webdriver

browser = webdriver.Firefox()
browser.get('https://www.amazon.com/')
browser.maximize_window()

# Trying to click on HTML element with id="searchDropdownBox"
from selenium.webdriver.common.by import By

search_dropdown_box = browser.find_element(by=By.ID, value='searchDropdownBox')
search_dropdown_box.click()

Handle exceptions

When you run code from above it results in an ElementNotInteractableException:

Traceback (most recent call last):
  File "/Users/robertsgreibers/projects/pythonic.me/amazon.py", line 11, in <module>
    search_dropdown_box.click()
  File "/Users/robertsgreibers/.local/share/virtualenvs/pythonic.me-GsTWWFLs/lib/python3.8/site-packages/selenium/webdriver/remote/webelement.py", line 81, in click
    self._execute(Command.CLICK_ELEMENT)
  File "/Users/robertsgreibers/.local/share/virtualenvs/pythonic.me-GsTWWFLs/lib/python3.8/site-packages/selenium/webdriver/remote/webelement.py", line 710, in _execute
    return self._parent.execute(command, params)
  File "/Users/robertsgreibers/.local/share/virtualenvs/pythonic.me-GsTWWFLs/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 424, in execute
    self.error_handler.check_response(response)
  File "/Users/robertsgreibers/.local/share/virtualenvs/pythonic.me-GsTWWFLs/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementNotInteractableException: Message: Element <select id="searchDropdownBox" class="nav-search-dropdown searchSelect nav-progressive-attrubute nav-progressive-search-dropdown" name="url"> could not be scrolled into view
Stacktrace:
WebDriverError@chrome://remote/content/shared/webdriver/Errors.jsm:181:5
ElementNotInteractableError@chrome://remote/content/shared/webdriver/Errors.jsm:291:5
webdriverClickElement@chrome://remote/content/marionette/interaction.js:156:11
interaction.clickElement@chrome://remote/content/marionette/interaction.js:125:11
clickElement@chrome://remote/content/marionette/actors/MarionetteCommandsChild.jsm:200:24
receiveMessage@chrome://remote/content/marionette/actors/MarionetteCommandsChild.jsm:91:31

I did try going around the exception mentioned above in multiple different ways…

…But my best guess is Amazon is trying to hide this element and does not want automated tools to be able to locate the element.

A quick Google search will give you a couple of ideas of what could be wrong.

If you search for ElementNotInteractableException it will suggest that the element might not be visible on the page when you try to click it.

And you have to use Selenium’s WebDriverWait to wait for the element before you execute the click. But even that did not work, see the example below:

from selenium import webdriver

browser = webdriver.Firefox()
browser.get('https://www.amazon.com/')
browser.maximize_window()

# Trying to click on HTML element with id="searchDropdownBox"
from selenium.webdriver.common.by import By

# search_dropdown_box = browser.find_element(by=By.ID, value='searchDropdownBox')
# search_dropdown_box.click()

# Trying to wait for the element

import datetime
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(
    driver=browser,
    timeout=15, # Maximum wait time
)

print(f'wait (before), time: {datetime.datetime.now()}')

search_dropdown_box = wait.until(
    method=EC.visibility_of_element_located(
        locator=(
            By.ID,
            'searchDropdownBox'
        )
    )
)

print(f'wait (after), time: {datetime.datetime.now()}')

search_dropdown_box.click()

But even the code from above resulted in the following output:

wait (before), time: 2021-12-14 23:01:32.428485
Traceback (most recent call last):
  File "/Users/robertsgreibers/projects/pythonic.me/amazon.py", line 27, in <module>
    search_dropdown_box = wait.until(
  File "/Users/robertsgreibers/.local/share/virtualenvs/pythonic.me-GsTWWFLs/lib/python3.8/site-packages/selenium/webdriver/support/wait.py", line 89, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: 

Use search bar instead

Instead of trying to select a specific product category, I’d suggest simply using the search box.

And searching for specific products that would fit the category of products you’re looking to scrape from Amazon.

Depending on your situation, it might not even matter how you categorize the search, you could come up with a list of category types of products on your own, outside of Amazon.

Then use your own list of product names and directly search for them in the Amazon search box.

That’s exactly what I decided to do.

Let’s say you’re looking to scrape information about different laptops available on Amazon. The first type of laptop you’re going to be looking for is a Macbook.

Now, all you need is to figure out how to select the search box HTML element and send a string value macbook to it.

Find search box with Selenium

Remove commented code from the code block above and go back to the start. This time use breakpoint() to stop code execution in the middle of the script.

Using breakpoint() will let you execute possible “guesses” to find HTML elements with Selenium.

And breakpoint() lets you do it without actually stoping the script and running the whole execution process from the start.

Here’s the code example with breakpoint()

from selenium import webdriver

browser = webdriver.Firefox()
browser.get('https://www.amazon.com/')
browser.maximize_window()


from selenium.webdriver.common.by import By

breakpoint()

Once the script from above stops, you’ll be able to type your code into a Python console.

It’s like you’re stopping in the middle of the script execution and can manually execute whatever code you want to execute.

It’s very useful once your Selenium scripts become more complicated over time and have more than 5 steps in them.

At that point, you definitely don’t want to re-run the whole script over and over again just to look for a new HTML element.

Use Python breakpoint()
Use Python breakpoint()

With a stop at the breakpoint, go to Firefox and inspect Amazon’s search box HTML element. Look for a possible way to select it.

Inspect Amazon's search box HTML element
Inspect Amazon’s search box HTML element
Amazon's search box HTML element
Amazon’s search box HTML element

Once you figure out the specifics of the HTML element, you just need to select it with Selenium. Very easy to do with breakpoint().

Use the following code to create a search_box variable and apply .send_keys() method to type a text into the search box.

search_box = browser.find_element(by=By.ID, value='twotabsearchtextbox')
search_box.send_keys('macbook')
Python code execution with breakpoint()
Python code execution with breakpoint()
Selenium .send_keys() method writes text in Amazon search box
Selenium .send_keys() method writes text in Amazon search box

Click on “search” button with Selenium

In a very similar manner, for the next step – you want to find a way to click on a search button with a glass-search icon.

Apply the same process as before, right-click on the glass-search icon (button) and inspect the HTML element. (see the screenshot below)

Inspect search submit button
Inspect search submit button

Again, you’re looking for a way to click on the element, so it’s supposed to be a CSS class or ID of an HTML element that you can use to ALWAYS find the element, it shouldn’t change over time.

If you’re not familiar with CSS selectors, I’d recommend taking a look at CSS Selectors references online.

Here’s a good starting point, but you can also just google search “CSS selectors explained”.

Keep in mind some CSS classes can be dynamic and change over time, IDs usually won’t change.

Find search button HTML
Find search button HTML

At the time of writing this post, code for the glass-search icon looks like in the HTML code block below.

<input id="nav-search-submit-button" type="submit" class="nav-input nav-progressive-attribute" value="Go" tabindex="0">

It’s pretty obvious you can use the ID of the element to find it with Selenium and it will work as long as Amazon will keep id="nav-search-submit-button" as an ID for glass-search icon.

See the code below to find id="nav-search-submit-button" HTML element in Selenium and click on it.

search_submit_button = browser.find_element(by=By.ID, value='nav-search-submit-button')
search_submit_button.click()

As I’ve mentioned above, I’d suggest you inspect HTML elements with Firefox first and try clicking on them with Selenium while you’re in a “breakpoint”.

This way you won’t have to start all over again if you make a mistake in your code, this way you can try multiple combinations to find an HTML element with Selenium.

See what works and put only what works into your amazon.py file. (see the screenshot below)

Click the search button with Selenium
Click the search button with Selenium

Once you’re able to reach the search results page with all the Macbook laptop results while still in the “breakpoint” state (see the screenshot below)

…You’re ready to move your code to amazon.py file and use it to develop the next steps.

Search results page for "Macbook" loaded with Selenium
Search results page for “Macbook” loaded with Selenium

After you implement all the steps from above your amazon.py file content should be very similar to the Python code given down below.

from selenium import webdriver
from selenium.webdriver.common.by import By

browser = webdriver.Firefox()
browser.get('https://www.amazon.com/')
browser.maximize_window()

search_box = browser.find_element(by=By.ID, value='twotabsearchtextbox')
search_box.send_keys('macbook')

search_submit_button = browser.find_element(by=By.ID, value='nav-search-submit-button')
search_submit_button.click()

Collect product data with Selenium

Alright, now to the most exciting part.

In this section, you’ll be able to find out how to scrape information about each Amazon product on the search results page.

Your goal here is to inspect a single product HTML to find ways to select a product with Selenium.

Inspect a single product

Use Python/Selenium code from previous steps and run it with a breakpoint() at the last line.

Once you’re on a search results page at the breakpoint() stop, go find the first Macbook product in the list.

Once you find it right-mouse click on it to inspect the element. (It’s the same process as before – inspect the HTML element)

Inspect the first Macbook in the list of search results
Inspect the first Macbook in the list of search results

At the time of writing this post, it seems like each search result product is using a CSS class called s-result-item you can already take note of that.

But as I have mentioned before – don’t blindly follow the steps here.

Websites like amazon.com change their code over a longer period of time.

You might need to re-check the search results page and see if it’s still the same CSS class they’re using.

HTML for the first item seems to be using the "s-result-item" CSS class
HTML for the first item seems to be using the “s-result-item” CSS class

If you mouse over the second <div>...</div> element with s-result-item CSS class, you’ll notice the second Macbook laptop being highlighted in the search results page list.

If that’s still the case for you – you can be sure s-result-item CSS class is still used for each product. (see the screenshot below)

HTML for the second item seems to be also using the "s-result-item" CSS class
HTML for the second item seems to be also using the “s-result-item” CSS class

Find all products with Selenium

Take advantage of breakpoint() stop and look for all the products using Python console before directly typing code into amazon.py file.

If for some reason s-result-item is not going to be the way to do it you’ll see a Selenium exception, but you’ll be also able to quickly try a different CSS class and find out what works.

result_items = browser.find_elements(by=By.CSS_SELECTOR, value='.s-result-item')
print(len(result_items))
Find all "s-result-item" HTML elements with Selenium
Find all “s-result-item” HTML elements with Selenium

As you can see from the screenshot above, currently I have 27 possible products found.

Also, notice what am I using here :

  • browser.find_elements(by=By.CSS_SELECTOR, value='.s-result-item')
    (plural elements instead of singular element)

Instead of:

  • browser.find_element(by=By.CSS_SELECTOR, value='.s-result-item')

Plural because at first you’re going to figure out how to extract information from the first product, but later you’ll apply the same extraction process to other products.

Meaning, you’ll apply a for loop to result_items list and extract information about each product on the search results page.

Inspect product image

Depending on your Amazon product collection goals you might be interested in figuring out the URL of a product image and saving that for later use.

Before you can scrape a URL of a product image with Selenium for each product, you need to find a way to access a product image URL for a single product.

Let’s start by inspecting the product image for the first product.

Inspect Amazon product image
Inspect Amazon product image

With a little bit of digging, you’ll find out there’s an <img class="s-image">...</img> element inside of each search result product. (see the screenshot and amazon.com source code below)

At least that’s the case at the time of writing this post – as I’ve mentioned above, amazon.com source code can change over time.

And if you can’t find the same type of elements and CSS classes you see here in this post, you need to do your own investigation with Firefox’s Inspect.

Amazon product <img loading=
Amazon product element seems to be using “s-image” CSS class
<img class="s-image" src="https://m.media-amazon.com/images/I/71TPda7cwUL._AC_UY218_.jpg" srcset="https://m.media-amazon.com/images/I/71TPda7cwUL._AC_UY218_.jpg 1x, https://m.media-amazon.com/images/I/71TPda7cwUL._AC_UY327_QL65_.jpg 1.5x, https://m.media-amazon.com/images/I/71TPda7cwUL._AC_UY436_QL65_.jpg 2x, https://m.media-amazon.com/images/I/71TPda7cwUL._AC_UY545_QL65_.jpg 2.5x, https://m.media-amazon.com/images/I/71TPda7cwUL._AC_UY654_QL65_.jpg 3x" alt="2020 Apple MacBook Air Laptop: Apple M1 Chip, 13” Retina Display, 8GB RAM, 256GB SSD Storage, Backlit Keyboard, FaceTime H..." data-image-index="3" data-image-load="" data-image-latency="s-product-image" data-image-source-density="1">

Find product image with Selenium

Now that you’re familiar with a product image HTML element, all you need is to figure out how you can find such an element with Selenium. Luckily by this time, you already know the process.

Check the first search result

While still in a breakpoint() Python console stop, create a new variable for the first product first_product = result_items[0]

(Let’s take the first item from the search results list)

Use '.s-image-fixed-height .s-image' CSS selector to look for a product image element inside the product element. (see code example below)

first_product = result_items[0]
product_img = first_product.find_element(by=By.CSS_SELECTOR, value='.s-image-fixed-height .s-image')
Trying to find the first search result image element
Trying to find the first search result image element
*** selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: .s-image-fixed-height .s-image
Stacktrace:
WebDriverError@chrome://remote/content/shared/webdriver/Errors.jsm:181:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.jsm:393:5
element.find/</<@chrome://remote/content/marionette/element.js:299:16

selenium.common.exceptions.NoSuchElementException is an exception you’ll probably see applying code from above.

If you think about it – what could actually be wrong with your approach? (Assuming CSS selector you used was correct)

The only part that could be wrong is the first product HTML.

Before you can make any other assumptions about what could be wrong I’d recommend you check the inner HTML for a product element. (see code example below)

first_product = result_items[0]
first_product.get_attribute('innerHTML')

HTML below is the first search result’s code.

If you take a closer look at line 15 you’ll notice there’s an HTML element with a text Filter by display size which seems kind of odd.

The first search result shouldn’t contain such text, it should be about a Macbook laptop instead.

\n  \n\n\n\n\n\n\n\n
<div data-uuid="f4a93a6e-6f87-4f5b-afd5-2ded8a814873" class="s-widget-container s-spacing-medium s-widget-container-height-medium">
   \n    <!-- BEGIN CardWidgetLayout widget-id: loom-desktop-top-slot_NWS_ATVPDKIKX0DER_BSP_macbook_0 -->
   <div class="celwidget pd_rd_w-JQAOE pf_rd_p-0b4fe531-c123-435e-ac22-f6a1bb4bce0c pf_rd_r-RJZ6R9KXDC816V1C6D8E pd_rd_r-d5d1c413-eefc-4d39-89d6-931205b4c692 pd_rd_wg-josba c-f" cel_widget_id="text-navigation_loom-desktop-top-slot_4" data-csa-c-content-id="amzn1.sym.0b4fe531-c123-435e-ac22-f6a1bb4bce0c" data-csa-c-slot-id="loom-desktop-top-slot_NWS_ATVPDKIKX0DER_BSP_macbook_0-5" data-csa-c-type="widget" data-csa-c-painter="text-navigation-cards" data-csa-c-id="pej5o0-lqdh3f-88ecru-k8aiz0" data-cel-widget="text-navigation_loom-desktop-top-slot_4">
      <script>if(window.mix_csa){window.mix_csa(\'[cel_widget_id="text-navigation_loom-desktop-top-slot_4"]\')(\'mark\', \'bb\')}</script>\n<script>if(window.uet){window.uet(\'bb\',\'text-navigation_loom-desktop-top-slot_4\',{wb: 1})}</script>\n
      <style>._text-navigation_style-desktop_textnavCard__1JK2U ._text-navigation_style-desktop_textnavSubtitle__3QVy9{background-color:inherit}._text-navigation_style-desktop_textnavCard__1JK2U ._text-navigation_style-desktop_textnavPill__nXfcd{margin-left:0!important;width:auto!important}._text-navigation_style-desktop_textnavCard__1JK2U ._text-navigation_style-desktop_textnavPill__nXfcd:last-child{margin-right:0}._text-navigation_style-desktop_textnavCard__1JK2U ._text-navigation_style-desktop_textnavPill__nXfcd>a{-webkit-box-pack:center;-ms-flex-pack:center;background-color:#f0f0f0;border-radius:8px;box-shadow:0 1px 2px 0 rgba(15,17,17,.2);color:#111!important;display:-webkit-box;display:-ms-flexbox;display:flex;justify-content:center;letter-spacing:0;margin:1px 8px 1px 2px;min-width:44px;padding:9px;text-align:left}._text-navigation_style-desktop_textnavCard__1JK2U ._text-navigation_style-desktop_textnavPill__nXfcd>a:hover{box-shadow:0 0 0 1px #111;text-decoration:none}._text-navigation_style-desktop_textnavCard__1JK2U ._text-navigation_style-desktop_textnavPill__nXfcd ._text-navigation_style-desktop_colorPillBox__7H8CV{-webkit-box-pack:center;-ms-flex-pack:center;background-color:#f0f0f0;border-radius:8px;box-shadow:0 1px 2px 0 rgba(15,17,17,.2);color:#111!important;display:-webkit-box;display:-ms-flexbox;display:flex;justify-content:center;letter-spacing:0;line-height:21px;margin:1px 8px 1px 2px;min-width:44px}._text-navigation_style-desktop_textnavCard__1JK2U ._text-navigation_style-desktop_textnavPill__nXfcd ._text-navigation_style-desktop_colorPillBox__7H8CV:hover{box-shadow:0 0 0 1px #111;text-decoration:none}._text-navigation_style-desktop_textnavCard__1JK2U ._text-navigation_style-desktop_textnavCarouselButton__8h7r-{background:#fff;border:.15rem solid #ddd;display:-webkit-box;display:-ms-flexbox;display:flex;height:40px;padding-left:5px;padding-top:12px;position:absolute;top:0;width:25px;z-index:1}._text-navigation_style-desktop_textnavCard__1JK2U ._text-navigation_style-desktop_textnavCarouselButton__8h7r-._text-navigation_style-desktop_textnavCarouselNextButton__2bOj2{right:0}._text-navigation_style-desktop_textnavCard__1JK2U ._text-navigation_style-desktop_textnavCarouselButton__8h7r-._text-navigation_style-desktop_textnavCarouselPrevButton__2svZ4{left:0}._text-navigation_style-desktop_textnavCard__1JK2U ._text-navigation_style-desktop_textnavCarousel__35v2t{margin-left:0;margin-right:0}._text-navigation_style-desktop_textnavCard__1JK2U ._text-navigation_style-desktop_textnavColorpillText__3etgW{display:-webkit-box;display:-ms-flexbox;display:flex;letter-spacing:0;padding:8px}._text-navigation_style-desktop_textnavCard__1JK2U ._text-navigation_style-desktop_textnavColorpillTextBox__2gzAQ{border-radius:8px 0 0 8px;display:-webkit-box;display:-ms-flexbox;display:flex;height:37px;width:32px}._text-navigation_style-desktop_textnavCard__1JK2U ._text-navigation_style-desktop_textnavColorpill__1ceWp{border-radius:8px;box-shadow:0 1px 2px 0 rgba(15,17,17,.2);display:-webkit-box;display:-ms-flexbox;display:flex;height:32px;margin:1px 8px 1px 2px;width:44px}._text-navigation_style-desktop_textnavCard__1JK2U ._text-navigation_style-desktop_textnavColorpill__1ceWp:hover{box-shadow:0 0 0 1px #111;text-decoration:none}._text-navigation_style-desktop_textnavCard__1JK2U._text-navigation_style-desktop_sangria__3FNzi ._text-navigation_style-desktop_textnavPill__nXfcd>a{background-color:#f4f4f4;padding:8px}._text-navigation_style-desktop_textnavCard__1JK2U._text-navigation_style-desktop_sangria__3FNzi ._text-navigation_style-desktop_textnavPill__nXfcd ._text-navigation_style-desktop_colorPillBox__7H8CV{background-color:#f4f4f4}\n._text-navigation_style-mobile_textnavCard__1Y-YL ._text-navigation_style-mobile_textnavPill__2aBkm{margin-left:0!important;width:auto!important}._text-navigation_style-mobile_textnavCard__1Y-YL ._text-navigation_style-mobile_textnavPill__2aBkm:last-child{margin-right:0}._text-navigation_style-mobile_textnavCard__1Y-YL ._text-navigation_style-mobile_textnavPill__2aBkm ._text-navigation_style-mobile_textnavLinkPill__23RWd{-webkit-box-pack:center;-ms-flex-pack:center;background-color:#f0f0f0;border-radius:8px;box-shadow:0 1px 2px 0 rgba(15,17,17,.2);display:-webkit-box;display:-ms-flexbox;display:flex;justify-content:center;margin:1px 8px 0 1px;min-width:44px;padding:8px}._text-navigation_style-mobile_textnavCard__1Y-YL ._text-navigation_style-mobile_textnavPill__2aBkm ._text-navigation_style-mobile_colorPillBox__1YPU-{-webkit-box-pack:center;-ms-flex-pack:center;background-color:#f0f0f0;border-radius:8px;box-shadow:0 1px 2px 0 rgba(15,17,17,.2);color:#111!important;display:-webkit-box;display:-ms-flexbox;display:flex;justify-content:center;margin:1px 8px 0 1px;min-width:44px}._text-navigation_style-mobile_textnavCard__1Y-YL ._text-navigation_style-mobile_textnavCarousel__2uLYI{background-color:inherit}._text-navigation_style-mobile_textnavCard__1Y-YL ._text-navigation_style-mobile_textnavColorpillText__NJi_Z{display:-webkit-box;display:-ms-flexbox;display:flex;font-size:12px;letter-spacing:0;line-height:16px;padding:8px}._text-navigation_style-mobile_textnavCard__1Y-YL ._text-navigation_style-mobile_textnavColorpillTextBox__2WDVj{border-radius:8px 0 0 8px;display:-webkit-box;display:-ms-flexbox;display:flex;height:32px;width:32px}._text-navigation_style-mobile_textnavCard__1Y-YL ._text-navigation_style-mobile_textnavColorpill__apbig{border-radius:8px;box-shadow:0 1px 2px 0 rgba(15,17,17,.2);display:-webkit-box;display:-ms-flexbox;display:flex;height:32px;margin:1px 8px 0 1px;width:44px}._text-navigation_style-mobile_textnavCard__1Y-YL._text-navigation_style-mobile_sangria__L7u8_ ._text-navigation_style-mobile_textnavPill__2aBkm ._text-navigation_style-mobile_colorPillBox__1YPU-,._text-navigation_style-mobile_textnavCard__1Y-YL._text-navigation_style-mobile_sangria__L7u8_ ._text-navigation_style-mobile_textnavPill__2aBkm ._text-navigation_style-mobile_textnavLinkPill__23RWd{background-color:#f4f4f4}</style>
      \n<!--CardsClient-->
      <div class="_text-navigation_style-desktop_sangria__3FNzi _text-navigation_style-desktop_textnavCard__1JK2U" style="width:100%" id="CardInstancet9LOwgvvCsqbU5Lrf_yjvQ" data-card-metrics-id="text-navigation_loom-desktop-top-slot_4">
         <style>._text-navigation_style-desktop_textnavCarousel__35v2t .a-carousel-card {visibility: visible !important; }</style>
         <div class="sg-row">
            <div class="sg-col sg-col-8-of-12 sg-col-12-of-16 sg-col-16-of-20">
               <div class="sg-col-inner">
                  <div class="a-section a-spacing-top-">
                     <div class="a-section a-spacing-small">
                        <div class="a-section a-spacing-none s-text-uppercase"><span class="a-size-medium-plus a-color-base">Filter by display size</span></div>
                        <div class="a-section a-spacing-none"></div>
                     </div>
                     <span data-component-type="s-searchgrid-carousel" class="rush-component" data-component-props="{"name":"loom-desktop-top-slot_NWS_ATVPDKIKX0DER_BSP_macbook_0"}" data-component-id="30">
                        <div data-a-carousel-options="{"circular":false,"maintain_state":false,"show_partial_next":false,"name":"loom-desktop-top-slot_NWS_ATVPDKIKX0DER_BSP_macbook_0","substractGotoPageButtonWidth":"28"}" data-a-display-strategy="searchgridvariablewidth" data-a-transition-strategy="s-carousel-searchgridvariablewidth" data-a-ajax-strategy="none" data-a-class="desktop" class="a-begin a-carousel-container a-carousel-display-searchgridvariablewidth a-carousel-transition-s-carousel-searchgridvariablewidth _text-navigation_style-desktop_textnavCarousel__35v2t a-carousel-initialized">
                           <input type="hidden" autocomplete="on" class="a-carousel-firstvisibleitem"><a aria-label="See previous" class="a-link-normal a-carousel-goto-prevpage aok-float-left aok-hidden _text-navigation_style-desktop_textnavCarouselButton__8h7r- _text-navigation_style-desktop_textnavCarouselPrevButton__2svZ4" href="#" aria-disabled="true"><i class="a-icon a-icon-previous aok-align-center" role="presentation"></i></a>
                           <div class="a-row a-carousel-controls a-carousel-row">
                              <div class="a-carousel-row-inner">
                                 <div class="a-carousel-col a-carousel-center">
                                    <div class="a-carousel-viewport" id="anonCarousel1">
                                       <ol class="a-carousel" role="list">
                                          <li class="a-carousel-card _text-navigation_style-desktop_textnavPill__nXfcd" style="height:42px" role="listitem" aria-setsize="8" aria-posinset="1" aria-hidden="false">
                                             <div class="a-section _text-navigation_style-desktop_textnavPill__nXfcd"><a class="a-size-base a-color-base a-link-normal" href="https://www.amazon.com/s/ref=sxts_sxts_ref_scx_alster_0?_encoding=UTF8&k=macbook&rh=n%253A172282%252Cp_n_size_browse-bin%253A7817234011&nav_sdd=aps&pd_rd_i=NWS_ATVPDKIKX0DER_BSP_macbook_0_0&qid=1640279285&pd_rd_w=JQAOE&pf_rd_p=0b4fe531-c123-435e-ac22-f6a1bb4bce0c&pf_rd_r=RJZ6R9KXDC816V1C6D8E&pd_rd_r=d5d1c413-eefc-4d39-89d6-931205b4c692&pd_rd_wg=josba">17 Inches & Above</a></div>
                                          </li>
                                          <li class="a-carousel-card _text-navigation_style-desktop_textnavPill__nXfcd" style="height:42px" role="listitem" aria-setsize="8" aria-posinset="2" aria-hidden="false">
                                             <div class="a-section _text-navigation_style-desktop_textnavPill__nXfcd"><a class="a-size-base a-color-base a-link-normal" href="https://www.amazon.com/s/ref=sxts_sxts_ref_scx_alster_1?_encoding=UTF8&k=macbook&rh=n%253A172282%252Cp_n_size_browse-bin%253A2242801011&nav_sdd=aps&pd_rd_i=NWS_ATVPDKIKX0DER_BSP_macbook_0_1&qid=1640279285&pd_rd_w=JQAOE&pf_rd_p=0b4fe531-c123-435e-ac22-f6a1bb4bce0c&pf_rd_r=RJZ6R9KXDC816V1C6D8E&pd_rd_r=d5d1c413-eefc-4d39-89d6-931205b4c692&pd_rd_wg=josba">16 to 16.9 Inches</a></div>
                                          </li>
                                          <li class="a-carousel-card _text-navigation_style-desktop_textnavPill__nXfcd" style="height:42px" role="listitem" aria-setsize="8" aria-posinset="3" aria-hidden="false">
                                             <div class="a-section _text-navigation_style-desktop_textnavPill__nXfcd"><a class="a-size-base a-color-base a-link-normal" href="https://www.amazon.com/s/ref=sxts_sxts_ref_scx_alster_2?_encoding=UTF8&k=macbook&rh=n%253A172282%252Cp_n_size_browse-bin%253A2423841011&nav_sdd=aps&pd_rd_i=NWS_ATVPDKIKX0DER_BSP_macbook_0_2&qid=1640279285&pd_rd_w=JQAOE&pf_rd_p=0b4fe531-c123-435e-ac22-f6a1bb4bce0c&pf_rd_r=RJZ6R9KXDC816V1C6D8E&pd_rd_r=d5d1c413-eefc-4d39-89d6-931205b4c692&pd_rd_wg=josba">15 to 15.9 Inches</a></div>
                                          </li>
                                          <li class="a-carousel-card _text-navigation_style-desktop_textnavPill__nXfcd" style="height:42px" role="listitem" aria-setsize="8" aria-posinset="4" aria-hidden="false">
                                             <div class="a-section _text-navigation_style-desktop_textnavPill__nXfcd"><a class="a-size-base a-color-base a-link-normal" href="https://www.amazon.com/s/ref=sxts_sxts_ref_scx_alster_3?_encoding=UTF8&k=macbook&rh=n%253A172282%252Cp_n_size_browse-bin%253A2423840011&nav_sdd=aps&pd_rd_i=NWS_ATVPDKIKX0DER_BSP_macbook_0_3&qid=1640279285&pd_rd_w=JQAOE&pf_rd_p=0b4fe531-c123-435e-ac22-f6a1bb4bce0c&pf_rd_r=RJZ6R9KXDC816V1C6D8E&pd_rd_r=d5d1c413-eefc-4d39-89d6-931205b4c692&pd_rd_wg=josba">14 to 14.9 Inches</a></div>
                                          </li>
                                          <li class="a-carousel-card _text-navigation_style-desktop_textnavPill__nXfcd" style="height:42px" role="listitem" aria-setsize="8" aria-posinset="5" aria-hidden="false">
                                             <div class="a-section _text-navigation_style-desktop_textnavPill__nXfcd"><a class="a-size-base a-color-base a-link-normal" href="https://www.amazon.com/s/ref=sxts_sxts_ref_scx_alster_4?_encoding=UTF8&k=macbook&rh=n%253A172282%252Cp_n_size_browse-bin%253A3545275011&nav_sdd=aps&pd_rd_i=NWS_ATVPDKIKX0DER_BSP_macbook_0_4&qid=1640279285&pd_rd_w=JQAOE&pf_rd_p=0b4fe531-c123-435e-ac22-f6a1bb4bce0c&pf_rd_r=RJZ6R9KXDC816V1C6D8E&pd_rd_r=d5d1c413-eefc-4d39-89d6-931205b4c692&pd_rd_wg=josba">13 to 13.9 Inches</a></div>
                                          </li>
                                          <li class="a-carousel-card _text-navigation_style-desktop_textnavPill__nXfcd" style="height:42px" role="listitem" aria-setsize="8" aria-posinset="6" aria-hidden="false">
                                             <div class="a-section _text-navigation_style-desktop_textnavPill__nXfcd"><a class="a-size-base a-color-base a-link-normal" href="https://www.amazon.com/s/ref=sxts_sxts_ref_scx_alster_5?_encoding=UTF8&k=macbook&rh=n%253A172282%252Cp_n_size_browse-bin%253A13580784011&nav_sdd=aps&qid=1640279285&pd_rd_w=JQAOE&pf_rd_p=0b4fe531-c123-435e-ac22-f6a1bb4bce0c&pf_rd_r=RJZ6R9KXDC816V1C6D8E&pd_rd_r=d5d1c413-eefc-4d39-89d6-931205b4c692&pd_rd_wg=josba">12 to 12.9 Inches</a></div>
                                          </li>
                                          <li class="a-carousel-card _text-navigation_style-desktop_textnavPill__nXfcd" style="height:42px" role="listitem" aria-setsize="8" aria-posinset="7" aria-hidden="false">
                                             <div class="a-section _text-navigation_style-desktop_textnavPill__nXfcd"><a class="a-size-base a-color-base a-link-normal" href="https://www.amazon.com/s/ref=sxts_sxts_ref_scx_alster_6?_encoding=UTF8&k=macbook&rh=n%253A172282%252Cp_n_size_browse-bin%253A13580785011&nav_sdd=aps&qid=1640279285&pd_rd_w=JQAOE&pf_rd_p=0b4fe531-c123-435e-ac22-f6a1bb4bce0c&pf_rd_r=RJZ6R9KXDC816V1C6D8E&pd_rd_r=d5d1c413-eefc-4d39-89d6-931205b4c692&pd_rd_wg=josba">11 to 11.9 Inches</a></div>
                                          </li>
                                          <li class="a-carousel-card _text-navigation_style-desktop_textnavPill__nXfcd" style="height: 42px; visibility: hidden;" role="listitem" aria-setsize="8" aria-posinset="8" aria-hidden="true">
                                             <div class="a-section _text-navigation_style-desktop_textnavPill__nXfcd"><a class="a-size-base a-color-base a-link-normal" href="https://www.amazon.com/s/ref=sxts_sxts_ref_scx_alster_7?_encoding=UTF8&k=macbook&rh=n%253A172282%252Cp_n_size_browse-bin%253A13580786011&nav_sdd=aps&qid=1640279285&pd_rd_w=JQAOE&pf_rd_p=0b4fe531-c123-435e-ac22-f6a1bb4bce0c&pf_rd_r=RJZ6R9KXDC816V1C6D8E&pd_rd_r=d5d1c413-eefc-4d39-89d6-931205b4c692&pd_rd_wg=josba">11 Inches & Under</a></div>
                                          </li>
                                       </ol>
                                    </div>
                                 </div>
                              </div>
                           </div>
                           <a aria-label="See more" class="a-link-normal a-carousel-goto-nextpage aok-float-left _text-navigation_style-desktop_textnavCarouselButton__8h7r- _text-navigation_style-desktop_textnavCarouselNextButton__2bOj2" href="#" aria-disabled="false"><i class="a-icon a-icon-next aok-align-center" role="presentation"></i></a><span class="a-end aok-hidden"></span>
                        </div>
                     </span>
                  </div>
               </div>
            </div>
         </div>
      </div>
      <script>if(window.mix_csa){window.mix_csa(\'[cel_widget_id="text-navigation_loom-desktop-top-slot_4"]\')(\'mark\', \'be\')}</script>\n<script>if(window.uet){window.uet(\'be\',\'text-navigation_loom-desktop-top-slot_4\',{wb: 1})}</script>\n<script>if(window.mixTimeout){window.mixTimeout(\'text-navigation\', \'CardInstancet9LOwgvvCsqbU5Lrf_yjvQ\', 90000)};\nP.when(\'mix:@amzn/mix.client-runtime\', \'mix:text-navigation\').execute(function (runtime, cardModule) {runtime.registerCardFactory(\'CardInstancet9LOwgvvCsqbU5Lrf_yjvQ\', cardModule).then(function(){if(window.mix_csa){window.mix_csa(\'[cel_widget_id="text-navigation_loom-desktop-top-slot_4"]\')(\'mark\', \'functional\')}if(window.uex){window.uex(\'ld\',\'text-navigation_loom-desktop-top-slot_4\',{wb: 1})}});});\n</script>\n
   </div>
   <!-- END CardWidgetLayout widget-id: loom-desktop-top-slot_NWS_ATVPDKIKX0DER_BSP_macbook_0 -->\n
</div>
\n\n\n

If you take one more look at the search results page, you’ll notice Filter by display size text is used inside the filter element – it must have the same HTML as the first search result.

Filter by display size - amazon.com search results page
Filter by display size – amazon.com search results page

Well, that’s not really the product you’re looking for. You’ll have to somehow skip the first search result in your scraping process. (Will be explained later in the post)

Check the second search result

Okay, if it did not work with the first item of result_items a list, it must work with the second one.

Create a new variable for the second product second_product = result_items[1] and apply the same process as with the first result.

Once you see product_img element is found (no exceptions thrown), apply .get_attribute('outerHTML') on a product_img element

And you’ll be able to see the HTML for the product_img element. (use code example & see the screenshot below)

second_product = result_items[1]
product_img = second_product.find_element(by=By.CSS_SELECTOR, value='.s-image-fixed-height .s-image')
product_img.get_attribute('outerHTML')
Printing Selenium element attributes
Printing Selenium element attributes

Notice the output of product_img.get_attribute('outerHTML') – there are two attributes containing information about a product image src & srcset:

  • src="https://m.media-amazon.com/images/I/31fD+NPpVqL.AC_UY218.jpg"
  • srcset="https://m.media-amazon.com/images/I/31fD+NPpVqL.AC_UY218.jpg 1x, https://m.media-amazon.com/images/I/31fD+NPpVqL.AC_UY327_QL65.jpg 1.5x, https://m.media-amazon.com/images/I/31fD+NPpVqL.AC_UY436_QL65.jpg 2x, https://m.media-amazon.com/images/I/31fD+NPpVqL.AC_UY500_QL65.jpg 2.2935x"

Obviously, in most cases, you want to scrape the highest resolution image. In the case above that would be the one with 2.2935x at the end of the image URL.

For such a case, you’ll need to use regex. See the next steps to extract the highest quality image using regex.

<img class="s-image" src="https://m.media-amazon.com/images/I/31fD+NPpVqL._AC_UY218_.jpg" srcset="https://m.media-amazon.com/images/I/31fD+NPpVqL._AC_UY218_.jpg 1x, https://m.media-amazon.com/images/I/31fD+NPpVqL._AC_UY327_QL65_.jpg 1.5x, https://m.media-amazon.com/images/I/31fD+NPpVqL._AC_UY436_QL65_.jpg 2x, https://m.media-amazon.com/images/I/31fD+NPpVqL._AC_UY500_QL65_.jpg 2.2935x" alt="Apple 15.4in MacBook Pro Laptop Computer with Retina Display MGXC2LL/A - Intel Core i7 2.5GHz, 16GB RAM, 256GB SSD (Renewed)" data-image-index="1" data-image-load="" data-image-latency="s-product-image" data-image-source-density="1">
Collect product image URLs

You can’t really use just a single product URL as a reference for the development of any kind of code.

Let’s collect all product URLs from the first search results page and find the best way to extract the highest quality product image URL with regex.

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By

browser = webdriver.Firefox()
browser.get('https://www.amazon.com/')
browser.maximize_window()

search_box = browser.find_element(by=By.ID, value='twotabsearchtextbox')
search_box.send_keys('macbook')

search_submit_button = browser.find_element(
    by=By.ID,
    value='nav-search-submit-button'
)
search_submit_button.click()

result_items = browser.find_elements(
    by=By.CSS_SELECTOR,
    value='.s-result-item'
)

for product in result_items:
    try:
        product_img = product.find_element(
            by=By.CSS_SELECTOR,
            value='.s-image-fixed-height .s-image'
        )
    except NoSuchElementException:
        continue

    srcset = product_img.get_attribute(name='srcset')

    print(srcset)

Notice small adjustments in the code above.

Since you already discovered the first search result with a CSS class '.s-result-item' is not really a product, you’ll need to skip such an element.

That’s really the purpose of using try/except block – you try to look for an image element inside the search result and if you can’t find one NoSuchElementException will be raised.

If NoSuchElementException is raised you know you can skip this iteration which is what you can do with a continue keyword in Python.

Product image URLs (srcset attribute)
Product image URLs (srcset attribute)
Extract product image URL with regex

For simplicity, we’re going to take an example from one of the previous posts I wrote about regex.

It really is a great tool to parse log files and any other type of text content.

I’m going to take a Regex class from my previous post and with small adjustments apply the same process here.

import re

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By


class Regex:
    _srcset = r'Source:(.+) Destination:' # TODO: Change this

    def extract(self, text: str, regex: str):
        rg = re.compile(regex, re.IGNORECASE | re.DOTALL)
        match = rg.search(text)

        try:
            return match.group(1)
        except AttributeError or IndexError:
            return

Notice _srcset = r'Source:(.+) Destination:' is variable you’re going to use to define a regular expression string for srcset attribute parsing.

Before you can move any further, you need to figure out the regex pattern to extract only the highest quality image URL.

It can be achieved in two ways:

  • First, you can try to tweak the regex pattern based on your previous regex experience until you find a pattern that extracts the high-quality product URL. (not recommended)
  • Second, just use an online tool for regex pattern development and tweak it from there until you see that the pattern you come up with matches the text you want to extract. (recommended – see an example below)
Test string #1
Test string #1

If you take a look at the full string of srcset attribute you can kind of already see the pattern. Every one of them ends with the highest quality image URL (the one you want to extract).

Test string #2
Test string #2

All you need to figure out is the regex pattern to detect it. With a bit of tweaking I came up with the following pattern: .*x, (.+) .*x

  • .*x, (1st part – One before the last one: select the end of the URL)
  • (.+) (2nd part – Highest quality URL: use parentheses to make it a group)
  • .*x (3rd part – Not using a comma , to match the end of srcset attribute string)

You can find a way better and more detailed explanation on regex101.com website, whenever you have a pattern, they’ll explain it on the right side of the page.

https://regex101.com/ explanation
https://regex101.com/ explanation

Now you can combine everything together inside amazon.py file and run the script with a breakpoint() at the end as usual.

import re

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By


class Regex:
    _srcset = r'.*x, (.+) .*x'

    def extract(self, text: str, regex: str):
        rg = re.compile(regex, re.IGNORECASE | re.DOTALL)
        match = rg.search(text)

        try:
            return match.group(1)
        except AttributeError or IndexError:
            return


browser = webdriver.Firefox()
browser.get('https://www.amazon.com/')
browser.maximize_window()

search_box = browser.find_element(by=By.ID, value='twotabsearchtextbox')
search_box.send_keys('macbook')

search_submit_button = browser.find_element(
    by=By.ID,
    value='nav-search-submit-button'
)
search_submit_button.click()

result_items = browser.find_elements(
    by=By.CSS_SELECTOR,
    value='.s-result-item'
)

regex = Regex()

for product in result_items:
    try:
        product_img = product.find_element(
            by=By.CSS_SELECTOR,
            value='.s-image-fixed-height .s-image'
        )
    except NoSuchElementException:
        continue

    srcset = product_img.get_attribute(name='srcset')

    product_img_url = regex.extract(text=srcset, regex=regex._srcset)

    print('product_img_url: ', product_img_url)

And this is the expected result, you should be able to see a high-quality image URL printed for each product on a search results page.

Print high-quality product image for all search results
Print high-quality product image for all search results

Inspect product title

Using regex to extract the highest quality product image was actually the hardest part. Now back to the other details you want to extract from each product.

Obviously, you also want to have access to a product title – so that you can at least understand what is it about when looking at the collected information later.

Apply the same process as before, inspect one of the product’s titles and find a CSS class you can include in your CSS selector string for Selenium.

Inspect Amazon product title
Inspect Amazon product title

Find product title with Selenium

The easiest way I saw you can extract a title of a product is actually by using a parent element with a class s-title-instructions-style (see screenshot above).

You want to find the element with CSS class s-title-instructions-style element and then just check its .text attribute.

You don’t have to always use the exact path in the HTML tree just to select an element with a Selenium CSS selector.

This won’t work with other attributes though (only with .text), but sometimes you can be a bit lazy with them.

Again, use breakpoint() at the end of your current amazon.py file…

…so that once everything that you’ve got so far is executed you’re stopped with an option to try new things in the Python console. (see the code example below)

third_product = result_items[2]
product_title = third_product.find_element(by=By.CSS_SELECTOR, value='.s-title-instructions-style')

print(product_title.text)

Notice I’m using the “third” product (List indexes in Python start from 0) from the search results this time.

As you will run the script multiple times you might get a different version of an Amazon search results page. (see the screenshot below)

Using breakpoint() to stop in the middle of a search results page
Using breakpoint() to stop in the middle of a search results page

Modern websites do not really use static pages, especially sites like amazon.com – they all are dynamic.

Meaning, that the content of a page can change over time, the structure of the search results page can change over time, etc.

This time running the script gave me Editorial recommendations before the actual search results I searched for. (see the screenshot below)

That’s the reason for using the “third” search result as an example here.

Amazon ads showing up
Amazon ads showing up

Include product title code from above in amazon.py and run the whole script again with a breakpoint() at the end.

Your goal here is to see if you can successfully extract a product title string for multiple Amazon products.

Here’s an example how the end of your amazon.py script should look like.

for product in result_items:
    try:
        product_img = product.find_element(
            by=By.CSS_SELECTOR,
            value='.s-image-fixed-height .s-image'
        )
    except NoSuchElementException:
        continue

    srcset = product_img.get_attribute(name='srcset')

    product_img_url = regex.extract(text=srcset, regex=regex._srcset)

    product_title = product.find_element(
        by=By.CSS_SELECTOR,
        value='.s-title-instructions-style'
    )
    print('product_img_url: ', product_img_url)
    print('product_title.text', product_title.text)
Extracting title of multiple Amazon products
Extracting title of multiple Amazon products

At this point in the tutorial, you should already be very familiar with the steps.

Inspect one of the search results and look for an HTML element containing a product URL. (see the screenshot below)

Inspect Amazon search result
Inspect Amazon search result

With almost little to no effort, you’ll be able to find an HTML element with two CSS classes a-link-normal and s-underline-link-text very close to the product title element. (see the screenshot above)

At the time of writing this post, I’m able to find an HTML element containing a product link by using either one of those CSS classes s-underline-link-text or a-link-normal

But as I’ve mentioned before – this could be different for you and you need to inspect the page yourself to see if this is still the case with a product link in Amazon search results.

Just to have higher chances of this working for most search result products, I’d recommend using a CSS selector with both of those classes.

Separate each CSS class with a comma , to be able to look for either one of them – '.s-underline-link-text, .a-link-normal'.

This post will give you a better idea of how CSS selector grouping works – very useful when you have to deal with dynamic websites like amazon.com

In a similar way as before, you want to run your current amazon.py file with a breakpoint() at the end.

And use the code example from below to see if you’re still able to find a product link with the code I’ve provided here.

second_product = result_items[2]
product_url = second_product.find_element(by=By.CSS_SELECTOR, value='.a-link-normal, .s-underline-link-text')

print(product_url.get_attribute(name='href'))

If not you’ll likely need to inspect Amazon search results and see if something has changed with the search results page over time.

Extract Amazon product link from a single product
Extract Amazon product link from a single product

Also, notice I’m using the third item from the search results list ( result_items[2] ) – first two results were not actually the products on the search results page.

You might run into selenium.common.exceptions.NoSuchElementException exception…

…if you’re going to apply the above code for the first two items. (we already had a similar situation in previous steps)

Traceback (most recent call last):
  File "/Users/robertsgreibers/projects/pythonic.me/amazon.py", line 59, in <module>
    product_url_element = product.find_element(
  File "/Users/robertsgreibers/.local/share/virtualenvs/pythonic.me-GsTWWFLs/lib/python3.8/site-packages/selenium/webdriver/remote/webelement.py", line 735, in find_element
    return self._execute(Command.FIND_CHILD_ELEMENT,
  File "/Users/robertsgreibers/.local/share/virtualenvs/pythonic.me-GsTWWFLs/lib/python3.8/site-packages/selenium/webdriver/remote/webelement.py", line 710, in _execute
    return self._parent.execute(command, params)
  File "/Users/robertsgreibers/.local/share/virtualenvs/pythonic.me-GsTWWFLs/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 424, in execute
    self.error_handler.check_response(response)
  File "/Users/robertsgreibers/.local/share/virtualenvs/pythonic.me-GsTWWFLs/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: .s-underline-link-text
Stacktrace:
WebDriverError@chrome://remote/content/shared/webdriver/Errors.jsm:181:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.jsm:393:5
element.find/</<@chrome://remote/content/marionette/element.js:299:16

Also, you might not be able to find a link with the first two search results elements. Make sure you’re looking for a href attribute inside an actual Macbook search result.

You can apply .get_attribute(name='innerHTML') or .get_attribute(name='outerHTML') on a search result item to see the content of it. (already explained in previous steps)

Include product URL code from code example from below in amazon.py and run the whole script again with a breakpoint() at the end.

Your goal here is the same as before – see if you can successfully extract a product URL for multiple Amazon products.

try:
    product_url_element = product.find_element(
        by=By.CSS_SELECTOR,
        value='.a-link-normal, .s-underline-link-text'
    )
except NoSuchElementException:
    continue

product_url = product_url_element.get_attribute(name='href')

print(' ')
print(' ')
print('product_img_url: ', product_img_url)
print('product_title.text: ', product_title.text)
print('product_url: ', product_url)

See the screenshot below for an example of how the end of your amazon.py script should look like.

Print Amazon product URL with Python & Selenium
Print Amazon product URL with Python & Selenium

Keep in mind, code from the screenshot above goes inside a for loop iteration.

It’s still the same for loop you created a couple of steps back.

It’s just that here – you can’t really see the whole script.

Switch to the next page in search results

Nice – so you’ve learned to gather data about Amazon products on the first search results page.

But are you going to run the script from the above steps over and over again just to scrape only the first page of the search results?

Probably not, it’s very unlikely that in a real-life situation or test automation project you’ll only deal with such a simple situation. It’s always way more complex.

Let’s take a look at how you could improve already existing code from previous steps to extract data about multiple search results pages.

Inspect “next” button element

Same as before, you start by scrolling down to the bottom of the first search results page and inspecting Next > button HTML element.  (see the screenshot below)

Inspect Amazon's search results page next button
Inspect Amazon’s search results page next button

As you can see from the screenshot below it’s pretty obvious we can find a Next > button element easily using s-pagination-next CSS class.

Amazon search results pagination next button
Amazon search results pagination next button

Find “next” button element

Apply similar steps here – you’re already familiar with the process. Use breakpoint() at the end of amazon.py – right after a for loop execution.

You need to get to the point where you have an option to try new things in the Python console. (see the code example below)

next_button = browser.find_element(by=By.CSS_SELECTOR,  value='.s-pagination-next, .a-pagination .a-last')

print(next_button)

You shouldn’t have any problems finding Next > button – this one seems to be the one that shouldn’t change unless you reach the last page.

But just in case, I’d recommend you use CSS selector '.s-pagination-next, .a-pagination .a-last' for the Next > button.

After running amazon.py script for a couple of times I noticed Next > button without s-pagination-next CSS class.

At that point, it might give you an exception about Selenium not being able to locate the element.

Find Amazon search results page next button with selenium
Find Amazon search results page next button with selenium

Scroll to the “next” button element

Okay, you’ve found the Next > button element – now you need to click on it and go to the next page.

You might be able to get away with one simple – next_button.click() but you also might not.

It really depends on how amazon.com reacts for you at the time you’re writing your scraping script.

A better approach would be to scroll to the bottom of the page so that the Next > button is visible. (Selenium .click() event can fail if the element is not in the current viewport)

from selenium.webdriver.common.action_chains import ActionChains

actions = ActionChains(browser)
actions.move_to_element(next_button).perform()

The above code example is what’s usually suggested by Selenium – the “right approach”. Meaning, you’re using Selenium’s framework to perform the scroll to the element.

Which is fine, I’d always recommend using built-in tools just because in Python most of the time these tools are really good.

This time there’s an issue if you’re using Firefox browser together with Selenium.

According to a StackOverflow post, a way better solution would be to use window.scrollTo(x, y); in combination with .execute_script()

Which is also an acceptable part of the Selenium framework. (see an example in the next steps)

See the screenshot below to see what happens when you use ActionChains

Using Selenium Actions to try scrolling to "Next >" button
Using Selenium Actions to try scrolling to “Next >” button
*** selenium.common.exceptions.MoveTargetOutOfBoundsException: Message: (1019, 3481) is out of bounds of viewport width (1440) and height (734)
Stacktrace:
WebDriverError@chrome://remote/content/shared/webdriver/Errors.jsm:181:5
MoveTargetOutOfBoundsError@chrome://remote/content/shared/webdriver/Errors.jsm:371:5
dispatchPointerMove/<@chrome://remote/content/marionette/action.js:1383:13
dispatchPointerMove@chrome://remote/content/marionette/action.js:1374:10
toEvents/<@chrome://remote/content/marionette/action.js:1145:16
action.dispatchTickActions@chrome://remote/content/marionette/action.js:1055:35
action.dispatch/chainEvents<@chrome://remote/content/marionette/action.js:1023:20
action.dispatch@chrome://remote/content/marionette/action.js:1029:5
performActions@chrome://remote/content/marionette/actors/MarionetteCommandsChild.jsm:447:18
receiveMessage@chrome://remote/content/marionette/actors/MarionetteCommandsChild.jsm:141:31

Scroll to the “next” button element with JavaScript and click it

A good alternative is just defining a new string variable called scroll_to_js

..which will contain javascript mentioned above ( window.scrollTo(x, y); ) – and applying .click() event to next_button right after the scroll.

Notice, you can easily access element X and Y coordinates using .location attribute. (see code example below)

scroll_to_js = f'window.scrollTo({next_button.location["x"]}, {next_button.location["y"]});'
browser.execute_script(scroll_to_js)

next_button.click()

Execute code from above while still in a breakpoint() Python console – this time you shouldn’t have any errors.  (see the screenshot below)

Scroll to the bottom of Amazon search results page
Scroll to the bottom of Amazon search results page

And now switch back to your browser – it should have switched to the second page in search results.

You can confirm this by scrolling down to the bottom of the search results page and checking the “highlighted” page number. (see the screenshot below)

Check browser and you'll see second page selected
Check browser and you’ll see second page selected

Refactor existing code for multi-page scraping

Now that you have code for each step except STEP 5 which is basically a repeated execution of the same code over and over again until enough data is collected.

All you need is to figure out how you can refactor your existing code in order to be able to use it over and over again without copy-pasting.

At the moment, the repeated usage of the same code is not really possible.

You’d have to copy the same code over and over again which goes totally against the DRY principle.

And if you do decide to simply copy-paste code – you’ll end up with a total mess of a project – spaghetti code.

  • 1. Open amazon.com
  • 2. Search for a specific product type
  • 3. Collect details about each product from the search results page
  • 4. Switch to the next page
  • 5. Repeat STEPS 3 & 4 until enough data collected

The current state of code

Just for a reference, I’ll leave here the current state of our code so that you can see the difference after refactoring.

Code below is the initial version that does the first 4 STEPS.

import re

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By


class Regex:
    _srcset = r'.*x, (.+) .*x'

    def extract(self, text: str, regex: str):
        rg = re.compile(regex, re.IGNORECASE | re.DOTALL)
        match = rg.search(text)

        try:
            return match.group(1)
        except AttributeError or IndexError:
            return


browser = webdriver.Firefox()
browser.get('https://www.amazon.com/')
browser.maximize_window()

search_box = browser.find_element(by=By.ID, value='twotabsearchtextbox')
search_box.send_keys('macbook')

search_submit_button = browser.find_element(
    by=By.ID,
    value='nav-search-submit-button'
)
search_submit_button.click()

result_items = browser.find_elements(
    by=By.CSS_SELECTOR,
    value='.s-result-item'
)

regex = Regex()

for product in result_items:
    try:
        product_img = product.find_element(
            by=By.CSS_SELECTOR,
            value='.s-image-fixed-height .s-image'
        )
    except NoSuchElementException:
        continue

    srcset = product_img.get_attribute(name='srcset')

    product_img_url = regex.extract(text=srcset, regex=regex._srcset)

    product_title = product.find_element(
        by=By.CSS_SELECTOR,
        value='.s-title-instructions-style'
    )

    try:
        product_url_element = product.find_element(
            by=By.CSS_SELECTOR,
            value='.a-link-normal, .s-underline-link-text'
        )
    except NoSuchElementException:
        continue

    product_url = product_url_element.get_attribute(name='href')

    print(' ')
    print(' ')
    print('product_img_url: ', product_img_url)
    print('product_title.text: ', product_title.text)
    print('product_url: ', product_url)


scroll_to_js = f'window.scrollTo({next_button.location["x"]}, {next_button.location["y"]});'
browser.execute_script(scroll_to_js)

next_button.click()

Refactoring steps

Just to keep the post shorter I decided to create a gallery of screenshots. Screenshots below will give you 15 STEPS you need to take in order to do refactoring.

Start from the top left side ( STEP 1 ) and continue until the very last screenshot at the bottom right side ( STEP 15 ).

Also, you can find already refactored code down below. (right after a gallery of screenshots)

Refactoring result

And here’s the result you get after applying all 15 STEPS from the screenshots above. This is a good starting point to finalize product data scraping for multiple search results pages.

import re
import typing

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.webdriver import WebDriver
from selenium.webdriver.remote.webelement import WebElement


class Regex:
    _srcset = r'.*x, (.+) .*x'

    def extract(self, text: str, regex: str):
        rg = re.compile(regex, re.IGNORECASE | re.DOTALL)
        match = rg.search(text)

        try:
            return match.group(1)
        except AttributeError or IndexError:
            return


def get_search_results_elements(b: WebDriver) -> typing.List[WebElement]:
    return b.find_elements(
        by=By.CSS_SELECTOR,
        value='.s-result-item'
    )


def get_product_image_element(
    product: WebElement
) -> typing.Union[WebElement, None]:
    try:
        return product.find_element(
            by=By.CSS_SELECTOR,
            value='.s-image-fixed-height .s-image'
        )
    except NoSuchElementException:
        return


def get_product_url_element(
    product: WebElement
) -> typing.Union[WebElement, None]:
    try:
        return product.find_element(
            by=By.CSS_SELECTOR,
            value='.a-link-normal, .s-underline-link-text'
        )
    except NoSuchElementException:
        return


def get_product_image_url(product: WebElement) -> typing.Union[str, None]:
    product_img: typing.Union[WebElement, None] = (
        get_product_image_element(
            product=product
        )
    )
    if not product_img:
        return

    srcset = product_img.get_attribute(name='srcset')

    return regex.extract(
        text=srcset,
        regex=regex._srcset
    )


def get_product_url(product: WebElement) -> typing.Union[str, None]:
    product_url_element: typing.Union[WebElement, None] = (
        get_product_url_element(
            product=product
        )
    )
    if not product_url_element:
        return

    return product_url_element.get_attribute(name='href')


def get_product_title(product: WebElement) -> str:
    product_title = product.find_element(
        by=By.CSS_SELECTOR,
        value='.s-title-instructions-style'
    )
    return product_title.text


def scrape_search_results_products(
    b: WebDriver
) -> typing.List[typing.Dict[str, str]]:
    products: typing.List[typing.Dict[str, str]] = []

    for product in get_search_results_elements(b=b):
        product_url: typing.Union[str, None] = (
            get_product_url(
                product=product
            )
        )
        product_img_url: typing.Union[str, None] = (
            get_product_image_url(
                product=product
            )
        )
        if not product_url or not product_img_url:
            continue

        product_title: str = get_product_title(product=product)

        products.append({
            'product_url': product_url,
            'product_img_url': product_img_url,
            'product_title': product_title,
        })

    return products


# Open browser and go to amazon.com
browser = webdriver.Firefox()
browser.get('https://www.amazon.com/')
browser.maximize_window()

# Search for a specific product type (macbook)
search_box = browser.find_element(by=By.ID, value='twotabsearchtextbox')
search_box.send_keys('macbook')
search_submit_button = browser.find_element(
    by=By.ID,
    value='nav-search-submit-button'
)
search_submit_button.click()

# Scrape search results from the first page
regex = Regex()
products: typing.List[typing.Dict[str, str]] = (
    scrape_search_results_products(
        b=browser
    )
)
next_button = browser.find_element(  # Go to the next page
    by=By.CSS_SELECTOR,
    value='.s-pagination-next, .a-pagination .a-last'
)
scroll_to_js: str = (
    f'window.scrollTo('
    f'{next_button.location["x"]}, '
    f'{next_button.location["y"]}'
    f');'
)
browser.execute_script(scroll_to_js)
next_button.click()

# TODO: Repeat scraping on the second page

Use for loop to scrape multiple pages

Start by importing time package (line 138) – it’s a package you can use in Python to have a little bit of delay and wait while the page loads (line 154).

Define from how many search results pages you want to collect products with scrape_page_count: int = 5 (line 140)

Define a List variable called all_products where to collect each product all_products: typing.List[typing.Dict[str, str]] = [] (line 142)

Add code for multi-page iteration
Add code for multi-page iteration

Once you have the initial variables defined, it’s time to think about looping for the scrape_page_count number of times.

In Python, this can be easily done with a built-in function called range().

Since there’s no real reason to print the number variable which automatically comes along with for loop, you can use underscore to just pass it.

browser.refresh()

Next, with every new iteration, you want to refresh browser variable.

If you’ll skip this step you won’t be able to collect all products just because of the code structure you have so far.

We would have to go a lot deeper into refactoring with Python classes to avoid this step, but it’s already way too long of a post.

If you want to go deeper into learning refactoring contact me personally – send me an email to roberts.greibers@gmail.com

After a clean browser refresh, your browser variable will be updated to the latest version of the page and you’ll be able to scrape new products.

    ready_state = None

    while ready_state != 'complete':
        ready_state = browser.execute_script(
            'return document.readyState;'
        )
        time.sleep(1)

Why while loop? – you may ask.

Well, you have to somehow wait for a search results page to fully load.

Otherwise, you’ll end up with selenium.common.exceptions.NoSuchElementException – it won’t happen always, but it might be thrown your way.

Traceback (most recent call last):
  File "/Users/robertsgreibers/projects/pythonic.me/amazon.py", line 155, in <module>
    next_button = browser.find_element(
  File "/Users/robertsgreibers/.local/share/virtualenvs/pythonic.me-GsTWWFLs/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 1244, in find_element
    return self.execute(Command.FIND_ELEMENT, {
  File "/Users/robertsgreibers/.local/share/virtualenvs/pythonic.me-GsTWWFLs/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 424, in execute
    self.error_handler.check_response(response)
  File "/Users/robertsgreibers/.local/share/virtualenvs/pythonic.me-GsTWWFLs/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: .s-pagination-next
Stacktrace:
WebDriverError@chrome://remote/content/shared/webdriver/Errors.jsm:181:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.jsm:393:5
element.find/</<@chrome://remote/content/marionette/element.js:299:16

return document.readyState; is just a way to check if a page is fully loaded with Javascript. And of course, javascript can be executed with Selenium’s .execute_script() function

Here’s the whole code section from the screenshot above.

import time

scrape_page_count: int = 5

all_products: typing.List[typing.Dict[str, str]] = []

for _ in range(scrape_page_count):

    browser.refresh()

    ready_state = None

    while ready_state != 'complete':
        ready_state = browser.execute_script(
            'return document.readyState;'
        )
        time.sleep(1)

And now to the second part of the for loop.

Right after you know the page is fully loaded (after lines 150-154) you need to collect all search results products from the page you’re on.

In the first iteration that would be the first search results page, in the second iteration – the second page, etc.

For product collecting use scrape_search_results_products function in a combination with Python’s built-in .extend() function.

Using .extend() instead of .append() will let you collect all products in one list.

If you’d use .append() here – line 156, you would end up with a list of lists just because scrape_search_results_products returns a list of products.

And with the part that’s responsible for scrolling to the Next button (lines 161-171), you’re already familiar from the refactoring steps.

The only thing you need to add here is time.sleep(1) – which I’d recommend you to add here because otherwise this section of code is executed way too fast.

If click on the next button is executed before the button is visible the whole process might fail with an exception.

Here’s the code from the screenshot above.

    all_products.extend(
        scrape_search_results_products(
            b=browser
        )
    )
    next_button = browser.find_element(
        by=By.CSS_SELECTOR,
        value='.s-pagination-next, .a-pagination .a-last'
    )
    scroll_to_js: str = (
        f'window.scrollTo('
        f'{next_button.location["x"]}, '
        f'{next_button.location["y"]}'
        f');'
    )
    browser.execute_script(scroll_to_js)

    time.sleep(1)

    next_button.click()

In the end, you can close the browser (line 177) and print out the number of products you collected from all pages.

When I configured the scraping script to collect products from 5 pages I ended up with 100 collected products.

Let me know in the comments, how many products from how many pages did you collect.

Closing Selenium browser and print the amount of products found
Closing Selenium browser and print the amount of products found

Just for reference, here’s the code from the explanation above. Of course, for this to work you’ll need code from all the previous steps.

import time

scrape_page_count: int = 5

all_products: typing.List[typing.Dict[str, str]] = []

for _ in range(scrape_page_count):

    browser.refresh()

    ready_state = None

    while ready_state != 'complete':
        ready_state = browser.execute_script(
            'return document.readyState;'
        )
        time.sleep(1)

    all_products.extend(
        scrape_search_results_products(
            b=browser
        )
    )
    next_button = browser.find_element(
        by=By.CSS_SELECTOR,
        value='.s-pagination-next, .a-pagination .a-last'
    )
    scroll_to_js: str = (
        f'window.scrollTo('
        f'{next_button.location["x"]}, '
        f'{next_button.location["y"]}'
        f');'
    )
    browser.execute_script(scroll_to_js)

    time.sleep(1)

    next_button.click()

browser.close()

print('all_products: ', len(all_products))

Conclusion & Save collected products

The very last step that’s usually done in a scraping process is saving.

Of course, there are way better ways than writing a text file, for example, you could create a small SQL database and save your results there.

But for the sake of keeping this post shorter, we’re going to simply create a new text file and write results right into the text file.

Code from the screenshot below should be pretty straightforward by this point.

You’re just opening a file with Python built-in function open() and with statement.

Write collected Amazon products data to a text file
Write collected Amazon products data to a text file

Here’s the code from the screenshot above.

with open('collected_products.txt', 'w') as f:
    for product in all_products:
        f.write(
            '\n\n'
            f'Product (Title): {product["product_title"]} \n'
            f'Product (IMG): {product["product_img_url"]} \n'
            f'Product (URL): {product["product_url"]} \n'
        )
    f.close()

And this is the result you should see in your collected_products.txt file after you let the amazon.py script run and scrape amazon search results pages.

At this point, you should be able to modify amazon.py to whatever you’re trying to scrape from the amazon.com

End result with collected Amazon products in a text file
End result with collected Amazon products in a text file

Also, keep in mind – this is just scratching the surface.

We could go a lot deeper and I could give you more examples for scraping websites, but at the end of the day, it’s just simple Python scripts you’re writing here.

When it comes to becoming a Python developer you’ll have to spend a bit more time and learn how to build modular projects…

…with dozens of classes – all working simultaneously – that’s a bit of a challenge in itself.

Comment below if this helped you and let me know if there are any struggles you’re dealing with.

I'll help you become a Python developer!

If you're interested in learning Python and getting a job as a Python developer, send me an email to roberts.greibers@gmail.com and I'll see if I can help you.

Roberts Greibers

Roberts Greibers

I help QA engineers to become backend Python/Django developers so they can increase their income