ALL >> Computers >> View Article
How To Analyse Amazon Reviews For Climbing Shoe Analysis Using Python?
Those who are familiar with the sport recognise the significance of wearing a pair of rock-climbing shoes that fit well. Shoes that don't fit well can be disastrous in a variety of ways. Because it's a fringe sport, finding a pair that fits properly isn't as simple as stepping into a store and trying them on. Climbing shoe stores may be far and few between depending on one's place of residence. As a result, many people opt to buy it online.
On the other hand, buying a pair, is no easy task, and scrolling through hundreds of reviews is bound to irritate even the most methodical among us. As a result, in this blog, we have scraped various shoe products available on Amazon for detailed study.
Web Scraping Amazon Reviews Dataset
Here, we will use Selenium library in Python and divide the web scraping process in various parts:
Part 1: Fetching Product URLs
The first-step will be to initialize the webdriver. For this we will use Google webdriver which can be downloaded using this link. The below code is used for starting chrome web browser.
from selenium import webdriver
driver_path = ...
...
driver = webdriver.Chrome(executable_path=driver_path)
Then we use get() method to get access to Amazon website. For avoiding simulation of entering search queries and keystrokes, simply attach search query to the base URL. The webpage coming from the user search will have the following format: https://www.amazon.com/ < query will be placed here>
Executing the below script will open the Amazon search result page for search query ‘la sportive climbing shoe’
base_url = 'https://www.amazon.com/'
query = 'la sportiva climbing shoe'
query_url = base_url+'s?k='+query
driver.get(query_url)
screenshot
The next stage would be to scrape all of the URLs on the page and save them to a file before moving on to the next page of search results. This process continues until you reach the final results page. For this approach, I write a few utility functions.
A string is written to a file using this function. The “a” option determines whether to create a new file (if one has not yet been created) or to publish to an existing file.
def write_to_file(filename,content):
f = open(filename, "a")
f.write(content+'\n')
f.close()
This method scrapes the entire current page for all URLs. If the words "climbing" and "shoe" appear in the URL, it writes the URLs to a file links.txt (using the write to file utility function). To avoid the Python NoneType issue that occurs when Selenium picks up NoneType objects, an additional condition is introduced to evaluate only strings (type(link) == str).
def get_all_urls():
# identify elements with tagname
links = driver.find_elements_by_tag_name("a")
# traverse list
for link in links:
# get_attribute() to get all href
link = link.get_attribute('href')
if (type(link) == str) and ('climbing' in link) and ('shoe' in link):
write_to_file('links.txt',link)
To put it all together, we begin by calculating the number of results pages. The class “a-normal” is used to represent integers in the DOM Tree. Here, you will get all numbers (in this case 1 to 5) using the find elements by class name method, then set the last number to the variable last page (.text extracts the actual text value which is typecast to an int variable).
screenshot
pages = driver.find_elements_by_class_name("a-normal")
last_page = int(pages[-1].text)
We scrape the URLs by cycling over the results pages after putting the previous two lines into a function. The base URL was combined with both the search query and page number in Amazon's URL structure for each incrementing page number: https://www.amazon.com/s?k=query&page=page number. If the results just had one webpage, this exception handler was created (pages would evaluate to an empty list). Other major climbing shoe brands that went through this process were Ocun, Five Ten, Tenaya, Black Diamond, Scarpa, Mad Rock, Evolv, and Climb X.
def iterate_pages():
pages = driver.find_elements_by_class_name("a-normal")
try:
last_page = int(pages[-1].text)
# base_url would evaluate to ...amazon.com/s?k=
base_url = driver.current_url
for page in range(2,last_page+1):
get_all_urls()
driver.get(base_url+'&page='+str(page))
except IndexError:
get_all_urls()
The initial scraping pass yielded 2655 URLs. Filtering away undesirable URLs, however, required more cleaning. Examine the following list of nefarious URLs:
https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_mtf_aps_sr_pg1_1?ie=UTF8&adId=A0187488H7LAC5K25U1L&url=%2FTenaya-Oasi-Climbing-Shoe-Womens%2Fdp%2FB00G2G1U5U%2Fref%3Dsr_1_14_sspa%3Fdchild%3D1%26keywords%3Dtenaya%2Bclimbing%2Bshoe%26qid%3D1618629457%26sr%3D8-14-spons%26psc%3D1&qualifier=1618629457&id=3206048678566111&widgetName=sp_mtf
https://www.amazon.com/s?k=five+ten+climbing+shoe#
https://www.amazon.com/s?k=mad+rock+climbing+shoe#s-skipLinkTargetForFilterOptions
https://www.amazon.com/s?k=mad+rock+climbing+shoe#s-skipLinkTargetForMainSearchResults
https://aax-us-east.amazon-adsystem.com/x/c/QmbTjAA2xKm5MsGgE8COatYAAAF43dZjkQEAAAH2AQqX8gE/https://www.amazon.com/stores/page/5F070E5E-577B-4B70-956A-02731ECF176A?store_ref=SB_A0421652LWRFVMVX8W4I&pd_rd_w=fsNZy&pf_rd_p=eec71719-2cb3-403b-a8de-e6266852cdb6&pd_rd_wg=uFHFp&pf_rd_r=5E5ZWZNV7GCVDNKR9EC7&pd_rd_r=a8ad5a25-3cd0-4a36-a529-3000b5271936&aaxitk=4VreBt7wvxz.JYuyg-2dbQ&hsa_cr_id=7601608740401&lp_asins=B07BJZTWGR,B07BKRBH7B,B07FQ5CYTK&lp_query=mad%20rock%20climbing%20shoe&lp_slot=desktop-hsa-3psl&ref_=sbx_be_s_3psl_mbd
The first URL is a link to a similar product that is essentially an advertisement (these would refer to the products that have the sponsored tag on the results page). The second, third, and fourth URLs all lead to the search result, not a specific product page. The final URL is an advertisement URL as well.
Let's see how this compares to URLs that lead to individual product pages.
https://www.amazon.com/Mad-Rock-Lotus-Climbing-Shoes/dp/B00L8BT7P4/ref=sr_1_15?dchild=1&keywords=mad%2Brock%2Bclimbing%2Bshoe&qid=1618629518&sr=8-15&th=1
https://www.amazon.com/SCARPA-Instinct-SR-Climbing-Shoe/dp/B07PW8ZLHN/ref=sr_1_93?dchild=1&keywords=scarpa+climbing+shoe&qid=1618629505&sr=8-93
https://www.amazon.com/Climb-Apex-Climbing-Shoe-Yellow/dp/B07MQ79LNZ/ref=sr_1_9?dchild=1&keywords=climb+x+climbing+shoe&qid=1618629539&sr=8-9
The first thing to notice is that clean URLs are free of slredirect and picassoRedirect.html. When we split down a system of clean URLs, we can see that the base URL is accompanied by the product name as the first two sections of the URL. In contrary, “dirty” URLs include the query signifier, s?k=, immediately after the basic URL. Another point to note is that product URLs begin with https://www.amazon.com/, as opposed to the dirty URLs' last example, which did not use that string as its base URL. To filter out these bad URLs, some simple data cleaning using regex was developed.
# Remove sponsored product URLs.
re.search('slredirect',line.strip("\n")):
# Remove results page URLs.
re.search('s?k=',line.strip("\n")):
# Remove URLs that do not start with base URL.
line.strip("\n").startswith("https://www.amazon.com"):
# Remove URLs with gp or ap following its base URL.
re.search('https://www.amazon.com/gp/',line.strip("\n")):
re.search('https://www.amazon.com/ap/',line.strip("\n")):
Furthermore, there seem to be a lot of URLs pointing to almost the same product page (even though their URLs were not 100 percent identical). Consider the following scenario:
https://www.amazon.com/Sportiva-Ultra-Raptor-Running-Yellow/dp/B008I6J7YS/ref=sr_1_46?dchild=1&keywords=la+sportiva+climbing+shoe&qid=1618629379&sr=8-46#customerReviews
https://www.amazon.com/Sportiva-Ultra-Raptor-Running-Yellow/dp/B008I6J7YS/ref=sr_1_46?dchild=1&keywords=la+sportiva+climbing+shoe&qid=1618629379&sr=8-46
The URL structure of Amazon is deconstructed in this blog post. In summary, only the “/dp/” and “ASIN” (Amazon Standard Identification Number) sections of URLs were significant. As a result, from the “ref=” section onwards, I erased everything. The final stage was to eliminate any URLs that directed to brands that were not part of this project. The data cleaning process yielded 425 clean, unique, and relevant URLs.
Part 2: Scraping Product Data
The flow of the scraping is as follows:
Go to the product page and scrape product title, product price.
scraping-product-data
Click on the “Sell all reviews” button and this will open a link.
reviews
Scrape all the reviews
reviews
On the other hand, Examining the flow of linkages, revealed potential duplication problems. Take a look at the two URLs below, each with a different ASIN.
https://www.amazon.com/Sportiva-Tarantulace-Climbing-Shoes-Black/dp/B081ZF79RK/
https://www.amazon.com/Sportiva-Tarantulace-Climbing-Shoes-Black/dp/B081XVFMZ5/
These hyperlinks are shown to refer to the same products based on the product name. The links on their individual review sites are identical.
https://www.amazon.com/Sportiva-Mens-TarantuLace-Performance-Climbing/product-reviews/B07BCK9CWR/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews
https://www.amazon.com/Sportiva-Mens-TarantuLace-Performance-Climbing/product-reviews/B07BCK9CWR/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews
With this in mind, you can decide two or more product links lead to the same review page, only one of the product URLs should be scraped. The following are some code snippets to help you out:
# Scrapes the product name.
prod_name = driver.find_element(By.ID, 'productTitle').text
# Scrapes product price.
prod_price = driver.find_element(By.ID,'priceblock_ourprice').text
# Click translate reviews button.
translate_button = driver.find_element(By.XPATH,'/html/body/div[1]/div[3]/div/div[1]/div/div[1]/div[5]/div[3]/div/div[1]/span/a')
translate_button.click()
# Scrape all reviews on the current page.
reviews = driver.find_elements(By.XPATH,'//span[@data-hook="review-body"]')
for review in reviews:
product_reviews.append(review.text)
The result of the dataset will have the below structure:
code
Data Cleaning and Preprocessing
The initial step was to create two brand and model fields. The brand may easily be deduced from the product title. The extraction of the model, on the other hand, proved to be a difficult undertaking. First, we will sample 15 papers and analyze what a product name is made up of to identify the general components of a product name. To do so, we will use the MongoDB Query Language (MQL).
db.products.aggregate(
[{$match:{"reviews_url":{$ne:null}}},
{ $sample: { size: 15 } },
{$project:{"_id":0,"product_name":1}}]
)
This will deliver the subset as follows:
{ "product_name" : "La Sportiva Otaki, Women's Climbing Shoes" }
{ "product_name" : "La Sportiva TX2 Women's Approach Shoe" }
{ "product_name" : "Five Ten Men's Coyote Climbing Shoe" }
{ "product_name" : "La Sportiva TX4 MID GTX Hiking Shoe - Men's" }
{ "product_name" : "Five Ten Men's Stonelands VCS Climbing Shoe" }
{ "product_name" : "Scarpa Women's Gecko WMN Approach Shoe" }
It was time to sanitize the reviews after constructing the derived variables. A random sample of three reviews is shown below.
"I have narrow feet, but couldn't get my feet in these shoes.
\nI prefer to wear comfortable climbing shoes, not meaning beginner shoes (though I'm a beginner).
\nI asked around and talked with those die-hard climbers, they suggested me wear shoes that I fee comfortable with.
\nI strongly agree.
\nHow can you climb with shoes that hurt your feet? Not to speak of focusing on it.""I just started climbing regularly and these are great. Good grip.""Great shoes! They have good flexibility."
For text cleaning and preprocessing, Python's NLTK module is used. Customization of data cleaning activities to the algorithm is required for conducting sentiment analysis on Amazon reviews using VADER. In a nutshell, VADER assesses public sentiment by:
Punctuation
Capitalization
Degree Modifiers
Conjunctions
Words That Flip The Observations
Below is an example of a review that has had its stop words removed. It may be seen qualitatively that the review's meaning has been preserved.
Great aggressive shoe. I have a wider foot and these fit me great. The laces really allow me to tune the fit. I downsized a half size from my gym shoe size. Initially pretty uncomfortable, but a shower and a few climbing sessions and they fit just right.Great aggressive shoe . I wider foot fit great. The laces really allow tune fit. I downsized half size gym shoe size. Initially pretty uncomfortable, shower climbing sessions fit right.
We also wanted each page to represent a single model for a single brand. The fields name and model could be used as a composite key in a relational database. The code for deleting duplicate documents can be available at the above-mentioned GitHub repository.
We have used VADER to conduct sentiment analysis on the reviews in order to turn them into something quantitative. A short lesson on how to use it may be found here. The following was my code implementation:
def get_sentiment_score(review):
# Create a SentimentIntensityAnalyzer object.
sid_obj = SentimentIntensityAnalyzer()
# polarity_scores method of SentimentIntensityAnalyzer
# oject gives a sentiment dictionary.
# which contains pos, neg, neu, and compound scores.
sentiment_dict = sid_obj.polarity_scores(review)
print("sentence was rated as ", sentiment_dict['neg']*100, "% Negative")
print("sentence was rated as ", sentiment_dict['neu']*100, "% Neutral")
print("sentence was rated as ", sentiment_dict['pos']*100, "% Positive")
print("sentence has a compound score of ", sentiment_dict['compound'])
# decide sentiment as positive, negative and neutral
if sentiment_dict['compound'] >= 0.05 :
print("Sentence Overall Rated As Positive")
elif sentiment_dict['compound']
iWeb scraping is a leading data scraping company! Offer web data scraping, website data scraping, web data extraction, product scraping and data mining in the USA, Spain.
Add Comment
Computers Articles
1. Exploring How Ai In The Cloud Can Transform Your BusinessAuthor: TechDogs
2. The Power Of Cloud And Ai: A New Era Of Collaboration
Author: TechDogs
3. Get Business Insights Using Expedia & Booking. Com Review Data Scraping
Author: DataZivot
4. Top 10 Reasons A Strong Communication Strategy Drives Prm Program Success
Author: Archi
5. Achieve Scalable Web Scraping With Aws Lambda
Author: Devil Brown
6. Overcoming Common Challenges In Iso 27001 Implementation
Author: Jenna Miller
7. Basic Computer Course: Your Gateway To Skill Development | The Institute Of Professional Accountants
Author: Tipa Institute
8. Top 7 Advantages Of React Js
Author: Bella Stone
9. Top 7 App Marketing Tools For Mobile Success
Author: Bella Stone
10. Revolutionizing Education Management With Samphire It Solution Pvt Ltd’s Erp Software
Author: CONTENT EDITOR FOR SAMPHIRE IT SOLUTIONS PVT LTD
11. Top 10 Healthcare Technology Trends
Author: goodcoders
12. "building Tomorrow’s Factories: The Role Of Automation & Robotics In Modern Manufacturing"
Author: andrew smith
13. The Ultimate Guide To The Best Ecommerce Plugin For Wordpress
Author: Rocket Press
14. Xsosys Erp: A Scalable Solution For Businesses In Any Industry
Author: Xsosys Technology(S) Pte. Ltd.
15. Rental Management Software: A Complete Solution For Car, Property, And Coworking Space
Author: RentAAA