123ArticleOnline Logo
Welcome to 123ArticleOnline.com!
ALL >> Computer-Programming >> View Article

How To Scrape Expedia Using Python And Lxml?

Profile Picture
By Author: iWeb Scraping Services
Total Articles: 177
Comment this article
Facebook ShareTwitter ShareGoogle+ ShareTwitter Share

Data Fields that will be extracted:

data-field
Arrival Airport
Arrival Time
Departure Airport
Departure Time
Flight Name
Flight Duration
Ticket Price
No. Of Stops
Airline
Below shown is the screenshot of the data fields that we will be extracting:

the-screenshot
Scraping Code:
1. Create the URL of the search results from Expedia for instance, we will check the available flights listed from New York to Miami:

https://www.expedia.com/Flights-Search?trip=oneway&leg1=from:New%20York,%20NY%20(NYC-All%20Airports),to:Miami,%20Florida,departure:04/01/2017TANYT&passengers=children:0,adults:1,seniors:0,infantinlap:Y&mode=search
2. Using Python Requests, download the HTML of the search result page.

3. Parse the webpage with LXML — LXML uses Xpaths to browse the HTML Tree Structure. The XPaths for the details we require in the code have already been defined.

4. Save the information in a JSON file. You can change this later to write to a database.

Requirements
We'll need several libraries for obtaining ...
... and parsing HTML for this Python 3 web scraping tutorial. The requirements for the package are shown below.

Install Python 3 and Pip

Install Packages
The code is explanatory

You can check the code from the link here.

Executing the Expedia Scraper
Let's say the script's name is expedia.py. In a command prompt or terminal, input the script name followed by a -h.

usage: expedia.py [-h] source destination date
positional arguments:
source Source airport code
destination Destination airport code
date MM/DD/YYYY
optional arguments:
-h, --help show this help message and exit
The input and output arguments are the airline codes for the source and destination airports, respectively. The date parameter must be in the form MM/DD/YYYY

For example, to get flights from New York to Miami, we would use the following arguments:

python3 expedia.py nyc mia 04/01/2017
The nyc-mia-flight-results.json file will be created as a result of this. json, which will be saved in the same directory as the script.

This is what the output file will look like:

{
"arrival": "Miami Intl., Miami",
"timings": [
{
"arrival_airport": "Miami, FL (MIA-Miami Intl.)",
"arrival_time": "12:19a",
"departure_airport": "New York, NY (LGA-LaGuardia)",
"departure_time": "9:00p"
}
],
"airline": "American Airlines",
"flight duration": "1 days 3 hours 19 minutes",
"plane code": "738",
"plane": "Boeing 737-800",
"departure": "LaGuardia, New York",
"stops": "Nonstop",
"ticket price": "1144.21"
},
{
"arrival": "Miami Intl., Miami",
"timings": [
{
"arrival_airport": "St. Louis, MO (STL-Lambert-St. Louis Intl.)",
"arrival_time": "11:15a",
"departure_airport": "New York, NY (LGA-LaGuardia)",
"departure_time": "9:11a"
},
{
"arrival_airport": "Miami, FL (MIA-Miami Intl.)",
"arrival_time": "8:44p",
"departure_airport": "St. Louis, MO (STL-Lambert-St. Louis Intl.)",
"departure_time": "4:54p"
}
],
"airline": "Republic Airlines As American Eagle",
"flight duration": "0 days 11 hours 33 minutes",
"plane code": "E75",
"plane": "Embraer 175",
"departure": "LaGuardia, New York",
"stops": "1 Stop",
"ticket price": "2028.40"
},
You can download the code at:

import json
import requests
from lxml import html
from collections import OrderedDict
import argparse

def parse(source,destination,date):
for i in range(5):
try:
url = "https://www.expedia.com/Flights-Search?trip=oneway&leg1=from:{0},to:{1},departure:{2}TANYT&passengers=adults:1,children:0,seniors:0,infantinlap:Y&options=cabinclass%3Aeconomy&mode=search&origref=www.expedia.com".format(source,destination,date)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url, headers=headers, verify=False)
parser = html.fromstring(response.text)
json_data_xpath = parser.xpath("//script[@id='cachedResultsJson']//text()")
raw_json =json.loads(json_data_xpath[0] if json_data_xpath else '')
flight_data = json.loads(raw_json["content"])

flight_info = OrderedDict()
lists=[]

for i in flight_data['legs'].keys():
total_distance = flight_data['legs'][i].get("formattedDistance",'')
exact_price = flight_data['legs'][i].get('price',{}).get('totalPriceAsDecimal','')

departure_location_airport = flight_data['legs'][i].get('departureLocation',{}).get('airportLongName','')
departure_location_city = flight_data['legs'][i].get('departureLocation',{}).get('airportCity','')
departure_location_airport_code = flight_data['legs'][i].get('departureLocation',{}).get('airportCode','')

arrival_location_airport = flight_data['legs'][i].get('arrivalLocation',{}).get('airportLongName','')
arrival_location_airport_code = flight_data['legs'][i].get('arrivalLocation',{}).get('airportCode','')
arrival_location_city = flight_data['legs'][i].get('arrivalLocation',{}).get('airportCity','')
airline_name = flight_data['legs'][i].get('carrierSummary',{}).get('airlineName','')

no_of_stops = flight_data['legs'][i].get("stops","")
flight_duration = flight_data['legs'][i].get('duration',{})
flight_hour = flight_duration.get('hours','')
flight_minutes = flight_duration.get('minutes','')
flight_days = flight_duration.get('numOfDays','')

if no_of_stops==0:
stop = "Nonstop"
else:
stop = str(no_of_stops)+' Stop'

total_flight_duration = "{0} days {1} hours {2} minutes".format(flight_days,flight_hour,flight_minutes)
departure = departure_location_airport+", "+departure_location_city
arrival = arrival_location_airport+", "+arrival_location_city
carrier = flight_data['legs'][i].get('timeline',[])[0].get('carrier',{})
plane = carrier.get('plane','')
plane_code = carrier.get('planeCode','')
formatted_price = "{0:.2f}".format(exact_price)

if not airline_name:
airline_name = carrier.get('operatedBy','')

timings = []
for timeline in flight_data['legs'][i].get('timeline',{}):
if 'departureAirport' in timeline.keys():
departure_airport = timeline['departureAirport'].get('longName','')
departure_time = timeline['departureTime'].get('time','')
arrival_airport = timeline.get('arrivalAirport',{}).get('longName','')
arrival_time = timeline.get('arrivalTime',{}).get('time','')
flight_timing = {
'departure_airport':departure_airport,
'departure_time':departure_time,
'arrival_airport':arrival_airport,
'arrival_time':arrival_time
}
timings.append(flight_timing)

flight_info={'stops':stop,
'ticket price':formatted_price,
'departure':departure,
'arrival':arrival,
'flight duration':total_flight_duration,
'airline':airline_name,
'plane':plane,
'timings':timings,
'plane code':plane_code
}
lists.append(flight_info)
sortedlist = sorted(lists, key=lambda k: k['ticket price'],reverse=False)
return sortedlist

except ValueError:
print ("Rerying...")

return {"error":"failed to process the page",}

if __name__=="__main__":
argparser = argparse.ArgumentParser()
argparser.add_argument('source',help = 'Source airport code')
argparser.add_argument('destination',help = 'Destination airport code')
argparser.add_argument('date',help = 'MM/DD/YYYY')

args = argparser.parse_args()
source = args.source
destination = args.destination
date = args.date
print ("Fetching flight details")
scraped_data = parse(source,destination,date)
print ("Writing data to output file")
with open('%s-%s-flight-results.json'%(source,destination),'w') as fp:
json.dump(scraped_data,fp,indent = 4)


Unless the page structure changes dramatically, this scraper should be able to retrieve most of the flight details present on Expedia. This scraper is probably not going to work for you if you want to scrape the details of thousands of pages at very short intervals.

Contact iWeb Scraping for extracting Expedia using Python and LXML or ask for a free quote!

More About the Author

iWeb scraping is a leading data scraping company! Offer web data scraping, website data scraping, web data extraction, product scraping and data mining in the USA, Spain.

Total Views: 287Word Count: 860See All articles From Author

Add Comment

Computer Programming Articles

1. Best Accounting Software 2025 In Zambia: Tips And Best Practices
Author: Doris oseR

2. Aryabhata And The Birth Of Zero: A Legacy That Powers Modern Ai And Machine Learning
Author: Pydun Technology Private Limited

3. Top 5 Video Conferencing Solutions Of 2025
Author: Ben Gross

4. Best Practices For Building High-performance React Native Apps
Author: William

5. Top 10 Reasons To Pursue Full Stack Java Development In India
Author: Rohan Rajput

6. Transform Your Digital Presence With Expert Drupal Development
Author: manish

7. We Provide It Solutions That Help You Succeed
Author: We provide IT solutions that help you succeed

8. What Makes A Full Stack Developer Stand Out In 2025?
Author: Shrushti Gurav

9. Effortlessly Convert Sale Orders To Purchase Orders In Odoo
Author: CodersFort

10. Best Software Development Comapny In Wayanad, Kerala, India
Author: TRUSTWAVES

11. How To Spot Red Flags In Invoices And Stop Fraud Instantly?
Author: Invoice Temple

12. Top Ai Development Company In Delhi: Leading Artificial Intelligence Services By Doubleklickdesign
Author: Prince

13. What Are The Best Coding Institutes In Bhopal?
Author: Shankar Singh

14. Innovating Blockchain Strategies With Mev Bot Technology
Author: aanaethan

15. How To Choose The Right Coding Institute In Bhopal
Author: Shankar Singh

Login To Account
Login Email:
Password:
Forgot Password?
New User?
Sign Up Newsletter
Email Address: