menu
techminis

A naukri.com initiative

google-web-stories
Home

>

Cloud News

>

Playwright...
source image

Dev

2w

read

333

img
dot

Image Credit: Dev

Playwright on Cloud: Automating Review Data Extraction

  • The Playwright on Cloud: Automating Review Data Extraction provides an application that automatically extracts review data from product websites with pagination on the review section. It can extract data from sites with universal support for all pagination types. A GET API is also implemented to return the extracted review. It is composed of three main components: API Gateway, Lambda Function, and EC2 instance.
  • The API Gateway component exposes the automation process to the network. A REST API with the GET response is used to trigger the lambda function that manages the process on the EC2 instance using SSM. The API endpoint should have a query search parameter named ‘page’. Lambda proxy integration is set to True to get the lambda function output as the response from the API.
  • The Lambda Function component works as a middleman that gets triggered from the API call and executes the automation pipeline in the EC2 instance and passes the output generated by the pipeline to the API response. The EC2 instance executes a Python script with the unique_id and url passed as command-line arguments to extract review data.
  • The EC2 Instance component uses BeautifulSoup to remove all unnecessary code to reduce the token size for the LLM. BeautifulSoup is used to identify the class name of the button, which is unique for every website, to go to the next review page. It scrapes the review page by page using BeautifulSoup. The review data is passed back from the EC2 instance to the Lambda function using the S3 bucket, acting as a cache store for already extracted reviews.
  • There were some challenges faced while choosing an LLM due to its false positive values, source code size, and performance. Playwright works on Chromium and is not natively supported by Lambda functions. Therefore, EC2 was used as a manual workaround.
  • The application is composed of several components and technologies used, which include AWS Lambda, EC2, API Gateway, S3, Python (Beautiful Soup, Playwright), Gemini-1.5-flash, Next.js, etc. The API endpoint must accept the full URL in the query parameter to extract the review data from the product website.
  • The Playwright on Cloud: Automating Review Data Extraction provides an excellent solution that automates the review data extraction process using several cutting-edge technologies and components to scrape review data with universal pagination support.
  • The link to the original article and API endpoint is provided in the article.

Read Full Article

like

20 Likes

For uninterrupted reading, download the app