menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Website Fe...
source image

Towards Data Science

4w

read

12

img
dot

Website Feature Engineering at Scale: PySpark, Python & Snowflake

  • The article discusses the challenge of identifying top candidates for a business proposal based on website quality scores, aiming to automate the process with machine learning pipelines.
  • It covers technical implementations like fetching HTML content in Python using Snowflake dataset or CSV, and assigning a quality score using PySpark for feature extraction and processing.
  • Legal and ethical considerations in web scraping are highlighted, emphasizing responsible practices, retention policies, and potential permissions from site owners.
  • The article provides instructions on getting started with the project, including the folder structure, Snowflake data preparation, and usage of scripts for fetching website content.
  • Advantages of using a comprehensive fetching script over a basic approach are outlined, showcasing benefits like asynchronous requests, rotating User-Agents, and efficient batching.
  • Storing raw HTML in databases like Snowflake is recommended for scalability, as large-scale scraping and feature engineering are more reliable when done in a suitable data warehouse.
  • The process of extracting features from HTML content using PySpark via Snowpark is detailed, including creating UDFs, applying feature extraction functions, and generating quality scores.
  • Country-specific configurations are utilized for defining keywords and patterns that signal good merchant sites, making the feature extraction adaptable across different regions.
  • The article concludes by emphasizing the website quality score as a key input for predictive models, showcasing its significance in ranking and recommending partners effectively for better business outcomes.
  • A GitHub repository link is provided for the implementation details and a disclaimer clarifies that the data and scripts used are examples and not from real business scenarios.

Read Full Article

like

Like

For uninterrupted reading, download the app