menu
techminis

A naukri.com initiative

google-web-stories
Home

>

Big Data News

Big Data News

source image

Siliconangle

2w

read

17

img
dot

Image Credit: Siliconangle

Key BI insights from re:Invent 2024: Data reduction and visibility innovations in focus

  • Siemens Digital Industries Software has partnered with Cribl Inc. for data reduction and operational visibility.
  • Cribl Stream helped Siemens achieve up to a 95% reduction in data volume.
  • Siemens customized data formats with Cribl, improving efficiency of security workflows.
  • Cribl enabled Siemens analysts to easily access and analyze data without relying on multiple tools.

Read Full Article

like

1 Like

source image

Precisely

2w

read

405

img
dot

Image Credit: Precisely

Maximizing Your Data’s Potential: Best Practices for Streamlining Data Enrichment

  • Data enrichment refers to the process of enhancing your first-party data by adding supplemental information from external, often third-party, sources.
  • The process of combining proprietary data with external data sources can be slow and costly.
  • Common data enrichment challenges include data quality, coverage, delivery, cost, and formatting.
  • To streamline data enrichment, third-party data should be relevant, consistent, accessible, and trustworthy.
  • Four key criteria to look for when evaluating datasets include relevance, consistency, accessibility, and trustworthiness.
  • Trusted third-party data sources that meet criteria outlined can supplement existing data with additional attributes that meet business needs.
  • Precisely offers thousands of data attributes and hundreds of datasets, organized into six categories.
  • Travelers, an industry-leading insurance company, was able to use the PreciselyID to streamline data enrichment, saving valuable time and money.
  • Enriched data helps businesses personalize marketing efforts, predict consumer behavior, and optimize inventory management.
  • Data enrichment enhances patient records by incorporating social determinants of health, leading to better outcomes and more targeted treatments.

Read Full Article

like

24 Likes

source image

Amazon

2w

read

375

img
dot

Image Credit: Amazon

Enforce fine-grained access control on data lake tables using AWS Glue 5.0 integrated with AWS Lake Formation

  • AWS Glue 5.0 provides fine-grained access control based on policies defined in AWS Lake Formation for granular control over data lake resources at the table, column, and row levels.
  • Lake Formation, a data lake management service, allows you to define fine-grained access controls through grant and revoke statements and automatically enforce those policies using compatible engines.
  • Using AWS Glue 5.0 with Lake Formation lets you enforce permissions on each Spark job to apply Lake Formation permissions control when AWS Glue runs jobs.
  • To enable Lake Formation FGAC for AWS Glue 5.0 jobs, create a standard Data Catalog table, then register the location, and grant table permissions using Lake Formation.
  • You can create PySpark jobs in AWS Glue to process input data, configure FGAC on the tables with row and column-based filters and limit read access to specific columns using Lake Formation permissions.
  • To enforce FGAC, use Spark SQL and Spark DataFrames and configure Lake Formation FGAC for AWS Glue notebooks through the console.
  • AWS Glue 5.0 governs access through a user profile and a system profile driver, delegating table stage reads to system executors.
  • Enabling Lake Formation FGAC in AWS Glue jobs makes previously dynamic data Frames with non-delegable operations compatible.
  • AWS Glue 5.0 unifies handling of FGAC permissions across service integrations, notably Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
  • Through Lake Formation permissions, AWS Glue 5.0 simplifies granular access control to data lake resources at the table, column, and row levels.

Read Full Article

like

22 Likes

source image

Amazon

2w

read

392

img
dot

Image Credit: Amazon

Use open table format libraries on AWS Glue 5.0 for Apache Spark

  • Open table formats such as Apache Hudi, Apache Iceberg and Delta Lake provide a standardized framework for data representation, offering flexibility, performance and governance capabilities.
  • AWS Glue 5.0 for Apache Spark has added support for Iceberg 1.6.1, enabling management of data lifecycle with flexible branching and tagging options, and controlled deletion of snapshots.
  • Delta Lake 3.2.1 on AWS Glue 5.0 includes optimized writes, deletion vectors to reduce write operations, and UniForm providing universal access to Delta tables through Iceberg and Hudi.
  • Apache Hudi 0.15.0, in AWS Glue 5.0, offers Record Level Index that enhances write and read operations, automatic primary key generation, and change data capture (CDC) queries, permitting all mutating operations on records.
  • The adoption of open table formats is an essential component of data-driven organizations for improved data management practices and maximum value extraction.
  • AWS Glue 5.0 upgrades enable users to create new jobs and enhance existing job features for closer integration and management of open table formats.
  • Open table formats are emerging as essential components for successful and competitive data strategies, addressing the persistent Data silos, data consistency, query efficiency, and governance challenges.
  • AWS Glue 5.0 adds significant functionality to the popular open table formats Apache Hudi, Apache Iceberg and Delta Lake, to optimize data management practices.
  • Users of new AWS Glue 5.0 version can take advantage of the enhanced features for better data management and analysis at scale.
  • Open table formats enhance data quality and contribute to flexible management of data, making them indispensable for organizations with complex data requirements and exponential data growth.

Read Full Article

like

23 Likes

source image

Amazon

2w

read

135

img
dot

Image Credit: Amazon

Introducing AWS Glue 5.0 for Apache Spark

  • AWS Glue 5.0, the newest version of AWS Glue, has been launched, upgrading Spark engines to Apache Spark 3.5.2 and Python 3.11.
  • Glue 5.0 upgrades runtimes to Spark 3.5.2, Python 3.11, and Java 17 with performance and security improvements.
  • It updates support for open table format libraries to Apache Hudi 0.15.0, Apache Iceberg 1.6.1, and Delta Lake 3.2.1.
  • AWS Lake Formation Fine Grained Access Control is now supported with AWS Glue 5.0 through Spark-native fine-grained access control.
  • Glue 5.0 offers Amazon SageMaker Unified Studio support, Frameworks updated to Spark 3.5.2, Python 3.11, Scala 2.12.18, and Java 17.
  • The upgrade also allows AWS Glue to support S3 Access Grants and use requirements.txt to manage python library dependencies.
  • AWS Glue 5.0 introduces data lineage support in Amazon DataZone preview.
  • Glue 5.0 reduces costs by 22% and improves the price-performance of AWS Glue jobs by 32%.
  • Apache Spark 3.5.2 added many highlights and enhancements, including Apache Arrow-optimized Python UDF.
  • To start using Glue 5.0, you can access AWS Glue Studio or the AWS Glue Console, and use the AWS SDK or AWS CLI.

Read Full Article

like

8 Likes

source image

Amazon

2w

read

418

img
dot

Image Credit: Amazon

Read and write S3 Iceberg table using AWS Glue Iceberg Rest Catalog from Open Source Apache Spark

  • Amazon SageMaker Lakehouse unifies data from Amazon S3 data lakes and Redshift data warehouses, allowing organizations to build powerful analytics and machine learning applications on a single copy of data.
  • AWS Glue Iceberg REST Catalog allows Apache Spark users to access SageMaker Lakehouse capabilities, making it possible to write/read data operations against Amazon S3 tables.
  • The article provides a solution overview for creating an AWS Glue database and using Apache Spark to work with Glue Data Catalog.
  • Lake Formation is used to centrally manage technical metadata for structured and semi-structured datasets for efficient and secure data governance.
  • Lake Formation permissions are enabled for third-party access, which allows the user to register a S3 bucket with Lake Formation.
  • To use OSS Spark role in data access, grant resource permission to the spark_role by providing full table access and data location permission.
  • A Pyspark script is set up to use an AWS Glue Iceberg REST catalog endpoint.
  • Users can validate read/write operations to the Iceberg table on Amazon S3 and view the data in the Iceberg table in Athena.
  • The architecture can be adapted to suit users' needs, whether they're running Spark on bare metal servers in their data center, in a Kubernetes cluster, or any other environment.
  • The article is written by Raj Ramasubbu, Sr. Analytics Specialist Solutions Architect, Srividya Parthasarathy, Senior Big Data Architect, and Pratik Das, Senior Product Manager, with AWS Lake Formation.

Read Full Article

like

25 Likes

source image

Amazon

2w

read

126

img
dot

Image Credit: Amazon

Author visual ETL flows on Amazon SageMaker Unified Studio (preview)

  • Amazon SageMaker Unified Studio (preview) provides an integrated data and AI development environment within Amazon SageMaker.
  • Unified Studio allows you to build faster using familiar AWS tools for model development, generative AI, data processing, and SQL analytics.
  • Visual ETL is a new visual interface that makes it simple for data engineers to author, run, and monitor extract, transform, load (ETL) data integration flow.
  • You can use a simple visual interface to compose flows that move and transform data and run them on serverless compute.
  • Visual ETL also automatically converts your visual flow directed acyclic graph (DAG) into Spark native scripts, enabling a quick-start experience for developers who prefer to author using code.
  • This post shows how you can build a low-code and no-code (LCNC) visual ETL flow that enables seamless data ingestion and transformation across multiple data sources.
  • The TICKIT dataset records sales activities on the fictional TICKIT website, where users can purchase and sell tickets online for different types of events such as sports games, shows, and concerts.
  • The process involves merging the allevents_pipe and venue_pipe files from the TICKIT dataset.
  • The data is then aggregated to calculate the number of events by venue name.
  • Generative AI can enhance your LCNC visual ETL development process, creating an intuitive and powerful workflow that streamlines the entire development experience.

Read Full Article

like

7 Likes

source image

Amazon

2w

read

235

img
dot

Image Credit: Amazon

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

  • AWS Glue has introduced native connectors for 19 applications to enable seamless access and consolidation of data from various sources.
  • Zero-ETL is a set of fully managed integrations by AWS that minimizes the need to build ETL data pipelines and makes data available in Amazon SageMaker Lakehouse and Amazon Redshift from multiple operational, transactional, and enterprise sources.
  • Zero-ETL provides service-managed replication designed for scenarios where customers need a fully managed, efficient way to replicate data from one source to AWS with minimal configuration.
  • Amazon SageMaker Lakehouse unifies all your data across Amazon S3 data lakes and Amazon Redshift data warehouses.
  • AWS Glue now offers multiple ways to build data integration pipelines depending on your integration needs. Glue ETL offers customer-managed data ingestion that is suitable for complex transformations.
  • The new functionality with AWS Glue enables customers to create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in Amazon SageMaker Lakehouse and Amazon Redshift.
  • One of the benefits of using Apache Iceberg in zero-ETL integration is the ability to perform Time Travel, which allows you to access and query historical versions of your data effortlessly.
  • With Iceberg Time Travel, you can easily roll back to previous data states, compare data across different points in time, or recover from accidental data changes.
  • This blog post explores how zero-ETL capabilities combined with its new application connectors are transforming the way businesses integrate and analyze their data from popular platforms such as ServiceNow, Salesforce, Zendesk, SAP, and others.
  • Using AWS Glue with Zero-ETL integration powered by native service connectors, customers can unlock the full potential of their data across multiple platforms faster and stay ahead of the curve.

Read Full Article

like

14 Likes

source image

Amazon

2w

read

314

img
dot

Image Credit: Amazon

Catalog and govern Amazon Athena federated queries with Amazon SageMaker Lakehouse

  • Amazon SageMaker Unified Studio is a newly-announced data and artificial intelligence (AI) integration platform for Amazon S3 data lakes and third-party sources including Snowflake.
  • Amazon SageMaker Lakehouse breaks down data silos, ensuring governance, security and compliance upon data expansion.
  • Data analysts can securely query external data sources, including Amazon Redshift data warehouses and Amazon DynamoDB databases, through a single, unified experience.
  • Administrators can apply access controls at different levels of granularity to ensure sensitive data remains protected while expanding data access.
  • This allows organizations to accelerate data initiatives while maintaining security and compliance, leading to faster, data-driven decision-making.
  • Amazon SageMaker Lakehouse streamlines connecting to, cataloging, and managing permissions on data from multiple sources, allowing analysts to run SQL queries on federated data catalogs.
  • Set-up is further facilitated thanks to the easy-to-use SageMaker Unified Studio, which integrates with SageMaker Lakehouse to provide flexibility to end-users working with their preferred tools.
  • This blog post demonstrates how to connect to, govern, and run federated queries on data, covering Redshift, DynamoDB (Preview), and Snowflake (Preview).
  • The blog presents a solution where a company is using multiple data sources containing customer data. Regulations require personally identifiable information (PII) data to be secured; an administrator sets up fine-grained access controls using Lake Formation.
  • We encourage readers to try fine-grained access controls on federated queries today in SageMaker Unified Studio, and to share feedback. For more on federated queries in Athena and the data sources that support fine-grained access controls, see Register your connection as a Glue Data Catalog in the Athena User Guide.

Read Full Article

like

18 Likes

source image

Amazon

2w

read

26

img
dot

Image Credit: Amazon

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

  • Amazon SageMaker, an integrated experience for data, analytics, and AI, enables customers to work with their data, whether for analytics or AI, help them get to AI-ready data faster, and improve productivity of all data and AI workers.
  • SageMaker brings together AWS ML and analytics capabilities and provides unified tools for model development, generative AI, data processing, and SQL analytics, along with built-in generative AI powered by Amazon Q Developer that guides you along the way of your data and AI journey.
  • SageMaker Lakehouse unifies all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data.
  • SageMaker Catalog simplifies the discovery, governance, and collaboration for data and AI across your lakehouse, AI models, and applications.
  • Amazon SageMaker Unified Studio (Preview) provides an integrated authoring experience to use all your data and tools for analytics and AI. All your favorite functionality and tools are now available in one place, helping you discover and prepare data with ease, author queries or code, and get to insights faster.
  • Moving forward, we’ll refer to this set of AI/ML capabilities as SageMaker AI, and we’ll continue to innovate and expand on them.
  • SageMaker still includes all the existing ML and AI capabilities for data wrangling, human-in-the-loop data labeling with Amazon SageMaker Ground Truth, experiments, MLOps, Amazon SageMaker HyperPod managed distributed training, and more.
  • SageMaker also comes with built-in generative AI powered by Amazon Q Developer that guides you along the way of your data and AI journey, transforming complex tasks into intuitive conversations.
  • The next generation of SageMaker delivers an integrated experience to access, govern, and act on all your data by bringing together widely adopted AWS data, analytics, and AI capabilities.
  • Innovate faster with the convergence of data, analytics and AI.

Read Full Article

like

1 Like

source image

Amazon

2w

read

65

img
dot

Image Credit: Amazon

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

  • ANZ Institutional Division has built a federated data platform that allows domain-based teams to build data products to support business outcomes.
  • ANZ Institutional Division’s challenges of operating with clear silos in data practices and centralized teams led it to decentralize data ownership, compliance, and creation of data products.
  • ANZ Institutional Division shifted from viewing data as a byproduct of projects to treating it as a valuable product in its own right motived by a few reasons improving business insights, agility, data quality, standardizing tooling, and driving innovation.
  • ANZ embarked on a transformative journey to redefine its data practices and extracted significant business value from data insights for improving customer and employee experiences.
  • Data mesh architecture allowed ANZ to embrace a decentralized approach to data management aligned with modern organizational structures and agile methodologies.
  • ANZ’s federated data strategy comprises four key principles: domain ownership, data as a product, a self-serve data platform, and federated computational governance.
  • The Institutional Division has implemented a self-service data platform called Institutional Data & AI platform that adopts a federated approach to data while simplifying data product development across domains.
  • ANZ Institutional Division uses key metrics to evaluate the success of the delivery model, such as cost transparency and domain adoption, to guide the data mesh governance team in refining the delivery approach.
  • The Institutional Division has decided to adopt an architectural and operational model aligned with the data mesh paradigm.
  • ANZ Institutional Division is a great proof point for AWS to democratize data, enabling business functions autonomy to self-serve data needs with built-in governance.

Read Full Article

like

3 Likes

source image

Cloudera

2w

read

240

img
dot

Image Credit: Cloudera

Cloudera AI Inference Service Enables Easy Integration and Deployment of GenAI Into Your Production Environments

  • The Cloudera AI Inference service is a deployment environment that enables you to integrate and deploy generative AI (GenAI) and predictive models into your production environments, incorporating Cloudera's enterprise-grade security, privacy, and data governance.
  • Cloudera AI Inference service is highly scalable, secure, and high-performance deployment environment for serving production AI models and related applications.
  • The service is targeted at the production-serving end of the MLOPs/LLMOPs pipeline.
  • Cloudera AI Inference is a new, purpose-built service for hosting all production AI models and related applications
  • The Cloudera AI Inference service provides secure and scalable deployment for pre-trained GenAI models.
  • It also has strong authentication and authorization capabilities, fast recovery from failures, and easy to operate rolling updates.
  • With tooling such as Prometheus and Istio, users can monitor the system and model performance.
  • Users can train, fine-tune machine learning models in the AI Workbench, and deploy them to the Cloudera AI Inference service for production use cases.
  • The Cloudera AI Inference service is designed to handle model deployment automatically and can efficiently orchestrate hundreds of models and applications and scale each deployment to hundred of replicas dynamically.
  • The service complements the Cloudera AI Workbench which is mainly used for exploration, development, and testing phases of the MLOPs workflow.

Read Full Article

like

14 Likes

source image

Cloudera

2w

read

253

img
dot

Image Credit: Cloudera

Cloudera announces ‘Interoperability Ecosystem’ with founding members AWS and Snowflake

  • Cloudera announces 'Interoperability Ecosystem' with founding members AWS and Snowflake
  • Cloudera and Snowflake offer a single source of truth for data, analytics, and AI workloads
  • AWS customers gain flexibility, data utility, and complexity by integrating with Snowflake and Cloudera
  • The collaboration enables seamless data sharing, enhanced AI/ML performance, maximized cloud investments, and support for multi-cloud strategies

Read Full Article

like

15 Likes

source image

Cloudera

2w

read

292

img
dot

Image Credit: Cloudera

Fueling the Future of GenAI with NiFi: Cloudera DataFlow 2.9 Delivers Enhanced Efficiency and Adaptability

  • Cloudera DataFlow 2.9 introduces new features to fuel GenAI initiatives, including new AI processors and ready flows for RAG architectures.
  • DataFlow 2.9 enhances developer productivity with features like parameter groups and ready flows for common tasks.
  • Operational enhancements in DataFlow 2.9 include customizable notifications and improved NiFi metrics for easier pipeline management.
  • Cloudera DataFlow 2.9 aligns with Cloudera's vision of universal data distribution, enabling seamless data movement and processing across environments.

Read Full Article

like

17 Likes

source image

Precisely

2w

read

353

img
dot

Image Credit: Precisely

Women on Wednesday with Meenakshi Khurana

  • Meenakshi Khurana, Talent Development Program Manager, shares her experience working in technology.
  • She chose a career in technology due to her interest in its rapid advancements and innovation.
  • Her greatest professional mentor taught her the importance of attention to detail, empathy, and kindness.
  • Meenakshi took the risk of moving to a smaller company, which allowed her to gain diverse experience and build confidence.

Read Full Article

like

21 Likes

For uninterrupted reading, download the app