Latest Big Data News and Articles on Techminis

A naukri.com initiative

New

Home

Big Data News

Amazon

349

Image Credit: Amazon

Enhance governance with asset type usage policies in Amazon SageMaker

Amazon SageMaker Catalog now supports authorization policy for asset type usage, providing fine-grained control over asset creation and management based on specific templates.
Large organizations face challenges in governing diverse asset types, requiring control over who can create assets and how sensitive templates are handled.
SageMaker Catalog enables administrators to define authorization policies for asset type usage, allowing control over who can use specific asset templates.
Customers can define policies at the asset type level, apply them to project members, and maintain centralized oversight while empowering decentralized teams.
The introduction of asset type usage policies in SageMaker Catalog benefits organizations by enforcing authorization policies, reducing duplication, and strengthening compliance.
Key benefits include proactive governance, reduction of asset sprawl, compliance enforcement, and accelerated onboarding with controlled access to approved templates.
The article provides a detailed solution overview on creating custom assets, associating usage policies, and testing asset type usage policies in SageMaker Catalog.
Asset type usage policies offer scalable governance for organizations managing diverse teams, projects, and templates in data publishing workflows.
To get started with asset type usage policies, organizations can refer to the user guide in AWS Commercial Regions where Amazon SageMaker is supported.
Authors of the article include Pradeep Misra, Ramesh H Singh, and Harsh Singh who provide insights into modern distributed analytics and AI/ML platform solutions.

Read Full Article

21 Likes

Dzone

384

Image Credit: Dzone

Scalable, Resilient Data Orchestration: The Power of Intelligent Systems

Data orchestration is crucial for AI/ML models as it ensures a continuous flow of quality data from multiple sources.
Orchestrators manage tasks in Directed Acyclic Graphs (DAGs), connecting subsystems through triggers and events.
Data orchestration differs from data pipelines as it spans multiple components and utilizes execution flows based on state machines.
Key traits of a good data orchestration design include responsiveness to triggers, modularity, scalability, and serial/parallel task execution.
Retry mechanisms and reliable restart capabilities are essential features that prevent unnecessary processing churn and ensure consistent data processing.
Transactional execution and auditability are vital for managing data orchestration in various use cases, such as AI/ML models.
Trending practices include leveraging object storage over databases, experimenting with file formats like Parquet, and prioritizing data over metadata in streams for better performance.
Data orchestration systems are vital for enabling private AI/ML systems and must be scalable, resilient, and efficient in storing and retrieving data.
Engineers should focus on system capabilities rather than tool popularity, understanding the principles that make a data orchestrator effective over time.
Data orchestration plays a crucial role in fueling innovation and driving progress by ensuring a steady flow of quality data for intelligent systems.

Read Full Article

23 Likes

Amazon

205

Image Credit: Amazon

Petabyte-scale data migration made simple: AppsFlyer’s best practice journey with Amazon EMR Serverless

AppsFlyer successfully migrated their data infrastructure from self-managed Hadoop clusters to Amazon EMR Serverless, reducing costs and operational complexity.
EMR Serverless dynamically scales resources in real-time, providing agility and high availability for AppsFlyer's massive data operations.
AppsFlyer's architecture includes Apache Spark, Kafka, Iceberg, and Airflow, handling 100 PB of data daily across various open source technologies.
The migration enhanced scalability, cost-efficiency, and operational simplicity, freeing up engineering teams for innovation.
AppsFlyer utilized infrastructure templates, Airflow integration, and solutions for permissions management to streamline the migration.
Custom Spark plugins and seamless cross-account permissions management were key in ensuring a successful migration to EMR Serverless.
AppsFlyer leveraged Spline for lineage tracking and Datadog for centralized logging, enhancing visibility and compliance.
Monitoring and observability played a crucial role in ensuring stability and efficiency for AppsFlyer's operations on EMR Serverless.
The transition to EMR Serverless empowered AppsFlyer teams to operate autonomously, focus on innovation, and achieve significant cost savings.
The approach taken by AppsFlyer can serve as a model for organizations looking to migrate to EMR Serverless for improved data processing capabilities.

Read Full Article

12 Likes

Precisely

371

Image Credit: Precisely

Understanding Your Data: The Key to Trust and Business Success

Trust in data is essential for better decision-making, innovation, and compliance, but many organizations struggle with understanding and trusting their data.
67% of leading data and analytics professionals lack complete trust in the data their organizations use for decision-making, leading to inefficiencies and missed opportunities.
The lack of clarity in data discovery and assessment can result in unreliable analytics, wasted resources, regulatory non-compliance, revenue loss, and brand damage.
Best practices to improve data trust include implementing a data catalog, using trust scores, establishing data governance, and leveraging AI-driven tools for data discovery.

Read Full Article

22 Likes

Discover more

Siliconangle

425

Image Credit: Siliconangle

TensorStax gets $5M in funding to automate data engineering with deterministic AI agents

TensorStax, a startup, secures $5 million in seed funding to bring AI-powered automation to data engineering.
The funding round was led by Glasswing Ventures and included Bee Partners, S3 Ventures, Gaingels, and Mana Ventures.
TensorStax focuses on automating data engineering tasks, which are traditionally challenging for AI algorithms due to their rigid nature.
The startup uses deterministic AI agents to design, build, and deploy data pipelines with high reliability, increasing success rates significantly.
By integrating within existing data stacks, TensorStax's AI agents work with various common data engineering tools like Apache Spark and Snowflake.
Customers use TensorStax's AI agents for tasks such as ETL pipelines, data modeling, schema building, and pipeline monitoring.
The AI agents respond to simple commands and draft customized plans for data stacks, which are then transformed into production-grade code for deployment.
TensorStax addresses the need for stable data pipelines in enterprises, aiming to simplify and automate data engineering processes.
The startup's approach with the LLM compiler and integration with existing tools is seen as promising for AI orchestration in data engineering.
The market for agentic AI in data engineering is projected to grow substantially, reaching a value of $66.7 billion by 2034.

Read Full Article

25 Likes

Amazon

233

Image Credit: Amazon

Configure cross-account access of Amazon SageMaker Lakehouse multi-catalog tables using AWS Glue 5.0 Spark

Amazon SageMaker Lakehouse allows organizations to unify data analytics and AI/ML workflows securely without data replication.
It organizes data using logical containers called catalogs, enabling seamless querying and analysis across ecosystems.
AWS Glue 5.0 supports SageMaker Lakehouse, unifying data across S3 data lakes and Redshift data warehouses.
The article demonstrates sharing Redshift and Amazon S3-based Iceberg tables across AWS accounts using Spark in AWS Glue 5.0.
The setup involves prerequisites like two AWS accounts with Lake Formation sharing and IAM roles for permissions.
Steps include creating catalogs, databases, granting permissions, and running PySpark jobs in AWS Glue.
Checks and verifications are done using tools like Athena to ensure proper setup and access.
Resource cleanup steps are provided to avoid unnecessary costs on AWS accounts.
In conclusion, the article highlights the process of sharing and querying data across AWS accounts using SageMaker Lakehouse and AWS Glue 5.0.
The appendices include detailed steps for creating tables in S3 and Redshift for demonstration.

Read Full Article

14 Likes

Siliconangle

334

Image Credit: Siliconangle

How the FICO score is evolving to improve credit decision-making

The FICO score 10 T uses trended data to evaluate a person’s credit behavior over more than two years, enhancing predictive accuracy for lenders.
FICO 10 T has gained significant adoption in the mortgage industry, with over 30 lenders already using it for scoring.
The FICO score is incorporating new types of credit data like Buy Now, Pay Later services to improve fairness and efficiency in decision-making.
A strong FICO score can unlock better financial opportunities for individuals, with 90% of top lenders in the US using it as a common measure of credit risk.

Read Full Article

20 Likes

Amazon

232

Image Credit: Amazon

Introducing Amazon Q Developer in Amazon OpenSearch Service

Amazon OpenSearch Service is used by customers to store operational and telemetry data for monitoring applications and infrastructure health.
Introduction of Amazon Q Developer support in OpenSearch Service aims to reduce mean time to repair (MTTR) by providing AI-assisted analysis for quicker issue identification.
Amazon Q Developer offers natural language-based exploration, visualizations, anomaly detector suggestions, alert summarization, and best practices guidance.
Users can create complex visualizations in OpenSearch UI using natural language queries like 'show me a chart of error rates over the last 24 hours.'
Result summarization by Amazon Q provides structured summaries of query results, aiding in data analysis by identifying patterns and trends.
Anomaly detector suggestions by Amazon Q recommend relevant fields for creating detectors based on operational data patterns.
Alerts summarization and insights feature provides concise summaries of alerts, conditions that triggered them, and insights into mitigating conditions.
Amazon Q Developer serves as an intelligent assistant for implementing OpenSearch best practices and provides guidance based on developer and product documentation.
Amazon Q Developer in OpenSearch is available at no additional cost in specified AWS regions and aims to simplify operations and enhance observability workflows.
The tool is designed to help all experience levels harness the full potential of OpenSearch, and it's available within OpenSearch UI.
By leveraging AI-powered capabilities, teams can overcome traditional barriers in setting up, monitoring, and troubleshooting applications within OpenSearch.

Read Full Article

10 Likes

Amazon

351

Image Credit: Amazon

Accelerate lightweight analytics using PyIceberg with AWS Lambda and an AWS Glue Iceberg REST endpoint

Apache Iceberg provides ACID transactions, schema evolution, and time travel capabilities for data lakes.
PyIceberg is a lightweight Python tool for managing Iceberg tables without distributed computing infrastructure.
PyIceberg integrates with AWS Glue Data Catalog and AWS Lambda for efficient data management in a serverless environment.
Data teams can leverage PyIceberg for data analysis using Python libraries like Pandas and Polars.
Iceberg tables managed with PyIceberg can be used with AWS data analytics services like Amazon Athena for scalability.
PyIceberg is suitable for tasks like feature engineering in data science and serverless data processing with Lambda.
By combining PyIceberg with Lambda, teams can build efficient event-driven data pipelines without managing infrastructure.
The article presents a detailed walkthrough involving setting up resources with AWS CloudFormation and building a Lambda function.
It demonstrates accessing and analyzing data using Jupyter notebooks with PyIceberg, showcasing tag management and snapshot-based version control.
Furthermore, it covers querying data from Iceberg tables using AWS Athena, highlighting interoperability with different data processing engines.

Read Full Article

21 Likes

TheStartupMag

118

Image Credit: TheStartupMag

April in Review: Global startup funding remains steady, while AI continues to draw headlines and industries like Telehealth drive growth

April 2025 saw a decrease in startup deals, though global venture funding remained flat year over year at $23 billion.
Despite a record-breaking funding month in March driven by OpenAI, concerns about AI environmental impact arose.
A report by the International Monetary Fund stated economic gains offset rising carbon emissions, boosting global output.
Interest in AI-generated code surged, with Microsoft sharing that 30% of its code is AI-generated.
Industries like Telehealth and pet tech health are driving growth, with the pet care economy expected to exceed $500 billion by 2030.
Startup stories from April include USC Incubator successes, Horasis hosting climate-focused panels, and Leadsales CEO offering funding advice.
AI startup Prezent appointed a biopharma executive to its board, while startup Aestro AI launched to transform debt recovery.
GPU startup SQream named a new CEO to lead AI Factories, while Buddy.ai aims to democratize English tuition in Latin America.
Skystem founder highlighted the importance of automated finance controls, and Y Combinator startup DeepSource released autonomous AI agents.
With the tech sector growing in Latin America, Big data enterprise Source Meridian opened a new office in Ecuador to boost the country's tech scene.

Read Full Article

7 Likes

Amazon

Image Credit: Amazon

Save big on OpenSearch: Unleashing Intel AVX-512 for binary vector performance

OpenSearch version 2.19 now supports hardware-accelerated enhanced latency and throughput for binary vectors using Intel AVX-512 acceleration.
With AVX-512, there is up to a 48% throughput improvement compared to previous-generation R5 instances and a 10% improvement compared to OpenSearch 2.17 and below.
Binary vectors store embeddings more efficiently than full precision vectors, saving on storage and memory costs.
The AVX-512 accelerator enhances Hamming distance calculations for binary vectors by utilizing the popcount operation efficiently.
OpenSearch 2.19 integrates advanced Intel AVX-512 instructions in the FAISS engine to improve performance on Intel Xeon processors without requiring additional configurations.
Benchmarking reveals up to 20% gains on indexing throughput and over 40% gains on search throughput with R7i instances compared to R5 instances.
The performance improvements come from utilizing the avx512_spr optimization mode, which utilizes advanced Intel instructions for faster computation.
Optimizing vector search with newer AVX-512 instructions can lead to a 10% throughput improvement on indexing and search operations.
The article details testing datasets of 10 million binary vectors to analyze performance gains, showcasing the benefits of upgrading to newer Intel instances.
Authors from Intel and AWS provide insights into the optimizations for OpenSearch workloads and suggest utilizing R7i instances for improved price-performance and TCO gains.

Read Full Article

1 Like

Precisely

414

Image Credit: Precisely

Trusted Third-Party Data Where You Need It: Unlocking Value Through Cloud Data Marketplaces

Accessing trusted third-party data in preferred environments has historically been challenging, involving manual processes and formatting issues.
Cloud data marketplaces are evolving, allowing easier access to high-quality, analytics-ready data products.
These marketplaces offer seamless data search, sampling, and immediate testing, reducing overhead and enhancing efficiency.
Key considerations when choosing data products include testing samples, prioritizing curated data, and checking metadata.
Precisely partners with leading cloud marketplaces to deliver data products directly, offering datasets for various needs.
Partnership with Snowflake introduces Native Apps for geo addressing and data enrichment within the Snowflake AI Data Cloud.
Enterprise-grade datasets for risk and demographics are available on the Snowflake Marketplace, aiding industries like real estate and retail.
Partnership with Google Cloud aims to deliver data directly to instances, supporting customers like Keller Williams in enhancing efficiency.
Keller Williams leveraged cloud data delivery to accelerate insights delivery, emphasizing the value of efficient data processing.
The focus remains on making high-integrity data accessible and user-friendly across cloud marketplaces for informed decision-making.

Read Full Article

24 Likes

Siliconangle

299

Image Credit: Siliconangle

Big-data company Informatica’s stock falls on mixed results and soft guidance

Investors reacted negatively to big-data company Informatica's mixed financial results and soft guidance, leading to a 3% drop in its stock.
Informatica's first-quarter earnings fell slightly below Wall Street's expectations, but its revenue exceeded estimates, increasing by 4% from the previous year to $403.9 million.
Informatica reported a slight profit of $1.34 million for the quarter, a decrease from the $9.3 million net profit in the year-ago quarter.
The company highlighted strong cloud subscription-based annual recurring revenue growth of 30% from the previous year and emphasized the importance of data management for enterprises.

Read Full Article

17 Likes

Siliconangle

339

Image Credit: Siliconangle

Amplitude shares tick up after earnings results slightly beat estimates

Amplitude's shares rose slightly in after-hours trading following its fiscal 2025 first quarter earnings report, which showed the company breaking even and generating $80 million in revenue.
The company reported annual recurring revenue of $320 million, a 12% increase year-over-year, and a cash flow from operations loss of $8 million, down $7.9 million from the previous year.
Amplitude introduced new features such as Amplitude Guides and Surveys to enhance user engagement, and platform enhancements in response to customer demand, including self-serve data deletion capabilities and Session Replay Everywhere.
The company's co-founder and CEO, Spenser Skates, highlighted that Amplitude is seeing more enterprise customers embracing their platform, stronger multiproduct attach rates, and rapid innovation, with expectations of continued revenue growth in the upcoming quarters.

Read Full Article

20 Likes

Siliconangle

322

Image Credit: Siliconangle

Bank of Montreal transforms customer decisions with FICO’s AI-driven platform

Bank of Montreal (BMO) utilizes FICO's decision management platform to enhance customer relationships and optimize credit decisions.
FICO's platform integrates data, AI, decision rules, analytics, and learning loop technologies to provide real-time solutions for clients in the financial services sector.
FICO's approach focuses on understanding individual customers through precise data instead of traditional segmentation, leading to improved customer engagement.
FICO's CEO, Will Lansing, emphasizes the importance of decision management in the financial industry and highlights the benefits of leveraging data for personalized customer interactions.

Read Full Article

19 Likes

For uninterrupted reading, download the app