Data & Analytics Services at AWS re:Invent 2022: A Recap

Data & Analytics Services at AWS re:Invent 2022: A Recap

AWS re: Invent is a learning conference hosted for the global cloud computing community with in-person and virtual content where they announced many new features and updates.

This post is the summary of the list of changes within the Data & Analytics services

AWS Recap by Mohamed Fayaz

AWS DataZone

Amazon DataZone is a tool that helps organizations catalog and shares data across the company. It allows data producers (such as data engineers and data scientists) to share data securely and with the right context, and allows data consumers (such as analysts) to find answers to business questions and share them with others in the organization.

DataZone is intended to provide an easy way to organize and discover data across the organization. It allows users to share, search, and discover data at scale across organizational boundaries through a unified data analytics portal that provides a personalized view of all the data while enforcing governance and compliance policies.

The tool creates a usage flywheel, where data producers share data securely and with the right context with others in the organization, and data consumers find answers to business questions and share them with others in the organization. This helps improve operational efficiency and enables business and data teams to work with data faster and make informed decisions based on the data. DataZone also aims to remove the burden of governing data and make it accessible to everyone in the organization, giving organizations a competitive edge by turning data into an organizational asset.

AWS Clean Room

AWS Clean Rooms is a solution that enables companies to collaborate on shared data sets while still protecting the underlying raw data. This is particularly useful for companies in industries such as financial services, healthcare, and advertising that need to collaborate with partners while also improving data security and protecting underlying data. The traditional methods for leveraging data in collaboration with partners, such as providing copies of data and relying on contractual agreements, can be at odds with protecting data.

AWS Clean Rooms allows customers to create a secure data clean room in minutes and collaborate with other companies on the AWS Cloud to generate insights about advertising campaigns, investment decisions, and research and development without having to share or reveal raw data.

Some features and benefits of AWS Clean Rooms include the ability to create a clean room and start collaborating in a few clicks, the ability to collaborate with hundreds of thousands of companies on AWS without sharing or revealing underlying data, the use of privacy-enhancing controls to protect underlying data, and the ability to use easy-to-configure analysis rules to tailor queries to specific business needs.

AWS Clean Rooms will be available in early 2023 in US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), Europe (London), and Europe (Stockholm).

AWS OpenSearch (Serverless)

Amazon OpenSearch Serverless is a new option offered by Amazon OpenSearch Service that simplifies the process of running search and analytics workloads at a large scale without the need to configure, manage, or scale OpenSearch clusters. It automatically provisions and scales the necessary resources to deliver fast data ingestion and query responses for even the most demanding workloads, and users only pay for the resources that are consumed.

OpenSearch Serverless decouples compute and storage and separates the indexing (ingestion) components from the search (query) components, using Amazon Simple Storage Service (S3) as the primary data storage for indexes. This allows the search and indexing functions to scale independently of each other and of the indexed data in S3. With OpenSearch Serverless, developers can create new collections, which are logical groupings of indexed data that work together to support a workload.

It also supports the same ingest and query APIs as OpenSearch, making it easy to get started with existing clients and applications, and it can be used to build data visualizations with serverless OpenSearch Dashboards.

AWS Glue Updates

⦿ AWS Glue 4.0 – Access to the latest Spark and Python releases so builders can develop, run, and scale their data integration workloads and get insights faster.

⦿ AWS Glue Data Quality - Automatic data quality rule recommendations based on your data

⦿ AWS Glue for Ray - Data integration with Ray (ray.io), a popular new open-source compute framework that helps you scale Python workloads

⦿ AWS Glue for Apache Spark - Supports three open-source data lake storage frameworks: Apache Hudi, Apache Iceberg, and Linux Foundation Delta Lake.

⦿ AWS Glue Custom Visual Transform - Create and share your own ETL logic, input rules, etc. on Studio. Available on the Transform tab of Glue Studio. Master administrators in Glue Studio can improve efficiency for other workers.

AWS Redshift Updates

⦿ Apache Spark Integration - Author Apache Spark applications using Java, Python, and Scala, with access to rich, curated data in your data warehouse

⦿ Streaming Ingestion Support - Kinesis Data Streams (KDS) and Managed Streaming for Apache Kafka (MSK) without staging in S3

⦿ Dynamic Data Masking - Easily protect sensitive data by managing data masking policies through an SQL interface

⦿ Auto-Copy From Amazon S3 - Simple, low-code data ingestion

⦿ New Query sets – MERGE, ROLLUP, CUBE, GROUPING SETS

⦿ Supporting large JSON objects – Up to 16MB (from 1MB)

⦿ Multi-AZ deployment

AWS QuickSight Updates

The new Amazon QuickSight feature that expands API capabilities allows customers to programmatically manage their QuickSight assets (analyses and dashboards) in their DevOps pipeline. Developers can now version control, back up, and deploy assets programmatically, thereby promoting faster changes that enable innovation in a competitive marketplace. This feature also accelerates migration from legacy BI tools to the cloud, supported by our migration partners.

⦿ Paginated Reports - Create, schedule, and share highly formatted multipage reports

⦿ Q Automated Data Prep - AI-enhanced automated data preparation, making it fast and straightforward to augment existing dashboards for natural language questions

⦿ QuickSight API - Access underlying data models of Amazon QuickSight dashboards, reports, analyses, and templates via the AWS Software Development Kit (SDK

⦿ 2 new questions types – “forecast” and “why”. “forecast” created a dynamic forecast dashboard, and “why” determines its data driver which is related a specific data change

AWS Athena for Spark

Amazon Athena for Apache Spark is a new feature that allows organizations to perform complex data analysis using Apache Spark without the need to configure and manage separate infrastructure. It allows users to build distributed applications using expressive languages like Python, and it offers a simplified notebook experience in the Athena console or through Athena APIs.

Athena is deeply integrated with other AWS services, making it easy to query data from various sources, chain multiple calculations together, and visualize the results of analyses. The feature enables interactive Spark applications to start in under a second and run faster with an optimized Spark runtime, saving users time and allowing them to focus on insights rather than waiting for results. With Amazon Athena for Apache Spark, there are no servers to manage and no minimum fee or setup cost; users only pay for the queries they run.

AWS Aurora zero-ETL Feature

Amazon Aurora now supports zero-ETL integration with Amazon Redshift, allowing users to perform near real-time analytics and machine learning using Redshift on large amounts of transactional data from Aurora. With this integration, data is available in Redshift within seconds of being written into Aurora, eliminating the need to build and maintain complex data pipelines for ETL operations.

The zero-ETL integration also enables users to analyze data from multiple Aurora database clusters in the same Amazon Redshift instance, providing a holistic view of data across multiple applications or partitions. This allows users to leverage Redshift's analytics and capabilities, such as built-in machine learning, materialized views, data sharing, and federated access to multiple data stores and data lakes, to derive insights from transactional and other data in near real-time.

AWS LakeFormation Data Sharing Access Control

AWS Lake Formation is a fully managed service that makes it easy to build, secure, and manage a data lake. A data lake is a central repository that allows you to store all your structured and unstructured data at any scale. You can then use the data lake to build a centralized data repository, which can be used for a variety of tasks such as analytics, data warehousing, machine learning, and more.

The new feature makes it easier for customers to designate the right level of access to various users without having to run complex queries or manually identify who has access to specific data shares. It also improves the security of data by enabling administrators to provide granular, row-level, and column-level access to data shares within Lake Formation. This is particularly useful for customers who want to share and work with consistent data across regions and accounts, but want to enforce granular access to different users.

AWS DocumentDB Elastic Cluster

AWS DocumentDB with MongoDB compatibility offers flexible scaling to store petabytes of data and handle millions of read/write requests per second. The infrastructure is managed by AWS, so there is no need for instance creation or scaling operations. It also provides high availability across 3 availability zones (AZs) with data replicated to 6 locations across 3 AZs for high durability.


✍️ About the Author:

Mohamed Fayaz is a Data & AI Consultant, and a technical blogger who writes and speaks about the topics such as Software Engineering, Big Data Analytics, and Cloud Engineering. Connect with him on LinkedIn or follow him on Twitter for updates.


Did you find this article valuable?

Support Mohamed Fayaz by becoming a sponsor. Any amount is appreciated!