AWS Glue is the central service of an AWS fashionable information structure. It’s a serverless information integration service that lets you uncover, put together, and mix information for analytics and machine studying. AWS Glue affords you a complete vary of instruments to carry out ETL (extract, rework, and cargo) on the proper scale. AWS Glue Python shell jobs are designed for working small-to-medium measurement ETL, and triggering SQLs (together with long-running queries) on Amazon Redshift, Amazon Athena, Amazon EMR, and extra.
In the present day, we’re excited to announce a brand new launch of AWS Glue Python shell that helps Python 3.9 with extra pre-loaded libraries. Moreover, it lets you customise your Python shell surroundings with pre-loaded libraries and affords you PIP help to put in different native or customized Python libraries.
The brand new launch of AWS Glue Python shell consists of the required Python libraries to attach your script to SQL engines and information warehouses like SQLAlchemy, PyMySQL, pyodbc, psycopg2, redshift, and extra. It additionally helps communications with different AWS providers similar to Amazon OpenSearch Service (opensearch-py, elasticsearch), Amazon Neptune (gremlinpython), or Athena (PyAthena). It integrates Amazon SageMaker Knowledge Wrangler for ETL duties like loading and unloading information from information lakes, information warehouses, and databases. It additionally consists of library help for information serialization in trade codecs similar to avro and et-xmlfile.
On this put up, we stroll you thru on use AWS Glue Python shell to create an ETL job that imports an Excel file and writes it in a relational database and information warehouse. The job reads the Excel file as a Pandas DataFrame, creates an information profiling report, and exports it into your Amazon Easy Storage Service (Amazon S3) bucket. This routine cleans inaccurate data and imputes lacking values primarily based on predefined enterprise guidelines. It writes the info right into a goal MySQL database for low-latency information entry. Moreover, in parallel, the script exports the DataFrame within the information lake in columnar format to be copied into Amazon Redshift for reporting and visualization.
AWS Glue Python shell new options
The brand new launch of AWS Glue Python shell lets you use new options of Python 3.9 and add customized libraries to your script utilizing job parameter configurations. This provides you extra flexibility to put in writing your Python code and reduces the necessity to manually preserve and replace Python libraries wanted to your code.
Personalized pre-loaded library environments
AWS Glue Python shell for Python 3.9 comes with two library surroundings choices:
- analytics (default) – You may run your script in a fullly pre-loaded surroundings for complicated analytics workloads. This feature hundreds the complete package deal of libraries.
- none – You may select an empty surroundings for easy and quick ETL jobs. This feature solely hundreds
awscli
andbotocore
as primary libraries.
You may set this selection through the use of the library-set
parameter within the job creation, for instance:
To your reference, the next desk lists the libraries included in every choice.
Python model | Python 3.9 | |
Library set | analytics (default) | none |
avro | 1.11.0 | . |
awscli | 1.23.5 | 1.23.5 |
awswrangler | 2.15.1 | . |
botocore | 1.23.5 | 1.23.5 |
boto3 | 1.22.5 | . |
elasticsearch | 8.2.0 | . |
numpy | 1.22.3 | . |
pandas | 1.4.2 | . |
psycopg2 | 2.9.3 | . |
pyathena | 2.5.3 | . |
PyMySQL | 1.0.2 | . |
pyodbc | 4.0.32 | . |
pyorc | 0.6.0 | . |
redshift-connector | 2.0.907 | . |
requests | 2.27.1 | . |
scikit-learn | 1.0.2 | . |
scipy | 1.8.0 | . |
SQLAlchemy | 1.4.36 | . |
s3fs | 2022.3.0 | . |
Added help for library compilers
On this launch, you may import and set up libraries as a part of the script, together with your personal C-based libraries. You’ve gotten PIP help to put in native or buyer supplied Python libraries with the help of the next compilers:
- gcc
- gcc-c++
- gmake
- cmake
- cython
- boost-devel
- conda
- python-dev
If you wish to embrace a brand new package deal throughout your job creation, you may add the job parameter --additional-python-modules
adopted by the title of the library and the model. For instance:
Find out how to use the brand new options with the AWS Glue Python shell script
Now that now we have launched the brand new options, let’s create a Python 3.9 job with extra libraries with AWS Glue Python shell. You’ve gotten two choices to create and submit a job: you should utilize the interface of AWS Glue Studio, or the AWS Command Line Interface (AWS CLI) for a programmatic strategy.
AWS Glue Studio
To make use of AWS Glue Studio, full the next steps:
- On the AWS Glue Studio console, create a brand new job and choose Python Shell script editor.
- Enter a job title and enter your Python script.
- On the Job particulars tab, enter an non-obligatory description.
- For IAM function¸ select your job function.
- For Python model, select Python 3.9.
- Choose Load frequent Python libraries.
- Select the script and the non permanent information places.
- Embody the extra libraries as job parameters (
--additional-python-modules
).
AWS CLI
With the brand new launch, now you can use the AWS CLI with the brand new parameters. The next is an instance of an AWS CLI assertion to create the AWS Glue Python shell script job with Python 3.9:
Let’s discover the principle variations from the earlier AWS Glue Python shell variations:
- Set the choice
PythonVersion
inside the--command
parameter to three.9. - So as to add new libraries, use
--additional-python-modules
as a brand new parameter after which record the library and the required model as follows:boto3=1.22.13
. - Embody
library-set
inside –default-arguments and select one of many values, similar todefault/analytics/none
.
Answer overview
This tutorial demonstrates the brand new options utilizing a standard use case the place information flows into your system as spreadsheet information experiences. On this case, you wish to shortly orchestrate a option to serve this information to the appropriate instruments. This script imports the info from Amazon S3 right into a Pandas DataFrame. It creates a profiling report that’s exported into your S3 bucket as an HTML file. The routine cleans inaccurate data and imputes lacking values primarily based on predefined enterprise guidelines. It writes the info immediately from Python shell to an Amazon Relational Database Service (Amazon RDS) for MySQL server for low-latency app response. Moreover, it exports the info right into a Parquet file and copies it into Amazon Redshift for visualization and reporting.
In our case, we deal with every state of affairs as unbiased duties with no dependency between them. You solely must create the infrastructure for the use circumstances that you just wish to check. Every part offers steerage and hyperlinks to the documentation to arrange the required infrastructure.
Stipulations
There are just a few necessities which might be frequent to all situations:
- Create an S3 bucket to retailer the enter and output information, script, and non permanent information.
Then, we create the AWS Id and Entry Administration (IAM) person and function essential to create and run the job. - Create an IAM AWS Glue service function known as
glue-blog-role
and fasten the AWS managed coverageAWSGlueServiceRole
for basic AWS Glue permissions.For those who’re additionally testing an Amazon Redshift or Amazon RDS use case, you could grant the required permission to this function. For extra data, consult with Utilizing identity-based insurance policies (IAM insurance policies) for Amazon Redshift and Id-based coverage examples for Amazon RDS. - Create an IAM person with safety credentials and configure your AWS CLI in your native terminal.
This lets you create and launch your scripts out of your native terminal. It is strongly recommended to create a profile related to this configuration.The dataset used on this instance is an Excel file containing Amazon Video Evaluate information with the next construction. In a later step, we place the Excel file in our S3 bucket to be processed by our ETL script.
- Lastly, to work with pattern information, we want 4 Python modules that had been made out there in AWS Glue Python shell when the parameter
library-set
is about toanalytics
:boto3
awswrangler
PyMySQL
Pandas
Notice that Amazon buyer critiques aren’t licensed for industrial use. You must substitute this information with your personal licensed information supply when implementing your software.
Load the info
On this part, you begin writing the script by loading the info utilized in all of the situations.
- Import the libraries that we want:
- Learn the Excel spreadsheet right into a DataFrame:
State of affairs 1: Knowledge profiling and dataset cleansing
To help with primary information profiling, we use the pandas-profiling
module and generate a profile report from our Pandas DataFrame. Pandas profiling helps output information in JSON and HTML format. On this put up, we generate an HTML output file and place it in an S3 bucket for fast information evaluation.
To make use of this new library through the job, add the --additional-python-modules
parameter from the job particulars web page in AWS Glue Studio or throughout job creation from the AWS CLI. Bear in mind to incorporate this package deal within the imports of your script:
A standard downside that we frequently see when coping with a column’s information kind is the combination of information varieties are recognized as an object in a Pandas DataFrame. Blended information kind columns are flagged by pandas-profiling
as Unsupported
kind and saved within the profile report description. We will entry the knowledge and standardize it to our desired information varieties.
The next strains of code loop each column within the DataFrame and examine if any of the columns are flagged as Unsupported by pandas-profiling
. We then solid it to string:
To additional clear or course of your information, you may entry variables supplied by pandas-profiling
. The next instance prints out all columns with lacking values:
State of affairs 2: Export information in columnar format and duplicate it to Amazon Redshift
On this state of affairs, we export our DataFrame into Parquet columnar format, retailer it in Amazon S3, and duplicate it to Amazon Redshift. We use Knowledge Wrangler to attach our script to Amazon Redshift. This Python module is already included within the analytics surroundings. Full the next steps to arrange the required infrastructure:
Now we are able to write uncooked information to Amazon S3 in Parquet format and to Amazon Redshift.
A standard partition technique is to divide rows by 12 months, month, and day out of your date column and apply multi-level partitioning. This strategy permits quick and cost-effective retrieval for all rows assigned to a selected 12 months, month, or date. One other technique to partition your information is through the use of a selected column immediately. For instance, utilizing review_date
as a partition offers you single degree of listing for each distinctive date and shops the corresponding information in it.
On this put up, we put together our information for the multi-level date partitioning technique. We begin by extracting 12 months, month, and day from our date column:
With our partition columns prepared, we are able to use the awswrangler
module to put in writing to Amazon S3 in Parquet format:
To question your partitioned information in Amazon S3, you should utilize Athena, our serverless interactive question service. For extra data, consult with Partitioning information with Athena.
Subsequent, we write our DataFrame on to Amazon Redshift inner storage through the use of Knowledge Wrangler. Writing to Amazon Redshift inner storage is suggested whenever you’re going to make use of this information often for complicated analytics, giant SQL operations, or enterprise intelligence (BI) reporting. In Amazon Redshift, it’s suggested to outline the distribution fashion and kind key on the desk to enhance cluster efficiency. For those who’re undecided about the appropriate worth for these parameters, you should utilize the Amazon Redshift auto distribution fashion and kind key and comply with Amazon Redshift advisor suggestions. For extra data on Amazon Redshift information distribution, consult with Working with information distribution types.
State of affairs 3: Knowledge ingestion into Amazon RDS
On this state of affairs, we open a connection between AWS Glue Python shell and ingest the info immediately into Amazon RDS for MySQL. The infrastructure you require for this state of affairs is an RDS for MySQL database in the identical Area because the AWS Glue Python shell job. For extra data, consult with Making a MySQL DB occasion and connecting to a database on a MySQL DB occasion.
With the PyMySQL
and boto3
modules, we are able to now connect with our RDS for MySQL database and write our DataFrame right into a desk.
Put together the variables for connection and generate a database authentication token for database login:
For extra details about utilizing an SSL connection along with your RDS occasion, consult with Utilizing SSL/TLS to encrypt a connection to a DB occasion.
Connect with your RDS for MySQL database and write a Pandas DataFrame into the desk with the next code:
You might want to create a desk in Amazon RDS for MySQL previous to working the insert assertion. Use the next DDL to create the demo_blog.amazon_video_review
desk:
When the info is offered in database, you may carry out a easy aggregation as follows:
Create and run your job
After you finalize your code, you may run it from AWS Glue Studio or put it aside in a script .py file and submit a job with the AWS CLI. Bear in mind so as to add the required parameters in your job creation relying of the state of affairs you’re testing. The next job parameters cowl all of the situations:
Evaluate the outcomes
On this part, we assessment the anticipated outcomes for every state of affairs.
In State of affairs 1, pandas-profiling
generates an information report in HTML format. On this report, you may visualize lacking values, duplicated values, measurement estimations, or correlations between columns, as proven within the following screenshots.
For State of affairs 2, you may first assessment the Parquet file written to Amazon S3 in Parquet format with partition 12 months/month/day.
Then you should utilize the Amazon Redshift question editor to question and visualize the info.
For State of affairs 3, you should utilize a JDBC connection or database IDE to connect with your RDS database and question the info that you just simply ingested.
Clear up
AWS Glue Python shell is a serverless routine that received’t incur in any additional fees when it isn’t working. Nevertheless, this demo used a number of providers that can incur in additional prices. Clear up after finishing this walkthrough with the next steps:
- Take away the contents of your S3 bucket and delete it. For those who encounter any errors, consult with Why can’t I delete my S3 bucket utilizing the Amazon S3 console or AWS CLI, even with full or root permissions.
- Cease and delete the RDS DB occasion. For directions, see Deleting a DB occasion.
- Cease and delete the Amazon Redshift cluster. For directions, consult with Deleting a cluster.
Conclusion
On this put up, we launched AWS Glue Python shell with Python 3.9 help and extra pre-loaded libraries. We introduced the customizable Python shell surroundings with pre-loaded libraries and PIP help to put in different native or customized Python libraries. We lined the brand new options and get began by way of AWS Glue Studio and the AWS CLI. We additionally demonstrated a step-by-step tutorial of how one can simply use these new capabilities to perform frequent ETL use circumstances.
To be taught extra about AWS Glue Python shell and this new characteristic, consult with Python shell jobs in AWS Glue.
In regards to the authors
Alunnata Mulyadi is an Analytics Specialist Options Architect at AWS. Alun has over a decade of expertise in information engineering, serving to prospects deal with their enterprise and technical wants. Exterior of the work, he enjoys pictures, biking, and basketball.
Quim Bellmunt is an Analytics Specialist Options Architect at Amazon Net Companies. Quim has a PhD in Pc Science and Information Graph specializing in information modeling and transformation. With over 6 years of hands-on expertise within the analytics and AI/ML house, he enjoys serving to prospects create techniques that scale with their enterprise wants and generate worth from their information. Exterior of the work, he enjoys strolling together with his canine and biking.
Kush Rustagi is a Software program Improvement Engineer on the AWS Glue group with over 4 years of expertise within the trade having labored on large-scale monetary techniques in Python and C++, and is now utilizing his scalable system design expertise in direction of cloud improvement. Earlier than engaged on Glue Python Shell, Kush labored on anomaly detection challenges within the fin-tech house. Other than exploring new applied sciences, he enjoys EDM, touring, and studying non-programming languages.