Many purchasers run large information workloads similar to extract, grow to be, and cargo (ETL) on Apache Hive to create an information warehouse on Hadoop. Apache Hive has carried out lovely neatly for a very long time. However with developments in infrastructure similar to cloud computing and multicore machines with huge RAM, Apache Spark began to achieve visibility through acting higher than Apache Hive.
Shoppers now wish to migrate their Apache Hive workloads to Apache Spark within the cloud to get the advantages of optimized runtime, price relief thru temporary clusters, higher scalability through decoupling the garage and compute, and versatility. Alternatively, migration from Apache Hive to Apache Spark wishes a large number of guide effort to put in writing migration scripts and take care of other Spark activity configurations.
On this put up, we stroll you thru an answer that automates the migration from HiveQL to Spark SQL. The answer was once used emigrate Hive with Oozie workloads to Spark SQL and run them on Amazon EMR for a big gaming consumer. You’ll additionally use this way to expand new jobs with Spark SQL and procedure them on Amazon EMR. This put up assumes that you’ve a elementary figuring out of Apache Spark, Hive, and Amazon EMR.
Resolution review
In our instance, we use Apache Oozie, which schedules Apache Hive jobs as movements to gather and procedure information each day.
We migrate those Oozie workflows with Hive movements through extracting the HQL recordsdata, and dynamic and static parameters, and changing them to be Spark compliant. Handbook conversion is each time eating and mistake vulnerable. To transform the HQL to Spark SQL, youâll wish to kind thru present HQLs, exchange the parameters, and alter the syntax for a host of recordsdata.
As a substitute, we will be able to use automation to hurry up the method of migration and scale back heavy lifting duties, prices, and dangers.
We break up the answer into two number one parts: producing Spark activity metadata and working the SQL on Amazon EMR. The primary element (metadata setup) consumes present Hive activity configurations and generates metadata similar to selection of parameters, selection of movements (steps), and dossier codecs. The second one element consumes the generated metadata from the primary element and prepares the run order of Spark SQL inside a Spark consultation. With this answer, we beef up elementary orchestration and scheduling with the assistance of AWS services and products like Amazon DynamoDB and Amazon Easy Garage Carrier (Amazon S3). We will validate the answer through working queries in Amazon Athena.
Within the following sections, we stroll thru those parts and the best way to use those automations intimately.
Generate Spark SQL metadata
Our batch activity is composed of Hive steps scheduled to run sequentially. For each and every step, we run HQL scripts that extract, grow to be, and mixture enter information into one ultimate Hive desk, which retail outlets information in HDFS. We use the next Oozie workflow parser script, which takes the enter of an present Hive activity and generates configurations artifacts wanted for working SQL the usage of PySpark.
Oozie workflow XML parser
We create a Python script to routinely parse the Oozie jobs, together with workflow.xml, co-ordinator.xml
, activity houses, and HQL recordsdata. This script can deal with many Hive movements in a workflow through organizing the metadata on the step stage into separate folders. Every step comprises the checklist of SQLs, SQL paths, and their static parameters, that are enter for the answer in your next step.
The method is composed of 2 steps:
- The Python parser script takes enter of the present Oozie Hive activity and its configuration recordsdata.
- The script generates a metadata JSON dossier for each and every step.
The next diagram outlines those steps and presentations pattern output.
Must haves
You want the next must haves:
- Python 3.8
- Python applications:
- sqlparse==0.4.2
- jproperties==2.1.1
- defusedxml== 0.7.1
Setup
Entire the next steps:
- Set up Python 3.8.
- Create a digital atmosphere:
- Turn on the newly created digital atmosphere:
- Git clone the undertaking:
- Set up dependent applications:
Pattern command
We will use the next pattern command:
The output is as follows:
Barriers
This system has the next boundaries:
- The Python script parses simplest HiveQL movements from the Oozie workflow.xml.
- The Python script generates one dossier for each and every SQL commentary and makes use of the collection ID for dossier names. It doesnât call the SQL according to the capability of the SQL.
Run Spark SQL on Amazon EMR
Once we create the SQL metadata recordsdata, we use any other automation script to run them with Spark SQL on Amazon EMR. This automation script helps customized UDFs through including JAR recordsdata to spark post. This answer makes use of DynamoDB for logging the run main points of SQLs for beef up and upkeep.
Structure review
The next diagram illustrates the answer structure.
Must haves
You want the next must haves:
- Model:
- Spark 3.X
- Python 3.8
- Amazon EMR 6.1
Setup
Entire the next steps:
- Set up the AWS Command Line Interface (AWS CLI) for your workspace through following the directions in Putting in or updating the newest edition of the AWS CLI. To configure AWS CLI interplay with AWS, check with Fast setup.
- Create two tables in DynamoDB: one to retailer metadata about jobs and steps, and any other to log activity runs.
- Use the next AWS CLI command to create the metadata desk in DynamoDB:
You’ll take a look at at the DynamoDB console that the desk dw-etl-metadata
is effectively created.
The metadata desk has the next attributes.
Attributes | Kind | Feedback |
identity |
String | partition_key |
step_id |
String | sort_key |
step_name |
String | Step description |
sql_base_path |
string | Base trail |
sql_info |
checklist | Checklist of SQLs in ETL pipeline |
. | sql_path | SQL dossier call |
. | sql_active_flag | y/n |
. | sql_load_order | Order of SQL |
. | sql_parameters | Parameters in SQL and values |
spark_config |
Map | Spark configs |
-
- Use the next AWS CLI command to create the log desk in DynamoDB:
You’ll take a look at at the DynamoDB console that the desk dw-etl-pipelinelog
is effectively created.
The log desk has the next attributes.
Attributes | Kind | Feedback |
job_run_id |
String | partition_key |
identity |
String | sort_key (UUID) |
end_time |
String | Finish time |
error_description |
String | Error in case of failure |
expire |
Quantity | Time to Are living |
sql_seq |
Quantity | SQL collection quantity |
start_time |
String | Get started time |
Standing |
String | Standing of activity |
step_id |
String | Task ID SQL belongs |
The log desk can develop temporarily if there are too many roles or if they’re working regularly. We will archive them to Amazon S3 if they’re now not used or use the Time to Are living characteristic of DynamoDB to scrub up previous data.
- Run the primary command to set the variable if in case you have an present bucket that may be reused. If no longer, create a S3 bucket to retailer the Spark SQL code, which can be run through Amazon EMR.
- Permit protected switch at the bucket:
- Clone the undertaking in your workspace:
- Create a ZIP dossier and add it to the code bucket created previous:
- Add the ETL motive force code to the S3 bucket:
- Add pattern activity SQLs to Amazon S3:
- Upload a pattern step (
./sample_oozie_job_name/step1/step1.json
) to DynamoDB (for more info, check with Write information to a desk the usage of the console or AWS CLI):
- Within the Athena question editor, create the database
base
:
- Reproduction the pattern information recordsdata from the repo to Amazon S3:
- Reproduction
us_current.csv
:
- Reproduction
- Reproduction
states_current.csv
:
- To create the supply tables within the base database, run the DDLs provide within the repo within the Athena question editor:
- Run the
./sample_data/ddl/states_current.q
dossier through editing the S3 trail to the bucket you created.
- Run the
./sample_data/ddl/us_current.q
dossier through editing the S3 trail to the bucket you created.
- Run the
The ETL motive force dossier implements the Spark motive force good judgment. It may be invoked in the neighborhood or on an EMR example.
- Release an EMR cluster.
- Ensure to choose Use for Spark desk metadata below AWS Glue Information Catalog settings.
- Upload the next steps to EMR cluster.
Step sort | Spark Software |
Identify | Any Identify |
Deploy mode | Shopper |
Spark-submit choices | --py-files s3://unique-code-bucket-name-#####/framework/code.zip |
Software location | s3://unique-code-bucket-name-####/framework/etl_driver.py |
Arguments | --step_id sample_oozie_job_name#step1 --job_run_id sample_oozie_job_name#step1#2022-01-01-12-00-01 --code_bucket=s3://unique-code-bucket-name-#######/DW --metadata_table=dw-etl-metadata --log_table_name=dw-etl-pipelinelog --sql_parameters DATE=2022-02-02::HOUR=12::code_bucket=s3://unique-code-bucket-name-####### |
Motion on failure | Proceed |
The next desk summarizes the script arguments.
Script Argument | Argument Description |
deploy-mode |
Spark deploy mode. Shopper/Cluster. |
call <jobname>#<stepname> |
Distinctive call for the Spark activity. This can be utilized to spot the activity at the Spark Historical past UI. |
py-files <s3 trail for code>/code.zip |
S3 trail for the code. |
<s3 trail for code>/etl_driver.py |
S3 trail for the motive force module. That is the access level for the answer. |
step_id <jobname>#<stepname> |
Distinctive call for the step. This refers back to the step_id within the metadata entered in DynamoDB. |
job_run_id <random UUID> |
Distinctive ID to spot the log entries in DynamoDB. |
log_table_name <DynamoDB Log desk call> |
DynamoDB desk for storing the activity run main points. |
code_bucket <s3 bucket> |
S3 trail for the SQL recordsdata which can be uploaded within the activity setup. |
metadata_table <DynamoDB Metadata desk call> |
DynamoDB desk for storing the activity metadata. |
sql_parameters DATE=2022-07-04::HOUR=00 |
Any further or dynamic parameters anticipated through the SQL recordsdata. |
Validation
After of entirety of EMR step you’ll have information on S3 bucket for the desk base.states_daily
. We will validate the information through querying the desk base.states_daily
in Athena.
Congratulations, you’ve transformed an Oozie Hive activity to Spark and run on Amazon EMR effectively.
Resolution highlights
This answer has the next advantages:
- It avoids boilerplate code for any new pipeline and gives much less upkeep of code
- Onboarding any new pipeline simplest wishes the metadata arrangeâthe DynamoDB entries and SQL to be positioned within the S3 trail
- Any commonplace changes or improvements may also be performed on the answer stage, which can be mirrored throughout all jobs
- DynamoDB metadata supplies perception into all lively jobs and their optimized runtime parameters
- For each and every run, this answer persists the SQL get started time, finish time, and standing in a log desk to spot problems and analyze runtimes
- It helps Spark SQL and UDF capability. Customized UDFs may also be supplied externally to the spark post command
Barriers
This system has the next boundaries:
- The answer simplest helps SQL queries on Spark
- Every SQL will have to be a Spark motion like insert, create, drop, and so forth
On this put up, we defined the state of affairs of migrating an present Oozie activity. We will use the PySpark answer independently for any new building through developing DynamoDB entries and SQL recordsdata.
Blank up
Delete the entire sources created as a part of this way to steer clear of ongoing fees for the sources:
- Delete the DynamoDB tables:
- Delete the S3 bucket:
- Prevent the EMR cluster if it wasnât a temporary cluster:
Conclusion
On this put up, we introduced two computerized answers: one for parsing Oozie workflows and HiveQL recordsdata to generate metadata, and a PySpark answer for working SQLs the usage of generated metadata. We effectively carried out those answers emigrate a Hive workload to EMR Spark for a significant gaming buyer and accomplished about 60% effort relief.
For a Hive with Oozie to Spark migration, those answers lend a hand whole the code conversion temporarily so you’ll be able to focal point on efficiency benchmark and trying out. Creating a brand new pipeline may be fastâyou simplest wish to create SQL good judgment, check it the usage of Spark (shell or pocket book), upload metadata to DynamoDB, and check by way of the PySpark SQL answer. Total, you’ll be able to use the answer on this put up to boost up Hive to Spark code migration.
Concerning the authors
Vinay Kumar Khambhampati is a Lead Marketing consultant with the AWS ProServe Workforce, serving to consumers with cloud adoption. He’s large information and knowledge analytics.
Sandeep Singh is a Lead Marketing consultant at AWS ProServe, desirous about analytics, information lake structure, and implementation. He is helping undertaking consumers migrate and modernize their information lake and knowledge warehouse the usage of AWS services and products.
Amol Guldagad is a Information Analytics Marketing consultant primarily based in India. He has labored with consumers in several industries like banking and fiscal services and products, healthcare, energy and utilities, production, and retail, serving to them clear up complicated demanding situations with large-scale information platforms. At AWS ProServe, he is helping consumers boost up their adventure to the cloud and innovate the usage of AWS analytics services and products.