Named SageMaker Data Wrangler, Amazon’s new service makes it easy for data scientists to prepare data for machine learning training. The company also launched SageMaker Feature Store, available in SageMaker Studio, a relatively new service.
With it, one can name, find, organize, and share machine learning features.
Amazon is also planning to launch Sagemaker Pipelines, a new service that integrates with the platform. It will bring a CI/CD service for machine learning to create and automate workflows and create an audit trail for model components like data configurations and training.
Infrastructure won’t be a problem for too long
AWS’ CEO Andy Jassy said in his keynote at the company’s re:Invent conference that data preparation remains one of the significant problems in the machine learning industry. Typically, users have to write their queries and the code to get the data from the data store.
Then, they have to write the queries to transform the code and then combine features to get the desired outcome.
All this work does not have anything to do with building the models but has everything to do with the infrastructure used to create the models. With inefficiencies like this, it becomes harder to get things done on time.
Making modeling easier
Data Wrangler has more than 300 pre-configured data transformation built-in for users to deploy in converting the column types or input missing data with mean or median values.
There are also built-in visualization tools that can help identify potential errors and tools to check if there are inconsistencies in the data before deploying the model.
All the workflows can be saved in a notebook or as a script for teams to replicate. With the introduction of SageMaker Pipelines, users can automate the rest of the workflow.