NORDICS20 - Assets

Deep Learning on AWS

Amazon Web Services Resources EMEA

Issue link: https://emea-resources.awscloud.com/i/1242450

Contents of this Issue

Navigation

Page 14 of 50

Amazon Web Services Deep Learning on AWS Page 10 Building, testing, and maintaining machine learning frameworks requires work. If the changes are breaking, you must make changes to your script as well. However, it is important to take advantage of the latest from the open source AI community and to support requirements of internal projects. Performance Optimization The full stack of deep learning has many layers. In order to extract maximum performance out of the stack, you must fine-tune every single layer of software that includes drivers, libraries, and dependencies. Poorly tuned layers in the software can increase the training time of the model and can lead to increased cost of training. Tuning the deep learning stack requires multiple iterations of testing and specialized skills. Most often tuning is required for both training and inference stacks. Different stacks may have different bottlenecks—network, CPU, or storage I/O—that must be resolved with tuning. Collaborative Development In most cases, it's a team of deep learning engineers and deep learning scientists that would collaborate on a deep learning project. The team must conform to certain standards to collaborate and provide feedback on each other's work. As the project moves from proof of concept to production, it is important to track that the model's performance over time for consistency. Consistency is required between the dataset and hardware versions and software configuration of different stacks used during the training by different practitioners. Consistency is also required between the training and the inference stack. The stack and the results from the stack should be reproducible. Infrastructure Management In order to prove the value of the model, it should be trained with the appropriate hyperparameters and on a large dataset. The search for the most optimal hyperparameters requires multiple jobs to be run concurrently on a large dataset. This exercise requires working with job schedulers, orchestrators, and monitoring tools, which creates dependency on IT assets managed by centralized IT teams. Even after the first version of the model is fully developed, the DevOps team must support the infrastructure required to retrain the model on fresh data, and to monitor and support the endpoint used to deploy the model.

Articles in this issue

view archives of NORDICS20 - Assets - Deep Learning on AWS