Descrizione Lavoro
OverviewSr. Software Engineer- AI/ML, AWS Neuron Distributed Training - Performance Optimization role within AWS Utility Computing (UC) and Annapurna Labs. The role focuses on development, enablement and performance tuning of ML model training and inference on the AWS Neuron stack, including Trn1/Inf1 servers, for large-scale model families and cutting-edge cloud AI services. The candidate will work with a team to enable distributed training and inference across PyTorch, TensorFlow, and JAX using XLA and the Neuron compiler/runtime stack, and will implement and optimize using libraries such as FSDP and DeepSpeed.This role is part of the ML Apps team that collaborates with chip architects, compiler engineers and runtime engineers to build, tune and optimize distributed training solutions for Neuron-based systems.Key responsibilitiesLead efforts building distributed training and inference support into PyTorch, TensorFlow, and JAX using XLA and the Neuron stacks.Tune models to achieve highest performance and efficiency on AWS Trainium and Inferentia silicon and on TRn1/Inf1 servers.Collaborate with chip architects, compiler engineers and runtime engineers to create, build and tune distributed training solutions with Trn1.Develop and enable support for a wide variety of ML model families (e.g., GPT-2, GPT-3 and beyond, stable diffusion, Vision Transformers, and more).Experience training large models with Python and integrate distributed training libraries such as FSDP and DeepSpeed into Neuron-based systems.Basic Qualifications5+ years of non-internship professional software development experience5+ years of programming experience in at least one software language5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems5+ years of full software development lifecycle, including coding standards, code reviews, source control, build processes, testing, and operationsExperience as a mentor, tech lead or leading an engineering teamPreferred QualificationsBachelor's degree in computer science or equivalentMachine Learning knowledge in frameworks and end-to-end model trainingAmazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.Our inclusive culture supports accommodations for disability during the application and hiring process. For more information, visit the Amazon accommodations page. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner.Our compensation reflects the cost of labor across several US geographic markets. The base pay range for this position is $151,300/year to $261,500/year, pay is based on location and experience. Amazon is a total compensation company; depending on the role, equity, sign-on payments and other benefits may be provided.This position will remain posted until filled. Applicants should apply via our internal or external career site.Posted: May 16, 2025 (Updated about 17 hours ago)Posted: September 20, 2025 (Updated 1 day ago)Posted: September 1, 2025 (Updated 1 day ago)Posted: August 27, 2025 (Updated 1 day ago)Posted: June 24, 2025 (Updated 2 days ago)Share this jobImportant FAQs for current Government employeesBefore proceeding, please review the following FAQshttps://www.amazon.jobs/en/faqs#faqs-for-us-government-employeesAmazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.
#J-18808-Ljbffr