Senior System Software Engineer - Scientific Computing PaaS
Company: NVIDIA Corporation
Location: Santa Clara
Posted on: November 6, 2024
Job Description:
Senior System Software Engineer - Scientific Computing PaaS We
are seeking a Sr System Software Engineer to help us build out our
scientific computing platform on Nvidia DGX Cloud. We are building
a cloud based accelerated scientific computing platform as a
service on the Nvidia DGX cloud. This DGX scientific computing
cloud platform enables Physics based Numerical Simulation Solvers,
AI based Training, Inference and Visualization workflow for
physical science and engineering problems.Those applications
include Weather prediction, Climate modeling, Industrial design and
Digital twins simulation in various domains e.g Aerospace,
Automotive, Sports, Renewable energy, Bio-medical and many more.Are
you passionate about solving rewarding problems at scale? Do you
enjoy crafting robust, critical services for compute and data
intensive workload? If so, you may be a phenomenal fit for our
team!What you'll be doing:
- Design, Build, Deploy and Operate Cloud native microservices
and APIs for scientific computing workload on DGX cloud.
- Design services and take ownership of underlying cloud
infrastructure for physics informed and data driven scientific
workflows.
- Design novel algorithms and actively engage with operations to
increase overall system performance, spanning across the stack e.g.
deep understanding of application code e.g DL Framework, Numerical
Solvers, Microservices, APIs and Heterogeneous accelerated
computing with CPUs and GPUs.
- Design, Build, Deploy and Operate scalable I/O infrastructure
for checkpointing, data loading, pre & post processing of
data.
- Optimize compute, storage and network architecture specific to
physics & simulation driven applications.What we need to see:
- BS/MS degree in Computer Science or related areas or equivalent
experience.
- 10+ years experience working on building and operating
distributed compute and data intensive platform as a service on
cloud.
- Proven skill in a compiled language (Go, Rust, C++ or
otherwise).
- Strong foundational knowledge in Cloud Computing e.g "The
Datacenter is a Computer" architecture, cloud security
architecture, virtualization - CPU, Memory and IO, Resource pooling
and elasticity.
- Proven skills in Distributed Systems & Parallel Processing e.g
System model of distributed computation e.g. topology abstraction,
logical time, synchronization and deadlock detection in distributed
systems, Fault Tolerance and Failure Detection, Consensus and
Agreement protocols, Parallel algorithms, shared memory and
distributed memory architecture, message passing (MPI, NCCL),
Cluster scalability and performance.
- Hands on Debugging skills with Process, Threads, Deadlock and
Synchronization, Scheduling, IPC, Memory management, File system
and I/O structure.
- Strong Evidence on Algorithmic Thinking & System Design skills
e.g Recursion, Graph, Tree, Stack and Queue, Large scale loosely
coupled distributed system design and operational experience.
- Be self-motivated, have strong interpersonal skills, and be
able to work independently with multiple teams with minimal
direction.Ways to stand out from the crowd:
- Have built, deployed and operated AI platforms on HPC clusters.
Have built, deployed and operated cloud native systems including
distributed storage, scheduling, and orchestration among compute,
storage and network.
- Configuring and troubleshooting hardware, operating systems,
kernel, compilers for maximum performance.
- Hands on debugging skills to optimize performance of compute,
networking and I/O framework. Extensively worked on third party
source code for debugging and customization.NVIDIA is widely
considered to be one of the technology world's most desirable
employers. We have some of the most forward-thinking and
hardworking people on the planet working for us. If you're creative
and autonomous, we want to hear from you!The base salary range is
180,000 USD - 339,250 USD. Your base salary will be determined
based on your location, experience, and the pay of employees in
similar positions. You will also be eligible for equity and
benefits.NVIDIA accepts applications on an ongoing basis.NVIDIA is
committed to fostering a diverse work environment and proud to be
an equal opportunity employer. As we highly value diversity in our
current and future employees, we do not discriminate (including in
our hiring and promotion practices) on the basis of race, religion,
color, national origin, gender, gender expression, sexual
orientation, age, marital status, veteran status, disability status
or any other characteristic protected by law.
#J-18808-Ljbffr
Keywords: NVIDIA Corporation, Vallejo , Senior System Software Engineer - Scientific Computing PaaS, IT / Software / Systems , Santa Clara, California
Didn't find what you're looking for? Search again!
Loading more jobs...