Machine Learning Infrastructure Engineer

Arcee AI Remote

Company

Arcee AI

Location

Remote

Type

Full Time

Job Description

About Us:
Arcee.ai is a cutting-edge AI company that empowers enterprises to own their GenAI strategy. We're a team of passionate and innovative engineers, researchers, and industry experts dedicated to pushing the boundaries of AI technology. We're looking for an exceptional Solution Architect to join our team and help design, develop, and deploy AI-powered solutions that meet the highest standards of quality, reliability, and performance.


Job Summary:

As a Machine Learning Infrastructure Engineer, you will be responsible for designing, developing, and maintaining the infrastructure that powers our machine learning models. You will work closely with data scientists, engineers, and researchers to ensure seamless integration of machine learning models into our production environment. Your expertise will enable us to scale our machine learning capabilities, improve model performance, and reduce time-to-market.


Key Responsibilities:

Design and Implementation:

    • Design and implement scalable, efficient, and reliable machine learning infrastructure (e.g., containerization, orchestration, and cloud services).
    • Develop and maintain infrastructure as code (IaC) using tools like Terraform, AWS CloudFormation, or Google Cloud Deployment Manager.

Model Serving and Deployment:

    • Design and implement model serving platforms (e.g., TensorFlow Serving, AWS SageMaker, or Azure Machine Learning) for efficient model deployment and management.
    • Develop and maintain automated model deployment pipelines using tools like Jenkins, GitLab CI/CD, or CircleCI.

Data Engineering:

    • Collaborate with data engineers to design and implement data pipelines that feed machine learning models.
    • Ensure data quality, integrity, and security throughout the data lifecycle.

Monitoring and Optimization:

    • Develop and implement monitoring and logging solutions (e.g., Prometheus, Grafana, or ELK Stack) to track model performance, latency, and system health.
    • Optimize infrastructure resources and model performance using techniques like hyperparameter tuning, model pruning, and knowledge distillation.

Collaboration and Communication:

    • Work closely with data scientists, engineers, and researchers to identify infrastructure needs and develop solutions.
    • Communicate technical information effectively to both technical and non-technical stakeholders.

Staying Up-to-Date:

    • Stay current with industry trends, emerging technologies, and best practices in machine learning infrastructure.
    • Participate in conferences, meetups, and online forums to expand knowledge and network with peers.


Ideal Candidate: 

Cloud Computing and Infrastructure:
   - Experience with major cloud platforms (AWS, Azure, GCP)
  - Kubernetes expertise for container orchestration
  - Infrastructure-as-Code (IaC) skills (e.g., Terraform, CloudFormation)

 Machine Learning Operations (MLOps):
  - Familiarity with ML model lifecycle management
  - Experience with ML model serving frameworks (e.g., VLLM, TorchServe, SGLang)
  - Knowledge of model versioning and experiment tracking tools‍

 Deep Learning and NLP:
  - Strong understanding of transformer architectures and LLMs
  - Experience with popular deep learning frameworks (PyTorch)
  - Familiarity with NLP concepts and techniques

API Development and Management:
  - RESTful API design and implementation
  - API gateway management and security
  - Experience with OpenAPI/Swagger specifications

Performance Optimization:
  - Proficiency in GPU acceleration techniques
  - Experience with model quantization and pruning
  - Knowledge of distributed inference and parallel computing

Programming Languages:
  - Strong Python skills
  - Familiarity with C++ for potential low-level optimizations
  - Shell scripting for automation



Requirements:

Education:

    • Bachelor's or Master's degree in Computer Science, Engineering, or a related field.

Experience:

    • 3+ years of experience in machine learning infrastructure, DevOps, or a related field.
    • Experience with cloud providers (e.g., AWS, GCP, or Azure) and containerization (e.g., Docker).

Technical Skills:

    • Proficiency in programming languages like Python, Java, or C++.
    • Experience with machine learning frameworks like TensorFlow, PyTorch, or Scikit-learn.
    • Familiarity with infrastructure as code (IaC) tools like Terraform or CloudFormation.
    • Knowledge of container orchestration tools like Kubernetes or Docker Swarm.

Soft Skills:

    • Excellent communication, collaboration, and problem-solving skills.
    • Ability to work in a fast-paced environment and prioritize tasks effectively.


Nice to Have:

Certifications:

    • Cloud provider certifications (e.g., AWS Certified DevOps Engineer or GCP Professional Cloud Developer).
    • Machine learning certifications (e.g., TensorFlow Certified Developer or PyTorch Certified Engineer).

Experience with:

    • Model serving platforms like TensorFlow Serving or AWS SageMaker.
    • Automated model deployment pipelines using tools like Jenkins or GitLab CI/CD.
    • Monitoring and logging solutions like Prometheus or ELK Stack.

Knowledge of:

    • Model explainability and interpretability techniques.
    • Data privacy and security best practices.


What We Offer:

  1. Competitive Salary: A salary commensurate with experience and industry standards.
  2. Stock Options: Equity in [Company Name] to give you a stake in our success.
  3. Comprehensive Benefits: Health, dental, and vision insurance, as well as 401(k).
  4. Professional Development: Opportunities for growth, training, and conference attendance.
  5. Collaborative Environment: A dynamic, diverse team that values innovation and open communication.

Apply Now

Date Posted

01/24/2025

Views

0

Back to Job Listings ❤️Add To Job List Company Info View Company Reviews
Positive
Subjectivity Score: 0.9

Similar Jobs

Director, Product, Customer, and Lifecycle Marketing - Garner Health

Views in the last 30 days - 0

Garner Health is seeking an experienced Product Marketing Leader to join their team The ideal candidate will lead the product marketing efforts focusi...

View Details

Linux Support Engineer - Voltage Park

Views in the last 30 days - 0

Voltage Park is seeking a Linux Support Engineer for a fulltime remote position The ideal candidate will have command line level Linux sys administrat...

View Details

Director, Product (Remote) - Dscout

Views in the last 30 days - 0

Dscout is a leading company in experience research technology offering a platform for major companies to gain insights into user needs and behaviors T...

View Details

Technical Architect - CDW

Views in the last 30 days - 0

CDW offers a rewarding career opportunity for a Technical Architect with expertise in ServiceNow The role involves delighting customers by collaborati...

View Details

Sales Sourcer (6-month contract) - Dandy

Views in the last 30 days - 0

Dandy a venturebacked company is revolutionizing the dental industry by integrating and simplifying dental practice functions through technology They ...

View Details

Sales Development Representative (Remote) - Dscout

Views in the last 30 days - 0

Dscout is a leading company in experience research technology offering a platform for businesses to gain insights into user needs and behaviors They a...

View Details