AWS Hyperpod Development Notes

EKS Cluster and control plane:

  • Hyperpod VPC, subnets, security groups, etc.

Hyperpod be default installs EFA device plugin to register EFA resource (vpc.amazonaws.com/efa) to k8s. If the plugin is not installed, you can install it manually by installing the EFA helm chart:

name: aws-efa-k8s-device-plugin
repo: https://aws.github.io/eks-charts

Also, make sure the request vpc.amazonaws.com/efa in pod spec, so that your container runtime can attach EFA NIC to the container:

resources:
  requests:
    vpc.amazonaws.com/efa: "1"

If everything is configured correctly, you should be able to see the EFA NIC in the container using fi_info:

# which fi_info
/opt/amazon/efa/bin/fi_info
# fi_info -p efa
provider: efa
    fabric: efa-direct
    domain: rdmap49s0-rdm
    version: 201.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: efa
    domain: rdmap49s0-rdm
    version: 201.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: efa
    domain: rdmap49s0-dgrm
    version: 201.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA

Hyperpod Node Group:

  • On-demand, Training Plan, etc.
  • Node group AZ override

EFA network:

  • Intra-cluster EFA networking
  • FSx with EFA, expensive
  • EFA network benchmarking, troubleshooting