To structure or not to structure IT Cabling for AI Clusters - Part 1

A couple of weeks ago I started on the chapters around data center topologies and architectures for the ADTEK Data Center Handbook, while at the same time I was also involved with a project needing DAC and AOC cables, and it triggered a reflection on the future aspects some of these architectures will have. As long as I’ve been in the industry the question whether point-to-point or structured cabling is better, has always been around, with proponents on either side. But with the new bandwidth requirements we are starting to see in the AI space, reference my previous blog on The 800Gb and beyond connectivity conundrum, it is worth to reflect on some ecological, operational and cost of ownership differences for AI between point-to-point, such as DAC PCC, DAC ACC and AOC solutions, and structured cabling solutions.

“So I was another 500 words down writing this blog and realized I would never be able to pack this in one blog, so for the first part I’ll be concentrating on laying out the different considerations and address the first ones of them. Subsequent blogs will address other aspects and combining those with your comments I will pull together a summary.”

 

First let us differentiate between training GenAI clusters and inference GenAI clusters, where training clusters in general consists out of substantial amount of GPUs and CPUs, that require a lot of connectivity and use copious amount of training data. Whereas, inference GenAI Clusters are smaller, 1 to 2 rack solutions that use application relevant data and are not massively different from HPC clusters.

Figure 1: GenAI Training vs Inference

Source: LinkedIn

These massive training clusters hold a lot of compute power and data storage space and therefore cost a lot of money. This motivates GenAI service providers to deploy them in a modular way, as the business grows, and once deployment is approved they want to get them installed as fast as possible and operational to get RoI. Then, when they are training, they should perform as efficient as possible, i.e. utilizing connections that require high bandwidth, with the lowest possible delay and operate uninterrupted.

Attribute

Gen AI Training

Gen AI Inference

Size (GPUs)

Tens of thousand

Below hundred

Deployment speed

Very Fast

Fast

Latency delay

Very Low

Low

Operational stability

Very High

High

This has impact on several data center parameters, first there is the size of the deployment, connections and requirement to be modular. Secondly, there is the power and cooling density and capacity. Finally the deployment, operations and sustainability.

In this first part, I want to address the operational sensitivity related to the cluster architecture and high bandwidth demand, more specific related to the compute-to-compute and backend connectivity. The higher bandwidth demand, 50Gb per lane and above, requires complex modulation and multiple lanes to support 400Gb and above speeds. For fibre connectivity this implies that we need parallel optics and that not only the IL of the fibre connectivity is important but also the RL. Increased levels of RL will impact the connection quality of the network signaling, which when impaired will slow down or even stop the connection[1]. This will not only impact that single connection, but because of it’s cluster architecture, it impacts the entire cluster. For structured cabling this means that every mated connection needs to be of high quality, albeit at the equipment to patch connector and at the patch connector to structured cabling connector. This is also one of the reasons why high bandwidth connectivity uses parallel optics with Angled Polished End-faces (APC), which are not compatible with UPC or PC connectors, which are standard for lower MMF network speeds. Another factor impacting the RL is dirty end-faces, although the industry has been creating standards and awareness for decades that Inspect, Clean and Connect (ICC) is needed for any matted connection in a fibre link, the practice shows that a lot of installers still omit this. This makes that every matted fibre connection in an GenAI cluster has the potential risk from an operational perspective.

Figure 2: PAM4 eye diagram

Second operational aspect is the latency of the network, where the cable type, cable length and the transceiver function play a role. Copper connectivity has less latency than fibre connectivity, where DAC cabling is about 4.6ns per meter and fiber about 5ns/m, but there is also the SFP transition that is near 0ns for DAC end and about 0.5ns per transceiver end for fibre[2]. This means that a 1m DAC has about 1.4ns (0.5+0.4+0.5) less latency compared to a fibre connection, be it AOC or over transceiver and structured cabling. This results in about 24% latency difference for 1m links, and about 14% for 3m links. This significantly impacts the speed of the training and subsequent RoI to the GenAI service provider. The disadvantage for DAC is that it only supports very short distances, at high bandwidth up to 3m for DAC Passive Copper Cable (PCC) and up to 5m for DAC Active Copper Cable (ACC).

Figure 3: Twinax vs MMF vs SMF latency

The next blog we will delve into the deployment and installation aspects, together with the power, cooling and sustainability impacts. Looking forward to reading your take on the above aspects and what your interpretation or vision is on this .

References:

Leave a Reply

Your email address will not be published. Required fields are marked *

Contact US

If you want to know more about us, you can fill out the form to contact us and we will answer your questions at any time.