Chat with us, powered by LiveChat

Distributed computing for Machine Learning


Machine learning methods are continuing to gather momentum, with the IDC estimating that spending on AI and cognitive systems are reaching $77.6B USD. Although such figures are clearly approximate at best, the figure still gives an idea of the scale of this market. Even if the theoretical results of machine learning have been known for a long time, the commercial development of machine learning methods is very recent. Cases of Successful use will continue to boost the credibility of the technology, and with it, the interest in applying it in new contexts.

The mathematical calculations behind ML methods require extensive computing power, and in most cases running these algorithms on a single PC is impractical as it would take prohibitively long periods of time. To solve this problem, 2 major computing infrastructure solutions are available today: cloud and on-premise. Cloud solutions like AWS or Azure provide specialized computing power tailored to the needs of machine learning, often with proprietary ML platforms like Amazon Sagemaker. The on-premise option requires purchasing powerful and expensive graphic processing units (GPUs), which are then run and maintained by the organization internally. In a data infrastructure survey conducted by O’Reilly, 63% of participants confirmed that they were using Amazon for some portion of their data infrastructure. This is an astonishingly high number that gives a clear indication of the company’s pricing power.

If computing resources are a significant cost, then reliance on Amazon and other cloud providers is something that needs to be questioned. There is the immediate consideration of the actual computing bill, but there are also other less visible factors to take into account.

There are data transfer fees, which can give companies nasty surprises and cause unexpected constraints in the future. Likewise there is the shaky ground of vendor lock-in: if a company invests into the costly development of a tech stack dependant on a specific cloud provider, then it is at the mercy of this vendor in the case of any price increases or policy changes. Without getting into details, it is clear that a company with a dominant market share will always have the means of charging a high price for its services, and relying on its services is a situation fraught with business risks. 

To an extent, the importance of infrastructure choice is also related to the type of ML projects that are undertaken. In the case of supervised learning, data needs to be manually annotated and labelled, as this approach requires the algorithm “learning” from a training set. In supervised learning, the costs of assembling such manual data are very high . 72% of respondents in a recent survey by Dimensional Research found that at least 100 thousand labelled data items were necessary to achieve production-level model confidence. The costs of outsourcing this or hiring internal teams to do the manual “grunt” work of data annotating, cleaning, and bias removal far outweigh any infrastructure outsourcing costs. In the case of unsupervised or reinforcement learning; the other 2 major types of machine learning paradigms, there is no such need for manual labelling, and therefore infrastructure costs figure much more prominently. 

On premise GPU installation has its benefits, as it allows to keep computing “in-house”, but may not suit every organization. The company may be too small for this type of hardware purchase, and not have the staff to maintain it, or would simply like a scalable solution without the costs and risks of depending on a single cloud provider. This is where distributed computing provides an interesting solution. Established platforms like Hadoop would allow companies to use their current hardware instead of paying a third party provider. If there are several hundred PCs in an office, they can be used as a collective workhorse at night, when they are idle. Instead of paying a cloud provider fees to use yet another set of processors, the company would use the hardware in which it has already invested, and the only cost is the incremental increase in electricity use for running the PCs at night. If there is not enough PC’s to provide the necessary collective computing power, then blockchain can provide a yet more interesting solution. A blockchain protocol can be designed so that it can allow access to a node’s processor, and send it calculation requests. There exists a small number of existing protocols like Raven that split machine learning tasks into small fragments, and distribute them across the blockchain network, compensating the computing nodes with tokens. 

There is a strong economic argument for the use of a P2P architecture in this context. Payments are made from the user to individual nodes, and therefore at potentially much lower prices than that of Amazon, which will use its pricing power in its favour. Prices would be dictated by supply and demand in such an ecosystem, and therefore users would be able to access idle computing power at low cost and when necessary. There are also strong environmental benefits to distributed computing on the blockchain. Using thousands of existing idle PCs during the day when their owners are at work maximizes the return on the environmental cost of their manufacturing. Instead of building yet new datacenters that consume yet more energy, the hardware already in existence is used, thereby reducing waste.