1. Work stealing is a type of load balancing technique. It is used for distributed task scheduling systems. These scheduling systems have multiple schedules; they assist in making scheduling decisions. Through this method tasks are assigned to idle schedulers, and they are transferred from the heavy-loaded ones, but this transfer can give poor data-locality in the case of data-intensive applications because both tasks and their execution largely depends on large data amount processing; this can cause data-transferring overhead. Work stealing can be enhanced via shared and dedicated queues [17]. The data size and location is used to organise queues. In such case, the MATRIX technique is used for distributing task scheduler for multiple tasks computing purpose. The metadata is scaled and organised, with the help of a distributed key-value store. Other elements organised and scaled are task dependency and data-locality. It is noticeable that the data-aware work stealing technique gives a good performance.

Falkon [18] can be considered as a centralised task scheduler providing an enormous amount of support for the MTC applications naïve hierarchical scheduling. Although Falkon provides better scales yet it is prone to the issue of scaling the petascale system as well along with its hierarchical implementation also affects the load balancing during uncertain task execution times. Moreover, for data-intensive workloads scheduling, data diffusion approach  [19] was adopted by Falkon. During the data diffusion process, storage and compute resources are acquired dynamically, data are replicated with respect to demand, and then computations are scheduled close to data. Contrary to a distributed key-value store [17], Falkon suffers from poor scalability due to deploying a centralised index server for storing metadata.

Charm++ [20], a machine on a parallel programming system that is independent, where, entire load balancing occurs in either a static or centralised, hierarchical or fully dynamic or distributed style. The centralised approach has very poor scalability (i.e. up to 3K cores [20]), whereas the dynamic or distributed approach is based on the neighbouring averaging schemes that ensure load balancing limits inside a local space and generate poor load balancing at extreme scales.

Mesos [21] is a resource sharing platform among many frameworks of cluster computing for tasks scheduling. It lets frameworks to attain data locality via data reading stored on each machine. It deploys the delay scheduling system and frameworks and waits for the limited time to get data storing nodes. This approach generates substantial waiting time for any task to be scheduled on time particularly for the larger set of data.

Quincy [22] is a highly dynamic and flexible framework for concurrent distributed jobs scheduling by means of fine-grain resources sharing system. It uses both load balancing and data locality to find the best solution by mapping the graph structure, which requires a significant amount of time.

Dryad [23] is a broadly useful distributed execution system for coarse-grained information parallel applications. It underpins running of applications structured as work process DAGs like the Hadoop scheduler [24], Dryad does centralised scheduling that maps undertakings to the where the information presents, which is neither reasonable nor versatile.

CloudKon [25] architecture is just like the MATRIX architecture with the only difference of emphasiz9ng too much on the Cloud environment, and also it is dependent on the various cloud services such as SQS [26] to perform circulated load balancing and deploy DynamoDB [27] as the DKVS system for metadata task keeping. Cloud services ensure the easier and simpler development but show control and overall performance loss as well. Moreover, CloudKon also lacks support for data-aware scheduling at an existing stage.

SLAW [28] is an exceedingly versatile locality-aware versatile work-taking scheduler framework supporting both work-and-help-first systems [29] adaptively at runtime on a for each assignment premise. In spite of the fact that it is intended to address the adaptability work taking issues, yet it accentuates on the centre and thread level too. That strategy of SLAW does not bolster the huge scale distributed frameworks by any means.

Another work is proposed in [30] which employed data-aware work stealing by deploying both dedicated and shared queues. Nevertheless, it relies upon the X10 worldwide address space programming model [31] to uncover the information territory data statically and separate between the location adaptable and location sensitive tasks at the start. This task data-locality information stays unaltered subsequent to characterising them. However, this approach is not adaptable for multiple data-intensive workloads.

  1. Generally, a computer-centric architecture is preferred for scientific applications which are based on MPI processes. The MPI processes run on various nodes utilising a file system, and they access the data which is needed to run the scientific applications. However, the high performance and efficiency of these MPI-based applications are threatened by the volatile growth of scientific data. The study [32] presents a solution to this problem; DL-MPI. The DL-MPI or the Data Locality-MPI proposes that the MPI-based applications obtain data information from computer nodes through a data locality API alongside a scheduling algorithm. The premise behind DL-MPI is that the scheduling algorithm will task each node based on their individual capacity thus increasing the efficiency and performance of the MPI. Similarly, data locality API will assess the amount of unprocessed local data and help the scheduling algorithm allocate processing need to computer node. While the solution is novel and noteworthy, it requires sophistication as the algorithm does not scale effectively when the number of nodes is large. It would seem that the data movement overhead obstructs the scaling of the system in the baseline of the system.
  2. In the literature, a multitude of data distribution methods for NUMA systems have been suggested, Huang et al. [33] proposition of extensions to OpenMP to allocate data to an undefined, immaterial notion of locations is one such suggestion. The method advocates block-wise distribution of data; while the effort required programming this distribution method is more than its counterparts, it allows for greater and more accurate control of data distribution.

Contrarily, the Minas framework [34] advocated for a data distribution API which uses an automated code transformation to find the best distribution for a given program based upon the unique profile of that program. Like, Huang et al. [33], this method also allows for precise control over the data distribution and expert programmers are needed for its execution.

Majo and Gross [35]forwards another method; their method relies on the use of a fine-grained API to distribute data. The API makes use of execution profiling to obtain the data access patterns of loops which are simultaneously used for directing the distribution of code and data. Loop iterations are pivotal in this method, as data distribution is executed between the iteration of these loops which ensures that memory pages are accessed locally in every iteration.

Nikolopou- Los et al. [36] proposed yet another method; a runtime tracing technique uses automatic page migration. The salient feature of this method is that unlike the above approaches, minimal programming effort is required. Moreover, the page migration is based on the continuous monitoring of hardware through performance counters. The need for programming is minimal because of the performance counters as well as the user-level framework approach in which page access runs in the background whereas hot pages are shifted closer to the node which will access it, thus increasing performance.

The data and task affinity techniques as discussed in [36] rely heavily on the concept of OpenMP runtime systems NUMA optimisations. The co-scheduling and local distribution elements of the concept influence this work greatly and are implemented in the runtime systems developed within the paper. Broquedis et al. [37] have presented a similar methodology to that of this paper.

Data distribution and locality-aware scheduling on many-core processors have not been addressed in research field adequately. Yoo et al. [38] give a point by quantitative point examination of territory aware planning for data parallel projects on many centre processors. It has been concluded that work-stealing scheduling cannot capture locality because of scheduling algorithms are ineffective in data-parallel programs. So a systemized locality-aware steaming and scheduling method is suggested that ensure probability maximisation of exploring the joined memory impression of a task group in the least level store to guarantee accommodating footprint. The method, however, cannot perform effectively without prior order information and task grouping that are acquired via profiling read-write tasks sets and offline graph analysis.

Vikranth et al. [39] suggestion to confine stealing to groups of cores centres in light of the topology of the processor.

Tousimojarad and Vanderbauwhede [40] embraced a smart approach, they recommended diminishing the entrance latencies to data distributed by utilizing duplicates, rather than originals, whose access is of local in nature to the thread access on the TILEPro64 processor. Alternatively, Zhou and Demsky [41] approached the problem from another angle and suggested that objects should be migrated to numerous processors to increase local access. They developed a NUMA- aware adaptive garbage collector for this purpose. However, programs which are written in C are more laborious to migrate and thus limit the efficiency of this approach.

Lu et al. [42] suggested the method proposed by Zhou and Demsky [41] of migrating data for multiple core processors. Lu et al. [42] claimed that rearranging affinity for-loops during compilation reduces access latency to data which has been distributed uniformly on the caches of multiple core processors. On the other hand, Marongiu and Benini [43] suggested partition arrays rearrange data and increase access latency. Data is rearranged to reduce latency and then redistributed according to its profiling. Marongiu and Benini [43] were motivated to adopt this approach because they wanted to enable data distribution on MPSoCs framework without any memory management hardware support.

Another approach is by Li et al. [44], [45] they deployed the compilation-time information system to monitor the data placement runtime system. Finally, R-NUCA [46] like Tousimojarad and Vanderbauwhede [40], uses an automated system to migrate shared memory pages to the cache memory.

  1. The research study in [47] claims that the energy waste from High-Performance Parallel-Systems can be reduced by leveraging the locality-awareness principle in order to implement optimised power-efficient techniques. It elaborates that two symbiotic techniques can be used to achieve power efficiency in high-performance parallel systems: (i) intra-locality power driven power optimisation and (ii) inter-process locality-driven power optimisations. The intra-locality technique gives programmers and system designers the control and flexibility to assign processor frequency and manage sleep states within the sequence of the same process. On the other hand, the inter-process technique uses the concepts of co-placement and co-scheduling of jobs to a varied set of threads from a diverse set of processes which are executed simultaneously on an HPC cluster. Co-placement and co-scheduling group threads and processes based on the similarity of their affinity patterns and symbiosis, this allows for a reduction, as much as 25%, in energy consumption. Although, the efficiency of energy reduction depends heavily on the correct identification of CPU capacity and computer memory functions.

These two techniques, when used in unison, proved to be more fruitful than independent attempts. Thus, it is prudent to analyse the results of these techniques in a systematic framework. Since, the breakthrough of measuring the framework itself is a great innovation, which contributes significantly to the growing numbers of researchers in the HPC power optimisation.

2.2 Job Scheduling Strategies

  1. Resource and energy allocation techniques are needed for (QoS) and to slash the system operational cost, and this is needed in a large-scale parallel computing environment. The reason why the cost of energy consumption needs to be reduced is due to its importance in both the user and the owner’s budget. Resource allocation strategies are very complicated when the matter of energy efficiency is a focus, and this can infringe both the response and queue time.

In the same large-scale and a parallel computing system, another factor offers storage, and computational resources access is the Resource Provider (RP) as it utilises the finite resource at hand, for some users. It also plays a part in maintaining the QoS levels. Job scheduling technique should be chosen carefully, in order to tackle the resource management dilemma; this technique is used in performance optimisation. Resource management scheduling techniques are a focus in case of research like [48]–[54], as they solve the resource allocation problem and that too, under different QoS constraints. For example, there is a metric-aware scheduling policy suggested by Wei et al. [48], and here, the scheduler achieves a balance the different competing scheduling objectives. These objectives include efficiency, job response time and system utilisation. Another research conducted by Khan et al. deploys a self-adaptive and a weighted sum method as mentioned in [49] and [50]. D in the same study is game theoretical methodologies [51] and goal programming approach [52]. These are used for optimization of system performance for grid resource allocation, and this is done under different QoS constraints. Eleven static heuristics were researched by Braun et al. [53]; he wanted to map independent tasks on a heterogeneous distributed computing system. There were common assumptions undertaken by the author, who evaluated and executed task mapping policies.

2.2.1 Resource Management

  1. There are applications which have resource requirements that need HPC schedulers, and these schedules are specifically designed for them. Monolithic simulation is an example of an HPC application; it has a massive set of cores that use couples and MPI-base parallelism tightly; these MPI applications are high latency sensitive. The number of resource demands in these applications does not vary, when it comes the number of cores need, as well as memory and wall time; these elements do not change, and they are supposed scheduling peculiarly, that they can be implemented on a system. Gang scheduling can be used here. Scheduling is primarily carried out on application-level tasks, which are also called job-level tasks; these tasks include implementation of individual processing on the computer nodes; these tasks are not shown to the resource manager. Both short-running and long-running batch-oriented tasks all come under the heterogeneous scheduling workload. This then challenges the traditional centralised and monolithic cluster scheduling systems. In order to eliminate the flexibility limitation and to provide support to the dynamic application which has heterogeneous tasks, Pilot-Jobs are used. HPC does not focus on data locality. The intensive data workloads can be broken down into loosely couples parallel tasks. The utilisation of resources, along with the improvement in their fairness, can be done through on task level, as compared to job-level [55].

YARN [56] and Mesos [21] are multi-level schedulers. The advantage of these two schedulers is that they are can support data-intensive workloads that have lightly coupled tasks, and they also make dynamic resources allocation and their usage possible, with the help of application-level scheduling. The interactive workloads low latency and scalability bottlenecks requirements addressed through the decentral schedulers. There are two application-level schedulers for Spark, called Omega, which is of Google’s [57] and Sparrow [55]; these schedulers are decentral schedulers. Further, the evolution of Hadoop schedulers is discussed, including multi-level and central approaches. Hadoop deploys a centralised task scheduling approach that is based on the Job Tracker; the Job Tracker is, in fact, a resource manager. Hadoop represents the Job Tracker with advance scalability bottleneck displayed the MapReduce framework as well, which controls flexibility. In starting this issue was not identified at all, and Hadoop was in use on the top of the HPC clusters. Hadoop used five things; Hadoop on Demand [58], SAGA Hadoop [59], My Hadoop [60], Amazon’s Elastic Map Reduce [61] and Microsoft’s HDInsight [62]. These approaches can be limited due to not having sufficient data locality which meant that moving data became necessary to the HDFS filesystem; the data move was to happen before the computation was executed. Hadoop 1 had a limitation. However, there were usage modes for Hadoop clusters that started to emerge. Despite this, there was an emerging demand for an increase in different sizes and varieties of frameworks and application and the requirements; this meant that supporting batch, interactive data processing and streaming became very important.

Resource usage and performance was not easy to predict when higher-level framework likes HBase or Spark were deployed with Hadoop daemons. YARN [56] was the core of Hadoop-2 and was developed addressing this problem and supporting the Hadoop environment based on heterogeneous workloads, which were large. Mesos [21] is a multi-level scheduler developed for Hadoop. The ecosystem that evolved on top of HDFS and YARN was large. The application system and their runtime system is shown in Figure 1; this runtime system integrates an application-level scheduler that synchronised with YARN including HBase, Sparl and MapReduce. The general requirements are then embedded with the advanced-level shared runtime systems structures, like using DAGs, which support longer or interactive application or multi-stage applications. DAG stands for Directed Acyclic Graph. Llama [63] provides a longer application controller for YARN-based applications. It is designed specifically for the Imapa SL engine. An example of DAG processing engine is TEZ [64], which is designed to support the Hive SQL engine.

Short-running and data parallel tasks can be found in typical data-intensive workloads. If job-level is used instead of task-level schedulers, like YARN, then the overall cluster utilisation can be made better, as the resources can be shifted between the applications. The scheduler can do things like resources removing from applications by waiting till task completion. In order to make resource usage as dynamic as this possible, tighter integration of an application is required by YARN. The application should register an Application Master process, which has subscription towards callbacks. The container is the unit of scheduling, and they are required from the Resource Manager. As compared to HPC schedulers, the requested number of resources is not asked by the Resource Manager. This means that the application should utilise the resources elastically when they are allocated by YARN and YARN can even request the containers de-allocation as well. This means that the application will have to make a note of the resources that are available at that time.

2.2.2 A Convergence of High-Performance and Big Data Computing:

HPC and Hadoop were designed for supporting multiple workloads types and requirements. Hadoop case, the combination at various levels can be seen as the top of the line parallel computing in the HPC against the modest data storage and recovery. Hadoop clusters have deployed more compute-demanding workloads; data parallel tasks and workflows are implemented as well and that too on the Hadoop infrastructure. As Memos and YARN were introduced, the ecosystem of Hadoop developed the ability to support heterogeneous workloads. There were tools like Pilot-Jobs, which supported couples and data-intensive workloads on the HPC infrastructures loosely. However, the Pilot-Jobs tools and other tools supported a large number of computer tasks or specific domains, and they could not reach to the scalability and diversity of the ecosystem made for Hadoop.

MapReduce, iterative MapReduce, diagram analytics, machine learning and others advanced level limitations for data storage, data processing and data analytics provided by the ABDS ecosystem.  They are altogether created over extensible bits like the HDFS and YARN. In contrast with HPC schedulers, the outline for YARN is made to help the heterogeneous workloads that deploy taggered and data-aware scheduling. To achieve this goal, a higher level of joining between the application or the structure and the framework level scheduler is a necessity by YARN; this is higher as compared to the HPC schedulers. YARN application does not request static resource before it runs, but request resources in an efficient way. When applications optimize resource usage, the cluster’s usage becomes better as well.

Even though YARN is not part of HPC resource fabrics, there have been suggestions to integrate Hadoop/YARN and HPC because of the many advantages that YARN will bring. Three things need to be considered. Firstly, assurance of local resource management level system integration. Secondly, integration with multiple external resources such as parallel and shared file system, which are HPC storage resources. Thirdly, the deployment of advance of network features like RDMA and several other efficient abstractions that include joint operations. In resource management integration, integration can be achieved with the help of system level resource management system. The Hadoop-level scheduler needs to adopted and on top of the system-level scheduler. Condor and SLURM are resource managers that give support to Hadoop. Moreover, there are third-party systems like SAGA-Hadoop [59], JUMMP [65] and MyHadoop [60] as well. However, the main problem with the approach is that data locality is at a loss, and the system-level scheduler is not aware of it. Further, when the HDFS is used, data is copied from HDFS, after which it is processed. This also represents a substantial overhead. Hadoop is used in a single user mode here, and there is no optimal way to use the cluster resources.

In Storage Integration, a pluggable filesystem abstraction is given by Hadoop. This filesystem interoperates with the POSIX compliant filesystem, be it any file system. Parallel file systems can be used with Hadoop, but there are cases in which Hadoop layer will not be aware of data locality, that is maintained on the parallel file system. This includes the Intel supports Hadoop, which is on top of Lustre [66], IBM on GPFS [67]. With the help of the shared file system, the MapReduce shuffling phase is carried out and, i.e. is an optimisation concern [68]. These filesystems and their scalability, as compared to HDFS, is constrained. This is because much data is processed locally so that data movements across the network can be eliminated. HDFS does not rely on fast interconnects much.

In Network Integration, Ethernet environments are a fit for Hadoop, and they mainly use Java sockets in case of communication. RDMA, which is a high-performance feature, is not used here. In order to solve this issue, it is proposed that RDMA be used in supporting MapReduce, HDFS, HBase and a wide range of other components for 10 Gigabit Infiniband networks.

  1. The Direct Acyclic Graphs are scientific workflows, and even though they are useful in understanding the dependence of relationships, they do not give information about temporary data files, input or output. This type of information is essential because it helps eliminate performance issues. The paper in [69] shows a mulitwork flow store-aware scheduler that is found in a cluster environment, which is known as Critical Path File Location (CPFL). In this approach, the disk access time is more imperative when contrasted with the entire network system, as it is an augmentation to the traditional list scheduling policies. The primary point here is to discover the best area of data documents in a progressive stockpiling framework and the algorithm that left it, was tried in an HPC cluster, and the outcome was great execution.
  2. A prologue/ epilogue mechanism is used as a foundation for RJMS [70]. It promotes communication between HPC and Big Data system, as it reduces the disturbance on HPC workloads. This is carried out with the help of leveraging Big Data frameworks, as their resilience is already present.

There is advanced performance offering hardware in HPC, including Parallel File Systems (PFS), and fast interconnect systems which can be advantageous [71]. Some take HPC traditional technologies, totally opposite to what is mentioned before, and build new Big Data tools; MPI is one way to do it. They use such tools because of their resilience, and they do not compensate for the lack of performance [72].  Using traditional Big Data tools is not a waste when used on HPC infrastructure if their resilience and dynamicity are leveraged. There are many techniques to make Big Data when it comes to the resource management level; Big Data and HPC workloads run alongside.

The straightforward one is to design two clusters, which will be committed to each workload, yet the exchange between the groups will be a bottleneck. There likewise no heap when the two clusters are adjusted on account of the partition of concern, while less complicated approach here is given the client a chance to deal with the Big Data programming inside HPC all alone. This can be connected to straightforward work processes that don’t convey much information. However, it has its drawbacks. Right off the bat, regardless of whether the contents are useful for the client, the client may think that its hard to stay aware of the system, while the stack develops. Furthermore, the Big Data programming stacks must be sent, arranged, began and close down for each activity that is completed; here the overhead is imperative if there should arise an occurrence of little employment. Thirdly, if the information is too vast and can’t be fit in the client home, an outsider stockpiling would be required for moving information and this can affect execution time [73].

Another thing to do here is to run HPC jobs, using Big Data stack. However, most HPC applications are not compatible with Big Data RJMS as a communication layer is required. HPC applications have to be rewritten here, as they can be easily approachable then. In an integrated approach system, new abstractions are needed to make it one system; this converts Big Data RJMS resource requirements into HPC RJMS allocations [74]. However, these approaches apply to some technologies; they evolve with them but have limitations as well. Thus, the solution is limited to implemented adaptors, and their cost can be high.

A more suitable approach here is to summarise HPC on Hadoop or vice versa, by using a Pilot base abstraction [75]. The two systems can be joined using software, as suggested by the authors.

2.3 Topology-Aware

  1. The computing platforms expanding intricacy fueled the requirement for programmers and developers and programmers to comprehend the hardware organisation and adjust their application format. As a component of the general enhancement process, there is a substantial requirement for a device for platform modelling, and hwloc is the most famous programming tool for uncovering a static perspective of the topology of CPUs and memory.

The paper in [76]shows how hwloc is reached out to these computing assets by additionally joining I/O devices topology and offering approaches to deal with different hubs. I/O territory data has been added to the hwloc tree speaking to the equipment and additionally numerous credits to enable applications to recognise the resources they utilise, put undertakings close them or adjust to their area. Later on, to control the remote host’s topologies with pressure for useful adaptability, an API was introduced.

The features that this paper is highlighting are mentioned in the hwloc v1.9 (released in Spring 2014). On-going work is currently concentrating on enhancing topology discovery on developing ARM designs for elite figuring and also programmed conflicts management between constant sources of data.

  1. Deploying the advanced hardware resource to a particular application is not something new at all and has been discovered in earlier researches as well. It can be grasped clearly in grids environment context [77]–[79], as the choice of the best combination of resources (clusters) to use is an essential factor in this case. The advantage of using such an approach is to reduce the WAN communication impact in grids but does not account real topology details to very much extent, such as the effects of NUMA or cache hierarchy.

More recently, a specific type of applications has been targeted by some works, such as MapReduce- based applications. Like, the TARA [80] uses the application description for resources allocation purpose. However, the work is designed for a particular set of applications only and not meant to address the other hardware details at all.

The network topology backs the parallel applications mapping job to the physical hardware. It can lead to significant improvements in performance [81]. The scheduler can consider the network topology characteristics [82] so as in favour of selecting a set of nodes being placed on the same network level and connected to the same network under the same switch or even placed near each other to avoid long-distance communications. Most open-source and proprietary RJMS takes this kind of feature is into account by using the underlying physical topology characteristics. However, they eventually failed to consider the importance of application behaviour during resource allocation. An approach named as HTCondor or formerly Condor [83] is proposed to take advantage of the matchmaking method, which allows matching the requirements of applications with the available hardware resources. However, this matchmaking method does not consider the application behaviour, and HTCondor applies to both clusters and connected workstations. Slurm [84] provides a feature to minimise the network switches number used in the allocation, thereby reducing communications costs during the execution of the application. Obviously, switches located in the depth of the tree topology are supposed to be less communication cost than those in the top layer. Similarly, PBS Pro [85] and LSF [86] exploit the same concept of topology-aware placement. Os Fujitsu [87] employs a similar technique, but for its proprietary interconnect called Tofu only. According to our knowledge, Slurm [84] Is still the only one that provides a best-fit topology-aware selection system while the others only suggest the first-fit algorithm system.

A portion of the distinctive RJMS offers numerous alternatives for task placement that can approve a craftiness arrangement for different application forms. That is Torque [88] circumstances which recommend NUMA based job task placement system. Be that as it may, in these current works, just the system topology is considered and the hubs intuitive design is left unresolved when execution results are conventional from misusing the memory chain of command.

Multiple binding strategies are accessible and supported with the Open MPI executed approaches. In every one of these solutions, the user first needs to recover the compositional points of interest before delivering his job or task. Moreover, the placement choices offered to put the weight on the client to decide his approach heretofore, and the application correspondence plot is not considered.

2.4 Decomposition Techniques

  1. The way toward software applications parallelisation can be exceptionally dreary, and error inclined, specifically the data decomposition task [89], [90].

While choosing a parallelisation methodology, data decomposition and the consequent conveyance of data or task processing running on available cores is a fundamental point to consider [91], [92].

Several researchers have conducted different studies to a better understanding of parallel program needs and the state-of-practice of parallelisation [93]–[95]. Such studies have played a role in understanding the project organisation, software usage, debugging, testing and tuning. On the other hand, data decomposition is an area of limited attention. This surprise leads us to the fact that data decomposition is a significant challenge in a parallel programming environment [90].

Up until this point, there have been insufficient empirical researches in High-Performance Computing field that concentrated on these issues. The assignment of performing data disintegration and correspondence while parallelising applications has been examined in the empirical researches [96]. The utilisation of techniques to help this task was examined, and the condition of- practice was additionally researched out. Besides, to support the undertaking of data deterioration and correspondence while parallelising applications, an arrangement of crucial necessities is inferred for tools of this system.

  1. The idea of decomposing multi-dimensional arrays in multi-view data model [97] is common to that of distributed arrays supported by specific languages and frameworks. High-Performance Fortran (HPF) offers compiler directives that define array elements distributions [98]. The easiest way to decompose an array is to designate keywords, such as BLOCK or CYCLIC to each dimension. Partitioned Global Address Space (PGAS) provides locality-aware shared address space among multiple processes and is supported by languages developed for high-performance computing, such as Co-array Fortran, Unified Parallel C [99], UPC++ [100], X10 [31], and Chapel [101]. PGAS languages also provide simple notations for describing data decomposition. However, HPF and this PGAS do not have a concept of View as in [97]. As a result, users have to manually write codes for mapping tasks to processes by considering data location when the task itself should be run in parallel. Hierarchically Tiled Array (HTA) is a distributed array that enhances locality and parallelism for improving performance on multicore systems [102]. An HTA is composed of hierarchical tiles that contain sub-tiles or values, and a unit of tiles performs data assignment to processes and data processing. Although it may be possible to achieve the same effect of data distribution and processing by using HTA, users have to carefully consider the hierarchy and indices of tiles when writing such codes. Habanero-Java provides Array-View, in which a one-dimensional array can be viewed as any higher- dimensional arrays, to improve productivity [103].

2.5 Programming Models

2.5.1 Parallel Programming Models

  1. Recently, the greater part of parallel computing systems is made out of heterogeneous tools at their nodes level. That types of nodes may include universally useful CPUs and performance accelerators, (i.e., GPU, or Intel Xeon Phi) that give superior legitimate qualities of energy utilisation. However, It is not easy to exploit the performance available in heterogeneous systems, and this task may be a tedious and challenging process. There are wide ranges of parallel programming models (such as OpenCL, OpenMP, CUDA) and selecting the right one for target context is not so easy and straightforward.

Table 1 below summarises the streamlined features of parallel programming models that are considered in the study in [104]:

OpenMP [105], OpenACC [106], OpenCL [107], and CUDA [108]. OpenACC and OpenMP are mainly implemented as C, C++, and FORTRAN compiler directives, which significantly hide the detailed information of architecture from the programmer. Interestingly, CUDA and OpenCL are executed as programming/software libraries for only C and C++ and open the developer to lower architectural details. As for as the parallelism bolster is concerned, the majority of the models considered support widely for the offbeat task parallelism and data parallelism. While OpenCL and OpenMP give parallelisation systems to both hosts multi-centre CPUs and numerous core accelerated devices, CUDA and OpenACC support for the parallelisation implies for accelerators only such as NVIDIA GPUs [109].

Table 1. Major features of OpenACC, OpenMP, OpenCL, and CUDA. From [104]: Memeti et al.

  OpenACC OpenMP OpenCL CUDA
Parallelism –      device only

–      asynchronous task parallelism

–      data parallelism

–      data parallelism

–      host and device

–      asynchronous task parallelism

–      data parallelism

–      host and device

–      asynchronous task parallelism

–      data parallelism

–      device only

–      asynchronous task parallelism

Architecture abstraction –      memory hierarchy

–      explicit data mapping

–      explicit data movement

–      memory hierarchy

–      data binding

–      computation binding

–      e explicit data mapping

–      explicit data movement

–      memory hierarchy

–      explicit data mapping

–      explicit data movement

–      memory hierarchy

–      explicit data mapping

–      explicit data movement

Synchronization –      reduction

–      join

–      barrier

–      reduction

–      join

–      barrier

–      reduction

–      barrier
Framework implementation compiler directives

for C/C++ and


compiler directives

for C/C++ and


C/C++ extension C/C++ extension


  1. Concerning dispersed computing, MPI based parallel executions of various calculations that are for the most part flexible to a broad number of processors may persevere efficiency loss when sent on such structures for a couple of uses. Precisely when various processors centres are established their parallel efficiency is significantly lower [110]. Of course, OpenMP is a conventional memory parallel programming system, which grants threat level parallelism. Joining OpenMP and MPI parallelisation for hybrid program assembling can achieve two levels of parallelism. This approach decreases the correspondence overhead of MPI while familiarising OpenMP overhead due to threat creation and pulverisation. The likelihood of hybrid indicates MPI/OpenMP has been exhibited to have better execution when appeared differently about MPI frameworks, for a couple of uses [111], [112]. The OpenACC Application Program Interface gives compiler orders, library calls and condition factors that empower programmers to make the parallel code for various systems such as General Purpose Graphics Processing Units (GPGPUs) [113]. Uniting OpenACC and MPI gives the limit of running a parallel code in a collecting mode with more than one GPU, distributed among the cluster nodes, risking the total number of cores joined.

2.5.2 Parallel and Big Data Programming Models: Performance Perspective

  1. It has been widely noted that MPI-based HPC frameworks outperform Spark or Hadoop-based big data frameworks by order of magnitude or more for a variety of different application domains, e.g., support k-nearest neighbours and vector machines & [72], k-means [114], graph analytics [115], [116], and large-scale matrix factorizations [117]. Recent performance analysis of Spark showed that compute load was the primary bottleneck in some Spark applications, specifically serialisation and deserialization time [118]. Other work has tried to bridge the HPC-big data gap by using MPI-based communication primitives to improve performance. For example, Lu et al. [119] show how replacing map-reduce communicators in Hadoop (Jetty) with an MPI derivative (DataMPI) can lead to better performance; the drawback of this approach is that it is not a drop-in replacement for Hadoop and existing modules need to be re-coded to use DataMPI. It has been shown that it is possible to extend simple MPI-based applications to be elastic in the number of nodes [120] through periodic data redistribution among required MPI ranks. Efforts to add fault tolerance to MPI have been ongoing since at least 2000 when Fagg and Dongarra proposed FTMPI [121]. Although the MPI standard has still not integrated any fault tolerance mechanism, proposed solutions continue to be put forth, e.g., Fenix [122]. However, there is a tremendous productivity gap between the APIs of Fenix and Spark. Several machine learning libraries have support for interoperating with Spark, such as H2O.ai [123] and deeplearning4j [124], and include their communication primitives. However, these are Java-based approaches and do not provide for direct integration of existing native-code MPI-based libraries. SWAT [125], which stands for SparkWith Accelerated Tasks, creates OpenCL code from JVM code at runtime to improve Spark performance. As opposed to our work, SWAT is limited to single-node optimisations; it does not have access to the communication improvements available through MPI.

About the author


Leave a Comment