Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure tolerance through a more distributed design #52

Open
slizzered opened this issue Feb 10, 2015 · 0 comments
Open

Failure tolerance through a more distributed design #52

slizzered opened this issue Feb 10, 2015 · 0 comments

Comments

@slizzered
Copy link
Contributor

Our current approach to distribute the workload on a cluster is based on MPI. Even #17 will use MPI for now.
The following is inspired by distributed systems. I'm not sure if this is actually necessary, but the impact on performance should be acceptable and failure tolerance will be improved. Let me know what you think!

Problem:

  • There is no failure tolerance
  • If one node crashes, the application goes down with it
  • no new nodes can be added, everything has to be static from the start on

Idea:

Use approaches that are common in distributed systems:

  • HeadNode is the server (there could be mirrored backup HeadNodes!), only Node with actual persistent state information (results for sampling points, currently working ComputeNodes)
  • Each ComputeNode is an independent client, has only a SoftState
  • When a ComputeNode starts, it searches for the HeadNode (either fixed hostname for headnode, or some service discovery broadcast)
  • ComputeNodes Request work from HeadNode. When work done, they send it back (like it is done now)
  • During computation, ComputeNodes send heartbeat-signal to the HeadNode. If heartbeat stops, the HeadNode assumes they are dead and re-schedules the sampling point
  • For very long computations (or when you repaired a broken ComputeNode), ComputeNodes can be started while the computation is during and will simply request work from HeadNode.
  • Simple communication with UDP or TCP, no more MPI dependency.
  • It would be also possible (and very easy), to add ComputeNodes that use a CPU code or different accelerators. All ComputeNodes would have to simply implement the same network interface
  • In theory, this allows also scaling among multiple clusters or to home computers 👯

Problems / Thoughts:

@slizzered slizzered added this to the 2.0 - the next generation milestone Feb 10, 2015
erikzenker added a commit to erikzenker/haseongpu that referenced this issue Oct 5, 2015
9dfdb96 fix missing OpenMP link flag
b9f099c fix foldrAll ICC bug
83ddac5 disable the OpenMP 4 back-end by default
8644064 fix Vec for Intel
819e5d9 fix boost 1.56 missing const bug
f9cd663 really fix Intel cpuid
330d983 remove incorrect docu
9f1b692 fix Intel compiler cpuid
1aa4c86 fix missing OMP_NUM_THREADS reset in getMaxOmpThreads
328e866 fix CUDA compilation
33c7888 remove ICC from the readme (untested / not compiling)
40a8465 always interpret all source files as .cu files for nvcc
25f4670 allow vectorize to be called without the element type
882c0a9 enhance documentation
05454a6 fix ambiguous template specialization for GetWorkDiv
5b70326 remove call to std::ref in BlockSharedAllocCudaBuiltIn
e15c40a fix fix AtomicOmpCritSec
afffe2f fix wrong atomic implementation for AccCpuOmp2Blocks
2a60bbb fix BufCudaRt destruction
062378d add ALPAKA_ADD_EXECUTABLE to alpakaConfig.cmake
b9a4125 use DimInt more consistently
919dc26 move ElemType from mem::view to elem
2807fc8 add initial ALPAKA_ADD_EXECUTABLE
f019e70 fix BufPlainPtrWrapper pitch
1ca1923 fix missing OpenMP linker flag
9ee231d fix getFreeGlobalMemSizeBytes
7e853c6 Merge pull request ComputationalRadiationPhysics#54 from psychocoderHPC/fix-cudaSet
6796eff Merge pull request ComputationalRadiationPhysics#55 from psychocoderHPC/fix-callingHostFunctionFromDevice
9f3d8e6 fix warning calling host function from device
000a250 fix wrong usage of `getPitchBytes<>()`
8be955d Merge pull request ComputationalRadiationPhysics#53 from psychocoderHPC/topic-suppressHostDeviceWarning
b7c877d Merge pull request ComputationalRadiationPhysics#52 from psychocoderHPC/tpoic-updateGitIgnore
0b94251 suppress host device warning
33a59be update `.gitignore`
237898f refactoring
d0ad945 implement getFreeGlobalMemSizeBytes
f85e233 allow accelerators to inherit from rand implementation
d96e8b5 fix CUDA set implemenentation

git-subtree-dir: include/alpaka
git-subtree-split: 9dfdb96b0cb2fc32a1f2e447de755905f7538bf4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant