Hardware trends suggest that large-scale CMP architectures, with tens
to hundreds of processing cores on a single piece of silicon, are
iminent within the next decade. While existing CMP machines have
traditionally been handled in the same way as SMPs, this magnitude of
parallelism introduces several fundamental challenges at the
architectural level and this, in turn, translates to novel challenges
in the design of the software stack for these platforms. This paper
presents the "Many Core Run Time" (McRT), a software prototype of an
integrated language runtime that was designed to explore
configurations of the software stack for enabling performance and
scalability on large scale CMP platforms. This paper presents the
architecture of McRT and discusses our experiences with the system,
including experimental evaluation that lead to several interesting,
non-intuitive findings, providing key insights about the structure of
the system stack at this scale. A key contribution of this paper is to
demonstrate how McRT enables near linear improvements in performance
and scalability for desktop workloads such as the popular XviD encoder
and a set of RMS (recognition, mining, and synthesis)
applications. Another key contribution of this work is its use of McRT
to explore non-traditional system configurations such as a
light-weight executive in which McRT runs on "bare metal" and replaces
the traditional OS. Such configurations are becoming an increasingly
attractive alternative to leverage heterogeneous computing uints as
seen in today's CPU-GPU configurations.