Auto-tuning A High-level Language Targeted To Gpu Codes

admin 15.04.2020 15.04.20

Auto-tuning A High-level Language Targeted To Gpu Codes 5,6/10 5424 votes

Auto-tuning A High-level Language Targeted To Gpu Codes 2016
Auto-tuning A High-level Language Targeted To Gpu Codes Free

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration. High-level language targeted to gpu codes. Auto-tuning a high-level language. Nov 27, 2016 Abstract. With CPU, GPU and other hardware accelerators, heterogeneous systems can increase the computing performance in many domains of general purpose computing. Open Computing Language (OpenCL) is the first open and free standard for heterogeneous computing on multi hardware platforms. In this paper, a parallelized Full Search Motion. Corresponds to the default remainder option). See paper 'Auto-tuning a High-Level Language Targeted to GPU Codes' for more about the transformations and best optimized configurations. Script to compile all codes in set: Each folder (CUDA, OpenCL, and HMPP) contains a 'compileCodes.sh' script that compiles each code in the section.

Publication: ESTIMedia'16: Proceedings of the 14th ACM/IEEE Symposium on Embedded Systems for Real-Time MultimediaOctober 2016 Pages 78–83https://doi.org/10.1145/2993452.2994307

This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
Manage my Alerts
Please log in to your account
Save to Binder
Create a New Binder
Name

May 18, 2014 During the last years, GPU manycore devices have demonstrated their usefulness to accelerate computationally intensive problems. Although arriving at a parallelization of a highly parallel algorithm is an affordable task, the optimization of GPU codes is a challenging activity. The main reason for this is the number of parameters, programming choices, and tuning techniques available, many of. Describes the exciting computer vision/high-performance computing research of one Scott Grauer-Gray. Udel CIS Home Udel Home: Scott Grauer-Gray. Auto-tuning a High-Level Language Targeted to GPU Codes. In Proceedings of Innovative Parallel Computing. Optimizing and Auto-tuning Belief Propagation on the GPU. Auto-tuning a high-level language targeted to GPU codes. In Proceed-ings of Innovative Parallel Computing (InPar), 2012. Pandit and R. Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices. In CGO, 2014. High-level DSLs, like Obsidian, are well suited to take advan-tage of auto-tuning, because of their ability to expose compile-time decisions as ordinary parameters in their meta-language. This prop-erty also opens up an interesting possibility: the language for ex-pressing kernels and the language for expressing auto-tuning can.

Heterogeneous processors with architecturally different devices (CPU and GPU) integrated on the same die provide good performance and energy efficiency for wide range of workloads. However, they also create challenges and opportunities in terms of scheduling workloads on the appropriate device. Current scheduling practices mainly use the characteristics of kernel workloads to decide the CPU/GPU mapping. In this paper we first provide detailed infrared imaging results that show the impact of mapping decisions on the thermal and power profiles of CPU+GPU processors. Furthermore, we observe that runtime conditions such as power and CPU load from traditional workloads also affect the mapping decision. To exploit our observations, we propose techniques to characterize the OpenCL kernel workloads during run-time and map them on appropriate device under time-varying physical (i.e., chip power limit) and CPU load conditions, in particular the number of available CPU cores for the OpenCL kernel. We implement our dynamic scheduler on a real CPU+GPU processor and evaluate it using various OpenCL benchmarks. Compared to the state-of-the-art kernel-level scheduling method, the proposed scheduler provides up to 31% and 10% improvements in runtime and energy, respectively.

C. Augonnet et al. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurr. Comput.: Pract. Exper., 23(2):187--198, Feb. 2011. Google ScholarDigital Library
P. E. Bailey et al. Adaptive Configuration Selection for Power-Constrained Heterogeneous Systems. In International Conference on Parallel Processing, pages 371--380, Sept 2014. Google ScholarDigital Library
S. Che et al. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Intl. Symp. on Workload Characterization, pages 44--54, Oct 2009. Google ScholarDigital Library
H. J. Choi et al. An Efficient Scheduling Scheme Using Estimated Execution Time for Heterogeneous Computing Systems. The Journal of Supercomputing, pages 886--902, 2013. Google ScholarDigital Library
K. Dev et al. Workload-aware Power Gating Design and Run-time Management for Massively Parallel GPGPUs. In IEEE Computer Society Annual Symposium on VLSI, pages 242--247, 2016.Google Scholar
K. Dev, A. Nowroz, and S. Reda. Power mapping and modeling of multi-core processors. In Low Power Electronics and Design (ISLPED), 2013 IEEE International Symposium on, pages 39--44, Sept 2013. Google ScholarDigital Library
G. F. Diamos and S. Yalamanchili. Harmony: An Execution Model and Runtime for Heterogeneous Many Core Systems. In Intl. Symp. on High Perf. Distributed Computing, pages 197--200, 2008. Google ScholarDigital Library
S. Grauer-Gray et al. Auto-Tuning a High-Level Language Targeted to GPU Codes. In Innovative Parallel Computing, pages 1--10, May 2012.Google Scholar
C. Gregg, J. S. Brantley, and K. Hazelwood. Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems. In HotPar'10, 2010.Google Scholar
C. Gregg et al. Dynamic Heterogeneous Scheduling Decisions Using Historical Runtime Data. In Proc. of the 2nd workshop on applications for multi- and many-core processors, 2011.Google Scholar
J. Lee et al. SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration. ACM Tran. Comp. Sys., 33:1--27, Aug. 2015. Google ScholarDigital Library
C.-K. Luk, S. Hong, and H. Kim. Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping. In Proc. of the IEEE/ACM Intl. Symp. on Microarchitecture, pages 45--55, 2009. Google ScholarDigital Library
S. Mittal and J. S. Vetter. A Survey of CPU-GPU Heterogeneous Computing Techniques. ACM Comp. Surv., 47(4):1--35, July 2015. Google ScholarDigital Library
P. Pandit and R. Govindarajan. Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices. In Proc. of Intl. Symp. on Code Generation and Optimization, pages 273--283, 2014. Google ScholarDigital Library
J. A. Pienaar, A. Raghunathan, and S. Chakradhar. MDR: Performance Model Driven Runtime for Heterogeneous Parallel Platforms. In Proc. of the Intl. Conference on Supercomputing, pages 225--234, 2011. Google ScholarDigital Library
K. Spafford, J. Meredith, and J. Vetter. Maestro: Data Orchestration and Tuning for OpenCL Devices. In Proc. of the International Euro-Par Conference on Parallel Processing: Part II, pages 275--286, 2010. Google ScholarDigital Library
C. D. Spradling. SPEC CPU2006 Benchmark Tools. SIGARCH Comp. Arch. News, 35(1):130--134, 2007. Google ScholarDigital Library
J. A. Stratton et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. UIUC, Tech. Rep. IMPACT-12-01, 2012.Google Scholar
Y. Wen, Z. Wang, and M. O'Boyle. Smart Multi-Task Scheduling for OpenCL Programs on CPU/GPU Heterogeneous Platforms. In Intl. Conference on High Performance Computing, pages 1--10, 2014.Google Scholar

Scheduling Challenges and Opportunities in Integrated CPU+GPU Processors

Please enable JavaScript to view thecomments powered by Disqus.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Published in
118 pages
DOI:10.1145/2993452
Copyright © 2016 ACM
Sponsors
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Permissions
Request permissions about this article.
NewLineVST Electrifying VST Free Download Latest Version r2r for MAC OS. It is full offline installer standalone setup of NewLineVST Electrifying VST Serial key for macOS. NewLineVST Electrifying VST Overview. Electrifying vst is designed with 75 new sounds to spark your creative side! 13 CATEGORIES MAKE UP THIS VST. .After purchase you will be directed to the download page where you can download content. ABOUT THIS VST Waves vst is designed with over 75 dark, wavy sounds influenced by today's music. Also included with this vst is a drum kit filled with HQ sounds as a added bonus to the vst. Be the first to review “Waves plug-in” Cancel reply. Download Free Waves AU VST Plugins & VSTi Instruments Here is our colection of FREE software, VST plugins, VSTi instruments, audio utilities and DAWs. Should you know of anything that we have not listed here let us know.
Request Permissions
Author Tags
Qualifiers
- tutorial
- Research
- Refereed limited

Article Metrics
- Total Citations
  View Citations
- Total Downloads
- Downloads (Last 12 months)76
- Downloads (Last 6 weeks)7
Other Metrics

PDF Format

eReader

Digital Edition

View this article in digital edition.

View Digital Edition

#PolyBench/ACC

##Contacts

Scott Grauer-Gray (sgrauerg@gmail.com)
William Killian (killian@udel.edu)
John Cavazos (cavazos@udel.edu)
Robert Searles (rsearles@udel.edu)
Lifan Xu (xulifan@udel.edu)

##Targets

CUDA
OpenCL
HMPP
OpenACC
OpenMP

This benchmark suite is partially derived from the PolyBench benchmark suite developed by Louis-NoelPouchet and available at http://www.cs.ucla.edu/~pouchet/software/polybench/

,1000000,null,'0','trip -',null,null,2,null,null,null,'Puerta',null,2,null,null,null,'this app! Thx for the great game',226000000,7,'3DTuning','Thank you for your review! NWe are doing our best to make more cars, tuning parts and features available on 3DTuning.nWe want you to know that we take into consideration each request and choose the direction of our development according to it.nYou can be sure your message will not remain unnoticed and we will try to implement your suggestion asap!' I've spent hours on it and love the new updates that you guys do on a regular basis. I hope you keep pushing out new and better updates frequently and maybe have a voting system for new updates and concepts like possible competitions and new idea boxes that people can say what they want in the game and we can vote and you can implement so e of them.

####If using this work, please cite the following paper:Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. Auto-tuning a High-Level Language Targeted to GPU Codes. Proceedings of Innovative Parallel Computing (InPar '12), 2012.

#####Paper download:http://cavazos-lab.github.io/Polybench-ACC/Autotuning.a.High-Level.Language.Targeted.to.GPU.Codes-paper.pdf

##Available Benchmarks

####datamining

correlation
covariance

####linear-algebra/kernels

2mm
3mm
atax
bicg
cholesky [*]
doitgen
gemm
gemver
gesummv
mvt
symm [*]
syr2k
syrk
trisolv [*]
trmm [*]

####linear-algebra/solvers

durbin [*]
dynprog [*]
gramschmidt
lu
ludcmp [*]

####stencils

adi
convolution-2d
convolution-3d
fdtd-2d
jacobi-1d-imper
jacobi-2d-imper
seidel-2d [*]

[*] - not available for CUDA or OpenCL

##Environment Configuration

###CUDA:

Set up PATH and LD_LIBRARY_PATH environment variables to point to CUDA installation
Run make in target folder(s) with codes to generate executable(s)
Run the generated executable file(s).

###OpenCL:

Set up PATH and LD_LIBRARY_PATH environment variables to point to OpenCL installation
Set location of SDK in common.mk file in utilities folder (in OpenCL directory)
Run make in target folder(s) to generate executable(s)
Run the generated executable file(s).

###HMPP (CAPS Compiler)2. Set up PATH and LD_LIBRARY_PATH environment variables to point to CUDA/OpenCL installation3. Set up HMPP/OpenACC environment variables with source hmpp-env.sh or caps-env.sh4. Run make exe in target folder(s) with codes to generate executable(s)5. Run the generated executable file(s).

###OpenACC (RoseACC)

Set up PATH and LD_LIBRARY_PATH environment variables for RoseACC (see RoseACC's Getting Started)
Run make exe in target folder(s) with codes to generate executable(s)
Run the generated executable file(s).

Auto-tuning A High-level Language Targeted To Gpu Codes 2016

Modifying Codes

Parameters such as the input sizes, data type, and threshold for GPU-CPU output comparison can be modified using constantswithin the codes and .h files. After modifying, run make clean then make on relevant code for modifications to take effect in resulting executable.

###Parameter Configuration:

####Input Size:By default the STANDARD_DATASET as defined in the .cuh/.h file is used as the input size. The dataset choice can be adjusted from STANDARD_DATASET to otheroptions (MINI_DATASET, SMALL_DATASET, etc) in the .cuh/.h file, the dataset size can be adjusted by defining the input size manually in the .cuh/.h file, orthe input size can be changed by simply adjusting the STANDARD_DATASET so the program has different input dimensions.

####RUN_ON_CPU (in .cu/.c files):Declares if the kernel will be run on the accelerator and CPU (with the run-time for each given and the outputs compared) or only on the accelerator. By default, RUN_ON_CPU is defined so the kernel is run on both the accelerator and the CPU to make it easy to compare accelerator/CPU outputs and run-times. Commenting out or removing the #define RUN_ON_CPU statement and re-compiling the code will cause the kernel to only be run on the accelerator.

###DATA_TYPE (in .cuh/.h files):By default, the DATA_TYPE used in these codes are float that can be changed to double by changing the DATA_TYPE typedef. Note that in OpenCL, the DATA_TYPE needs to be changed in both the .h and .cl files, as the .cl files contain the kernel code and is compiled separately at run-time.

###PERCENT_DIFF_ERROR_THRESHOLD (in .cu/.c files):The PERCENT_DIFF_ERROR_THRESHOLD refers to the percent difference (0.0-100.0) that the GPU and CPU results are allowed to differ and still be considered 'matching'; this parameter can be adjusted for each code in the input code file.

###OPENCL_DEVICE_SELECTION (in .c files for OpenCL)Declares the type of accelerator to use for running the OpenCL kernel(s).

CL_DEVICE_TYPE_GPU - run the OpenCL kernel on the GPU (default)
CL_DEVICE_TYPE_CPU - run the OpenCL kernel on the CPU
CL_DEVICE_TYPE_ACCELERATOR - run the OpenCL kernel on another accelerator such as the Intel Xeon Phi processor or IBM Cell Blade

####Other available options

These are passed as macro definitions during compilation time(e.g -Dname_of_the_option) or can be added with a #define to the code.

POLYBENCH_STACK_ARRAYS (only applies to allocation on host):use stack allocation instead of malloc [default: off]
POLYBENCH_DUMP_ARRAYS: dump all live-out arrays on stderr [default: off]
POLYBENCH_CYCLE_ACCURATE_TIMER: Use Time Stamp Counter to monitorthe execution time of the kernel [default: off]
MINI_DATASET, SMALL_DATASET, STANDARD_DATASET, LARGE_DATASET,EXTRALARGE_DATASET: set the dataset size to be used[default: STANDARD_DATASET]
POLYBENCH_USE_C99_PROTO: Use standard C99 prototype for the functions.
POLYBENCH_USE_SCALAR_LB: Use scalar loop bounds instead of parametric ones.

##ContributionsThe following contributions have been made to this benchmark suite by the following people:

Lifan Xu -- Original implementation of CUDA and OpenCL kernels
Robert Searles -- Original implementation of HMPP kernels (version 2.x)
Scott Grauer-Gray -- Modified implementations of CUDA and OpenCL
William Killian -- Modified HMPP kernels (updated to 3.x), OpenACC kernels, OpenMP kernels

Auto-tuning A High-level Language Targeted To Gpu Codes Free

##AcknowledgementThis work was funded in part by the U.S. National Science Foundation through the NSFCareer award 0953667 and the Defense Advanced Research Projects Agency through the DARPAComputer Science Study Group (CSSG).

Save to Binder

Scheduling Challenges and Opportunities in Integrated CPU+GPU Processors

Login options

Full Access

Published in

Sponsors

Publisher

Publication History

Permissions

Author Tags

Qualifiers

Article Metrics

Other Metrics