CHANGES

LOG:

Version 3.1.2 versus 3.1.1
- Added support for CUDA 4.1 and 4.2 (PTX 3.0). 
- Added an improved and less brittle support for associating fatbins to
  sections from the output of cuobjdump
- Added raw simulation time printout back for CUDA.  
- Revived support for PTX file override via environment variables for CUDA 4.0
  onwards.
- Experimental support for libraries, currently not working because of
  undocumented functions in the cuda runtime API.
- Converted loading fat binaries to be done on demand basis
- Added support for vote.ballot instruction. 
- Added support for popc instruction. 
- Changed ssy ptxplus instructions to have their own opcode but they still act
  as a nop right now.
- Added support for converting BRX sass instructions to brx ptxplus instructions
  in cuobjdump_to_ptxplus and started work on getting the brx ptxplus
  instruction to work in gpgpusim.
- Modified the cache hierarchy (cache_t -> baseline_cache -> [read_only_cache, data_cache, ...])
- Enabled configurable cache policies (write-back, write-through) and implemented 
  a write-allocate policy
- Added functional execution support for shared memory atomic operations
- Updated the option parser to support strings delimited by characters such as
  '=' and ';'.  This allow options with a long string of sub-options (e.g. DRAM
  timing) to have a more friendly format. 
- Changed the DRAM timing option to a more reader-friendly format. 
- Added support for the PTX version 3.0 codes generated by NVIDIA OpenCL
  drivers newer than version 256.40. 
- nvopencl_wrapper now works on machine with both AMD and NVIDIA GPUs.  It will
  automatically compile the OpenCL Kernels via the NVIDIA platform. 
- Bug Fixes:
    - Fixed a compile error that happens with newer gcc/g++ versions (4.7.1)
    - Fixed a bug with the association between cuobjdump output and cubin
      handles (Bug #7 external)
    - Adding support for atomic operations with generic memory space.  Before
      this, atomic operations can only work on global memory space.  (Bug #14
      external)
    - Fixed a bug in conversion to ptxplus that was causing local memory store
      or load instruction to be ignored.
    - Adding support for direct addressing using immediate values for the load
      and store instructions.
    - Fixed a bug that was causing inconsistency in local memory address
      calculations
    - Fixed incorrect counting of predicated instructions.  Now instructions
      are counted into gpu_sim_insn regardless of predication outcome.  (Bug
      #15 external)
    - Added support for cudaReadModeNormalizedFloat (a texture read mode). 
    - Fixed texture fetching for 1D texture with non-normalized coordinates. 
    - Fixed mult-element texel fetching (e.g. texel with RGBA components).  
    - Fixed bug where Global variables were not being declared in ptxplus.
    - Fixed bug with IADD.CARRY sass instruction conversion to addp ptxplus
      instruction.
    - Fixed bug with .half ptxplus instruction modifier showing up on the wrong
      instruction or not showing up at all. This should correct a bug where the
      instruction addresses in ptxplus are different than sass. They need to be
      the same for the brx ptxplus instruction to work.
    - Fixed a bug where the L2 cache was modelling write-back for local writes
      and write-evict for global writes - Should be write-back for all writes.
    - Fixed bug that was causing undetermistic kernel end detection inside the 
      simulation thread.
    - Fixed clCreateProgramWithSource to accept NULL or 0 as string length parameter
    - Fixed replacement of printf modifiers when compiling OpenCL code (e.g. '%f')
    - Fixed the shared memory bank conflict model for GTX 480 and Tesla C2050.
      Added options to configure the number of shared memory banks per shader
      core and to allow a more flexible broadcast mechanism. 
    - Fixed the timing model for LDU instruction, before it was not recognized
      as a memory instruction in the timing model.
    - Removed memory leak in cudaSetupArgument(). 
    - Fixed .loc directive parsing for PC-Histogram (CFLog) in AerialVision. 

Version 3.1.1 versus 3.1.0
- Add checks to top level makefile to ensure setup_environment is run and checks to
  setup_environment to ensure a compatible version of CUDA is installed.
- Change Fermi configuration folder name to GTX480.
- Added TeslaC2050 configuration.
- Added a two level warp scheduler (as presented at ISCA 2012 tutorial).
- Cleaner makedepend that doesn't interfere with Makefiles
- Removed a dependency on an inhouse implementation of a string list in
  cuobjdump_to_ptxplus.
- Cleaned up the information printed out to stdout/stderr during the operation
  of cuobjdump_to_ptxplus.
- Change cuobjdump_to_ptxplus to use std::string instead of char*.
- Print out instructions at branch divergence/reconvergence points
- Bug Fixes:
    - SIMT stack correctly handles recursive functions by allocating a new stack entry on every
      function call
    - Fixed a bug in executing call_imp and callp_imp with the pure functional simulation mode.
    - Fixed a couple of memory errors in cuobjdump_to_ptxplus code.
    - Implemented better support for handling memory operand type modifier
      in cuobjdump_to_ptxplus
    - Correctly handling MVC instructions in cuobjdump_to_ptxplus; now they can
      map to cvt or mov.
    - Corrected calculation of average memory latency.  Before this fix, the
      metric was only reporting the average for memory accesses from the final 
      AerialVision sampling window.  Now the metric reports average memory 
      latency for all memory accesses in the entire application run. 
    - Corrected GDDR5 parameters in Fermi configuration: 
      - Increased burst length to 8 and changed the address mapping to reflect
        16 banks as suggested by Jungrae
      - Decreased DRAM clock to 924MHz from 1848MHz. 
      - Corrected CAS Latency, Write Latency and other timing constraints. 
      - Added a new option 'dram_data_command_freq_ratio' to configure the
        frequency ratio between the DRAM data bus and command bus.  This allows
        GPGPU-Sim to support both GDDR3 (data rate = 2X command rate) and GDDR5
        (data rate = 4X command rate). 
    - Fixed the DRAM performance statistics log to display information properly
      when there are more than 4 banks in each partition (reported by Jungrae).
    - Fixing a bug in the opencl prebuilt ptx flow.
      If a kernel made modifcations to the .cl code at runtime (like changed
      the block size or some other constants) then recompiled itself, the 
      saved_embedded_ptx flag would just overwrite the same ptx file.  
      The same thing happened when using a prebuilt ptx file through
      PTX_SIM_USE_PTX_FILE - It always wanted to
      load the same ptx file even though there should have been more than one.
    - Fixed bug that caused $p3 to be used before it was initialized.

Version 3.1.0 versus 3.0.2
- Support for CUDA 4.0 for both PTX and PTXPlus.
- Support for using cuobjdump to extract PTX/SASS from binaries.
- Complete rework for the SASS to PTXPlus conversion code including support for
  code generated by cuobjdump.
- Support for configurable pipeline widths.
- Support for a configurable number of functional units within an SM.
- Model for GDDR5 bank groups.
  - GDDR5 configuration.
- Extended the result bus model to allow issuing/retiring more than one
  instructions from multiple function unit per cycle.
- Support for configurable instruction latency and initiation interval
  - Added corresponding configuration to Fermi and Quadro config files according
    to benchmark results
- Support for 64-bit atomicAdd, atomicExch, and atomicCAS. 
- Support for 32-bit floating point atomicAdd. 
- New installation instructions for AerialVision, removed deprecated install
  script.
- Rework for some makefiles
- Removed the no longer supported/used decuda_to_ptxplus
- Removed all dependencies on the boost libraries. Now GPGPU-Sim can run
  without boost.
- Added missing support for -gpgpu_perfect_mem. When enabled this option models 
  a memory system with single cycle latency per memory request for accesses that 
  miss in the L1 cache. Bandwidth is limited to one memory request issued per 
  SIMT core cycle. Note this means uncoalesced accessess will be slower than 
  coalesced accesses.
- Bug Fixes:
    - Added a missed call to operand collector writeback which fixes some
      writebacks that were not considered by the operand collector.
    - Fix for a bug that caused global symbol tables to be re-initialized after
      parsing each ptx file.
    - Fix for a rare race condition that prevented kernel stats from being
      printed
    - Fix for miscalculation of the gpu_total_sim_rate value printed at the end
      of a kernel launch
    - Fix for a bug that caused atomic accesses not to have a payload to the
      Atomic Operation Unit; atomic accesses now have a payload both to and
      from the Atomic Operation Unit
    - Fixed a bug that caused deadlock check to be omitted
    - Updated the Fermi config files such that when an access misses the L1 data 
      cache, it allocates a line immediately before sending a data fetch request 
      out to the memory partition. 
    - Changed the writeback arbitration among multiple clients in the LDST unit 
      to round-robin. 
    - Fixed a rare pathological case that caused LD instructions with divergent 
      memory accesses to be overcounted.  
    - Fixed linear_to_raw_address_translation::partition_address() so that it 
      works for non-power-of-two number of memory partitions.  Before, it just 
      returns the input memory address without any change.  
    - Fixed warp_inst_t::memory_coalescing_arch_13() to use the 
      gpgpu_shmem_warp_parts option to divide up the warps in the coalescing 
      logic.  It was hardcoded to 2, which causes coalescing to always operate 
      on half-warps. 

Version 3.0.2 versus 3.0.1
- Added Fermi configuration
- Revived functional simulation mode
- AerialVision: Revived support for PC-Histogram
- AerialVision: Revived collection of source code view metrics (all except
  exposed pipeline latency)
- Added missing function implementations (caused static linking with GPGPU-Sim
  to fail).
- Fixed div_impl with 32-bit and smaller integers (Uninitialized upper bits in
  the operands was causing the instruction to behave incorrectly).
- Updated setup_environment script to handle host names with empty domains.
- Forced L2 cache line size to be greater than or equal to L1 cache line size
- L2 caches can now be disabled
- Increased Quadro config's L2 line size to 256B and total L2 size to 256kB
- Fixed generation of memory requests for local.s8 load instructions (fixed a
  bug in local memory address translation)
- Conflicting atomic accesses now generate multiple requests during coalescing
- Updated Doxygen scripts
- Fixed dump pipeline in case of L1 disabled.
- Fixed a bug in the debug printing system by making the enum and string array
  use the same list
- Fixed a bug where the ldst_unit::cycle and ldst_unit::writeback were being
  processed at different rates
- Changes to setup_environment to prevent path explosion and crashing due to
  empty host domain name
- Changed the remote OpenCL compilation to use rsync instead of scp (scp had
  issues with concurrent copying).
- Fixed the data collection code for the following metrics (in 3.0.0 and 3.0.1
  they were only counting memory instructions that generate off-core traffic):
  gpgpu_n_load_insn
  gpgpu_n_store_insn
  gpgpu_n_shmem_insn
  gpgpu_n_tex_insn
  gpgpu_n_const_mem_insn
  gpgpu_n_param_mem_insn
- Fixed a bug where the PTX load and store instructions' input address register 
  dependencies were not being registered with the scoreboard. 
- Updated how pending hits in data cache are reported (3.0.0 and 3.0.1 reports 
  the difference between the number of cache misses and pending hits). 
- Fixed a bug where a configuration with non power-of-two number of memory 
  partitions can cause two different linear address to be aliased into the 
  same intra-partition address at the same memory partition. 

Version 3.0.1 versus 3.0.0b
- Updated README and setup_environment for more streamlined setup. 
  Including install instructions for Ubuntu 10.04.3 LTS
- Changed Quadro FX 5800 config to use PTX by default (not SASS/decuda)
- Added Doxygen generated documentation
- Removed ISPASS 2009 configuration (use Quadro FX 5800 config instead)
- Removed common.mk (run native CUDA/OpenCL applications)
- Removed out-of-date GPGPU-Sim documentation. This wil be updated in a 
  later release.  Refer to Doxygen documentation and/or source code.
- Added prints for L1 data cache statistics
- SIMD width option removed from shader_core_pipeline_opt (was not used)
- Added read-to-precharge constraint in DRAM
- Disabled Stream Manager's verbose output by default, use debug level 3 to 
  enable it
- Addresses returned by memory allocation are now 256 bytes aligned
- Ejection from the clock domain interface buffer between interconnection 
  network and L2 cache happens in the L2 clock domain instead of interconnect 
  clock domain.
- Update OpenCL support to work with AMD OpenCL sample applications
- Bug fixes
   - Fixed the variation in instruction count seen under different cache
     configurations on the same workload
   - Fixed unnecessary flushing of instruction buffer
   - Fixed mislabeling of stall cycles recorded by scheduler
   - Fixed compilation issues when using gcc 4.5.1
   - Cleanup and bugfixes to build scripts
   - Fixed gpu_sim_cycle with sequential kernel launches
   - Fixed memory corruption bug in decuda_to_ptxplus causing parse errors

Version 3.0.0b versus 2.1.2b
- massive refactoring of code to C++ (most global variables eliminated)
- support for executing native SASS code via decuda
- added to timing model:
   - in order scoreboard allowing multiple instructions per warp in 
     pipeline at a time
   - operand collector for modeling access to banked register files
   - prefetching texture cache model (Igehy et al., Graphics Hardware 1998)
   - updated data and constant cache models with updated MSHR model
- other changes I don't remember right now

Version 2.1.2b versus 2.1.1b:
- Better 1D texture support 
- bar.sync timing change for compatibility with NVIDIA hardware 
  (release barrier when all warps reach barrier, irrespective of 
  divergence state). Functional simulation mode disabled to 
  implement the expected barrier behavior under branch divergence 
- Refactored memory stage 
- Limited miss status holding registers 
- Experimental support for running on Mac OS X 
- Register bank conflict modeling (see MICRO-42 tutorial) 
- Changed files to C++ 
- Updated makefiles 
- Various improvements to aerialvision 
- Simulation speed improvement 
- Various minor fixes 

Version 2.1.1b versus 2.1.0b:
- Added support for OpenCL (requires OpenCL driver from NVIDIA)
- Added performance visualizer tool
- Added manual documenting usage of the simulator (see doc directory)
- Added configuration file for Quadro FX5800 (see configs directory)
- Simultaneous release of benchmarks from ISPASS 2009 paper (separate download)
- Improved shared memory bank conflict modeling
- Improved default mapping of local memory accesses to global address space 
- Added interconnect concentration modeling (to approximate effect of a TPC)
- Added support for GPU-to-GPU memory copies (same GPU), timing not modeled
- SimpleScalar code (for command line parsing and functional simulation of 
  memory) removed.
- PTX support for vector operands in mov instruction
- Bug fixes
   - stack overwrite bug fix (with thanks to Jake Adriaens)
   - interconnect clock domain crossing "bottleneck" bug (was not removing
     traffic quickly enough going from memory to shader cores)
   - immediate post dominator detection (affected eigenvalues SDK benchmark)
   - CTA/core calculation error with PTX containing calls and returns
   - global symbol lookup error when building with multiple .cu files
- Other general improvements and bugfixes

Version 2.1.0b versus 2.0b:
- Added support for parsing and functionally simulating up to CUDA 2.2 
  generated PTX
- Added dynamic warp formation timing model (see MICRO'07, ACM TACO'09 papers)
- Updated gpgpusim.config and mesh in benchmark subdirectories to be similar to
  ISPASS 2009 paper baseline configurations
- Added OpenGL interoperability support
- Added support for parsing embedded PTX files without requiring recompilation 
  (suggested by Gregory Diamos)
- Improved support for texture filtering (linear mode for 2D, closer agreement
  to hardware for 2D point sampling where sample points are close to texel
  boundaries)
- Benchmark examples updated to version from CUDA 2.2 SDK (NOTE: these will not 
  compile with earlier CUDA installations. However, GPGPU-Sim should still work 
  with applications written for older versions of CUDA.)
- Fixed splash screen: was reporting incorrect version
- Fixed bug comparing strings in basic block formation code
- Fixed bug with additional basic block (which lead to incorrect postdominator 
  analysis on some benchmarks with recent versions of CUDA)
- Fixed "warning: "/*" within comment" when building.
- "Other general improvements"