May « 2018 « C++Guns

18.05.2018

C++ Guns: Choose the right datatype

Filed under: Allgemein — Tags: Cpp — Thomas @ 19:05

Hier ein sehr kleiner Ausschnitt aus der Praxis und der Assembercode. Verwendet wird ein Vector basierend auf SSEvalarray.

auto func(const SSEVector<float,3>& UBAR) {
    const auto U = UBAR[1]/UBAR[0];
    const auto V = UBAR[2]/UBAR[0];
    // more code which use U and V
    return U*V;
}

func(Vector<float, 3> const&):
  movss (%rdi), %xmm2
  movss 4(%rdi), %xmm0
  divss %xmm2, %xmm0
  movss 8(%rdi), %xmm1
  divss %xmm2, %xmm1
  mulss %xmm1, %xmm0

Wir sehen ein paar MOVs und insbesondere zwei DIVs. Die Eingabevariable ist aber ein SSE Type. Warum sind dann auch zwei DIVs im Assember Code zu sehen? Na weil zwei Divisionen im C++ Code stehen. Um das zu verbessern muss man davon absehen zusammengehörige Dinge als einzelne Skalare zu implementieren.

auto func(const Vector<float,3>& UBAR) {
  Vector<float,3> UV = UBAR/UBAR[0]; 
  return UV;

func(Vector<float, 3> const&):
  movss (%rdi), %xmm1
  shufps $0, %xmm1, %xmm1
  movaps (%rdi), %xmm0
  divps %xmm1, %xmm0

Diesmal existiert nur eine DIV Instruktion für gepackte Werte. Die Shuffle Instruktion verteil den Skalar in UBAR[0] auf alle 32Bit Einheiten im SSE Register. Allerdings ist der Code etwas verwirrend. Die Variable UV ist ein Vector mit drei Elementen. Aber es werden nur zwei Benutzt. Die Semantik ist etwas kaputt.
Der Vektor UBAR stellt soll eigentlich zwei Variablen darstellen. Geschwindigkeit und Durchfluss. Wenn das Programm entsprechend modelliert wird...

struct U_t {    
  Vector<double,2> q;
  double H;
};

auto func(const U_t& UBAR) {
  Vector<double,2> v = UBAR.q/UBAR.H; 
  return v;
}

func(U_t const&):
  movsd 16(%rdi), %xmm1
  unpcklpd %xmm1, %xmm1
  movapd (%rdi), %xmm0
  divpd %xmm1, %xmm0

Der Assemblercode bleibt im Prinzip gleich, nur dass jetzt ohne Verlust double statt float genutzt werden kann. Und die Semantik ist wieder hergestellt :)

Comments Off

17.05.2018

C++ Guns: Point Concept

Filed under: Allgemein — Tags: Cpp — Thomas @ 11:05

Ich hab mich mal schnell in Concepts eingearbeitet und provisorisch die std library concepts implementiert. Darauf hin habe ich mir ein Concept eines PointND überlegt. Nun das müsste einfach nur ein Container sein, mit den Funktionen x(), y() und Operatoren für +, -, *, /.

So wurde aus dem Stück Code mit viel Templates:

template<typename Array>
inline auto operator+(const PointND<Array>& lhs, const PointND<Array>& rhs) {
    return PointND<Array>{static_cast<Array>(lhs) + static_cast<Array>(rhs)};
}

dieses Stück:

template<typename PointND>
concept bool Point =
        requires(PointND p1, PointND p2) {
        typename PointND::Base;
        requires std::Container<PointND>;
{p1.x()} -> typename PointND::const_reference;
{p1.y()} -> typename PointND::const_reference;
{p1+p2} -> PointND;
{p1-p2} -> PointND;
{p1*p2} -> PointND;
{p1/p2} -> PointND;
};

template<Point PointND>
inline auto operator+(const PointND& lhs, const PointND& rhs) {
    return PointND{static_cast<typename PointND::Base>(lhs) + static_cast<typename PointND::Base>(rhs)};
}

Na das ist doch schon besser.

https://sourceforge.net/p/acpl/code/ci/master/tree/acpl/acpllib/include/core/stdconcepts.hpp
https://sourceforge.net/p/acpl/code/ci/master/tree/acpl/acpllib/include/core/geometry/point.hpp

Comments Off

16.05.2018

C++ Guns: SIMD PointND

Filed under: Allgemein — Tags: Cpp — Thomas @ 10:05

Kommen wir wieder zurück zu meinem Lieblingsdatentype: PointND. Diesmal mit SSE. In den letzten Posts hab ich den Type SIMDarray vorgestellt zum gleichzeitigen prozessieren von mehreren Werten. Danach folge SIMDvalarray Type welcher arithmetische Operation +, -, *, /, definierte. Was braucht es noch mehr zum SIMD PointND? Eigentlich nichts. Ein PointND ist ein fixed size array der Länge N. Man kann noch ein paar Funktionen bereit stellen um explizit auf X und Y zuzugreifen. Aber das ist eigentlich nur Syntax Zucker und bringt keine neue Funktionalität. Also versuchen wir es:

template<typename Array>
struct PointND : Array {
  static_assert(Array().size() >= 2);
  using value_type = typename Array::value_type;

  const value_type& x() const {
    return (*this)[0];
  }

  const value_type& y() const {
    return (*this)[1];
  }
};

template<typename Array>
inline auto operator+(const PointND<Array>& lhs, const PointND<Array>& rhs) {
    return PointND<Array>{static_cast<Array>(lhs) + static_cast<Array>(rhs)};
}

template<typename Array>
auto func(const PointND<Array>& p1, const PointND<Array>& p2) {
    return (p1+p2).y();
}

auto func2(const PointND<SIMDvalarray<double, 2>>& p1, const PointND<SIMDvalarray<double, 2>>& p2) {
    return func(p1, p2);
}

acpl::func2(acpl::PointND<acpl::SIMDvalarray<double, 2ul> > const&, acpl::PointND<acpl::SIMDvalarray<double, 2ul> > const&):
  movapd (%rsi), %xmm0
  addpd (%rdi), %xmm0
  unpckhpd %xmm0, %xmm0

Also ein paar Sachen sind ja schon nervig. Zum einem die Templates. Für Funktionsargumente ist das zu verbose. Es zählt im Grunde nur, dass der Type ein PointND ist. Die Anzahl der Dimensionen, das zugrunde liegende Array oder value_type (int, double), spielt eigentlich keine Rolle. Der Compiler kennt den genauen Aufbau des Types und kann alles Nötige daraus Ableiten. Das selbe gilt auch für Point2D für nur 2D Funktionen. Hier müssten eigentlich Concepts weiter helfen....

Die zweite störende Sache sind die sich ständig wiederholenden operation+-*/ Funktionen. Weglassen kann man sie für PointND nicht. Es wird dann z.B. operator+ für SIMDvalarray aufgerufen, und das Ergebnis ist wieder ein SIMDvalarray, kein PointND. So ist dann der Zugriff auf Y wie im Code gezeigt nicht mehr möglich. Der Compiler ist so schlau zu erkennen, dass hier nur der Type geändert wird. Dies hat keinen Einfluss auf die Laufzeit, wie im Assembler Code zu sehen ist.

Comments Off

15.05.2018

C++ Guns: use templates and inline; don't use libraries

Filed under: Allgemein — Tags: Cpp — Thomas @ 19:05

Der Compiler kann wirklich ganz viel Optimieren, solange er das komplette Programm als Code sehen kann. Wird irgendwo eine Funktion aus einer Laufzeit-Bibliothek aufgerufen, ist es vorbei mit der Optimierung. Ein kleines Beispiel soll das verdeutlichen:

void foo(const int &x); 

int bar() {
    int x = 0;
    int y = 0;
    for (int i = 0; i < 10; i++) {
        foo(x);
        y += x;
    }
    return y;
}

Die Funktion foo ist nicht definiert, das gäbe einen Linker Error. Wichtig ist erst einmal, dass zur Compilezeit nicht bekannt ist, was die funktion foo tut. Dementsprechend können keine Annahmen getroffen werden um weitere Optimierungen vorzunehmen. Die Assembercode sieht aus wie erwartet:

bar():
  pushq %rbp           // safe register rbp rbx
  pushq %rbx
  subq $24, %rsp       // allocate memory on stack
  movl $0, 12(%rsp)    // x=0
  movl $10, %ebx       // i=10 
  movl $0, %ebp        // y=0
.L2:                   // for...
  leaq 12(%rsp), %rdi
  call foo(int const&)
  addl 12(%rsp), %ebp  // y += x
  subl $1, %ebx        // --i
  jne .L2
  movl %ebp, %eax     // return y
  addq $24, %rsp      // release memory on stack
  popq %rbx           // restore register
  popq %rbp
  ret

Der C++ Code wurde quasi 1:1 in Assember umgesetzt. Nichts besonderes.
Wenn jetzt die Funktion bar() bekannt ist und zum Beispiel einfach eine leere Funktion ist, ändern die etwas am erzeugten Assembler Code? Allerdings!

foo(int const&):
  ret
bar():
  movl $0, %eax
  ret

Der Compiler hat quasi alles weg-optimiert. Der Funktionsaufruf von foo(), die Addition von y, die Schleife. Das sichern der Register ist nicht mehr notwenig. Stack Speicher wird nicht genutzt. Der Compiler hat erkannt, dass die Funktion immer 0 zurück gibt.

Also: benutzt templates und inline, statt klassische Bibliotheks-Funktions-Aufrufe.

Comments Off

C++ Guns: SIMDvalarray; std::valarray for SSE

Filed under: Allgemein — Tags: Cpp — Thomas @ 11:05

SIMDvalarray is the class for representing and manipulating arrays of values like std::valarray but it uses SIMD techniques like SSE, AVX ... and has no subscript operators like slicing.

Einfach SIMDarray mit Operatoren +, -, *, /, +=, -=, *=, /=

template<typename T, std::size_t N>
struct SIMDvalarray : acpl::SIMDarray<T,N> 
{ };

...

SIMDvalarray arr1 {1.0, 2.0};
SIMDvalarray arr2 {2.0, 4.0};
SIMDvalarray erg1 = arr1+arr2;

Erzeugt packed values SIMD Assember Code, wie gewünscht. Und mit C++17 Decudtion Guidelines muss man die nervigen, redundanten Template Parameter auch nicht mehr mit angeben.

https://sourceforge.net/p/acpl/code/ci/master/tree/acpl/acpllib/include/core/util/SIMDvalarray.hpp

Comments Off

C++ Guns: SIMDarray; std::array for SSE

Filed under: Allgemein — Tags: Cpp — Thomas @ 09:05

SIMD (Single Instruction, Multiple Data) - besser bekannt als MMX, SSE, AVX u.s.w ist super für die Datenverarbeitung geeignet. So können, je nach Ausstattung der CPU, viele Gleitkommazahlen gleichzeitig verarbeitet werden. So bietet SSE 128 bit, also 16 Byte, große Register. Hier können zwei double Variable gleichzeitig verarbeitet werden. Zum Beispiel für einfach Point2D Operationen wie Addition, Division. Hingegen ist es mit AVX und 256 bit Register Möglich gleich vier double Variablen gleichzeitig zu verarbeiten (Point3D). Seit 2013 gibt es AVX512 mit 512 bit bzw. 64 Byte Register. Willkommen bei PointND.

Die Compiler unterstützen dies auch gut, allerdings wird nicht so ohne weiteres der Assember Code erzeugt, den ich mir vorstelle. Ich möchte nicht in Schleifen tollen SIMD Code erstellt bekommen, sondern bei normalen Operationen von PointND wie + - * /. Einfach, weil die Schleifen in meinem Algorithmen nie trivial sind und ich sehe mehr Optimierungspotential für einfache PointND Operationen.

Einfaches Beispiel

#include <array>

struct Point2D : public std::array<double,2> {
};

inline Point2D operator+(const Point2D& lhs, const Point2D& rhs) {
    return Point2D{lhs[0]+rhs[0], lhs[1]+rhs[1]};
}

auto func(const Point2D& p1, const Point2D& p2) {
    return p1+p2;
}

Der generierte Assembercode enthält zwei Additions-Instruktionen addsd. Die beiden letzten Buchstaben geben den Datentype an und ob es eine gepackte Operation ist, also mehrere Zahlen gleichzeitig. sd steht für single value double precision. Also eine Zahl. Erwartet hätte ich aber addpd - packed value double precision.

operator+(Point2D const&, Point2D const&):
  movsd xmm0, QWORD PTR [rdi]
  addsd xmm0, QWORD PTR [rsi]
  movsd xmm1, QWORD PTR [rdi+8]
  addsd xmm1, QWORD PTR [rsi+8]

Man kann dem Compiler mit der Erweiterung "Vector Instructions" aber etwas auf die Sprünge helfen. Durch die Definition des Attributs vector_size wird SSE Code produziert, so wie ich es mir vorstelle. Hier das Beispiel:

#include <array>

struct Point2D {
    typedef double v2d __attribute__ ((vector_size (16)));
    v2d _data;

    const double& operator[](size_t pos) const {
        return _data[pos];
    }
};

inline Point2D operator+(const Point2D& lhs, const Point2D& rhs) {
    return Point2D{lhs._data + rhs._data};
}

auto func(const Point2D& p1, const Point2D& p2) {
    return p1+p2;
}

Es wird nur noch eine Addition Instruktion erzeugt. Wir haben die Geschwindigkeit der Operation verdoppelt!

func(Point2D const&, Point2D const&):
  movapd xmm0, XMMWORD PTR [rdi]
  addpd xmm0, XMMWORD PTR [rsi]

Darauf lässt sich doch aufbauen und analog zu std::array ein SIMDarray erzeugen. Damit ist es Möglich für beliebige Arithmetische Typen beliebiger Anzahl effizienten SIMD Code zu erzeugen. Gesagt, getan. Die Klasse findet ihr in ACPL core/util/SIMDarray.hpp

https://sourceforge.net/p/acpl/code/ci/master/tree/acpl/acpllib/include/core/util/SIMDarray.hpp

https://gcc.gnu.org/onlinedocs/gcc-8.1.0/gcc/Vector-Extensions.html#Vector-Extensions

Comments Off

12.05.2018

C++ Guns: print std::array with std::integer_sequence

Filed under: Allgemein — Tags: Cpp — Thomas @ 15:05

Part1: print std::array with std:integer_sequence
Part 2: convert tuple to parameter pack
Part 3: print std::array with std::apply and fold
Part 4: fold over std::tuple und erzeugten Assembler Code
Part 5: fold over std::tuple of std::vector of Types ...
Part 6: apply generic lambda to tuple
Part 7: Play with std::tuple and std::apply

Wird an der Zeit, dass ich das auch mal versuche:

#include <array>
#include <iostream>

template<size_t N, std::size_t... Ints>
void print_impl(const std::array<double,N>& arr, std::index_sequence<Ints...>) {
    ((std::cout << std::get<Ints>(arr)),...);
}

template<size_t N>
void print(const std::array<double,N>& arr) {
    print_impl(arr, std::make_index_sequence<N>{});
}

void func() {
    std::array<double,4> arr{1,2,3,4};
    print(arr);
}

Comments Off

10.05.2018

C++ Guns: schlecht generiertes Assember von Pascal Code

Filed under: Allgemein — Tags: Cpp — Thomas @ 17:05

Das selbe Beispiel nochmal mit Pascal Code. Für ein halbwegs vernünftiges Ergebnis musste ich Optimierung O2 auswählen und von Hand inline einschalten. Dead Code elimination funktioniert nicht. Die wesentliche Subroutine hat fünf Subtraktionen, zwei Multiplikationen, elf MOVs, ein LEA und der Stack wird genutzt. Auch zum übergeben der Funktionsparameter. Zur Erinnerung: Wir brauchen fünf Subtraktionen, zwei Multiplikationen und vier explizite Kopier-Befehle (MOV) in C++.
Besser wird es nicht.

unit output;
interface
implementation

type
  Point2D = record
    x: double;
    y: double;
  end;

type 
  Line2D = record
    pt1: Point2D;
    pt2: Point2D;
  end;

Operator -(p1: Point2D; p2: Point2D) res: Point2D inline ;  
begin  
  res.x := p1.x-p2.x;  
  res.y := p1.y-p2.y;  
end; 

  function func(line1: Line2D; line2: Line2D) : double;
var   
    denominator: double; 
    a: Point2D;
    b: Point2D;
begin
    a := line1.pt2 - line1.pt1;
    b := line2.pt1 - line2.pt2;
    denominator := a.y * b.x - a.x * b.y;
    exit(denominator)
end;
  
end.


func(line2d,line2d):
  pushq %rbp
  movq %rsp,%rbp
  leaq -48(%rsp),%rsp
  movsd 32(%rbp),%xmm0
  subsd 16(%rbp),%xmm0
  movsd %xmm0,-16(%rbp)
  movsd 40(%rbp),%xmm0
  subsd 24(%rbp),%xmm0
  movsd %xmm0,-8(%rbp)
  movsd 48(%rbp),%xmm0
  subsd 64(%rbp),%xmm0
  movsd %xmm0,-32(%rbp)
  movsd 56(%rbp),%xmm0
  subsd 72(%rbp),%xmm0
  movsd %xmm0,-24(%rbp)
  movsd -8(%rbp),%xmm0
  mulsd -32(%rbp),%xmm0
  movsd -16(%rbp),%xmm1
  mulsd -24(%rbp),%xmm1
  subsd %xmm1,%xmm0
  leave
  ret

Comments Off

C++ Guns: schlecht generiertes Assember von Fortran Code

Filed under: Allgemein — Tags: Cpp, Fortran — Thomas @ 13:05

Analog zum letzten Beispiel mit C++ hier die angefangene Funktion mit Fortran. Auf die Getter Funktionen und ctors habe ich erst mal verzichtet. In Fortran ist es ohnehin nicht möglich immer vereinheitlichten Code zu schreiben, und es ist auch so schon schlimm genug.

module geometrymodule
  implicit none 
  type Point2D_t
    real(8) :: xp, yp
    
    contains
    
    procedure :: Point2Dminus
    generic :: operator(-) => Point2Dminus
  end type
  
  type Line2D_t
    type(Point2D_t) :: pt1, pt2
  end type

  contains
  
  pure function Point2Dminus(this, point1) result(point2)
    implicit none
    class(Point2D_t), intent(in) :: this
    type(Point2D_t), intent(in) :: point1
    type(Point2D_t) :: point2

    point2%xp = this%xp - point1%xp
    point2%yp = this%yp - point1%yp
  end function
end module

function func(line1, line2) result(denominator) 
  use geometrymodule
  implicit none
  type(Line2D_t), intent(in) :: line1, line2
  type(Point2D_t) :: a, b
  real(8) :: denominator
  
  a = line1%pt2 - line1%pt1
  b = line2%pt1 - line2%pt2
  denominator = a%yp * b%xp - a%xp * b%yp
end function

Und hier der Fortran Code. Zur Erinnerung: Wir brauchen fünf Subtraktionen, zwei Multiplikationen und vier explizite Kopier-Befehle in C++. In Fortran zähle ich nur vier Subtraktionen, dafür eine Addition, zwei Multiplikationen, 15 Kopier (MOV) Befehle, ein push/pop Paar für den Stack, zwei Funktionsaufrufe, mit Point2D_t sind es sogar vier, und vier leaq Aufrufe. LEA steht für Load Effective Address. Hell NO. Das war mit Optimierung O1.

Point2d_t:
	movq	(%rdi), %rax
	movq	8(%rdi), %rdx
	movq	%rax, (%rsi)
	movq	%rdx, 8(%rsi)
	ret

point2dminus:
	movq	(%rdi), %rax
	movsd	8(%rax), %xmm1
	movsd	(%rax), %xmm0
	subsd	(%rsi), %xmm0
	subsd	8(%rsi), %xmm1
	ret

func_:
	pushq	%rbx
	subq	$48, %rsp
	movq	%rsi, %rbx
	movq	Point2d_t, 24(%rsp)
	leaq	16(%rdi), %rax
	movq	%rax, 16(%rsp)
	movq	%rdi, %rsi
	leaq	16(%rsp), %rdi
	call	point2dminus
	movsd	%xmm0, (%rsp)
	movsd	%xmm1, 8(%rsp)
	movq	Point2d_t, 40(%rsp)
	movq	%rbx, 32(%rsp)
	leaq	16(%rbx), %rsi
	leaq	32(%rsp), %rdi
	call	point2dminus
	mulsd	8(%rsp), %xmm0
	mulsd	(%rsp), %xmm1
	subsd	%xmm1, %xmm0
	addq	$48, %rsp
	popq	%rbx
	ret

Mit Optimierung O2 wird es besser. Auf einmal sind es sieben Subtraktionen, dafür keine Addition mehr. Es bleibt bei zwei Multiplikationen. 13 Kopier MOV Befehle. Der Stack, LEA und die Funktionsaufrufe fallen weg. Hmmm, aber die Funktionen Point2d_t und point2dminus sind immer noch da. Dead Code elimination funktioniert also nicht. Damit bleiben effektiv fünf Subtraktionen, zwei Multiplikationen und 6 Kopier/MOV übrig. Naja immerhin.

Point2d_t:
	movq	(%rdi), %rax
	movq	8(%rdi), %rdx
	movq	%rax, (%rsi)
	movq	%rdx, 8(%rsi)
	ret

point2dminus:
	movq	(%rdi), %rax
	movsd	8(%rax), %xmm1
	movsd	(%rax), %xmm0
	subsd	8(%rsi), %xmm1
	subsd	(%rsi), %xmm0
	ret

func_:
	movsd	(%rsi), %xmm0
	movapd	%xmm0, %xmm1
	movsd	24(%rdi), %xmm0
	subsd	16(%rsi), %xmm1
	subsd	8(%rdi), %xmm0
	mulsd	%xmm1, %xmm0
	movsd	8(%rsi), %xmm1
	movapd	%xmm1, %xmm2
	movsd	16(%rdi), %xmm1
	subsd	24(%rsi), %xmm2
	subsd	(%rdi), %xmm1
	mulsd	%xmm2, %xmm1
	subsd	%xmm1, %xmm0
	ret

Werden hingegen Getter Funktionen für Point2D x und y implementiert, um ein einheitliches Interface für alle geometrischen Datentypen bereit zu stellen, was in C++ immer ohne Overhead möglich ist, kommen wieder Funktionsaufrufe und LEA Instruktionen in den Code. Und die gehn auch nicht mehr weg, egal mit welcher Optimierungsstufe.

Comments Off

C++ Guns: perfekt generiertes Assember von meiner Geometry Library

Filed under: Allgemein — Tags: Cpp — Thomas @ 12:05

Schaut euch mal folgenden Anfang einer Line2D intersect Funktion an. Erstaunlich wie sauber und perfekt der Assember Code aus meinem C++ Code generiert wird. Da ist absolut kein Overhead zu erkennen. Keine Funktionsaufrufe, kein unnötiges Laden und zwischenspeichern von Daten. Keine temporäre Objekte auf dem Stack. Überhaupt keine Stack Nutzung. Es gibt im Code fünf Subtraktionen und zwei Multiplikationen. Das resultiert im Assember Code mit fünf Subtraktionen, zwei Multiplikationen und vier explizite Kopier-Befehle für doubles. Besser geht es doch gar nicht mehr. Und das alles ist nur mit Optimerungslevel O1 compiliert. Nicht O2, nicht fanzy O3, nein, nur O1. GNU GCC alle gängigen Versionen. Clang kann es natürlich nicht mit O1, nur mit O2, aber dann sieht der ASM Code nicht mehr so schön symmetrisch aus ;) Bei Intel funktioniert es auch mit O1, die ASM Code Anordnung ist etwas anders. Microsoft... nein.
Und die wichtigsten Zeilen vom C++ Code sind doch auch ausdrucksstark.

#include <array>

struct Point2D : public std::array<double,2> {
    inline const auto& x() const { return at(0); }
    inline const auto& y() const { return at(1); }
};

inline const Point2D operator-(const Point2D& p1, const Point2D& p2) {
    return Point2D{p1.x()-p2.x(), p1.y()-p2.y()};
}

struct Line2D : public std::array<Point2D, 2> {
    inline const Point2D& p1() const { return at(0); }
    inline const Point2D& p2() const { return at(1); }
};

auto func(const Line2D& line1, const Line2D& line2) {
    Point2D a = line1.p2() - line1.p1();
    Point2D b = line2.p1() - line2.p2();
    const auto denominator = a.y() * b.x() - a.x() * b.y();
    return denominator;
}

func(Line2D const& line1, Line2D const& line2):
  movsd (%rsi),   %xmm0 # line2 x1
  subsd 16(%rsi), %xmm0 # line2 x2
  movsd 24(%rdi), %xmm1 # line1 y2
  subsd 8(%rdi),  %xmm1 # line1 y1
  mulsd %xmm1,    %xmm0
  movsd 8(%rsi),  %xmm1 # line2 y1
  subsd 24(%rsi), %xmm1 # line2 y2
  movsd 16(%rdi), %xmm2 # line1 x2
  subsd (%rdi),   %xmm2 # line1 x1
  mulsd %xmm2,    %xmm1
  subsd %xmm1,    %xmm0
  ret

Übrigends bekommt man den selben Assember Code auch mit float statt double. Und für int bleibt die Struktur auch die selbe, es werden nur die normalen Register benutzt. Nur bei long double kommen ein paar Kopierbefehle dazu, da die normalen floating point Register der CPU genutzt werden, statt SSE. Da SSE aber 128bit floating point Zahlen verarbeiten kann, muss ich wohl die CPU beim Compiler angeben.

// Nachtrag
Einen setz ich noch drauf: SSE mit GNU gcc Vector Instructions Extension. Point2D mit double als Type lässt sich wunderbar per SSE verarbeiten. So können zwei Subtraktionen/Multiplikationen parallel ausgeführt werden. Damit lässt sich extrem kompakte (und super effizienter) ASM Code generieren.

#include <smmintrin.h>

using v2sd = double __attribute__ ((vector_size (16)));
struct  Point2D : public std::tuple<v2sd> {
....
}

auto func(const Line2D& line1, const Line2D& line2) {
    const Point2D a = line1.p2() - line1.p1();
    const Point2D b = (line2.p1() - line2.p2());
    const auto denominator = a.y()*b.x() - a.x()*b.y();
    return denominator;
}

Nur noch drei Substraktionen. Leider kommen so unpack Anweisungen dazu.

func(Line2D const&, Line2D const&):
  movapd 16(%rdi), %xmm1
  subpd (%rdi), %xmm1
  movapd (%rsi), %xmm2
  subpd 16(%rsi), %xmm2
  movsd %xmm2, %xmm0
  movapd %xmm1, %xmm4
  unpckhpd %xmm4, %xmm4
  mulsd %xmm4, %xmm0
  unpckhpd %xmm2, %xmm2
  mulsd %xmm2, %xmm1
  subsd %xmm1, %xmm0

Comments Off

Older Posts »

C++Guns – RoboBlog blogging the bot

18.05.2018

17.05.2018

16.05.2018

15.05.2018

12.05.2018

10.05.2018