Puppet architecture

I've been working with Puppet for almost three years and I want to share some ideas I believe are (or were someday) new in the Puppet world.

A puppet deployment can be divided into four parts:
  • Custom services: applications your company develops.
  • Common services: such as MySQL, Mongodb, Apache, Tomcat...
  • Base: applications you install on every machine and O.S. customizations.
  • Architecture: the design for your network.

This is a simplified hiera config:
  • %{::service}/roles/%{::service_role}
  • %{::service}/%{::service_instance}
  • %{::service}/%{::service_version}
  • %{::service}/%{::hostenv}
  • %{::service}/default
  • %{::hostenv}
  • default
Not every entry was necessary in my case.

The main point here is the separation between:
  • How to install something.
  • Where to install something.

It's very easy to see where are you installing something, for instance, we can have 4 environments: a laptop (for development), an integration machine, a preproduction machine, and a production machine. When you remove this from the hierarchy you get the central abstraction: the service.

Now we have something I consider a fundamental truth: every application you install, gets installed in the context of a service. If something doesn't belong to a service, you need to create one.

Before continuing, have a look at this blog post, as this one expands on top of that design.

Now let's see what a service is: a service is a collection of roles. I have not described the role, but a role is a division of a service :). For ex. let's say we have a blogs service, we could divide it into two roles: web and db; for a larger blogs service we could have: web, web_static, web_admin, db, db_slave, etc... (btw, doesn't matter how many machines we have for each role). As we could have two installations for this service, note, and this is very important: web role from the first installation is a different role than web from the second one.

Now we don't need nodes. I know Puppet makes you define them, but if we look at this principle, we can see this code:

node /abc/ { }

it's equivalent to:

if $::fqdn =~ /abc/ { }

If every machine belongs to a service and has a role, we can do:

node default {
  include "service_${::service}"
  include "service_${::service}_${::role}"

This model follows a copy-on-write model: you only need to modify your puppet code the first time you do something new. This is true for everything you do with configuration management.

Now about data: if you use predictable data, you don't need exceptions in hiera. You could as well use service discovery for IP addresses, but so far I haven't explored that possibility.

When we started using hiera, we didn't qualify variables. I find this very confusing because it's very difficult to know how a variable is used. What we did was a wrapper for hiera that qualifies variables with the class name. Now variables look like: mysql::server::listen.

And this takes us to a new problem: ¿what if two modules need the same value for a variable? Going back to the blogs example, let's say both the web and admin servers need the address for the mysql server. We do:

blogsadmin::mysql_address: x.x.x.x
blogsweb::mysql_address: x.x.x.x

And this is very important: when we change one, will we remember to change the other/s?

Hiera supports sub-lookups, so I created a new shadow hierarchy:
  • vendor
  • service
  • cluster
  • variable
We only used vendor and service btw, but it could be extended to as many levels as you want.

Think in business terms: this service has a database, what is it for? I don't know, let's say this is the main database cluster, we can call it: blogsdb, we get:

blogsadmin::mysql_address: "%{hiera('vendor::blogs::blogsdb::ipaddr')}"
blogsweb::mysql_address: "%{hiera('vendor::blogs::blogsdb::ipaddr')}"

Note we usually do this at service level, but it sometimes needs to be done (or overridden) at role level. In other words, what we try to say is: "for these machines in this service, the ip address for the blogsadmin/blogsweb service, is the one from the blogsdb cluster". The one for that cluster will be different for any cluster of course, but you can manage that with hiera (the shadow hierarchy).

This mechanism can be reused to bind variables between different services. That's why the top level: "vendor::var" is for.


The hiera hierarchy above has: "%{::service_version}". When we deploy a new version of some service, we install it on new machines, and both services coexist. You can also use this number for something else, just use your imagination. If a service doesn't have a version we use "0". Note this is probably a Puppet internal version, it could refer to a real external version, but it's not necessary.

What if what we want to change is only an application? For ex. the syslog daemon. What I did was adding "proxy classes". The base class includes: "syslog", and the syslog module uses hiera to include the right syslog implementation and to uninstall any other.

We can extend this for other stuff. For ex, our .jar files use supervisord, instead of calling supervisord directly, I have a processwrapper module that calls supervisor. This processwrapper module exposes a "business" API, and translates it to the particular implementation. Because the defines for processwrapper have a parameter for the right backend to use, and use the one in the main class if not specified (found with hiera), we can use a different processwrapper for any application, host, service, environment or any other mix. In other words: we can migrate machines in batches and with safety.

Hiera (data) migrations

Sometimes you want to migrate to a new architecture, with a new hierarchy. You can create a "hiera of hieras", you prefix every variable on hiera with: %{::hiera_version}, and Puppet will use a different hierarchy for a different set of hosts.

At the end you create "micro-tears" in your infrastructure and use those conditions to leave technical debt behind. How much change is up to you, you could have a 100% new code for every deployment, but reuse application modules for ex.


jOS operating system.

I've made the source of my hobbyist toy operating system available.

See here: https://github.com/acceso/jOS

I've also made some documentation but it's written in Spanish. I wish I had knowledge and time to translate it into English.

It's here: jOS.pdf

It is not the best thing but it can read an ELF from disk, map it into memory and run it. It also implements two syscalls: printf() and exit().

I'm not sure if I'll be working any longer on it as I want to explore new things.


Automatic "su" after login.

I needed this piece of code today and I though it could be useful to anyone. It is needed if you want to automate the "su" command after login.

my $ssh = Net::OpenSSH->new($host);
$ssh->error and exit;

my($pty, $pid) = $ssh->open2pty("/bin/su", "-");

my $expect = Expect->init($pty);
$SIG{WINCH} = \&winch;
sub winch {
    kill WINCH => $expect->pid if $expect->pid;
    $SIG{WINCH} = \&winch;

        [ qr/Password:/ => sub { shift->send("mypass\n"); } ],
        ) or die "Expect timeout.";


I couldn't find anything similar as it seems most people is using sudo.

Update: the source on github. I renamed it to "s2", as I want to redo "s" with Python and make something more general.


jOS syscals

After ~28 months (with ups and downs) and almost 10k LOCS of C code, I can say I have my own operating system as I have the syscalls working!

Let's see what happens when I do:

exec ("/sbin/init");

Being init something simple:

        movq $1, %rax;          // write( ...
        movq $1, %rdi;          // write(1, ...
        movq $str, %rsi;        // write(1, str, ...
        movq $14, %rdx;         // write(1, str, 14)

(note this code works with Linux as I aim to be binary-compatible :) )

  • Open the path (the inode has to be located, etc...).

  • Read the ELF header from disk (buffers are also stored into the buffer cache).

  • mmap() the PT_LOAD section, which means one page (I'm using 2MB pages, so should be eable to run not so small binaries) gets mapped into free memory and the section is copied into the page (it's done right away as I don't yet have a page fault handler).

  • usermode_jump(elf_entry_point) gets called.

This is usermode_jump (note this is an alpha version as the stack doesn't get changed for now):

#define usermode_jump(_addr)                                    \
        do {                                                    \
                msr_write (MSR_STAR, ((u64)U_CS)<<48 br="">                                   | ((u64)K_CS)<<32 br="">                msr_write (MSR_LSTAR, (u64)syscall_dispatch);   \
                asm volatile ("mov %0, %%rcx\n\t"               \
                        "sysretq\n\t" : : "m"(_addr));          \
        } while (0)

When the usermode code does the syscall, the instruction register (%rip) points to the address loaded into MSR_LSTAR, which is syscall_dispatch. That function looks like:

.... some checks...
        asm volatile (
                "shl $3, %%rax\n\t"
                "addq %0, %%rax\n\t"
                "call *(%%rax)\n\t"
                : : "c"(syscall_table)

I'd "only" have to implement the actual syscalls (and the ELF data segment, and the usermode stack, *glup*) to fully run binaries.

When I started I estimated the amount of lines of code being something like ~10k for something functional and 15k for something really functional. Right now I have 9374, so it's pretty close.


Virtual file system

When I started my hobbyist OS one year and a half ago I did never imagine I would do as much as I have done.

Right now I'm working on the VFS and ext2 and it's being the biggest part so far! You have filesystems, superblocks, inodes (and inode cache), dentries, files, descriptors, buffers (and buffer cache), etc. besides you get 'em twice because you have generic inodes + disk inodes (ext2 inodes), generic superblock and ext2 superblock, etc. The generic function does something like:

i = inode->ops->lookup (inode, path, len);

and the lookup function for ext2 does:

buffer = ext2_read_block (inode, block);

Sadly, I can't summarize what I've done in one post. The OS has more than 8000 lines of code and it's growing whenever I find a moment to work on it. I hope one day I use what I'm learning with this as it's being a huge work :(.


Good readings for os coding.

These are my reading recommendations if you want to learn how an operating system works, or if you want to do one. It would have been very useful to me back when I started almost two years ago. The list is not complete as I still have a lot to read.
Note I'm not a genius and I need info about every little thing. I wasn't taught even the basic stuff back at school so I had to start almost from scratch.

  • The first thing you should learn is C. I guess anyone thinking about OS coding already knows C, but you need more than a basic knowledge. Learn the differences between the stack and the heap, static and dynamic memory and what a stack frame is.

  • Read about the C library. I recommend reading the libc pdf manual (which is free) or Advanced programming in the Unix environment . Pick just one of the two as they talk about the same things.

  • Learn the CPU. If you are using x86, the AMD manuals are great (and I prefer those), but the ones by Intel are also very good. You will need to read the system programming volumes. Learn about paging, CPU privileges, exceptions and interrupts, APIC, etc. This was probably the hardest part for me :(, and not because it's hard, but because it was unknown.

  • Learn assembly. You can pick Intel or AT&T syntax. As I am developing from Linux I recommend: Professional assembly language. Note you can also use Intel syntax with newer GCC versions.

  • Learn the toolchain. If you use Linux, you have the docs by free online: GCC, as, ld, gdb, etc. It's important to know how the linking process works, so read a book from the Wikipedia external links section.

  • Now you can start reading about the mechanisms used by an OS. I recommend: Linux kernel development. It's easy, very well explained and will give you a good idea about some important and basic stuff.

This would be a good start, but it's not enough. More things:

  • This tutorial was really motivational for me. Try to have something that boots, calls C code and prints into screen. I believe it's a good start and you can also improve it later. My first version had 118 lines of code: Makefile, kmain.c, kstart.S and linker.ld.

  • It is also important to bookmark this site. You will need it often.

  • Switch into protected mode.

  • Make exceptions work.

  • Read more about OS design. There are many types of operating systems, as always, just pick Unix :). In my opinion, the newest books are too hard to grasp before learning some basic stuff, so I recommend The Design of the UNIX Operating System, UNIX Internals: The New Frontiers and The Design and Implementation of the 4.4 BSD Operating System. There are probably more, but these are really good. Note they are probably outdated, but the overall architecture hasn't changed that much. If you are smarter than me, you can just use the ones about the Linux kernel, but I think things have got too complex these days.

  • You will need several data structures in your code: linked lists, trees, etc. Also, spinlocks, a kprintf function, itoa, date conversion functions... Some of those can be made out of the tree. I think this is a good idea, as you will be merging far more stable code.

  • The page allocator is one the first things you need. Then, a kernel memory allocator built on top and probably a slab allocator. Read Understanding the Linux Virtual Memory Manager. Note it's based on the Linux 2.4 kernel, but far more complex than an initial implementation will be.

  • Now you can code some basic devices, every one of them has its own documentation available online. See the osdev.org site. For ex: timers (you have pit, apic, rtc...), keyboard, detect the cpu speed, you can even code a basic ATA driver... These devices are useful to get familiar with the hardware.


Memory detection

To detect memory, we can use the BIOS, but it's just easier to take advantage of grub as it leaves that information in ram for us.

In multiboot.h we have the two structures we need:
typedef struct multiboot_info {
        u32 flags;
        u32 mem_lower;
        u32 mem_upper;
        u32 boot_device;
        u32 cmdline;
        u32 mods_count;
        u32 mods_addr;
        u32 a, b, c, d;
        u32 mmap_length;
        u32 mmap_addr;
        u32 drives_length;
        u32 drives_addr;
        u32 config_table;
        u32 boot_loader_name;
        u32 apm_table;
} multiboot_info_t;

struct multiboot_mmap_entry {
        u32 size;
        u64 addr;
        u64 len;
        u32 type;
} __attribute__((packed));

typedef struct multiboot_mmap_entry multiboot_memory_map_t;

Grub leaves the multiboot information in the %ebx register, we save it into memory for later:

        .global mbi32
        .long 0x0
        movl %ebx, mbi32


extern u32 mbi32;
static multiboot_info_t *mbi;
mbi = (multiboot_info_t *)((u64)mbi32);

Now we only walk the struct and save the data we need:

static void
get_memory_ranges (void)
        multiboot_memory_map_t *mmap = (multiboot_memory_map_t *)(u64)mbi->mmap_addr;
        u8 i = 0;

        # grub sets this flag when it can't detect the amount of memory
        if ((mbi->flags & (1<<0)) == 0x0) 
                kpanic ();

        kprintf ("%i lower and %i upper KB detected. Usable ranges:\n", mbi->mem_lower, mbi->mem_upper);

        while ((u64)mmap < mbi->mmap_addr + mbi->mmap_length) {
                if (mmap->type == 1) {
                        kprintf ("base: %p     limit: %p\t(%p B)\t\n",
                                mmap->addr, mmap->addr + mmap->len, mmap->len);

                        usablemem[i].addr = mmap->addr;
                        usablemem[i++].len = mmap->len;

                mmap = (multiboot_memory_map_t *)((u64)mmap + mmap->size + sizeof (mmap->size));


usablemem is defined such as:

struct _usablemem {
        u64 addr;
        u64 len;

This has an issue: how many ranges will we have? We don't have dynamic memory yet, so we need a static number. My emulator returns 2 usable ranges so I just reserve 10 and pray for the best :).

This is harder than it looks. We need memory for structures that keep information about memory! I've unable to think of an easy way to solve this with the "buddy allocator". As I prefer to move on more interesting stuff than having better algorithms, I'll use a bitmap. If I pick a small bitmap, our operating system won't support a lot of memory and if it's big, I'll waste some bits. With one word we can support 134217728 bytes, and that should be enough for now.


Higher half kernel

Both x86 and x86-64 map an amount of RAM for the process and some other ram for the kernel. When the process changes, pages mapped for that process are invalidated, but the ones for the kernel are still valid. The easiest way to work around this,  is to map half of the ram to the process and the other half to the kernel, so we can reach all the physical memory without creating particular mappings.

If we use the first half of the memory for the kernel, 32 bit processes (if we ever support that...) would not run as their pointers would be greater than 2^32, and even though we aim for 64 bit process, every other kernel seems to be mapped on the higher half :).

The linker script maps the kernel code into physical memory. But, we link as if it was mapped in some other address. With PIC (position independent code) would be easier but as far as I know this can't be compiled as PIC.

    . = phys;

    .mboot : AT(ADDR(.mboot))
        mboot = .;
        . = ALIGN(4096);

    . += page_offset;

But this it's not enough because "call" has the opcode: "00 00 00 00" and the jump we need to do is much bigger than 2^32, and the address doesn't fit there. This is the error:

boot/boot64.S:15: relocation truncated to fit: R_X86_64_PC32 against symbol `kmain' defined in .text section in kernel/main.o

After digging into the cpu manuals, we've got: "movabsq" that forces a 8 bytes offset:
        movabsq $kmain, %rax
        callq *%rax

We also need to define a page table, I used this code:

.macro map_pdpe pdpe_ptr, base_and_flags
        movl \pdpe_ptr, %edi
        movl \base_and_flags, %eax
        movl $512, %ecx
        movl %eax, (%edi)
        addl $0x40000000, %eax
        addl $8, %edi
        loop 1b

        movl $pdpk + 0x7, pml4 + 0x800
        map_pdpe $pdpk, $0x87

From now on, we need to use these macros to convert from one to other kind of addresses:

#define __pa(_x) ((_x) - K_PAGE_OFFSET)
#define __va(_x) ((_x) + K_PAGE_OFFSET)

Now the VGA addresses are:
#define VGA_BASE ((u16 *)__va(0xb8000L))
#define VGA_END  ((u16 *)__va(0xb8ff0L))

Cool :)


Xlib odometer

This is an old project to write the distance of the mouse, just to play with xlib.

This is the code, compile with:
gcc -Wall -pedantic -std=gnu9x -lX11 -lm odo.c -o odo

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#include <X11/Xlib.h>

#ifdef DEBUG
#define eperror(fcn) {                                          \
                fprintf (stderr, "%s %d: ", __FILE__, __LINE__);\
                perror (fcn);                                   \
                exit (EXIT_FAILURE);                            \
#define eperror(fcn) {                  \
                perror (fcn);           \
                exit (EXIT_FAILURE);    \
#endif /* DEBUG */

struct _iinfo {
        unsigned long int distcms;
        unsigned long int nbuttonp1, nbuttonp2, nbuttonp3;

        unsigned long int nkeyp;

#define EVENT_MASK ButtonPressMask | ButtonReleaseMask

struct _iinfo iinfo;
long long int distance_mm = 0;
double um_per_pxH, um_per_pxV;

static int
event_loop (Display *d)
        XEvent ev;

        while (XCheckMaskEvent (d, EVENT_MASK, &ev)) {

                switch (ev.type) {


        return 1;

check_distance (Display *d)
        Window root_ret, child_ret;
        int absH, absV, relatH, relatV;
        unsigned int modkeymask;
        static int pointerH = 0, pointerV = 0;
        static long long int dist;
        int pxH, pxV;
        long long int dHum, dVum;

        XQueryPointer (d, RootWindow (d, DefaultScreen (d)), &root_ret, &child_ret,
                        &absH, &absV, &relatH, &relatV, &modkeymask);

        pxH = absH - pointerH;
        pxV = absV - pointerV;

        if (pxH == 0 && pxV == 0)

        pointerH = absH;
        pointerV = absV;

        if (pxH < 0)
                pxH = -pxH;
        if (pxV < 0)
                pxV = -pxV;

        dHum = pxH * um_per_pxH;
        dVum = pxV * um_per_pxV;

        /* Thanks to Pitagoras :) */
        dist = sqrtl (dHum * dHum + dVum * dVum);
        distance_mm += (dist / 1000);

        fprintf (stdout, "%lli\n", distance_mm);

main (int argc, char **argv)
        static Display *display;
        int scrNum;

        display = XOpenDisplay (NULL);
        if (display == NULL)
                eperror ("Can't open display.");

        scrNum = DefaultScreen (display);

        um_per_pxH = (double) (1000 * DisplayHeightMM (display, scrNum) / DisplayHeight (display, scrNum));
        um_per_pxV = (double) (1000 * DisplayWidthMM (display, scrNum) / DisplayWidth (display, scrNum));

        /* When the output is redirected to a file doesn't show up without this. */
        setbuf (stdout, NULL);

        do {
                check_distance (display);
                usleep (100000);
        } while (event_loop (display));

        XCloseDisplay (display);

        return 0;


Automatic password change

This started when I didn't have a VPN. I even tried to do my own interpreter in C :). But it's easier to reuse any other programming language.

Pam preexec just runs a command when a user tries to log in, that commands changes the password and it can use any source: time, rss, one-time-password, etc..

Code and readme: here.



3rd post about my kernel.

Next step are exceptions. We need a table such as (IDT):

static struct {
       u16 offset1;
       u16 selector;
       u8 ist;
       u8 flags;
       u16 offset2;
       u32 offset3;
       u32 reserved;
} __attribute__((__packed__, aligned(8))) idtentry[256];

According to docs, the cpu can raise 32 exceptions, but exceptions above 20 are reserved (except number 30). The table is similar to GDT. We can load it with:

static void
set_idt_reg (u64 base, u16 limit)
       struct {
               u16 limit;
               u64 base;
       } __attribute__((__packed__)) idt_reg;

       idt_reg.base = base;
       idt_reg.limit = limit;

       asm volatile("lidt %0"::"m" (idt_reg));

For the entries:

idt_set_gate (u8 num, u64 addr, u16 selector, u16 flags)
       idtentry[num].offset1 = addr & 0xFFFF;
       idtentry[num].offset2 = (addr >> 16) & 0xFFFF;
       idtentry[num].offset3 = (addr >> 32);
       idtentry[num].selector = selector;
       idtentry[num].flags = flags;

       idtentry[num].reserved = 0;
       idtentry[num].ist = 0;


We have to call this function once for every interrupt:

#define GATE_INT  0x8e
#define GATE_TRAP 0x8f


idt_set_gate (0, (u64)&do_isr0, K_CS, GATE_INT);
idt_set_gate (31, (u64)&do_isr31, K_CS, GATE_INT);

for (n = 32; n <= 254; n++)
         idt_set_gate (n, (u64)&do_isr20, K_CS, GATE_INT);

set_idt_reg((u64) idtentry, sizeof(idtentry) - 1); 

To start with, we can use the same function for every signal.

Now the hardest part. The interrupt handlers have some peculiarities:

#define ISR(_n)                                        \
__attribute__ ((regparm (0), aligned(8))) void do_isr ## _n (void)      \
{                                                      \
asm volatile (                                         \
       "cli\n\t"                                       \
       "pushq $" #_n "\n\t"                            \
       pushaq()                                        \
       "call isr_handler\n\t"                          \
       popaq()                                         \
       "addq $8, %rsp\n\t"                             \
       "sti\n\t"                                       \
       "iretq\n");                                     \

The function entry point can't touch the stack, which is tricky to do with C. Also we have to return with "iretq", the "ret" that the compiler inserts will never be reached. Let's see if it works:

objdump -D jOS | less
0000000000100ba0 <do_isr0>:
 100ba0:       fa                      cli   
 100ba1:       6a 00                   pushq  $0x0

Looks good. We can define pushaq y popaq to push or pop every register:
#define pushaq()               \
       "pushq %rax\n\t"        \
       "pushq %rcx\n\t"        \
       "pushq %rdx\n\t"        \
       "pushq %rbx\n\t"        \
       "pushq %rbp\n\t"        \
       "pushq %rsi\n\t"        \
       "pushq %rdi\n\t"

#define popaq()                \
       "popq %rdi\n\t"         \
       "popq %rsi\n\t"         \
       "popq %rbp\n\t"         \
       "popq %rbx\n\t"         \
       "popq %rdx\n\t"         \
       "popq %rcx\n\t"         \
       "popq %rax\n\t"

Now we can change the cpu registers in the handler without messing with what they had before the interruption.

This "isr_handler" funtion is a normal function with one parameter:

isr_handler (struct intr_frame r)
       kprintf ("Excepcion %d!\n", r.intnum);

The struct represents the "stack frame" of the function. Because it's a value, the function sees this:

previous values
parameter 1
return address
frame pointer
local variable

Parameter 1 includes everything afterwards (lower addresses) and we have access to those parameters from the function.

It's confusing. The struct:
struct intr_frame {
       u64 rdi;
       u64 rsi;
       u64 rbp;
       u64 rbx;
       u64 rdx;
       u64 rcx;
       u64 rax;
       u64 intnum;
       u64 errcode;
       u64 retrip;
       u64 cs;
       u64 rflags;
       u64 retrsp;
       u64 ss;

Beware because structs grow upwards and the stack downwards on x86-64.

From c, we can raise an interrupt to see if it works:

asm volatile ("int $0x3\n");
asm volatile ("int $0x5\n");

Note that we need to execute a "sti" instruction to start getting interrupts. We're working with exceptions.

This is the result:

Bonus: kprintf function: because of the helpers it gives us, it's not hard to do with GCC. The only catch are the number conversions. We can define "itoa":

itoa (u64 n, char *s, u8 base)
        calculate the absolute value for the number
        do {
                int rem = pn % base;
                *p++ = (rem < 10) ? rem + '0' : rem + 'a' - 10 ;
        } while (pn /= base);
        reverse the string because it's reversed

#define va_start(v,l) __builtin_va_start(v,l)
#define va_arg(v,l)   __builtin_va_arg(v,l)
#define va_end(v)     __builtin_va_end(v)
#define va_copy(d,s)  __builtin_va_copy(d,s)
typedef __builtin_va_list va_list;

kprintf (const char *fmt, ...)
        va_list ap;

        va_start (ap, fmt);

        Print each character from the string except if it contains '%',
        in which case look at the next element and do something such as:
        puts(va_arg (ap, char *)); to get the next element if it's a pointer or 
        vga_writechar ((char) va_arg (ap, int)); to get a character 
        (characters are promoted to integers).
        va_end (ap);

Next stop: PIT and PIC.


Enabling long mode

This continues the previous post about a 64 bit loader.

We now have a Grub 2 image (Grub 1 can't load a 64 bit ELF).

This is the code for the loader (with comments):

        .set MB_MAGIC, 0x1BADB002
        .set MB_FLAGS, \
                1<<0 /* page align */ |\
                1<<1 /* memory info */ |\
                1<<16 /* a.out kludge */
        .section .mboot


        .align 4

        /* Grub magic */
        .long MB_MAGIC
        .long MB_FLAGS
        .long -(MB_MAGIC + MB_FLAGS)

        .global _start
        /* block interrupts until we have the IDT */

        /* Disable paging, although grub should not enable it */
        mov %cr0, %eax
        btr $31, %eax
        movl %eax, %cr0

        /* Segmentation */
        lgdt gdt_ptr

        /* Page tables */
        movl $pdp + 0x7, pml4
        movl $pd + 0x7, pdp
        movl $pt + 0x7, pd

        /* Map 512 entries, 4KB each (2 MB) 
         * It only maps the addresses with themselves
        movl $pt, %edi
        movl $0x7, %eax # flags
        movl $512, %ecx
        movl %eax, (%edi)
        addl $0x1000, %eax
        addl $8, %edi # next entry
        loop 1b

        # Load the page table we just generated
        movl $pml4, %eax
        movl %eax, %cr3

        /* Enables PAE, needed. */
        movl %cr4, %eax
        bts $5, %eax
        movl %eax, %cr4

        /* IA32_EFER.LME = 1. Enables "long mode". */
        mov $0x0c0000080, %ecx
        bts $8, %eax

        /* CR0.PG = 1. No more physical addresses! Enable pagging. */
        mov %cr0, %eax
        bts $31, %eax
        mov %eax, %cr0

        /* Jump into a 64 bit segments. */
        ljmp $0x8, $start64

        /* This should not be reached: */
        jmp 1b

        .align 8

/* GDT pointer */
        .word gdt_end - gdt - 1
        .long gdt

        .align 16
/* This is the key for segments. 
   A null segment, one for 64 bits code, and other for data,
   just for the kernel for now. */
        .quad 0x0000000000000000
        .quad 0x00AF98000000FFFF
        .quad 0x008F92000000FFFF


        .align 4096

        # This is reserved for the page tables.
        .lcomm pml4, 0x1000
        .lcomm pdp, 0x1000
        .lcomm pd, 0x1000
        .lcomm pt, 0x1000

We just follow the cpu manuals to enable long mode, no magic here :).

Now the 64 bit code:



.global start64
        # This creates a stack for c:
        movq $(stack + STACKSIZE), %rsp

        pushq $0x0;
        pushq $0x0;

        movq %rsp, %rbp

        pushq %rbx

        # Calls c code!
        call kstart

        jmp 1b

.set STACKSIZE, 0x1000

.lcomm stack, STACKSIZE

The VGA driver writes between 0xB8000L and 0xb8ff0L. Each character is mapped to two bytes. We can use this macro:

#define VC(_c, _fg, _bg)        ((_c) | (((_fg) | (_bg) << 4) << 8))

_fg and _bg are the front and back colors. 0x0 black, 0x1 blue, etc up to 0xf white.

Now let's see how to link the code.

The loader has to be 32 bit code. The problem is that afaik, 32 and 64 bit code can't be linked together (without conversion), we have to compile the code as 32 bit, and then convert the .o:

gcc -c -m32 -o boot/boot32.o boot/boot.S
objcopy -O elf64-x86-64 boot/boot32.o boot/boot.o

To link ld needs -N and -T linker.ld. linker.ld will be a script for ld. This is the file:


phys = 0x100000;
virt = 0x100000;

    .text virt : AT(phys)
        code = .;
        /* Make sure the heather goes at the beginning */
        . = ALIGN(4096);

    .data : AT(phys + data - code)
        data = .;
        . = ALIGN(4096);

    .bss : AT(phys + bss - code)
        bss = .;

    /* end of kernel :) */
    eok = .;

It only defines the load addresses for the binary. We can see it with objdump -D.