Skip to content

Transforming legacy C code into EM

CoreMark® has emerged as the premier industry benchmark for measuring CPU performance within embedded systems. Managed through EEMBC , virtually every MCU vendor has certified and published CoreMark scores for a broad portfolio of their processors. Running the benchmark code also serves as a "typical workload" used when characterizing active power consumption [ μW / Mhz ] of a particular MCU.

The workload introduced by CoreMark encompasses four algorithms reflecting the variety of software functions often implemented within embedded application programs:

  list processing find and remove elements, generalized sorting
  matrix manipulation add and multiply by a scalar, vector, or matrix
  state machine scan a string for a variety of numeric formats
  cyclic redundancy check checksum over a sequence of 16 / 32-bit values

Besides adding to the workload, CoreMark uses algorithm to validate the final results of running the benchmark program – comparing a checksum over the list elements used in algorithm against an expected value. CoreMark also checksums the matrix data produced by algorithm as well as the state machine transitions encountered by algorithm .

You'll find the CoreMark sources on GitHub, together with instructions for building / running the benchmark program. To ensure the integrity of the benchmark, you cannot modify any of its (portable) C source files – with the exception of core_portme.[ch], used to adapt CoreMark to a particular hardware platform.

Needless to say, your choice of C compiler along with specific options for controlling program optimization remain on the table. While primarily intended for comparing different MCUs, CoreMark also provides a known codebase useful for "apples-to-apples" comparisons between different compilers [GCC, IAR, Keil, LLVM] targeting the same MCU.

CoreMark – a "typical" C program in more ways than one

We sense that very few software practitioners have actually studied the CoreMark source files themselves. As long as "someone else" can actually port / build / run the benchmark on the MCU of interest, good enough !!

In our humble opinion, the CoreMark sources would not serve as the best textbook example of well-crafted C code:  insufficent separation of concerns, excessive coupling among compilation units, plus other deficiencies.

Said another way, CoreMark typifies the design / implementation of much of the legacy embedded C code we've encountered for decades within industry and academia alike.  But therein lies an opportunity to showcase EM.

CoreMark ⇒ EM•Mark

In reality, none of the official CoreMark sources (written in C) will survive their transformation into EM•Mark – a new codebase (re-)written entirely in EM.  At the same time, applying the same CoreMark algorithms to the same input data must yield the same results in EM.

The input data used by EM•Mark (like CoreMark) ultimately derives from a handful of seed  variables, statically-initialized with prescribed values.  Declared volatile in EM as well as C, the integrity of the benchmark requires that the underlying compiler cannot know the initial values of these seed variables and potentially perform overly-aggressive code optimizations.

At the same time, the CoreMark sources do make use of C preprocessor #define directives to efficiently propogate constants and small (inline) functions during compilation.  EM•Mark not only achieves the same effect automatically via whole-program optimization, but also leverages the full power of EM meta-programming to initialize internal data structures at build-time – resulting in a far-more compact program image at run-time.

If necessary, review the material on program configuration and compilation to fully appreciate the opportunities that EM affords for build-time optimization.

High-level design

The EM•Mark sources (found in the  em.coremark package within the em.bench bundle) consist of ten EM modules and two EM interfaces, organized as follows:

Image info

EM•Mark Design Hierarchy

The  ActiveRunnerP and  SleepyRunnerP programs on top of this hierarchy both execute the same core benchmark algorithms, albeit in two very different contexts:

ActiveRunnerP performs multiple  benchmark iterations, much like the legacy CoreMark program

SleepyRunnerP performs a single  benchmark iteration, awakening every second from deep-sleep

The CoreBench module (imported by both of these programs) coordinates both configuration as well as execution of the list processing, matrix manipulation, and state machine algorithms; we'll have more to say about its implementation in a little while.

To capture behavioral commonality between CoreBench and the algorithm modules it uses internally [ ListBench, MatrixBench, StateBench ], our EM•Mark design introduces the abstract em.coremark/BenchAlgI interface:

package em.coremark

import Utils

interface BenchAlgI

    config memSize: uint16

    function dump()
    function kind(): Utils.Kind
    function print()
    function run(arg: uarg_t = 0): Utils.sum_t
    function setup()


Of the handful of functions specified by this interface, two of these play a central role in the implementation of each benchmark algorithm:

BenchAlgI.setup, which initializes the algorithm's input data using volatile seed variables, which executes one pass of the benchmark algorithm and returns a CRC value

Taking a quick peek inside CoreBench, you'll notice how this module's implementation of the BenchI interface simply delegates to the other algorithm modules – which in turn implement the same interface:

em.coremark/CoreBench.em [exc]
def em$construct()
    Utils.bindSeedH(1, 0x0)
    Utils.bindSeedH(2, 0x0)
    Utils.bindSeedH(3, 0x66)

def dump()

def kind()
    return Utils.Kind.FINAL

def print()

def run(arg)
    auto crc =
    Utils.setCrc(Utils.Kind.FINAL, Crc.add16(<int16>crc, Utils.getCrc(Utils.Kind.FINAL)))
    crc =
    Utils.setCrc(Utils.Kind.FINAL, Crc.add16(<int16>crc, Utils.getCrc(Utils.Kind.FINAL)))
    Utils.bindCrc(Utils.Kind.LIST, Utils.getCrc(Utils.Kind.FINAL))
    return Utils.getCrc(Utils.Kind.FINAL)

def setup()

CoreBench also uses public get / set functions provided by the  Utils module to fetch / store designated CRC and seed values.

more code ahead – free free to scroll down to the Summary

Each of the benchmark algorithms will call the Crc.add16 or Crc.addU32 functions to fold a new data value into a particular checksum. Looking at the implementation of the  Crc module, both of these function definitions ultimately call Crc.update – a private function that effectively mimics the crcu8 routine found in the legacy CoreMark source code:

crcu8(ee_u8 data, ee_u16 crc)
    ee_u8 i = 0, x16 = 0, carry = 0;

    for (i = 0; i < 8; i++)
        x16 = (ee_u8)((data & 1) ^ ((ee_u8)crc & 1));
        data >>= 1;

        if (x16 == 1)
            crc ^= 0x4002;
            carry = 1;
            carry = 0;
        crc >>= 1;
        if (carry)
            crc |= 0x8000;
            crc &= 0x7fff;
    return crc;

Finally, CoreBench defines a pair of config params [ TOTAL_DATA_SIZE, NUM_ALGS ] used to bind the BenchAlgI.memSize parameter associated with the other algorithms; refer to CoreBench.em$configure defined here for further details.  Initialized to values tracking the legacy CoreMark code, CoreBench assigns ⌊2000/3⌋ ≡ 666 bytes per algorithm.(1)

  1. We'll have more to say about CoreBench.em$configure after we explore the three benchmark algorithms in more detail.

Matrix manipulation

Pivoting to the simplest of the three benchmark algorithms administered by CoreBench, the  MatrixBench module implements each (public) function specified by the BenchAlgI interface; and most of the MatrixBench private functions defined inside the module [ addVal, mulVec, clip, etc ] correspond to legacy C functions / macros found in core_matrix.c .

Internally, MatrixBench operates upon three matrices [ matA, matB, matC ] dimensioned at build-time by the module's em$construct function – which uses the BenchI.memSize parameter (bound previously in CoreBench.em$configure) when calculating a value for dimN:

em.coremark/MatrixBench.em [exc]
module MatrixBench: BenchAlgI


    type matdat_t: int16
    type matres_t: int32

    config dimN: uint8

    var matA: matdat_t[]
    var matB: matdat_t[]
    var matC: matres_t[]
em.coremark/MatrixBench.em [exc]
def em$construct()
    auto i = 0
    auto j = 0
    while j < memSize
        i += 1
        j = i * i * 2 * 4
    dimN = i - 1
    matA.length = matB.length = matC.length = dimN * dimN

The MatrixBench.setup function initializes "input" matrices [ matA, matB ] at run-time, using values derived from two of the volatile seed variables prescribed by legacy CoreMark:

em.coremark/MatrixBench.em [exc]
def setup()
    auto s32 = <uint32>Utils.getSeed(1) | (<uint32>Utils.getSeed(2) << 16)
    auto sd = <matdat_t>s32
    sd = 1 if sd == 0
    auto order = <matdat_t>1
    for auto i = 0; i < dimN; i++
        for auto j = 0; j < dimN; j++
            sd = <int16>((order * sd) % 65536)
            auto val = <matdat_t>(sd + order)
            val = clip(val, false)
            matB[i * dimN + j] = val
            val += order
            val = clip(val, true)
            matA[i * dimN + j] = val
            order += 1
end finally executes the benchmark algorithm itself – calling a sequence of private matrix manipulation functions and then returning a checksum that captures intermediate results of these operations:

em.coremark/MatrixBench.em [exc]
def run(arg)
    auto crc = <Crc.sum_t>0
    auto val = <matdat_t>arg
    auto clipval = enlarge(val)
    crc = Crc.add16(sumDat(clipval), crc)
    crc = Crc.add16(sumDat(clipval), crc)
    crc = Crc.add16(sumDat(clipval), crc)
    crc = Crc.add16(sumDat(clipval), crc)
    return Crc.add16(<int16>crc, Utils.getCrc(Utils.Kind.FINAL))

Once again, the [EM] implementations of private functions like addVal and mulMat track their [C] counterparts found in the CoreMark core_matrix.c source file.

State machine

The  StateBench module – which also conforms to the BenchAlgI interface – scans an internal array [ memBuf ] for text matching a variety of numeric formats.  Similar to what we've seen in MatrixBench, the build-time em$construct function sizes memBuf as well as initializes some private config parameters used as run-time constants:

em.coremark/StateBench.em [exc]
    config intPat: string[4] = [
        "5012", "1234", "-874", "+122"
    config fltPat: string[4] = [
        "35.54400", ".1234500", "-110.700", "+0.64400"
    config sciPat: string[4] = [
        "5.500e+3", "-.123e-2", "-87e+832", "+0.6e-12"
    config errPat: string[4] = [
        "T0.3e-1F", "-T.T++Tq", "1T3.4e4z", "34.0e-T^"

    config intPatLen: uint16
    config fltPatLen: uint16
    config sciPatLen: uint16
    config errPatLen: uint16

    var memBuf: char[]
em.coremark/StateBench.em [exc]
def em$construct()
    memBuf.length = memSize
    intPatLen = intPat[0].length
    fltPatLen = fltPat[0].length
    sciPatLen = sciPat[0].length
    errPatLen = errPat[0].length

The StateBench.setup function uses the xxxPat and xxxPatLen config parameters in combination with a local seed variable to initializing the memBuf characters at run-time:

em.coremark/StateBench.em [exc]
def setup()
    auto seed = Utils.getSeed(1)
    auto p = &memBuf[0]
    auto total = 0
    auto pat = ""
    auto plen = 0
    while (total + plen + 1) < (memSize - 1)
        if plen
            for auto i = 0; i < plen; i++
                *p++ = pat[i]
            *p++ = ','
            total += plen + 1
        switch ++seed & 0x7
        case 0
        case 1
        case 2
            pat  = intPat[(seed >> 3) & 0x3]
            plen = intPatLen
        case 3
        case 4
            pat  = fltPat[(seed >> 3) & 0x3]
            plen = fltPatLen
        case 5
        case 6
            pat  = sciPat[(seed >> 3) & 0x3]
            plen = sciPatLen
        case 7
            pat  = errPat[(seed >> 3) & 0x3]
            plen = errPatLen

Details aside, calls a private scan function which in turn drives the algorithm's state machine; run also calls a private scramble function to "corrupt" memBuf contents ahead of the next scanning cycle:

em.coremark/StateBench.em [exc]
def run(arg)
    arg = 0x22 if arg < 0x22
    var finalCnt: uint32[NUM_STATES]
    var transCnt: uint32[NUM_STATES]
    for auto i = 0; i < NUM_STATES; i++
        finalCnt[i] = transCnt[i] = 0
    scan(finalCnt, transCnt)
    scramble(Utils.getSeed(1), arg)
    scan(finalCnt, transCnt)
    scramble(Utils.getSeed(2), arg)
    auto crc = Utils.getCrc(Utils.Kind.FINAL)
    for auto i = 0; i < NUM_STATES; i++
        crc = Crc.addU32(finalCnt[i], crc)
        crc = Crc.addU32(transCnt[i], crc)
    return crc

def scan(finalCnt, transCnt)
    for auto str = &memBuf[0]; *str;
        auto state = nextState(&str, transCnt)
        finalCnt[ord(state)] += 1

def scramble(seed, step)
    for auto str = &memBuf[0]; str < &memBuf[memSize]; str += <uint16>step
        *str ^= <uint8>seed if *str != ','

The crc returned by effectively summarizes the number of transitory and finals states encountered when scanning.

even more  code ahead – free free to scroll down to the Summary

List processing

Unlike its peer benchmark algorithms, the  ListBench module introduces some new design elements into the EM•Mark hierarchy depicted earlier:

the ComparatorI abstraction, used by ListBench to generalize its internal implementation of list sorting through a function-valued parameter that compares element values

the ValComparator module, an implementation of ComparatorI which invokes the other  benchmark algorithms (through a proxy) in a data-dependent fashion

The ComparatorI interface names just a single function [ compare ] ; the ListBench module in turn specifies the signature of this function through a public type [ Comparator ] : (1)

  1. a design-pattern similar to a Java @FunctionalInterface annotation or a C# delegate object
package em.coremark

import ListBench

interface ComparatorI

    function compare: ListBench.Comparator

em.coremark/ListBench.em [exc]
module ListBench: BenchAlgI

    type Data: struct
        val: int16
        idx: int16

    type Comparator: function(a: Data&, b: Data&): int32

    config idxCompare: Comparator
    config valCompare: Comparator

CoreBench.em$configure (which we'll examine shortly) performs build-time binding of conformant Comparator functions to the pair of ListBench config parameters declared above. But first, let's look at some private declarations within the ListBench module:

em.coremark/ListBench.em [exc]

    type Elem: struct
        next: Elem&
        data: Data&

    function find(list: Elem&, data: Data&): Elem&
    function pr(list: Elem&, name: string)
    function remove(item: Elem&): Elem&
    function reverse(list: Elem&): Elem&
    function sort(list: Elem&, cmp: Comparator): Elem&
    function unremove(removed: Elem&, modified: Elem&)

    config maxElems: uint16

    var curHead: Elem&


The Elem struct supports the conventional representation of a singly-linked list, with the ListBench private functions manipulating references to objects of this type. The maxElems parameter effectively sizes the pool of Elem objects, while the curHead variable references a particular Elem object that presently anchors the list.

Similar to the other BenchAlgI modules we've seen, ListBench cannot fully initialize its internal data structures until setup fetches a volatile seed at run-time. Nevertheless, we still can perform a sizeable amount of build-time initialization within em$construct:

em.coremark/ListBench.em [exc]
def em$construct()
    auto itemSize = 16 + sizeof<Data>
    maxElems = Math.round(memSize / itemSize) - 3
    curHead = new<Elem> = new<Data>
    auto p = curHead
    for auto i = 0; i < maxElems - 1; i++
        auto q = = new<Elem> = new<Data>
        p = q
    end = new<Data> = null

Like all EM config params, maxElems behaves like a var at build-time but like a const at run-time; and the value assigned by em$construct will itself depend on other build-time parameters and variables [ itemSize, memSize ].  In theory, initialization of maxElem could have occurred at run-time – and with EM code that looks virtually identical to what we see here.

But by executing this EM code at build-time , we'll enjoy higher-levels of performance at run-time .

Taking this facet of EM one step further,(1)em$construct "wires up" a singly-linked chain of newly allocated / initialized Elem objects anchored by the curHead variable – a programming idiom you've learned in Data Structures 101 .  Notice how each field similarly references a newly-allocated (but uninitialized ) Data object.

  1. that the EM language serves as its own meta-language

Turning now to ListBench.setup, the pseudo-random values assigned to each element's and fields originate with one of the volatile seed variables prescribed by CoreMark.  Before returning, the private sort function (which we'll visit shortly) re-orders the list elements by comparing their fields:

em.coremark/ListBench.em [exc]
def setup()
    auto seed = Utils.getSeed(1)
    auto ki = 1
    auto kd = maxElems - 3
    auto e = curHead = 0 = 0x8080
    for e =;; e =
        auto pat = <uint16>(seed ^ kd) & 0xf
        auto dat = (pat << 3) | (kd & 0x7) = <int16>((dat << 8) | dat)
        kd -= 1
        if ki < (maxElems / 5)
   = ki++
            pat = <uint16>(seed ^ ki++)
   = <int16>(0x3fff & (((ki & 0x7) << 8) | pat))
    end = 0x7fff = 0xffff
    curHead = sort(curHead, idxCompare)

Finally, the following implementation of calls many private functions [ find, remove, reverse, … ] to continually rearrange the list elements; also uses another volatile seed as well as calls sort with two different Comparator functions:

em.coremark/ListBench.em [exc]
 def run(arg)
    auto list = curHead
    auto finderIdx = <int16>arg
    auto findCnt = Utils.getSeed(3)
    auto found = <uint16>0
    auto missed = <uint16>0
    auto retval = <Crc.sum_t>0
    var data: Data
    data.idx = finderIdx
    for auto i = 0; i < findCnt; i++
        data.val = <int16>(i & 0xff)
        auto elem = find(list, data)
        list = reverse(list)
        if elem == null
            missed += 1
            retval += <uint16>( >> 8) & 0x1
            found += 1
            if <uint16> & 0x1
                retval += (<uint16>( >> 9)) & 0x1
            if != null
                auto tmp =
       = tmp
        data.idx += 1 if data.idx >= 0
    retval += found * 4 - missed
    list = sort(list, valCompare) if finderIdx > 0
    auto remover = remove(
    auto finder = find(list, &data)
    finder = if !finder
    while finder
        retval = Crc.add16(, retval)
        finder =
    list = sort(list, idxCompare)
    for auto e =; e; e =
        retval = Crc.add16(, retval)
    return retval

Refer to  ListBench for the definitions of the internal functions called by .

Generalized sorting

As already illustrated, the ListBench.sort accepts a cmp argument of type Comparator – invoked when merging Data objects from a pair of sorted sub-lists: (1)

  1. The implementation seen here (including the inline comments) mimics the core_list_mergesort function found in the legacy core_list_join.c source file.
em.coremark/ListBench.em [exc]
def sort(list, cmp)
    auto insize = <int32>1
    var q: Elem&
    var e: Elem&
    for ;;
        auto p = list
        auto tail = list = null
        auto nmerges = <int32>0  # count number of merges we do in this pass
        while p
            nmerges++  # there exists a merge to be done
            # step `insize' places along from p
            q = p
            auto psize = 0
            for auto i = 0; i < insize; i++
                q =
                break if !q
            # if q hasn't fallen off end, we have two lists to merge
            auto qsize = insize
            # now we have two lists; merge them
            while psize > 0 || (qsize > 0 && q)
                # decide whether next element of merge comes from p or q
                if psize == 0
                    # p is empty; e must come from q
                    e = q
                    q =
                elif qsize == 0 || !q
                    # q is empty; e must come from p.
                    e = p
                    p =
                elif cmp(, <= 0
                    # First element of p is lower (or same); e must come from p.
                    e = p
                    p =
                    # First element of q is lower; e must come from q.
                    e = q
                    q =
                # add the next element to the merged list
                if tail
           = e
                    list = e
                tail = e
            # now p has stepped `insize' places along, and q has too
            p = q
        end = null
        # If we have done only one merge, we're finished
        break if nmerges <= 1  # allow for nmerges==0, the empty list case
        # Otherwise repeat, merging lists twice the size
        insize *= 2
    return list

Looking first at the IdxComparator module, you couldn't imagine a simpler implementation of its function – which returns the signed difference of the idx fields after scrambling the val fields:

em.coremark/IdxComparator.em [exc]
module IdxComparator: ComparatorI


def compare(a, b)
    a.val = <int16>((<uint16>a.val & 0xff00) | (0x00ff & <uint16>(a.val >> 8)))
    b.val = <int16>((<uint16>b.val & 0xff00) | (0x00ff & <uint16>(b.val >> 8)))
    return a.idx - b.idx

Turning now to the ValComparator module, you couldn't imagine a more convoluted  implementation of – which returns the signed difference of values computed by the private calc function: (1)

  1. the twin of calc_func found in the legacy core_list_join.c source file
em.coremark/ValComparator.em [exc]
module ValComparator: ComparatorI

    proxy Bench0: BenchAlgI
    proxy Bench1: BenchAlgI


    function calc(pval: int16*): int16


def calc(pval)
    auto val = <uint16>*pval
    auto optype = <uint8>(val >> 7) & 1
    return <int16>(val & 0x007f) if optype
    auto flag = val & 0x7
    auto vtype = (val >> 3) & 0xf
    vtype |= vtype << 4
    var ret: uint16
    switch flag
    case 0
        ret =<uarg_t>vtype)
        Utils.bindCrc(Bench0.kind(), ret)
    case 1
        ret =<uarg_t>vtype)
        Utils.bindCrc(Bench1.kind(), ret)
        ret = val
    auto newcrc = Crc.add16(<int16>ret, Utils.getCrc(Utils.Kind.FINAL))
    Utils.setCrc(Utils.Kind.FINAL, Crc.add16(<int16>ret, Utils.getCrc(Utils.Kind.FINAL)))
    ret &= 0x007f
    *pval = <int16>((val & 0xff00) | 0x0080 | ret)   ## cache the result
    return <int16>ret

def compare(a, b)
    auto val1 = calc(&a.val)
    auto val2 = calc(&b.val)
    return val1 - val2

Besides scrambling the contents of a val field reference passed as its argument, calc actually runs other benchmark algorithms via a pair of BenchAlgI proxies [ Bench0, Bench1 ] .

Benchmark configuration

Having visited most of the individual modules found in the EM•Mark design hierarchy, let's return to CoreBench and review its build-time configuration functions:

em.coremark/CoreBench.em [exc]
module CoreBench: BenchAlgI

    config TOTAL_DATA_SIZE: uint16 = 2000
    config NUM_ALGS: uint8 = 3


def em$configure()
    memSize = Math.floor(TOTAL_DATA_SIZE / NUM_ALGS)
    ListBench.idxCompare ?=
    ListBench.valCompare ?=
    ListBench.memSize ?= memSize
    MatrixBench.memSize ?= memSize
    StateBench.memSize ?= memSize
    ValComparator.Bench0 ?= StateBench
    ValComparator.Bench1 ?= MatrixBench

def em$construct()
    Utils.bindSeedH(1, 0x0)
    Utils.bindSeedH(2, 0x0)
    Utils.bindSeedH(3, 0x66)

In addition to calculating and assigning the memSize config parameter for each of the benchmarks, CoreBench.em$configure binds a pair of Comparator functions to ListBench as well as binds the StateBench and MatrixBench modules to the ValComparator proxies.

CoreBench.em$construct completes build-time configuration by binding a prescribed set of values to the volatile seed variables accessed at run-time by the individual benchmarks.

Summary and next steps

Whether you've arrived here by studying (or skipping !!) all of that EM code, let's summarize some key takeaways from the exercise of transforming CoreMark into EM•Mark :

The CoreMark source code – written in C with "plenty of room for improvement" – typifies much of the legacy software targeting resource-constrained MCUs.

The high-level design of EM•Mark (depicted here) showcases many aspects of the EM langage – separation of concerns, client-supplier decoupling, build-time configuration, etc.

The  ActiveRunnerP and  SleepyRunnerP programs can run on any  MCU for which an em$distro package exists – making EM•Mark ideal for benchmarking MCU performance.

Besides embodying a higher-level of programming, EM•Mark also outperforms  legacy CoreMark.

To prove our claim about programming in EM, let's move on to the EM•Mark results and allow the numbers to speak for themselves.