This is another article in the scala-notes series. This one deals with what I call number madness.

What!?

Numbers are used everywhere in code, and often the type used is something like Int (Scala) or Integer/int (Java). These simple types are familiar to those of us who used languages like C but should not be used without considering their inherent limitations. Lets imagine an example web application that during its normal operation records interesting events…

Imagine a field used to represent the key of a database table named INTERESTING_EVENT, specifically an automatically incrementing surrogate key, as is typical. (This is not the best example, in part because one would rarely need perform arithmetic on a key value, but it is simple to explain.) Lets say that on average, only twenty INTERESTING_EVENT records are populated per second.

Is an Int or Integer safe? Lets get some perspective…

365 days per year
$= 365$
8760 hours per year
$= 24 \cdot 365$
525,600 minutes per year
$= 60 \cdot 8760$
31,536,000 seconds per year
$= 60 \cdot 525,600$

Given a liner expansion of the INTERESTING_EVENT sequence, by twenty each second, the value would grow by 630,720,000 each year.

630,720,000 events per year
$= 31,536,000 \cdot 20$

In four years that would grow to 2,524,608,000.

2,522,880,000 events in four years
$= 630,720,000 \cdot 4$

The MAX_VALUE for java.lang.Integer is 2,147,483,647.

The example is contrived and unlikely but also recall that automatically incremented keys do not always grow linearly, and twenty events per second is conservative for a web server… (Insert your own web scale humor here.) But even with those constraints the key grows beyond the capacity of java.lang.Integer in less that four years. Four years is enough time to forget about the capacity constraint built into the example application.

You might forget how these simple number types work and think:

No problem, statistics or other processes are running, constantly computing metrics against Events and some computations use the key value… Surely an exception will be thrown as soon as the Integer overflows.

This is where you would be wrong.

Division will throw an ArithmeticException on division by zero, but a sum, for example, will quietly continue on an overflow. Let me repeat that… The sum will return an incorrect value on overflow, but it will not throw a runtime exception! This is serious problem. The application will quietly continue to run, but the mathematical operations on the value will be incorrect. This should be a real “What!?” moment.

Two’s complement

The JVM stores an int as a two’s complement binary (the core of java.lang.Integer and scala.Int), and thus it is often referred to as a two’s complement type.

Specifically, the JVM uses a 32-bit signed two’s complement integer for int. The most significant bit is used to designate the sign of the number. The max value for int is 2,147,483,647, and would thus be represented1 like 0111 1111 1111 1111 1111 1111 1111 1111.

Simple sum

First lets simply add 1 to 2,147,483,646.

The result is what we expect, 0111 1111 1111 1111 1111 1111 1111 1111, or 2,147,483,647 in decimal.

Negative values

The JVM int is signed, that is to say both positive and negative values can be stored as an int.

Using the MSB (most significant bit) to designate sign makes it possible to store negative values using the same placeholders and arithmetic. A negative is represented with the MSB set to 1. To convert -2,147,483,647 to a two’s complement binary for int, we must find the complement of each bit then add one to the result.

1. Find the complement of each bit of 0111 1111 1111 1111 1111 1111 1111 1111.
1000 0000 0000 0000 0000 0000 0000 0000

The minimum value would be the value with the greatest magnitude and the opposite sign. The 32 bit, negative, two’s complement number with the greatest magnitude would be where the sign bit is set to 1 and all the other bits are set to 0, i.e.. 1000 0000 0000 0000 0000 0000 0000 0000. This means the min value2 is -2,147,483,648 in decimal.

Lets see what happens when we overflow by adding one to 2,147,483,647.

Wait! We have seen this one before, it is -2,147,483,648! So in this case the result is mathematically incorrect, but a valid int!

You can try out Scala code snips in the online REPL here, and Java in the online REPL here.
The following return -2147483648.

Scala
Int.MaxValue + 1
Java
Integer.MAX_VALUE + 1

BigInteger (java.math.BigInteger) to the rescue?

Clearly when arithmetic is needed, or when it is possible to overflow a field, one should use something safe like BigInteger over its primitive counterpart. Since one can never know for certain how a field will be used, selecting a safe type should be the default.

If one can be sure that arithmetic will never be performed on a field, and benchmarking shows that using a primitive is needed, then maybe it is worth the risk.

Primitive Type

Pros
Simple quick arithmetic.
Fixed memory allocation size.
Symbol operators. ( + - * /)
Cons
Arithmetic can fail without error!

BigInteger (java.math.BigInteger)

Pros
Correct arithmetic
Cons
Can not use symbol operators; must use add, subtract, multiply, and divide methods.
Arithmetic could be slower.
Higher memory allocation requirements because more information is stored and capacity is dynamic.

BigInt (scala.math.BigInt)

If you live in the Scala world you have better options, such as scala.math.BigInt. It can be thought of a Scala’s BigInteger. In Scala, characters such as + - * / can be used as method names, so this makes using scala.math.BigInt a pleasant experience.

Spire

Finally, when talking about handling numbers in Scala, we need to talk about spire. Spire makes working with numbers a beautiful experience.

Look at how clear and simple it is to create a BigInt using spire string interpolation. Using spaces makes it easier to identify the place values for each digit and results in source that is simpler to read. Spire supports additional number types, type classes and more. Lets take SafeLong for example. It works as a nice replacement for BigInt. While it provides accurate arithmetic, independant of value, it performs better than BigInt for values inside the java.lang.Long range.3

Conclusion

A two’s complement type like int should be considered an optimization choice for Scala and Java software development, and not a default. Its performance comes at a cost and that cost is silent failure.
The default choice for a whole number field should be something like BigInt or SafeLong because they ensure correct arithmetic without regard to the size of the operands. This is made far more pleasant in Scala, especially when using a library like spire.

1. I show the left most bit as the most significant bit because it makes it simple to see the arithmetic.

2. Yes, the minimum and maximum values are asymmetrical, the minimum is -2,147,483,648 and the maximum is 2,147,483,647.

3. $-2^{63} \text{ through } 2^{63} - 1$.

Scala Notes: Getting Started

This is the first in a series of very short articles I am calling scala-notes where I will share my notes and or my favorite links about Scala related topics. Hopefully this will prove to be a motivator for me to mine the giant stack of paper notes I have written about scala… In any case I hope this exercise proves useful for somebody.

Getting Started with Scala

When you get started with a language you need a “Hello, World”. (This example works if you install Scala, rather than using it from SBT.) So here is one:

 $vi HelloWorld.scala  $ scalac HelloWorld.scala
$scala HelloWorld Hello, world!  What is Scala? The official answer is here, but to me it is simply an object-functional programming language on the JVM. The more interesting part is the functional part, but because it also supports object oriented design on the JVM, it makes for a simpler transition from Java than say moving directly to Haskell. If you are a Java programmer maybe you should avoid scala? Don’t say I didn’t warn you. ;-) Functional Here are some functional programming links (mostly videos). Some relate to scala more than others. Videos Books General Scala Books If you have some favorite links for getting started with Scala, then please share them in a comment. Why Java Programmers Should Avoid Scala I have been writing Java professionally since 1997 and while the JVM (Java Virtual Machine) continues to dominate enterprise development, that is not all that has kept me using it. I have always liked Java. A while ago I was chatting about how I found some aspects of Haskell and Go really interesting, when BeepDog mentioned Scala. Well, at the time I felt I had no need of Scala because If I wanted to use the JVM I would use Java, and if I did not want to use the JVM I could use Go, Haskell, or whatever other language best fit the needs of the project. But then I did not know Scala. An EPFL Coursera class, Functional Programming Principles in Scala opened and being out of excuses, I took the class. I was so impressed that I took the next one, Principles of Reactive Programming, too. There is no fitting exclamation, or emoticon to describe the experience. I can’t list all of the advantages of using Scala on the JVM, because I keep finding new ones. Here are a few just to get things started.1 • An improved, static type system • Native support for Functional Programming • Native support for Object Oriented Programming • Clever concurrency solutions • A more concise and expressive syntax • Runs on the JVM fully interoperable with Java. Scala classes can call Java methods, create Java objects, inherit from Java classes and implement Java interfaces. Scala Language Specification Version 2.8 The above advantages and more makes Scala the pragmatic choice for productive, maintainable, DRY, and performant code on the JVM platform. So with all those advantages, why avoid it? Because something happens when you learn Scala… I continue to experience situations where I find some aspect of the Scala syntax or some framing of a problem awkward only to have an epiphany. The result of the epiphany is an astonishing change in my way of thinking. Avoid Scala because once you learn it you will be forever changed and will never again approach problems the same way. Scala is improving, and promising languages appear all the time. Who knows where the journey will end. If you are content with using the JVM just as you do now then please stick with Java, but stay away from Scala. ;-) 1. There are many more, simply search the web… Many people even explain why these characteristics are advantages, I will save the space. Machine Virtualization as a Development Tool: Part 2 Prologue This is the second of a two article series about using machine virtualization as a development tool. It focuses on the question of how. The first article is found here and focuses on the question of what. Using VMs It is time now to get to the specifics of using a VM solution, in this case VirtualBox. Guest Additions Guest Additions are vendor specific add-ons that permit some useful features of VirtualBox. The Guest Additions should be installed into the guest OS when creating a base VM. See the VirtualBox documentation for details. Shared Folders One of the important features that Guest Additions permits is a shared folder. A shared folder permits the sharing of a host directory to the guest VM. This is a nice way to deploy code or otherwise share content with the VM. If content in a shared folder is stateful, such as code, then it should be managed by a version control system, such as git. Snapshots A snapshot saves the state of a VM such that when the snapshot is restored the VM continues exactly where it left off. When a VM is started back at a point in time of a snapshot, remounted network drives and shared folders may have different contents than when the snapshot was made because the content is sourced externally to the VM. This is in contrast to virtual drives which by default will be restored bit for bit. Snapshots are useful for a developer to test a new release or to rollback a site to an older state to work on a bug-fix. For example, one might be working on a new release of an application, only to have an emergency bug-fix take precedence over work on the new release. It is then a simple task to snapshot the VM to save the state for the new release, then restore a clone of the VM as it existed for the code experiencing the bug, right along with checking out the matching code from one’s SCM, like git. Then when the bug is fixed a similar process is performed to resume work on the new release. This ensures that not only does the code match production, but the other aspects of the host match as well such as the database1, web server settings, etc. Taking a snapshot is not a big deal so when in doubt, take one. Always take a snapshot to capture current state before restoring a snapshot. Beyond that, it is a good idea to take a snapshot before making a time consuming change to the VM, and with each code promotion. It is wise to shut-down the VM prior to taking a snapshot because when one takes a snapshot of a running VM the memory state of the machine is also stored in the snapshot. This extra data is only needed if it is important to restore the snapshot to the running state, and it can occupy as much disk space as the memory defined for the VM. When making a snapshot, VirtualBox provides a field that can be used to label the snapshot. A description field is also provided. Be sure to use the description field to include any information that is needed to fully restore the state of the snapshot. For example, include the git tag2 needed to revert contents of a shared folder. In the following example, the snapshot is labeled to show that it was taken prior to doing some database maintenance. The description field contains the git data needed to restore a shared folder to the correct condition of the snapshot. The command git log --decorate --oneline -1 is a handy way to get the latest commit information, but ideally one should be creating an annotated tag to go with the snapshot. An annotated tag could be labeled the same way as the snapshot, thus clarifying the snapshot to tag relationship from examination of both the code and the list of snapshots. The power of a snapshot is the ability to capture a point in time. However, to be able to make full use of the point in time, we may use snapshots with clones. Clones As one would expect, cloning a VM is making a copy of a VM. The source for a clone can be a VM’s current state or a particular snapshot. Lets consider the following where one is working on a web development project where git is being used to manage the stateful content of a shared folder: A critical bug is found in production. We must roll-back to a snapshot to work a bug then later resume the current state… There are two ways to handle this: Use snapshots 1. Shutdown the VM so we do not preserve the VM run-time data. 2. Stash content of shared folder. $ git stash save "Working on customer lookup when ticket ID 113245 was received"

3. Capture the full stash information for use in the snapshot description.

 $git stash list stash@{0}: On dev: Working on customer lookup when ticket ID 113245 was received  4. Take a new snapshot of the current state so we have a place to return to when the bug is fixed. Include the stash information in the snapshot description. 5. Restore the VM to the snapshot required to fix the bug. (In this case the snapshot created for the v1.0.1 code promotion.) 6. Restore the shared folder to the correct place3 for fixing the bug. $ git checkout -b b1.0.1-113245 tags/v1.0.1

7. Fix the bug…

8. Restore the VM to the snapshot made with our stash comment.

9. Restore the shared folder to the state that corresponds to the snapshot we just restored.

 $git checkout dev$ git stash list
stash@{0}: On dev: renaming serialized files
stash@{1}: On dev: Working on customer lookup when ticket ID 113245 was received
\$ git stash pop stash@{1}

Use snapshots with a clone

What if a co-worker can take on the bug fix while you continue to work on your current sprint? If that is the case, then we can simplify the process down to creating a clone VM for the co-worker.

1. Select the correct snapshot from the snapshots list. (In this case the snapshot created for the v1.0.1 code promotion.)

2. Right-click the snapshot and select clone, or click clone button (pictured as a sheep).

3. Name the clone including the source VM name, snapshot name and the ticket ID 113245.

4. If your co-worker uses the same network as you, then click the Reinitialize the MAC address of all network cards. Note that you may need to boot into the VM and update network settings to reflect the new MAC addresses.

5. Select a full clone rather than a linked clone, so that the clone can be operated independently.

6. When prompted for the what parts of the snapshot tree to clone, select Current Machine State. It is confusing but Current Machine State is the current state of the VM at the point in time of the snapshot you selected. This can result in a smaller VM because no other states, and their requisite differencing disk images, need to be incorporated into the clone.

Once a VM is fully setup with SSH and the other tools I need, I find that the GUI simply gets in the way. There are two easy ways to start a VM without loading its console or its Graphic User Interface (GUI).

1. Hold down the shift key when starting a VM from the GUI VirtualBox Manager.
2. Start a VM from the command line with VBoxHeadless --startvm <uuid|name>.

Maximize return on virtualization investment

When using VM software, as with any tool, to be useful it must save more time and headaches than it causes. There are a few things one can do to maximize return on one’s time investment.

Separate source code management from VM management and identify intersections

Keep source code and source code management on the host machine. Share code or deployments with the VM guests by way of shared folders, do not store such things on a VM’s virtual disk. This eliminates unnecessary dependencies on guest VMs by confining code management to the host machine. This does not preclude the use of remote git repositories. It simply ensures that the host is handling those dependencies.

Identify intersections of state between guest VMs and managed code. This means identifying important milestones in code state with SCM tags, such as a git annotated tag. Also identify important milestones in VM state with snapshots. Be clear that each time a tag is needed for the source code, a snapshot should be made in the VM, and vice versa. Keep snapshot descriptions updated with tags that correspond to the intersections of state between the VM and the code. See Snapshots above.

Use pre-built base VMs.

It is my experience that the most time consuming aspect of using VM software is creating the base VM. That said, creating a base VM from scratch is no more complicated than installing an OS on a real machine, and once you have done it once, I find it simpler than dealing with a real machine. A nice short-cut is to use a pre-built base VM, then simply customize it if needed. The ideal place to get a VM would be from whomever created the VM for your production environment, if a VM is used in your production environment. Pre-built VM images are sometimes called virtual appliances. Even though there are many different VM software technologies, a format exists called OVF (Open Virtualization Format) that can be used to make a VM image saved in this format more compatible across different VM software providers. This is supposed to make it possible, for example, to use a VM created for VMWare on VirtualBox. VirtualBox has an import mechanism to let one make use of these OVF VMs.

The following are two places with useful VMs.

The place I find the most useful is vagrantbox.es, but to properly use those images we need another tool called Vagrant.

Use Vagrant

Use Vagrant to simplify and accelerate creating your own base VMs. Once setup, it is possible to create a new VM, with a single vagrant up command. The Vagrant build process is automated and controlled by a written configuration, thus making Vagrant VMs reliable, predictable and consistent.

I use Vagrant with Puppet to simplify and unify the provisioning tasks. The following example shows how to use Puppet to ensure that git is installed in the guest VM. Notice that the command does not identify a particular OS or distribution specific package manger, like rpm or yum. Part of the power of Puppet is that it can keep these specifics out of the configuration and out of the way.

This lets one focus more on addressing the needs for a VM, and less on how those needs must be achieved for a particular OS/distribution.

I created a Habari development VM for GrowingLiberty.com using Vagrant. This implementation can be found in github at the following URL.

github.com/mmynsted/vagrant-centos-php

The details are documented in the Readme.md. This implementation uses a base Vagrant box from vagrantbox.es, showing that one can further simplify provisioning a base VM by using a pre-built Vagrant base box.

My preference is to use Vagrant to create a base VM, then clone it to a new VM that is no longer dependant on Vagrant. This way I am free to continue to improve my Vagrant implementation without adversely affecting actively used VMs. The example above shows how one could use Puppet as part of the original provisioning and then use traditional shell commands from the new clone. Puppet is still installed so one can interact with the VM in the way that seems most natural.

The Vagrant and Puppet combination enables one to quickly translate a VM build investment to new and changing needs.

Conclusion

A virtual machine is a software based abstraction of a physical machine. This abstraction permits reduced maintenance time, improved testing, and making better use of both remote and local machine resources. Using a local VM can become a seamless part of one’s development process, can help one be better organized, and more productive.

1. Database state would only be saved if the database was served from inside the same VM or if external, one managed its state much like managing state for source code delivered through a shared folder.

2. Be sure to use an annotated git tag so the full information is captured.

3. Creating a new branch b1.0.1-113245 from tag v1.0.1 on the dev branch.

Machine Virtualization as a Development Tool: Part 1

Prologue

The use of Virtual Machines for web development is becoming ubiquitous, and with good reason.

This is the first of a two article series about using machine virtualization as a development tool. It focuses on the question of what. The second article is found here and focuses on the question of how.

What are Virtual Machines for a software developer?

A Virtual Machine (VM) is a software based abstraction of a physical machine. This abstraction makes it possible to manage and control access to the actual hardware. Hardware, that does not physically exist, can be emulated.

Here is how I recall the explosion in use of Virtual Machines…

When I first became a professional software developer I worked for IBM. The first VM I used for development was VM/CMS. It was a remote, centrally managed VM, and a sensible abstraction away from big iron mainframes with big consequences if something went wrong.

It was not too long until the PC revolution changed the way applications were used, written and delivered. The 3270 terminal (or terminal emulator) was largely replaced with the web browser and Client-Server applications. It was around that time that I displayed a picture on my office wall of a proud man standing in a giant warehouse stacked with PC servers. His arms were spread so one could get a feel of scale; a tiny man standing in such a giant warehouse with an incalculable number of servers. The picture was from a book where the man described his successful journey to replace a mainframe with PC servers. This was IBM so I am sure the message I should have gained from the book was something like “Wow look at all the hardware that people would buy, and all the services they would need, to support their desire to join this PC revolution!”

There is something to be said for the decentralization of application development accelerating innovation, but what I saw in the picture was “Maintenance nightmare!”

If I recall correctly, the move from centrally managed mainframes to cheap distributed PC infrastructure, this PC revolution, and its requisite services, were quite profitable for IBM. There must be some irony there.

It was a while later before PC architecture servers, with enough surplus power to support virtualization, became cost effective. Until that time the big boys (think IBM, HP, Sun, and many others) continued to offer expensive, but powerful machines with hardware based virtualization (type 1 hypervisor), while the PC guys began to learn the true costs of maintaining so many physical machines.

What do you do if you want the benefits of virtualization but your hardware (architecture) does not meet the Popek and Goldberg virtualization requirements? Well, you write a software hypervisor (type 2 hypervisor) like VMWare, and change the world. This kind of VM software made it possible to emulate a number of PC servers on a larger host, with room to scale up as needs grew. Surplus power could then be traded for better maintainability, reliability, predictability and separation.

All that brings us to today where VMs (Virtual Machines) are used throughout the lifecycle of application development and deployment, especially for web applications where PC hardware is still largely ubiquitous.

There are a variety of good VM solutions, but for the sake of simplicity I will focus the rest of this article on one, VirtualBox. VirtualBox is currently free and can be found here.

What can it do for me?

OK, that is all well and good for system administrators and infrastructure planners, but what can it do for me as a developer?

Reduce maintenance time

Changing a network card or adding RAM to a virtual machine is as simple as changing a setting. This is true whether the host is a remotely maintained server, or your laptop.
Virtual Machines are portable, so if the physical hardware is acting up, the VM can be moved to another host until the original is repaired or replaced. One could even continue to work on a clone while software is being upgraded on an original VM. Need a development server and all existing servers are at capacity? No problem, simply fire up a new VM.

It is quicker and easier to update a VM in concert with a closely matching production machine, than it is to update a number of physical test machines, developer laptops, workstations, etc. This is not simply a matter of scale, updating a single image rather than a number of physical boxes, but also a matter of compatibility. The compatibility becomes important if it permits one to apply the same changes to production and test VMs with tools such as Puppet, with predictable results.

Reduced maintenance and provisioning time result in increases of developer productivity, even if the developer never deals with the maintenance or provisioning him or her self.

Improve testing

There are characteristics of VMs that are useful for testing. VM images are software so they only need to consume resources when they are needed, unlike their physical counterparts. This makes it cheap to keep many different test images. Because VMs are portable and can be copied and cloned, it is simple to make them available to whomever needs them, even to many people simultaneously.

Undo changes with roll-back

VMs make it quick and safe to test dramatic environmental changes. This is because changes can be completely undone with a simple roll-back.

I recall a third party software upgrade that resulted in the upgraded server becoming completely unusable. Performing an un-install would have still left the server damaged. A roll-back restored the server to the state exactly as it was prior to the upgrade. I can not imagine the nightmare had this server not been a VM.

Better match

Testing is more efficient when all factors, other than those being tested, are consistent and thus eliminated from consideration. This way when differences are observed they can be expected to be the result of the factors being tested.
Nobody wants to experience a problem in production that could not be found in test, but when these environments are different, and the differences are not the factors being tested, such problems can happen.

It is possible for one to create a VM that more closely matches production (also likely a VM) than it would be to configure one’s laptop, workstation, or even a physical test machine to do the same. This is in part because a laptop or workstation is designed to meet different needs than for example, a dedicated web server.

An example might be that one’s development workstation runs archlinux and one’s hosting provider, or a company production environment, uses Centos 6. To address the discrepancy, one could create a Centos 6 base VM that closely matches production. In some cases such a VM can be derived from the actual VM used in production.

At this point I should mention that many developers use an IDE (Integrated Development Environment) both for writing code and for local testing. I personally avoid IDEs. To me, they divert one’s attention from learning the language and development tools, to learning the IDE. I find that I get better results by writing a script, an editor macro or by re-examining what I am doing, than I do from using an IDE shortcut.

I worked with a developer who experienced a really frustrating and intractable code problem. We found that the problem was not with his code but was simply a flaw in the way his IDE executed it.
The problem did not, and could not, exist anywhere else.

I bring this up because I find that running code, especially tests, in an IDE exacerbates the problems of run-time differences between a local development and production environment. If you are hell-bent on using an IDE, I suspect you could configure it to use a local VM for testing, etc. After all, development tools should conform to the needs of the developer, not the other way around.

Make better use of your local machine

Many developers are familiar using VMs as part of a remote, company development server pool, but few use VMs locally. VM solutions work well on local machines too. Independent contractors have been taking advantage of virtual machines for years. Visit a new client and create a fresh VM. That VM can then be customized to the client and saved for the next time the client calls. Meanwhile the host machine is kept safe from constant alteration. This also works well when one encounters a client running a different OS. Have a Mac or Linux laptop and the customer runs Windows or FreeBSD? No problem, just use a VM of whatever is needed.

The second article in this series will focus on how to use machine virtualization as a development tool.